Growing Pains

For the first few days of the New Year and now again this week, we’ve had some problems keeping up with the volume of images uploaded. This has resulted in people occasionally being unable to upload, and some weird things happening (multiple copies of images being uploaded, extremely long waits while uploading, etc.) We are all working on this as fast as we can. This post will explain the problem and what we’re doing to fix it (the good news is that it should be fixed quite soon).

How it works
Flickr is quite a complex piece of software. When you upload a photo it is passed through a load balancer (among other things) to one of the servers in the web-serving and image-receiving cluster. One component of the server receives the image file itself along with any associated metadata. Another places and holds the image in a queue for processing. A third component processes the image: converting it to the right format, making all the different sizes that are used on the site and extracting the EXIF – and soon, IPTC – data.

The queuing component then passes all the new files along with the original and the metadata to a fourth component that copies all the files to multiple storage servers (hooray for redundancy!), ensures that they’re safe, and then updates the database with the location of your photo and all the metadata.

The problem
In a nutshell, the problem is that at peak times, more photos are coming in than we have the capacity to handle. However, it manifests itself in many ways:

  • The load is not balanced very well – one server might have hundreds of images sent to it while another only gets a dozen in the same period of time.

  • The queue is strictly "first in, first out", so if someone uploads 500 photos all at once to the same server your photo was sent to, you get a long wait.

  • Processing images, especially the large ones, takes quite a while. While a 640×480 cameraphone image has 307,200 pixels in it, a 3,008×2,000 image (like those from modern DSLRs) has a whopping 6,016,000 pixels in it., and we’ve got to look at all of them.

  • Under extremely high load and long queue times, parts of the system can "freak out", for lack of a better technical term,

In another sense, the problem is simply one of growth. While we’re used to rapid growth, and have planned for it, the last month has been even "growthier" than normal. To give you a sense of the whole Flickr system, on a really busy minute in a busy day: 8 new people sign up, 400 new photos are uploaded resulting in around 44,000 new images being saved, 5,000 pages and 60,000 images are served, and over 100,000 database queries are processed. That’s a lot.

The solution
The easy part of the solution is getting more servers. We ordered many more when the problem first arose and they should be here soon. Once we get them set up, configured, installed and testing, we’re rolling. We had been waiting on adding additional hardware pending the big move we just completed and now have the extra space and power we need to add machines with abandon.

The harder part –  what we’re all working on now – is making the whole system perform better, even when the loads are very high:

  • The queuing component is being improved by changing to what we call a "fair queue" – when you upload a few images right after someone else uploaded 100, yours will be interleaved with theirs, resulting much faster processing for you and the wait will be distributed (this is in testing now).

  • The processing process (ha!) has been optimized to move images through about 2-3x as quickly as before. (This is in testing now.)

  • Load balancing will be improved after some changes to the setup of our internal network (this will take a little over a week)

  • More testing is happening constantly to prevent any freaking out (multiple copies of images being uploaded, uploads failing, etc.)

  • Better feedback about and handling of high load situations is being added – this is already present when you upload via the website, and will be rolled into the uploading applications as quickly as possible.

In the meantime
We ask for your patience while we work through this, and if you are having problems, help us by giving us some of the details in the official thread. Happily, only some users are experiencing problems, and even then, only some of the time. Unhappily, if you’re one of them, it can be really frustrating. If you have pro account and feel like your Flickring has been unduly hampered, let us know and we’ll try to make it up to you.