Link to talk description and video (videos should be public next week I believe)
Performance (e.g. a request should return in less than 5 seconds) is not the same as scalability (e.g. a request should ALWAYS return in less than 5 seconds). Fortunately, it turns out that when you start working on scalability you usually end up improving performance as well -- note that this doesn't work the other way around.
Common bottlenecks
The database is almost always an issue.
Caching and invalidation help.
They use Postgres for 98% of their data, it works great on good hardware with one master only (Disqus, his company, uses Django to serve 3 billion page views a month)
Packaging matters
Packaging is key: it lets you repeat your deployment, makes it repeatable which is incredibly useful even when you're working by yourself. Unfortunately there are too many ways to do packaging in Python, and none that solves all the problem. He uses setuptools, because it usually works.
Plenty of benefits to packaging:
- The handy 'develop' command installs all the dependencies.
- Dependencies are frozen.
- It's a great way to get a new team member quickly set up.
Then, they use fabric to deploy consistently.
Database(s)
This applies to any kind of datastore, which are the usual bottleneck. It can become difficult to scale once there is more than one server.
The rest of the talk uses a Twitter clone as an example.
For the public timeline, you select everything and order it by date. It's ok if there is only 1 database server, otherwise you need to use some sort of map/reduce variant to get it working. The index on date will be fairly heavy though. It's quite easy to cache (add tweet to a queue whenever it's added), and invalidate.
For personal timelines, you can use vertical partitioning, with the user and tweets on separate machines. Unfortunately this means a SQL JOIN is not possible. Materialised views are a possible answer but there aren't supported by many databases (for instance it's not supported by MySQL. MySQL will generate a view by rerunning the query everytime, which means you can't index it).
Using Postgres and Redis, you can have a sorted set, using the tweet id with the timestamp as its weight (will become ordering). Note that you can't have a never ending long tail of data, data will be truncated after 30 days or whatever (remove the data from Redis).
Now the new problem is to scale Redis! You can partition per user, say if you keep 1000 tweets per user you can know how much space a user will take, and how many you can have per server.
See: github.com/disqus/nydus to package cluster of connections to Redis, it can be used like (?) a Django database. They store 64 redis nodes on the same machine in virtual machines.
Vertical vs. Horizontal partitioning
You can have:
- Master database with no indexes, only primary keys
- A database of users
- A database of tweets
So far the hardware scales at the same time as their app. If you need more machines, more RAM, it's cheap enough, and when you need it again in a few years it will be the same price.
Asynchronous tasks
Using Rabbit and Celery, you can use application triggers to manage your data, e.g. a signal on a model save() hook that adds the new item to a queue after it's been added to the database. This way, when the worker starts on the task it can add the new tweet to all the caches without blocking (e.g. if someone has 7 million followers, their tweet needs to be added to 7 million streams)
Building an API
Having an API is important to scale your code and your architecture. Making sure that all the places in your code (the Django code, the Redis code, the REST part, whatever) all use the same API, or are refactored to use the same API so that you can change them all in one place.
To wrap up
- Use a framework (like Django, to do some of the legwork for you), then iterate. Start with querying the database then scale.
- Scaling can lead to performance but not the other way around.
- When you have a large infrastructure, architecture it in terms of services, it's easier to scale
- Consolidate the entry points, it becomes easier to optimise
Lessons learnt
- Have more upfront, for instance 64 VMs, so that you can scale up to 64 machines if needed.
- Redistributing/rebalancing shards is a nightmare, plan far ahead.
- PUSH to the cache, don't PULL: otherwise if the data is not there, 5000 users might request it at the same time and suddenly you have 5000 hits to the database. Cache everything, it's easier to invalidate (everything is cached 5 minutes in memcached in their system)
- Write counters to denormalise views (updated via queues, stored in Redis I think)
- Push everything to a queue from the start, it will make processing faster -- there is no excuse, Celery is so easy to set up
- Don't write database triggers, handle the trigger logic in your queue
- Database pagination is slow and crappy: LIMIT 0, 1000 may be ok -- LIMIT 1000, 2000 and suddenly the database has to count rows, it gets slower and consumes CPU and memory. There are easier ways to do pagination, he likes to do id chunks and select range of ids, it's very quick.
- Build with future sharding in mind. Think massive, use Puppet.
One of the questions was: does that mean there are 7 million cache misses if someone deletes a tweet? Answer: Yes indeed.