Storage System Design Considerations

When designing storage systems for applications, not building something like a SQL/NoSQL database, I’m talking the layer/logic that goes on top, you have to consider a whole bunch of things if you want to build the “right thing(TM)” for your application.

In part one, we’ll be considering the simple cases (spoiler: most cases are simple) and SQL databases, in part 2 we’ll be considering NoSQL databases.

Even with various considerations to think about, hard and fast rules about datastores tend to get very difficult, as the various considerations can push things around quite a bit as to how to best implement them for today while taking tomorrow into account.

The main things to start thinking about are data size, desired response time, money budget and data access patterns.

A few rules of thumb to consider:

Avoid the new snazzy datastore. Let someone else discover where it falls over, you’ve got better things to do.
Any time you step up the complexity, necessary or not, you now get a different bag of problems than where you started, and more complex storage systems have more complex failure modes, so start simple and only go up the chain when you need to.
Before choosing a NoSQL datastore, be really sure you need it. It’s not that they don’t have their place, but they seem to be over-prescribed.

Simple Cases

If your data will fit on a single machine, and access rates/patterns are not such as to overwhelm that box, generally speaking, a regular SQL database is probably your best bet. They provide a lot of batteries included, atomicity just works, ad hoc queries are easy-peasy, foreign keys are included, query tools abound, and oh my: on delete cascade, oh how I love thee! If your app and database will easily fit on a single machine, you may want to consider SQLite as there’s less maintenance to deal with. For apps that I write for myself to run on shared hosting, SQLite is usually what I pick, but for where SQLite won’t do, PostgreSQL is my favorite, but MySQL is provided by just about every cheap hosting provider, so you kinda need to know both.

Fortunately, most things fit in the “Simple Cases” bucket. Unless you are a startup that expects things to get really big in a hurry, start here. Even as that startup, you probably still want to start here and design your schema with the idea that you may wind up switching to a NoSQL database at some point, so that the eventual migration won’t be so hairy. Also, if you design things with microservices in mind (but not necessarily doing μservices), the individual databases for the services may be quite small, and never need to be scaled very far. Either way, future you will know better about what you need from your datastore than present day you anyway.

If you exceed the simple cases bucket, there are two roads to consider, staying with the SQL database, and then going to a NoSQL database.

Staying With SQL Databases

If you need to be able to scale horizontally, because one storage backend won’t cut it, either for fault tolerance, or scalability, things get a little interesting.

If your data access pattern is read-mostly, then read-only slave replication on a SQL database is probably all you need, and there are pretty general recipes on how to do this. Both MySQL and PostgreSQL can have slave replicas, and that may be all you need.

If you need write scalability with a SQL database, with the current state of the art, you probably either need to shard your database into sub-databases, or you need to give a big bag of money to someone. There are things like MySQL cluster and Postgres-XL which may work for you, but there are caveats with both that you need to consider before proceeding, they are not just a drop in – read their documentation carefully and note the special cases.

Vertical Scaling

One solution, while not that sexy is just to buy bigger database servers with as much RAM as you can throw at it. For many cases I’ve seen, this is sufficient. The limits you’ll hit if you’re cloud hosted will generally be hit much sooner than if you’re hosting your own hardware though as you can buy physical machines that are larger than any cloud-hosted machine.

Query Optimization

You can usually buy some read-scalability with query optimization. There can be some really impressive wins, especially if some portion of your queries are generated by machine in response to user queries. Look at the query patterns that you see, and learn how to understand how your database plans the query, and how you could rewrite the query when a particular query pattern is noticed: Special case handling for these is not some horrible hack, but a really good idea. Just because you can’t use some trick all of the time is no reason to not do it most of the time. I’ve seen query optimization take queries that took 57 seconds to run in less than 100ms. Consider adding indexes to referenced fields; PostgreSQL has some nice, exotic-ish ones that can be quite amazing. Also, if you’re on a database that supports them (sorry MySQL) look into Common Table Expressions.

Denormalization

Once you’ve gotten all you can out of query optimization, you may have been saying to yourself things like “if only I didn’t have to join to this table, I could save a ton of time” or “why do I need to do this in three queries rather than two?”, or I’m always executing this same query and getting the same results over and over. If you see that, consider denormalizing some of your data to see if that could speed things up – but always measure first so you have a baseline of comparison! With database triggers, you can ensure that the denormalized data is properly done and can possibly pre-bake things for you. Materialized views here can be a huge win.

Caching

If you have a query which has a simple key and/or the data doesn’t change that often (for some value of often), definitely consider a cache. Though take heed that cache invalidation is one of the two hardest things in computer science. Sometimes cache invalidation is pretty trivial, or if it’s ok to have stale data in the cache, in which case, we celebrate. But if stale data in a cache is a big problem, your problem just became pretty interesting.

Some ways you can mitigate the cache invalidation problem is to shorten the cache lifetime, and many times, that’s good enough: it still turns some large variable number of queries per second into a lower fixed number. If you can segregate things that so as few reader care about freshness as possible, sometimes having only a few people paying the read penalty is good enough. Other times, it won’t hack it and you either have to pay the read cost every time, or need to ensure that caches get invalidated on a write. If you get there, you probably need something like etcd which has nodes that have TTLs (or put a ttl in the node itself if you use ZooKeeper, and have something to periodically examine nodes to delete expired one), to make sure that everything is properly up to date.

For me, for an external cache (as opposed to in-server cache), memcache is my default choice: It’s super simple to use and to set up, and it basically just works.

Breaking Apart Your Schema

One thing you may consider is in the course of doing database optimization your schema may actually have mini sub-schemas in it that can be separated out into their own database clusters, because these subsets are only dealt with together. In this case, code changes involved are fairly simple, you just have to track which connections go with what subsets.

Dividing your servers apart this way may reduce the resource contention, and if you do choose to vertically scale things, you’ve at least isolated the parts that need scaling, and so your costs will be lower because of it.

Sharding and Entity Groups

If you break your database into shards, you need to consider the notion of Entity Groups. A shard contains a set of Entity Groups. Depending on your schema, an Entity Group might be something like an individual user and their data. An entity group should also be able to contain any data you’d need to change in a single database transaction since transactions don’t normally span databases – unless you invest into two-phase commit, which you can do, but you’re signing up for significant complexity (which by extension means more interesting failure modes). But however you choose to define it for your application, it is a set of things that you need to be able to query/modify at one time. So look at your SQL queries. Is there a way if you sharded your database (by user id, or some other key), that based upon what you are looking to query, could you know which shard to run the queries on?

From there, you can break your database into some arbitrary number of shards, and break up the entity groups amongst them. When you do this, consider (but don’t blindly use!) consistent hashing as part of your shard selection algorithm. This way, if you need to add machines later, the number of entity groups that need to move is minimized. Alternatively, add new entity groups to shards as you add them, and so you shouldn’t need to move entity groups. Either way, consider your sharding algorithm very carefully and with much thought – and oh please, please test it before you implement it, really, then go back and test it again carefully, as it’s really hard to change without having to juggle around tons of records and handling the interim case where an entity group is on two shards, etc. Also consider how you would move entity groups between shards in the eventual case where you may need it.

Also, you may have a few different entity groups in your schema. If you think of something like Spotify, you might have User entity groups, but you might also have Artist entity groups (the artist, their albums and songs). These all could be sharded amongst the same shards, or you might choose to have different database systems for Artists and Users and shard them differently. But these entity groups could live in separate databases entirely too.

As an interesting datapoint, Google AdWords ran on top of a heavily sharded/replicated MySQL database for many years, so before you jump into the NoSQL camp, think first.

Sharding can also significantly improve query response times if you can have queries that cross shards (in aggregation). Say you had you database sharded by Albums, and you wanted to find all Albums that reached gold. Instead of querying a single database for all albums, you’re now querying N smaller (and presumably faster to respond) databases to get this information. It’s akin (but not) to a small mapreduce job.

Sharding may save you money, but you buy application complexity in dealing with it, with the benefit of you still have a SQL database underneath, with all it’s consistency guarantees, and tooling.

Write Contention

Stil you may have some write contention that is not addressed with all of this.

Here, there are no quick band-aids, but here are some possible patterns:

Mini-sharding: do you have a row which is getting hammered? Try sharding the row. Have a counter column which is getting hammered? Have it represented by multiple counter rows (possibly in separate tables) that you add together when you need the total.
Mini-sharding (2): do you have a table getting write-hammered? Shard the table into several smaller tables with identical definitions, and possibly on different disks. Fixing the constraints may be tricky, but it’s usually doable via at least a check-constraint if the usualy means don’t work.
Is consistency on the hammered bit really needed? Consider a lazy-write through cache, whereby the value is only updated in the database store a few times a second instead of n where n is large, or after some perioud of inactivity.

Conclusion

In part one, we’ve covered things using SQL databases, in part two, we’ll cover the issues surrounding NoSQL databases and some possible ways to combine the two.