Oct 11, 2024 4 min read capacity-management

Go Forth and Replicate

The noise the replicators made is still burned into my brain. All credit to MGM I suppose?

There is a lot that goes into managing the capacity of a sharded service.

Once you have a good definition of what capacity is and you're using it to make good decisions about where users should be placed in the available set of shards, you're far from done.

Things have an annoying habit of changing over time.

As a result, any system that adopts service sharding needs to be able to migrate users from one shard to another to adapt to that change.

Ideally without the user even noticing.

We Brought Them Aboard, To Study Them

The simplest way to migrate data from one shard to another is to copy it.

Most services have a persistence layer of some sort. Relational databases are common and are pretty easy to reason about from a data migration point of view, but even more esoteric or complicated data storage systems can still be acceptable.

The most important capabilities are:

that data for a specific user can be teased apart from the data belonging to all of the other users who share the same shard
that the data for a single user can be iterated through and copied, piece by piece, to the persistence layer on another shard

Simple enough, right?

Unfortunately not, because the source data will constantly change. It's not like the user is just going to coincidentally stop using the software system so you can copy everything across.

So you need to force them to stop using it.

Either restrict their access to the software system entirely, or put them into a read-only mode of some sort while the data is being copied, then restore the access once the copy is done.

For small users this is generally acceptable.

If you're smart, you can analyse their usage patterns and come up with a time when they are unlikely to be using the software and schedule the copy for then. A few minutes of downtime or read-only access in the dead of night in order to perform a data migration is an fine trade-off.

It's a different story for large users though, or users who are active all the time.

For these customers, you need to do something a little bit fancier.

We Did Not Fully Comprehend The Danger

A more complicated way to migrate data between shards is to copy it, but sneakily.

Like a ninja.

What I mean, is that rather than taking the user offline and then copying their data piece by piece, you instead leave them online and replicate the data from the source to the destination behind the scenes.

Eventually enough of the data will be available in the destination that you can take the customer down for a much shorter period of time, copy any outstanding differences and then point them at the destination.

You still copy the same amount of data, you just do it while the user can still use your software system, which is a vast improvement from their perspective. This breaks the relationship between user disruption and size, which allows for far more flexibility in how you can use the migration capability.

It sounds simple, but I promise that it's not.

For example, any replication process is going to be heavily dependent on your persistence layer. If you've chosen something that already has the ability to replicate data from a source to a destination (like PostgreSQL) then you're going to have a much easier time than if you have to build it yourself.

Even then, depending on how you've partitioned your user's data, you might run into further complications, as a poor partitioning strategies might make the available persistence layer functionality unusable.

Another thing to be aware of is that a replication-based migration is fundamentally more complicated than a simple copy one, just because it features far more moving parts. If the replication channel fails, or the migration needs to be rolled back, unpicking all of the things can be a bit of an ordeal.

Still, it's worth it in exchange for giving your users less disruption.

Less disruption isn't no disruption though...

Yeah, We Do That All The Time

The ideal migration approach is a natural extension of the replication approach described above, except you never take the user offline.

With this approach, whenever you decide that you need to do a migration, you still set up a replication channel between the source and destination, but instead of letting the user write to the source and then replicating the data from the source to the destination behind the scenes, you let the user write to the source and destination at the same time in an active-active configuration.

This might not seem like much of a difference, but it is.

For example, having an active-active setup may require the software system itself to be aware of the situation, rather than just the data layer. If a write to the destination fails, should the write to the primary fail as well? What does the user see in that case, a failure, or is there some sort of silent retry?

It's a challenging set of problems to solve.

Once you have a zero-downtime migration solution though?

Incredible.

The entire migration process becomes completely invisible to the user.

And when the user isn't impacted at all by a migration, you can do all sorts of wondrous things from a capacity management point of view.

Kind Of Expected More From You Guys Though

Minimising the disruption caused by a shard migration is important to being able to effectively manage the capacity of a sharded service.

You need to be pragmatic though.

You can't always go straight for the absolutely perfect experience, because otherwise you'll sink a huge amount of engineering effort into something that might not matter all that much.

If your users are small and their data is easy to copy around, a small amount of downtime during a low period of usage is far far easier to implement than a complicated active-active replication between source and destination.

You do want to give it some thought ahead of time though.

You don't want to get yourself into a situation where you have to awkwardly tell a user that you need them to stop using your software system for a few days, just so you can move their data around in a way that they don't care about.

Not that I've had to do that or anything.