Aug 9, 2024 5 min read capacity-management

Construct Additional Pylons

Such glorious cinematics. Shame about the company. All credit to Blizzard

At the company I worked at before Atlassian, the biggest problem we had was just building enough raw functionality into our new cloud product so that it could compete with the ancient on-premises beast that we were trying to replace.

Unsurprisingly we never quite reached the same scale as Atlassian, but I now know that once you have thousands or millions of customers champing at the bit to use your cloud product, you have to deal with an entirely different set of problems.

Like where the hell do you put everyone.

Not Enough Energy

If you're building a software product these days, it's probably going to be hosted in some data centre, available on the internet and have many customers using it at the same time.

These sorts of software systems feature one or more back-end services that fulfil requests from some sort of front-end interface, probably a website of some sort, or maybe an app that is installed on a device.

When you're just getting started, and you only have a few hundred or maybe even a few thousand users, you will have a small pile of very closely connected infrastructure that all of those requests go through.

As the number of users grows, you will upgrade that pile of infrastructure, making the individual components larger. Those requests have to be fulfilled somehow, and throwing power at the problem is a pragmatic way to move forward.

And that works.

Until it doesn't:

Maybe you do a bad deployment, and you take all of your customers offline, making a large number of people fundamentally unhappy
Maybe a critical piece of infrastructure in your pile fails for some reason and every single one of your customers can't use what they paid for
Maybe you reach the limits of what one of your key resources is capable of and the system starts coming apart at the seams
Maybe one of your users starts absolutely slamming your system for some reason and it causes problems for everyone else

It doesn't matter what the reason is, the outcome is the same.

You can't or shouldn't keep scaling vertically.

Which means it's time to scale horizontally.

Research Complete

If you're following good engineering and operational practices, you may already be scaling horizontally.

For example, having a load balancer that routes web requests to a set of auto-scaling backend servers is a way of scaling horizontally. More traffic? Let the auto-scaler do its thing and create a few more servers, evening out the number of requests that a single server needs to deal with and preventing performance problems.

At that point, it no longer matters if a single resource can deal with all of your web requests, because you can make as many of those resources as you like.

That only works because web requests tend to be stateless though, so it's not really a good solution if your data store is the resource that is not able to keep up with the load. You can't really auto-scale data stores, at least not easily.

At that point, if you want to solve a swathe of scalability problems all at once, you can instead just create an entirely separate copy of your stack. Another load balancer, another set of backend servers, another data store. Literally everything, from top to bottom.

And now you have a shard.

Having a set of shards is like having a fancier version of the load balancer + auto-scaler that I mentioned above. You're still creating isolated units that can deal with that sort of workload, you're just doing it in a far more persistent way that aims to solve a long-term problem rather than a short-term one.

Assuming you're dealing with persistent workloads (i.e. a customer only exists on one shard and all requests for that customer must be fulfilled on that shard), you've prevented a bunch of the problems outlined above right from the get-go.

A single deployment, assuming you're deploying to shards in some sort of sequence, can't explode everyone. In fact, you can partition your customers by risk and only deploy to the risky ones when everything is proven to be fine
A single piece of key infrastructure failing no longer causes a total outage. It might still cause an outage for some of your customers, but that is still a solid improvement
A resource reaching the limits of what it is capable of no longer brings down the entire product, though, again, it will bring down some customers
A particularly noisy user can't make everyone else's lives hell, just the people they are sharing a shard with

Adopting service sharding, as it is called, is a way to architecturally protect yourself from the risks inherent in shared infrastructure, reducing blast radius and improving resiliency.

But like anything, it's not free.

We Are Under Attack!

While you may solve a bunch of problems by adopting service sharding, you also introduce a host of new and interesting challenges.

The biggest one is simply deciding where a customer should go in your set of shards.

When a customer first starts using your product, you need to choose the shard that they will be located in.

A naïve strategy is to just randomise that decision and select a destination from the available set of shards. That will work pretty much as expected, until it doesn't.

Eventually, things get more complicated because different customers have different load profiles and because a single shard can only support a certain number of customers before its performance starts to degrade and every customer on the shard suffers.

At that point, you need to get a bit more complicated and start monitoring the capacity of the shards while also considering how much of that capacity each tenant is consuming. Once you have that information, you can combine it to make better decisions about where a new customer should be located within the set of shards.

But the challenges keep coming, because:

A single customer's capacity consumption will change over their lifetime, which means that a shard that was perfectly utilised today, might be over or under utilised tomorrow
At the point where a customer starts using your software, you literally know nothing about how much capacity they are going to consume, so you need to make decisions using assumptions that have to be re-evaluated later
Any measure of capacity is only going to be an approximation, and as the capacity is consumed, the ability for the shard to deal with load may fundamentally (or subtly) change

And that's just the tip of the capacity management iceberg.

En Taro Adun

Scaling horizontally by adopting service sharding is a great way to solve a certain class of problems while introducing a whole different class of problems.

I would say that it's still a net gain though, especially if you can leverage existing capacity management platforms to avoid having to build everything from scratch.

I wouldn't bother architecting for service sharding right from the start though. Be pragmatic and do what you need to do when you need to do it.

Remember, sharding is only really useful when you have a lot of customers, which is good, because it's definitely not free.