Sep 13, 2024 4 min read capacity-management

Heart of the Swarm

Blizzard really did lean heavily on that corruption theme for a while. All credit to them I suppose

For the last three and a bit years at Atlassian I've been immersed in the capacity management domain.

It's a pretty good domain; interesting without being mind-meltingly complex, impactful without being monstrously high-pressure, and data-oriented without requiring everyone to be a statistician.

While I wouldn't say I'm an expert, I like to think I understand it well enough to explain it at least some of the concepts.

Like what capacity actually is and how you can calculate it.

Armies Will Be Shattered

In abstract terms, the capacity of a software system is the amount of users that can be served without the user experience degrading to an unacceptable level.

Annoyingly enough, there isn't always a specific and easily predictable number of users where everything breaks. It's more of a probability curve, where, as the number of users grows, there is an increased chance that something bad will happen.

That is one of the reasons why, when it comes to capacity management, shards are not buckets and users are not water.

Concepts aside, any capacity management process or system needs to have an actual definition of capacity. A numerical value that can be used to make decisions.

With this definition, you can know:

how much available capacity you have
how much capacity is being consumed
the rate at which those two values are converging

Calculating capacity is a fairly involved process, and it all starts with metrics.

Worlds Will Burn

In this context, a metric is a dimension of either the shard or the user, typically for a single point-in-time.

If you're paying attention to your software system from an operational point of view, you're probably already monitoring a whole bunch of metrics and know which ones are relevant from the perspective of user experience.

The amount of data stored in your (probably) infinitely scalable binary storage mechanism? That doesn't really matter.

But the CPU usage of the underlying RDS database, which is uncapable of scaling automatically? That probably relevant to understanding your capacity.

But looking at raw metrics isn't enough when it comes to capacity management, especially if they are changing constantly, like most metrics do. There is too much noise to make long-term capacity decisions.

So the next step towards calculating capacity is to abstract them into profiles.

Vengeance Will Be Mine

A profile is an aggregation of the relevant capacity metrics for a shard or user, bundled together in a handy data structure that is easier to reason about.

The intent of a profile is to draw a pattern out of the noise, discarding the data points that don't meaningfully impact capacity over the long-term and keeping the ones that do.

Constructing a profile typically involves applying some sort of aggregate function over a window of time for each metric, like the MAX RDS CPU Usage over the last seven days or something like that. The profile is still anchored to a point-in-time, just like the metrics are, but it encapsulates within it information about the data leading up to that point.

As you might imagine, deciding what aggregations to run on your raw metrics in order to create a profile can be somewhat complicated. If you have someone who is good at math and statistics, this is a great problem to give them.

They love this sort of stuff.

If you don't have someone like that, well you better hope that you have access to a strongly-opinionated platform that guides you in the right direction, because otherwise you're in for a lot of trial and error.

Profiles are an improvement over raw metrics, being less noisy and showing high-level patterns more clearly, but they are still just another stepping stone on the path to actually being able to manage capacity.

You also need to identify constraints.

The Swarm Calls

Constraints are the breakpoints for your software system. The metric values at which things start to degrade enough that you are no longer comfortable putting more users into a shard.

They only really apply to the shard, because they are the containers, so for each metric in your profile, you need to select a numerical value and some comparator that defines when the metric is constrained.

For example, you might say that the MAX RDS CPU Usage shouldn't be any higher than 55%. Once it breaches that, the shard is considered at capacity and should no longer be available, at least until it goes below that point again.

As you can imagine, defining the constraints for your software system involves a lot of experimentation and analysis, but it's worth it.

Metrics come in, they get aggregated into profiles and the profiles are compared to the constraints in order to define if the shard is at capacity or not. You can make well founded decisions about where users should go, based on actual data.

There is a bit of nuance here though.

For example, you will almost certainly want to establish a default profile for a user. When a new user arrives you literally know nothing about them, and if you count them as such, you will overload your shards. Instead, you should assume that they consume some capacity and embed that assumption into the system. You can always shuffle things around later when you have actual data anyway.

Also, if you have multiple constraints, you want to be pessimistic, considering the first one to breach to be the one that anchors the capacity calculation. You don't have to do this, but it's generally a safer model when it comes to safeguarding the user experience.

And I Am It's Beating Heart

Metrics, profiles and constraints.

With their power combined, you can reason about capacity, which gives you the ability to answer the following two questions:

How many more users can we support with the shards we have available?
Where will this specific user fit in the shards we have available?

It might not seem like much, but the ability to answer those questions, to be able to understand what your capacity is at any point in time, is what forms the foundation for almost all other capacity management capabilities.

Like placement and growth and rebalancing and a whole bunch of other keywords and phrases that I'm not going to get into right now.

But I will eventually.

After all, the Terran Dominion wasn't destroyed in a day.