Platforming Is Hard

If you've been following this blog at all over the last year or so, you would know that I've been part of an effort to build a new internal platform that empowers service teams within Atlassian to reap the benefits from capacity management.
I've participated in a lot of engineering initiatives in my time on this earth, but I haven't done anything quite as large or ambitious as building an entirely new platform, especially not at the planned level of scale of this one.
It's safe to say I've got some reflections.
Well, This Doesn't Look Too Bad
Before I get into the reflections though, let's set the stage.
The problem was that there wasn't anything within Atlassian that helped a service team to manage the capacity of their service once it got into production.
The biggest and most critical services within Atlassian had their own capacity management solutions, but they were not re-usable, because they were never designed with that in mind.
For context, managing capacity is all about:
- Deciding how many users a single service shard can support
- Deciding how many shards are necessary to support all of the users
- Balancing the users across all of those shards based on various conditions
It's a problem that manifests when you have a lot of users and you either can't support all of them on a single cluster of infrastructure or you want to mitigate the risk of a single cluster of infrastructure going down and impacting every single one of your users.
I've written about the nature of capacity management a few times on this blog already, so if you want to learn a bit more about the problem space, feel free to go and have a read of either of the posts below.


With no common solution for managing capacity, and a continued drive towards monolithic service decomposition, it would be entirely likely that each team would build their own solution for capacity management.
This would waste engineering effort across the board and possibly create reliability problems because each solution would be bespoke and probably immature.
So, we decided to solve that problem by building a platform, mostly because we already owned the most mature capacity management solution in the business, and thus had the necessary skill and experience to generalise it into something that other services could leverage.
Which leads nicely into my first reflection.
Oh, That Was Unexpected
It turned out that owning the most mature capacity management solution in the business, for the two largest and most critical services that back up Atlassian products, was both a blessing and a curse.
It was a blessing because we knew a lot of things about how to manage capacity at scale and had already built a bunch of stuff that enabled those management processes to be automated and tracked and all of that other good stuff that you need when you're operating at scale.
It was a curse for basically exactly the same reasons :(
In fairness, the burden of knowledge part wasn't that bad, as we were able to generalise a lot of the service specific stuff without too much trouble when we were designing the new platform. I wouldn't say it was easy, but it wasn't a massive blocker or anything.
The second part though, the existing systems? That really did cause more pain than we originally expected.
The problem wasn't that the existing systems were bad or anything. They were perfectly serviceable for what they needed to do, though with some rough edges that we hoped we could smooth out in the new platform.
We didn't even try to naively retrofit a new generic capacity management engine into the existing systems, while still trying to keep the existing systems running in production. We built the new stuff in parallel, taking the functionality of the old stuff into account as we did so.
The pain came from two places.
The first was that the existing capacity management systems had their functionality split across at least three services.
As you can imagine, this is problematic when you want to create a new way of doing things. If you don't consolidate, you need to make changes in multiple places, which is just a recipe for disaster. So we decided to consolidate, to merge all of the existing capacity management stuff together into a single service and smooth out all of the flows.
It was a lot more painful and time consuming than we originally estimated, and we assumed it was going to be somewhat painful and time consuming.
Complicating things was the second pain point, which is that other things happening within Atlassian exposed operational problems in the existing capacity management system.
This put more pressure on the team doing the relevant consolidation and platform work, both from an engineering point of view and from an operational point of view, because they needed to add additional functionality to the existing systems in order to deal with the new operational challenges (and adding that functionality was both time critical and more difficult because of the previously mentioned functionality split).
All in all, I'm not actually sure what the lesson is here.
Maybe not to underestimate the effort required to work on an existing system? I mean, we tried to take that complexity into account, we just didn't quite get the estimate right, which is fine because they are estimates.
Possibly the lesson is that you should action known weaknesses in your system as quickly as you can. We've known for a long time that the functionality split was a danger, we knew we needed to fix it, but we thought we had more time before it bit us.
Classic mistake.
Ridiculous Challenges At Every Turn
Speaking of classic mistakes, whenever you embark on building a new thing, one that will take many quarters, perhaps even years to fully mature, you need to make sure that you have consistent and focused leadership.
That leadership needs to be focused inward, understanding the north star that you are aiming towards and constantly tweaking the plans for how to get there to take reality into account. No plan survives contact with the enemy after all, and there is no greater enemy than the unexpected complexities of engineering.
The leadership also needs to be focused outward as well, talking to customers, understanding their needs and lining them up to evaluate and give feedback on early solutions. A perfectly built platform is useless if no-one knows about it. The outward focus needs to extend to other parts of the business as well, reinforcing the value of the platform to extended stakeholders and senior leadership, so that they aren't tempted to shift all of the necessary engineering effort elsewhere.
The second reflection I have is that we didn't have enough leadership.
There was some. We had initial plans, we re-evaluated those plans on a regular basis, we talked to customers, we extrapolated requirements from those conversations, and we generally kept the ball rolling and things moving forward.
There just wasn't enough of it.
Everyone on the leadership team for the platform was constantly being pulled in other directions, to contribute to other initiatives that the business deemed more valuable or to fight fires that needed to be extinguished in order to keep the existing things running.
Doing all of that classic leadership stuff was no-ones first priority. Sometimes it wasn't even a second or third priority, so from a leadership point of view we've basically limped along as best we could, and the progress that we've made has suffered as a result.
To be clear, this is not a mark against any of the engineers who have been working hard to build out what they can, they have been doing the best that they can with the work that has been organised for them to execute.
It's entirely on the leadership team, and more specifically on me, as the platform was basically my baby, and I have neglected it more than I should have.
The lesson here is a complex one, but I think it basically comes down to making sure that the people who are involved in the initiative have the time and focus necessary to actually work on it, consistently, over an extended period of time.
Obviously if the priorities of the business shift though, there is very little that you can do. Effort goes where effort is needed, and if the consensus is that effort is not needed on your specific thing, you should listen.
Not Giving Up
Which brings me to the end of my reflections.
It's not that we haven't accomplished anything at all. We have. We've built some great basic functionality and strengthened our foundational constructs so that we can use them to keep moving forward towards that north star of a generic, re-usable capacity management platform.
It's just been slower and more painful than I thought it would be.
That's no reason to give up though, so I won't.
If anything, the whole experience is a great opportunity to learn some valuable lessons about how not to do things and what mistakes not to make.
At least, that's what I keep telling myself anyway.
Member discussion