Home Gotopia Articles Beyond Backstage...

Beyond Backstage: Building Platforms That Scale

Ajay Chankramath & Nic Cheneweth Cheneweth on platform engineering: product mindset, control planes & AI. Don't skip the foundations.

Share on:
linkedin facebook
Copied!

About the experts

Read further

Why Platform Engineering Needs a Product Mindset

Ajay Chankramath: Hello, everyone. Welcome to GOTO Unscripted. Today we're going to have some interesting conversations on platform engineering. I'm Ajay Chankramath. I'm the founder and CEO of Platform Metrics, where we look at how to break down the problems organizations have in their platforms today. Nic Cheneweth k?

Nic Cheneweth : Yeah. Nic Cheneweth, and I'm a principal technologist at ThoughtWorks, deeply involved in several different ways in which we build and put platforms into use with large customers.

Ajay Chankramath: For the listeners who are not familiar — Nic Cheneweth k and I used to be colleagues. We used to work together a lot for many years, and then we went in slightly different directions. But the good thing is that we have been working on some books together and some research. I'm thinking that maybe we should keep the conversations around the platform engineering space, some of the things we talked about in the book, and also the new things that are coming up right now. What do you think?

Nic Cheneweth : Yeah. Sounds good.

Ajay Chankramath: Awesome. I'll kick it off with some thoughts and I would love to hear yours. One of the things we cover a lot in the book is about the product mindset from the platform point of view. It's often misused and not quite well understood. From your point of view, why is that product mindset needed for platforms?

Nic Cheneweth : It ties right back to the results you're going to get out of it. A lot of companies have great product mindsets when they're literally facing their external customers. They've developed those muscles — they're pretty good at it, and they're always getting feedback from customers in that sense. But with internal products, it's all over the place. A lot of organizations have never really thought about it that way, so they've never developed a sense of looking at something and saying, 'that's a terrible product experience,' when they're delivering it to internal customers. They've just thought in terms of, 'can I get something done?' And there's always time pressure in that. Thinking about it in a product way takes a bigger upfront investment, but the rewards are tremendous. The productivity and sustainability of something you've built properly is going to be very different than just taking the quickest path.

Ajay Chankramath: One of the transformation patterns I've been talking about more recently is that one of the primary goals is not just to standardize everything end users are doing within the enterprise. Then you start thinking about orchestrating some of those things, and then comes the whole contextualization part. If you're working in a financial services company versus a telecom, how does that context get into this whole idea of building the infrastructure? It's not a question of whether I should apply a product mindset — it's a question of how I do it. I think about this in the context of a supervised approach. When I think about standardization, we've all been talking about golden paths — that has been the backbone of a lot of what you do in internal developer platforms, trying to make sure that the majority of users have a standard way of getting things done.

Control Planes, API-First Design, and Golden Paths

Ajay Chankramath: Then comes the interesting part, which we cover a lot in the book — the whole idea of control planes. The way I think of control planes is mostly around how you take a developer intent — a developer wants to deploy Postgres, a database, whatever — and translate that into the architectural parts of the actual infrastructure build-out. I know this has been evolving over time, and since you did a lot of the heavy lifting on that in the book, I'd love to hear your thoughts on that orchestration piece.

Nic Cheneweth : It goes back to what you're saying about the control plane — we want to build everything API-first. Whatever capability we're going to create and provide to developers, behind it is an API. And frequently it's not one you have to create yourself. If you're working with a cloud provider, they've already done that, and you can expose those things. But when you talk about the golden path for developers interacting with the things they most commonly use, you still want to present an API to them so that it's a convention — a standard they get for low effort. It's just: boom, they can deploy it. But it's also an API endpoint, so when they need to break out of that mold for whatever reason, they have a place to start. It's standard. Like: I've got a one-inch pipe, I need to go to a half-inch — I can just get a step-down fitting, because it's a set of conventions and standards they can follow. That's what we're always looking for: when I expose something, how can I do it either directly if it's a third-party thing, or in a way that gives developers guardrails? That's where a control plane comes in. It lets me wrap something like AWS or a Google Cloud resource with those standards, while at the same time having a mechanism for developers to go direct and get whatever else the cloud provider offers — completely self-serve.

Ajay Chankramath: One thing that came up at a conference a couple of weeks ago was: how does this commitment to an API-first design simplify that whole world of managing all those APIs? And you also alluded to self-healing, which is important. When you're building some of these things, how can a control plane help with that process?

Nic Cheneweth : Imagine if it wasn't there and you were just thinking, 'I need to build this' — which is what Google was doing with Borg way back. It was: I've got this problem I need to solve, and if I do automate it and provide it in a highly resilient way, what things do I have to do? I need a reconciliation loop, constantly checking the state and then responding to it, comparing it to what the user asked for. Has it drifted? How do I put it back? That whole set of things — what constitutes health and can I continuously check it — is what Kubernetes handles. It's a very resilient, highly-tested loop, with probably more community focus on it than just about any other control loop we've got available to us. And it's extensible. It's meant to be used this way. You can readily extend it to something it doesn't understand natively, but still apply that same logical loop. Rather than rebuild all that by hand and try to achieve that same level of resiliency and community understanding — which would be very difficult — just use it for what it is. We create platform products where most people think of Kubernetes as 'I want to deploy my API or app there.' We deploy a control plane where users don't even know it's Kubernetes — they're not trying to deploy apps, they're trying to build other kinds of infrastructure, but that's the API interface we give them so they get all of those benefits. Why reinvent the wheel when it's sitting right there and relatively cheap to provide compared to custom-writing all of this?

Ajay Chankramath: Absolutely. Something related to that is applying infrastructure-as-code to the foundational aspects. There's a term — I'm not sure we introduced it, maybe we did — something like 'platforms-as-code,' which is a play on the API-first approach.

Nic Cheneweth : A play on the API development approach, right?

Ajay Chankramath: Yes. From my point of view that's a vital cog in this. If you're really trying to get platforms to evolve over time — which is a theme we're seeing a lot more — then there's a question of how much maintenance has to exist for a platform, how do you keep platforms properly maintained, and how do you avoid a devolved-scale problem. What do you see companies being more interested in across that spectrum?

Nic Cheneweth : What do you see? Because you're tied into a lot of what companies are looking for as they reach out into the market.

Ajay Chankramath: That's a great question. Just today, about an hour ago, I was having a meeting with a client who is a big player in the educational technology space. They've been on the platform journey for about three years. They initially stood up a team, extracted all the capabilities into a separate team, and developed it further from an operations point of view. Their challenge is precisely on the evolution side. They built a bunch of things, and now they're finding it extremely complex to manage. This is where the separation of duties comes in — we discuss this in the book, the whole idea of shared responsibility models. They initially thought the platform team should build and maintain everything all the time. That's led to a situation where even their CI/CD pipelines are being maintained by the platform team. I was telling them that's probably the wrong place. If you think pipelines are the only thing the platform team should own, you've actually cheated yourself by not building the right foundations.

Nic Cheneweth : That's one of those things that adds a 2% improvement, which is valuable when you add it to everything else. But there are a few things you do that are a 10% improvement — measured by reduction of time and removal of friction. Developers maintaining ownership of pipelines is a key one. At ThoughtWorks, we basically invented doing that — with CruiseControl and similar tools back in the day. The whole notion was created for developers and for developer concerns. Then it got co-opted by everyone else for their concerns and turned into something that can work contrary to what you're trying to achieve. But if you do it right, there are specific architectural outcomes you need to meet. Otherwise it just becomes hard and you don't want that.

Backstage: Promise Versus Reality

Ajay Chankramath: With this client, another challenge is that they've been using Backstage on their journey for a while, and it hasn't been very successful. When I talk to a lot of newer clients, one thing I consistently find is that so many people are getting started with Backstage, but the number of people who are really finishing that journey and getting something meaningful out of it seems to be very low. Part of it is that we have to blame ourselves for setting up that expectation — a lot of us talk about Backstage as the holy grail of everything, to the point where people don't even talk about developer portals anymore. They think, 'oh, that's a platform thing, right?' So just having Backstage does not give you the false comfort of saying you've implemented a platform. Providing a button for developers to push doesn't mean you're successful — not if the backend isn't properly set up, whether it's the underlying network structures or all the things we discuss in the book. Those architectural underpinnings are usually the biggest challenge.

Nic Cheneweth : It can be a complete trap if not done well. It's a touch point, it's a UI, and a lot of companies don't have an alternative that provides that same function — but they don't implement it the right way. It's very tempting, especially when you look through the quickstart guides on Backstage, to get sucked into the pattern of saying, 'oh, everything's just a plugin,' and it becomes something that performs the business logic of setting up platforms. But it wasn't designed to do that. You can do it that way — but should you? That's another question. The other thing is people think, 'well, I'm just starting, one team is going to use it, we'll cut corners to get that one team on it.' There's value in experimenting and finding out what those teams actually need before investing. But if you didn't have the time to implement it right, why do you think you're going to have the time to fix it? And then you double down, and the problem compounds.

Ajay Chankramath: Exactly. And this comes up in conversations all the time. If everyone goes on the Backstage journey and at some point isn't getting the right value out of it, the question becomes: what do I do? Making sure architecture decisions are solid behind the scenes is what can actually help. One of the things we emphasize in the book is how you actually translate these things into organizational management. Backstage POCs look really good because they have a front end and all of that. But when you start to scale, you start seeing the larger problems — whether it's being able to onboard more developers or being able to translate developer intent into the kinds of outcomes you actually need.

Nic Cheneweth : Right. When you're mentioning that customers are having challenges at scale — from their perspective, how would they frame the question of what's going wrong? How would they describe it?

Ajay Chankramath: More often than not, they don't know. Which is why measurement is a huge aspect of how we should be improving any of these things. Organizations even struggle to measure the problem, because they tend to jump to DORA metrics, which is a giant leap from where their problems actually are to the manifestation of those problems. Sometimes it's about breaking down the basic problems as they see them and then building from there. Sometimes, if they're setting up APIs that aren't being considered properly within their architecture — or developer touchpoints are somewhat spread out rather than more unified — defining those problems is really the key to figuring out why people aren't seeing the value they'd expect.

Nic Cheneweth : It's like — if you knew as a company that you wanted to provide a great mobile experience to customers, you'd think in terms of needing people who are familiar with mobile, whether that's shaping what the app should be or just building what you think you want to build. You'd hire or train that way, or engage a vendor that way. But with internal products, people often don't do that. They get some folks together who are familiar with DevOps ideas and say, 'you're going to build a platform,' without any product support. It's not uncommon for a full-stack team doing DevOps to not be coders who build APIs or work in cloud-native spaces — they haven't needed to, so why would they jump into it? But if you're going to build a platform, you do need some of your people on that product team who can build and shape those things. You're never going to get a great platform experience solely from third-party tools integrated out of the box. You'll always have to write something custom. You just try to minimize that, and make choices that keep things as simple as possible — but not simpler. Paraphrasing Einstein there. It's a big part of it.

Domain Design: The Foundation That Makes Platforms Scale

Ajay Chankramath: The thing is — at least when I started doing some of these things many years back — that was the whole promise: being able to make sure your domain APIs, the domains you're working in, can translate domain requests into infrastructure decisions. So that even if you're not an expert in infrastructure, your platform and DevOps teams can help with that process, but the internal translation happens from the domain side. There still seems to be a gap. People are still looking at this as, 'here's my infra silo, here's my platform capability build-out silo,' and they're still treating them separately. Bridging that gap is where the control plane discussions come in — at least, that's what I glean from that part of the book.

Nic Cheneweth : Right. Domain design done badly — or not doing it at all — that's a way to do it badly. If you don't aim for something, you're definitely going to hit it. When we switch from doing infrastructure in traditional ways to doing it in software, why would you think all the software needs, problems, and engineering strategies aren't going to come with it? Absolutely they do. You need good, strong domain design within platform products so that whatever part needs to grow and scale — or be managed by a different team depending on platform size — it has domain boundaries that we've tested and that work. APIs work. If it was on the business side of your product, you'd be very concerned about those things. If you were going into a distributed services architecture, you'd ask: how do I know I've got this domain boundary right? It's not just wherever I want it to be, it's where it can actually work. I know that because teams aren't stepping on each other, they're not in the same codebase, it can evolve and ship quickly without dependencies across teams. You do the same thing on a platform to get that outcome.

Ajay Chankramath: For listeners who may not be very familiar with domain concepts and applying them to platforms — can you throw out a couple of typical examples of a bounded context in this case?

Nic Cheneweth : In a platform, the control plane itself is often a clean first example. If you think of Kubernetes on GKE with Google Cloud, there's what the vendor manages inside of that and what you manage on top of it — because as a platform builder you always will be deploying things. That's a clean line: what is everything Google manages for me? That becomes its own pipeline and domain — everything I need to do to manage and build it is in one place, separate from what I deploy on top of it. If I'm running a service mesh on top of that, it's a different pipeline, maintained maybe by a different team depending on how large the platform is. The team that's evolving the Kubernetes baseline is free to do that without interfering with the teams building on top of it. There's not a lot for you to do if Google is managing GKE, but you are choosing when and where you want to be on certain upgrades, when node upgrades happen, and you do it in a way that doesn't interfere with anyone else. And likewise for the teams building on top. I always tell people: if you've ever used a well-managed third-party API — like Google Maps — to do something, you'll kind of get what the internal experience should be like. I get as much Google Maps as I need, on demand. They may not have all the features I need today, but I don't call up Google and demand it. I have a workaround, and I wait and see if they build it someday. All those experiences should be there with internal platform domains — whatever team is building whatever, they can ship and evolve as quickly as there's value, and other teams around them aren't blocked or dependent. There are no ten different meetings required to consume what they need, and vice versa.

AI and Agents in Platform Engineering: What's Actually Working

Ajay Chankramath: We're almost running out of time, so let's jump to the topic everyone's interested in: the application of AI and agentic systems in platforms. The kinds of things I'm looking at include how you apply solutions to reduce manual processes and improve the whole experience — things like intelligent CI/CD, going from the self-service aspects to more self-optimization or self-evolving solutions. Other areas I'm exploring include enhanced observability — anomaly detection and predictive analytics. And then multi-agent scenarios, like having a cost optimization agent running alongside an incident response agent. I'm actually building a couple of prototype solutions to show people what's possible. What's been your journey in that space?

Nic Cheneweth : I've been fortunate to work with probably the biggest company in that space in terms of consumers of the hardware needed to build new AI things — whether it's training completely new models or further training a model you've bought. Maybe you're using an advanced open-source model like Llama or something similar, but you're tuning it with your own data so that it knows enough about your business to start applying it to things in ways that only people could do before. That said, it doesn't take away the human responsibility. If you can't assess what AI tells you and know if it's correct — if you don't evaluate its output — that's the core problem. Because it's that kind of contributor who can come up with brilliant things sometimes, and other times comes up with things where it's like, 'are they even paying attention?' And doing that well is where most of the effort is: taking something generalized and tuning it to the realities of a particular customer's data, so that it's much more likely to provide useful help. On a support interaction analyzing metrics and logs from your cluster to try to help you catch things ahead of time, it functions like an additional knowledgeable team member that owns the operational space — coming up with ideas. And like most humans in a group, some of those ideas are good and some aren't. But it does it really fast. There's a bunch of good ideas and then there are some terrible ideas. But they don't sleep. They don't need time off. They're always there to help the team contribute.

Ajay Chankramath: Your experience seems really fascinating. What I see in my conversations and in the work I'm doing is people jumping into agentic automation at this stage without the foundations — which leads to fragmented infrastructure because there are no real standards. All the bad things get amplified. If you don't build the foundation for your house right, you're not going to have a house. When you think about getting people to apply the right standards — control plane-centered design, domain-level APIs, all those things that should happen with the supervised approach, a human-in-the-loop where you don't just let it run — in the space you're really looking at on the model and operational side, how do you reinforce building those foundations?

Nic Cheneweth : I think about what we do, and we're still in a place where — say you've got nodes in a cluster where the logs are filling up. That's a boring operational problem. Just truncating the log file isn't really the solution. The real fix is having something running automatically that knows how much information should be kept, offloads it, and keeps things clean. When you deploy an AI agent, it's probably not going to say, 'here's how you permanently fix that so you never need to worry about it again.' It's probably just going to clean it up — erase those logs. Same idea: it's going to propose a lot of ideas and take action on things, but you can't just assume the solutions will all be correct. Unless there's a system that runs unattended for a year and then gets turned off — in which case, who cares? But for anything real, you need to figure out how to apply it such that the value meets the expected return. I'm always amazed when someone is doing something with AI and the result is like — if a human did that for you, you'd fire them. We wouldn't tolerate from a person the same things we're saying we'll accept from AI. At least apply the same standard.

Ajay Chankramath: The non-deterministic nature of some of these things makes it super frustrating. And more than anything, I'll admit — I'm not good at math. I've been starting to trust AI a little more on the math side of things.

Nic Cheneweth : It doesn't know anything. It's just math.

Ajay Chankramath: Right. We actually did something called the State of AI in Platform Engineering, and some of the data that came out was really interesting. We asked literally thousands of developers how they're using AI and how many of them are using it. Apparently about nine out of ten people seem to be using it. The things that are both surprising and not surprising — because everyone knows about this — are code generation and documentation. And obviously making sure communications don't come across as rude, making them warmer, things like that. But what I've been seeing a lot more of is people using AI to write infrastructure code, which gives me nightmares. They're saying they use it for reading and generating infrastructure files. When those things happen, I still struggle to articulate why that's a bad idea. You need to have standards in place. If you put garbage in, what are you going to get out? The things I think about are: you're going to have infrastructure that doesn't scale properly, it won't have proper tests. How do you actually counter that or articulate it?

Nic Cheneweth : How do you write the tests? If you're going to have AI write the code, don't let it write the tests too. It just gets in a loop. Everything it does passes the tests, even though it may not at all be what you imagined behind the scenes. If you're having it generate infrastructure, what are the tests you need to write ahead of time? How would you know that this infrastructure is built to your standards — resiliency and everything else? Because if you've used it like a black box, it doesn't know if it did it right or not. It's just autocomplete, essentially. The analogy I keep coming back to is when spreadsheets were first invented on computers — Lotus 1-2-3 was a brilliant tool. Whenever anything was being calculated by hand on green-bar paper, spreadsheets were wonderful. But spreadsheets don't know anything. You have to use them effectively to produce good results. AI is the same kind of thing. If it's generated something you don't understand and couldn't have written yourself, well — good luck.

Ajay Chankramath: I run a nonprofit on the side — you probably know this. Coming to the end of the year, I've been doing some reconciliation and I've been really behind on it. I had some accounting questions and thought, 'how bad could it be to just ask ChatGPT?' I cannot tell you how bad it was. I ended up having to call Quicken to actually get things fixed. It's amazing what happens when you don't really know what you're doing in an area and you let AI fill that gap.

Nic Cheneweth : Yeah. And why is everyone going so fast with this? Because everybody's so desperate for it to succeed, and there are some genuinely valuable things happening — that's why it captured people's attention in the first place. But what we imagine it can be gives some people a good argument to try strategies that normally wouldn't be able to get through. It'll come back to earth. The value that's real is still real. If it's not doing things effectively in a given context, that will become clear. And depending on who you are, maybe competition keeps you honest — or maybe it doesn't.

Ajay Chankramath: For me, I've seen history repeat itself. It's all going to come crashing down for organizations that don't do it right. There are some genuinely valuable aspects, but fundamentals always matter. It's been such a pleasure talking to you and catching up again. Hopefully until next time.

Nic Cheneweth : Take care.