Effective Platform Engineering
You need to be signed in to add a collection
Transcript
Intro
Wes Reisz: Hello and welcome to a new edition of the GOTO Book Club. The GOTO Book Club brings the brightest and boldest ideas from the new books in science, development, and tech, to you in an easily digestible online format. Today we're talking platform engineering. According to Gartner, by 2026, 80% of large software engineering organizations will establish platform engineering teams to provide reusable services, components, and tools via platforms for application delivery. It's often said platform engineering is an evolution of DevOps. Far from evolving away from the three ways found in the DevOps handbook, platform engineering extends and redefines or refines the definition of DevOps. Platform engineering is about taking a product-focused approach to enabling software teams to deliver with autonomy, low friction, and achieve fast flow.
Platform engineering does this by accelerating time to value and reducing the cognitive load to deliver software. My name is Wes Reisz. I'm a technical principal for Equal Experts working in enterprise modernization in the platform space in North America. And up until a few weeks ago, I worked with today's authors at ThoughtWorks, where I've had the opportunity to be on some engagements with these folks. Today's guests are Ajay Chankramath, Sean Alvarez of Brilio, and Nic Cheneweth and Bryan Oliver of ThoughtWorks. Their book is called "Effective Platform Engineering." It was recently released on Manning Early Access Program and covers topics like improving business outcomes with platform engineering, platform product management strategies, scaling Kubernetes-based engineering teams across the entire lifecycle that companies go through, and much, much more.
Some of which we're going to get into over the next hour. What I like about this book is that without losing on the focus of promising and delivering on business value, it takes a developer-centered view of platforms to streamline operations, reduce complexity, enhance developer autonomy, which is all truly key to delivering the value of platform engineering. This book represents dozens, if not hundreds, of different platform engineering engagements that this group has worked on, in delivering value to production. Gentlemen, it's great to see you all again and welcome to the club.
Ajay Chankramath: Thank you. Looking forward, Wes.
Key Principles for Starting Strong in Platform Engineering
Wes Reisz: So, let's start off by just going around the table. I'm going to repeat that question because I didn't delete it from there. So, let's start off by going around the table. The first question I have is, what is the one thing that you really need to get right when you're starting off with platform engineering? So, Sean, we'll start with you.
Sean Alvarez: Sure. The first thing you need to get right is picking your initial use case that you're going to apply platform engineering to, in order to increase your efficiency. And you want to start small. You want to start with something that you know you can show the value in and that you can get the return on the investment to get that value quickly so that you can get more investment going forward as you get the buy-in from your stakeholders.
Wes Reisz: That flywheel effect. Then slice through the value and repeat. Bryan, what about you?
Bryan Oliver: The thing I've seen by a lot of organizations when taking this sort of thing on is not adopting evolutionary architecture from the very beginning. So, often we'll see organizations start to build a platform and enable teams with self-service. And they then have to go and refactor a lot of the things that they built because they did start with things like interfaces, abstractions, and architectural decision records, things that we talk about in the book. So, we try to focus on from the very beginning teaching teams how to adopt those principles and make sure that they are plugging it in from day one.
Wes Reisz: Makes sense. Ajay.
Ajay Chankramath: Thanks, Wes. I know that Sean and Bryan covered some key points. But from my point of view, it's mostly building the shared understanding of why the platform exists and who it is for. Very, very basic, simple, fundamental things. So, within that, I sort of look at it from like four different axes. The first axis is that developer experience. We talked about that, why and who, developer experience, improving the developer experience. Then the next thing is to really think about it from an outcomes and value point of view. So, what are the outcomes that you're going to get out of building a platform? And so what's the value? Because you could build the best platform in the world and probably not have the right value. So, the third thing is just as important, mostly around stakeholder alignment. You have a lot of stakeholders within your organization from soup to nuts. There are a lot of things happening within the organization where a platform can add value. So, how do you align all those stakeholders to get the right kind of thing? The fourth point, Sean sort of mentioned this, right? Start small and how do you actually scale that? MVP to scale is so critical in ensuring the platform, you have to get that right as you really start off.
Wes Reisz: Yes, absolutely. Nic, we'll wrap it with you.
Nic Cheneweth: Yes, I was thinking of really good things you're listing out there. We've heard this quoted on television a few times recently, you know, they were in the link saying you have to know where you are, be able to make good decisions where you're going next. I think that a critical part of it is sort of that introspection as an organization to really make those decisions that the rest of you are just talking about unless you recognize where you are today, the level of maturity you have in the space. And for traditional IT to ship an internal product, you may be close to doing that or you may be a long way away from that. If you don't know where you actually are, it's hard to make good, you know, wisdom-based decisions on what steps to tackle next. If you have an overinflated or even underinflated, depending on where you're at. So, I think that self-awareness is really critical.
Wes Reisz: Absolutely. Self-awareness is definitely a lost art for all of the teams that we tend to be with sometimes.
Recommended talk: The Value Flywheel Effect: A Modern Cloud Strategy • David Anderson & Charles Humble • GOTO 2024
What Sets 'Effective Platform Engineering' Apart: Practical Guidance and Sustained Success
Wes Reisz: Nic, let's keep going with you. There are a lot of really good books out there on platform engineering today. A couple that comes to mind are Camille Fournier's book, Maurico Salatino's book, and platform engineering book, two really good ones. So, what does your book bring that differentiates from the field that's out there now?
Nic Cheneweth: I think a couple of key things apart from the fact that obviously all of us have been involved in, at very large scale and growth from small scale, you know, successfully achieving this, as well as learning the mistakes you will learn along the way and trying to get this right. You know, learn from that. You don't have to remake those mistakes going forward. But I think it's probably the emphasis on the product. How do you build and create good products and ship them and maintain them? And that's what this is, is that it's an internal product, but it's a product. And so it's not enough to, you know, pick something that's technically accurate. That's good. You don't want to build technically inaccurate things. But how do you life cycle that? How do you evolve it over time?
How do you achieve the internal organizational structures and just like awareness and build, you know, the right muscles to succeed at it? I think there's some strong differentiation for us around that. How do you sustain that over time? And then another part of it is, you know, part of the book, we give people a taste of building some foundational components, actually getting hands-on and building them. They're the starting points. They're not the ending points. But if you get those right, they set you up for that evolutionary architecture that Bryan was talking about. So, I think those are probably two of the critical things, actually getting to try it and then how do you keep focused on the sort of outcome goals you need and the way of measuring those goals that'll lead to success.
Wes Reisz: I think that's so right. You can't forklift a platform from one place to another. What works for me isn't necessarily going to work for you. So, that experience that this group has of actually working with clients and actually coming to a platform that works for them, that solves their problems, those thin slices you were talking about, I think is such an amazing point for this book. And it really comes through in some of the things that we're going to talk a bit about.
Ajay Chankramath: Wes, if I may jump in for a second.
Wes Reisz: Please.
Ajay Chankramath: So, I think this is the uniqueness that Nic sort of highlighted, right? On top of it, one of the things that we are very well aware of and all of us having done the consulting hand in the past, we have talked to a lot of clients, right? All five of us have talked to a lot of potential customers and clients. So, the first thing they ask is like, what's in it for them? I mean, we could do the best of the technology thing. What's in it for them? So, we approached this book asking that same question. What's it that we can provide the practitioners? And as we actually started writing it, and as we started evolving it, we found that a lot of the things that would resonate with the practitioners would resonate with those decision-makers too. I mean, this is something that Nic brought into our collective parlance here with respect to looking at personas from the practitioners and decision-makers, right? So, we ended up writing things, even though we wanted to write a practitioner's book. And this is probably, in our own personal opinion, the best practitioner's book that you would find there. But it's got that perspective with respect to that, all the things we spoke about earlier, right? How do you actually make sure that you're doing the right thing? And how do you actually measure that value? And you can probably see a lot of articulation of that, it's in the book.
Wes Reisz: Absolutely.
Recommended talk: Platform Strategy • Gregor Hohpe & James Lewis • GOTO 2024
Evolving from DevOps to Platform Engineering
Wes Reisz: Bryan, I want to ask you the next question. You mentioned evolutionary architecture when you were talking at the very beginning. And before I jump into the question, I kind of had my head for you. I thought we might circle back. What is evolutionary architecture? That may be not a term that everybody's familiar with.
Bryan Oliver: In the context of the book and in building platforms, if you could imagine, take an example of self-service APIs. This is something we talked about in the book quite a bit. If you start building APIs for deploying, say, infrastructure or applications, and you start with just writing Terraform, it's kind of like you are designing a mouse and you decided to write the firmware first, instead of first designing the interface, how it feels, how it looks, like all those things that are the touch point for the customer. It's like you wouldn't start with the firmware. So, it's kind of like you're writing the firmware for the cloud before you've written the interface.
So, what you want to do first is instead say, let's define the interface, which is the point between the platform team and the developer team and how they're going to interact with each other. That interface is an API. So, that can be implemented in hundreds of different ways. Terraform, Crossplane, Pulumi, it doesn't really matter. It's a detail. What matters is that you have this interface in front of everything, that when you start to change those underlying bits, your platform can evolve and get better over time, but you don't have to change how your customers are interacting with that platform. So, it kind of allows the platform team to grow without necessarily creating a ton of, like, friction when it needs to grow.
Wes Reisz: Makes full sense. So, it's like keeping that point where you can introduce change later on so you're not coupled to a decision early on. You can continue to evolve.
Nic Cheneweth: There's a good distinguisher in there, I think, from people are going to talk about, you know, how is this an evolution of DevOps? If you think about DevOps inside most organizations, one thing that teams doing that kind of work are often thinking about is, "How do I build something that's reusable across all these different use cases that someone's going to come in and want me to provision a database somehow?" And even if they're writing code to do that internally, they're supporting a myriad of different use cases. Whereas we're saying, you know, contrasting that building a product, it's like, no, you optimize this end to end for this product and nothing else. So, you don't build to suit every possible use case of the database inside your company. You're not the database team or IT team in general. You're building a product. And so you make very different choices if you're trying to optimize for one thing than if you're trying to sort of generically optimize for the whole company.
Wes Reisz: The kind of ports and adapters being baked into how we're designing platforms. Makes sense. Bryan, you started off talking, one of the questions that I wanted to talk to you about was kind of, like Nic Cheneweth just brought about this transition, this evolution from DevOps into platform engineering. What does that look like in real terms? What are some of the challenges that are there? And how does the book talk about addressing those challenges in that evolution?
Bryan Oliver: Yes, so there's quite a few. I think it might have been Nic that said this or somebody here, but somebody here once said the pull request is the same thing as a ServiceNow ticket. And we've seen that as a challenge when organizations try to adopt self-service mentality, infrastructure as code, get up. They think that they're adopting platform engineering when really they're just sort of increasing their DevOps capabilities and not really moving into a product mindset. So, instead of saying we're going to do GitOps, we're going to do infrastructure as code, we're going to do pull requests because then it makes you think, it's like you're really not moving the needle. But you do when you start to say, "Okay, instead of all those details, we're going to instead build a product that is an engineering platform and treat it as a product." And that sort of separates you from the traditional thinking that you see in DevOps.
Instead of handing teams like a really rigid Jenkins file, like Nic was saying, where he was trying to address just a myriad of things, we instead say, "We're going to provide you with a target that you can use to then employ whatever process you want, whether that's you own your CI-CD and you're going to deploy it to our platform. We're going to provide APIs and services that help you get there, but we're not going to tell you how to do it. We're maintaining a product much like a Heroku or a Google Cloud," like words like we're providing you an endpoint that you can consume and use, but we're not going to tell you how to get there.
Sean Alvarez: I think that one of the things we often talk about with platforms is you're not trying to target 100% of everything the company is doing in engineering. Eighty percent is usually a good number, right? Because the more you try to snowflake into the system, so to speak, the more complex it becomes, the more unwieldy it becomes, and the more you lose track of that user experience. Now, of course, the side effect of that, because you are developing a product, is to say, if you are wildly outside the bounds of the workflow of every other team in the organization, maybe this isn't the right place for you, or the alternative to say, maybe you should be doing it in a more streamlined way so that your team can now become more efficient by adopting this platform.
Wes Reisz: Thanks. Ajay, I've heard you talk about Gaul's law in this context. I'm going to misquote it, but a complex system is made up of small working parts, and if you try to rewrite an entire complex system, you're bound to fail. Start with that thin slice that we were talking about. Make it work. Make it solve that problem before you move on to the larger complex system.
Ajay Chankramath: Absolutely.
Nic Cheneweth: Well, that 80-20, too, is going to say it, 70-30, 80-20, those are those numbers that we find. But it's not that you pick a team and about 80% of what that team needs is the engineering platform. It's not really that number. It's more like if you look at large-scale organizations that maybe have a couple thousand developers divided up amongst a variety of internal product teams. About 80% of those teams, what they need are common ubiquitous things that can provide an engineering platform very effectively. And then you've got 20% or 30% that they can use when they need it, but they have these other things that they're doing that are a little bit different and outside. And so the path will look different. So, it's less about trying to hit most users' needs. It's about hitting what most users look like, most personas, and then not trying to solve for the ones where you can't solve it that way. You have to go back to more of a DevOps relationship and less of a product experience. Not every team has it. It's like 80% don't need any of that and then another 20% do.
Bryan Oliver: It's also, if you add to what Nic's saying, it's not just about trying to meet 80% of those needs. It's, don't force 100% to use the things that you're creating because going to the Jenkins file thing, it's like if you force the entire organization to follow this process, you're going to have 20% of them screaming at you like, "I need this, this, this, this, this," and then you're drowning in a sea of tickets, right? So, it's like, no, instead just meet the 80% and the other 20% can figure out what they need to do because you're not telling them what they have to do. So, that's the big difference.
Wes Reisz: I love the way Team Topologies is a particular puts platform engineering or puts the stream on a team, the feature delivery teams as the heart, there's the customer that's driving it. So, if you keep that customer in mind and solve those needs, then it keeps the focus in the right place. And I think it does exactly what you're describing, Bryan.
Domain-Driven Design in Platform Engineering: A Scalable Approach
Wes Reisz: Sean, early on in the book, you talk about platform and product domains where you apply domain-driven design to some of the practices around engineering platforms. I think that's a novel thing that I've seen in some of the books. So, except, I haven't seen DDD in particular applied platform engineering. Could you describe that a bit and why you felt it was important to include in the book?
Sean Alvarez: Absolutely. So, one hand, we're developing a product, right? So, this is designed to be long-lived. There's not really an end date to this. We're going to continue to evolve it over time, evolutionary architecture. We're also going to be scaling it out over time. Ideally, we're going to get more and more users onto this, more and more of the organization. It's also going to become ideally a critical piece of infrastructure for the organization. So, we're going to have to have SLOs on it, make sure that it has uptime, make sure that it has availability. And one of the big differentiators that I've seen between DevOps and platform engineering is DevOps practitioners and ops practitioners really, they think of their job as, "We're writing scripts to deploy infrastructure." But here we're actually writing software, right? This is a product no different than any other product you would deploy internally.
So, by using DDD and having those bounded contexts, now we can carve out the pieces of our application that are separate from each other, that will enable us to increase our availability by not having to deploy this monolithic platform. Every time we make an update to it, we can have multiple repositories, multiple deployment pipelines. And as we scale, because we've looked at our domains and separated them effectively if I suddenly have a huge influx of requests for, I don't know, networking changes or some new computing enhancement, right? Quantum computing may, where it results in the ability to, have to deploy new node types into the base underlying infrastructure or service mesh. I can spin off teams very easily with these code bases now that are targeting that domain and have them continue to operate independently and autonomously from what the rest of the platform is doing, enabling that scale out over time.
Wes Reisz: I like in particular, as we were talking about before, the experience that you all had engaging with clients, you found those boundaries through time. Like, a lot of times with DDD, you discover the boundaries, but they may change. If you put that through the test. So, these are, like, some really good, sensible defaults that you can start with being on the scale. So, I thought that was a really cool thing that you had to develop.
Sean Alvarez: It's a good starting point. It's certainly going to evolve over time and it may be different for every organization and tech stack in some little way, right?
Recommended talk: Why Is My App SLOw? Defining Reliability in Platform Engineering • Jez Humble • YOW! 2023
Platform Engineering: Bridging DevOps, SRE, and Developer Experience
Wes Reisz: Ajay, there's a bunch of different roles that we typically kind of adjacent perhaps to platform engineering. I'm talking about things, like traditional DevOps, maybe SREs, developer experience. How did those roles kind of fit together with platform engineering? How do you talk about them in the room?
Ajay Chankramath: That's an excellent question, Wes. Because we are going into a lot of details around some of those things within the book and we already covered a couple of things here, right? So, maybe what I should do is maybe just take a step back and define from our point of view, if you watch DevOps, watch SRE, watch DevEx, and how to start fitting it, right? So, from our point of view, this is something that the industry has come to terms with over the years, right? Because we used to call people a DevOps engineer and a DevOps team and there are still a lot of organizations who do that. But that transition is happening where people are understanding that DevOps is nothing more than a causal paradigm that generally improves collaboration, communication within all aspects of STLC, right? I think that understanding is supremely critical as we really try to contrast these things. The second thing is when you really talk about an SRE, you're always talking about like, how do you apply those software engineering principles to operations?
Not everybody gets SRE right, but again, that's a traditional definition. That's what everybody aspires to do. As you really try to do that, right? For what purpose? Obviously, to increase that reliability, make sure that you have highly reliable protection systems. Now, in the recent years, a lot of us have been talking about, and a lot of the companies and a lot of the clients that we talk to, we all talk to, talk about developer experience. And how do you really... And what exactly is that? I mean, that's essentially that experience for the developers interacting with the tools, frameworks, processes, and everything through that STLC process. So, as you really look at those three definitions of it, from our point of view, platform engineering is nothing but something that enables DevOps while supporting SRE, including the observability, and is required by Devex. So, as you can see, I threw all those three words into the definition of platform engineering, right?
But the cherry on top is the fact that this is being done as a product. And essentially using those three different things. I think that is how we really look at platform engineering. I would talk when it comes to that. So, from our point of view, platform engineering is that centerpiece that enables DevOps and ensures that it's required by Devex and supports the SRE. But with that core focus of having that self-serve-capabilities to do all of the things with respect to build, test, deploy, the whole thing, right? A couple of other things I want to probably bring up here are the audience of it, right? So, when you do the DevOps activities, there is this tacit expectation that DevOps people are sitting around on the site, getting the request from the developers to be able to execute things that developers could potentially do.
But for some reason, over the years, somebody came to this conclusion that that's not the best use of developers' time. So, you need to have a completely different set of people who have a completely different set of expertise to actually go in and build all of the things, right? So, that audience is a huge thing. So, from the platform engineering point of view, that audience is really critical, which is the developers who actually build those products. And there are some unique responsibilities along those lines. I mean, things like within the platform engineering space, within DevOps, the traditional DevOps definition, we don't always talk about abstracting complexity, right? Within platform engineering, you always talk about abstracting complexity. In self-service, we just spoke about it. We don't quite talk about that because sometimes again, whether we ever intended to be like that or whether it turned out to be like that, I don't know, DevOps teams and DevOps engineers almost always were sort of anti-self-service, right?
It's like, "Let me do it for you," or, "I'm supposed to do it for somebody else." Now, this one sort of enables that self-service. Then the third part of it is where we are seeing a lot of, and this is also something that we are touching in the book is this whole idea of dev portals, being able to have some kind of a conduit into the engineering platforms, to be able to have a unified interface to talk to your engineering platforms and through that, to the whole STLC process. So, those three things to me are sort of that unique aspect of platform engineering that starts differentiating beyond these definitions.
Measuring Success in Platform Engineering: Metrics, Boundaries, and Waste Reduction
Wes Reisz: Right. Makes total sense. How do you measure the platform? How do you know that you're having success? Earlier, we talked about business value, we talked about thin slices, we talked about individual pieces and then measuring it. So, of all the things you just described, Ajay, Nic, one of you two, I think you both are well equipped to answer this, but how do you measure whether your platform is doing what you think it should be doing? How do you know if you're being successful?
Ajay Chankramath: I'm happy to start off. I'd love to hear from my co-authors too. So, this is something that we take really seriously, going back to the original purpose of like, why did we write this book, right? A lot of us have got a background in building platforms, but sometimes this place where we really struggle is that measurement part, what you're just asking about. Ultimately, somebody has to pay for it. And so we can always talk in abstract terms like in a subjective term saying that if you were to allow building an engineering platform and if your teams were to use engineering platforms, you're going to be better off. Sure, that is great. But how do you know how much better? Is my investment going to pay off the right way? So, this is the context in which you want to actually have a shout-out to the work like the Dora work, the Accelerate book and a lot of the work that Jean and team has done.
Now that a lot of people are working on this space, the idea of having some clear articulation of how you actually measure the value of every capability that we're building is something that we're going into a lot of details within this book. So, we are actually introducing certain models that talk about, "Hey, here is a simple model," using which you could say that if I'm investing X amount of dollars, I'm actually going to make 10X. If I'm not going to make the 10X out of that, then I should not be investing in X. So, this goes back to the point that Nic and Sean were talking about earlier with respect to, do we have that 80-20 rule kind of thing? I think, Bryan, you were talking about that. So, sometimes it's the question of it's not even 80-20, like Nic said, it's maybe 70-30, maybe it's the 60-40. So, you just need to do 60% of those capabilities to find that value. So, that's... And you have to do that both objectively and in an abstract subjectively.
What we are presenting is more of an objective way of measuring that and ensuring that you don't build things that you don't need. And at this point, I probably want to throw in one last point, which is who does that and whose responsibility is it to do it? I would say as a platform practitioner, it is your responsibility, but there is always a specific role that we are recommending in the book, which is the technical product manager. So, that person is the one who sort of brings it all together. So, to me, it's that person who's the cattle-herder who brings everybody together. But a technical product manager alone can't do this job, right? It is, if a platform engineer or the platform practitioners, whether you are acting as a producer or a consumer of these capabilities, don't have that appreciation, things won't work. But that's how we would recommend measuring the value.
Wes Reisz: Let me ask a follow-up on that. So, you mentioned DORA Metrics as a way of measuring the success of a feature that you may be delivering, a capability that you're working on. What if you don't have DORA Metrics? Does that mean that platform engineering is not right for you, or does that mean maybe getting towards something like DORA Metrics is kind of your first step?
Ajay Chankramath: So, there are two aspects to that question, right? So, one is like, should it be DORA Metrics itself, or should it be something else? So, DORA Metrics, as we know, could be mostly a lagging indicator of a lot of things that's happening. Some of the things that you might want to look at as you build, what you're building as an engineering platform, it'd be some of the leading metrics. Because you may not really want to wait until after the fact to see some of the numbers that you might be seeing from a DORA 4K Metrics to be able to do that. So, with the space framework and things like that, every digital native organization is starting to think about some of those challenges. We have some very specific recommendations on what are some of the leading metrics you should be looking at. We have, and I think it is in chapter three, we are going into a lot of details about, how do you actually set up the leading metrics. What are the kind of things you should be looking for as you measure the value? And there is no one-size-fits-all, which is why we are providing some guidance on how to do it. And we have some exercises that actually talk about those things. But it's up to every practitioner, every organization to figure out what's going to work within this framework of what they want to achieve using an engineering platform.
Wes Reisz: Makes sense.
Nic Cheneweth: Other things we bring in there too, which are critical and why we talk about it being, we're all software engineers now, is that a lot of that domain-driven design, we talk about product teams, well, inherent within that is this notion of what's the boundary between two teams look like? What's it look like when I'm in this team and I work with stuff that that team creates? Let me talk about that contract boundary. And it's that, no, they have an API. I have it, you know, maybe they're the Google Maps team. I can take as much Google Maps as I need whenever I need it, you know, or as little as I need. That's a very low friction seamless relationship between them. And below our actual internal product teams, as you go down through the tech stack, we're all part of that same contract. That's the idea. It's that I need all those same things to ship this payment service, you know, if that's what my product teams are building. And I want the same experience from all of those teams going downstream.
And so while you could say, "Well, how hard is it to measure, you know, the quality and the effectiveness of these things?" And it's like, well, you can focus on that contract boundary, if it's seamless and easy and quick, you know, you're getting what you need when you need it. But also, it's like you can focus on waste. That's another thing that we talk a lot about is that it's much easier to measure and identify waste and remove it than it is to necessarily say, "Hey, how many commits should a person make a week or how many... You know, some of these other qualitative things that we say are good indicators. And we're saying, "Well, the time you spend waiting around for anything, if I want to use a Git product in the cloud, and I have to call them to do anything if I open tickets with my provider to do stuff, that's just wasting time. I should have all the Git I need, whatever I need." And we're saying that the engineering platform should be the same way all the way down. So, we're going to measure it like that. And that leaves teams that hopefully, free to be able to say, "We're shipping as fast as we're capable of shipping. And if that's not fast enough, let's talk about our internal team structure." And maybe the lack of we don't know exactly what we're trying to do or something at that level. And it's not engineering friction downstream that's creating, you know, those long lead times.
Recommended talk: Modern Platform Engineering: 9 Secrets of Generative Teams • Liz Fong-Jones • GOTO 2023
Balancing Autonomy and Compliance in Platform Engineering
Wes Reisz: A lot of times when we talk about platform engineering, we talk in planes, delivered planes, we talk about developer planes, we talk about security, observability, we talk about compliance and governance. I want to dive into one particular area, Bryan, particularly around governance and compliance. One of the friction points that we often see is this friction between, like the autonomy of a developer to deploy things that they need, and the automation that we're trying to do with the platform. How do you deal with that?
Bryan Oliver: Thanks for asking. So, from what we've seen, developer autonomy has kind of become a distracted space where people tend to focus on all these differences, as code or combine some words with ops, focus like Git ops and AI ops and infrastructure. And those are all sort of into details of the developer experiences and those tend to just end up being things like, the developer makes a pull request or they make some config file change and everything is just magic. And that's not really the focus of developer experience from our perspective. Instead, what we're saying is you should begin to build a culture of trust. So, you're building an engineering platform that developers can use autonomously without actually telling them how to do it. So, what I mean by that is, like, if you start to put a bunch of guardrails in their pipelines, like in the traditional DevOps sense, like Ajay was talking about, you might have a security team that owns the scanning step of your pipeline and they just own it outright. And that's in a traditional DevOps world. That's quite normal.
Then you have an audit team that owns the package repository where you get all your dependencies from. And there's all these stopping points in the workflow of just trying to build your application. And we've seen from experience that that just hinders the entire development process for months. Instead, what we do is say, "Okay, we're going to trust that developers know how to build their application and they know how to optimize and tune their own workflow for their application. Every application team is different." An AI or ML team has a very different build process from a web API. So, we're not going to tell them all how to build their stuff because it varies greatly how they should be doing it. Instead, what we're going to do is provide a gate at the deployment API, meaning the point at which you actually send the application to the platform. That's going to verify those things that we're trusting them to do. So, like, we trust that they are using secure packages. We trust that they are writing good code. We trust that they're writing tests. Those are all verified once they actually try to go to production with their application and hit that deployment API.
Now, there are pros and cons. One of the cons is that some teams will just completely defer doing any of that work until the very end. So, the way that you sort of handle that is instead of just having this deferred situation, you start them off from the beginning with things like starter kits that are deployable into production from day one. Meaning, you should be able to fork that starter kit and deploy it all the way to prod and you're passing all of those rules from the get-go. So, then when you start checking in code, you immediately see if you're following the rule of every change we make goes to prod, you're going to immediately see, "I'm failing those deployment gates at the beginning." That's kind of how we think about it, is instead of just all these different comments, start to think about let's build a culture of trust. But also it's kind of that trust but a verified thing. Like, we trust our teams, but we verify the artifacts that are actually hitting our platform.
Wes Reisz: So, intuitively, I think we can figure out what a starter kit is, we'll talk more about that. What is a starter kit? What is an accelerator in the context that you're just talking about? How do you ease that path to production using them?
Bryan Oliver: It's very different from what you see in the traditional DevOps world, like where you just have a repo with a Jenkins file and maybe a little bit of code. We see a starter kit as a deployable application that actually uses a database, has a small API, maybe uses some of that self-service infrastructure like creating a database or creating an S3 bucket or whatever. You should be able to take that starter kit and actually go all the way into the production environment and use it as a publicly available endpoint. So, it should be a secure functional production ready application. There's not really a lot of value in building toy applications or these small shared libraries that don't immediately show the value of what they're doing. So, we instead like to focus on, we want to get our teams to go from... At some clients, we see it takes months to get a brand new application spun up into prod. By enabling this starter kit pattern, that goes down to a single day. The brand new team spins up, they're in prod that day. It might just be a skeleton application with a simple API, even a hello world API, but that's a starting point that they can immediately build and push to prod day by day and maybe use feature flags to sort of protect some of it until they're ready for it to be exposed to real customers.
Wes Reisz: Then it can encapsulate some of those compliances at the point of change things you were talking about. So, like if you're using OPA and Gatekeeper to prevent running this route, it can demonstrate with the accelerator how to correctly create a container that's not running this route to operate in your environment. So, you've set the culture of how it should be done. Totally, we'll do it.
Bryan Oliver: Exactly. And then you also need to surface things like conformance testing that further lets developer teams pull down those requirements from your platform and run them against their application.
Wes Reisz: What have you seen on things to gently, like kind of being in permissive mode early on to be able to warn rather than just blocking in this production? Have you seen those types of tools being used?
Bryan Oliver: For sure. Nic or Sean, so you've seen this a ton, but we will often use things like a policy code or a policy engine. And most of the technologies in that space sort of support audit mode. You also see this in the Kubernetes world where when an API is deprecated, it's like, it's just a warning for maybe three or four versions before you actually need to move on to the next. So, it's kind of the same thing with our development teams. We'll build things like a deprecated policy or in this case, an audit mode policy where every time they hit the deployment, we're like, "Hey, you're not passing this regulation. It's not enforced yet. You have X amount of time." We go beyond just saying that in a warning though, because they may not ever see that if it's just in a pipeline log. That will also be part of a continuous scanning process where we can send them automated Slack messages. You need to be more proactive with things like that before you create immediate friction.
Sean Alvarez: Something to keep in mind there, Wes, we want to get away from that idea of being enforcers, right? We don't want to be the people that the development teams are afraid of that we're going to come down and tell them they're doing everything wrong, right? We want to think of ourselves as enablers to be able to get them into production safely with governance, compliance, security, and make their jobs easier in the process by saying, "Hey, you're not following this guideline. In the future, there might be an issue, right? So, take the time to fix it, right?"
Wes Reisz: Trace that collaborative culture.
Nic Cheneweth: And providing the right tools too, because like we say, those starter kits include whatever the controls, if you will, that will be there at the point of deployment that'll say, "Have you done this work?" All that stuff should be pushed left from, we're leaving it up to teams to do the work of becoming compliant. And then we do the act of testing whether or not you're compliant at the point of deployment. But if you're security and you own some particular security scan that's going to happen, or the results of what you're going to be looked for when you deploy it, you have to have a reusable bit of pipeline code that's already in that starter kit that says, "Here's your feedback." When you're back in your local environment, you know, "Hey, I'm going to pass or fail that thing that I'm going to encounter when I get to my actual data center environment." So, the feedback is early, but you're free to use that and plug it in your pipeline, wherever you like. Or if you remove it and don't pass it, you know you're just going to fail when you get to deployment. So, it's keeping the two things separate. I do this work to become compliant versus I'm being scanned to see if I did it.
Wes Reisz: Yes, it's such a shift because you put the developer as the heart of the customer and the platform isn't blocking them, it's enabling them to meet the compliance requirements. It becomes a partnership, not a friction point. Makes total sense.
Bryan Oliver: It's like the platform becomes a target instead of a process.
Recommended talk: Platform Engineering on Kubernetes • Mauricio Salatino & Thomas Vitale • GOTO 2023
Scaling Engineering Platforms: Evolving from Startups to Enterprises
Wes Reisz: Sean, let's talk a little bit about scaling organizations and engineering tech together. Like, how does the book deal with how you evolve engineering platforms at different scales of a company, like from the full startup scale up to enterprise? Like, how do you all think about that?
Sean Alvarez: Yes, and it's a question we get a lot. And you think back to when microservices were first designed and there was that, I think it was Martin Fowler, right? He must be this tall to use microservices. The question always comes in. At what point of maturity does platform engineering make sense, right? When should we start to think about it? And the way we look at it is you want to think about it as early in the process as you can. For a small startup, a digital startup, maybe full platform engineering with the engineering platform doesn't make sense. You've only got a few teams. You've got people that are very autonomous already. So, the self-service aspect doesn't really apply. But if they're starting to think in terms of platform engineering principles, right? Automate everything, automate the checks, automate the infrastructure deployments.
Now, you're going to help avoid the problems that we often see in many of our engagements of, you know, we're X number of years down the line. We have so much tech debt built up. We can't deploy more than once every six months because these things were just never thought about before. So, by instilling the principles, you're still going to be able to benefit down the line. Now, for an established enterprise, once you start to scale up, that's when you need to start thinking about, "Am I prepared to offer a product? Do I have a product team that can support this?" We talk about technical product managers in the book that Ajay referenced and those key roles that can drive these things. And then like we talked about earlier, you know, identifying the right thin slices, identifying the right value streams. Every organization is going to have some initiatives that maybe aren't going to return a whole lot of value over time. Maybe they're just in maintenance mode right now. It doesn't make sense to bring them onto a platform. And then we've got others that are, you know, going through a lot of churn.
They're spinning up new services rapidly. And those are the ones we want to make sure we get onto the platform. So, they're going to have to think about that adoption roadmap. What initiatives do I want to bring on this? Who do I want to bring onto it? And then again, that scale of the domains within the platform itself. Can I do, you know, a single platform to start with, with a single platform team? It's usually the best way to go about it because you can show the value immediately. But you want to start to think about how many parts of the organization are going to use this. Am I going to need different platforms for different parts of the organization? Because maybe they have different focus areas between different divisions or, you know, got one using one cloud and one using another cloud, and they're using it in very different ways. Might not make sense for a single platform in that case. So, you still really have to start thinking in terms of scale and how to break up and kind of make that composable platform. How many of the building blocks do you need to put it together?
Recommended talk: Platform as a Product • William Rizzo • GOTO 2024
The Role of Generative AI in Shaping Platform Engineering
Wes Reisz: Thanks. Ajay, I think I would be kicked off the tech island if I don't bring up gen AI at some point in the conversation. The book briefly mentions talks a bit about gen AI. Can you talk a bit about gen AI and AI and how you see it affecting platform engineering?
Ajay Chankramath: Absolutely. I mean, before we even talk about it, right? I mean, so amongst the four of us, we had a serious conversation of, "Should we include gen AI in the book or not, right?" I mean, so we had those really hard discussions early on, you know, with the Manning team as well as amongst ourselves to see like, is it relevant for us to bring it in? When we started writing, or as we started writing this, you know, it became fairly obvious to us that, yes, it was a good idea to eventually write that. And, you know, what we are seeing is that even from the beginning of time, we started writing this several months back to now, we can see those kinds of impacts and differences within our own client base, right? So, the way I look at gen AI, you know, the way we are addressing this in the book also is like, first of all, we are trying to look at it from a developer's point of view. So, what do developers get if we were to apply gen AI on the platform engineering space? So, some things are very obvious, right?
Things like your Intel's and core recommendations and those kinds of things are the most popular ones, the co-pilots of the world, right? And so everybody does that. So, there isn't much needed from, you know, our point of view, much additional value that we could add. But as we started looking at it from the point of view of self-service platforms, there are some unique areas that we could really get into. Things like your chatbots or your AI assistants or now agentic AI, right? I mean, so, or even agentic mesh. I mean, there are lots of very specific areas that we could get into that will have a direct impact on those self-service platforms. The other thing, which is an interesting take on it, right? Excuse me. As we really look at, you know, any kind of domain-based products, you know, you always have some kind of a CX component, a customer experience component to it. You start really talking about some kind of a hyper-personalization these days. But what if you are able to create personalized workflows for your developers? You have a team of developers who are about, like, 50 developers doing things.
And as you are really bringing in some of these platform capabilities, we are making an assumption at some level that there is still some level of commonality there, right? We talked about that 80-20 rule. But what if we can apply AI and try and make sure that we create those, developer-hyper-personalization? I think that will make a huge difference as we look at it, right? So, that's on the developer experience side of things. The other one would be more on the automation and efficiency side of things, right? So, we have, I think, in chapter 11, we are going into some very specific cases of how you actually do predictive maintenance and optimization. So, this is an area where we all know that AI has been actually chomping at the bits for a while. Now, some of these models, AI models, that can really predict those infrastructure and application issues before they occur. So, which means that there is a direct correlation to that observability piece that we've been talking about. Now, a specific example you can think of is like, you know, maybe a Kubernetes cluster is having some kind of scaling issues and how do you actually predict that and create, you know, more hybrid cloud environment kind of things.
So, that's one thing. The other one would be more on the continuous optimization side of things, right? I mean, so you have ways of applying AI to do continuous optimization. Bryan Oliver already talked about some of the governance aspects. Automated governance is an area where we see significant benefits with respect to that. So, this is all of this, you know, application and the infrastructure side of things. Now, you really talk about it from a productivity side of things. You know, we could even really look at it from the point if you can bring some of the AI tools in the platform teams to design and evolve engineering platforms themselves, right? Because we have always been talking about this co-creation of doing things. Because if you are doing the right way, your engineering platforms will have contributions with at least PRs from the developers who are actually doing these things or consuming these things so that the capabilities are a lot more democratized in the process.
So, within that space, we see a lot of applications of it. And at the end of the day, the way to look at it is that it sort of enhances that collaboration. And I would be, you know, remis if I don't bring up this whole aspect of, you know, how RACs sort of help us, right? And we are going into this again in a lot of detail in chapter 11. One thing that we are seeing more and more is that there is this contextual part to every organization, right? So, when you look at all the models that's out there and all the LLMs that you could use and all that, there is some sort of generalization happening. Again, it's got lots of data going into it. There's still things that are very specific to it, typical ontology, specific activities, documentation or anything that you have in a particular organization, is probably missing.
So, the way we see this is that more and more organizations will start developing their own RACs and applying their own RACs to ensure that, you know, sort of leveraging these can create more contextually to that particular organization, contextually aware platform interfaces. So, this makes it so much easier because when you are really querying the ecosystem and really using those capabilities, you could even create a path with that specific context that is coming out of the RACs, right? And this will lead to more custom AI training and, you know, we'll have more specific AI models coming into that. But using RACs are relatively far easier if you have a bunch of data and most of these organizations have it. So, as you can see, there are lots of things we could even talk about for hours just on this topic, but we are finally happy with the fact that we decided to include that because this will definitely show a path for how we move forward.
Key Takeaways from 'Effective Platform Engineering'
Wes Reisz: That's good. Makes sense. Well, gentlemen, I think we are just nearly at time. So, let's kind of start to wrap. I want to go around the room and give each one of you an opportunity to kind of speak to this. Ajay, I'll start with you since we were just going. But as somebody who reads through this book and gets all the way through it, what is your hope that they'll walk away with? What do you feel is the big takeaway that can really make an impact on my teams come Monday morning after I read the book?
Ajay Chankramath: We briefly touched upon these things already, but I'll sort of summarize this by saying this, right? For us, platform engineering is that strategic enabler. And so we would want the readers to understand that. Strategic enabler for what? For essentially balancing your operational excellence, your developer experience, your innovation. And as you try and do those three things, you are fostering a scalable self-service system that empowers teams. And once you empower the teams, you get that organizational agility for that long-term value. So, think of platform engineering as that strategic enabler that brings these things together.
Wes Reisz: Makes sense. Nic.
Nic Cheneweth:. I hope that people will feel equipped, I think, to go in and run a bunch of experiments. Let's just say that they'll feel equipped to try some of these things in an organization, to know if it's working or not working. They'll have some measures to put against it. Then from there to really fundamentally make a difference in what's going on and believe that they can. I think that's a big part of it.
Wes Reisz: Yeah, absolutely. Bryan.
Bryan Oliver: I think I hope for two things. The first one is I hope this book helps people that read it cut through the BS of what's floating around in the industry around this concept because there's a lot of just different marketing and things around it that make it really hard. And instead, I hope that the readers of our book find platform engineering to be much more approachable and friendly and actually understood at a concrete level, what it's all about and how they can approach it.
Wes Reisz: Sean.
Sean Alvarez: I think the key to me is finding a practical way to think through how to increase your engineering effectiveness and excellence by producing something that people are going to want to use. And it's not going to be a friction point for the organization that you're constantly struggling to get adopted. Because if people don't use it, you'll never see the value of it. So, you need to make sure that it's something that people see as making their jobs easier by taking part in it.
Wes Reisz: Absolutely. Well, the book is called "Effective Platform Engineering." You can find it on the Manning Early Access Program at manning.com. I believe there's 9 of 11 chapters published in the last few chapters. We'll be hitting on some of the scaling questions that we talked about and diving deeper into the people and architecture required for evolving a platform. Gentlemen, it's always a pleasure. I miss seeing you all, so I look forward to some less formal chats in the future. Thanks for joining the GOTO Book Club.
Sean Alvarez: Thank you, Wes. Appreciate the time.
About the speakers
Ajay Chankramath ( author )
Nic Cheneweth ( author )
Bryan Oliver ( author )
Sean Alvarez ( author )
Wes Reisz ( interviewer )