Home Gotopia Articles Platform Enginee...

Platform Engineering: From Theory to Practice

"Platform engineering isn't just about building tools—it's about creating sustainable systems that empower developers while fostering collaboration across disciplines. Watch the conversation between Liz Fong-Jones and Lesley Cordero.

Share on:

Copied!

About the experts

Liz Fong-Jones ( expert )

Field CTO, Honeycomb.io

Lesley Cordero ( interviewer )

Staff Software Engineer, Tech Lead at The New York Times

Read further

Intro

Liz Fong-Jones: Welcome to Chicago. This is the second time both of us have been here in about a year and a half. It's really cool to see how much of the financial sector is represented here - fintech startups across the Midwest, insurance companies, banks, and people doing startups and other ventures.

Lesley Cordero: I'm in New York, so I'm pretty used to finance. It's a familiar vibe. I would live here if it was warmer than New York, but here we are. The range of companies tracks with New York in terms of finance, but I'm not too familiar with the Chicago tech scene.

Liz Fong-Jones: I used to talk with folks from the Google Chicago office when I worked at Google, but I don't work there anymore. It's interesting that we now have a number of clients here.

We should introduce ourselves. I'm Liz and I work at Honeycomb.

Lesley Cordero: I'm Lesley and I work at the New York Times as a staff engineer.

Engineering the Platform

Liz Fong-Jones: One thing we share in common is that platform engineering is really near and dear to our hearts. Let's start with the perennial question: How do you think platform engineering and SRE relate to each other?

Lesley Cordero: I usually describe them as both applications of DevOps. In my talk, I argue that just as SRE emerged from DevOps, platform engineering emerged from there too. They definitely overlap in terms of technical practices and the socio-technical approach.

Liz Fong-Jones: I love what you said about supersets or subsets. In your talk you mentioned it's not just SRE stuff that's relevant - there's design system work and other things that help developers do what they want to do. Disciplines like reliability, security, and UX engineering are all trying to get developers to do things differently. We can either go at it alone with separate ambassadors from each discipline, or we can come together to make one platform with unified defaults and advice. I love that this is finally getting us to collaborate across these disciplines.

Lesley Cordero: That's my favorite part about platform engineering - I view it as a unifying domain. You can apply platformization across so many use cases. I find that unifying approach super appealing because tech could use more intentional unification.

Liz Fong-Jones: The reason we're seeing all these platform teams emerging is people have realized that creating the job of "DevOps engineer" doesn't work. Trying to adopt SRE the way Google originally espoused isn't working for them. People are grasping at this concept of ensuring developer productivity when engineers are increasingly scarce. How do we avoid just throwing more duplicative engineers at a problem and actually invest in empowering them to do more?

Similar to SRE, that's going to be doomed to fail unless people follow some degree of common practices and learn from each other. That's what was great about your talk - you laid out some of those areas ranging from principles to implementation.

Lesley Cordero: Thank you. SRE was my entry point to platform engineering, so it's always going to be near and dear to my heart. Recently I saw Alex Hidalgo post about how not everything in the SRE book was actually validated in the ways people assume. I thought that was super interesting because no one's willing to admit that sometimes there's theory in there, which is why I like conversations like this where we get to talk about the nitty gritty.

Liz Fong-Jones: Alex wound up deleting that conversation because it was getting quoted out of context for demanding retractions. It was actually practiced by one team at Google - that's like a lab experiment that works in mice. If you take that lab experiment and it doesn't work in humans, you shouldn't demand paper retractions. This is suggesting avenues for further research.

Lesley Cordero: One thing I love about SRE is it emphasizes experimentation so much. When I read the SRE book from Google, I didn't take it as a step-by-step manual. I think it's a great conversation starter, which is why I like talks. I hope that's what my talk did - start conversations.

Liz Fong-Jones: I think it worked because you presented examples of how you applied it while sharing overarching principles you think hold true. You encouraged people to try it out and see what happens. But one challenge is that this requires psychological safety and people being willing to take risks and fail. Without that, you're not going to succeed as well.

Lesley Cordero: Definitely. I have a whole separate talk on psychological safety that I actually like more, but because it's softer, I don't think it's as popular.

Liz Fong-Jones: At the end of the day, everything we do is made up of people in the system. Whether you use the word socio-technical, or what my colleague Jessi calls "somatic" - systems of learning parts - our systems are dynamic and we need to treat people as part of the system design.

Lesley Cordero: I totally agree. You were actually my entry point to socio-technical thinking, and you're part of why I started doing talks.

Liz Fong-Jones: That's really exciting because you give really great talks.

Lesley Cordero: Thank you. That's actually what happens when you have women of color to look up to - you have proof that "I can do it because that other person is surviving."

I thought it was interesting what you said about the presumption that I must be a front-end engineer. It turns out front-end engineering is really complicated and difficult. I look at our front-end engineers in awe - you don't have hard requirements about how the system runs, it's subjective from the user's perspective whether it worked or not. There's no acceptance test for that.

Liz Fong-Jones: I always joke that it's too complicated for me. I think backend is more straightforward, though I know people will argue the opposite. I definitely have appreciation for the craftsmanship that goes into it.

There's this joke from when I was in college at a science-focused school about hierarchy - mathematicians were the purists, physicists were doing applied mathematics, chemists were doing applied physics, and it goes all the way down to biology and psychology. But instead of arguing about whose discipline is pure, we could work together and collaborate to achieve results.

Lesley Cordero: I've seen that in every aspect of my career. In my school of engineering, computer science was seen as the "soft" one, which is funny to say out loud. Even within CS, there were people on the systems track versus the application track, which was more front-end focused.

Liz Fong-Jones: I don't know how much this has changed over the past ten years, but when I was in college, there was a lack of teaching about running production systems. You'd get assigned coursework, turn in assignments after a week or work on a project for a month, then drop it and never think about it again. That's not how modern software development works. We need source control, comments, and tests because someone else has to pick it up after you.

Lesley Cordero: I agree. I wish there wasn't as much of a gap. School was good at teaching me how to write new code, but not how to contribute to existing codebases. I've worked at non-startups for my entire career, so I was rarely writing things completely from scratch. I was having to learn systems first before being able to contribute, but I didn't feel equipped for that.

Liz Fong-Jones: A former colleague named Mikey Dickerson has created a college course on creating and running reliable systems. Students start with a simple service - your first week's task is to get the service up and running and printing "Hello." By the end of the semester, make it print "Hello" when hit with a million queries per second, and also when I take down an availability zone or mess with your database. That approach needs to be duplicated more so there are more entry points into platform engineering where people can come in feeling like they know what this work is about, that they enjoy it, and feel prepared to do it.

Recommended talk:Organizational Sustainability with Platform Engineering • Lesley Cordero • GOTO 2024

Platform Engineering in Practice

Lesley Cordero: Since you're so big in open source, I've thought about this problem with respect to training students to become open source developers. I worked on this in college, but it was so hard. I saw someone post today about an O'Reilly course on how to become an open source contributor, and the idea was super appealing. I know you're huge in the OpenTelemetry space.

Liz Fong-Jones: One thing about open source is that it has cultural norms that are being rewritten over time. A course on how to contribute to open source 20 years ago might have been "here's how to suck up to Linus Torvalds and get him to not yell at you." I'd like to hope we've moved past this era of the lead maintainer yelling at you if he doesn't like your patch.

Now we're having interesting distortions around corporate open source. What does it mean when you have the lead contributors to a project all having the same employer and then being able to relicense it? We've seen the breakdown of the non-foundation model with the WordPress fiasco going on as we record this.

By and large, the things that have historically held true with open source contributions is that people need to appreciate what it's like to be an open source maintainer in order to function in our modern world. You will inevitably be using someone else's open source codebase, even if you're not an open source maintainer yourself. Getting experience with what that's like helps you write better bug reports and be kinder to open source maintainers. We can create a culture where we don't have open source maintainer burnout.

Lesley Cordero: How do you see building that sort of industry?

Liz Fong-Jones: I was employed by Google from 2008 to 2019. In that time I watched Google do the Google Summer of Code, where they would pay software developers who were sophomores or juniors in college $3,000 to work for a summer on an open source project. It wasn't just Google - other companies participated, but Google was facilitating it. That's a partial start - it gets people exposed to doing open source work for pay.

But some of that is working on feature requests. There's this long tail though - software's out there and then you have to maintain it. Especially in a world where there are security vulnerabilities - it's the security engineering equivalent of being litigious. Instead of getting sued all the time, you get bug bounties opened against you saying this particular sub-package is vulnerable. How do we push out fixes across the whole open source ecosystem?

I don't entirely know how we reconcile this. You can get people into entry-level work on adding more surface area to attack, as opposed to getting people to maintain when they're not necessarily very trusted. A university student writes the code, but someone has to review it, merge it, and then carry the maintenance of it forever.

I was talking to some OpenSSL maintainers about a year ago. One interesting thing they pointed out is that many people on the OpenSSL maintainer team don't have professional experience as software developers - they're cryptography enthusiasts who just started hacking in C.

I think we have to figure out sustainable models where companies that benefit from open source actually fund people to work on it while avoiding vendor lock-in and the relicensing drama. That's what turns people off of open source a lot - having to deal with drama.

Lesley Cordero: It's very true. Drama and uncertain funding - you can see why people might not want to pursue that as a career.

Liz Fong-Jones: Exactly. And also no one breathing down your neck asking if your contribution is approved for relicensing under the Contributor License Agreement when you're close to working.

Lesley Cordero: Open source is something I've always been adjacent to in terms of wanting to be a contributor, but it's hard. It was easier to explore in college because most companies don't invest in open source the way they should. I found it actually easier when I had no work obligations.

Recommended talk: Effective Platform Engineering • Chankramath, Cheneweth, Oliver, Alvarez & Reisz • GOTO 2024

From Kubernetes to Observability and Onboarding

Liz Fong-Jones: All the stuff underpins the work of every platform engineer. We all, by and large, have to maintain Kubernetes platforms. This has a favorite pet peeve of mine - people keep conflating Kubernetes with being a developer platform. Kubernetes is not a developer platform. Kubernetes is a platform for building a developer platform. It's so meta.

Lesley Cordero: Absolutely.

Liz Fong-Jones: We cannot expect every software developer to have to learn how to write Kubernetes manifests.

Lesley Cordero: Absolutely not. As much as that would be enticing for everyone who's a developer.

Liz Fong-Jones: This creates a conundrum. The population of software developers out there includes platform engineers who are using Kubernetes to build a platform for the software developers. Then we've got the set of Kubernetes contributors who are providing the components that are used by platform engineers. It's so many layers of abstraction away from the actual end user problem.

This leads both to people not understanding that this requires care and feeding, and also to us potentially building the wrong solutions. It leads to people not being funded to work on the actual problems. I think more people need to rotate onto at least understanding these issues - having something they'd like to fix, wanting to contribute upstream, or being able to write a useful bug report. These are things that we need to get practice with in order to make sure that we are contributing upstream.

In terms of what you asked me about how OpenTelemetry has navigated some of these things - OpenTelemetry has such strong incentives around observability and telemetry generation. They need to be grounded in the commonality of data. This means that by and large, with one exception that's changing, the major players in the observability vendor space don't want to duplicate effort. They don't want to have the Splunk SDK and Sentry SDK and Honeycomb SDK. We just want to have one SDK, and library authors are aligned here because they want to build in telemetry as opposed to having to make it something that you're swapping in at runtime or having to play favorites.

I think everyone wants that project to succeed - vendors, users, and library authors. That's how we've managed to get the funding for our efforts, because people already were investing development effort into developing SDKs. This was just a "let's put all our resources together" approach.

Lesley Cordero: That's the beauty of centralization efforts. This happens in platform engineering too. We talked about using IDPs - internal developer platforms. I think we use those terms interchangeably, but we also see platform engineering being conflated with just the development of IDPs, which I think is another problem.

Liz Fong-Jones: Exactly. I love that you said writing documentation is a form of platform engineering. It doesn't have to have a Backstage plugin for it.

Lesley Cordero: One of my favorite things about our pipelines is that I have the docs automatically generated based off of comments because I don't have to worry about it. Documentation is so important, but keeping documentation up to date is annoying sometimes. That's why I think it's almost more important that we find these automated ways.

Liz Fong-Jones: The challenge with documentation is that misleading documentation is worse than no documentation. At least if there's no documentation, you know "here be dragons." But if the documentation leads you astray, you've now invested time in developing the wrong mental model. This is the thing I keep saying about the goal of observability - the goal is to give you and your systems the feedback loops so that you have a good mental model of what the system is doing.

If you got the wrong mental model - I actually had an outage a couple of weeks ago where I misled myself into believing that the system was in a different state than it was, and therefore I disregarded feedback that was showing me that the system was in fact in a different broken state than I thought it was. Red herrings and human biases are so real.

Lesley Cordero: Having data that can help you challenge your assumptions is super crucial. That's also what I like about observability - it's such a full stack problem, genuinely. When I think about my talks in platform engineering, they come from an observability perspective because of that.

Liz Fong-Jones: Definitely. It's the source of the data by and large. You can write all of the design documents in the world, but if your design document says service A does not call service B, and you have tracing showing that service A does call service B, which one are you going to believe?

Lesley Cordero: Exactly. At an old team, when I was onboarding people, I used to tell new hires that I would just go onto our observability vendor and see what API endpoints come up for any given service. That was the easiest way to actually figure out what's going on with our systems. Just digging through the codebase or docs was far less effective in terms of getting the right mental model.

Liz Fong-Jones: Although there's that challenge we need to figure out - helping people navigate from trace span names or API endpoint names to the relevant lines of code, which is easier in some languages than others depending on the stack trace generation or the metaprogramming constructs. Being able to say this trace comes from this line of code at this commit.

Lesley Cordero: I also found it helpful for finding confusing points - if something that appears in the code is not what they would have expected, being able to flag that. I think that beginner's perspective was super helpful. I don't know if you experimented with doing anything like that during onboarding.

Liz Fong-Jones: Our learning approach is super interesting. This has gotten harder to scale as we've gotten larger, but usually the most recent batch of people who were recruited has the second most recent batch do the platform architecture overview for the newest set of recruits. That's really fun because it means that we have this living oral story about how the system came to be and how it works. There are more senior engineers who sit in on it, but that's how we avoid having that central dependence upon that one most tenured person - this is actually a story being told by the newest people.

Lesley Cordero: That's a great idea. I love that.

Recommended talk: Using Serverless & ARM64 for Real-Time Observability • Liz Fong-Jones • GOTO 2024

From Shadow IT to Shared Trust

Liz Fong-Jones: You mentioned you were a manager before. So let's talk about your manager pendulum.

Lesley Cordero: I'm not an engineering manager right now. I can talk about why I'm not an engineering manager right now, but how about you?

Liz Fong-Jones: I'm technically the field CTO at Honeycomb. I have no direct reports, and Charity has no direct reports assigned for me either. We're the co-CTOs and we are in a department off on our own. So we're working on the executive level business problems but not necessarily the people management parts.

I've been an SRE manager before. I used to manage while I was living in New York. I used to manage the Cloud Bigtable SRE team. Then I went back to being an individual contributor working on SRE education and working with Google Cloud customers. So I did a back and forth on that manager and engineer pendulum myself.

Lesley Cordero: How do you make a decision on when to switch back and forth? How much of it is personal versus strategic?

Liz Fong-Jones: I think it's based partly on what are the organization's needs, what are my own needs, what's going to further my career development the best right now, and also deliver the most impact. That's how I thought about it, but definitely that experience of being a manager has been super helpful in understanding organizational psychology and why organizations make decisions the way they do.

It helps with understanding how to facilitate people on our team that we're the tech lead of in their career development and growth. How do we work with a manager to help someone who's maybe on a performance plan? Having experience of having been the manager in that situation really is helpful for not being completely mystified about why they did that.

Lesley Cordero: I found it super helpful. I don't know how much I should be talking about what's going on at my company with respect to the labor union dispute, but that actually was really helpful in the union negotiations we've been doing the last couple of years because I've been on both sides. There's a lot of nuance there.

Liz Fong-Jones: Certainly when we were doing organizing at Google, there's a spectrum of whether we actually believe that line managers - people who are mid-career managers - have sufficient influence over company policy to be part of capital and management. That's an interesting discussion that's happening in our industry.

Under labor law, it boils down to whether that person has the power to actually fire someone. Maybe the answer is no - they have to get the sign off of HR and their director. These people are kind of executing the will of the organization in some ways, but not necessarily setting strategy.

It's super interesting. I've seen this play out definitely as far as what is our role in shaping effective organizations, whether it be from an individual contributor or manager perspective.

Lesley Cordero: My situation is super complicated. Especially because you're in a field CTO position, how does that feel in terms of being more of a manager role versus an IC role or some weird in between?

Liz Fong-Jones: It's definitely weird in between. I do not have to do performance management of reports, but I do think about budgets and strategy. The field part specifically is that Charity and I both are very frequent conference speakers. But Charity spends the time which is not speaking at conferences looking after the internal organization structure, our board or executive team, company strategy, whereas I'm a lot more focused on working one on one with our clients and helping our clients execute their strategies.

That's where that delineation of field versus home comes from. Both of us are looking very much at which direction the industry is heading and how we help the company succeed in that circumstance.

Do I write code sometimes? Do I cause incidents sometimes? You're not going to take that away from me simply because it's part of how I continue to be someone that you ought to listen to, because the instant that I retire from writing code, that's going to be the instant that I stop being relevant.

Lesley Cordero: That's the hard part of being a manager - the idea of staying relevant.

Liz Fong-Jones: But I never do anything that's in the critical path of delivery. That's super important. That's common to most managers. You do not want to be working on something where you have to choose between "I'm going to deal with that performance management issue on my team" or "I'm going to write code." You have to look out for the team first, always.

But if it's in service of the team - I worked on something a couple of weeks ago and posted about it on LinkedIn. I was looking at our build times and noticing that our build times were getting slower and slower. They regressed from 10 minutes to 12 minutes to 14 minutes. I said enough was enough. I had two days where I could look at it, and I cut it to 10 minutes, and then I cut it to 6.5 minutes. That's something that no one had put on the roadmap. It was not blocking any team, but it was something I could do to help the team execute faster.

Lesley Cordero: Someone asked me about how I go about prioritizing that kind of work. I had my answer, but I'm curious what yours is.

Liz Fong-Jones: My answer to how you prioritize maintenance and care and feeding work is very similar to how we think in SRE about paying down toil. A majority of your team's time should be spent not doing toil. For people who are not familiar with the SRE space, we define toil as repetitive, automatable work that doesn't require human judgment necessarily.

The non-toil work can be bucketed into feature work or platform work. As a platform team, that priority is almost set by the organization - how many resources are they giving to platform versus product.

If you are a more monolithic organization and you're making decisions on how to prioritize platform versus product work on your own team, that's going to depend a lot on the stage that your organization is at. It depends on the state of your organization, because sometimes you're going to have to incur more technical debt in order to make it to market, to raise revenue. But there will come a time where if you're not careful about paying down that debt, the interest is going to overcome everything else and you're going to spend way too much time doing toil.

Lesley Cordero: I try to rely on having as much evidence at any given time so that I can figure out when's the right time to reprioritize. Initial prioritization should be thorough, but you also have to acknowledge that priorities are going to change.

Liz Fong-Jones: I love that - gathering evidence to know when you should change your priorities. Our team does this by surveying our developers who are working not on the platform. They ask questions like "How did your last context shift feel?" - text field, not 1 to 5 - or "How easy was it to add UI components?" Based on that, that helps us inform where we prioritize our time as a platform team. That helps us figure out whether we're under or over investing in certain internal areas - when is enough polish there? Ideally we get ahead of problems.

For instance, we're now in a better state than we were before. We have time and energy to work on rewriting our deployment system. We know that we're hiring more developers. We know that we now are regularly in a state where the hourly push train has four or more changes, rather than only 1 to 3. Before we get to 16 changes going in a single release, we need to make sure that we're more incrementally deploying. That requires us to do a progressive train, continuous train rather than hourly rate.

Initially you have to do what you have to do to get the job done, and then later you can set the roadmap and say, "If we continue to grow, these are problems you're going to be facing. Let's make sure that we're proactive about addressing them."

Lesley Cordero: In my psychological safety talk, I talk about three phases: reactive phase versus proactive versus preventative. When you're in a more reactive state, what prioritization looks like is extremely different than when you're in a more proactive, preventative state as a team.

Liz Fong-Jones: The addendum to that - I got annoyed about the build times, but it turns out that was on the SRE team's OKRs for the last quarter, but it had fallen off the bottom of their priority list because more urgent things came up. You don't want me in the critical parts of doing something urgent, but it was important work, and it was work that I was able to get off of that team's roadmap by doing it.

Lesley Cordero: Those are my favorite developers when they do the work for you without being asked.

Liz Fong-Jones: As someone who is not reporting to our VP of engineering, I think it's important for me to acknowledge that I will turn up occasionally with harebrained ideas and sometimes it distracts people or causes work or churn. So I feel like I owe it to the team to also chop wood and carry water. If I don't do that, I'm going to have a very resentful team that's like "Liz is just turning up again with another thing that's disrupting us" as opposed to "Hey, Liz both contributes to experiments and also cleans up after the experiments."

Lesley Cordero: It's a way of making sure you have trust with the people you work with. At the end of the day, no one likes being told what to do without reciprocity.

Liz Fong-Jones: I think that applies to platform teams too. Platform teams need to provide value to the organizations. We cannot just say we have a monopoly on the platform - you work here, you have to use us.

Lesley Cordero: That's why I made the point that it's not our job as platform engineers to tell other people what to do. It's our job to help them do what they need to do. For platform engineers, I do see sometimes that tendency to be very "do what we say because we're the centralized platform team."

Liz Fong-Jones: And that's how you get shadow IT.

Lesley Cordero: Exactly. I'm very familiar with DevOps.

Liz Fong-Jones: Hopefully not from your current workplace.

Lesley Cordero: I think every workplace has some shadow ops, especially if they're an older company. It's hard for me to think of a company that's large and doesn't have any shadow ops.

Liz Fong-Jones: Either you can develop a platform that gets people excited about migrating to it so that they don't have to do shadow ops, or you can try to root it out by sending everyone's credit card expenses to find out who's spending money on unauthorized tools.

Liz Fong-Jones: This is super exciting and fun. I really enjoyed this conversation about platform engineering and everything that comes along with it - engineering, open source, and how we get people into our field.

Lesley Cordero: I agree, thank you for speaking with me. This is awesome.