
Reliability Engineering Mindset
You need to be signed in to add a collection
Charity Majors (Honeycomb) interviews Alex Ewerlöf (Volvo Cars) on the gap between Google's SRE ideals & reality. Key insight: SLOs are your team's API for pushing back against micromanagement.
Transcript
The Serendipitous Path to SRE
Charity Majors: My name is Charity Majors, co-founder and CTO of Honeycomb. And I was interviewed a while ago about my books, and I'm here with Alex Ewerlöf. Tell us who you are.
Alex Ewerlöf Ewerlöf: Yes. I'm a Senior Staff Engineer at Volvo Cars currently. Prior to that, I was working at different media companies for six and a half years and actually started my journey with computers in 1999. But I've been moving around a little bit. I've been a PM at some point. I've been in a startup at some point.
But the common theme is product development. I came back to engineering and programming. I accidentally became an SRE, which is a funny story. It wasn't an intentional career choice.
Charity Majors: Wow. I didn't actually know that. That is fascinating. I am a huge fan of diversifying one’s career, pursuing sort of parallel skillsets. But it's rare to see someone who comes to SRE later in life. What brought you into SRE?
Alex Ewerlöf: I was working at a media company. Media, typically makes money from ads. Google and Facebook were eating our lunch because they could do targeted ads. We weren’t a tech-savvy company. McKinsey was in the house and they convinced them to throw money at the problem.
They hired a lot of skilled folks. That was the most highly paid and talented gang of people I have ever worked with. We opened an office in London and gulped in talent from Google, Facebook, and all over the place. Fast forward two years and the venture failed to deliver. As it turns out, you cannot just throw money at the problem while keeping the old leadership.
The company broke in two halves. I was in the half that got a new CEO. We went into cost saving mode.
The new leadership said, we don't want specialist titles like QA orDevOps, we just want generalist software developers. So anyone who doesn't want to be a generalist software developer, was welcome to leave the company. I didn't.
I was a web developer at that time. I knew so little about DevOps, infrastructure, and observability. I just knew that if I write good code in JavaScript and HTML and CSS and hand it over to the operations, they run it for me and if incidents happen, it's their problem.
Overnight, I and my colleagues were given admin access to the company's website - we could go and deface it –and you know how sensitive the media companies are to the public face.. We made some stupid mistakes and also went to some courses. Over time, we kind of grew into what Google calls SRE. Ben Treynor Sloss who coined the term SRE says that people from two different backgrounds come into this. Roughly half of them come from the traditional system administration background. And the other half are software engineers.
An interesting dynamic that I touch on in my book is that if you find a way to make quality and reliability a software engineer's problem, the engineers are going to use code to fix that. And that's exactly what happened.
Charity Majors: They were actually way ahead of the curve because there is no more "dev over here, ops over there." It's like we're at the end of the road for DevOps. Which is not to say that everyone's gotten there, but the idea that you have dev teams and ops teams and they're supposed to have empathy and collaborate - no.
Now we have engineers who write code and who own their code in production. Everything we know about the art and science of shipping good software says it's about connecting the feedback loops and making them short. And when you tear that up into two different sets of heads, that's not a feedback loop.
Alex Ewerlöf: Exactly.
And incidents love these cracks in the feedback loop. Whenever there is a handover, incidents emerge in the ownership gaps. If you look inside what the team is providing, they test it and ensure that quality is good. That's not usually where the incidents happen. Usually it's the point of handover where the ownership is ambiguous.
Charity Majors: Yes, absolutely. All right.
Recommended talk: Is It Time To Version Observability? (Signs Point To Yes) • Charity Majors • GOTO 2024
From SRE to Author
Charity Majors: So you started off as a software engineer. You did a bunch of different things. You got into SRE. Sounds like about ten years ago. When did you start writing your book?
Alex Ewerlöf: I’m currently working at a car company. And cars these days are COWs: "computers on wheels." or. One major distinguishing factor for cars is actually how well integrated their software ecosystem is - for example, to the Apple or Google ecosystem: running apps, settling subscriptions, gathering data, downloading and installing OTA (over-the-air updates), etc.
My company invested heavily in software. The headquarters is on the west coast of Sweden. The west coast of Sweden has a lot of manufacturing. And on the East Coast we have many software companies: Spotify, Skype, Minecraft, DICE, Klarna, Lovable, King (Candy Crush). We made a strategic decision to open an office in Stockholm to tap into the software ecosystem there.
That's where I come in. I have a framework called T-POP for Tech, People, Operation, and Product. Usually when you’re new to a company, you know one of these and use that to crawl to the rest. For example, you know the tech, but you don't know how the company operates and the product. Or you know the people, but you don't know the tech.
For me that was SRE because I was a Staff SRE at Discovery Networks. I thought, okay, I've been doing SRE for a couple of years now, maybe I can find where the Volvo SRE team is. And then kind of self onboard and crawl my way into the organization from there.
I got lucky and one of the senior directors in the Platform organization gave me the task of implementing DORA metrics and then I could motivate them to improve reliability using SLIs and SLOs. At the time I had hundreds of teams, which is kind of big.
At my previous company, I had just four teams! I knew everyone by name. I could do workshops face to face and find the right SLIs. But in the new one I had to find a way to scale that effort. There is no way I can go and talk to every person and tell them what SLIs, SLOs and SLAs are let alone going through all their architectures.
So that's where the book comes in - to solve that problem. How do I scale the basic language of SLIs, SLOs and SLAs across the company?
Recommended talk: Principles For Secure & Reliable Systems • Eleanor Saitta • GOTO 2023
The Google SRE Gap
Charity Majors: Google famously coined the term SRE and kind of pioneered the idea of general software engineers doing this sort of work. But I think there's a pretty significant delta between what Google thinks of as SRE versus what the rest of the world thinks of as SRE. How does that factor into your work?
Alex Ewerlöf: I often use this metaphor: their books are like fantastic chef's recipes for Michelin-starred meals.
And most companies try to copy Google and mimic what Google does, hoping to get the same results.
Charity Majors: They don't have Google's business model. They don't have Google's resources. They don't have Google's stack. Yeah.
Alex Ewerlöf: Exactly. So they look at their kitchen and they see, well, I don't even have a fridge. And there are three fires burning instead of an oven. Many companies start building a kitchen - AKA the platform. Hence the rise of platform engineering. So I think SRE will come back once the majority of the industry reaches a point where we can actually do proper SRE. But before that, we have to first build the platform and tooling and automation.
The reason that I wrote this book is the delta between where Google starts their books and where most companies are in terms of platform maturity.
Charity Majors: You know, honestly, I kind of wish that many of the books that were written as advice - I kind of wish they were written as memoirs, like "here's what I did, here's what happened" instead of "here's what you should do." Let me learn my own lessons about what I should do based on my context.
Give me way more information about the context in which you were operating, in which these decisions made sense. Because I think without that context and without that framing, the amount to which it's applicable to the rest of the world is debatable.
Alex Ewerlöf: To your point, there is the term "best practice." I am more fond of using "fit practice" because “best practice” has this notion of absolutism in it: this objectively the best. Whereas in reality we need to find what works for a particular situation based on the tech landscape, the budget, headcount, timings, trade-offs, all that stuff. Fit practice over best practice.
I don't believe in one recipe that works for everything. The more flexible it is, the more nuanced it is, the more I respect that. That's the main reason I wrote this book. There was no reason for me to write a book when Google has published such excellent books other than sharing my experience in what works in different environments.
Charity Majors: Yes, exactly.
Making Service Level Objectives (SLO) Practical
Charity Majors: Well, I'm glad you mentioned SLOs. A lot of your book is devoted to a lot of the nuances around this. Do you want to share for the audience how you would position them ?
Alex Ewerlöf: I also have a newsletter and sometimes people book me to go and talk to other companies - the majority of the companies Imet, are not even at the level where they can talk about SLIs. It’s just Google jargon they don't fully understand. But one of the most common SLI pitfalls is not to measure it at all. I believe that Service Level is a very powerful concept, because it normalizes failure.
Charity Majors: SLIs and SLOs are Google's number one contribution to the field of reliability engineering. They did it so well. I mean, I think everyone should adopt some version of these - there are almost no exceptions.
Alex Ewerlöf: Yes. And here's the second most common pitfall: measuring the wrong thing. Most teams measure one of the golden signals (availability, latency, error rate, saturation) because it's in the book. But I've written an open source tool that helps the companies to find the right angle to think about SLI.
So it's visual using SVGs. I'm piggybacking on my development background and created a tool that draws a graph between the service providers and service consumers. And at the point of dependencies, that's where the failures can happen.
So you list your failures and then sort them based on business impact. And then tie SLIs to those failures. I have done this exercise more than seventy times across different companies. This process wasn't like this in the beginning when we started. And now I can confidently say that this is a foolproof way to find SLIs that are meaningful.
Charity Majors: I love that. I sometimes joke that Honeycomb is a sticker company where we sometimes build developer tools on the side. One of the stickers that I have says "SLOs are the APIs for your engineering teams" because one of the things that you talk about in one of your chapters is how it drives creative change when it comes to the relationship between the management chain and the engineers, because there's this tendency to micromanage each other's roadmaps, each other's time. And SLOs, I think, are your most powerful tool for pushing back on that and being like, no, you don't need to understand, you don't need to have an opinion on what my team is doing.
As long as they're hitting these numbers that we have agreed upon, we are good. And if we're not hitting those numbers, then I'm sorry, we have to take cycles off this roadmap stuff and put it back, because we have agreed - we made this agreement when we were sober, looking at the numbers, not in the heat of the moment.
And I just feel like it's this really important power tool for balancing all of these complex forces that are pushing and pulling teams all over the place.
Alex Ewerlöf: Exactly. It goes back to the definition of Service LevelsI. What is a service? When I start a workshop with my teams, I ask them, well, why is the company paying you? Why are you here? What problem are you solving? That is the service. The service is not a microservice or database that is running. The service is the problem we are solving. That service has a level whether you measure it and commit to it or not. That's one thing - to abstract what goes inside the team as long as it's providing the service.
And the second idea behind Service Levels is that reliability is not free. For every nine you're adding to the SLO, you're essentially shrinking the error budget by a factor of ten. That has a cost - you have to refactor, you have to change vendors, you have to maybe improve your tooling, maybe you need to hire more people, maybe you have to ship slower. Reliability has a cost. It is usually not free. Sometimes you find a clever trick to improve reliability, but more often than not there is a cost.
And we want those conversations to be part of the negotiation. It's not like you have platform teams and then the product teams just go and pull the strings in the management chain to increase the number of nines. That's not how it's supposed to work.
Charity Majors: This is a sign of a junior engineering manager usually versus a very seasoned engineering manager - whether they know how to say no or push back and explain the trade-offs of what they're being asked to do, or whether they just accept all these assignments and they just say yes to everything. The team grinds to a halt. Everyone's stressed working 60 hours a week, making no progress. It doesn't work right.
So the engineering manager usually doesn't feel like they can just say, no, we won't do this. So the way you say no is you explain the costs. You push the costs into the room so you can have an informed discussion. So you can be like, okay, if you're asking me to do X, Y, Z, I need A, B, C. And that - I think that's a hard - I associate with senior engineers, staff engineers are often responsible for this too. But for me, I feel like every new engineering leader needs to go through this because it doesn't come naturally to anyone.
And I really feel like you can tell how experienced and effective an engineering leader is, whether they're an IC or a manager, by how effectively they shift the discussion from the territory of "what are we being asked to do" to "what does it cost exactly?"
Alex Ewerlöf: It helps to put the product hat on. The power of a positive no. There is a book with that title. The idea is that if someone says they want you to spend more on improving your reliability, you ask why? Because you can do a lot with UX. For example, they may ask for shorter backend latency in order to make the UI feel more snappy. But with some UX tweaks, you can actually make it appear faster. Behind the scenes you may be just polling an async worker. You can hide that unreliability much cheaper than investing in infrastructure.
I have a fun story. I was hired by another company for very poor app reviews due to subpar reliability. They decided to throw money at the problem and hire SREs. That's where I come in. And as soon as we started talking to the CTO about SLIs, he said, "I want 99.999% availability." I thought: “What are you smoking? I want some of that. It must be really good.” This is a media streaming app. What you're talking about is the realm of high availability systems like hospital information systems or airport control tower.
Charity Majors: What they're saying when they say that is "we want our customers to have a good experience." You don't need five nines reliability for customers to have a good experience.
Alex Ewerlöf: Exactly. The UX research showed that the users can put up with up two hours of unavailability. That's 99.7% availability. That's workable because we had just 1000 developers.
Charity Majors: It's turtles all the way down. I love this quote from your book: "I'm fully aware of the risk of service levels being weaponized by management. However, I'd like to honestly invite my engineering peers to look at it as a way to use hard data to put an end to emotional discussions." I love that.
Alex Ewerlöf: One of the companies I used to work at used a blanket SLO for all services: just measure availability regardless of what you're building. They defined several classes of availability depending on where your system sat in the critical user journey, impacting how the business makes money. It wasn't so much about "okay, so maybe we can get away with 99% because that will dramatically reduce the cost of building a highly reliable system." It can make it much more manageable, for lack of a better word. Not every system deserves to be as reliable. There are certainly some systems that really deserve the cost of reliability, but many systems don't.
Recommended talk: A Field Guide to Reliability Engineering at Zalando • Heinrich Hartmann • GOTO 2024
The Future of Observability and SLOs
Charity Majors: One other thing I feel obligated morally to throw in here. That is something I think most people in the world don't have yet, but I have come to think of as absolutely necessary for good SLOs is calculating your SLOs from the same data that you're using to investigate incidents. The way Honeycomb does it is you define your SLOs and they're computed from the events you're shipping into the system.
So if there are events that are violating the SLO, you can just click on them and see exactly which requests are causing problems. You can then Bubble Up, which is a Honeycomb thing where you draw a little bubble around the thing you're like, oh, what is this? And it diffs what's inside the bubble versus the baseline.
So it's like oh this thing you care about - oh all the errors in this time range are from Android devices from this region hitting this cache or whatever combination of things it is. Because I feel like when you have to jump between tools or sources of truth, you often spend more time trying to figure out the differences between those data sources, instead of - if they don't agree on what the problem is.
And I think being able to just switch fluidly back and forth between "oh, here's my SLI, here's my SLO, and here's the data that is flowing into them" - it's such a power tool.
Alex Ewerlöf: I haven’t worked with Honeycomb, unfortunately. Most of my experience comes from observability 1.0 providers, as you would call it. I’ve heard so many good things about Honeycomb.
One of the things that I was really excited about is that SLOs are a first class citizen in Honeycomb, and that's how it should be.
Charity Majors: You shouldn't have static dashboards as your entry point into your systems. It should be the most business critical information that you look at and use as your jumping off point into exploration .
Alex Ewerlöf: Yeah, and the other thing that you said - it should be the same data and SLOs. I haven't seen this in the wild unfortunately, but that's like a dream come true if there is a company that does it like that. That's how it should be done.
Charity Majors: I don't know of anyone but Honeycomb, but I believe, honestly, between you, me and our podcast listeners.
These changes are being driven by soaring complexity. And I really do think that the observability 1.0 model - we've never been able in this industry to treat observability like a data problem. Why? Because we're putting every signal to its own storage location. You can't query over a combination of metrics and exceptions and traces. You can't query, you can't run an ETL process, you can't derive summary statistics.
Every single observability company that has been founded since 2019 has followed the 2.0 model of having a single source of truth. Anyway, this is not about that, but I think the next generation of tools will absolutely have SLOs as a first class citizen because it's such a game changer.
All right. Before we run out of time, I wanted to say - by the way, I think my other favorite chapter of yours was the organization architecture. I don't know if we have enough time to talk about it, but I really encourage everyone to look it up. It's got so many cool little diagrams. I love the little smiley faces and the part about everyone supports the same football team. I was just like, yes, this is what the real life of your organization looks like. It's so good. All right. Okay, I can't go into it, but.
Swapping Stories on Recent Reads
All right. We are - I think we have 1 or 2 minutes left. What's the best book you've read in the last year?
Alex Ewerlöf: How The Mighty Fall: And Why Some Companies Never Give In.
Charity Majors: This one. Yes. All right. I'm going to have to look that up.
It's called "Fluke: Chance, Chaos, and Why Everything We Do Matters." And I love this because I grew up religious. I'm not religious anymore. But I feel like there are kind of two competing extremes. There are people who are like, nothing we do really matters. Someone was going to invent the light bulb. Does it really matter if it was Edison or one of 200 other people? Versus people who are like, no, everything we do really matters. We have agency. We can make an impact.
I reality, he talks about how there's truth to both of these, but actually there's a lot of contingency in history. There's a lot of - World War Two wouldn't have happened without World War One. And World War One happened through a series of wildly improbable events. His takeaway was not everything you do ends up changing the world, obviously, but some of the things you do end up changing the world, and you have no way of predicting which things.
A stray comment you make to someone may compound and result in them starting a company years later which results in them changing the world - the butterfly effect. And I just found this so meaningful. As an atheist, there is meaning to our existence, and there are reasons to be mindful of the impact that we have in the world, even if most of what we do will get drowned out in the noise of history. So yeah, "Fluke" - loved it.
Alex Ewerlöf: Yes. Speaking of books, I think one other book that really changed me was The Elephant in the Brain. That one talks about how a lot of what is happening - that we think we have decided to do - is actually decided in the subconscious of the brain, and we just observe it and justify that it was us who did that.
Charity Majors: All right. I have two great new books for my reading list.
Alex Ewerlöf: Me too. And you can read this one.
Charity Majors: I'm so glad we got to do this.
Alex Ewerlöf: Thank you, Charity Majors, I really appreciate your time.
Charity Majors: Likewise.
About the speakers

Charity Majors ( interviewer )
CTO at honeycomb.io

Alex Ewerlöf ( author )