Home Bookclub Episodes Observability En...

Observability Engineering

George Miranda • Liz Fong-Jones • Charity Majors | Gotopia Bookclub Episode • June 2022

You need to be signed in to add a collection

Observability is crucial for developing and understanding the software that powers today’s complex systems. Charity Majors, Liz Fong-Jones, and George Miranda, authors of “Observability Engineering” show you how to manage software at scale, deliver complex cloud-native applications and systems, and the benefit observability has across the entire software development lifecycle. You’ll also learn the impact observability has on organizational culture (and vice versa).

Share on:
linkedin facebook
Copied!

Transcript

Observability is crucial for developing and understanding the software that powers today’s complex systems. Charity Majors, Liz Fong-Jones, and George Miranda, authors of “Observability Engineering” show you how to manage software at scale, deliver complex cloud-native applications and systems, and the benefit observability has across the entire software development lifecycle.
You’ll also learn the impact observability has on organizational culture (and vice versa).

Intro

Charity Majors: Hello. I am Charity Majors, co-founder and CTO of Honeycomb, and co-author of the upcoming "Observability Engineering" book. And I'm here with my co-authors, Liz Fong-Jones and George Miranda.

Liz Fong-Jones: Hi, I'm Liz Fong-Jones. I'm a principal developer advocate at Honeycomb. And I got roped into this book project a couple of years ago, which is exciting. And we're really glad to have George Miranda on board as well.

George Miranda: Hi, I'm George Miranda. I also have been at Honeycomb. When I joined Honeycomb, the book was somewhat in its infancy. I was super interested, so I just jumped right in. And together, we've all been cranking on this for a while and getting it done. So I'm super excited to be here to talk about this book today.

Charity Majors: I feel like when George joined, Liz and I were both like, "Hey, we're half done with this book." Remember that?

George Miranda: To be young and naive.

Charity Majors: Exactly. Yeah, this book has taken...

Liz Fong-Jones: It was my first rodeo with a book, but it was not Charity's first rodeo with a book.

Charity Majors: No, no. I'd been doing the book for a little while, and I thought we were half done. Anyway, Liz came on board first, I think and drew up a whole new chapter outline and we started writing. But I don't feel like the train got its wheels on, and that's a bit of a metaphor until George started. He just brought so much structure to the process. I would spew out a couple of pages of just, like, notes off the top of my head, and he would turn out these beautiful, nicely crafted entire chapters. And I feel like that was when we finally got an actual rhythm to it.

George Miranda: Absolutely. I think the three of us have been a really good combination to... I guess, you know what, this is not my first O'Reilly book, but I've written many smaller-length O'Reilly books. I think working with you and with Liz Fong-Jones has just been super informative for me joining Honeycomb. 

Liz Fong-Jones: Oh, yeah. You were new and you had, like...

George Miranda: I was brand new.

Charity Majors: A newbie.

George Miranda: Exactly, a brand newbie. This was a way for me to wrap my mind around the nuance and the difference. And having the two of you to lean on in terms of the insights and just the different areas of focus that you both brought to the table, I think, one, really let me absorb that working with both of your writings internalizes it. I think it took me a couple of months to wrap my brain around it. But then as we kept going there were parts of the book that I was then able to own and run with.

And so I can only hope that as folks read this book, they also can glean some of the insights that just made things click, that made the practice of observability easy to understand and follow and to understand why it is such a paradigm shift. I hope that's what people get out of it. I think we've done, hopefully, a pretty good job of it when I look back at it, maybe.

The evolution of Observability Engineering

Charity Majors: I hope so too. It's been nice having the reader feedback all along, which is the main thing I have. But also things have been changing along the way pretty quickly. It wouldn't have taken us this long to write if observability, as we know it today, had existed in its present form three years ago. Right, Liz?

Liz Fong-Jones: Definitely. One of the key things that I was lamenting to George always was that when we originally sat down to write about open telemetry, we were imagining that it was going to be this giant effort to explain to people how they should instrument their code. Then it turns out that other people have written excellent documentation. So we can almost provide a pointer to existing documentation rather than having to explain it all ourselves. Also, the code examples that we use now are 1.0. There's no longer a, "Oops, this is already out of date," you know, from the moment the ink is dry.

Charity Majors: Yes, for sure. Well, maybe I'll talk just briefly about the origin story for Honeycomb, which is kind of the origin story for observability. I know Liz has her spin on it coming from Google. But when we started Honeycomb back in 2016, like, observability wasn't in the air. I think Twitter had used observability engineering for the name of its team once, but, it wasn't a thing. When Christine and I started Honeycomb, it was after this experience of Parse we had had. Parse is this mobile backend as a service that was acquired by Google...sorry, acquired by Facebook, I wish. Acquired by Facebook and shut down.

The experience we had was Parse was doing a lot of stuff that was before its time. We were doing microservices before there were microservices. We were doing these multi-tenant platforms when all we had to understand were metrics, aggregates, and logs. And Parse was going down every day. It was professionally humiliating to me just how often the site was going down. And we tried everything too, you know, all of the current best practices and nothing worked until we got acquired by Facebook and we started using this tool called scuba.

We typed some data sets from Parse into scuba data set, and in the time that it took us to identify what was the problem it was a different problem every day, a different app, it hit the iTunes top 10, and just there was no predictability to what was causing it on a day-to-day basis. The time it took us to identify and solve these problems after we got data sets into scuba dropped like a rock from open-ended hours, days, we might never figure it out, it might just cover a ton, to not even minutes, to like seconds. To the point where it wasn't even an engineering problem anymore, it was a support problem.

This had a huge impact on me because we got a hold of our reliability issues, and we moved on with our lives and it wasn't a thing. Then when I was leaving Facebook, I suddenly went, "Oh, shit, I don't know how to engineer anymore without the stuff that we've built on top of this." It's like the idea of going back to the Dark Ages. I was reading the other day about how when pilots went from only having visual stuff to having an instrument, be having instrument ratings, being able to fly with their instruments. It was like the idea of going back to only having VFR again, right? That's why we started Honeycomb.

It was like six months into Honeycomb when... The hardest job in the world is product marketing, I swear to you. Because we were trying to figure out how to describe what we're doing. We knew it wasn't monitoring, but what was it? In July, six months after we started the company, I googled the term observability and I read the definition, right? It's all about how well can you understand what's going on inside the system just by asking questions from outside the system? I just had light bulbs just going off like, "Oh my God, this is exactly what we're trying to build." Right? So we started trying to find observability for systems. The definition caught on quickly, a little too quickly. But then when Liz joined, a couple of years later, you had a very different angle that you're coming at this from. Right, Liz?

Liz Fong-Jones: So I lived in the future, right? I worked at Google. We had all of this super-advanced technology, including tracing, including metrics,, and kind of we'd done work to pioneer stitching these two things together in a way that was available to site reliability engineers at Google. But one of the challenges was that as much as Google lives in the future, it also reinvents everything because it's inventing the future. Therefore, there's this huge gulf between... Towards the end of my time at Google, I was working with Google Cloud customers.

"How do we help bridge this divide between people who are using Google Cloud versus people who are using the shiny, cool, and interesting Google stuff?" We didn't use the word observability for this. It was-  what is your monitoring situation? What is your position with service-level objectives? But what I realized was that the most effective way for me to level at the industry was going to be helping everyone, not just the people at Google maturity get good observability. So that's kind of how I came to this.

Monitoring vs. observability

Charity Majors: I love that phrasing of you lived in the future. Because that's exactly the experience that we had had at Facebook, right? Living in the future tool-wise, but bridging that by also living in the past or the present in terms of the rest of our stack. Just wanting to bridge that gap is exactly where this came from. You talked about SLOs, which, when you started at Honeycomb, was the first thing. I think it was on day one, you're like, "Charity, it's time to build an SLO product." And it took me a little while to understand the vision, but I completely understand it now.

Liz Fong-Jones: One of the key things I'd seen people struggle with was that at Google, we really, really encourage people to stop alerting on stupid things, to stop alerting on the CPU is too high or the disk consumption's too high.

Charity Majors: The symptoms.

Liz Fong-Jones: Instead, we advocated that people set goals for user experience, or service-level objectives, and to alert only on those. To alert only if, over a 30-day window, it looks like you're going to stop having 99.9% of your request succeed. If it looks like 99.8% of your requests are going to succeed instead, then you need to intervene and stop that. But the problem that we saw was the adoption of the...was that people were like, "But you can't take my safety blanket away from me, how will I be able to debug things if I don't understand which individual servers are having problems?" They were stuck in this mode of, "The only way I can debunk things is starting from the very bottom of the stack and looking up, even if things are randomly flopping at the bottom of the stack all the time." So that was part of...

Charity Majors: And I think a conclusion that you and I kind of came up with was, you know, that you need observability to understand from the bottom-up what's happening, and you need SLOs to understand from the top-down, whether or not you should care.

Liz Fong-Jones: Right. And observability is that connective tissue to help you systematically debug from the top-down to figure out where in the stack is this problem happening. Rather than having everything in the stack blaring at you at the same time and playing whack-a-mole.

Charity Majors: Because it's a super core principle of observability that everything matters right down to the raw rows, right? The way that we can take your data and bubble up, you can just draw a bubble and go, "Boom, that's where it is," is only by capturing all that really rich context, right? And then, you know, of course...

Liz Fong-Jones: Capturing the really rich context but crucially. You can have all the data in the world, but if you can't actually analyze it, and slice and dice it in real-time, you don't have observability. I think that's the key thing. People have been collecting logs and having them sit on a shelf for ages.

Charity Majors: Yes. George also has a background in operations, which is part of why I think we all gelled so well here.

George Miranda: Which is what I was gonna bring up, right? While you two had both been living in the future, I was a product of the past or the present. And Charity, I love the VFR example that you're making, right? Because really, that's what we're talking about. It is that leap from which I see and can viscerally connect with what is happening because I'm so intimately familiar with a system, which is how I was used to thinking about production, having to gain that knowledge over time, and having to gain that experience and troubleshoot through intuition. Switching to a mode where you can use your instruments and you can rely on them.

I don't know if you know this, but that part of your flight test, at some point when flying with instruments, is sort of Luke Skywalker style. You put on a face shield and you can't see out the window. You're supposed to fly just with your instruments. Observability kind of gets you to do the same thing, which is realizing that now, it is technologically possible to gather and analyze this much data? That was like an unfathomable amount of data that just simply was not possible, you know, even as far back as, you know, a decade ago. And as an industry, there are so many things that we have normalized when it comes to intuitive troubleshooting, right?

Liz Fong-Jones: Yes. We made so many compromises. Right? You have to use metrics because metrics are the only way that you can handle that much share. And that was true 10, 20 years ago, that is not true today.

George Miranda: Right. I think making that mental leap is something that I had to make coming into Honeycomb. I think part of the reason that we all gelled so well is, that I think that's the perspective that I brought to the table with the book, right? So, you know, not having quite made that transition, how do we lay the trail successfully to figure out how to take a methodical set of steps to step away from troubleshooting by intuition? So, Liz, when you say "Don't take away my security blanket," that's kind of what I think about, right? I am so used to doing things that way. How do we move back to, or I guess forward to, a more data-driven, hypothesis-driven scientific approach to troubleshooting? That's what computer science should be. We just aren't used to doing it.

Liz Fong-Jones: Yes. But I think it's scary to people, though, to move from kind of making leaps of intuition and being like, "I know this. I can figure it out instantly." With observability, people have to rewire their brains to think, " I'm not going to necessarily either figure out instantly how to take three hours. But instead, if I follow this methodical set of troubleshooting steps and powered by my tooling, I will be able to solve every problem in 5 to 10 minutes."

Charity Majors: This is why people often ask us, "Well do I need observability if I have a monolith?" And it's like, not as much, you know. Because when you have a monolith you can pretty much tell the glance from your dashboards, whether it's the web, the database, or the app tier, right? But you don't need your instrumentation in telemetry. It's not better to have it. Right? And it's that panicky feeling that people get when they don't know what's going on.

Most of our customers have come from people who are on the edge of that, who have felt that, who have seen what happens when their intuition can't keep up. That's when they're willing to invest in learning how to do things a slightly different way. Then it's never as bad as they thought it would be. 

Don't get me wrong. My niche as an engineer was being the first...the infrastructure engineer, and being the person who built the thing, who knows the thing better than anyone else, I love that godly feeling that I get from looking at a dashboard, stroking my beard and go like, "It's brightness." I know it doesn't say it anywhere, but I just know that it's read it and I feel like a god and you can't take that away from me.

Liz Fong-Jones: Your big, giant, bushy, gray beard, Charity.

Charity Majors: Yes, exactly. I stroke it. It's there. I miss that feeling too. But then also, when I went on my honeymoon to Hawaii, I was getting called at 3 a.m., and 4 a.m. because Parse is dead and MongoDB, nobody could figure it out. So they were calling me, it doesn't scale. You know, it doesn't scale. And our architecture, our systems are getting to the point of complexity where it doesn't matter how bushy or gray your beard is, there are still gonna be more problems that your intuition hasn't encountered before and cannot debug, than there will be ones that you're familiar with.

George Miranda: And we accept that as normal, right? And we accept it as sort of these...like, there's a set of unknowable things that happen in production, right? Production is unpredictable and it's chaotic. And so nobody wants to touch it, or break it, or, like, you approach super carefully. And we're starting to think...

Liz Fong-Jones: Or even more straight, we're talking about this thing about, oh, there's no way we could understand that we'd have to push new codes. It turns out that the key difference between monitoring and observability is moving from we have to push new code to understand things, to actually we've already added the right instrumentation to debug things that we never anticipated would be a problem when we wrote the instrumentation.

Charity Majors: I know that cost models are not the sexiest thing to talk about with engineers, but, like, the cost model metrics scale linearly with every question that you want to be able to ask, right? Every custom metric that you capture is another increment of cost, right? Versus the model with observability, where you aren't capturing these arbitrarily wide structured data blocks, like one per request per service, and just, you know, throwing another key-value pair, or 2, or 3, or 10, or 15 on the event is free, right? And so you're incentivized to capture everything that might be useful someday, and you're not getting priced out of it.

I mean, it's free if you're using Honeycomb to add more events because you're priced per event. It's basically free, no matter what you're doing, because adding a couple of more bytes on to, you know, a packet that's already getting shipped is not that expensive. It's not free, free, but it's approximately free for most people. This means you don't have to sit there and, like, cool out, you know, "We haven't used this in a while. I'm gonna rip this because I want to save my operating cost." 

Cardinality

Liz Fong-Jones: It hurts my heart every single time I see a conference talk about how to limit cardinality or how to pre-aggregate things to save money on your metrics. It's like, "oh, no..."

Charity Majors: Let's talk about cardinality for a second for the kids at home.

George Miranda: Sure.

Charity Majors: George, do you want to define it for us?

George Miranda: Yeah, sure. So, cardinality is the uniqueness of a particular value. Something with low cardinality could be something like gender, for example, right? It's more than just binary, but there are a handful of different ways to identify perhaps.

Charity Majors: Let's not fight you on this one, but I...

George Miranda: Right. Sorry, I couldn't come up with a better example on the fly. But something like customer ID or transaction ID, right, is very high cardinality, incredibly unique. And so the challenge with high cardinality comes in. Liz or Charity, do you want to take that?

Charity Majors: Well, the challenges with metrics, they're just not built for it. Modern metric systems have gotten to a point where they can afford a couple, but you're gonna blow out the keyspace. You'll blow it out rapidly and it's gonna cost a lot of money. It all comes down to how literally we lay down our desk, and it's just not...

Liz Fong-Jones: Which is one of the chapters that we have in the book, right? I think this is why I'm excited to methodically lay it out in writing, systematically for the people who learn best through writing. Because Charity, I, and George have been talking about this for ages. But at some point, you have to see that rhetorical argument. You'll have to see that diagram that shows that you are wasting a lot of bytes on disk, you're wasting a lot of kind of room in your indexes if every single index is unique, right? Like, that's the problem with high cardinality is that if a value or a combination of values only appears once, you're paying all this overhead to store the name of the index. And then there's a singular value at a single point in time, and that never gets reused again, right? Like, that's why people in the metrics world have been like, you know, "You have to reduce cardinality." And we say, no, you don't have to reduce cardinality, you have to embrace cardinality.

But I think there is an interesting thing which is, that Charity talked about the fact that observability makes it so that the cost does not scale in the number of questions you have to answer. One of the main pieces of pushback that I always saw against wide events in observability was this idea of, oh my goodness, like, it's going to scale linearly in the number of...in the amount of traffic my service receives. I think that's where another innovation comes into play, which we have a chapter in our book about sampling, about how you can have it not scale linearly with the size of the service.

Charity Majors: Which is a dirty word in production engineering. I blame the logging companies who've been making bank for decades just saying, "Save every log message. Don't let a single one drop." This is unfortunate because if you look at your NGIИX log, do you care about, you know, say, your health check request at the same amount of, you know, caring as you care about, say, errors that slash payments? No, you don't. Health checks, in a modern system, can often account for 20% of your load. Like, representing 1 out of 20 of those requests that are all the same, all to hundreds, all from the same location to the same location, there's a huge cost-savings. There's a lot you can do to reduce your cost there without any downgrade in reliability or your ability to ask complex questions.

George Miranda: So taking a little bit of a step back, I think what I like about the book is that if we start with an outcome that what we're getting at is data-driven debugging and stepping away from debugging by intuition and we're debugging by hypothesis. What are the things that are necessary for that to be true? Like you just mentioned, right, being able to analyze all of that telemetry data. Well, first, being able to just gather all that telemetry data, right? And having a data store that can handle that is step 1, right? There might be some things that you need to do at scale, like sampling, for example, or managing telemetry pipelines, but...

Charity Majors: The scale is a lot bigger than people think it is.

George Miranda: Scale is much, much bigger, right? And so what I do like about the book is that we start with, what are the fundamental building blocks? From a technical perspective, what do you need to do to gather that telemetry and store it? How should you instrument your applications? And then what kind of analysis do you need to do to debug in those methods? Right? And then we sort of methodically walk through what are ways to automate it? What's the kind of data store that you need? And what functionality do you need so that if your data set is, you know, millions of rows, or billions of rows, or hundreds of billions of rows, how does it need to behave? Right? How do you get answers fast? How do you need to construct things to get to these outcomes?

I know we spend a lot of time talking about Honeycomb's pricing model, for example. Right? But I think what I want to get across is that what we do with the book is...I think there are maybe a handful of times in there when we relate things back specifically to Honeycomb for product-specific examples of this is what, like, a TraceView looks like.

Liz Fong-Jones: This is what an implementation of this looks like.

George Miranda: Exactly. Right. But here are the things that you need if you want to build to this type of outcome. Right? And so, I'm hoping that's what people get.

Liz Fong-Jones: Right. I think Charity called it showing our work, right?

George Miranda: There you go.

Liz Fong-Jones: People for a very long time said, "You can't possibly do that. That doesn't make any sense." We're like, "No, actually, we'll show you." Right? Like if you want to do it, you can do it.

Definition of Observability Engineering

Charity Majors: There are so many definitions of observability proliferating. I'm wondering, can we all come up with our favorite? What's your favorite definition, Liz, of observability?

Liz Fong-Jones: A favorite wrong one or a favorite great one?

Charity Majors: Oh, that might be more fun. Both.

Liz Fong-Jones: Okay. Well, I think later, let's start with my favorite wrong one, which is, you know, you have observability if you have logs, traces, metrics for files every single signal type that your vendor is trying to sell you as a separate add on. That's my least favorite definition of observability. But my favorite definition of observability is the one where our customers say, "I can see and understand my systems for the first time." I think that's where the magic is.

George Miranda: I think that's also a little fuzzy, though, right? Because of some of that experience, what people hear with observability is another synonym for visibility, right, where it's turned into like a synonym for monitoring as well. What we frequently talk about is, you know, that meantime to WTF, right, which is that experience that you have when you wire up your applications and you start observing and seeing what is your request in production doing? And suddenly, there's this moment of, "Wait, my code's not supposed to work that way. Wait, it's doing what?" That's when it suddenly clicks that this is a fundamentally different way of seeing what your code is doing. And there's not a great way to explain that unless you see it for yourself in a context that you thought you understood one way, and suddenly your mental model shifts and you're like, "Oh," right, and then you get it. So, yeah, that's... I don't know, but that's what I see. What about you, Charity? What's your favorite definition?

Charity Majors: I used to lean a lot on the unknown unknowns type of definition. Monitoring is for known unknowns and observability is for your unknown unknowns. But I think increasingly, I The three pillars thing bugs the shit out of me. And the three pillars thing dates for a blog post that Peter Borger [SP], who's amazing, wrote in 2018, you know, two years into when we were trying to define Honeycomb as, you know, observability, etc. He came up with this blog post that's like, there are three pillars, metrics, logs, and traces. And immediately, every vendor in the world latched on to it. They were like, "Great. We have observability, too, because we have a metrics product, a logging product, and a tracing product."

And we're just like, "No. Because it's worse if you have to pay to store it three times, and then you have to have a human in the middle who's, like, bouncing back and forth, and copy-pasting ideas around trying to get it. It's not observability." But that aside, the definition that I've really settled on is, that it's about high cardinality, high dimensionality, and explorability, right? There are a lot of BI tools that have had the support for high cardinality and all these things for a very long time. But what they tend to lack is explorability. We talk to teams, even to this day, we're like, "I can answer any question about these systems. But you know what, give me those questions and I'll get back to you in two or three weeks." And we're just like, "What?"

Liz Fong-Jones: Right, it's about speed, right? Can you still do this? Then it feels like, I think Charity you say, be in constant conversation with your code, right? Like, is it a conversation or is it, you know, trying to talk to Voyager 1 where you send a message and it comes back, you know, months later?

Charity Majors: I strongly feel that we target the 95th percentile of our query sheet. I know this is very Honeycomb-centric. But, you know, we try to make everything end in less than a second. Because when you're debugging, this isn't like dashboards where you're just, like, sitting there flipping through a bunch of dashboards, trying to pattern match from your eyes like, "Well, there's a spike here. So flip, flip, flip, flip. Oh, there's a spike there, they must be related." Right? Like, that's not science. That's not debugging. That's just like pattern matching.

But what are you doing? Well, you're often you're starting somewhere, and then you're iterating a lot, right? You might start with something that's like a mile from the end, but you're just like, "What about this? What about this? What about this?" And every time you ask a question, you know, you're checking yourself, and then it might not be that soon too, like, roll back and go, "Okay, I knew I had it like five queries ago, right?" So you go back, and you find it and you start going again. But that needs to be iterative, right? It can't break your flow. It can't be the kind of speed where you have to go up and go to the bathroom or get coffee, because that'll break your entire train of thought. It can be a few seconds, maybe, but you need this, what about this? What about that? What about that? Because you're on the trail, right? You're bugging something. You've got the suspect in your sites.

Liz Fong-Jones: Which is funnily enough, why Honeycomb used to be called Hound.

Charity Majors: Yeah, it's true. It was even Bloodhound and that stayed, although that didn't last more than a couple of months. It's a very aggressive-sounding name for your company. Funny story, I love this story. It's a little bit of a detour. We were called HoundDog SH [SP] for the first two years, and then we got served a cease and desist from HoundCI, which, you know, they have to defend their trademarks. No-fault there. But we had like an all-nighter, we're like, "What of these names?" And Honeycomb, we loved, but it was owned by some friends from Slack.

We reached out and we asked them if they would sell us the domain names. Because it was like one of Stewart's pet projects or something. They came back and said they would gift it to us and the domain shortener hny.co, just for the price of the legal fees to do the switch. So, you know, mad love for Slack. And hilariously enough, Slack became one of our earliest...probably our first real customer at scale. They ended up contributing a couple of chapters to the book that we're working on. George paired a lot with Frank Chen, on their CI/CD platform, in particular. Do you want to talk about that a little bit?

Case study: Slack

George Miranda: Yeah, sure. So, I think this goes back to the earlier discussion, right? When you think scale, scale is usually a lot bigger than most people think.

Charity Majors: Slack is scale.

George Miranda: And Slack is scale. Frank Chen contributed a chapter on how Slack is using observability in their CI/CD infrastructure, like, looking at their pipelines. One of the things that I found fascinating, was I got a chance to spend a lot of time talking to Frank about the particulars of the use case, and how they're doing things, and just sort of uncovering some of the whys. And I never... I guess, you know, in my operations time, our pipeline firms were, I don't know, maybe a dozen, two dozen servers top at the high of just pushing changes all the time.

Slack's footprint just for pipelines is enormously huge. The underpinnings of what might be changing, different versions of libraries, right, different versions of plugins, it's massive. They are pushing just tens of thousands of changes. There is so much instability that could be introduced when one small component changes, and you would never see it. And so they started applying observability concepts to their pipelines and treating it pretty much as a production application. Because for Slack, right, shipping that many changes are a part of the production. 

Liz Fong-Jones: For a lot of companies, their number one expense is the salaries of their developers. It turns out, that is really, really, really important to protect. And this is why the whole developer tooling space is so important because people have recognized that you have to invest in your number one asset. Throwing away hours, days, or weeks of developer time is just not a good idea.

Charity Majors: It goes back to that study that Stripe published that showed that engineers self-report that they waste 42% of their time on orienting themselves, trying to reproduce bugs, tech debt, just trying to figure out what's happening, stuff that does not move the business forward. So it does not do anything but frustrate the engineers who are working on it. They call this something like the $2 trillion gaps or something for developer tools to make engineers' lives...to make engineers more productive and their lives better.

George Miranda: What I'm hoping is that this book has a little bit of something for everyone. So everything from the introductory underpinnings of why events are the building block of observability, right when you're just getting started, to some of the business cases for why observability matters, like what it means for your business to move faster, saving those developer salaries, etc., and some of the use cases on the operation at scale side of the spectrum. And so, Liz, we had another contribution from Slack around telemetry management pipelines. Do you want to say a little bit about that?

Charity Majors: I have one thing to say real quick about CI/CD, which is that most people don't realize that they can instrument their build pipelines as a trace. This has been the number one use case for our Honeycomb's free tier is people just instrumenting their CI/CD pipelines, and suddenly they can see, "Oh, which tests are slow? Oh, where's that time going? Oh, why is it taking me so long to build, ship, and deploy?" And that's a great use case. And that was one that Slack...no, it was Intercom came up with. That was not a Honeycomb invention, that was an Intercom invention. They started instrumenting their CI/CD pipelines as a trace and they were just, like, mind blown. So cool. Sorry. Over to you, Liz, telemetry.

Liz Fong-Jones: This was an addition to the book that we were debating and going back and forth on because this was only applicable at some scales. The idea of how do you route and standardize all of your telemetry that's flowing in? Where for the majority of folks, the open telemetry collector will do just fine for them. But we let it on having Slack as one of the most sophisticated developers of their observability data pipeline, that we should offer them the opportunity to talk about kind of some of the challenges, why they did it, whether or not you should do it, to kind of make observability feel like it's possible at large scales, and not just at kind of medium and small scales. To talk about those real challenges of what happens when, you know, practically everyone in the digital world is using your platform and you're generating all of this data, like, how do you keep the clutter under control? Right, like, we thought that was important to include in the end.

How does observability relate to DevOps, SRE and cloud-native?

Charity Majors: Switching gears just slightly. How does observability relate to DevOps, SRE and cloud-native?

Liz Fong-Jones: As I said, when you're talking about origin stories, at a certain point you cannot practice SRE without practicing observability. We spell that out in a lot of our tutorial material on SRE, where we said one of the key responsibilities of SRE is appropriate monitoring and observability. If I were re-recording those videos today, I would use the word observability.

Charity Majors: Provide it for the entire company, for all of the teams?

Liz Fong-Jones: Yes, to be a force multiplier. Because SREs should not be the only consumers of this observability, but SREs and platform teams should be making sure that observability is begotten. So I think that's kind of one answer to how observability relates to SRE and DevOps and all these things are that if you do not have observability, like, you are not fulfilling your mission of ensuring that your teams can provide reliability.

Charity Majors: And this is an area where...

Liz Fong-Jones: Because there is another angle...

Charity Majors: Just to interject, this is an area where I feel like observability is a step up from monitoring. Because monitoring, traditionally, was concerned with very low-level statistics, right? And you had the ops team sitting there, almost like a translation layer between yourself or engineers who are trying to ship code, and the systems. And they're like, "Aha, so you made this change and that made  this spike on your IPB," if I can do four graphs, "this spike in your memory graph, and this spike in your CPU." But it wasn't talking to them in the language of endpoints, variables, and functions. So they couldn't parse it on their own unless they were ops people. And that's a big part of observability I think they don't talk about very much, which is just that, you know, it's very consciously in the language that software engineers are using every day.

Liz Fong-Jones: So to go to the other kind of intersection with DevOps and SRE, the other thing, when we talked earlier about technical requirements for observability, it turns out that there are technical requirements to implement monitoring, but there are prerequisites to observability too. That if you cannot ship code, to begin with, no amount of observability is going to help you because you're not going to be able to take those insights from looking at your code and figure out  how do I take action? How do I fix it, right? It doesn't matter if the time to query and understand, you know, if it takes 10 seconds to query, and it takes five minutes to understand the issue, it's then stuck waiting three months for the next quarterly build, right?

It's challenging to get started with observability. If you can't enrich your code with those rich attributes, right? Sure. Automatic instrumentation can kind of monkey patch and get you some visibility to start with. But ideally, the power of observability comes from enriching your data with business logic-specific attributes. But again, if it takes three months to get those pushed to production so that you can then use them to run queries you can predict in advance, like, yeah.

Charity Majors: Or if you've got devs or DevOps teams who aren't allowed to touch the code, and they're saying, "Do whatever you can without shipping any code whatsoever that's too dangerous, we don't trust it." Like, you can only get so far treating your code as a black box, right? That's not observability. You can monitor. You can say it's returning between these thresholds. It's good or it's bad. But you can't get any of the rich detail about what's... Like, it's not observability if it's not able to see inside the box and, like, explain to you back what's going on.

George Miranda: I think where it gets interesting is when you as a developer can start writing experimentation along with new features, with the idea that when you deploy to production, then you can see what is happening as soon as that starts getting real production traffic, right? And I think when we talk about the intersections, when it comes to DevOps, right, really what we're talking about is service owners. Right? And if you are a developer that is closed...

Liz Fong-Jones: Service ownership and shipping faster in kind of smaller units of delivery, right? Like, all those principles of DevOps, if you're supercharged by observability.

George Miranda: Absolutely, 100%.

Charity Majors: And it's not just observability, observability is kind of, I think, the prerequisite, as Liz was saying, but it also features flags, it's also putting software engineers on call for their code, it's also chaos engineering. I feel like there's been this big pendulum swing over the past five or so years away, we've gotten everything we can get out of pre-processing our code before it goes into production. We reached sharply diminishing returns because most of the really interesting things that happen are only going to happen in production.

Increasingly, all the tools you get a lot of bang for your buck out of are circling helping you slice and dice in production and giving you better visibility into production. Giving you a progressive delivery is, I think, a huge thing. It's all about getting to production more quickly, and then being able to segment your traffic or tell what's going on, or ship to certain users or not others. There are a lot of different ways you can combine these things to bring in more reliability and predictability, but it starts with observability and it doesn't stop there. You're all like, "Yes."

The other thing that I think is interesting is cloud-native. Because well, this is a buzzword that I don't use because I find it personally obnoxious. I also think that the shift from monitoring to observability is very...it's kind of a steep function, and often happens when people adopt the cloud. Because if you think about it, the more ephemeral infrastructure, the more dynamic it is, the less those old patterns of predictability and black light, up-down, the less in supply and the more manual massaging they need, and the less it... You want some instrumentation just in the last step and just in dynamic infrastructure. And if you think about it, everything's...

Liz Fong-Jones: It's the transition from monoliths to microservices, and from fixed servers to ephemeral servers, right? Like, those are kind of...

Charity Majors: Everything.

Liz Fong-Jones: To decompose what the buzzword means like that. And again, you need to have observability because there's no longer one place to go and grep through the logs. And it turns out to be horrendously expensive, and people pay these gigantic bills to send their data to a log aggregator, right? It's like, no, there's a better way. There's a better way.

Charity Majors: Everything's a high-cardinality dimension these days. And the first high-cardinality dimension people usually run into is hosting. Right? A lot of people will spin up, their Prometheus, or their Datadog or whatever, and they start thinking about tags to append hosting. That seems a really obvious thing to append, right? Then they get up to a couple of hundred hosts, and suddenly, nothing works anymore. Then they try cardinality.

Liz Fong-Jones: Or a lot of containers that are like, "Sure, you might have a couple of hundred hosts, but you're running tens of thousands of containers, oops."

Charity Majors: Exactly. Not being able to tag that with the container name.

Liz Fong-Jones: And this is why people love to bill by the number of hosts or containers. It's like, "Ah, no, no, no."

Charity Majors: Exactly. Exactly.

Main takeaways from the book

Liz Fong-Jones: So, George, what do you think people could get from our book?

George Miranda: Well, I think like I was saying earlier, I hope there's a little something in there for everyone. I will speak from my own experience. For somebody new to the world of observability, trying to understand not just the technical differences, but the changes in behavior that are necessary to wrap your mind around a new way of troubleshooting. We sort of cover that in parts as we go throughout the book. And then I think we build to a point where we even cover what that means from a culture and a team perspective. Right?

And some patterns for adoptions and ways to grow champions in an org to help you push your adoption along. I think that end-to-end journey of covering what it means from, I guess, just the definition perspective, what it means from a technical perspective, and then what it means from a cultural and adoption perspective, I hope is something that will, one, help disambiguate the term a bit, and two, give people something to dig into, and hopefully drive the conversation forward, right? So we're not just focused on pillars, but a better way of understanding and debugging systems, right? That's what I want to see people get.

Charity Majors: We talk about Honeycomb a fair bit in the book here, but ultimately, if Honeycomb failed and observability succeeded, I would be happier than if Honeycomb succeeded and observability failed. There are lots of times in the first five years of the company that people were leaning on it so hard, "Do metrics, go where the money is, do these things." And we held fast, and we thought we were gonna fail there for a while. We haven't yet, thank goodness. We all come from operations backgrounds, and we're all familiar with the pain, the agony, and the ecstasy of being on call, and of being that last...you're like the last best hope of survival. Like, everything's down, nobody's gonna get it up but you. Right? We just have so much empathy for the people who are in this situation. And, you know, as we all enter our 30s and 40s, we don't want to get woken up all the time. 

Liz Fong-Jones: No one should have to suffer through that on call. No one.

Charity Majors: No one should have to. I feel like people think it's a necessity like it's a prerequisite for having a good software system. It's actually the opposite. I think that the rise and fall of happiness of developers and users tend to happen in tandem. You never see them diverge for too long, right? You don't have miserable developers and happy users, or vice versa. The care and feeding of your engineering teams, this is where...I think we talk a little bit in the book about this, but making room in the product roadmap for actual tech debt paid out and for reliability work and making it so that it's not... It's everyone's job to make sure that the business keeps running. It's not anyone group of heroes and saviors to man the barricades. That may be how we grew up, but it's not healthy and it's not sustainable.

We see that more than ever now because systems are so complex that, honestly, if you didn't have a hand in writing them or supporting them, you can't troubleshoot them, right? You have to get inside the code, you can't just fight the fire like they're black boxes. And I feel like the original sin of...the pre-DevOps sin was like, "Okay, engineers, you write the code. Okay, ops people, you run the code." And the entire DevOps movement was born out of the fact that this is a lie. You can't build and run software systems this way. Right? There can be specialties. But in the end, every single one of us needs to be writing code and supporting the code that we write.

George Miranda: And know how to debug the code, regardless of how familiar we are with it.

Charity Majors: Right. We can't outsource that. You can't outsource understanding your systems.

George Miranda: And it turns out that it doesn't have to rely on your rock star engineers, right? You don't have to rely on the senior-most person to answer every question. There can be a repeatable method and a way of understanding and combing through that data to objectively figure out where is the problem? And that, I feel like, you know... You referenced, like, the original sin of DevOps, and somewhere in there, we just concluded that you know what, there's no possible way to understand what is happening with this code. And I think it's time that we resolve that lie.

Charity Majors: Yeah. For sure. Absolutely. And the thing is, we've seen this happen now. We've seen this happen on our teams, we've seen this happen. It used to be like George had said a couple of times that pretty much the debugger's last resort was always the person who's been there the longest, right? That was me, that was my niche. Right? And it's a good feeling, but it's a better feeling when the best debuggers are the people who are the most curious, are the people who look at their code after it's been shipped every time.

Because if you aren't looking at your code after you've shipped it every time, you don't know what normal is, right? You know what normal looks like if you see it every day. And part of our mission at Honeycomb, I think, has always been inviting people into production and trying to make it friendly, and trying to make it so that it's not the scary walled-off thing that only the experts dare tread. If you're writing code, you need to understand your code. 

Liz Fong-Jones: But we haven't offered you the tools to do this before that, right? It was, "Oh, you have to be a Charity Majors or a Liz Fong-Jones, or some kind of systems engineering wizards to operate systems." And it's like, no, we can make it...so that it's not intimidating to operate your code, every engineer should be able to do it and supported in doing it. It's not like, we're subjecting you to a bunch of pain and masochism.

George Miranda: What I like about doing this book in tandem with O'Reilly is that we've had to be very objective and instructive around how to build a solution like this as well. If you want to go down the path of really doing this, here's what you need to know about what the outcomes are, how it needs to operate, how it needs to come together, and what the functional types of questions are you need to answer. And like Liz said, the rest of it is, let's show our work, right? Here's an actual example of this. And so, hopefully, anybody can pick up this book and wade their way through what it means to build, what it means to operate this way, and what observability is.

Charity Majors: I couldn't have said it better myself.

Liz Fong-Jones: And I think the other cool thing, as you mentioned, George, is working with O'Reilly, it's been great being part of the broader ecosystem of similar books that have been published by O'Reilly. I know that for one, I was very, very inspired by the original kind of SRE books, you know, Site Reliability Engineering, the Site Reliability Workbook, and Seeking SRE as these resources that define what it means to be an SRE. And we're hoping to do the same thing for observability.

Charity Majors: Yeah, yeah. The SLO book was great. Another one that's coming out, not O'Reilly, but...

Liz Fong-Jones: Oh, yeah. The SLO book is separate from the SRE book. The SLO book...

Charity Majors: The SLO book is another book by Alex Hidalgo. So that's a great book. It's very practical. It's very friendly. It's very inviting. And then another Alex...Alex Boten from Lightstep is writing a book on open telemetry that will be out about the same time as ours, I think. 

Liz Fong-Jones: There also was a book already published, again, by some of the folks at Lightstep on distributed tracing in general. So kind of lots of really fascinating books in the space, spanning from cloud-native applications and Kubernetes, to SRE, to open telemetry, right? I think that kind of helps put together this picture, because there are so many things we wanted to include in our book, but we couldn't. And it turns out other people have done it. It's great.

Charity Majors: Thank goodness. All right. This has been super fun. Any last words? No?

Liz Fong-Jones: I think the only thing I'd add is, if you're listening to this, if you're an O'Reilly Safari subscriber, you can get access to the book raw, unedited. Well, you know, less unedited now. We've had several passes. But you can access kind of our work in progress over O'Reilly Safari. And if you're not an O'Reilly Safari subscriber, Honeycomb is providing complimentary access to the early release of the book. We'll drop a link to that in the show notes.

Charity Majors: Yeah, there's a honeycomb.io/blog is where we drop a lot of posts on this stuff and honeycombio Twitter account. And you can find me @mipsytipsy on Twitter. Liz, where can they find you?

Liz Fong-Jones: You can find me @Liz Fong-Jonesthegrey, except it's with a E, not an A, and it confuses people sometimes. And George Miranda?

Charity Majors: That's how gray should always be spelled. I'm with "Emily of New Moon" on that one.

George Miranda: I'm definitely with you on that. And you can find me @gmiranda23 on Twitter. And as Charity Majors said, we're dotting the i's and crossing the t's on the final manuscript as we speak. So hopefully, you should have the actual production print of this book very soon. And if not, you know where to find us before that.

Charity Majors: Exactly. Thanks so much for having us. This has been really fun.

George Miranda: Thanks. This has been great.

About the speakers

Liz Fong-Jones

Liz Fong-Jones

Field CTO, Honeycomb.io

Charity Majors

Charity Majors

CTO at honeycomb.io