Home Gotopia Articles Observability 2....

Observability 2.0: Transforming Logging and Metrics in Software Engineering

Charity Majors and James Lewis explore the evolution of observability and the challenges faced by software engineers. Charity introduces the concept of canonical logs, which enhances observability by systematically gathering and organizing contextual data throughout a request's lifecycle, leading to more structured debugging. They examine the limitations of traditional metrics and tools like Prometheus and Datadog, emphasizing the need for a shift towards "Observability 2.0," which prioritizes analysis over search and high-cardinality data management. The conversation also touches on the importance of platform engineering and cost-efficient infrastructure, advocating for design innovations that simplify complexity for engineers and improve their ability to manage systems effectively.

Share on:

Copied!

About the experts

Charity Majors ( expert )

CTO at honeycomb.io

James Lewis ( interviewer )

Software Architect & Director at Thoughtworks

Read further

Introduction to Observability and Version Change

James Lewis: Hello and welcome to GOTO Unscripted coming to you from GOTO Amsterdam 2024. My name is James Lewis and I'm so pleased to be with you here today with...

Charity Majors: Charity Majors.

James Lewis: I've been a big fan of yours and the work you've been doing especially obviously with Observability with Honeycomb for a number of years now. So I hope you don't mind me saying that, bit of a family...

Charity Majors: No. Thank you. Thank you, it's very sweet.

James Lewis: It's been a really exciting journey I guess.

Charity Majors: It has, yeah, it's been a very exciting journey full of many near-death company experiences.

James Lewis: So you were talking today, you did the keynote this morning which was fantastic, fabulous, I loved it.

Charity Majors: Thank you.

James Lewis: Just wanted to say that. On whether it's time for a version change in terms of how we define observability. Now, I have to admit when I saw the title, I was thinking, do you mean like a version of logs? Is it schema versions? So what do you mean by is it time for a version change?

Charity Majors: I feel like we spent a few years arguing about observability and how it's different from monitoring and this led to a few years where I felt like my entire life was just correcting people on Twitter about terminology, which does not make you a lot of friends. You're just like, you are literally the "well actually," guys. It's not super fun. And I feel like we've come to this place where it's kind of engineering... In the beginning I was like, we're going to make a unique technical definition for observability. But you don't actually get all of the say. The market has a lot of say. Other engineers have a lot of say in this. And the term kind of got diluted. So at this point I think of observability as being just an attribute of complex systems, just like performance or reliability. But there's been this massive step function, like discontinuous leap in recent years when it comes to everything from the cost model to how you store it. And this is where the idea of, okay, what is semantic versioning? It's about if you're bumping the major version, it better be a breaking, backwards, incompatible change. And I think that the observability 1.0 versus 2.0 is very clarifying for a lot of people. Because observability 1.0 tools are metrics, logs, and traces. You're storing in your run tools, in your APM tools, in your profiling tools, in your tracing tools, in all these different tools. And what connects them? Nothing. Just you, the engineer who's sitting here kind of eyeballing shapes and going, well, that's probably that. Copy, pasting.

Observability 2.0 is very simple. You have a single source of truth. You have these arbitrarily wide structured data blobs. And you can slice and dice. And you can zoom in. You can zoom out. You can derive metrics and logs and traces. But since you have a single source of truth, all of that context gets preserved. It's very simple. You have many sources of truth or you have one. But the socio-technical implications are just massive.

Recommended talk: Is It Time To Version Observability? (Signs Point To Yes) • Charity Majors • GOTO 2024

James Lewis: I mean, maybe I'm a bit old enough to remember before observability was 1.0. I remember back in the day, around about 2010, 2011, there was a guy called CodaHale. It was the Twitter handle, CodaHale. And his metrics libraries were the first libraries I started using. It was awesome, absolutely awesome. Can you just maybe talk us through how things have evolved since back then? Because I know you were saying in your keynote it was about 2016, 2015 you coined the term. But there'd been this giant explosion of things happening before that.

Charity Majors: Statsy was a brilliant innovation for its time. It was like a generic interface to telemetry for engineers. And at the time, storage computing resources were extremely expensive. And scarce. And the metric is just a number. So you were deriving the absolute least amount of information that you could to get insights out of it. And God bless, that carried us for decades. It's amazing. But ultimately, the metric is really a data type you should be using to summarize vast quantities of data. And the metric was also really useful in the days when your architecture was very simple. You had the web tier, the app tier, the database, singular. And so all the real complexity was bound up inside your lines of code. So it was tractable. If all else failed, you could just attach a debugger and step through it. Well, what's happened over the past 10 years? Well, we blew up the monolith.

James Lewis: Yes, sorry about that.

Charity Majors: Right? Patience hero here.

We've got all these services. We've got all these third-party APIs. You've got so many different storage systems. And so much of the complexity no longer lives inside your code. It's outside of your code. In the system, everything is a high cardinality dimension. Debugging is no longer a question of finding the code and fixing it. That's the easy part. The hard part is finding where in the system is the code that you need to find to fix and solve. So it's just a very different universe.

James Lewis: I remember you saying again earlier about how one of the things you used to love when it was still Facebook, I guess, was being the smartest debugger in the room.

Charity Majors: The debugger of last resort.

James Lewis: The debugger of last resort, that was the phrase. Yeah. I love that phrase. It's a really cool phrase.

Charity Majors: I think a lot of us identify with that, right?

James Lewis: Yes, once you get your head around the system enough, then you suddenly...

Charity Majors: You built it. You've seen all the ways it can fail. You can pattern match to the best of them. And it does feel really good. But it really doesn't scale. It was very humbling. Almost humiliating. Let's stick with humbling. For me, when I was at Facebook, at Parse, and these fairly junior engineers came along and they just started feeding some data sets into SCUBED. And suddenly these things that only I could get out of... I could figure it out, even though the data didn't say it. Suddenly these very junior engineers were just like, bop, bop, bop, bop, well, it looks like it's set. And I was just like, the hell? It was like, I guess I need to learn this modern tooling too. Which is a little... For cranky people like us, it's like, okay. But it's so healthy. It's so healthy. If the connective tissue, if the context, if it sits in a tool, instead of relying on our head, it means that the best debuggers aren't the people who have been there the longest anymore. It's the people who are the most curious.

James Lewis: I love that idea. Curiosity drives that.

Charity Majors: Yes. Curiosity and your habits. Habits do so much work for us as engineers. It's one thing, the things you do during exceptional times. But it's the habits that really drive us.

Recommended talk: The Best Programmer I Know • Daniel Terhorst-North • GOTO 2024

The Shift Toward Observability-Driven Development

James Lewis: I remember a long time ago talking with Daniel North about his time at one FinTech. It was a commodity trading firm. It was about one of the... Probably he said it was either the first or second best programmer he'd ever met or ever worked with. The one thing they did when he first started pairing with this guy, he came over and he said, let me show you the test. He said, no, no, no, run it and show me the logs. I thought that's quite a profound thing actually, right? Because yes, I can see the test. But actually, if the logs don't tell me enough about the running system, then how am I ever going to work with this long term?

Charity Majors: It's such a profound mental shift.I feel like in so many ways where we are right now as an industry, we're in the early days of observability in the way that I feel like 20 years ago, we were in the early days of TDD and test driven development, where it feels very manual. It feels very awkward. It feels very unnecessary. It feels kind of excessive. It's like, why do I need to figure out how to add? I know how it works. I've got my dashboards. Why do I need to turn? For a long time, it was like, why do I need my unit test? And it is kind of manual at the moment because, to be completely honest, you have to have tooling to support higher level tooling. What we've got right now is there are not many places where you can get raw events and slice and dice and all this stuff.I t's an absolute prerequisite. You can't do that built on top of aggregates and examples. So the industry is going to have to advance a bit. But it's going to get there because your code passing tests does not tell you what's working.

James Lewis: That's really interesting.

Charity Majors: It tells you at best that your code probably is logically correct.

James Lewis: Yeah. Your colleague, Martin Thwaites.

Charity Majors: Martin is great.

Recommended talk: Expert Talk: What’s Next For .NET? • Hannes Lowette & Martin Thwaites • GOTO 2022

James Lewis: I saw him speak about observability driven development a couple of years ago. Even maybe now, a few months ago. I was trying to think about what it reminded me of. And it reminded me a bit about how you're going from a closed to an open system. Yes, you had TDD and you were driving out your unit test. Then we started with doing an acceptance test from development. You're driving out from acceptance test. Then we started doing almost even further out where we were running the whole thing. And almost doing, running the whole thing, standing up, not even standing up a JVM. You didn't need to stand up a JVM. You were driving the APIs from tests. And it seems like we're moving outwards now. And it also, when you talk about debugging, it also made me think about, I was used to joking, I haven't opened a debugger in years. Because I've got enough coverage in my test suite to know what things are correct. Is that how you see this evolving?

Charity Majors: There are so many similarities and parallels. Because when I talk about debugging, you don't know if your code works until you watch it run in production. I feel like our model of debugging and monitoring and observability, for so long it's focused around problems and errors and bugs. We've trained ourselves to not even look at production unless there's a problem or we get paged. There's so much that goes into understanding your software. It doesn't have anything to do with something being down or paging you. And we're engineers, we like to understand things. We are curious. Anytime I encounter somebody who's just like, oh, I'm just like, who hurt you? Because it has to get drummed out of you to be like, oh, God, I don't want to look at it. Because inherently, we're so curious about it. And being able to understand your software, it's what informs how you spend your time. Whether or not what you built was being used or not. One of the things...

Austin Parker, who I work with, said some really cool stuff that I can't get out of my head, which is about how if observability turns out to just be better monitoring, we will have lost because so much of it is about blending business metrics with application and system metrics. It's so weird if you think about it that we're all walled off from each other. Every interesting question that you want to ask is probably a blend of business and application and systems.

James Lewis: And it absolutely changes behavior. I mean, I remember just to share a story, I was on a team many years ago now. And we put this set of microservices in production. And we were kind of curious about how it was performing. So we had lots of stuff around latency. Don't tell anyone it was a secret or S implementation. But we had to measure the latency between writes and reads. So we could make sure that we were ever able to get blah, blah, blah. But the biggest breakthrough came when someone said, I wonder how much money this is making. And no one knew. But someone said, but I'm sure finance now what clicking on the link is like, you know, roughly. So someone went down to the finance team and said, hey, can you, yeah, well, what's how much do we on average make every time someone clicks the thing? And so we started graphing that and said, oh, that's quite a lot of money. Actually, this thing is worth right. So maybe we should start taking things a little bit more seriously. And not that we had everything automated anyway. But suddenly the mind shifts seeing those business metrics, just amazing. Yeah.

Charity Majors: You know, one of our earliest intercoms was one of our earliest, you know, back when I was in charge of marketing and I was like, high cardinality is a great marketing term. One guy. Was like, that's a great marketing term.

One of the first things they did was they instrumented their services with app IDs. And so like, as I recall the story, they were on the verge of having to do this giant MySQL migration, like their database had exceeded the maximum size you could do in AWS. And so they're like, well, it's time to shard. It's time to partition all this stuff. And this engineer instrumented their code with, you know, Honeycomb and app ideas and just started poking around and went, huh, 80% of our execution time is being taken up by one customer who's paying us $20 a month. So maybe we don't need to do this year-long migration. Maybe we throttle this guy. We're charged with a lot. Like maybe this is not like we have so little visibility into what's happening in our systems and it really takes that curiosity.

And I feel like the tools that we have punished curiosity. Like how often have you seen or you put some new on call and they're like poking around with their little grasshopper, you know, hat and like, this is interesting. What's happening here? And your instinct is to go, oh, grasshopper, like just don't even look like I've done this so many times. It's going to take you down a rabbit hole for two days. You're not going to find anything. It's going to be like you have product work to do. It punishes you.

But if it rewards you instead, if you can quickly go, oh, this weird thing is, you know, the intersection of... This is what I love about it. We have this thing in honeycomb called bubble up, which is like if you're instrumenting your code with these wide events and you can have a graphic, if you're curious about it, you draw a little bubble around it. Like this is a spike. I'm curious. Draw a little bubble and we'll compute it inside the bubble. Baseline outside and sort and diff them. So then it's like, here's something I care about. That's weird. And you're like, oh, those are all Android devices with this particular build ID and this particular like return string and using this language pack and this region of the world. It's just like, you're like, oh, now I know what the problem is, you know, because so much of debugging is just here's the thing I care about. Why do I care about it?

James Lewis: It's it's yeah, it's amazing. One thing I'm kind of fascinated by in this kind of world and have been for a very long time and talk about business metrics is I still don't get why the huge, the giant, okay, maybe this is going a bit too far, but why the giant products that you have to spend all the money on are still absolutely leaders in this. I know it's been basically because of marketing spam, right? Because I remember being on the big thing where it was going to take us, I remember we had it in the backlog. Sometime before launch, we have to implement inserts, giant vendor things here. That's going to be the instrument you think. I mean, it kept getting pushed back and pushed back. And then we had nothing. But actually what we ended up doing is doing very simple instrumentation, a very, very simple set of...

Charity Majors: It wasn't valid to you while you were developing it. So it just gets shut down.

James Lewis: And I think that's a bit of a problem. This is back to observability driven development.

Charity Majors: If it's part of your development process, if you're validating yourself at every step, you're like, I just shipped a thing. Is it working the way I expected it to do? I feel like so much of this is almost muscle memory. There's this feeling in your body where you're like, well, I've committed it and it's done and I'm going home for the day. And it's about moving that marker away from my test pass so I've watched people using it in production and I've looked at my telemetry. Right. And then you can feel okay. You can leave for the day.

Recommended talk: How Flow Works • James Lewis • GOTO 2024

Canonical Logs

James Lewis: One thing I was super excited about today that I can't believe I hadn't heard the term before was canonical...

Charity Majors: Canonical logs.

James Lewis: I feel ashamed, deeply embarrassed.

Charity Majors: It should be way more widely known because it's revolutionary and it's not new. But, you know, Brandon, I forget his name, but an engineer, well, he was at Stripe. He named it. So the idea of canonical logs is this. So with normal logs, your request is executing through your service and you're just like, here's something. Log it, log it, log it, log it. It's kind of schizo. So if you actually ever want to do anything with it, you have to go and sort of like do a bunch of sed and awk or whatever, just sort of recombine it. So the idea of canonical logging is this. When your request enters a service, you initialize an empty struct. You pre-populate it with anything you know, like any parameters getting passed in, anything you already know about it. Then while it's executing...

James Lewis: Kind of pass a mode of context around essentially.

Charity Majors: You pass it around. Anything you're like, oh, I'm writing up. Shopping cart ID, might be useful someday. Drop it in the blob. Right? App ID, you know, username, user ID. Like anything you're just like, might be useful someday. Just drop it in the blob. You can get fancy. You can be like anything in slash proc. Drop it in the blob. Language internals. Drop it in the blob. Whatever. Drop it and forget it. And then when your request gets to the end, it's ready to exit or error. You ship it all off to your storage. One very wide structured event. So all of that context for that request service is knit together. And this isn't... Like I learned years after starting Honeycomb, AWS has been doing this in their backend services forever. And it's literally just a text file in which to slash something or other. But I would honestly rather have these wide structured events plus grep than almost any other debugging solution on the market because it's so powerful.

James Lewis: I can almost see a secondary industry though in formatting stuff you dump into structured logs. Right? Because you can just imagine like you've got multiple people working on software. And suddenly you're going to have uppercase, lowercase. There's definitely an open source library in there that's going to basically streamline the first name. It's going to be a camel case. So it's going to be this.

Charity Majors: I know. There's a need for a lot of like sort of like tooling on this. So like you actually... One of the interesting things about Observability 2.0 is you don't want schemas or indexes or anything that assumes that you can predict in advance what data is going to be valuable. But at the same time, you also don't want chaos and madness. So you want some sort of implicit schema or like some sort of resolution, something that automatically converts camel case to like or something like that. And so both of these things are true and it doesn't exist right now. So it's very manual.

James Lewis: This literally has only just occurred to me. So I don't know. This is probably something that's completely stupid. But in the kind of data space, data warehouse space, there's obviously been an evolution away from the old traditional kind of data warehouses, star schemas and stuff.

Charity Majors: ETLs.

James Lewis: ETLs towards data lakes. And then again, away from data lakes towards data pipelines. And then even now an evolution again towards things like data products. So each service is offering up their data as a product. Is that something you see in terms with observability? Or is that just slightly... Is that orthogonal to it?

Charity Majors: It's related and also not. I hear the term data observability and I was like, well, this is interesting. Sounds like my people. And so I investigated a little. And it means something a little different than when I talk about observability.

Data observability has a lot to do with things like, you've got a schema, but some events might be sparse or rare. But how sparse is too sparse? How do you know when it stops sending something versus it's just not very common? How do you validate the schemas are getting impacted? How do you version those schemas? It's really interesting. It aligns, I think, a lot with the kind of observability I talk about. But it's just different enough that it's like, you've got to do some manual stuff to glue them together, which some people do. I think it's a super interesting time to be in data though.

James Lewis: I mean, it certainly is. It certainly is. But it also is a super expensive time in terms of cost.

Charity Majors: This is something where I feel like software engineers... It always kind of surprised me. Software engineers are not all that literate in data stuff. But the idea that anytime you are using a data model and data storage type that is not right for your use case, your costs are going to go like this. And this is where I feel like people who are buying observability tools haven't quite reckoned with the fact that the data model that is based on metrics are just... There's no relational data stored in a time series database. It's so expensive to try and store any kind of, not even high cardinality, but like an itty bitty card... 100 things are just like, whoa, it's not the right tool for the job. And people are just running into...

After I wrote that blog post on the cost crisis and observability tooling, I heard from so many observability teams, and I'm talking like people who are at the peak of their craft, experts in this domain. And they all feel kind of ashamed and embarrassed because they're like, we spend an outright majority of our time as a team just trying to manage costs. Not even like automating or building libraries or doing something that will fit, but just like manually monitoring and massaging. And I'm just like, this is such a smell, wrong data model.

James Lewis: I had this conversation, I won't say with who, but it was, should we say, about the infrastructure of a large FinTech and I think their production infrastructure was something like, four fairly meaty instances running Kubernetes with a bunch of containers on them, but then their observability infrastructure, they had 40. It was literally in order of magnitude more like infrastructure to manage their observability requirements.

Charity Majors: Such a smell. Such a smell.

James Lewis: But it's astonishing.

Charity Majors: But I also, I don't want to talk shit about, because vendors like Datadog or Prometheus, these are phenomenal tools.

James Lewis: Sure.

Charity Majors: Phenomenal.

James Lewis: I'm glad you said that about Prometheus. Phil Calzado, who's a former colleague and good friend of mine, he's from Prometheus.

Charity Majors: They're very well engineered, architected. These are the last best metrics-based tools we're ever going to get.

James Lewis: Right. Exactly. And they were game-changers when they came out.

Charity Majors: They were game changers and they're now mature. The tooling is exquisite. It's just, you've got to know the use case or it's just not going to work. Infrastructure, great use case for metrics. Because you're not, with application data, you literally only want to aggregate around every user's experience, which is a very high card, and I'll even mention. But when it comes to infrastructure, you want to aggregate around the CPU or the disk or, you know, there are lots of use cases where metrics are the right tool for the job. They just get out of their lane and then you run into trouble.

James Lewis: Awesome. I love the idea of getting out of their lane and running into trouble. I'm totally, can I borrow that? Do you mind if I just take that?

Charity Majors: Absolutely, take it.

James Lewis: I'll use it for all the things.

Charity Majors: Do it, do it.

Recommended talk: Observability Engineering • Charity Majors, Liz Fong-Jones & George Miranda • GOTO 2022

Observability and the Future of Software Development

James Lewis: So, I mean, going forwards, Observability 2.0, it's okay. It's a bit silly to say what 3.0 looked like. What does the future hold? What does the future hold in your, you know, if you could wave a magic wand and look, you know, a little bit into the future, what does it look like for you, for the company, for the world, for the world of software and application development?

Charity Majors: I feel like there are so many interesting trends in software that I feel like we are sitting kind of at the nexus of like putting software engineers and calling for their systems and the sort of like the rise and fall of DevOps, you know, is a thing. We're like, ops teams just kind of are becoming less and less of the thing. So DevOps, you know, like the end state is every engineer writes code, every engineer owns the code that they write, you know.

Platform engineering is, I think, so interesting because it's bringing product discipline to infrastructure for the first time. The end of the zero interest rate phenomenon is, I think, forcing us to, like, build or relearn these muscles when it comes, like you were talking about, connecting what we do to dollars and cents, which is the universal denominator, right? We should all like that cost is a property of systems. Every architect should care about this. And I think that all these things are so good for us. You know, I think that for such a long time in like the C suites, I feel like the VP of engineering has been like the junior member of the squad because we've been like, well, we're artists, you know. We can't translate the work that we do into something that is me.

James Lewis: You couldn't possibly understand.

Charity Majors: You couldn't possibly understand, which is not a good thing.

James Lewis: No.

Charity Majors: It's not a good thing. I feel like these are all kinds of long term trends. I feel like it should be a little bit arrogant for a moment or whatever. I feel like the model that we have sketched out over the past few years about, you know, wide events, high cardinality, it is the inevitable end states, like whether or not Honeycomb succeeds or fails, it's just the only way to handle it.

James Lewis: That's the way it's going.

Charity Majors: It's been really fun to just like you can't you can't really explain this... Maybe it just takes a long time to explain. The tension spans are like this, you know, which I get, which I get. But I think that over the next few years, I think you'll start to see more companies and startups who are built in this kind of model. There's some really interesting ones that have been kind of startups just based on click houses, which I think is really exciting. It needs to be a commodity. It needs to be open source. It needs to be more competitive. But I think we're going to see some really interesting design innovations.

Like so much of this is a mental model shift away from a world where with logs and with metrics is like you need to predict in advance what questions you're going to want to ask. You need to know what strings you're searching for in order to find them. The world of observability 2.0 is much more like I don't know what I'm looking for. I just know that something is wrong, but I don't know why. Right. And being able to quickly and being comfortable with that, being comfortable with sort of analysis first instead of search first. Like this is a design problem.

James Lewis: And as you were saying earlier, hypothesis first

Charity Majors: Hypothesis first. You know, the exciting thing to me about this is that it's such a leveler. Democratizer is an overused term, but it's like, you know, if you don't want to join a team you like, well, I can never catch up. I can never be the expert in these systems. Right. Like it's a way of, you know, I think we're looking for ways to kind of take people's hands and lead them into the future because. It's so much easier. The way people are doing it now is so hard and so expensive. We've been doing it for decades. We're just like, well, we know it, but it's not. It's so hard. It can just be so much easier. And so that's that's where... that's like, you know, people are always like, what's the biggest hurdle we have? And like I said, my talk, it's that people don't believe a better world is possible. And I get it. We've all been told things by vendors, but it's not that the world's going to be easy. It's that, you know, there are tools that obscure complexity and tools that try to make complexity tractable. And that's where I hope observability is going.

James Lewis: I think that's probably a brilliant place to end. So the world is going to be so much easier.

Charity Majors: I can talk to you all day.

James Lewis: Making complexity tractable through observability 2.0. Thank you so much, Charity.

Charity Majors: Thanks for having me.

James Lewis: It's been absolutely brilliant chatting. My name is James Lewis.

Charity Majors: Charity Majors.

James Lewis: Thank you for listening to GOTO Unscripted again, coming from GOTO Amsterdam. Thanks.