Getting Started with Chaos Engineering

Nora Jones • Casey Rosenthal | Gotopia Bookclub Episode • October 2020

Casey Rosenthal and Nora Jones, authors of “Chaos Engineering,” highlight some of the best practices that famous companies like Netflix and Capital One use to break (or not break) their systems in productions, so that you can get a taste of it.

Share on:
linkedin facebook
Copied!

Transcript

Chaos engineering is much more than just hype. Get the map and compass that you need to navigate the stormy waters of distributed systems while optimizing to meet business goals. Casey Rosenthal and Nora Jones, authors of “Chaos Engineering,” highlight some of the best practices that famous companies like Netflix and Capital One use to break (or not break) their systems in productions, so that you can get a taste of it. This Book Club episode is an expansion of a recent interview between Nora Jones and Casey Rosenthal on the comedic Chaos Community Broadcast. We now include expanded conversation, including discussion about Nora's work with the Learning from Incidents community.


SUPPORTER

Verica uses chaos engineering to make systems more secure and less vulnerable to costly incidents.
With Verica, you can trust that your software is working how it’s meant to. As systems become more complex, Verica will be there to help maintain confidence in those systems. Get your free copy of the Chaos Engineering book.
https://www.verica.io/book

Get your free copy of the Chaos Engineering book

Looking for a unique learning experience?
Attend the next GOTO conference near you! Get your ticket at http://gotocon.com

SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily.


Two of the field’s prominent figures, Casey Rosenthal and Nora Jones, pioneered the discipline while working together at Netflix. In this book, they expound on the what, how, and why of Chaos Engineering while facilitating a conversation from practitioners across industries. Many chapters are written by contributing authors to widen the perspective across verticals within (and beyond) the software industry.

Get your free copy
Chaos Engineering by Casey Rosenthal and Nora Jones

James Wickett: Well, today we're joined by Nora Jones. And because we're continuing our series on the book, "Chaos Engineering: System Resiliency in Practice." Nora Jones is a co-author of the book. She also wrote the chapter, Creating Foresight, and she's a CEO and co-founder of Jeli. So, Nora, thanks so much for coming on the show.

Nora Jones: Thanks for having me.

Casey Rosenthal: Congratulations on being published. It's nice to finally meet you in person, Nora.

Nora Jones: Yes. Likewise, nice to meet you in person too.

Casey Rosenthal: I know it's going to come up, so let's just get this question out of the way. Your name is Nora Jones — I'm sure you get asked this all the time about your name — is there any relation to Indiana?

Nora Jones: No relation to Indiana. No.


Nora Jones’s journey in Chaos Engineering

Casey Rosenthal: Okay, if I could take you back to the year MMXVII, November 30th, MGM Grand Arena in Vegas, the AWS re: Invent Conference. There were 40,000 people in the audience, another 60,000 people watching remotely. And you had the stage... you spoke about the journey of chaos engineering sophistication, the forces of chaos. How did you get to that stage? What was your journey to end up there?

Nora Jones: I was working at a company called Jet.com that had just launched in 2015. They were an e-commerce website. Their marketing team was doing a phenomenal job. But as with any good problem, you need to keep up with the tech. So we were having a lot of incidents, right? We needed to meet the demand that was coming to the website. We were meeting almost every morning to talk about the previous night's incident. So at that point, we had talked about just trying some new techniques. I came across Netflix's Chaos Monkey, and I started implementing a version of that at Jet.com. It went really interestingly, I thought I was making a tool. 

Nora Jones’s journey in Chaos Engineering

Slowly but surely, I realized it was a cultural shift, actually. And the tool was just kind of a catalyst for that. So at that point I was leading developer productivity and developer tools there. I went over to Netflix for a couple years after that, where I was on the chaos engineering team with Casey, although, this is the first time we've met in person. 

Casey Rosenthal: Interesting.

Nora Jones: Yes, interesting. Then I was building tooling there for a little while. That's where I did the re:Invent talk in 2017. But my journey's changed a lot since then. My thoughts have evolved a lot since then as well, but it's a pretty fascinating field, and I think it's going to keep evolving too.

GOTOpia November 2020

Join a group of like-minded developers November 10-13 for keynotes, masterclasses, Q&As & valuable networking.

Secure your conference pass now for only €100!

Learn more
GOTOpia November 2020

What would you attribute your success to?

Casey Rosenthal: You mentioned you worked on the chaos engineering team at Netflix, which I believe I managed. Was I the worst manager you've ever had?

Nora Jones: No.

Casey Rosenthal: See, James?

James Wickett: I'm surprised by that.

Casey Rosenthal: Yes. I've heard of Jet, which you mentioned — James and I were talking about this last week — was acquired by Walmart, and then discarded. Comforting to know that even Walmart sometimes regrets what it buys at Walmart. But I don't often use the word brilliant, acceptance, self-affirmation. 

In less than five years, you've gone from Jet to Netflix to Slack, now, you've run a VC backed startup. If you could be self-reflective for a moment, is there anything that you would attribute your success to? Versus other people who have had the same advantages? Obviously, we live in a biased system towards privilege. But is there anything in particular about your outlook on software, your understanding of where the industry is that you feel has contributed to your success?

Nora Jones: I've been so lucky to work with amazing people at each company I've been at. I mean, Jet was kind of a startup environment. I was in hardware before Jet, still focused on reliability, but it was a different journey for me. I was able to take some of the same theories and approaches towards hardware reliability. I was actually surprised how much it applied towards software as well. 

Honestly, just collaborating and getting to know teammates, like one of the things…

I studied computer engineering back when I went to school, and one of the things I really disliked about the major was how individual it was, how I felt that it could be a lot more collaborative — I felt that we were going to build better things if we were working together, if we were talking.

I think sometimes the software industry falls into that trap as well, where you put your headphones on, and you just code all day. And I think, actually, sitting down and working with my coworkers, and building reliability tools, and really thinking about the business impacts of things has just completely changed my viewpoint and has really helped me along this journey. Yeah, I am still going. But it's been so fun. 

It's really just the people.


Chaos Engineering - free book

Casey Rosenthal: So that perspective of in the language of the field, viewing it more as the, or at least being open to the socio part of this socio-technical system. You're still in the master's program at Lund University?

Nora Jones: Yes. I'm still in it.

Casey Rosenthal: How much of that has changed your thinking?

Nora Jones: It changed my world. I mean it was stuff that I knew and felt, and then I came into that first week of class. I felt like I could actually put theories and research and practical applications behind stuff I was thinking and feeling all along. I just feel like the software industry has so much ahead of it in terms of reliability, of how we think about making the Internet a place folks can trust from a reliability perspective and totally understand the impacts of what we're building as software engineers, not only to society, but to our businesses.

Key takeaway 1

It's the people!

Collaboration and learning from incidents

Casey Rosenthal: You mentioned collaboration earlier and you launched a community, Learning from Incidents, which now has a website, learningfromincidents.io, some very well-recognized names in the industry submit blogs there and post articles and essays. How did that come about? Was that an extension of this work from Lund and your interest in socio-technical systems?

Nora Jones: Yes, absolutely. So, working at Netflix was so amazing on the chaos engineering team, but it was a very different chaos engineering perspective than I had at Jet, right? Most folks at Jet hadn't heard of Chaos Monkey or chaos engineering or the philosophies behind it, and we were a startup. We were just getting off the ground, whereas Netflix formalized it.

I was coming into a team that had started building tooling around that, but as you know, Casey, most of the folks using that tooling were folks on our team.

Collaboration and learning from incidents

I started trying to think about how to get more folks in the organization using it, how to extract their mental models. Being in the Lund program helped me ask those questions a little bit better so that we could get more people at the table and understand the actual business impacts, refine folks' mental models, and help make the system more reliable overall so that we understand how we're actually working.

I think shortly after a month or two at the Lund program, I started looking into incidents at Netflix a little bit more, and I would just review previous incidents. Warren Hochstein [SP] was doing a great job at that at Netflix too, and I started getting into it as well, like, reading previous Slack transcripts, looking at pages that went off, looking at who needed to be paged, looking at what dates we were getting paged, looking at what events were happening around that time. 

But I started doing that as, okay, maybe we can build something to feed into our chaos tooling to prioritize experiments based on previous incidents that have happened or previous surprises that have happened or prioritize certain teams that were maybe underwater in certain regards, like around different dates.

That was why I started looking into incidents there, and I ended up doing like a cross-incident analysis of a bunch of different incidents just to see which themes kind of emerge from the organization, and it was so amazing to do. It took a significant amount of time but it was like… it was so enlightening to see which people were involved, which teams were involved, which people weren't on call. All these different things and then bubbling up and sharing that with the rest of the organization helped them prioritize their OKRs, helped them prioritize future requests against reliability measures.

At that point I was like: this is a lot of work, I feel like there's probably more people in the industry doing stuff like this. I'm just going to send out a tweet and see who else is doing stuff like this because I wanna talk to them. I feel like it's going to be more beneficial for us all as an industry if we're talking about some of the stuff we're finding because then we can all improve together.

Join our certified scrum experts in a two-day masterclass

 

Go from Scrum zero to hero and become a certified ScrumMaster or Scrum product owner in one of our intense two-day courses with one of our certified experts. Or Join Jim Coplien, founder of the Scrum Patterns discipline in what past attendees have described as "the most important course they have taken in their professional career". here.

Check the Masterclasses
GOTO Masterclasses

So I sent out a tweet about starting learning from incidents community, and I think I got about 200 DMs that night of folks wanting to join. I wanted to keep the community small just to make sure that we were getting to know each other because we were sharing specific things about our companies and our incidents, and I wanted it to be a tight-knit community. Then we ended up learning a bunch together.

I wanted to open source those learnings further which is why I started the learningfromincidents.io website, and we've been so lucky to have a number of great folks write blog posts on there about their chop-wood and carry-water stories at their companies on how they're doing this. I think a lot of folks see the word "learning" and they're like, okay, well you have to do more than learn, you have to fix, but it's: you can't fix until you're learning how the thing happened to begin with. So I think that's one of the things I'm trying to help folks understand as well too.

Casey Rosenthal: It feels like you're on the pressers if you haven't already opened up an entirely new discipline, right? It's blue skies ahead. Where does that community go next?

Nora Jones: It's a great question. I see that community evolving into a number of different things. Folks are sharing some of their post-incident reviews there and they're getting feedback from folks that are not at their companies.

I started my own company around the incidents space too, and we're not ready to share stuff yet, but we will be willing to soon… we will be able to soon. I think it's going to be a lot of fun to share some of the post-incident world there as well.

Key takeaway 2

You can't fix something until you learn what happened in the first place.

Two of the field’s prominent figures, Casey Rosenthal and Nora Jones, pioneered the discipline while working together at Netflix. In this book, they expound on the what, how, and why of Chaos Engineering while facilitating a conversation from practitioners across industries. Many chapters are written by contributing authors to widen the perspective across verticals within (and beyond) the software industry.

Get your free copy
Chaos Engineering by Casey Rosenthal & Nora Jones

Chaos engineering at Netflix

James Wickett: This has been something that I've been thinking about and wondering about — maybe you could speak about this.

Netflix is kind of this place that has really been a launching pad for a lot of interesting startups. And we see that over the years, like both Casey and Nora, yourself, like, both of you have startups coming out of this field, what made Netflix particularly successful in being able to help people see their vision, being able to kind of tackle these problems, and then spread this to the wider audience?

Nora Jones: Casey, do you want to take that first?

Casey Rosenthal: No, I don't recall being there.

Nora Jones: Netflix was an amazing place to work in a lot of different ways. They really share freedom and responsibility. 

You can work on what you want to work on, but you have to be able to back it up, you have to be able to own it, you have to be able to share it. One of the things that was really great for me at Netflix, although, sometimes it could be a lot was folks in the reliability organization had to play every single role. I had to be a product manager, I had to be a designer, I had to be a back-end engineer, I had to be a front-end engineer. I had to be an internal evangelist. 

I got good at a bunch of those different skill sets just by being there. I think it was part of the Netflix culture that really helped with that. But a lot of those skill sets helped me find my way into starting my own company as well. So I'm so grateful for that experience.

Casey Rosenthal : At best I think those companies have a generative culture that has really little bureaucracy. Of course, aligning the entire company to these values is not easy.

James Wickett: As an outsider looking in at Netflix, Blockbuster 2.0, how did it become the place that generated all this. Like you wouldn't have suspected that if you were to write this down in a book. You wouldn't have thought, "Oh, that's the place where innovation is going to come from." In both of your books, "Chaos Engineering," it had a lot of those pieces but I think that's kind of hearing you talk about that is really helpful. But now I'd like to move to a word from our sponsor.

Casey Rosenthal: I think that was a good one, though. That reminds me of when I was back in the Twin Cities when they first rolled out the Gamewell fire alarm telegraph system, do you remember that?

Chaos engineering at Netflix

Nora Jones: What year was this?

Casey Rosenthal: Oh, I don't want to date myself. But back then it was a novelty. So you know, of course, the kids would want to see the fire trucks. So they'd do it and set the alarm off and the fire trucks would show up, then Minneapolis, I guess they passed an ordinance that, you know, if you did that and there wasn't an actual fire, you'd get fined. So, of course, the kids were smart. So they'd do it, they set off the alarm and then they'd look around for something to light on fire so that they didn't have to pay the fine. Of course, they've probably taken all those Gamewell boxes down by now. I don't know if Minneapolis still has a problem with fires but unintended consequences of automation. Smart kids. 

So in the book, you talk about… I'm sorry, it's really difficult to work with these uncomfortable shoes. James, can you get me a pair of good studio shoes? Nora Jones, do you want a good pair of studio shoes? Can you Postmates or courier a pair of comfortable studio shoes that would be great?

James Wickett: No, Casey. That's not something I do, no.

Casey Rosenthal: I'm surprised you do anything around here. So in the book, you talk about...

James Wickett: Sorry about that, Nora Jones. He gets like that sometimes.

Chaos Engineering - free book

Key takeaway no 3

Chaos Engineering was possible at Netflix due to the generative culture and very little bureaucracy.

The role of a facilitator in chaos engineering

Casey Rosenthal: In the book, you talk about using a facilitator...

Nora Jones: The show business getting to him.

Casey Rosenthall: You know, James, Milton once said to me, "Never apologize for somebody else's free well." In the book, you talk about using a facilitator to investigate an incident, much like it's done in other high-stakes industries, medicine, aviation and maritime. Can you explain how that relates to, "Chaos Engineering" and, you know, some of your work in the book?

Nora Jones: Yes, absolutely. So, at Netflix, we had built a really great system, and we could do a lot of things. Like we could inject failure safely in production. But we had this giant form for folks to fill out if they wanted to do it. It was a pretty daunting form. 

It's not like they were thinking about doing chaos experiments all the time unless they were the four of us on this team. We were thinking about it all the time. They were thinking about it, maybe once every few months. So seeing a giant form, asking what business metric you want to impact can be kind of scary for folks, apparently.

Doing some facilitation techniques at Netflix was really helpful, like talking to teams and asking what kept them up at night. If you actually had time, talking to people individually, and seeing how mental models differ between folks on the same team. Seeing how mental models differ between different tenures on the same team, different roles on the same team, different pieces of the system on the same team. A really good facilitator can extract that out of people and find those gaps. 

When I went to Slack, we had been rolling out Envoy, and people wanted to do some chaos experiments on it. I started asking the questions I had put in the chaos book, right? And I really grouped with people on the team. We filled out a lot of pre-work to be able to run the chaos experiment. And by the time we got about 75 percent through facilitating the experiment and preparing for it, we realized we weren't actually ready to roll it out. And we didn't even need to run the chaos experiment to know that. It was just asking those questions.

Casey Rosenthal: You said a really good facilitator can pull those things out, how does one become a really good facilitator?

Nora Jones: Yes. It's a good question. It's not being the expert, but being able to dig into things and different words that someone tells you about a situation. So if they're saying, "Yeah, this part of the system is very flaky." Not being like, "Oh, yeah, I know, that tipped over last night," even though you do know, asking them, "Oh, what do you mean, it's flaky? Can you elaborate on what that word means?" And I know that can be kind of awkward, especially with folks that you've worked with for three years.

The role of a facilitator in chaos engineering

I remember in a post-incident review once, I was asking someone what tool they used to make a configuration change. They were like, "Nora, I made a change with you in the system, like a couple weeks ago," like, you know, they were just looking at me, I'm like, "Well, humor me for a second."

I think being a good facilitator is all about being able to kind of remove that ego and ask some of the silly questions that you might know the answers to. But it's really not about the canonical answer, right? It's about how someone else will use the system, because then you can see the gaps between how they view the system, maybe how you view the system, maybe how other people in the organization are viewing that system. And you get to see how you actually work versus how you think you work as an organization. It kind of bubbles things up to the surface that folks have maybe been ignoring, that folks didn't actually know exists, and it creates learning opportunities in the organization.

But there's a big, big, big psychological trust and safety component too. If someone doesn't feel comfortable stating, "Hey, how does this work?" Right? The facilitator has to be confident in that. I usually recommend that person being an engineer, someone that is well-respected in the organization and then can have these conversations with folks without fear.

And also the folks on the other end — that's why it's important to do these in one-on-one meetings too so that if they say something they feel is silly or they're describing a system in a way that it actually doesn't work like that, they have the freedom and comfortability to do that because it's not reflecting on them, it's reflecting on the system and the organization as a whole, on how they're disseminating information to folks.

Because ultimately if they're working under different assumptions, you're going to be having a weird system overall, but you need to figure out what's making those oddities happen.

GOTOpia November 2020

Casey Rosenthal: I like that picture of surfacing gaps in people's mental models. It's an interesting answer. The canonical answer is actually to read the book. But that was an interesting answer.

Nora Jones: Yes, I put a large set of questions in the book to help you become a facilitator.

James Wickett: Let's remind all listeners to — if you want access to those questions, and much more — get the "Chaos Engineering" book.

Two of the field’s prominent figures, Casey Rosenthal and Nora Jones, pioneered the discipline while working together at Netflix. In this book, they expound on the what, how, and why of Chaos Engineering while facilitating a conversation from practitioners across industries. Many chapters are written by contributing authors to widen the perspective across verticals within (and beyond) the software industry.

Get your free copy
Chaos Engineering by Casey Rosenthal & Nora Jones

Nora Jones: You'll never believe question seven. It's wild.

Key takeaway no 4

A really good facilitator can uncover and leverage mental models while validating chaos engineering experiments.

Chaos libs time

Casey Rosenthal: James, do you want to introduce the next section here?

James Wickett: I think it is now chaos libs time. This is a fun-filled very exciting lightning fill-in-the-blank game. Casey Rosenthal is going to read and, Nora Jones, you just need to respond as fast as you possibly can. Are you ready?

Nora Jones: Ready.

James Wickett: All right.

Casey Rosenthal: Continuous?

Nora Jones: Verification.

Casey Rosenthal: Ironies of...?

Nora Jones: Automation.

Casey Rosenthal: Human?

Nora Jones: Factors.

Casey Rosenthal: Could have gone the other way. Working with Casey Rosenthal was?

Nora Jones: Fascinating.

Casey Rosenthal: Experiment?

Nora Jones: Design.

Casey Rosenthal: Game?

Nora Jones: Days.

Casey Rosenthal: Peanut butter and?

Nora Jones: Jelly.

Casey Rosenthal: James, how did she score?

James Wickett: Oh, that was great. I think that's one of our all-time highs: 37 points so far. Good job. Well, I think time's up on that one.

Casey Rosenthal: Is there a prize for that? Still waiting for my shoes.

Chaos Engineering free book

Q&A for the authors of the “Chaos Engineering” book

James Wickett: Yep, we have some questions. Let's bring in question caller number one.

Woman: Hi, Nora Jones. Question for you. Yeah, so what is it we can learn from chaos engineering that we can't already learn from a well-executed root cause analysis?

Casey Rosenthal: Excellent question.

Nora Jones: That's a good question. Yes, I think they're two very different things. A well-executed root cause analysis — I don't know exactly what that looks like. I would say that there are well-executed post-incident learnings that you can do. But they're two separate things. Chaos engineering is proactively looking at things, and I think they're inextricably linked. I like to think of them as sort of a feedback loop. You kind of can't do one without the other. You need to be able to learn from your incidents, right? And so you can do different chaos experiments. You can do better chaos experiments. You can know where to experiment, who to bring into the room, if you should fix it or not. Those kinds of things can't be totally learned without learning from your incidents first. But I wouldn't call that a well-executed root cause analysis. But that's not for today.

Casey Rosenthal: Maybe we should just execute our root cause analysis better.

Nora Jones: Yes. Maybe. Maybe.

Key takeaway no 5

Chaos engineering is proactively looking at things, and I think they're inextricably linked with root cause analyses

Casey Rosenthal: Any other callers on the line?

James Wickett: Yep. We have one last caller, let me pull… let's bring them in.

Man: Hi, longtime caller, first-time listener. My question is, what's the best way to adapt a post-mortem template for use with chaos experiments?

Nora Jones: Yes. There are a few different ways I would suggest that answer. I think you can look at a post-incident template, and also look for opportunities to do chaos experiments in the future, like, look at what surprised us during the incident, look at who we needed to rely upon that we weren't sure we needed to rely upon, what parts of the system we needed to rely upon. And you can use that to feed into future chaos experiments. But you can also use some guides with chaos experiments too, like, what are our expectations with this experiment? What are our hypotheses? Who do we need to tell about this experiment? When should we run it? Why does it matter when we run it?

Best way to adapt a post-mortem template

I think the two, I would call them guides rather than templates, so that folks don't feel like they have to stick to them. But I think, again, those two worlds are fairly linked. And they can be used as a feedback cycle into others. I would recommend mentioning chaos experiments in post-incident guides. And I would recommend mentioning incidents in chaos experiment guides as well. Because that will help you see over time, how you're actually doing, which chaos experiments you've run, which incidents have been associated with them, what the impacts of those incidents were.

Casey Rosenthal: Well, thank you so much for coming on the show, Nora. Best of luck to you at Jeli. And I hope people read your book so that they don't mislearn the things that they should have learned because they didn't learn how to learn from incidents.

Nora Jones: I hope they read your book too, Casey.

Key takeaway no 6

Chaos engineering guides and post incident guides should be used together


SUPPORTER

Verica uses chaos engineering to make systems more secure and less vulnerable to costly incidents.
With Verica, you can trust that your software is working how it’s meant to. As systems become more complex, Verica will be there to help maintain confidence in those systems. Get your free copy of the Chaos Engineering book.
https://www.verica.io/book

Looking for a unique learning experience?
Attend the next GOTO conference near you! Get your ticket at http://gotocon.com

SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily.


About the authors

Nora Jones, founder at Jeli, practiced chaos engineering at Slack. She’s passionate about resilient software, people, and the intersection of those two worlds. She cowrote two book on chaos engineering and keynoted at AWS re:Invent in 2017 to an audience of over 40,000 people about the technical benefits and business case behind implementing chaos engineering.

Casey Rosenthal is the co-founder and CEO at Verica. As an Executive Manager and Senior Architect, Casey manage teams to tackle Big Data, architect solutions to difficult problems, and train others to do the same. He seek opportunities to leverage his experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike. His superpower is transforming misaligned teams into high performance teams, and his personal mission is to help people see that something different, something better, is possible. For fun, Casey models human behavior using personality profiles in Ruby, Erlang, Elixir, Prolog and Scala.

About the speakers

Nora Jones
Nora Jones

Co-founder and CEO at Jeli; Chaos Engineering pioneer

Casey Rosenthal
Casey Rosenthal

Deprecating Simplicity and the Rise of CV