Security Chaos Engineering: Security Resilience

What’s the state of the art in modern security practices?
The authors of the book Security Chaos Engineering, Aaron Rinehart, co-founder and CTO at Verica, and Kelly Shortridge, senior principal at Fastly, talk to Mark Miller, Vice President, Community Engagement and Outreach the Linux Foundation, about the shift in the mental model that one has to undertake to reap its benefits. Their approach paves a new way that allows security engineers to uncover bugs in complex systems by chaos experiments before an actual attack.

What is Chaos Engineering?

Mark Miller: One of the things that I like to do to start off is to lay the groundwork so that we're all on the same plane here. Kelly Shortridge, for those that are just starting out conceptually with chaos, what is chaos engineering?

Kelly Shortridge: Yes, I would hope that everybody knows about the scientific method. You generate a hypothesis and you tested and you look at the evidence and then you analyze it to confirm or deny the hypothesis. That's ultimately what chaos engineering is. So in the security context, you're coming up with hypotheses about the safety of your systems. So that could be whether a security control works, whether a certain type of attack is going to succeed against your systems.

You can inject a little chaos to test that invalidates or does not confirm that hypothesis, which is powerful and something we haven't seen very much in information security.

Mark Miller: When you say inject, you mean go into the system and inject an issue into the system?

Aaron Rinehart: Yes, and we do that because it's like we Kelly Shortridge kind of said we're trying that case. Engineering sounds provocative as it. Right. But really, we're not creating the chaos that the problem is. Is that because of the speed, scale, and complexity of modern systems we're already dealing with the chaos we're trying to do is create order proactively.

All we're doing is presenting the system with the conditions we expected to successfully operate under. Right. So we believe that under X Y will happen. So we're proactive instead of finding out randomly what didn't work. Proactively saying, hey, computer, do you do what you're supposed to do? And that's what we're doing.

Why a book on security in Chaos Engineering?

Mark Miller: And is that the genesis of this book? What was the purpose of the book? What are you trying to do?

Kelly Shortridge: Actually, the purpose of the book is to bring information security into not just the modern era, but also a realistic era. So we're not just talking about what perfect security is and textbook security and all of that stuff. It's like, OK, you know, you have a mental model of how you think your security is and how you think the safety of your systems is.

How does that match up with reality? We are looking at the difference between that there and as I said, I think most computer systems these days are complex systems and complex systems inherently have unexpected interactions. We can either uncover those when there's a successful attack or we can uncover those through experimentation. So really, the book is not just about this experimentation, though that is powerful.

It's also about all of the underlying philosophy around resilience, that ability to recover gracefully. How do you adapt? How do you evolve as a complex or when you're kind of building and maintaining and securing these complex systems, which is a radical departure from the old way of doing security? So we could have just talked about the experimental part, but we're trying to do something much bigger as well.

Mark Miller: It's interesting, if you think about it you're trying to overlay a mental map on top of a physical one, what is the relationship between the two maps?

Kelly Shortridge: You have kind of your mental model of here's how the system is going to behave. Under certain kinds of stress, right? So you have that mental model, then you have the physical reality, which, you look at, let's say the supply chain attack thing, which is very popular right now.

You have this actual interconnection that may be between two components you had no idea were connected. Then an attacker laterally moves in. You're like, oh, my goodness, this is completely shattered. Kind of this mental model of reality. It also can manifest actually when you're performing incident response, you may have assumptions about how your systems behave, where you kind of overlook things, maybe actually this is the problem because that's not even in your mental model of reality.

That's really what you're trying to do is you can almost think of it in those old school movies, forgive me, which where you do like the rubbings on a rock, it's almost like that. You're like uncovering and excavating, like, OK, here's how it looks. Not just whatever your preconceived sketches.

Aaron Rinehart: I want to play off of that, too. So it takes a while. I don't know if you feel this way. It takes a while when you start doing this and be able to explain the right words, and the message, you're much better at that than all the good writing in the book.

But it's one of the greatest and the best examples that I really kind of honed in on is the concept of a legacy system. So what does it mean? Give me legacy typically means it's mission-critical. Right. That it's typically the flagship application or some derivative of that or usually some kind of mainframe.

It's legacy. We call it legacy because we would have gotten rid of it. It's tech that we need. It's critical to the business. Legacy systems will often associate with things like stability? Right. That we know how the systems work. The engineers that are working on it feel somewhat confident and competent in how it functions and it kinda has incidents rather than just issues

But the question I ask is, was it always that way? Right. So we learn. The answer is it wasn't right. We learned through a series of unforeseen surprises, incidents, and outages. Right. Achilles that we learned what we thought the system was versus what it was through a series of surprise events.

Now, those surprise events when we said, hey, oh, this works differently. So we need to go back and fix it. But remember, the surprises are reflective of pain, right? Customer pain, user pain people, and the operating system people are using system pain. Chaos engineering and security chaos engineering you could think about as a proactive way of accelerating that process.

Of warning about the system before you encounter pain.

How can Chaos Engineering help with control?

Mark Miller: There's a couple of things to unravel there. One is why do people keep legacy systems when everybody's saying, oh, you should be replacing that with the latest and greatest? Usually, it's because those systems are critical. It's a critical system that is functioning and the business doesn't want to replace them. But what happens with those systems is exactly what you said is that there's a corporate memory that's attached to that system.

When you say surprise, surprises happen, I like that. Surprises happen as part of Chaos Engineering. Where is the corporate memory? Where's the documentation for these surprises?

Kelly Shortridge: Well, I think that's ul thing about security. Engineering is it generates a lot of that kind of shared memory. And ideally, we'll talk about this. Ideally, it should be widely shared and transparent. So it's not you.

Mark Miller: Now, ideally, yes, in reality.

Kelly Shortridge: No. But the beautiful thing about security class engineering is, again, rather, you know, incidents eventually fade into memory. Right. It's a memory of the past. And people stop kind of learning the lessons of those. But if you're kind of continually practicing, like conducting these experiments, like confirming, checking your hypotheses, challenging your assumptions, if you're doing that continuously, which is what security engineering is all about, then you're kind of continually generating this new kind of shared knowledge.

You're continually keeping that memory fresh, which is important. I think to your point. Part of the reason why people do stick with legacy systems is that they do understand them. I think it's a very, very understandable natural human tendency. We want to understand the world around us. It makes us feel safer. It makes us feel more in control.

And again, with security, chaos, engineering, rather than, you know, having that long lifespan of the legacy system, which is one way to kind of understand, again, all of the different interactions, the way the system works, you can conduct experiments and generate that same understanding.

Even though chaos sounds like there's not a lot of control, it can, at least in part, give you a greater feeling of control because you have that greater understanding of how your system behaves in practice.

Aaron Rinehart: Furthermore, for the moment, like is that when we're doing this, the proactive so during an active event or surprise event, right? Because if it wasn't a surprise, you would just fix it before a surprise. Right. But during an active incident, that is not a good time to alert people. People are you worry people are worried about being named, blamed, shamed.

I knew I should push that code. I knew I should have done it. I'm going to get fired, right? Like people are freaking out. Well, when you're under or the world, the world around you is on fire. You don't think?

But the world is not on fire when we're doing chaos engineering or security chaos engineering.

We're proactively trying to understand the system. We can learn. We don't have the blinders on. We're not looking back. A lot of times when an incident or an outage happens, we expand on a lot of these concepts around hindsight bias and the sharp end versus the blunt end of...and a lot of it comes from Sidney Dekker's 20 years of airline accident investigations. We go into the world of cognitive science and safety engineering.

If you know the outcome of the event, you look at the events that unfolded up until the outcome, completely different, right. But what we're doing is proactively trying to understand the system with eyes wide open, if that makes sense.

How documenting incidents empowers engineers

Mark Miller: As you both were talking, it came to me that if documentation is done properly if the documentation of these surprise events is done properly, it becomes part of the system itself and is a way for the system to respond to future incidents of the same thing. Is that part of the process here?

Kelly Shortridge: Yes. The way we characterize it is, like, building muscle memory for incidents. You can almost think of it, as training montages. Where you get people fighting you, so, for the real boss fight, like, you're all ready. It's very similar to that. You're building that muscle memory and making sure that when a real surprise happens, not these kinds of injected chaos surprises, you have a much greater sense of how to handle it and how to respond gracefully and not experience the burnout and stress I think most responders go through, whether that's performance incidents or security incidents, most of the experience, that kinda burnout is stress.

Mark Miller: Still, though, if you're thinking of humans as part of that process, that's the problem. And I'm thinking that if chaos engineering is done properly, the system is the one that responds, not the humans respond.

Kelly Shortridge: Completely disagree.

Mark Miller: Okay. Good.

Kelly Shortridge: I don't think humans are the problem. I think that's kind of like Aaron was saying. We can see this across all sorts of disciplines. I think there was one incident where the relevant alert...this might've been in a nuclear plant. Maybe you remember. The relevant alert was on the backside of one of those, old-school big panels at the very bottom flashing. And they said, "Well, the operator should've known." It's like, "Really? Because 100 other lights were flashing. They should've known that the correct one was, like, on the other side?" It's ridiculous, right. You can see in other industries where they blame an operator for not knowing that there was, like, a truck whose beepy thing backing up was broken. Not realizing that it was broken and looking around to see…

We love to blame operators because we're removed from it. We're not in the situation and I think it's very much part of the just-world hypothesis. We don't like the idea that random, terrible surprises can happen and so it makes us feel better to say, "Well, actually, it wasn't random and terrible. It was because this operator did this dumb human thing." I think if we look at the history of incidents, humans are ultimately often a source of strengths because computers are largely not entirely very deterministic, right. Humans are the ones that can be innovative, they can be nimble, they can be flexible in their thinking, and they can respond in very adaptive ways.

So part of what we're trying to promote with information security...information security, right, has blamed the human most of the time. Dumb users clicking on links, all that stuff. But what we're saying is actually, "Well, humans behave in very human ways. Maybe we should be designing our information security systems and our processes and procedures in a way that understands that humans can be very adaptive but also they have finite attention, they have finite time, they have all these sorts of production pressures and constraints." You have to ship software quickly, all that stuff.

We need to be very empathetic with humans. We also need to view the human as a source of strength. How can we make sure during incidents they do feel empowered to have a nimble response? I'm sure Aaron can talk about some of what we've seen kinda more in the wild on that front too.

Aaron Rinehart: That was a great answer. I can echo it even more. I mean, we read a lot about this. Humans are the solution. I don't think we're ever gonna have machines writing all of our software for us. It's one of the ironies of automation also that, like, you can't think that when you...when you write automation, you need fewer humans, you need more because somebody's gotta maintain the code, somebody's gotta write new code, right.

Kelly said the word adaptive, right. There's this term in resiliency engineering called adaptive capacity. Humans have more ability to look at different things and figure out what's going on and make decisions.

We need to empower people. Part of the chaos engineering where it began at Netflix was to put better information in front of engineers, right. Turns out if engineers have better context...so over control, right. The more context you have about a problem, it's more likely you can solve it, right. I think Charles Kettering once said, a problem well defined is half solved, right. That's what engineers do. They solve small problems.

Kelly Shortridge: And I think there's an important distinction here also between uncertainty and ambiguity. Computers could often be good at resolving uncertainty. That's just a lack of information, like, collecting all the relevant data points. But there's also ambiguity and that's you can have the same set of data points but there are two different ways to interpret them. And what's the right interpretation and context? That depends. That's not something at least yet we've taught computers how to solve. For instance, take an intensive care unit, right. A doctor sees that there are two different patients. One maybe needs palliative care. One is facing something terminal. The other maybe could be cured but the other one came in later.

How is a computer supposed to decide that? I don't think we want necessarily computers, like, making those kinds of decisions, right. That's ambiguous. You have all the data that you could have but what is the right decision? It's very unclear. And in a lot of these incidents...again, you see across industries and certainly in the, what I call, like, computer system realm, we see that roughly what we call human error is because we think that the human operator should've resolved that ambiguity in a different way, that there is the right way but maybe not in context. So, I think we have to be careful that more information won't necessarily solve things because a lot of these choices are not just...you know, there's one right answer.

It could be the same data points, again, in a slightly different context could result in a completely different outcome. We need to empower the operators to make the best decisions that they can with as much context as possible but they aren't the problem. There are always gonna be mistakes. I don't think we're ever gonna be able to resolve ambiguity, ever. That's a judgment problem and I think we see...not to get too philosophical, but I think we see across society a lot of times we do try to solve kind of problems of judgment with metrics and I'm not sure how well that's working out because we just gain the metrics. Is that better? I don't think so but…

What is security Chaos Engineering?

Mark Miller: All right. You have both used the term security chaos engineering several times already. Is it a coincidence that that's the name of your new book?

Kelly Shortridge: No. We recognize that there is a branding part of it, right. If we're introducing or trying to usher in this new era of information security which we think is more pragmatic, and more aligned to the realistic models of how systems and humans in those systems work, we need a catchy name. And security chaos gets people's attention. Like Aaron was saying chaos, it's a pithy name, it's one that people are like, "Wait, do we want chaos?" And the answer is yes. So that's one part of it.

We also thought it was really important to extend that underlying discipline of chaos engineering just because that has been well-practiced more on the broader DevOps community side which is relevant to our conversation. So, we're bringing it and looking very much at how it applies to information security as well. So, we could've called it, you know, something like human sensitive, you know, resilience engineering for system safety but security chaos engineering is much pithier.

Mark Miller: So, who called who?

Kelly Shortridge: I think I was already writing a book because that's just what I do in my free time. I just love writing. And I think Aaron reached out to me because he had seen I was talking about security chaos…

Aaron Rinehart: There was a lot in that talk. It was the keynote you did. I think I came up to New York and we talked about, we had all these things we would write about and I think I must've filled up several pages of notes. I was like, "Okay. We have maybe several books here." There's so much depth beyond the practice of security chaos engineering. So, one of the things I was wondering early on with...so I wrote the first open-source tool that applied Netflix's chaos engineering to cybersecurity.

Mark Miller: And that tool was?

Aaron Rinehart: ChaoSlingr. Still out on GitHub but it's deprecated. UnitedHealth Group has their version of it and I'm no longer there. One of the things that I was trying to figure out also was what allow me to deeper understand why are these chaotic experiments always unsuccessful, right. Because we only do chaos engineering experiments, security or not, for...under conditions we expect the system...we think we're right.

We're introducing the conditions we expect the system can handle, right. We're not trying to introduce randomness or trying to introduce...it's not a monkey in a data center pulling cables. Everybody loves to use that example. That's not...that is chaos, right. That's mayhem, right.

What we're trying to do is introduce the conditions we expect the system to operate under by sort of doing that, and the system rarely did what we expected it to, right. So why are we wrong about how our system...because we were basing all of our engineering practice on hope and luck, right. As engineers, we don't believe in hope and luck. We believe in instrumentation, feedback loops, and data. We believe in those sorts of... measurements, these things that tell us it did work, why it didn't work and how we can improve it.

I started following some of the work being done at London University and with Casey Rosenthal and Nora Jones and about resilience engineering and safety engineering, the cognitive sciences. There's a human brain behind these computers, right. And they interpret things differently.

We've been reading a lot about these concepts and trying to extend some of the great work being done by Dr. David Woods or Richard Cook and, you know, these people that have been...David Woods created Resilience Engineering, I think, from…

Kelly Shortridge: Essentially, because the way I came about it was slightly different, I started actually with earthquake resilience, of all things, to look at how buildings were designed. And there's this one quote that still sticks with me and it's Dr. Elizabeth, I think, Hau, who's a geologist from what I remember. And she says, "A building doesn't care if the earthquake was predicted or not. It's either gonna stand up or it's gonna fall, right." And I thought that summarized so well or pattern matching to information security, it summarized so well at the time...it's a little less now but at the time there was so much effort put into how do we predict breaches. Like, how can we predict what's going on and all this data science going into it and it's, like, it doesn't matter. Either your system is going to be resilient against an attacker or it's not.

Either you're going to experience a breach or you're not or you're gonna experience downtime or you're not. But there was no one talking about it in those terms. The information security side delved more into kind of natural disaster resilience. I ended up also going into David Woods because he kind of spans a bunch of disciplines and looks into resilience across all these different domains and quite late I discovered that, "Oh, this has already been kind of described on the chaos engineering side." And I found that fascinating. But I think there is that kind of underlying issue in information security which I'm not sure how often the software engineering community more broadly understands but in information security, it's very difficult to measure what success is.

As a result, it means that we have stuck with strategies for decades that just don't work. But that's the folk wisdom and that's how it's always been done. And if we can't measure success, we're never going to improve. And it seemed like being able to conduct these experiments...which, by the way, computer people are very lucky, we can't conduct these sorts of experiments with nuclear power, right. It's unethical and there are a bunch of domains...even in macroeconomics, we can't just inject a financial crisis to see what happens across the system. So, in computer science, we were incredibly lucky that we can kind of inject this kind of, again, what I call controlled chaos in a way to understand better how our systems are gonna behave in information security that is so powerful because we can finally see, "Hey, our strategy is actually effective."

Do they operate the way we think they're gonna operate? Can we again build this muscle memory to respond to incidents better? So that's why we think it has the opportunity to start...we can start to see an information security industry that maybe is a little more pragmatic and constructive rather than just, hand waving and kind of, like, the shamanism of old.

Aaron Rinehart: AI is not gonna solve all your problems.

Kelly Shortridge: Correct, yes.

Aaron Rinehart: Stop that.

Examples of security Chaos Engineering

Mark Miller: Let's talk then about specifics.

Aaron Rinehart: Sure.

Mark Miller: We've talked in generalities about what's going on. How has this been applied? I think in the book you've got two examples to start with, right.

Kelly Shortridge: We have a few. We have quite a few case studies. We've been lucky that there are a bunch of people working on this. Aaron has seen a lot of these up close. One of my favorite ones is actually...I believe it's an open door in the book talking about logging pipelines. I think it was the…

Aaron Rinehart: That's Prima Virani,

Kelly Shortridge: Prima Virani, yes. I think it may have been the Census Bureau had a breach and what they discovered is that logs had been sent nowhere I think for 18 months, maybe more.

Mark Miller: Logs have been sent nowhere?

Kelly Shortridge: Nowhere. They were being sent down to a...I believe it was a SIM that had been decommissioned or something like that. So, they were broken for 18 months. They had no idea and that was one of the recommendations obviously in their report. They were...I thought they were a little too shamey again of human error but they were like, "Probably you should have your logs going somewhere real." So, in this case, a study by Prima, it's basically, log pipelines are the lifeblood whether that's site reliability engineering or obviously on the security operation side, responding to an incident. You need your logs. You need that visibility.

You can test like, "Hey. Are log pipelines going to continue operating under these various scenarios?" You wanna be able to validate, "Yes, we can be confident that our log pipelines are gonna be able to provide us that."

Mark Miller: But isn't that a one-off?

Aaron Rinehart: If someone's gonna spend time on security chaos engineering, I think control validation's a good one. It's a response to observability in general. Observability in my opinion is the other half of this big problem Kelly and I are kind of attacking with security chaos engineering. Observability and software security suck. It's horrible. It's horrific, right.

We put all this time and effort into a detective and preventative kinds of things hoping that when they happen that the technology works, the humans repair it, all these sorts of things. Well, with chaos engineering, once you do see those conditions, that signal, did...okay, as soon as we interject that condition for the preventative or detective kind of logic to fire upon, we can look at the technology. Did the technology work the way it was supposed to?

But other things we can look at now that we're looking at it from the other angle...we're not looking at it after the fact, we're looking at it proactively, we can see did the systems give us to log and doc data, event data that we could read and make sense of to...had this been a real problem, would we have known what to look at?

Mark Miller: Well, still, go back to the point where if you have never seen a problem where logs are being sent nowhere, you have no way to look for that. You wouldn't even know to look for that.

Kelly Shortridge: I don't think that's necessarily true because we have a case study in the book of like, "Hey, you should worry about this." I think there's also a great database that GitHub provides, for postmortems, and things going wrong. Part of what's in the book is we recommend a whole boatload of experiments to conduct as well. But I think Aaron's point is really...again, it's the scientific method. I think it's very reasonable for people...you can give your hypotheses: when this port is opened up, we expect, the security engineering team Slack channel to receive an alert. That's something I think you're just documenting, like, here's your expectation.

What may happen in practice when you experiment is like, "Huh, the alert didn't show up. Why is that?" And then when you go and investigate, maybe it's because something's messed up in your log, like, your logging pipelines. I think it's always good to come back to that scientific method where you don't necessarily have to know the counterfactual of like, "log pipelines can be broken." What you do know is you do expect to receive an alert. And from there you can kind of untangle and that's, again, the beauty of security chaos engineering is it's about that mental model. You have this mental model of how your system's going to behave end to end and how that measure up in practice. So, you don't need to specifically test like, "Hey, are our logs being sent nowhere?"

But you can test, do we have alerts? And to Aaron's point, if we wanna investigate deeper, can we access the logs? And then maybe you're like, "Huh, we can access the logs. That's interesting." Again, it is much better to do it in this kind of controlled chaos experiment rather than when you have an attacker that's up in your systems, right.

Mark Miller: A real-world example, meaning the situation where the HVA system was hooked up to the point of sales systems and nobody knew it. That example. Would chaos have done anything about that or notified anything about that?

Kelly Shortridge: It depends on the experiment but a lot of times it can, again, excavate that kind of flow and it can be like...if you think of tributaries underneath once you kind of unearth a system, I think, very similar to, you know, you have control flow within a particular program. You also have kind of interaction flows and data flows within your application, within your systems. I think in theory, you might've been able to see "Huh, there's relatively tight coupling." Because if you injected some sort of fault into one system or the other, maybe you would've seen, "We're getting some sort of reaction in this other system but there shouldn't be, right."

I think it's certainly possible. Again, our view is that you shouldn't just do a one-off experiment. That's a good way to start but ideally, you're trying to automate some of this, you're trying to continually test your hypotheses and eventually get a little more sophisticated over time. My view is certainly if those systems matter to you and you want to conduct...I would imagine you would wanna conduct experiments in them, then probably that coupling would've been uncovered. But I don't know. What do you think, Aaron?

Aaron Rinehart: That's a good answer. This is not magic stuff. This is rooted in the core components of all science and engineering. We're providing a method for instrumentation. As Kelly said over time with these experiments, it depends on the experiment itself but you will...because how often are you actually, looking at the system with that kind of perception or that kind of angle? We're always looking at it after something bad happened and we're worried about damage control. We're not worried about what happened and we always love to believe there's a root cause

What we're trying to do is proactively understand the system so that you'd be able to understand where is this error coming from? It looks like an HVAC system."

Chaos Engineering in complex systems

Mark Miller: The problem, though, at this time in history is the complexity and scale. At the enterprise level with problems like this, it's inconceivable with that complexity as it's grown beyond human comprehension. Where does chaos fit in to make this, I would say, palatable but it's...that's not the right word.

Kelly Shortridge: That's the whole point of chaos engineering. So, if we had perfectly linear systems, you probably wouldn't need chaos engineering so much because ultimately what chaos engineering is doing is looking at the interactions in the systems. I think the hard part about complex systems is that you can't reason about them as well, you can't build that mental model once it reaches a certain level of complexity. In large part kind of like your HVAC system with the point-of-sale system, it's because of those interactions between components. So, another is...traditional security is not going to catch that either, right, because it's looking does each component have, like...from this list of vulnerabilities, they're not looking at those interactions.

What chaos engineering is saying is like, "Okay, that's not enough. What we need to do is again build that better mental model about how our systems are behaving." That's precisely what you do when you experiment is you were able to uncover, "There seems to be coupling over here and coupling over here and some sort of fault in this component ends up kind of trickling through to all these other components." That's not something you can do with basically any of the kind of security testing methodologies today but you can do it when you conduct chaos experiments.

We're certainly not saying it's a silver bullet. In your first chaos experiment, you're going to have this beautiful, new architecture diagram of all of the interactions in a complex system. The point is, though, over time, I like to view it as, like, you're piecing together this, like, mosaic in a way that you couldn't before and that's powerful.

Aaron Rinehart: One of my missions in speaking about this stuff and writing about it is trying to educate the security world about complex systems. The depth behind the security chaos engineering helps with resilience engineering, and safety engineering kind of sciences. If we don't get...if we don't start to understand this problem, everything we're doing today is almost irrelevant. It was okay when we had, like, the three-tier app, right, and we had far fewer components but now with microservices, public cloud computing, continuous delivery, you know, continuous integration, DevOps, we're delivering value to customers faster than we ever had before.

But also, the software never decreases in complexity because of its changeability. If you see a complex system, a software system and you want to make it simple, how do you do that? You have to change it, right, to make it simple. Well, the act of changing it...remember, the relationship between change and complexity, right. So, you're just moving the complexity around. So, it's not about complexity and removing it and it's about learning how to navigate the complexity and understand it. That's what we're doing with this stuff, because if we don't get better, I mean, this problem is a gnarly one to tackle.

Kelly Shortridge: I think that's a really good point. That complex system, we're in... computer systems aren't the only complex system, right. It kinda goes back to my point, in some ways, computer people are kind of spoiled in that we can wrangle this complexity in a way you just can't in other disciplines because it's unethical or it's just very...one of my favorite anecdotes is in nuclear power plants they have to...in their pipelines, like, literal pipelines, they have to deal with the fact that clams will start to grow. We don't have to deal with stuff like that in computer systems but it's just mind-blowing that this system that, if it experiences a catastrophic accident, could kill potentially hundreds of thousands of people, one of the ways that that could happen, the root cause could be clams growing in the pipelines, right.

I think again it's not like we're inventing this concept of complex systems. Computer systems aren't the first time we've had to deal with complexity. We can draw on all this incredibly rich research over decades now of complexity whether that's in ecosystems, nuclear power plants, healthcare, mining, marine systems, or air transportation. It's just everywhere. Even human brains. We can leverage all of that expertise and very hard thought, kind of lessons learned to improve, in this case, the discipline of information security as it pertains to computer systems.

Aaron Rinehart: You brought up the human...I was thinking about the human body when you were talking about nuclear. The human body's also a complex adaptive system. I had recently recommended to a security researcher to read "How Complex Systems Fail" by Richard Cook. It's only two pages. When you read it, you think it's about a computer, right. It's actually about the human body. He's writing about the human body but it makes total sense for computers, right. Both people's minds...because after she...this woman had read it, she reached out to me, "That's an amazing paper." I said, "Okay. Did you know that he's talking about the human body?" And she was...mind was blown, right

I guess what Kelly and I are trying to relay here is that we're going back, we're looking at what other people have done, to tackle these problems as we walk into this world in the world of technology. And now arguably, the powers have shifted. Technology now is facilitating a lot of these core components and we have a responsibility, as stewards of them, to manage them effectively.

Mark Miller: One of the dilemmas, though, as part of the whole process of complexity is you mentioned speed, Aaron. Things have to be faster and faster. But there's a tradeoff there because there's a creation of potential vulnerabilities because of the speed that which things are changing.

Kelly Shortridge: I am not sure I agree.

Mark Miller: You never agree with me.

Kelly Shortridge: I know, I know. That's me as a person. One of the things Dr. Nicole Forsgren uncovered...and this was looking at the accelerate metrics or DORA metrics so the four golden signals for DevOps. What she found is that speed and stability are correlated. Think about a very simple case in information security with patching.

One of the root causes of a lot of breaches is that a patch didn't happen, for 18 months or some egregious amount of time after the patch was released. Why is that? Because they couldn't ship software quickly. If you can ship changes on-demand on your software, you can ship patches on demand. That flexibility and that speed are vital to be able to implement security changes as well.

This is similar in other disciplines too, that speed and stability are often correlated. Not always. If you look at, like, seatbelts in a way and brakes...I think Sounil Yu has mentioned this if you look at brakes, it allows us to go faster in a safe way. So, I would certainly argue that we shouldn't view speed as inherently bad for safety. I think sometimes safety is necessary to go fast. But I think the bigger issue is around the mental models. If we're going quickly and if we're able to kind of build things at a scale that we can't reason about in our brains, that's where the mental models start to break down because you're having to iterate on your mental model constantly. That's again a vital kind of benefit of security chaos engineering is it allows you to keep evolving your mental model, it keeps, you know...

After a year of changes in a high kind of velocity organization, that the way your system behaved at day...you know, January 1st to December 31st is gonna be radically different. If your mental model is the same as it was on January 1st, that's a huge problem. If you're kind of continuously conducting these experiments, then your mental model's also evolving along the important way.

How to update a mental model of a system

Mark Miller: You've used the term mental model numerous times here. How is a mental model actualized? What does it look like?

Kelly Shortridge: I think that's different in everyone's brains but I think when it's actualized, I think architecture diagrams do a great example of...that is very much the software architect or designer's notion of here's how the system looks. Maybe behaves if they have flowed between them as well depending on the architecture diagram. If you looked, and some tools help with this security chaos engineering helps uncover this as well. If you look in practice, though, probably after a year in production, that architecture diagram doesn't reflect reality. Like, there are other things that you would have to extend it to include other systems, where it's talking to, all that good stuff.

Mark Miller: You're smiling. You just like hearing her talk.

Aaron Rinehart: I do. She's brilliant. Before my current role at Verica, I was the Chief Security Architect at UnitedHealth Group and that's where I started doing a lot of this stuff with security chaos engineering. So, she's so right. When people come to me...a solutions architect and a data architect would come to me for the same system and show me two different diagrams, right. Often the system never actually reflected that. That was just how they believe the system to work. So, there is a...and...I'll get to that in a second but...is that...so it's that neither one of them are correct but, like...and when we say mental model too, think about the number of mental models you have running through...that were humans and their perceptions of things.

So let's say you have 10 microservices in a modern application, right. You've got payments, you've got billing, you have reported, you've got 10 of them, right. You don't have 10, but usually, for each service, there's a team of engineers, right. There's an engineering manager and probably some engineers as part of that, right. And sometimes one team will handle two but usually just one. So, you have 10 sets of humans working on individual microservices. Those microservices are not independent. They're dependent upon each other for functionality. So, all these things have to work together. And let's say they're doing things like CI and CD in DevOps. They're making, let's say, five changes a day per microservice. Maybe they're not on the same schedule, maybe they're not.

They're all the vital changes that affect each other, right, in this post-appointment world. So, just imagine the number of changes in systems impacting each other. All these teams see their world of the system post-appointment differently, right. So, like, it's a shared mental model. So, if the humans that are operating the system don't understand the system in their mental model, it's hard. Complex systems are hard, right. But, like, we have to get these engineers that are operating these critical complex systems better information and better context so they can be more effective with what they're trying to do.

Short vs long version of the book

Mark Miller: Let's put that in the context of the book. You've got a 90-page book here and you have another one coming out. What's the purpose of the first book here, the first 90 pages?

Kelly Shortridge: I would say it's something bigger than an appetizer but maybe not a big steak sort of meal, introduction to security chaos engineering. The idea is you get a bitesize into the underlying philosophy, enough to get you started and to rethink kind of your security programs and practices, procedures, all the good stuff as well as to be able to get started with how to conduct experiments, which Aaron very much led, as well as some examples of experiments that might work for kind of the way most systems are built. So, for instance, like, thinking about, automation servers, orchestration control plains, just vanilla sort of Linux servers. We have a bunch of cool experiments there.

So, it's a how-to guide for, "Okay, you're early in your journey. You would think maybe..." More realistically, you know that something is off in your security program. You don't seem to be getting results. Why is that? We talk about that a little bit because we compare what I call security theater, which is kind of the traditional way, where you just do things for the sake of feeling like you're doing things about the security problem, then the security chaos engineering way. So, it's enough for you to see, "Okay, here are some of the reasons why it's not working today. Here's maybe how we can reimagine it both from an organizational and kind of cultural perspective as well as the philosophies behind practices. Then here's how we can get started in practice with some experiments."

What the phonebook's going to do is going to extend that. It adds much more depth. It's going to be a full-fledged just guide for not just how to build a brand-new modern security program that works for this modern era of complex systems we're talking about but also many more case studies because that's the one thing that is very clear from kind of the audience is they wanna hear more from a diversity of industries and organizational types. How has it worked for you? There's going to be a large, chunky part of the book that's dedicated to that too.

Mark Miller: One of the ways that you and I have talked about, Aaron, when we were together in Singapore, maybe even Sydney, is that people learn from failure, other people's failures, hopefully. Where does that fit into what you guys are working on here?

Kelly Shortridge: The underlying philosophy of security chaos engineering is that failure is a great teacher and that failure is inevitable so we might as well learn from it.

Aaron Rinehart: That is the first opening…

Kelly Shortridge: I think it might be the first sentence.

Aaron Rinehart: Please, Yoda, Yoda, teacher...failure...the best is...whatever's Yoda teaching, is it?

Kelly Shortridge: Whatever it is..

Mark Miller: Somebody who was...Steve...I forget. Dr. Steve...one of Gene Kim’s friends said that failure is the default state of the system.

Aaron Rinehart: Oh, that's Dr. David Woods.

Mark Miller: Okay. You're right.

Aaron Rinehart: And that's what's...one of many Woodsisms. But he loves to say the system was never broken. It was designed that way, right. Right? Failure is a visible component in all human evolution. It's how we learn. I mean, we fall, we get up. Failure is a core component of how we grow and build better systems. What we're doing...that's why we opened up the first book with it. Probably going to open up the second book with something similar just because we need to...people need to understand that we're not very good at this stuff we're doing, right.

We're trying to get better and this is a way you can...this is a methodology, this is a discipline you can adopt to help you get better at it.

So, what I love about chaos engineering as an engineer is the confidence you built, right. Like, I'm not just putting something out in the ether in some system, some control, some firewall, some whatever hoping that it works when I need it to work. I'm actively introducing a problem it's supposed to work under. I know it works. Knowing it works creates just a level of confidence when something bad does happen in the wild that you know that you're not fully covered but you have confidence that you can handle those sorts of conditions.

Kelly Shortridge: There's also a meta point that maybe gets into the more cold corporate territory. But most traditional security strategy today is based on trying to prevent impossible failure, as David would say. So, one, you're wasting a lot of money, what...in the economic experience known as opportunity cost, both the time and money that you're spending on trying to prevent failure from happening, which is impossible, could be spent on actually preparing for that inevitable failure and making sure that you're recovering gracefully and that the impact to the business is minimized. That's not where the focus is. Today it's trying to minimize risk, which is very nebulous, to zero and it doesn't work.

So again, from that cold corporate perspective, we're just wasting a boatload of money right now on trying to prevent failure, which is impossible. The meta point here is that it results in failures of security programs. By trying to prevent failure, we're introducing failure at the macro level in our security programs, which is slightly poetic but very unfortunate for the users whose data we're not protecting very well.

Learning more about security Chaos Engineering

Mark Miller: You both have mentioned Sounil Yu a couple of times here. Who else should people be following? Sounil is an obvious one. He does great stuff.

Kelly Shortridge: I think certainly following people who helped pioneer chaos engineering, one, the kind of, like, the performance side of things.

Mark Miller: Who would that be?

Kelly Shortridge: Well, some of Aaron Rinehart's colleagues.

Mark Miller: Casey for one. I mean, Casey's obvious…

Aaron Rinehart: Casey Rosenthal created chaos engineering at Netflix. And then there's Nora Jones. She did the keynote for Reinvent years ago. There is also David Woods, and Richard Cook. For security chaos engineering, you've got Jamie Dicken. She's doing some amazing work.

Kelly Shortridge: Absolutely.

Aaron Rinehart: She wrote the Cardinal Health pieces there. Mario Platt.

Kelly Shortridge: Yeah, Mario Platt for sure. Some people are in an adjacent space which again we're...security chaos engineering isn't just about the experiments. That's an important part of it but that's also the underlying philosophy. I would say Bea Hughes is someone who I know. She gave a talk, I think, back...gosh, in 2013. It was all about like, "Hey, maybe we should actually be working with the humans in our systems and maybe they matter and maybe that...just telling them hey, you made a mistake and you're wrong isn't particularly helpful and how can we become more confident during an incident response."

I know she's also continued to publish and talk about, I would say, contrarian, takes to traditional security which I think is an important thing because, again, security chaos engineering is all about challenging our assumptions. I think it's important to follow thinkers who are challenging those assumptions about what I would call status quo security.

Main takeaways from the book

Mark Miller: To kind of close the loop here, what are you hoping to accomplish? What do you hope people will do once they've read the book?

Kelly Shortridge: I have a very long wish list but getting away with human error as a reason for incidents, as a root cause of incidents I think would be huge because from that trickles so many different changes. If we stop thinking that it's an individual human who caused this massive incident...if you assume that, then you aren't looking at the design, you aren't looking...you're completely ignoring the fact that your mental model may be off. You just think, "Well, we need to just, like, punish the human or create more restrictions around the human." Nothing's ever gonna get better. So that's one of the key items there.

At a broader level other than world peace and harmony, I think it's...ultimately, we want...information security right now is honestly letting down the modern discipline of software engineering. I think that's safe to say. We would very much like to kind of usher information security into the 21st century in a way that's going to make systems more resilient, not just keep spending money and making a lot of people kind of very well off and, you know, give them more importance than thought leaders. We want information security that works, not that just does something for the sake of doing something.

Aaron Rinehart: I'm glad you said the human error bit. That was gonna be mine. I would also add root cause to that because root cause does not exist. It never has. It's a fallacy, right.

Take this example. Name one reason why you as a person...root cause, are successful, right. You can't, right. If you can't name one reason why something's successful, you can't do the same for why it's not, right. If you, and then human error as the root cause of an incident or a problem, that's the beginning of your investigation, not the end, right.

So, I challenge more discipline in that role. I want to just build off what Kelly said too. It's not that there are humans that cannot try, right. Passionate people are trying...in this craft trying to do good things. What we're trying to do is lead them in a direction, I think, that targets the real problem at heart which is complex systems. And I think what we're trying to also achieve is give them away one...the philosophy of thinking about a...think about what we do differently and not...this is not Kelly and Aaron just making stuff up, right. This comes from the world of safety engineering, how they've been able to pioneer and transform how they handle airline accident investigations or nuclear power plant incidents.

We bring that strength in too from the philosophy of why we're saying you should think about doing this differently for this reason. Furthermore, we follow that up with...in the larger book that's coming towards the end of the year is a deep how-to, how to write these things. This is not super advanced engineering, this is not AI, this is not the blockchain, this is not some magic engineering that you pour on your stuff and it'll solve everything. This is a basic level of instrumentation similar to testing. We're gonna walk through some of the different examples, different kinds of systems. We have case studies from the government, and we have case studies from healthcare, banking, and startups, so everybody has an idea of how this stuff could be applied and how to write them. And, you know, we look forward to seeing more stories unfold in the community.

Kelly Shortridge: Importantly, it's gonna be inclusive. This is not a movement that can be solved just by information security professionals, which I agree with, they just need a kind of better way to conceive the problem. It's also for software architects, software engineers, you know, whether that's leaders or people kind of, you know, building systems day today. We actually will have chapters covering each kind of phase of the software delivery lifecycle. So, if you're just, like, architecting your system, you can open up the architect thing and designing chapter and be, like, how can security chaos engineering help me make this system more resilient against attackers.

So, we want it to be both practical, grounded in principles, and have ample examples so people feel confident about, I guess, interestingly enough, being more confident in their systems so…

Mark Miller: The book is out, "Security Chaos Engineering".

Kelly Shortridge: The slim book is.

Mark Miller: The slim book is out.

Kelly Shortridge: And then the full book with the animal on the cover, which is TBD, we're very excited to see what will be given with the animal, that will be out I think later this year.

Aaron Rinehart: Yeah. And then the first book is written more to be, like...to be read from end to end whereas I think the larger book is much larger and it's the reference [inaudible 00:51:23] As Kelly said, you can open up if you're an architect, the architecture section.

Mark Miller: Right. Thanks to you both.

Kelly Shortridge: Thank you for having us.

Mark Miller: That was great.

Aaron Rinehart: Thank you.

Kelly Shortridge: Yeah. Thank you.

Security Chaos Engineering

Transcript