Despite the widespread adoption of DevOps and CICD, some companies still rely on manual deployment in 2023. Michael Nygard, author of "Release It!" examines new patterns and anti-patterns that have emerged since the first edition of his book was released in 2007. Mike and Trisha Gee explore why companies using current best practices continue to encounter challenges.
Come along to hear from the trenches of the DevOps movement.
Trisha Gee: Hello, and welcome to another episode of the GOTO Book Club. In this episode, we're going to cover "Release It!" the second edition by Mike Nygard. I'm Trisha Gee, I'm a Java Champion, a developer advocate, and I'm gonna be asking Mike all about his book. Mike, please introduce yourself.
Mike Nygard: Hi. I'm Mike Nygard. I've been a developer and architect, and now a technology leader for more years than I really care to admit. I was one of the early developers to go into operations, and learned some hard experiences from being woken up at 3:00 and 4:00 in the morning, for weeks on end, and tried to sort of bring back my experience from there because when I went into operations, I thought it would be about fans failing and replacing disc drives, and that was the tiny minority of problems we resolved. It was all software problems. So that's what motivated me to write the book. And I've since spent a lot of time traveling around the world to conferences where we met, and taught people how to make software that survives production.
Trisha Gee: Which kind of preempts a little bit my next question, which is, can you give us a quick overview of the book? What does it cover, and who is it for?
Mike Nygard: Well, the primary audience is the professional software developer or architect. It covers a lot of the topics that generally weren't addressed in software desig n courses or in training that you would find, because those would always have these very short examples and they would say things like, you know, error handling is emitted for clarity. And then you go read real software that's been around for a while, and it's like three-quarters of the code is about error handling and recovering from weird states.
So, there was this whole big space that just...it wasn't being talked about. And kind of at the same time, we were moving into larger scale systems, web systems that have to be on all the time. You don't get a maintenance window. You may have very large volumes of people arriving all at once. So, it had more demanding characteristics, and people weren't really being taught how to design for that. So, that was the primary emphasis, was kind of filling that gap.
Trisha Gee: As you said a bit earlier on, one of the themes which keep coming up in the book is they're being woken up at 4:00 in the morning with some sort of error. And you're right, like when I was a developer in the '00s, that never happened to developers because that was an Ops problem, and we just didn't feel that pain of something that we'd written that was causing production problems and causing the Ops folks to have to figure out what on earth we'd written in our applications.
The first edition is a classic, in my mind, and it came like at just the right time. It sort of preempted the DevOps thing and in my mind kind of was part of the input to the movement towards DevOps, where developers are encouraged to start thinking about like, how do you get your code into production and what is the impact? What's the difference between running in production and running on your machine?
What is new in the second edition
Trisha Gee: And so, the first edition is now a little while back. What has changed in the book since that first edition? What's new in the second edition?
Mike Nygard: Well, when I started contemplating the second edition, I started making a list of what had changed in terms of infrastructure and architecture and the market. And it was like all the smartphones and mobile apps, basically all of the cloud. I think in the first edition I had maybe a page about, if you're going to the cloud, then a bunch of these rules changes. And by the time I sat down for the second edition, it was sort of like cloud was almost the default or the de facto standard. We had the entire movement around open-source monitoring and management tools. When I wrote the first book, if you wanted monitoring and production, you went to HP and signed a check with a lot of zeros on it to acquire Opsview...sorry, OpenView or some similar, you know, licensed package.
Then by the time we looked at the second one, there was this embarrassment of riches around monitoring tools. And the question is, you know, which ones do you plug together rather than are there any? So, an enormous amount had changed. Some things hadn't changed, and that was actually gratifying that there were some things that stood the test of time. A lot of the material around stability and distributed systems absolutely still remains true. But, for example, there was a large section on capacity management that was too specific to physical hardware, and also needed to be generalized to take on physical environments, cloud environments, and auto-scaling environments. And there's sort of been this shift in the cloud world away from capacity management to cost management because you can always get the capacity and it's really about how much you want to burn.
So, I removed that section, changed it completely, and put in a whole new section around structuring cloud-based systems. Removed a large section around what I had termed the Ops DB in the first edition because people had actually gone and built the things. So, I didn't really need to talk about why you should have one. So, it needed a lot of updating. I also added an extensive amount about virtualization, and cloud environments, plus a couple of new stability anti-patterns and patterns that had been discovered in the time since the first edition.
Trisha Gee: You just said a whole bunch of things that I want to explore more. One of the things was visibility. I went back and had a quick look at the first edition and I could see there were a certain number of pages devoted to explaining why you need visibility and monitoring and, you know, this whole thing about, this seems to be important. And it's really funny to read the first edition now, because you're like, "Well, duh, of course, you need all of that stuff." And the second edition is much more like, "Yes, of course, you need all of that. And this is what you can do and this is what it gives you." Which I thought was really valuable and much more aligned with where a lot of applications are these days, I think.
Recommended talk: Observability Engineering • Charity Majors, Liz Fong-Jones & George Miranda • GOTO 2022
Mike Nygard: Well, there's a meme on the TV Tropes' site that says Seinfeld isn't funny. And what it's referring to is if you watch Seinfeld now, maybe for the first time, you'll recognize all of the jokes, the instant they're being set up. That's because they've been repeated so many times and imitated so many times that they now are just sort of common expectations. There's nothing surprising about it. When Seinfeld was on at first, they were new, fresh, surprising, and funny. So, yeah, all the part about observability that's sort of like, you know, it was new at the time but has now become kind of accepted conventional wisdom, which is great.
Trisha Gee: Good. Yes, for sure. You don't have to devote pages trying to sell developers on, "This is an important thing." You're like, "Look, let's just assume that you're gonna do this." So, that was good. And yeah, and also, similarly, I thought what was quite interesting was, when you're talking about not just the auto-scaling side of stuff, but like virtualization and things, the fact that you have to be quite careful about the terminology and define those terms, because before in the first edition, you're able to talk about a server, and everyone has in their head like, you know, a server and a data center. And now you're like, "What does that mean?"
Mike Nygard: Right. Right. It's no longer a pizza box with blinking lights.
Trisha Gee: Right. Or it might be.
Mike Nygard: In fact, there's a really fascinating thing happening in the large-scale data centers where virtualization is now moving a layer below the CPU. So, it used to be that you'd run the hypervisor on the CPU, it would intercept all of your hardware operations. Now, especially in AWS and Azure, a lot of those things have been moved out onto the PCI bus, where your CPU is just running a regular operating system, and thinks it's talking to local storage, but it's actually being intercepted by a smart device that turns it into a remote network call. So, server now is like, yeah, you have your CPU that you're programming, but it's surrounded by 100 other cores that are all pretending that the CPU is in the same kind of server architecture it used to be. So, the picture just continues to get more involved and more complicated.
New (anti-)patterns in modern applications
Trisha Gee: Right. For sure. I mean, to me, that was one of the takeaway points of, certainly later on in the book, where you're explaining the terminology. You're like, as developers, we have to worry about so much more than we used to have to worry about. And it's a good thing to think about production, but we also have to understand, like get a bit closer to the hardware than we used to get. And then we're not even that close to the hardware because it's all these different layers between us. And, you know, because early on in the book, you're talking a bit about how TCP works and, you know, it's important for us to understand what's actually going on, otherwise, we can't anticipate potential problems in production if we don't know what's going on. And things have got so much more complicated these days.
Mike Nygard: For sure. And not only can we not anticipate the problems, but when the problems do occur, you sometimes have to peel back a layer of abstraction to understand what the actual problem is. And so, you know, diagnosis and problem-solving usually require awareness and visibility at a level below where you have caused the problem.
Trisha Gee: I liked all the stories that you had in the book, all the case studies and things. I like the way that you said things like the CPU usage is low, but nothing's happening. And that told me that there was something waiting on something. And that's why your experience really shows because other developers could look at the system, and all this visibility and monitoring and observability that we have, and be like, "There are no errors and it's not working. Like, what do I do?" But that you really clearly explain, you know, "Under these circumstances, my suspicion was connection pooling or garbage collection or...because I've seen this sort of thing before." And I thought that using stories, then having the anti-patterns, and then the what to do about it was like a really helpful way to teach developers like what are the problems, what's causing those problems, and what you can do to prevent those things in future.
Mike Nygard: Well, thank you. Certainly, people learn from stories. Humans are storytelling creatures. It's one way of transferring experience without having to suffer through it. But you also mentioned that notion of seeing an observation and then forming a hypothesis about what that observation indicates, and that's a step that really requires a mental model of the system underneath, in order to be able to create a plausible hypothesis. And of course, then you go from observation to hypothesis and then you say, "If this hypothesis is true, what else would I see? Or what would I observe that would disprove this hypothesis?" So, you kind of go from observation to model to observation. The model is the crucial step, otherwise, you'll just be overwhelmed with all the data, and the graphs, and the flowing numbers, and, you know, seeing the matrix characters dribbling down your screen and so on.
Trisha Gee: One of the things that really hit me early on in the book, when you're talking about how...I can't remember the terminology you used, I've already forgotten. Like a small failure can lead to the cracks in the system, which can end up like bringing everything down. To begin with, you see everything coming down. It's very difficult to trace it back without experience, without reading a book like yours. Difficult to trace it back to that one line of code where you forgot to close your file handles or whatever.
Mike Nygard: Yes, it can be especially because the initial fault can be quite small, but then once it begins to create this snowball effect and you have the cascading failures, so many other things begin to report problems that they can absolutely drown out. The true cause was, you know, maybe one line in your log file, and then you have thousands and thousands of log reports about services not responding.
Testing: Should we stop QA?
Trisha Gee: And you know perfectly well, and you say this in the book as well, that as soon as an app becomes unresponsive, particularly, if it's a mobile app or a web app that's got like real human beings on the other end rather than like an internet type-app, people are hitting refresh, refresh, refresh, and you know for a fact that's just gonna make things worse. One of the things that you mentioned a bunch in the book as well is that, QA does not replicate these types of situations because a tester or an automated test is not going to hit refresh 27 times after something goes wrong, and you don't see those sorts of things in QA.
Mike Nygard: It's true. In fact, just recently I was investigating a defect. This one wasn't an outage, but it was a defect that occurred with multiple users interacting with their mobile apps simultaneously. There was some information leakage from one user's session to another user's session. As we dug into kind of the whole tree of causes that contributed to this, one of the things we observed was that there was QA, but because it was a mobile app, the QA was still manual at this company. And they were testing against an environment that had paired servers for HA testing, but they never had enough testers using it simultaneously to have even a 1% chance of 2 requests being on the same node at the same time. So either they would come one after the other on the same node, or they'd be low-balanced across nodes. And so, there's a statistical element that's hard to produce in QA, unless you devote a lot of resources to it.
Recommended talk: How To Build Quality Software Fast • Dave Farley • GOTO 2022
Trisha Gee: Right. And my takeaway from that was not that we shouldn't do QA because I can see some developers being like, or business, if you like, going, "Oh, no, testing is expensive and it doesn't catch these types of defects." It's like, no, no, no, no. It's like you still need the QA that we're used to seeing, and the automated tests and if possible, be up to be a bit more like production. The fact is that the sorts of problems you see in production you will probably only ever see in production, and you have to anticipate them, and not just throw away all of your other efforts and go, "Oh, it's useless." Right?
Mike Nygard: Oh, for sure. I would even say with regards to QA, my response would not be to say, "We need to stop doing QA." It would actually be to say, "We need to do more different types of QA." So, we got along way for a long time in the industry with people all kind of clicking around gooeys and manual things. We scripted that, automated it, and that helped make it more repeatable, more scalable. We moved to unit testing, which I fully endorse, but maybe we went too far and tried to do everything through unit testing.
I'm now also a big fan of property-based or generative testing. I certainly like load testing. I do some model-driven testing, where I have kind of a mark-off model of how users interact with the system. And so, that can generate endless streams of really odd traffic that ordinary human testers might not think to produce on their own, or you wouldn't see it written in a test plan because it looks kind of wacky to have in a test plan, refreshing the page 27 times.
But when you use more different types of QA, each one is good at finding a certain class of problems, and the first time you do one of these kinds of testing, you'll find a bunch of problems. Pushing that style of testing farther and farther and farther creates diminishing returns. And so, there comes a point where you're better off stopping with one kind of testing and adding another kind.
Trisha Gee: Right. I saw you mention load testing a few times in the book as well, and obviously having load testing is a good thing. But even the sort of automated load testing that we might naively implement to begin with, isn't necessarily the same as what users might do. I was taken by what you said about some of these sessions that kind of log back on again after like 15 minutes, and you'll never see that sort of thing in an automated test because it runs for 5 minutes, 10 minutes, whatever.
Mike Nygard: In a way, even the automated load tests are too predictable.
Trisha Gee: I think that's the main thing for me that I took out of the book. I've been writing software for 20 years and seen a lot of different things, but users are going to do weird stuff. And as our applications get way more complicated, and we're running stuff on the cloud, and we don't know where the servers are, and we don't know how it's scaled out. You were talking about social networks. And when you did the first edition of the book, probably you had no idea that a piece of software was gonna serve like a billion, million users. Things are only going to get worse from here as far as I'm concerned.
Mike Nygard: Well, I think there's even another shift that's well underway in most organizations, which is to create internal platforms, where the users are now other pieces of the software inside the company. And it's a premise of an internal platform that new use cases and new services can show up and start consuming the platform kind of without notice, without permission, without having to tell you as the provider of the platform that they're going to start using it. So, when we think about users being kind of random and unpredictable, and doing clever things with our software, we now also have to think about the internal software using our platforms in unpredictable or unexpected ways.
Trisha Gee: I hadn't even thought about that because when I worked in the world of developing internet applications, you could like literally physically meet your users. You would go and see the 12 people using your software. It didn't occur to me that there might be internal systems using your internal system doing I don't know what.
Meetings & Manual Deployments
Trisha Gee: One of the things I've read that gave me terrible flashbacks. The deployment case study where you had like 20 people sat around a conference table, you'd had the go/no-go meetings and you start the whole thing at 10:00 at night and the business are gonna wake up at 1:00 in the morning to do UAT. And I had terrible flashbacks because I've definitely worked in more than one environment that does that. My question was, do you still see organizations doing that kind of like manual deployment process, which, as you cost out in the book, is ridiculously expensive and very painful?
Mike Nygard: It's not as far in the past as you might like. Just recently, I did observe a cutover process for an airline moving from one passenger support system to another passenger support system. This happens all in one night, with airplanes in the air, and airports in operation. So, literally as a gate agent, you may be there checking people in for one flight, using one system, and then by the time you're checking people in for the next flight, you're on the new system. It's a massive, massive change. If you think about how widely distributed airports and baggage claims and terminals are, all the different states that a flight can be in, all the edge cases and the unhappy paths that travelers can experience. And to have that all happen overnight while there are actually planes in the air that took off using one PSS and landed using another PSS.
So, in an event like that, yeah, there's absolutely a war room. In fact, for this one, there were war rooms at multiple companies, in multiple countries, making sure that it all went off. And to this company's credit, it did work. You know, there was a punch list of minor things. There were corrections happening in the middle of the night, but all of the airline's operations continued, and there were several hundred people involved in that particular cutover.
Now, I'll say that is not a routine event, that is not the quarterly deployment process, but there are still quarterly deployment processes at companies that have not automated, and there's a sort of a vicious cycle that keeps them stuck in this mode. They'll do the big overnight thing with the playbook and they've customized the playbook every time. And so, not everything works every time. It requires human intervention to recover, which reinforces the idea that we can't automate this because it requires humans to do things, you know, every time. Where if they started from the premise that said, "We can automate this by standardizing it, and the value of automation is enough to make it worth the cost of standardizing," then they would get out of this trap...
Trisha Gee: Right. I mean...
Mike Nygard: ...that is sadly common still.
Trisha Gee: I've worked with those sorts of organizations where the release is so painful and so time-consuming that they do it rarely, and therefore, there's no need to automate it, and so you end up in this loop. But my assumption was that now we have continuous delivery and we have your book out there, and we have DevOps, and automation has got so much better, and we have cloud. In theory, organizations are taking advantage of at least some of those things to move closer to an automated deployment process perhaps. Or maybe it's just such a big change that they can't go get over the hump.
Recommended talk: Expert Talk: DevOps & Software Architecture • Simon Brown, Dave Farley & Hannes Lowette • GOTO 2021
Mike Nygard: I think it's like many of these sorts of diffusion curve problems. Many organizations are well along the transition, and many of them have made that transition completely, but the massive legacy software out there is so vast. And some of it is in this state that sort of like terminal life support, they can't quite afford to shut it down and replace it, but they can't quite afford to really modernize it, and so, it limps along, sucking people's lives away.
Challenges for organizations using Continuous Delivery
Trisha Gee: I had another question on that area, which is, so organizations that are following more of a continuous delivery type thing and have taken advantage of DevOps who read your first edition of the book, they must be facing their own challenges, I assume, in terms of deployment, getting it into production, making that work. What sorts of things have you seen in that space?
Mike Nygard: This is actually one of the subjects I have a lot of fun with because as we have introduced more automation to do these jobs for us, the automation itself has created new ways to disrupt and break our systems. I have a couple of examples in the second edition about the kind of automation gone wrong. One of them comes from Reddit which had a very serious outage when they were doing a migration of one of their configuration systems, ZooKeeper from, you know, one cluster to another cluster.
They had an auto-scaling service that used that configuration to determine what should be running out in the environment. Because they were going to be moving clusters, they deactivated the auto-scaling service, which makes perfect sense. They didn't want it to look at partial or incomplete data. Midway through their migration, a different configuration management service saw that the auto-scaling service was down and said, "Oh, auto-scaling should be up."
So that piece of automation turned on the auto-scaling service, which looked at the incomplete data and said, "Oh, apparently all of Reddit runs on three servers," and it shut down thousands of nodes. And it did it very, very quickly. It took a long time for Reddit to recover from that because they had to actually fix the auto-scaling, bring back all the nodes, and warm up all the caches. So, it was quite a large recovery effort. And it happened because these different pieces of automation were interacting with each other indirectly through the environment. So, this is kind of a force multiplier effect where the power of automation is, it can do a lot of things very quickly, and the risk of automation is that it can do a lot of things very quickly.
Trisha Gee: So, I guess the challenge then is to really understand the environment that you're trying to deploy into and anticipate what could go wrong when you don't really know what's happening or what could go wrong.
Recommended talk: Modern Continuous Delivery • Ken Mugrage • GOTO 2019
Mike Nygard: Well, that sounds a little bit impossible, right? Like that's in the unknowns category. But there are things you can do to give yourself a fighting chance. One of them is, certain actions are safer than others. So, if I need to bring up 10% more nodes than I've got, that's a relatively safe action, subject to budgetary limits and thresholds and limiting the burn rate. If I decide to shut off 50% of the nodes that are running, that may be an unsafe action. And so, one of the things I recommend is actually building into your automation some degree of understanding about what's a safe action and what's an unsafe action. And just slow down the unsafe actions a little bit to give the intelligent elements in this sociotechnical system an opportunity to see that something is not going right and intervene.
If any human saw that the auto-scaler said shutting down 100,000 nodes, you know, 50% complete, they'd go, "Oh my God, stop." And they'd be able to prevent the disaster. But the decision was taken automatically, the action was done very quickly, there was no sort of guardrails or limits that said, "Never shut down more than 10% at a time, don't do shutdowns more than, you know, once every 5 minutes," or something like that. So, in a way, we have to create some notion of inertia or momentum or just some friction to slow down those unsafe actions.
Trisha Gee: Right. As you said, the speed of automation is the problem really, and the lack of intelligence as well. Brute force is not usually the right answer.
Mike Nygard: Yes.
The impact of DevOps on the second edition of the book
Trisha Gee: I looked up when the first edition came out, and it was basically about the same time that people started talking about DevOps. And I'm guessing that DevOps was just not really a thing when the first edition came out. So, I was kind of interested in, in my mind, the first edition very much fed into it was one of the inputs into the DevOps movement and the fact that developers need to care a lot more about not just writing lines code on their laptop, but how it really works into production. So, I was interested in, given that the first edition fed into the DevOps movement, how much did the sort of DevOps movement fed back into the second edition? And sort of how has your thinking changed and what's happened there?
Mike Nygard: Well, speaking of feedback loops, yes, you're exactly right. That did happen. Both of those influences were definitely there. When you look at the second edition, I think you'll see that there's sort of more of a notion of an ongoing process of adaptation and change. And so it's less about kind of big releases, ironic given the title of the book, but it's sort of about continual pathfinding. The back section of the new edition, and I know people, only read the first half or two-thirds of most tech books, but the last half talks about some of the ways of thinking about architectural change in a continuous way and thinking about programmatic change in a continuous way. And that's very much driven by observations from DevOps where, for example, in the DevOps world, we talk about very significant architectural changes driven by operational issues.
I think it was probably John Allspaw, but it might be somebody else that I heard this story from. But imagine a large MySQL database or Oracle database and operations kind of coming back and saying, "Look, there's really just no way to do this kind of continual zero downtime deployments because every time we make a schema change, we have to re-index these massive, massive tables of users. And as we're re-indexing, we can't do any transaction on it. So, you're gonna observe downtime whether you want it or not."
That's an operational consideration. But it motivated a move to a different database technology and a different architecture where that table was partitioned and charted across many different instances, precisely so that they could do zero downtime deployments. And so, when you think about that kind of feedback loop happening on an ongoing basis, that's kind of what the last section of the new edition is about.
Trisha Gee: Right. And I like the way that whole thing feeds into...well, a lot of the trends over the last sort of 10 years have been taking what Agile started in terms of like iterations and fast feedback, and sort of applying that to pretty much everything, and evolutionary architecture, and all of that kind of thing. It's all about observing, feeding back, making changes, and iterating over that.
Mike Nygard: In the DevOps community, that's now even being extended to internal audit and IT risk management. That's a notion of continual feedback and incremental change.
Trisha Gee: My experience...
Mike Nygard: Agile everywhere.
Trisha Gee: Good. My experience of risk management is that it's not very iterative at all, so I'm really pleased to hear.
Recommended talk: Architecting For Scale • Lee Atchison & Ken Gavranovic • GOTO 2021
Mike’s personal development
Trisha Gee: One last question because I think we're gonna wrap up soon. I've got to find it now. Yes. It was a slightly less book-focused question and a more you-focused question. What has changed with you since the first edition was released? What have you been working on and how did that impact this book, if you want to tie it back to the book?
Mike Nygard: Well, even the second edition is a little bit aged now. So, the second edition was in 2018. A few things have changed in the world since 2018. So, as for myself, I'm now working for a Brazilian financial services FinTech startup called Nubank with operations in Brazil, Columbia, and Mexico. And in particular, I'm responsible for the data analytics platform. This is an area that I think DevOps has largely ignored or bypassed, still, very much centralized operations, throw it over the wall, run in a big batch. So, I'm keenly interested in how we can take the same principles and practices and bring them into the data world where sometimes just running a join on this 10-billion row table takes several minutes. Maybe you're doing a calculation that takes an hour or two. So, how do you get fast feedback in an environment where once you run it and it breaks your cycle time there is measured in hours or days? So, it's gonna be an interesting set of challenges.
Trisha Gee: I'm looking forward to the third edition. Thank you. I want to give you one last chance if there's anything else you want to comment on about the book or anything else you want to say.
Mike Nygard: Well, I think the largest thing I would say is, very often we look at the challenges of adapting to the new world, cloud-native continuous deployment, and we say, "We can't do this because X, Y, Z." I would like people to invert their thinking around that and say, "It's valuable to do this, and in order to do it X, Y, and Z must be true." And it seems like a little, you know, linguistic inversion, and maybe just playing games with words, but actually, expressing things in terms of prerequisites rather than obstacles, it does have a pretty large change in how we approach things. And so that would be my encouragement. If you're struggling to get to this more continuous, feedback-driven, incremental style of work, start enumerating the things which must be true, in order for that to be part of your world.
Trisha Gee: I like that. I like that because it's useful, but also, it's more positive and more proactive. "We need to do this, this, and this, and then we can get to where we want to be."
Mike Nygard: Exactly, so.
Trisha Gee: Great. Well, thank you very much, Mike. I hope lots of people buy your book because it's extremely helpful. I love the way it's been updated to encompass a whole bunch of like, more modern patterns of developing applications, and I think everyone should read it. Thank you.
Mike Nygard: It's been my pleasure.