Building Software That Survives
"Organizations that design systems are constrained to produce designs which copy their communication structures"- but most people miss the word "communication."
About the experts
Charles Humble ( interviewer )
Freelance Techie, Podcaster, Editor, Author & Consultant
Michael Nygard ( expert )
General Manager of Data at Nubank
Read further
Charles Humble: Hello and welcome to this episode of GOTO Unscripted. I'm Charles Humble, a freelance tech editor, author, and consultant, and this is part of a series of podcasts for Go-To talking to software engineering leaders. Today we're joined by Michael Nygard. Michael is currently General Manager of the Architecture Enablement Group at NuBank, which is a very interesting and successful financial services company. He's a much sought after conference speaker and has written and coauthored several books, including 97 Things Every Software Architect Should Know and the bestseller Release It!, which is a book about building software that survives the real world and was a huge influence on my own career. I'm thrilled to have you on, Michael. Welcome to the show.
Michael Nygard: Hi, Charles. I'm delighted to be here.
Charles Humble: Thank you. So I've described your career to some extent, but can you maybe talk about it from your perspective? What is it that you would say that you do in a general sense?
Michael Nygard: Yeah, it's a surprisingly hard question. In fact, I don't expect an existential crisis at the passport control line, but when they ask me what I do, I sort of say, well, I used to be a programmer. And it's true. That's where I began. I was a programmer and then an architect. And for about the last ten years, I've been in this sort of fuzzy space that we call tech leadership or technology leadership, which is applying what I understand about systems and people and organizations and architecture to try to help companies succeed in their objectives. It is a little bit nebulous and vague around the edges. So when somebody asks what do I do, I usually just say I'm in technology now.
Charles Humble: Yes, I can relate. So as I said in the intro, I think I came across you, as I suspect many did, through Release It!, which is such an important work, I think. Could you talk a bit about that book? How did it come into being?
The Genesis of Release It!
Michael Nygard: First of all, I'm glad that you found it useful. I never get tired of hearing that people found it valuable and influential. It came about through kind of a chaotic set of circumstances in my life. A consulting company that I had co-founded with great timing at the beginning of 2001 had gone along for a while, but we decided that it ultimately wasn't working. I was casting about looking for some kind of contract where I could put in just 40 hours a week, go home, and recover from the stresses of that consulting company.
What I ended up doing was getting a contract that turned into 60, 70, 80 hour weeks for the next two years. So it was not exactly the relaxing venture I'd hoped for, but it was a step in a different direction. Instead of being a developer, I was in this role called application administrator, which meant I didn't have root privileges, but I was there to write all the configs for the web servers and the app servers and help get this commerce site up and running.
It was an operational role at first, mainly focused on QA and then trying to move into production. I started asking annoying questions like, how are we going to manage all these configurations for production? Because we've got a whole bunch of machines and different instances. It turns out when you ask too many annoying questions that nobody has answers to, eventually you get tasked with solving those. So I designed and built the production infrastructure for that system that we were launching.
Once it launched, I was there with the care and feeding and the 3 a.m. crashes and the sudden operational outages. But because I had come from development, I was able to do things like look at the stack traces from the application server and go tell the developers, this is exactly where all of your threads are getting hung up and why your pages are not showing up. It would be something like get a connection from a connection pool with no timeouts, so just wait infinitely. And then something else would be holding a connection and throw an exception and bust out of its stack and never release the connection. And you have a deadlock.
So I was in that space where I was a developer in operations, quite a few years before we started talking about DevOps. I was able to bridge these two worlds that had drifted apart during the early web boom days. It was a great education for me because I saw that in ops, the standard solution is have you tried turning it off and back on again? And most of the time that worked. It was very disruptive. It just sort of got under my skin as an engineer that we were creating such awful software that had to be restarted several times a day just to keep up and running.
That was the genesis of it. I started looking at the problems and I saw that the problems came in types or classes, and that was actually an encouraging discovery because if there are types of problems, then maybe there are categorical solutions. I started looking for those, and that's what became the book.
Charles Humble: Right. Yes. It was such an interesting time because I think that whole business of how, as you say, operations and development sort of drifted into these two very separate camps was never really going to work. I think lots of us in different ways independently tried to figure out ways of bringing them back together, and that was sort of what fed into the DevOps movement. It didn't appear in isolation, but I think it's so interesting that that parallels with your experience and prompting the book as well. I think that's fascinating.
Modernizing Sabre: Legacy Systems and Technical Debt
You were involved with this remarkable technology transformation project at Sabre, which is an interesting company in its own right. So can you tell us a bit about what Sabre does and maybe a bit about the underlying system and architecture?
Michael Nygard: Sure. So I have sometimes referred to Sabre as the first software as a service company, because back when Sabre got started, it was the first electronic booking system and the first one to be offered to multiple airlines. It was built within American Airlines as a mainframe system. The mainframe was actually named Sabre, and it allowed travel agents to reserve seats on planes.
Sabre started offering that to multiple airlines and connecting agents to those airlines, running the passenger support system. So it all began on mainframes. When I joined, the majority of revenue eventually went through the mainframes in one way or another. Over the years, there had been successive attempts to take things off of the mainframe and build different types of systems—some monoliths, some mega services, some microservices later on.
But there was this environment of 50 years of accumulated technical debt, some data centers that Sabre had been occupying since the 1970s, and really a technology portfolio that no human being could fully understand. In that realm, I came in with the task of modernizing development practices, testing practices, moving to continuous delivery, and moving to the cloud.
We had a multi-cloud strategy at first, then we eventually settled on Google Cloud as the single target and had the mission to get everything to Google Cloud except the mainframes in a pretty short period of time. It was fascinating because we had initially thought the challenge would be the cloud portion of it, but what we actually discovered was that there had been this extended period of underfunding in keeping the technology current.
We had applications running on every version of Windows, every version of Linux you could think of. When I joined, there was some HP, some Solaris, there's probably a VAX off in a corner someplace. So really the biggest challenge was the modernization of dev practices, getting people just doing automated builds, continuous integration and automated testing. That was the first huge hurdle. I think we underestimated the challenge of that when we started.
Charles Humble: And the heart of Sabre at one point was something called TPF, if I remember rightly, which was a very unusual real time operating system. Was that still there when you were working? Was that still a core part of the architecture?
Michael Nygard: TPF was still there when I joined, and it was still there when I left, although the amount being done on it had changed throughout that time. So yes, TPF is sort of an evolutionary branch that most of the world never pursued. Real time operating system on a mainframe, highly interrupt driven, very different architecture than anything else you've seen, but it delivers extremely high volume throughput with very predictable latency.
The downside is, if it can't serve the request within the latency windows, it just refuses to accept it. It's quite different than the Unix model where it's like throw as many connections as you like, I'll queue them up, I'll have enough threads, I'll get around to it sometime. TPF has a lot of attractive attributes, and the throughput at the cost is actually pretty good.
The biggest problem is there just aren't a lot of people who know how to do it anymore. And you really have just one vendor for your hardware and operating system operations. So it's the ultimate in lock in.
Charles Humble: Yeah. It's fascinating. And as part of updating the stack, you've got what's effectively under-invested legacy systems on different operating systems and so on. And so you're updating things, you're paying down technical debt, and you're migrating to the cloud. Are you also doing structural things within the organization at the same time, changing the nature of the way the tech organization is working as you're doing the modernization process?
Michael Nygard: That was definitely happening. It was sort of one step above my pay grade. So my boss was part of a couple of big reorganizations, one moving us to a functional structure where it had been the case that there were multiple business units that each had their own development. And then there was the sort of corporate portion that was meant to be common platforms and common technology.
Centralization, Autonomy, and Organizational Change
We moved to a single engineering org shortly after I joined. This was one of the successive waves of centralization and decentralization that had happened there, which many people regarded with some skepticism and a little cynicism about, well, I guess the pendulum is swinging the other way now.
But that was one of the big organizational changes. On the business side of things, two separate business units were unified. One was aimed at serving the airlines. The other was aimed at serving travel agencies. And those were brought together as well as a single product organization.
Charles Humble: So just to clarify, you're essentially moving from some centralized command and control type structures to something that is more autonomous, more of a liberated structure. Is that what was going on at the same time?
Michael Nygard: Well, it's interesting when we talk about centralization versus autonomy, because I think people often view it as a single spectrum, and you're either going in one direction or you're going in the other direction. But what I have come to believe is that you have to talk about centralization versus autonomy with respect to certain activities that you do.
What are the things that you can do independently, and what are the things that you have to go through the centralized authority for? So a classic example: finance is typically not autonomous. You have a certain leeway to make spending decisions within your organization, but you normally can't commit millions of dollars of the company's capital without getting some approval from a central authority.
So you have autonomy within limits, but you also have centralization for some governance and control. There you have a type of activity and a degree of autonomy to talk about. So what we went through at Sabre was creating a degree of autonomy around deployment and build processes that had been fully centralized in an operations group.
Some of it was being devolved to the individual teams, but not all of it, because they couldn't just decide to go hire their own data center and move a bunch of hardware into it. They couldn't decide to opt out of security monitoring tools and that sort of thing. So we were increasing autonomy in some dimensions, but actually increasing centralization in other dimensions simultaneously.
Charles Humble: That's such an interesting insight. I think it's a really good point and one that gets missed so often. How does that compare with what you have at New Bank where you are now?
Michael Nygard: New Bank really strongly endorses autonomy, maybe more than is practicable at a scaled up startup. But it is, again, autonomy in certain activities supported by structures. I'll give you some examples. We have a few hundred squads that are working on a few thousand microservices. Each squad is on their own pace for when they build features, when they do deployments.
It's all git-driven operations now. So when they merge a pull request, a deployment happens. And that's entirely on their schedule. But it works at scale because we have the supporting structures that allow them to do that. So we have a build pipeline that everything goes through. We have an architecture that expresses a lot of the service's attributes as data.
And then we have tooling that was built in the early days that helps examine that data and make sure that the service is safe to deploy. So we are enabling autonomy in those activities, like writing code, merging code, deploying code, by creating the structures that make it safe to do that at scale.
Charles Humble: That's really interesting. There's something that I have come across, and I've come to think of it as a day two problem. So you've had some sort of organizational transformation and you've given people more autonomy, and then suddenly you have to figure out how you keep the company moving towards a common goal or common objective.
I think it's one of those things that often gets overlooked. There are tools for that that are quite effective tools, but if you haven't really thought about how you're going to do it, you end up in a situation where the organization is pulling in all sorts of different directions. I was curious if you had any thoughts on this or any reflections on tools or approaches that you've seen work?
Michael Nygard: Sure. First, though, I have to say, one of our slogans is it's always day one. So I'm officially not allowed to talk about day two problems.
But at New Bank, we started as a single product company in one geography. We were offering a mass market credit card in Brazil, and since then we have added more products—investments, loans, deposit accounts. We're in multiple countries: Brazil, Mexico and Colombia. We've also scaled up. We had a period of hyper growth in our customer base and in our organization.
So alignment is a constant challenge. It's something that we spend a lot of effort on. We used OKRs from an early stage, and that's a tool that's commonly recognized as helping to drive alignment. I'll say it helps. OKRs by themselves are not sufficient because you'll often have multiple top level teams directly facing the market that each have their OKRs.
Where we are in Brazil is very different than where we are in Colombia. And therefore the things we need to do in Brazil are very different than the things we need to do in Colombia. And then, of course, that means that we have market forces pulling us in different directions at the very beginning, and you percolate that through a layer of global products and then global platforms. You often have multiple OKRs that are asking for different things from the same people. So OKRs are part of it.
We are also working on high trust methods, something called high performing teams, attempting to get most of the leadership speaking a common language about active communication, conflict surfacing, conflict resolution. This is all very squishy stuff to most engineers. But you can see the results when you look at a management team that, even if they're not all pulling in the same direction, they have common tools for acknowledging that, surfacing it and working out the differences.
Even so, we put a lot of effort into alignment. One of the things I'm doing is constantly looking for situations where architectural boundaries would enable more autonomy and therefore require less effort expended on alignment. And that's a little bit of a complicated causal chain, but I'll give an example, or an analogy.
Suppose you have green space behind your house, and four other families also open up into that same green space, and you want to plant a garden. Well, you probably need to talk to your neighbors because you're modifying some shared space. You're taking use of it for your own purposes. You have to negotiate, like I'll give you some cabbage or whatever, or I'll give you more zucchini than you would ever want.
Whereas if you actually had fences and you knew the boundaries of what was yours, you could do what you needed to do in that space with less effort spent on negotiation and working out the boundaries and parameters. There's a sense in which knowing the boundaries of where you can operate freely increases your autonomy within that area, even though you do have boundaries.
So architecturally speaking, we talk a lot about boundaries. We look at coupling inside and across those boundaries, and we try to find good boundaries that make for good fences and make for good neighbors.
Conway's Law and Communication Structures
Charles Humble: That sounds consistent with Conway's Law, which gets bounced around a lot. Organizations which design systems, in the broadest sense, are constrained to produce designs which are copies of the communication structures of these organizations. I think it's one of the things that gets brandished around a lot. I think there's some really interesting, quite subtle aspects to it that maybe get missed. I don't know if you want to reflect on that a bit.
Michael Nygard: I would love to. I think it's super important. In fact, Conway's Law is very well understood by the leadership at New Bank and was something that we were trying to teach to the leadership at Sabre. One of the most interesting aspects about Conway's Law is that people often leave out the word "communications" when talking about the structure. So they talk about the software will mirror the structure of the organization.
That's not what he said. It will mirror the communication structure. What I've often seen is companies reorganize with an intent to have things operate differently, create better cohesion among these groups that you're bringing together, create a boundary between the groups that you're separating. But if you only change the formal structure and you don't also change the communication structure, then the change will not produce the effects you expect under Conway's Law.
There's a great book called Team Topologies that had a section in it that blew my mind and was clearly, evidently true once you assimilated it. It stated that communication is not an absolute good. If you're observing excess communication between platform users and a platform team, it might indicate that your platform is inadequate, or that the APIs are not well designed, or that you've missed an abstraction.
So that notion of using a surplus of communication where it doesn't need to be as a smoke detector was kind of a wild idea, but it absolutely fits along with Conway's Law. If you expect there to be an interface in the software and you're not observing an interface in the communication structure, you won't get what you want from the software.
Charles Humble: Yeah, 100%. I think there's a related thing I've seen. I've only seen it once, but it was very early on in my career when I was very junior, so I wasn't really involved in it, but it was a large investment bank, and they were having all sorts of problems. They brought in McKinsey, a large consulting firm. The McKinsey people recommended this new, very complex, very hierarchical structure that was extremely difficult to implement.
But they also, as far as I could work out, paid very little attention to how things were actually being done. What we ended up with was the official organization charts and then this sort of unofficial shadow charts. We talk about shadow IT a lot, but this was like a whole shadow organization. The actual structure didn't look a lot like the official structure. Is that something you've seen?
Michael Nygard: Absolutely. And especially if you have a place where there are people with high tenure who have known each other for decades, they know each other's kids and so on, you'll never stop them from talking to each other. Sometimes the most effective way to get things done is to utilize that network. But yes, you risk falling into this worst of both worlds situation where you have formal systems and formal processes and JIRA tickets for everything and ServiceNow in place.
You have a procurement process and a procurement system, but nothing actually happens until you get one of the tenured, networked people to poke somebody else. So you can eventually access the speed of that high trust network, but you've also got the weight of the bureaucracy of the formal processes that just don't work.
Or you have to exercise the informal network first, and then the formal process is just the paperwork that you do after all the decisions are already made. So if you want to use the network style and have that high speed lateral, high trust communication, that's great. If you want to have the formal processes, that's great. But one or the other ought to work without needing to engage both of them every time.
Charles Humble: Yeah. I think there's something related around tooling. Where we have workflow automation and tooling that itself ends up entrenched. And then you go and change the structure, but you don't change the tooling. So you've now got this weird disconnect between how the workflow automation works and how the organization is supposed to work. Again, is that a thing you've seen?
Michael Nygard: Yes. And I've seen an entire industry grow up around trying to solve that by enabling the users of the processes to do their own automation in one way or another. It used to be that the most effective departments were the ones that had the couple of people who knew how to run Microsoft Access and Visual Basic to serve the immediate needs of that department without getting tied up in IT scheduling.
These days, we tend to refer to RPA, because giving something a three letter acronym makes it less distasteful as a Band-Aid over tools that are always two reorgs behind. Automation and re-automation is a classic problem. You can reduce but not eliminate the lag on tooling behind the organization by putting more of that automation directly in the hands of the users.
I think what we're going to see over the next year or two is a lot of adoption of agentic AI as the latest incarnation of solving the tools problem. So I can just have the AI go and figure out how to get my needs met in the current organizational structure.
Charles Humble: What's your take on this as an approach? Do you think that's something that's likely to help or is it just more noise?
Michael Nygard: I think it's likely to help the users of corporate processes. Particularly, we're seeing a lot of interest in Model Context Protocol as a way of making systems more usable by the agents. I think there's a risk that we're pouring concrete on the problem instead of solving the problem. Meaning, we may lose adaptability in the lower layer and just have this layer of agents in the middle trying to keep up. I'm not entirely sure how that's going to play out in practice.
Charles Humble: Yes. I too am not sure. I think I have certainly seen an awful lot of attempts of using tooling to paper over a problem that's really an organizational problem. And really the only way to fix an organizational problem is to fix the organizational problem, I think. But it may be that it can help in some way.
Michael Nygard: Maybe. I will say I've been in this industry for a while now and I've seen 4GLs, VB and Delphi builders, I've seen screen scrapers and RPA, I've seen agents, I've seen low code platforms. All of these are attempts to put some control directly in the hands of the end users. And they're all beneficial. But they're always sort of on the periphery of what the real IT inside the company is doing. And that makes it hard for them to stay current and continue to have traction.
Charles Humble: Yes. And I think often people who are not IT people don't see it as their job to build automation tools or database querying tools or whatever it is. So I think it's difficult. If I'm an HR manager or something and you say we've got this wonderful new way of solving your problem, that's like, well, that's not my job. My job is managing people.
Michael Nygard: At this point, I don't necessarily see it as their job to do the last mile of knitting together four different corporate systems, so that the end user isn't swiveling between them. It's very common if you go and watch people engaging with the corporate side of the world—I'm not talking about software development for serving customers, but the internal systems—it's very common that you see people with a spreadsheet open where they're taking notes, which is their scratchpad or their context window for integrating across multiple different internal systems.
Leadership & Culture
Charles Humble: Yes, absolutely. We've talked a bit about organizational change, but I want to throw something else in here, which is that I think in leadership positions you can often affect a huge amount of change, possibly even more change, without necessarily messing around with an org chart very much. It's kind of like the reversal of that thing of you have to be really careful what you say as a leader, because what you say is much more important than what you do. But I think sometimes we go to the reorg because it seems like that's what we need to do. I think quite often there are maybe less high profile ways to effect change that are perhaps more effective or in some cases easier to do.
Michael Nygard: I don't know if I would say easier, but I would say more effective and less disruptive. So I agree with you. Reorgs are disruptive. Sometimes, in rare circumstances, the disruption is what you desire out of a reorg. But that's usually not what I'm going for.
One of the ways that you can create change is in what you celebrate and the sort of stories and myths that you tell. If you celebrate the person who's working 70 hours a week, they're always in at 10:00 at night, this person got divorced and lost their family because they're working so hard, and you give them a cheer for that, you're sending a very strong signal about what you expect and what's required to get ahead in this company.
It's very easy to celebrate the people who put out fires. It's much harder to celebrate the people who prevent the fires. But I can tell you that your company is better off if you have more fire preventers. You need some firefighters. Always. I've done a lot of it. It's very gratifying. But fire preventers are important too.
So the stories and the kind of founding myths that you tell, that's one thing. Certainly be careful what behaviors you will reward with promotions and high performance ratings. I don't care how much you think performance ratings are confidential. Everybody knows everybody's performance rating immediately. So whatever you reward, you get more of. That's a lot of it.
And of course, the leader is also constantly on stage and under a spotlight, as it were. So you want to be deliberate about the culture that you're modeling and how you show up and what you are indicating is acceptable discourse and what's not acceptable discourse. That's something that leaders should always be running a background thread on. Like what am I signaling beyond what I might say?
Charles Humble: Right. Yes. And I think there's a level of additional care or additional caution you need to think about as you get more senior. Again, early in my career, I can remember very senior leadership who had come down to the pub with the developers and got drunk and behaved inappropriately. The messages that you're sending about what behavior you deem to be acceptable is very strong and not great.
You just can't do that anymore. It's one thing when you're a junior programmer or something. It's still not good. But we can probably say that's okay. But when you're at a point where your behavior is going to get mirrored, that comes with a whole layer of extra responsibility that I think everyone needs to be aware of.
Michael Nygard: Yeah, absolutely. And I think for people in management positions, there's actually even an inverse causality, which is if you are someone who cannot manage himself, you won't get the advancement into those higher level positions. It's something that you learn to recognize. Who has the maturity and sort of self-regulation to handle larger scope, larger responsibilities, more pressure, more difficult negotiations? And who isn't there yet?
The other thing is for individual contributors, senior engineers often don't understand how much of a role model they actually are. They are leaders, even if they don't have management positions. Younger engineers look up to them not only for how do I solve problems and build systems and think about things, but also what does it mean to be a senior engineer in this organization?
Charles Humble: Yes, that's a really good observation. I love that a lot. Yeah, absolutely. You were talking about the thing of the firefighters getting promotions and the fire preventers getting overlooked. I think that can happen at the team level as well, where the team that does well ends up getting starved of resources and the team that's always late and always making a noise gets more and more resources thrown at them. You get this kind of perverse incentive.
Michael Nygard: Yeah, definitely. So I would also say that within a team, there are often people doing a lot of glue work. This is a term that was introduced relatively recently, but I think it captures an important concept. Glue work helps the entire team be effective, but isn't always as flashy or as direct an impact or direct contribution. But it's extremely valuable nonetheless.
Resources
Charles Humble: Yes, absolutely. Do you have any recommendations of books or conference talks or other resources for people who are moving into more senior management positions, things that you've recommended or things that you've found helpful? You mentioned Team Topologies, but are there other things that you would recommend?
Michael Nygard: Yeah. I'll recommend a book and I'll recommend a speaker. This is the point where I ought to have a book to promote. I don't. I'm sort of remiss in that. But one that I would definitely recommend is called Crucial Conversations. It's about having hard talks when the stakes are high. It is something that I have reread a few times and studied and deliberately practiced, and I'm still not very good at. So that's a good one that I can recommend.
And then on the conference circuit, there's a gentleman named Pat Kua who talks about becoming a technical leader, and he has a lot of valuable material, much of it from conferences and goto conferences. Much of it is on YouTube now. So I would say you could do a lot worse than watching everything he's said.
Charles Humble: Michael, that's brilliant. Thank you so much for your time. I really appreciate it. It's lovely to talk to you. And thank you for joining me on this episode of Go to Unscripted.Michael Nygard: It's been my pleasure.