Enabling Microservices Success
You need to be signed in to add a collection
Sam Newman talks to Sarah Wells about her new book "Enabling Microservice Success." Sarah Wells, an independent consultant with extensive experience from working at the "Financial Times," shares insights on engineering leadership, culture, and the practicalities of implementing microservices. They discuss challenges like out of hours support, the importance of organizational culture, and lessons learned from early microservice adoption. Sarah and Sam highlight the necessity of a thoughtful approach to microservices, emphasizing team autonomy and resilience.
Transcript
Intro
Sam Newman: Welcome to a lovely episode of the "GOTO Book Club," where I'm talking to Sarah Wells, who's going to be sharing some insights from her recently published book, "https://amzn.to/4cAYNrR ." Sarah, thank you so much for coming along today.
Sarah Wells: Thank you for inviting me.
Sam Newman: We're gonna get right into all of the awesome insights you have to share in the book and more. But I thought it'd be great if you could maybe just introduce yourself a bit, and maybe talk about what you do today, and maybe what your background is in IT.
Sarah Wells: What I do today is I'm an independent consultant. I work with a selection of different clients, talking about engineering, engineering leadership, and culture. This is my first time working independently. I've always had permanent jobs. The most recent one was at the "Financial Times." I was there for a long time. I was there for 11 years. I worked in IT for about 25 years, as a second career. I originally worked in scientific book production. And I went and did a master's course and learned how to code. Worked as a software developer for quite a long time, and then became a principal engineer, and then a technical director.
Sam Newman: And so just to let people a little bit inside, we're not gonna talk too much about inside...basically about books. But because of Sarah's background, she's an awesome person to review books. She reviewed two of my books. And you get excellent insights. Fantastic copyediting. I got to review Sarah Wells's book. And it was also good to see that you actually understood things like how to spell and grammar, which are things that pass me by. But it's always great that when you have had like other careers before IT, there's always...like, having different backgrounds, there's loads of new skills you can kind of bring into what you do.
Purpose of the Book
Sam Newman: I guess kind of what then leads me to thinking about what are the main focuses of the book. So it's called "Enabling Microservice Success." But could you maybe explain what the whole... What does it encompass? Why did you want to write it? And what's its main focus?
Sarah Wells I wanted to write it because I spent a long time at the FT working with microservices. We were very early adopters. I felt like I learned a lot about what to do and what not to do. And how important the cultural and organizational sides of things were. So you can easily focus on discussing what the architectural patterns are with microservices. But if you haven't got the right sort of team structure in place, and you don't approach things in the right way...and sometimes that is different for microservices than it was for previous ways of working, say the monoliths, then you can get yourself into a world of pain, where microservices require a lot more effort. And if you're not benefiting from that, you're just paying a cost that you don't get anything for.
Sam Newman: I think the thing I liked about your book, and this partly comes again to the roles you played at the "Financial Times." You mentioned culture and leadership. A lot of people might be thinking, oh, we hear about this stuff all the time in the context of development. But you actually had a bit more of a focus into maybe more traditional spaces within the world of IT, looking at things like disaster recovery, and business continuity planning, and, you know, datacenter springing a leak, and all those sorts of things. So, when you're talking about organization culture, it goes way beyond, I am a developer in my box, much more to, we're a large complex IT organization, here's what sort of has to happen if you want to get the most out of these things. So could you maybe expand a little bit on some of the roles you played sort of more recently, just before you left the "Financial Times?"
Sarah Wells: So I had been a Java tech lead for a long time. And then a principal engineer running the content API team. But because we were building a microservices architecture, we had to really care about running it in production. It was the first time I had to focus on what it means to build something that you can run and operate in production. And I ended up commenting on a proposal for how we were going to do out of hours support from development teams. Because we'd really begun to buy into the DevOps idea that you build it, you run it. The only people who can run the system in production, when it's changing as rapidly as you want it to be changing when you've got a microservices architecture, are the people that built it. You may have support for the platform underneath it, but you're the ones that understand your application. And that ended up leading to me moving across to be a technical director for operations and reliability. So I'd never done any kind of operational type role.
I took that job up and didn't realize that disaster recovery and business continuity were going to be part of the things I should care about. But they're all very related. You realize that if you've done operational support, and you've done chaos engineering, actually, those are applicable to planning for business continuity. So this is a complete tangent from the book. But when we had to go remote at the beginning of the pandemic, we'd actually been preparing for a month or so before that, by sending individual departments to work from home for a day. So we were doing the practice of, you've now got to work from home, but in a constricted sense so that the service desk weren't overwhelmed, and so that we had some continuity, we could send like part of a group home. So we were using things we had learned from operational incidents and operational practices in a non-technology part of the organization.
Recommended talk: Mature Microservices & How to Operate Them • Sarah Wells • YOW! 2019
Challenges of Out of Hours Support
Sam Newman: One of the insights you've shared with me in the past was that you build it, you run it, you ship it, and think about out of hours support. They're saying, it's setting people in the idea, which, you know, I do all the time. And then there's actually making that idea a reality. And I know you touch it in the book as well. But what were some of the challenges in terms of what you actually had to do to make you build it, you run it, you ship it a reality? And like, how does transitioning to out of hours support, what sort of things have to happen?
Sarah Wells: Well, first of all, there's a real general kind of working an HR part of it, which is, quite often your organization doesn't have anything in people's contracts that talks about doing out of hours support. And people are concerned about doing it because they're worried that they're going to be unable to live their normal life, that they're gonna get called up all the time. People have lives outside work. They want to be able to go for a walk without having a phone reception. They want to be able to pick their kids up from school without worrying about that sort of thing. And I think you have to consider that. If you want people to support this stuff in production, you have to assure them that there will be space to build resilient, robust systems that are very unlikely to have problems.
So for a lot of our systems at the FT, they were very rare for them to have a production incident. Quite often, those would be things where we didn't have a lot of control over it. So it might be a CDN provider having an outage, where actually, you may have a backup, or you may just have to work out how you're going to cope with that for a period of time until it's fixed. But I think people will sign up for out of hours support if they don't feel it's going to take over their life.
Sam Newman: So you start off thinking about DevOps, then you start thinking about you building it, you run it, you ship it. And next thing you're going through employment contracts with HR. So it sounds very glamorous. But I'd imagine...and this obviously varies in different parts of the world. I mean, it is, I suppose quite concrete discussions like, can I drink when I'm on call? Can I go out to a restaurant? If I do get paid, am I expected to come into the office the next day? I would imagine you just had to work through all of those things, talk to people, and find out who could do what.
Sarah Wells: You do. And there definitely are parts of that. But you can't ask people to do out of hours support on a formal, like you're expected to do it. We took an approach for a lot of the teams at the FT where it was opt in. So you could agree to be on a rotor. And we were very clear to people that you can choose to come off that rotor at any point, either for a short period of time or for a longer period of time. And we would have conversations where people said that I'm not able to do this for whatever reason. It could be that you just don't have the capacity to be called up at 3:00 in the morning and deal with something. But that's fine, because there are other things that you can be doing. So we would worry if we had no one signing up for the rotor, it's usually a sign that you actually haven't built a resilient system and that you haven't given people space to make the right architectural choices.
Often though, you're saying to people, you get to make choices about the tools that you use, and the way that you build this because you're going to support it. And that can be quite appealing. When we started to introduce out of hours support for our content API team, we had a conversation about the architecture, and said, "What would make you feel comfortable?" And it turned out that everyone said, "Well, actually the data store we're using, it's proprietary, we don't understand it. We can't look up any problems that people have with it. We want to make a change." So we moved to a different data store before we went live, because that gave the team a conviction that if something went wrong, they could basically look up other people's experiences and find help.
Organizational and Technical Changes
Sam Newman: I've heard Michael Nygaard say, talking about it in terms of architectural changes, that architecture doesn't change immediately. It changes like a wave crashing over the system. It's like change happens at different rates. That does sort of happen organizationally as well. Like, it starts by having those conversations person to person, making that little change happen, and not everybody and not every team, I guess, can move at the same rate. And even the services they look after, you know, they might be too complicated to be... They're not ready for that world. So I guess it sounds like it's being open to the fact that change isn't going to be immediate, and it won't be one size fits all straightaway.
Sarah Wells: Well, there are two things that that makes me think about. One is that actually, it's important to consider which of your services actually need out-of-hours support. It's very easy to say everything needs to be supported out of hours, but there will be some that don't. And we had a very strong categorization at the FT which was, you know, platinum, gold, silver, and bronze systems. Really, bronze and platinum were the two we mostly used. Bronze was something that was not required to be supported out of hours. Platinum is, and that means it needs to be multi-region. So you're giving the teams that extra level of resilience, where it's less likely there'll be a problem because you've lost an availability zone or a region. That was the first thing.
The second thing you made me think about was that actually lots of the changes we did around how we did out-of-hours support, we did it on, we're going to try this for three months. So there was a debate the whole time that I was a tech director there about where we're going to move out of hours support onto a more formal footing where people got paid for it. And they had to agree to definitely be there. And it never happened. There was a lot of debate. We stuck with the original, we're just going to try this out for three months. And at the time I left, that was still the way it was working. And people were generally happy with it, although people really worry that they're going to try to phone and there'll be no one available who can help.
Recommended talk: Release It! • Michael Nygard & Trisha Gee • GOTO 2023
Sam Newman: I want to come back to the gold, silver, bronze stuff in a minute because I find that stuff really interesting. But it might help to maybe set a little bit of context in terms of, at this stage, what the "Financial Times" looked like in terms of size, scale, location, because often people are hearing things like this and thinking, oh, that's fine, but you're a small company. But you weren't, right? So at this point, when you left, like, how many people were actively developing software at the "Financial Times?" And kind of where were you geographically?
Sarah Wells: I think it's about 250 to 300, people working in software development teams, something like that, and spread across two locations. So development hub in London and in Sofia. And in another Manila, sorry, three locations. Manila, is largely operational, but also some platform engineering teams as well.
Sam Newman: So when we're talking about this being quite informal, and it just kind of seemed to work, this is not like a small, 10-person. When you're in a smaller group, you can often do things quite informally. And people assume when you're bigger, you need lots of formality. But clearly, you don't always.
Sarah Wells: So I think the thing that's interesting for me is you need to... So the way we worked, it was best endeavors, there is a list of people, and we had a first line operations team who would phone around and find someone who was available, could answer the phone. And there was a definite sense of fear that this meant that you were not gonna be able to find someone who could help. But if you had had a formal rotor, there was still a chance that the person that's on call today can't help for this particular thing, because they didn't write that code, and they don't know that part of the code base. Because you have to have a rotor that covers more than one team. Because you cannot have a four or five-person development team working out of hours on a formal rotor, because you would be on call one week in three, which I think is excessive.
So you never have that certainty. I think there's that thing about, like, the certainty that someone will pick up the phone is not a certainty that that person is going to be able to fix the problem. And in a way, having a wider perspective of, we'll do our best, we'll find some people, maybe you'll be more likely to find a person who's available, who does know how to fix this problem.
Sam Newman: I want to come back to the gold, silver, bronze, and platinum model because that's something that almost a light switch went on when you first told me about that. Because a lot of people have this challenge. And this is what goes all the way throughout the book, is we have now got a service. What does it mean to run it and to own it? And so the conversations then go to, okay, what kind of SLOs, what kind of uptime, what's my support rotor? You see some companies where everyone tries to work that out themselves, in which case you end up with all kinds of mismatched things, or if a one size fits all doesn't really make sense. So this felt like a really clever way of saying, is your service, which categories does it fall into do we think based on need? Oh, if you're gold, or if you're platinum, these are the things you need to do. So how did you kind of evolve those levels and sort of what did those levels mean?
Sarah Wells: So they'd been introduced while I was a tech lead in one of the teams. So I wasn't really involved in that first definition. The way that they evolved... In practice, most systems were either platinum or bronze. We didn't find the two in the middle... The main distinction was, is it multi region? And do people get called up if it breaks out of hours? Or is that not the case? And that worked pretty well. And meant you understood, should I phone someone because this thing is broken? And in terms of things like SLOs, there were some associated, how long should we aim for a response? If something goes wrong, how quickly should we aim to get hold of someone? How quickly should someone be working on this? I think there was some definition of how long to fix it. I'm not really super convinced about that. Because I think once you've got people working on it, they're going to fix it as soon as they can. So I'd rather just say that a main aim is to make sure we pick up through monitoring, that there has been a problem, and we get someone working on it, who's got the ability to try and mitigate it as quickly as possible.
Standardization and Tooling
Sam Newman: One of the things I know you did internally was, though, tried to give people guidance about the things that they should do. So there's often this conversation about standardization, all the services should sort of behave the same way. And there's always a concern in that, which is, people end up being constrained or they're given some framework they've got to use. But some of the things you do are quite interesting, which is you say, look, your service should be doing these things. Oh, and by the way, here's how you do it. Can you sort of explain how that worked internally?
Sarah Wells: We had an engineering checklist or sort of guardrails that said when you're building service and you're running in production, we expect these things to happen. There are tools to do them. If you're using one of our standard ways of deploying, maybe things are built in automatically. But if you do want to use some different approach because that's going to help you, that's fine, but you need to comply with the guardrails. So an example might be, we want every release to production to be logged through a change API, so that we know in one place, everything that's changing. So if you're using one of the standard CICD pipelines, that's quite simple. But if you're not, here's the API that you have to make sure is called from your pipeline.
And some of those things were a little bit more of a constraint than others. So logging, we want people to produce useful logs of significant events. But actually, we want them all to go to the same log aggregation place. Because if you don't have all the logs aggregated in a single place, you end up finding it very difficult if something goes wrong. And I had this example at the FT, where we had an incident, we discovered that a team was experimenting with sending their production log somewhere different, but I didn't have access to those as someone in the operations team. And it wasn't really clear to me how I would get a hold of it. And I think that that's a sign of something that's...that you really need to have those things in a single place.
Sam Newman: There is that also...probably also something about picking your battles in these situations. Which is that these are the things you should do. And these ones we're serious about. And these ones we're going to help you do. And you're probably, you know... It's just the whole kind of paved road idea, which is, if you use the things we've got sort of standard, most of these things are going to happen for you without you having to do any work. You're going off piece, you can. But there are these things that we are quite serious about. It's like, I've got nephews, right? And if you overreact about everything they do, then when they're getting near a kettle, they don't pay attention if you tell them not to touch because it's hot, right? I know I have just compared developers to my four-year-old nephew. But there is probably a bit of that, that you have to pick your battles a little bit around these things internally, right?
Sarah Wells: I think so. And those guardrails are really the things that are important to the organization, to making sure that the organization is secure, that we aren't incurring huge amounts of cost. Those things are really important. I think there's a couple of really useful things. One is building them into tools. The second is making things really visible. So if you can show people how they're doing... And in particular, if you can show people that actually this thing we've asked you to do, everyone else is doing it, but you're not. That can really influence people. I remember a very long time ago at the FT when I was running a team, the operations teams started to send out an email periodically saying, these are the noisiest alerts that we have. I completely ignored the email until my team was in the top 10. At which point I was like, yeah, we need to fix that because that's embarrassing. And we would do the same thing for some of the tools that we built.
We had lots of records that had no information on. We wanted to make sure that there was enough information to help people find the right team, do things like failover, or scaling up, just the information that would help you to try and mitigate stuff. So we started scoring the fields in the runbook. And then we aggregated those scores together. So we could say this service has got 80% of the necessary information in. And went and aggregated that up so that we could show teams how they were doing. And at that point you start to get a little bit gamified, but also some people get very competitive. We had people coming to our team and saying, why can't I get 100%? Why can't I do this? And we're like, because we haven't quite got the rules right so that you can get 100%. Well, I want 100%.
Sam Newman: You worry about those things if it encourages the opposite behavior of what you're hoping for. But it sounds like in most cases, that wasn't what happened, and it was about... But again, it comes back to, you're making it clear this is important to you. You're letting people know how they're doing against it. You're explaining how they solve that problem. And you're giving them visibility. And then I think most people want to do the right thing in those situations. They're not going to circumvent things for the sake of it, they want to... And yeah, developers love getting points, right? That's kind of...
Sarah Wells: Yeah, some, not everybody. But the other thing about it is honestly, there's so much that you need to do. And if you can give people in a list where you say these are the most important, then I think that really works. We did similar things with security scanning. We wanted to show all of the things that people could do to improve security, and to prioritize them, to say, these are the ones that are most important for us and set objectives around. We're going to improve in this category, because that's the one where we think we get the most value.
Recommended talk: When To Use Microservices (And When Not To!) • Sam Newman & Martin Fowler • GOTO 2020
The Microservices Journey at Financial Times
Sam Newman: If you look back at the time at the "Financial Times," thinking about that kind of early adoption of microservices all the way through. I mean, looking back, what were some of the things that really surprised you about that whole journey? Was there anything that sticks in your mind as things that were just like, I wasn't expecting that?
Sarah Wells: So for us, we were so early, there was no tooling a lot of the time. There are lots of things now that you can basically get a tool for it. So we had to solve a lot of the problems ourselves. And we also found lots of things where no one had really pointed out that this was going to be very different. So testing, for example, we started building microservices. And we thought, oh, yeah, we need to have acceptance tests, while they have to test a business flow. And then we just basically built ourselves some acceptance tests that coupled every single service together. So we just had a distributed monolith. We all hated these tests. Because you could be exaggeration, half our code fix would take a week to fix all the fixtures for the acceptance test, because they all had to have data set up. And at some point, we just said, that's it, we're basically going to remove these. And we're going to find a solution that is better.
So it really pushed us to think about monitoring as testing. And in particular, testing as production. And actually those were both massive improvements. So instead of having a suite of acceptance tests that ran during the release process, we had some synthetic monitoring that ran all the time, and it ran in production, not just in pre-production environments. So, the first example we had was, let's publish a real article, a real old article, and just republish it every minute. And check that that publication went through our published process. So we'll know really quickly if some part of our publishing flow is broken. Because that monitor will tell us that we can't any longer publish that article. And I think that, you know, production is what you care about. Production is where you want to know that something's gone wrong. And in modern software development systems, it could break for many reasons that are not related to a code release. I would want to know about those as well. So I just think the value of that kind of testing being in production and not as part of a release pipeline is huge. And I wasn't expecting that change.
Sam Newman: But both at your time at the "Financial Times" and even more so subsequently, you've been exposed to a load more insights from what other companies have done, other case studies, and loads of that is in the book, right? You know, yes, the "Financial Times" was a big part of it, but there's loads of other insights that you've gained from across industry. Again, sort of thinking about looking back, given the things you've learned subsequently, are there things that you wish you could have gone back and done differently? Were there blind alleys that you went down, things you explored, where in hindsight, you're thinking, well, of course that wasn't going to work, or, oh, I wish we had that idea. We could have used that and would have been so much better. So kind of if you had your time over again, I guess I'm gonna say, what would you do differently?
Sarah Wells: I think actually, you should get used to the idea that you do things, and you decide that they were a bad move. I think that that's the whole point. You only benefit from adopting something by being experimental and trying things, and some of them won't work. I think you should accept that. And, like, the idea is it should be easy to change and go back. The biggest thing about microservices was actually that we really believed the whole, it's 100 lines of code, they do a single thing. And we had so many microservices. And the more that you have, the more it costs you to keep everything up to date. So we had a team of 15, that had 150 microservices. So when you have to upgrade a library, you have to upgrade it in a lot of places. If it takes you five minutes, that's still a lot of time. It wasn't enormously painful, but if I was doing it again, I would compose them a bit more. I would have bigger microservices. And I think that generally what you want is something where it fits the boundaries of the team. So a team may have two or three, depending on what they want to do. But really, it's about making sure that their service is not shared across multiple teams.
Sam Newman: Those really aggressive... I'm kind of fascinated by that. I think the size thing is a meaningless concept with microservices. And it is about team boundaries, which is kind of the constraining factor for me. But I also get quite interested in ratios. And so what I do see in organizations where there's less awareness, there's less support, maybe there's a lot more manual processes involved, you might struggle to manage one microservice as well for like a team of 15 developers. Whereas if you've done a load of stuff and lots of automation, and you've got toolchains and investments, you can be a bit more aggressive the other way. I still kind of think that 10 services per developer is probably not that healthy. I think similar teams at Monzo are not far off that. But they do some very special things to make that work that most people don't get. So I tend to agree with that sizing, by the way. I think a team should own it. They can make it smaller.
But again, I suppose that kind of comes full circle. It's like, if the team is making the architectural decisions, and the team is seeing the implications of those decisions because they're the ones supporting it over the long term, then they're the people best placed to say, should I split it into 10? Should I have it as one? And that different teams might make different decisions. Right? I guess that's sort of an organizational point of view, a big part of that, of why we do this, right?
Sarah Wells: Certainly by the time I left FT, people were not spending a lot of time putting all of these microservices back together. They felt like they invested the time to be able to run a lot of separate microservices. Yeah, I think you're right. Monzo is a very interesting thing, because they have a lot of standardization around how you expect to build a service. The FT did not. There was a lot more variation between teams. So I think where you have a lot of standardization is quite easy to say or easier to say, I'm building a new service to do this one thing. But where you don't have that, it's... I feel like when I ask people, 20 to 30 microservices in an organization feels pretty, like, very common as the number rather than 2000.
Sam Newman: I think even if an organization has 2000, like, it's not 2000 within one system. This is the other thing that... This is why I think the ratio is still for me my preferred number, because we've got 2000 microservices. Yeah, but actually, what you've got are discrete lines of service. And within each discrete line of service, you've got about 100 people working on them. And in there, you've got about 30 services, right? That's kind of more interesting. So I'm always wary when you hear people saying, oh, we do X and we do Y, and extrapolating from that. So also, like, don't measure success by how many services you have. I think the only people that succeed in terms of you having more services are the people that are selling your Kubernetes clusters, right?
So I guess you went through that whole experience at the... You know, you've worked out how to make microservices work for you within the context of "Financial Times." And you've also seen other companies that have tried, some have failed, getting the most out of microservice architectures. That's what the whole book is about, right? Is enabling you to be successful with microservices. If you've got microservices, you should read the book. However, do you have a clearer view in terms of those... Are there situations where you just would not pick microservices, where that for you, is it quite situational? Are there certain circumstances where you just think microservices should just be avoided?
Sarah Wells: I feel like both you and I have a chapter in our books that say, here are all the places where you shouldn't use microservices. Because absolutely, it's not helpful to go, I need to build microservices right now, and you've got a team of four people. I feel like microservices are generally...they are about solving organizational scaling issues, where you've got enough people that it's difficult to know what everyone else is doing. And there's a chance that two of you will try to change the same part of the codebase for two different reasons. And there will be bad consequences as a result of that. So I wouldn't generally start off with a microservice architecture in a small team. I also think it's really difficult, even in a big team, to build something for the first time using microservices, because you don't understand where the boundaries should be.
So you learn where the boundaries are by building stuff. So I would start by just building a monolith. And then when you become aware that there's a part of that that always changes together, and that begins to represent a domain, you could extract that out as a service. I think that's the obvious way to go. This is not what the FT did or not completely. And we actually built, when we adopted microservices, three different groups within the FT adopted a microservices architecture for a new system. But in each case, they were rebuilding an existing system. So rather than extracting services from a monolith, we basically already knew the domain and built a replacement that used our knowledge of the domain to split it up into smaller services. Which, you know, I look at it and go, no one would have advised us to do that. Three in parallel, all adopting this quite new technique, but it worked.
Conclusion
Sam Newman: Well, and if you too want it to work for you, you should read Sarah's book. Sarah Wells, I'm gonna say, a huge thank you for your time today. I'd also like to thank you for reviewing my last book and my next book. So this is me giving back. I really would encourage you to read Sarah Wells's book. It's got loads of great insights about... Okay, so there are books that talk about the technical aspects, and then this is... So you've got a microservice architecture, how do you actually make it work in an organizational context. So it really is a great read. So there are loads more fun stories in there. I can't remember if you feature the story about the leaking roof in the book. I don't know if that's really related to microservices or not.
Sarah Wells: Well, that's part of my whole becoming, like, realizing that I was involved in business continuity. And yeah, there was a plumbing issue that ended up dripping water through the roof of the server room on the editorial floor. And we were running around putting plastic over the top of all of the server racks to just try and make sure that... Because it happened, of course, during newspaper production time. So in a newspaper, between 4:00 and 7:00 is the time when anything goes wrong, you have a real constraint. But yeah, that was a fun one.
Sam Newman: It's a glamorous, glamorous life. The pictures you circled were great. Thank you so much for your time. And also thank you for writing the book. It's great to be able to point people towards that as a resource. So it's packed full of great, useful advice. So thank you so much for your time today.
Sarah Wells: Thank you. And thank you for some great questions.
Sam Newman: I think we're good.
About the speakers
Sam Newman ( interviewer )
The Microservices Expert. Author of "Building Microservices" & "Monolith to Microservices"
Sarah Wells ( author )
Independent consultant and author