Building Modern Software at Scale: Architectural Principles Two Decades in the Making
Battle-tested lessons from the tech frontline! Join Charles Humble and Randy Shoup on GOTO Unscripted as they reveal when to break your monolith, master domain-driven design, and drive productivity with continuous delivery.
Read further
Thrive Market and Randy Shoup's Career Journey
Charles Humble: Hello, and welcome to this episode of "GOTO Unscripted." I'm Charles Humble. I'm a freelance techie, editor, author, and consultant, and this is part of a series of podcasts that I'm doing for GOTO talking to software engineering leaders.
Today, we're joined by Randy Shoup. Randy has spent more than two decades building distributed systems and high-performing teams and has worked as a senior technology leader at many of the Silicon Valley giants, including eBay, Google, WeWork, and Stitch Fix. He is currently SVP of engineering at Thrive Market. He also coaches CTOs, advises companies, and talks a lot, sometimes at conferences, about software, and he's interested in the nexus of culture, technology, and organization. Randy, welcome to the show.
Randy Shoup: Thanks, Charles. It's great to be with you. Love to chat with you as always.
Charles Humble: Always a pleasure. Thank you very much for doing this. So, what attracted you to Thrive Market? It's a smaller company than eBay , Google, or Amazon. So, what was the sort of attraction there?
Randy Shoup: Thrive Market, for the very many people that don't know, is an online organic grocery here in the United States. So, I think Costco meets Whole Foods, but online. What Costco means is it's a membership model. It's a curated selection, and you get better prices by being a member. Whole Foods' angle is it's all organic, sustainable, regenerative. We have a thousand banned ingredients that we do not have on any of our products. So, one of the reasons people love us is because they know we've already done all the looking at the labels. So, like, if we say it's allergen safe for you as a gluten-free allergen-having person, like my son, or various other things, like, we've already done that.
So, we have 1.6 million members now. I've been around for 10 years, growing more this year than we ever have. And I'm starting to answer your question about why I joined. So, I love this phase of the company in particular. It is the late-stage private, hopefully, with fingers crossed, transitioning to a public company sometime over the next while, when the markets, etc., permit. And I've done this ride four times now, actually, back in the dot com boom/bust with a company called Tumbleweed Communications, which did security software. Stitch Fix, I led engineering up to and through their IPO in 2017. WeWork, I led a significant portion of engineering up to and through our not successful IPO in 2019. Happy to tell that story, but that's a whole other podcast and miniseries, by the way. Sidebar, if there's a miniseries about a place that you've worked, you probably had an exciting time in the Chinese curse phrase of exciting.
This will be the fourth for me in this phase. I love it because the company is large enough so that the eBay, Google, et al. lessons apply. So, like, my skills are valuable and useful and can change the trajectory of the company. And it's also small enough so that we can decide together and act on it, like, right now. You know, we can decide, "Hey, we're going to do feature flags," and boom, tomorrow we're doing feature flags. You know, we can decide to do canary deployment, and boom, tomorrow we're doing canary deployment. So, the combination of being able to bring the large-scale kind distributed systems and large-scale organizational expertise and have those lessons be relevant, combined with it's small enough to be able to change the trajectory of the company significantly and relatively quickly.
I'll highlight the other aspect of it, which I love: the mission. So, our mission is again to bring healthy and sustainable living to everybody. And everybody is a subset of the United States at the moment. But we want it to be a lot of the United States, and maybe then all of the United States, and then expanding out from there. And every time somebody becomes a member, a paid member, we also give a free membership to a first responder or a teacher or a veteran or somebody who's low income. So, you know, it's doing well by doing good, if that makes sense.
Evolving Architecture: From Monolith to Microservices
Charles Humble: Can you tell us what the architecture looks like? What are the components you're using? How is it kind of structured?
Randy Shoup: We're in the monolith transitioning to the microservices phase. That's another whole podcast. We might talk about a little bit of that today. But, yeah. So, you know, we've been around for 10 years, and we have 10-year-old software. So, we started and are still on Magento, which is a PHP-based open-source e-commerce in a box. It was really big 10 years ago. Now you can think of Shopify, but there's no company around the open-source thing for Magento. So, the core of our e-commerce in, out, up, and down is this PHP monolith.
We have a constellation of, let's call it, 100 services that are around that. And then over time, we find important pieces of the monolith that, for reasons, need to be extracted, and we extract them into services written typically in Python or Java. That's the quick overview. We're all in on cloud, AWS, all up and down and left and right. No physical infrastructure except for the Wi-Fi things in the warehouses, because we're a physical business like Stitch Fix was, so we have physical warehouses, and one of my teams builds software and maintains software for that.
The other thing, though, is as every small company should, we use a ton of third-party services. So, we use some kind of managed services from AWS, our cloud vendor, but we also use a ton of third-party services for everything that's not our core competency, right? Our core competency is not shipping physical boxes from our warehouses to people. So, we leverage a third party called Convey to give us notifications about that. All aspects, you know, internal and external, of things that are not core to our mission, we use third parties for that.
Charles Humble: Right. Yes. And is that so that you've not been there all that long? So, I'm presuming the architecture is largely something you effectively have kind of inherited and are working with it. Is that right?
Randy Shoup: That's exactly right. I started in early March. So, that would give me nine and a half months here. Yeah, I mean, the architecture has not... I was going to say, not significantly changed. It's probably not changed at all, candidly, since I've been here, where we've built a lot of cool stuff. There's a reason why you use these ecommerce things in a box, because stuff like regional pricing you get for free if you do it right. So, you know, we're doing a lot of work there.
We just integrated just now, basically, in the last couple of weeks, we integrated with an Instacart offering around what are called retail ads. So, it's ads on our site that are promoted by our vendors for things we already sell. Hope that makes sense. If you're old enough, as you and I are, to remember little inserts in the physical newspapers of that, "Hey, your local Aldi or Tesco or Safeway, here are some... it's 5% off this thing or 15% off this other thing." You can think of that. So, vendors actually fund, for the most part, those ads. But again, the ads are not for things that are outside of Thrive. They're for things we already sell.
The reason why I mentioned that in the form of architecture was that it was an opportunity for us to build a set of services around in Java and use SQS and SNS and, you know, the whole deal.
Charles Humble: How naturally does that architecture fit with the kind of current business domain where you are as a business, as it were?
Randy Shoup: Oh, very insightful question. As is absolutely correct for this phase of the company, our architecture lags the current state of the business, which is correct. Because, like, if the architecture were ahead of the business, we would have wasted a lot of time building things that we didn't need. And so, as we might talk about later, eBay, Twitter, Amazon, everybody who's big now started off small and were dragged by their business into better and more sophisticated distributed systems in terms of their architecture. So, we're in that same exact phase. We're in what I have called in the past the scaling phase of, "Okay, now's the time to take that monolith and break it up into smaller pieces."
And the way you ask it is really apt. You know, we have very well-defined individual business domains, you know, around how do we acquire new members? That's a thing, you know, for a membership company. How do we make the membership or the subscription-like service really good for people? So, recurring shipments that you would get from us every, you know, couple of weeks or a month. Like, making that experience really good. How do people find stuff on the site? How do we get information about the items that we sell from our vendors, and how do we incorporate that around in the system, etc.?
Those are all separate domains. So, the first one is member growth. We call habit formation, which is around recurring shipments. We talk about discovery, which is how people find things on the site, etc., etc. Right now all of those domains more or less live in the monolith and have some ancillary services. But where we will ultimately get to is those domains will own up and down independent services that are around their particular domain. So, a very apt question. And that's why I'm here. I mean, that's why I like this.
The other aspect of why I like this is the architectural opportunity to take a really successful business, because why would you be in this situation if you weren't a successful business? And then we leverage the knowledge that we now have that, frankly, we didn't have, you know, 5 or 10 years ago about, "Okay, here is the domain decomposition. Here is the place where we want to invest in terms of scaling the business and scaling the technology." And we do that. So, it's a fun place to be.
Microservices: When (and Why) to Break Apart Your Monolith
Charles Humble: I think there's a really interesting thing to pull on there, which is when should you choose to go to a distributed architecture like microservices. Because I think, and I think you would agree with me on this, it's generally a huge mistake to sort of start building microservices when you're, you know, at that seeking product-market fit stage. And obviously, you're a bit beyond that stage now, and so you're starting to kind of break bits of your monolith apart. But how do you make that decision? What's the point at which you go, "We need to start breaking this down?"
Randy Shoup: Extremely apt question that I'm living now. There are really two motivations. So, first, let me underline what you said, which I 100% agree with. Ninety-five, 99% of the applications on the planet are and should remain monoliths. Full stop, end of story. You should only do... I mean, I love distributed systems. I've lived in it for about 25 years, and, like, I just get joy from doing this kind of thing. And I read the papers, and, like, I find it super exciting. And also it's a 1% situation, right? The vast majority of you should only be dragged kicking and screaming into distributed systems rather than starting from the beginning.
There are a handful of counterexamples to that that I can think of, and they are super rare. And I've never been in one myself where you would start with microservices. The only case that I can think of, or the several cases, are the thing you are offering needs to be at massive scale out the gate. Right? So, you're AWS, you're offering a new service, like you don't start that as a monolith because day one it's supposed to be... It is always day one at Amazon. But, like, day one, when they launch the new X, Y, Z service, like, it's already got a million requests a second or whatever, and so you better be ready for that.
The other thing is an example of companies like Monzo in the UK, where they're a banking software. The reason they started with microservices is because the banking domain is known up and down and left and right for four and five decades. So, the domain decomposition is super obvious. If I started by, like, skating to where the puck is going to be, you can get a lot of benefit. There are some great talks by Matt Smith, I want to say, from Monzo about that. And that's a decision that I'm on board with, but, like, I exhausted my examples of where you should start with microservices.
Absolutely start with a monolith. Ninety-nine percent case for sure. And so, only when you are at the scale that I'm talking about, where the monolith is running out of gas, do you want to move to a distributed system in the form of typically microservices? Okay, so what do I mean by running out of gas? Two things: developer velocity is slowing down because one of the benefits, and then costs, of a monolith is it's a single repo, and everybody's on top of each other. And so, if the opportunity when developers are stepping on each other's toes, you want to pull the things apart, and that's going to solve essentially an organizational problem, an organizational efficiency and scalability problem, by simply making separate parts of the system, components, or services, or whatever that are independent and isolated and able to be worked on and deployed independently.
The other argument is straight-up load, where the parts of the system need to scale much faster than other parts of the system, and simply vertically scaling or, like, even horizontally stamping out more instances of the monolith isn't going to cut it. And so, those are the two arguments that I have deployed over time about why you would want to move from monoliths to microservices. But again, I will triple underline the vast majority of software on the planet should remain monolithic, and you should feel like it's too late, and then only then you should do it.
Recommended talk: Working Effectively with Legacy Code • Michael Feathers & Christian Clausen • GOTO 2023
The Relationship Between Organizational Scaling and Microservices
Charles Humble: I think it's worth underlining that point about sort of organizational scaling because I think generally people think of microservices in terms of being able to scale up or scale horizontally or whatever. But they tend to miss the point that actually, certainly, that's one reason for doing it, but if you've got thousands of developers working in a single repo or in a single code base, you're going to start tripping over each other, and it just doesn't work. And so, at that point, splitting things apart makes a bunch of sense.
Randy Shoup: If you had sufficient foresight and you componentized your monolith, that makes it a lot easier. The hard case is the case where — I was in at WeWork, and I'm in here, and I was somewhat in at Stitch Fix — of yeah, the monolith isn't particularly well-componentized. So, step one is taking a page out of Michael Feathers' "Working Effectively with Legacy Code" book and, like, finding a seam for a component within your monolith and walling that off in your favorite interface approach, writing tests around that, and extracting it. So, yeah, it's a standard progression. And as you know, Charles, but maybe your listeners don't, I've given talks about this exact thing. So, you can Google my name and look for monoliths and microservices and migrations and, you know, hour-long disquisitions on thinking about this stuff.
Recommended talk: Effective Microservices in a Data-Centric World • Randy Shoup • GOTO 2017
Charles Humble: There's one interesting thing just to sort of connect a couple of dots in that because you were talking about the Monzo example, and you said quite rightly with Monzo, because the problem domain is... Or the domains are so well understood, basically, you can do the whole domain-driven design because you know what your bounded contexts are effectively. You know what your boundaries are straight away.
There's an interesting sort of counter to that, which is, I think, sometimes you know what a lot of the boundaries are, but then you have a sort of mysterious blob over here where we're really not quite sure yet. And so, you can sort of group those things together, separate them off, if you will. And then at some point, you'll get clearer on what the boundaries are. And then you can start splitting those off. Does that sort of... Do you see what I'm getting at?
Randy Shoup: That's so insightful. I actually want to riff on that for a moment because I missed. Thank you for reminding me, I missed the third reason. I'll say it this way. An additional reason why you do not start with microservices is you do not understand your domain. You think you do, but you don’t.
You do not understand your domain when you're starting out and particularly when you don't fully have a business model or product-market fit. You do not understand your domain. Therefore, stay as a monolith because you can use refactoring tools to move things around and componentize and componentize and so on. Only when your domain has... Your overall set of domains, let's say it that way, have coalesced, and you really understand your domain decomposition, then you can say, "Okay. Well, now it's clear that I can pull out payments. Now it's clear that I can pull out member acquisition. Now it's clear that I can pull out finding or searching or, you know, everyone in the catalog."
Super insightful point that, again, you cannot split things into services. Once you've split something into services, you have reified your domain decomposition. You have coalesced is not the right word. You have taken it from... Solidified. That's a good word. It's an even better word. You have solidified it into software, and now it's really hard to refactor capabilities to undo this particular service and, like, recompose it into other services. It's just harder.
Charles Humble: Yes.
Randy Shoup: Great point. Thank you.
Recommended talk: Microservices, Where Did It All Go Wrong • Ian Cooper • GOTO 2024
How eBay Doubled Engineering Productivity
Charles Humble: As I said, we were talking about some microservices kind of in the context of it's effectively an engineering productivity thing, right? As your team gets larger. So, focusing, because I think your roles at eBay...in your second stint at eBay, because you were at eBay twice, and at Thrive, have some similarities in the sense that you're focusing on developer productivity. So, what we might call sort of platform engineering or engineering enablement in the sort of Team Topologies sense of that and how that sort of fits together. Is that kind of right? Is that a reasonable description of the parallels between the two?
Randy Shoup: I've been at eBay twice. I was there mostly as an individual contributor from 2004 to 2011, where I worked on eBay search engine and learned a ton about distributed systems then. And then I left and did other things for 10 years or so, worked at Google, Stitch Fix, WeWork, and other places. And then in 2020, right at the start of the pandemic, I was asked back to be chief architect of eBay. So, I came back, and I took on the chief architect role but also led the platform engineering organization that developed, like, the internal developer frameworks, the CI/CD pipelines, the external APIs, a whole bunch of stuff that's related to engineering productivity.
This is something that is very near and dear to my heart, and I care a lot about it. And so even though I wore the chief architect hat, the thing that I did first and really solely, frankly, in the two years that I was there was not reforming the architecture. Not because it didn't need to be reformed. It did and does. But because it was very clear that what was preventing us from reforming the architecture was all aspects of software delivery, right? Like better setup for automated tests, better pipelines for canary deployment out and back.
The only way I know of doing large-scale change, whether it's architectural or otherwise, is in small steps that are two-way doors that go forward and back. There's the whole, "How do you eat an elephant?" It's one bite at a time. Then I love the reformulation by John Smart of "Sooner Safer Happier," where he says you want elephant carpaccio. So, like, super, super, super thinly sliced pieces, and where aggregated all together, it makes a big difference.
How did we approach that idea? So, as with all things, whether it's developer productivity, architecture, or whatever, you really need to understand the problem that you're solving. So, if people take nothing else from this talk or this chat, remember these seven words. What problem are we trying to solve? So, often as engineers, we don't ask that question. We assume we know, or we're handed off some spec or some diagram or some beautiful sketch and implement that. Well, if we understand why, we're going to do a better job, and we actually will get there. So, understand the problem statement.
At eBay, it was, you know, we had better words for it, but we were too slow. Like just end to end. It took us too long to go from idea all the way to value. And so, you know, I like to think of it as, like, four steps. So, planning is, like, how does an idea become a project? Software development is how does a project becomes committed code. Software delivery is how does committed code becomes a feature that people use on the site. And then post-release iteration is how does that feature that we launched initially get iterated upon based on user feedback and analytics and monitoring, and so on.
We learned at eBay, we did a value stream map, which comes out of lean. So, we thought of it as a big workflow, and we looked at where the opportunities were, what was taking longer, what was taking shorter. And it became glaringly clear what a lot of people knew anecdotally and individually, but the data was really clear that it was all about software delivery at eBay. You know, there were opportunities everywhere, frankly, like planning could be a lot better, software development could be a lot better, etc. But the really big opportunity there was in the software delivery space.
Fortunately, we were living in the world of 2020 and still this world where we have a lot of industry expertise now, like 10, 15 years of the State of DevOps reports and the "Accelerate" book and the DORA Metrics and so on. So, what we did is we just straight up, you know, had everybody read the "Accelerate" book, which everybody should if you haven't already. Stop now, pause the video, go read the book, come back. We use the DORA four metrics as our way of measuring our software delivery performance. And again, without going into huge detail, read the book. Dr. Nicole Forsgren, who led all that research and was the primary author of the book, proved that these four metrics of deployment frequency, of lead time for change, of change failure rate, and of mean time to recover do predict the software delivery performance and therefore business and organizational performance.
We measured ourselves on, "Hey, how often are we deploying code? How long does it take us from when a developer commits their code to when it shows up on the site?" That's the lead time for change. When we deploy things, how often do we have to roll it back or hotfix it? That changes the failure rate. And then MTTR, when we have an incident, how long does it take us to recover from it?
We put together... eBay's big, and so they build everything themselves, so they built a tool, you know, to monitor all those things, look at our deployment pipeline, etc., etc. Issue tracking, etc. Integrated all that together. And we're able to see that very quickly, which is really great.
And then how do you do it from there? If you just understand... You basically go to teams and you go, "What's preventing you from moving faster and being more effective?" And the most useful question that I could ask as the head of platform engineering was, "Hi, X, Y, Z team. I see that today you're deploying maybe once a week or once every two weeks. If I told you that you had to deploy every day, tell me all the reasons why you can't." And this wasn't like a challenge. This was like an opportunity. And they were like, "Wow. Somebody will finally listen." And they said, "How long do you have, Randy ? And, like, would you want the whole list or just top 150?"
They gave us, you know, a huge list, all legit, of, like, "Okay. Well, the tests are flaky, and the builds take too long. And I have to get manual approval, but not one, but not two, but, like, all the dependent teams that depend on the software that I build. And when I roll out, I have to have somebody hand look at it to go from, you know, the one stage to the next stage, to the next stage of the deployment process." And so, unlike all the other conversations and rants that, you know, those people legitimately had over the years, finally, somebody was listening to them, and I said, "Great. You just gave my team our backlog."
Then we formed a team of teams. Referencing exactly the book, but a team of teams where, like, we would get together once a week of, like, all the teams that were working together on improving their software delivery performance, and we would share the things that we learned and how we're doing. And what immediately came out was everybody was motivated to make their lives better individually, and then people could share ideas like, "Oh, I'm having problems nagging people about doing their code reviews." Other team says, "Hey, we implemented this super lightweight tool that nags people." And like, "Hey, you wanna use it?" And, you know, so the sharing of experiences and successes was really valuable.
I inherited this word, but overall this thing at eBay we called the velocity initiative. I would have given it a different word if I'd been there, but I inherited the word. So it's about improving the overall engineering productivity and flow of the organization. And one of the things I'm super proud of, of the work that we did together with a bunch of really great people at eBay, was that for the teams that we worked with, we doubled their engineering productivity. And what does that mean, doubled? It means, as measured by features and bug fixes, like produced end-to-end per unit time, they doubled. So, given the same team composition, same team size, before we did this work, they were doing X number of, you know, features and bug fixes per week or month or whatever, and then now they're doing 2X. And, like, that's cool.
I am very personally proud of it, and also we didn't do anything magic. It's the standard. "Standard." I mean, it's the playbook. It's the DevOps Accelerate DORA playbook of, you know, "Hey, we need to do automated testing. We need to make sure that tests are not flaky. We need to automate the deployment process." So like a canary situation where a human is not pressing a button to go from stage X to stage Y to stage Z. It happens automatically; looking at metrics, if they're good, they go forward. If they're bad, they go back. Just leveraging all the tools and techniques and capabilities that we have in the 2020s and had in the 2010s in our world. And so, we didn't do anything magic. We just executed and, like, introduced people to continuous delivery essentially.
Recommended talk: Better Value Sooner Safer Happier • Simon Rohrer & Eduardo da Silva • GOTO 2025
Balancing Platform Engineering Effort
Charles Humble: How much of your engineering effort goes into that platform team at somewhere like eBay?
Randy Shoup: At eBay. It's a good question. I have noticed anecdotally, but from my experience in multiple places, that as you get larger, there is a larger percentage of time spent in the, what I'll call, platform versus in the other areas. And that is because when you are small, the amount of leverage you get by a common platform team is small. It only is over whatever 20 or 30 people you have, let's say. But when you are 4,000 engineers like eBay, or 100,000 engineers like Google, or maybe 200,000 at Amazon. Amazon has some crazy number at this point. The tiniest little three-person team on a platform that does what seems like a relatively small thing has this massive impact and pays for themselves in a week.
So, at eBay scale, it's basically half and half. So, what eBay... I don't know if they've reorganized, but when I was recently there, we had what we called the product engineering organization, which delivered user-facing features, and they were excellent, and then we had the core tech. So, like, think writ large platform organizations. That's infrastructure. That's DevOps. That's my platform engineering. All sorts of different things. And it was basically half and half. So, order 2000 engineers on the product engineering side and 2000 engineers on the core tech side. And when I was there, Google had a similar ratio. The user-facing things versus the multiple really excellent layers of software that everything is built on, if that makes sense.
Charles Humble: Totally. There are a couple of other things about platform teams that I've seen people come unstuck with. One of those is how you prioritize the work that comes in and how you manage that relationship because it's a bit of an unusual relationship; it's a bit like a vendor-customer relationship rather than a... It's different from other internal teams, I think. I'm not... You get what I'm getting at?
Randy Shoup: You used the metaphor that I like to use straight away, which is vendor customers. So, how, as a platform team, do I understand what to do? I talk to my customers. And just as if I were a third-party vendor, I prioritize based on that, like, okay, I think of my thing as a product. My platform as a product. And I use product management techniques and lean product management to, like, figure out what to do. So, it's exactly like any third-party vendor. Listen to your customers, you know, and prioritize the largest opportunities/the largest pain points first.
And so does that mean that every customer gets exactly what they want when they want? No. That's what prioritization is. Like, we don't have infinite resources to everything, but it's correct for a third-party vendor, and it is equivalently correct for an internal platform team to say, "If I work on this thing, it impacts most or all of the teams and makes them substantially more productive. And if I do this other thing that this particular customer team is asking for, that really only, you know, makes them more productive. And so, okay, that prioritization is really clear."
We used this at eBay. Put your eBay hat on. Right? Put your broad company hat on. Like, what is the correct thing for us to improve, you know, as a member? Like, what would the executive say essentially? What's the overall priority of the organization? And once you put that hat on, that, like, pretend you're a vendor, it all becomes really clear. And again, you can't do everything, but you prioritize, and you say, "Hey, I can't get to this this time, but, you know, two quarters from now, sure, I'll be happy to do it." And also, one of the benefits of running that organization is guess what? We'll happily consult with you. If you want to roll your own thing to do your kind of custom improvement, we can help you and point you in the right direction. We just can't do it for you right now.
Charles Humble: There's another thing that, I mean, it's become something of a maxim that people talk about the fact that you shouldn't mandate the use of a platform, but I think it's worth talking about it because I still see people sort of going, "Hey, we've built a platform, and now you've got to use that." And I think there's interesting aspects of that. So, can you just sort of riff on that a bit? What's your sort of view on mandating versus just allowing it, having it there? Why would you favor one over the other?
Randy Shoup: Fantastic question. I love it. The distinction is, is it a mandated platform or is it a paved path? And I'm very intentionally stealing that phrase from Netflix. So, Netflix and Google are excellent at this. Let me give a little more... So, in general, I prefer the paved path, or almost exclusively I prefer the paved path versus the mandate, and I'll explain why in a moment.
I want to give a little more nuanced thing with if you're small, it shouldn't be mandated, but there's not a lot of excuse at my scale for people to roll their own because we only have 100 engineers. At Google scale, with 100,000 engineers, there's a lot more play and frankly a lot more... The distribution of individual teams' needs is much broader. But overall my philosophy is the platform should be so good that people should want to choose it affirmatively. So, it needs to be affirmatively better, strictly better than buying it, borrowing it, or stealing it.
And that's the choice because it is right for me as the internal software vendor of platforms to give my customers the choice. Why? Because it puts all the incentives in the correct places. The customer team is correct, if they're good engineers, which they mostly are, they're choosing what works best for them, and sometimes, hopefully, most of the time, that's the platform. Okay. But if they are weird... Not weird but different needs, like, "Okay, for machine learning, hey, this thing isn't going to do it. You need something else."
So, it puts the customer in control, where they should be, essentially, the customer team in this example. But also it gives me, as the platform team, the right incentives as well. Like, if I'm essentially competing with all the other options my customer teams can have, now I'm like a third-party vendor back again, like, "Okay, how am I going to survive as a going concern?" Like, if it is strictly better for my customers to use a managed service from our cloud vendor instead of the thing I built, great, I should want that, because guess what? There's other stuff I should do. If we can get this thing as a managed service or there's a third party that does this, fantastic. "Thanks. Glad what I did helped you from the past." And then now we're going to do other things because, again, there's a big long list of priorities from my customers that we weren't able to get to.
And I 100% believe that. The reason why people would mandate a platform I think are deeply wrong. They are the... I don't mean to be mean, but it's the bias of sunk costs. It's the sunk cost fallacy. It's like, "Well, I built this, so you should use it." I'm like, "Well, okay. We spent that money. Now what?" And if the right thing today is to continue to use the platform, great. We did a good job, and we should pat ourselves on the back. But if, as often it happens, that was the right thing to do four years ago, but right now there's a third party that has packaged that up for free and it's all integrated and way better in the cloud ecosystem, in the Kubernetes and CNCF ecosystem, at any given year there's way more stuff out in the world in open source, etc., etc., than there was the year before.
I hope this metaphor is going to work, but the overall tide, in a good way, is rising, right? The baseline level of what's out there for free or for cheap in the industry is going up. And that's a good thing. We should want to float on top of that rather than chaining ourselves to the bottom of the harbor, if that makes any sense.
Recommended talk: Structures Shape Results: Software Insights • Elisabeth Hendrickson & Charles Humble • GOTO 2024
Building eBay Today: Modern Architecture & Cloud Tools
Charles Humble: There's actually an interesting sort of line I want to pull on there, which is, because obviously eBay has been around by kind of internet company standards a long time. And when you started, or when it started out, you know, the cloud wasn't even a thing. So, if you were building an eBay today, or something like an eBay, say, what would be different? Are there more off-the-shelf services that you would maybe draw on that you couldn't have drawn on at the time? And how would that look? How would that compare?
Randy Shoup:, I love this. I actually did this thought experiment when I was at eBay as the chief architect, which is, what should the chief architect do? The chief architect should be thinking, "Hey, if I had to do eBay all over again, what would I do?" And I would make... What's the way to say this? I'll give specific examples, but, like, most of the patterns are correct, and I would keep. Most of the implementations I would update. Right? So, the patterns are... So, well, let's take it one at a time.
So, eBay does not use the public cloud, basically. eBay leverages its own private cloud, because guess what, when eBay started almost now 30 years ago... We're at 29 years old right now. There was no cloud, like you say. The first thing you do is you buy some actual physical hardware from an actual physical vendor. You install software on it, like, from a CD or a floppy. You rack it in a rack, you put it in a data center, blah, blah, blah. And companies that got big early on in internet time: eBay, Amazon, Yahoo, Google, all had to learn how to do that. So, they all learned how to be their own, what we would now call cloud vendors. And eBay's pretty good at that, actually.
From a cost perspective, if you think... What's the way to say this? If you lifted and shifted all of eBay's infrastructure from the private setup that they currently have to the public cloud, it would probably increase the cost; let's call it 4X. But if you were smart, you could get a lot of benefit, and it wouldn't be 4X if that makes any sense. Like if you just lifted and shifted, it would be that way.
What I would do today is I would leverage...go all in on one of the cloud vendors, partner with them up and down and left and right, exactly how, you know, Netflix has done and various other big places have done. I would build from the ground up in an auto-scaled way. So, again, when you run your own data centers, physical things don't flex in the way that we would like. And now we've learned what we can do with auto scaling. So, I would do that. The reason why I said all in on a cloud vendor is I would by default use all of their managed services up and down and left and right, and only where the managed service didn't meet a very specific need that we had, which is frankly a rare case in these days in 2024, only then would I build custom infrastructure.
And then the other thing, which is eBay... Again, you know, it's legit. eBay builds and maintains its own data centers. There are three data centers in the United States and mostly the Western United States. So, that's great if you happen to live in the United States. That's not great if you happen to live in Australia. So, if I had it to do all over again, I would go... Oh, if I had to jet to where we currently are. We're not eBay as a startup. We're eBay at current scale. Let's say it that way. I would be global, multi-region active/active, right? So, I haven't directly used Spanner from Google or the brand-spanking-new thing from Amazon, Aurora DSQL, that they just announced like a few weeks ago at re:Invent, which is Spanner-plus-plus, as far as I can tell. Spanner even better. Anyway, I would leverage something that was, you know, all the capabilities that are multi-region from the beginning, if that makes sense.
That's what I would do on infrastructure. Again, it is correct that something at eBay scale would be microservices. I will continue to do that. eBay, you know, evolved over time. Like, if you could jet right to the domain decomposition that we've currently learned or that eBay's currently learned, we would just, you know, construct services around that. That's fine. We would do what eBay does. Today wasn't the first time I was there. But really, in a DDD way, organizing teams and services aligned to those domains, like very clearly segmented there and bounded context like you were suggesting earlier.
I would be event driven from the beginning. Again, we're not being a startup; we're eBay at current scale. And I would go either all in on whatever cloud I chose, I would go all in on their eventing setup. Or if I didn't like that for whatever reason, I would go all in on Kafka. And then, because we can't get past functions as a service even 10 years along, I would use those as kind of lightweight event producers and consumers. But despite the, like, serverless first, I don't think starting it with functions as a service is the right way to go. It's a tool in our toolbox, but it's not the first tool. Now, serverless, as AWS says writ large, like everything should be scalable down to zero and scalable up to a million, like yes, yes, yes. A thousand times yes.
Then the very last thing is I would do a first-order implementation of workflows. So, whether that's a saga pattern layered over an eventing system or something that's new to me, but I think is not new to most people or a lot of people, is something like Temporal, which is used by Coinbase and Stripe and a bunch of places that this stuff really matters. We're currently looking into that now about how to implement workflows at a higher-level, like correct and very reliable way. So, that's what I would do.
Charles Humble: That's absolutely fascinating. That's really interesting.
Randy Shoup: I was going to say, did I indicate that I've been thinking about this for a while?
Charles Humble: No, it's really interesting, though, because it's something I've been reflecting on a bit, and actually something you said earlier about when you came back to eBay and you basically used the work from the "Accelerate" book and "Continuous Delivery" and all of that. And so much of that is stuff that wasn't there 20 years ago. It's like the patterns and approaches that we need are so much better defined for the scale that we operate at than they were a few years ago. And it's just, it's kind of interesting, I think, to reflect back on that.
Recommended talk: Platform Strategy • Gregor Hohpe & James Lewis • GOTO 2024
Evolution of Architectural Principles Over 20 Years
Charles Humble: Actually on that, because we're getting towards the end of our time, obviously you and I met, I think at QCon San Francisco in 2007. Quite a long time ago now. So, we've talked about how you would do eBay differently, but how have the sort of architectural principles or approaches changed over the sort of 20 or so years that you and I have known each other?
Randy Shoup: In preparation for this talk, I actually was listening to an interview I did at QCon London in 2008 for "Software Engineering Radio," but whatever. That's another good podcast. I listen to "GOTO" all the time too as well. Don't worry. And I was listening to the themes from that talk, which was the same talk that you saw at QCon San Francisco in 2007. And oh my gosh, all the patterns are the same. Like, and it's not because Randy was brilliant. It's because it took us a while to learn. It took us maybe 5 or 10 years to learn how to do distributed systems, but then we learned them and patterns. The patterns are the same. The detailed implementations are very different.
So, number one was scaling. Okay, domain partitioning of our... Domain decomposition, services per domain, microservices that own their own data. Boom, we knew that then. We know that now, whatever, 17 years later. Asynchrony. Again, particularly at large scale, prefer an event-driven architecture over everything being synchronously coupled, and go all in on the implication that there is eventual consistency from something happens over here when somebody bids on an event and only later does it show up in search and only later does it show up in these other places. Now at eBay, we did a lot of work. So the latter showed up in a couple of seconds in search. But, you know, the conceptual idea of a thing happens over here in one part of the system, and then it propagates quickly or slowly, you know, to the rest of the system in an eventually consistent way.
Automation. So you cannot live at large scale without automating everything up and down and left and right, and that is things like a canary rollout, right? So, automating the steps to roll things out, what I call adaptive configuration. So, say to a part of the system, "Here is your SLO. I need you as an event consumer or message consumer to consume messages at a certain rate and keep the queue of messages below a certain level. And then tune yourself, tune your polling rate and your parallel threads, and your number of consumers to keep that rate with minimal research usage." And that's not conceptually all that difficult, but you just have to implement it.
We could talk about machine learning. That's automation at massive scale, but same exact thing. And then failure. I think it's the first fallacy of distributed systems. It's certainly there. Like, you know, things fail all the time. So, you know, you have to invest in some form or another. Now you can use third parties, but invest in observability, step one. And then, as I was mentioning, you know, things fail in every way. So like canaries and automated rollback. You know, looking at those monitoring metrics to decide, you know, what to do to scale up and down.
What we now call, thanks to Mike Nygard, circuit breakers. We had that technique at eBay back 20 years ago. We called it markdown. Like you mark down a database temporarily when it's not performing well. So, circuit breakers like that, and then when things aren't available, you don't collapse; you gracefully degrade the functionality. So, and then a thing that's a little bit new. Well, okay, the massive thing that's new is what you hinted, which is continuous delivery.
So, those other patterns around scaling, around asynchrony, around automation, and around failure, those things were known. Different words, but we knew them all 20 years ago and were at least at the time secret. Big companies knew all these things. And it wasn't just eBay, by the way. Yahoo, Google, everybody knew it.
The thing that is massively different today versus 20 years ago is all aspects of continuous delivery. So, when I first gave that talk back in 2006 about, "Hey, here's how we're doing things at eBay," opening up the hood a little bit, people were amazed, shocked that we released the entire site every two weeks. So, like, "Oh my God, how do you do it?" And we're so proud. We're like, "Well, we compute the transit of closure of all the interdependencies of all the different services and applications, and we execute on that in this automated way where like up the dag from the leaves to the root, rolling forward and back to pace based on things." And we were so proud.
But we thought of the site as this monolithic thing, and we released the whole site, you know, in pieces, but the whole site every two weeks. Now, that would be laughable, absolutely laughable. I don't remember the exact thing, but Amazon 10 years ago was doing hundreds of thousands of deployments every day. I don't know that they've said it now, but let's say that it's two orders of magnitude larger. Google would be the same. So the idea, you know, in its first, you know, credit to Jez Humble and Dave Farley for the "Continuous Delivery" book in 2010. The very famous "10 Deploys A Day" talk, which I was lucky to have to attend at Velocity from John Allspaw and Paul Hammond back in 2009.
So now it's crazy talk to think that you shouldn't be deploying multiple times a day for each individual thing. So, that's on all the techniques that you say that are around continuous delivery. And I think we knew all the individual techniques, but we never layered them together and combined them in a way that would allow us to safely deploy micro changes in a constant stream. And now we do. And now the best-in-class companies, those high performers in the Accelerate/DORA milieu, are doing intraday, every application, multiple deployments. And that's the standard. And you would be laughed at if you were a big site and updated everything every two weeks. So, we've come a long way.
Recommended talk: Release It! • Michael Nygard & Trisha Gee • GOTO 2023
Charles Humble: We have.
Randy Shoup: The patterns are all the same, which I love. And it's like it's really exciting to me to see that we as an industry discovered all these things and co-discovered, right? Like, they were all being, like... Nobody's talked to anybody. Super secret. But like Yahoo and eBay and Amazon, and Google, all co-discovering all these techniques all at once, you know? Feature flags and canaries and circuit breakers and the broader techniques as well. Anyway, love that, and love that those patterns are still just as relevant today as they were. And also love that we've come a long way on the continuous delivery front. So, 2025 is a great time to be a software engineer.
Charles Humble: A hundred percent. Randy, thank you so much as ever. I feel like I could talk to you for hours, but that was wonderful as always. Thank you so much.
Randy Shoup: Thanks, Charles. Me too. Clearly we can talk on any of these topics forever, but really enjoyed our conversation. Thank you.