Home Gotopia Articles Innovations in S...

Innovations in Serverless and Event-Driven Solutions

Julian Wood and Ben Ellerby explore the challenges and innovations in event-driven architectures, generative AI, and serverless technologies. They emphasize the importance of well-structured event schemas and the role of platform teams in reducing cognitive load for developers. Ben Ellerby highlights the potential of generative AI in modernizing legacy codebases and discusses the resurgence of event-driven architectures, driven by improved tools and frameworks that promote decoupling and efficiency. The conversation also touches on the future of serverless computing, edge computing, and the significance of data management in global applications, underscoring a transformative shift toward more scalable and flexible cloud solutions.

Share on:

Copied!

About the experts

Julian Wood ( interviewer )

Serverless Developer Advocate at AWS

Ben Ellerby ( expert )

AWS Serverless Hero

Read further

Understanding the Cloud Transformation Journey

Julian Wood: Welcome to GOTO Unscripted. We are here at the GOTO Conference in the wonderful, fabulous Amsterdam. Today I'm joined by Ben Ellerby. Ben, welcome to GOTO.

Ben Ellerby: Thanks so much for having me.

Julian Wood: Absolutely. So you are into transformation. You are helping customers go into the cloud. You work for Elias, which is a consultancy. What are people's journeys to the cloud? And I know you talk about modern cloud. What does that mean? What are the struggles people are having and how are you helping?

Ben Ellerby: I mean, I think everyone's sort of at a different point on that journey. I think a lot of people have, you know, lift and shifted to the cloud, but are still wanting to get even more return on investment. Some people have kind of stayed back and waited to see how things go in the cloud landscape. And they're now looking to make that move. I've worked with the airlines that have moved from fully on-premise to fully serviced and event-driven in AWS. And I've also worked with organizations that have lifted and shifted to AWS, but are then trying to get even more benefits, trying to break down monoliths, trying to restructure teams and trying really to focus on reducing total cost of ownership. But also focusing on the time to value, the time from an idea in the organization to customer feedback in production. What are the barriers to that and how can they use cloud to reduce that time?

Julian Wood: I mean, it seems like for some companies that can be quite a journey and that can be long. And I'm sure there are many sorts of hills, valleys, and pitfalls that they're going through. How do people sort of start to even think or to work out what they need to do or, you know, where do they start the journey?

Ben Ellerby: I think, unfortunately, a lot of them see it as a program with a fixed date where everything shuts down in our data center and everything's on in the cloud or we stop having our monolith and then we have all of our beautiful microservices. And that big bang approach is ill-advised. I've seen it fail many times and often it creates a lot of frustration, a lot of waste. Instead, I'm a big proponent of a more agile approach, a strangler fig pattern. And I often am an advocate of the framework, minimum viable migrations.

Julian Wood: Just strangler fig. Can you explain what strangler fig is for people who may not know?

Ben Ellerby: Strangler fig is a pattern coming out of Martin Fowler. And it refers to the strangler fig, which is a plant in Australia that wraps around a tree and slowly breaks it down. And I think the important part is taking it piece by piece. Rather than taking the whole, taking slices. And I think Martin Fowler regrets strangler fig because it sounds so aggressive. We're going to strangle the monolith. But instead, it's actually how do we gradually decompose the monolith? And I'm a big advocate of taking those vertical slices, saying, okay, we've got a really complex e-commerce system, built for 20 years. I'm struggling. It's on-premise. Let's take payments. Let's break it out. Let's move it into the cloud. And let's have two systems living in parallel and gradually break it down. And in that way, you can release something in eight weeks and see if it works. Test assumptions. Pull risk forwards. Rather than have a two-year modernization program with a big day where everyone is very nervous. You flick the switch and then we realize there was a big issue. We didn't test something, and we have to roll back. I think instead we need to take it as iterative and evolutionary. And I think that's key because it's not just moving to the cloud. It's moving to the modern cloud. And of course, the modern cloud is going to always change. It's the very nature of being modern.

Recommended talk:The Ultimate Cloud Platform Team Topology • Ben Ellerby • GOTO 2024

Modern Cloud and Navigating Organizational Change

Julian Wood: So what is a modern cloud? What would you think about or define the modern cloud?

Ben Ellerby: It's a big question. I think modern cloud is right now the move to seeing compute and storage and other fundamental building blocks as a utility or a commodity. To trust the cloud provider to run your computer for you and really focus on your differentiating value, your application code. And also using things like Cognito for authentication instead of building it yourself. Trusting the cloud provider, using the cloud provider. But then there is a challenge in just writing your bespoke code. The cloud is building these things at scale. There has to be a bridge between them. And then there's still some complexity there. Over time, I think they'll get even simpler. But right now, I think it's building event-driven architectures, building serverless and moving undifferentiated heavy lifting to the cloud provider and focusing your company on its differentiating value.

But moving to that is a big change for organizations that typically have a monolith, typically have built a lot of stuff themselves. And it's saying goodbye to some of that stuff, which is difficult. But it's also breaking down systems that have been built over decades by multiple teams. So the path to the modern cloud is hard. But I think what the modern cloud is right now is quite clear. But it's going to change. It's going to evolve. So I think your approach to adopt modern cloud today has to be an approach to adopt modern cloud in 10 years. It has to be an evolutionary approach where you gradually move pieces to what are the modern parts rather than have a one-step jump from not modern to modern. If that makes sense.

Julian Wood: YI know you're a big proponent of serverless. What are the objections when people are starting this journey and they, you know, there's obviously an implicit trust to hand, you know, trust over to a cloud provider to provide these kinds of managed services. What kind of things do people get tripped up with and how would you help them understand that?

Ben Ellerby: I often get dropped into organizations or me and my team get dropped into organizations that have started a journey to modernize or about to start a journey to modernize and are struggling to actually get results. And often, you know, it's the classic stuff of is there buy in? Does the board agree? Does the product team and the tech team have a joint vision? Is there joint resourcing? What's the plan? So a lot of it's that.

Julian Wood: So now that's technical. That's a, you know, people and organizational thing.

Ben Ellerby: Unfortunately. A lot of my job has become fixing people...

Julian Wood: Psychological.

Ben Ellerby: ...and organizations, which is, you know, not necessarily what my computer science background helps with. But until you get that right, none of the technology changes really matter. So a lot of it's getting the sort of socio part in place. A lot of it's getting a technical part in place. And we join that together in what I call the serverless staircase, which is sort of a socio-technical blueprint for adopting cloud. and that starts with getting...

Julian Wood: So social technical. So, you know, talking about the people and the technology together to migrate and modernize.

Ben Ellerby: To the horror of engineers, yes.

Julian Wood: Okay.

Ben Ellerby: And that staircase for me is getting the vision right. So what is the North Star we're working towards and aligning an organization on that.

Julian Wood: So is that when you're talking about, you know, when there's not alignment with the board and things like that, is that, you know, having a North Star vision with your company that you know the end state you're heading towards and have that clear.

Ben Ellerby: Exactly. And getting product and tech to work together on the end state as one team. The second part then is really making sure you have the right topology. This is the architectural topology, what you're trying to get to, but also what is the team topology, the structure of teams to facilitate you to get there. Then it's how do you change an organization's skill set? So how do you provide psychological safety? How do you provide learning paths? Learning service is a big step for someone who's been doing a lot of sort of monolithic work in, you know, maybe a legacy JAR in the framework for a long time. It's a big step change and it's a big change for traditional infortunes as well.

And then it's really how do we take a progressive approach. That NPM. That structure that I talked about. That is step four. How do we take vertical slices and show results in eight weeks, rather than the mistake lots of people make, saying, in two year's time, it's all going to be done. Flip the switch and it doesn't work. And then finally, the most boring parts, but the most important part, governance. How do you have tech and product working together to measure the progress of a modernization, to see the impact on DORA metrics, to see the impact on time to value, to see the impact on total cost of ownership and to measure that over time?

Julian Wood: That's interesting. So those are not, because lots of people think of governance as controls and security and auditors and all this kind of thing. But you're actually seeing governance from a software delivery perspective and sort of being agile and DORA metrics and that kind of thing, which is interesting.

Ben Ellerby: I mean, all the other stuff is important, it has to happen.

Julian Wood: It's included as well.

Ben Ellerby: But I think for a modernization, if you're trying to move an organization, make huge changes, decouple a monolith, break up teams, restructure, maybe move product into teams directly. It's a huge change. And I think the organization needs to see the results and also needs to make sure the results are actually happening. And that's the North Star. It's a staircase, but it's also kind of a certain. You define the vision. So you need to actually check you're moving towards that vision. But yeah, I think it's a lot of actually making sure the return on investment is felt and seen throughout a modernization, not just expected at the end. And that's why I focus on the actual tangible business outcome, reduce total cost of ownership, decrease time to value. We have to measure that through modernization, not just at the end.

Julian Wood: So what kind of things do you measure that you would put into this sort of feedback loop?

Ben Ellerby: It depends. It depends on what the vision is. If it's time to value, there are proxies. There is a deployment frequency is a really good proxy. It's not perfect. You can still measure time to value. It's a little bit complicated. But things like deployment frequency are a good way of doing it. Or for other organizations, the North Star is really having downtime issues impacting SLAs. A great sort of proxy to that is MTTR, Mean Time to Recover. How quickly, when's an incident, do you recover? This has a financial impact because nearly any organization can put a dollar value on an hour of downtime.

I was once working with a travel company and there was, I think, 300,000 per hour of downtime in lost revenue. That's a good impact when their downtime was, their Mean Time to Recover was 14 hours. Bringing that down to 20 minutes, that's a huge financial saving that the commercial team, the product team can get behind. And then the engineers who've been working hard on modernization are not seen as the people who aren't delivering features and working on the IT project, but they're the people delivering huge financial benefits to the organization. So there's some examples, I guess, of proxies to that overall North Star that we're working towards.

Julian Wood: And then you talked about the sort of people organization and changing the organizational structure and maybe putting product in teams and everything. I mean, that sounds like quite an emotive and complex thing to do in a company because it's not just moving tech around and changing libraries and doing stuff in the cloud. How do people approach that with that psychological safety that you mentioned?

Ben Ellerby: I think a lot of people don't and it's a lot of the cause for failure. I think starting on the right note is key. I think taking small steps helps a lot. It's no longer we're going to rework everything. I say, we're going to take payments. We're going to move it to a more modern tech stack. It's going to introduce our Mean Time to Recover. It's going to have this business impact. Yes, these individuals are going to move to be a different type of team. They're going to move to work in a different way and we're going to see the benefits. And then everyone sees, okay, no, that worked. Let's do it again.

Recommended talk: Minimum Viable Migrations (MVMs): A Path to Cloud Native EDAs • Ben Ellerby • GOTO 2022

Incentivizing Teams and Managing Modernization Costs

Julian Wood: Are there ways to sort of incentivize teams and people to sort of be excited about this change? I mean, how would people do that?

Ben Ellerby: I think engineers are excited to learn new technology. It's fun and exciting, but they need to feel safe in doing it. If they're in an organization where the product or commercial teams are yelling at them because they're not delivering features fast enough. And I'm saying, and now you need to go learn a new tech stack. And then you're going to rebuild the stuff you've already built. And then you can build the features that you're being yelled at for. Well, there's no way they want to do that. So I think it's making sure the organization gives them the space, the time, the resources, the access to experts, the pair program with them to make them, to help them to understand the technology.

So it's making sure there's a psychological safety in place and it's tying people together to that North Star vision. That governance not only to look at what's happening and make sure we're going to succeed, but also to communicate throughout the organization, to show the wins and celebrate the efforts. I think it comes down to safety and resourcing more than it does ever encouraging an engineer to learn something new because nearly every engineer is an engineer because they like learning something new. That's never really the challenge.

Julian Wood: I mean, you being a proponent of serverless and, you know, event-driven architectures is also a different architectural model. You talked about sort of costing and this kind of thing. Is going through this journey a financial viable thing. Anything that's change, anything that's new is going to impact the way, you know, it costs to change. Is the benefit at the end generally worth it or how would you sort of sell that part of doing this Upfront Investment?

Ben Ellerby: Well, as a consultant working in space, I'd say the benefits are definitely worth it.

Julian Wood: Of course, yeah.

Ben Ellerby: But no, I think the benefits are worth it. And if they're not, then you don't have the vision. You don't start doing it. I think you need to be clear on what the benefits are.

Julian Wood: So being clear on that North Star in the very beginning to even understand whether you even do this journey.

Ben Ellerby: Exactly. And then yeah, cost is an interesting question, right? Because there's the cost of the new architecture that's serverless. You've got to work with teams to explain FinOps. How, you know, the cost of using something will vary as a parent paper use model and the fact also paying for some things to be baked in. We're paying for scalability to be baked in. We're not doing that ourselves. There's a whole education piece about total cost of ownership that a CFO of an organization that just typically has had IT as a line item. It needs some work to get them solving that journey. But they often if you sort of explain it in terms of CAPEX and OPEX, they often quickly understand. So I think it's helping the organization understand the cost profile of what they're moving to, helping the organization to understand the cost of what they're doing, and then helping them understand the end outputs. Any modernization project takes time. That's the biggest cost.

One thing I typically do with organizations that are working on their sort of case for modernization. So it's often a board paper for the executive team of large organizations to apply what I call an NFP model, new feature productivity. So we sort of come up with a proxy. Okay, an average developer is currently 0.7 times productive because the baseline is one, when things were good at the start of the company, when there wasn't as complexity when we were small, we were a fast team. But there's a lot of tight coupling in the monolithic system that's slowing them down. Added on to that, there might be a failed modernization program in place which drops that down to 0.5. We then multiply that by the number of engineers delivering in that in that environment. And we can have a proxy for productivity.

Now then we can model at the end stage with everyone working this fully serverless event driven utopia. We can look at studies. We can look at past experience and we see about a 33% increase in efficiency. So we're moving up to 1.33. Now in the interim, we're not just going in one day, now we're 1.33. We're going to put some people on modernizing, some people on learning, some people on rebuilding. So new feature productivity is going to dip a little bit. But then it's going to move back up and we actually model that as a graph. And those graphs are very useful way to talk about different scenarios. Do you want to start with putting all 60 of your engineers on the modernization effort and drop NFP down? No, the product team is going to be very frustrated.

Do you want to take five people, move them to a good level of productivity, then take another five people and actually level that drop down? You can bring in external people to help also level that. But I think it's important to model that out and get tech and product to agree on the scenario they want to go through when it comes to productivity. Because yes, modernization is expensive. Yes, you might use some consultants. Yes, you're going to spend some money on cloud and your on-premise is going to be an overlap. And cloud providers have funding programs that can help with that. But I think the key thing is the productivity is a distraction to the organization. And that's how we apply that NFP model to try and map that out and try and decrease the impact productivity throughout the modernization program.

Julian Wood: That's interesting. So it's a very structured approach rather than just throwing engineers at the problem or even throwing product managers at the problem. All at once you're going to do the situative approach on the tech and also on the people to modernize. Modernize the people as well.

Ben Ellerby: We said that it's a people issue, but we're still engineers. So it's a pretty mathematical approach.

Recommended talk: Software Architecture for Tomorrow: Expert Talk • Sam Newman & Julian Wood • GOTO 2024

Modernizing Microservices: Balancing Complexity and Cognitive Load

Julian Wood: You mentioned enough about microservices and monoliths and breaking that kind of down. What are you seeing customers doing for that? Because there also is a bit of a backlash against microservices, whether people have gone too crazy on microservices or not. You know, what is it even nowadays in 2024, you know, 10 years of microservices. I spoke to Sam Newman yesterday and he was writing the microservices book 10 years ago. So he said it made him feel old. But yeah, the microservices journey. What do people think about that now? And when can it go wrong? And what would the reasons be for doing microservices in the modern cloud?

Ben Ellerby: I think, you know, microservices somewhat feel like second nature to us in engineering. But when we're talking about, you know, companies that are moving to modernize to solve time to value, for instance. The time it takes from an idea, you know, in the organization to customer feedback and production. They're frustrated by how slow things are taking to build. And that's often due to a monolithic code base or monolithic system. But it's not due to the fact it's a monolith. It's due to the fact it's a monolith that was built quickly and iteratively and the company's changed a lot. And the domain modeling maybe wasn't done at the start. And also the domain has changed. It might have been a company that X that now does Y and there's a load of legacy code for X that sort of is kind of used in Y. But it's all very confusing.

One of the ways we try and show this to teams, because obviously, you know, it's their baby. They built it. Some of them were there at the start. We're not trying to say the monolith that they built is bad. It's to become legacies to have succeeded. If the company failed, it wouldn't be legacy. It wouldn't exist. The legacy is a success of what they did. But it's time for sort of the next chapter of that. And typically, what we do is analyze the code base. So we take the commit history of everyone in the organization, and we group it into the teams that they're in. We then model that over time and we can build a model where you can see the code base. You can see the different areas, maybe in Django, the different modules. And you can see little dots of color depending on which team made changes. And if you see an area where all the dots are green from one team, it's fine. One team is making changes to one area. Highly effective. If you see an area with dots of lots of different colors, it's a huge friction. It's a lot of stepping on each other's toes. When we show that picture back to the teams that are working, they're like, oh yeah, no, it's always painful to touch that. Oh, I never touched that. You know, we don't actually touch that. We rebuilt it over here because we didn't want to tell them. But actually, no, that bit's a nightmare.

So I think it's important to look at the monolith or the, you know, the monolithic system that an organization has when you're looking to move them to something else. To value that it was really good, but also to show where the pain points are and not talk about them theoretically, but talk to a team and show them, remember when you tried to do that commit last year and it was a nightmare. Well, it's because everyone else is making these commits at the same time. So really, visualizing the monolith structure and the interaction modes on the monolith, overlaying the current team structure, and then showing a new way in which that team structure will be different and how those dots would be non-overlapping. It's a really down-to-earth way to show engineers the benefits of moving to a decomposed system.

And I'm not saying an organization of 5 engineers needs 20 microservices. I'm saying an organization of 200 engineers probably needs more than one service to work it. And yes, you can scale a monolith. And yes, you can structure an amazing monolith. If you start fresh and try to think about a monolith that's going to structure that many teams, but that's not what people are doing. And the cognitive load on teams now is just too high. And that's the big move to decomposition. Yes, you get independent scalability. Yes, you reduce blast radius. Yes, you can reduce the scope of things like PCI DSS. But largely, you're reducing the amount of things someone has to think about when they're trying to build their future. And that's really what I try and move people towards seeing. It's not about moving to microservices. It's about making the job of the teams easier.

Julian Wood: I think that is a fantastic point. People get very focused on the size of the microservice or the size of the monolith when it's actually got nothing to do about that. It's sort of independent deplorable to be able to evolve the system.

Ben Ellerby: Exactly.

Julian Wood: So, I mean, did talk about the size of monoliths and size of microservices. What do people get wrong?

Ben Ellerby: Lots. I think people say, okay, well, our code base has these modules. They're now each microservices and let's build that. And we had this way of looking at, you know, our data modeling. And that's how we're going to do it in the new system. Modernizing the system, decomposing monolith, moving to an event-driven architecture. It's your unique opportunity to do DDD right for the domain that you're in now.

Julian Wood: So DDD, domain-driven design?

Ben Ellerby: Domain-driven design, yeah, exactly. And it may have been used at the start of the company or the start of the system, but the system has changed over time. It's a good time to reset and think about what is a customer? What is an order? What is a payment method? And event storming, I'm a big fan of. It's an extension to domain-driven design. Putting lots of post notes on a whiteboard involving both the tech and the business.

Julian Wood: That's a critical part. It's not a tech thing. It's actually the tech people, maybe for the very first time are understanding, oh, this is what our business model is. They're normally working in their team doing their tech kind of thing. And the business does a lot more in understanding the bigger picture.

Ben Ellerby: It's interesting, right, because that forces us to step back and look at the whole system. It's a very painful week of workshops. Best to do in person, sometimes remote, which is quite painful. But when you do it in person and you look back, it's really complicated. The cognitive load is really high. God, this is huge. But then the point is we're only looking at it all together once to agree on our language, to agree on the areas of the system, to agree on the responsibilities, to agree on the events which become the sort of interface between systems. Then we never have to look at that big picture again. That's not true. You need to redo this periodically.

But then you can focus on your area. You don't have to think about all the complexity around it. So I think it's interesting. I'm proposing reducing cognitive load by taking a step back and looking at everything. So we have a high cognitive load to then drop it way back down and have that split up. But yeah, I think what people don't get right is assuming that the way they modeled a domain in the past is necessarily the right way to model their domain going forwards. And it's a unique opportunity to get it right. Because the split up of your banded context, the split up of your systems will lead to the split up of your teams applying the reverse Conway maneuver to make sure that we get the right organization for the architecture that we want. This is the core time to get that domain modeling correct.

Julian Wood: To reverse Conway, I mean Conway's law about your software architecture is built basically on the way that your team is organized. And reversing that is deciding on your architecture and then formulating your teams to help deliver that.

Ben Ellerby: Exactly. "Team Topologies" by Matthew Skelton comes in detail. It's a really great technique and we use it with almost every organization we're modernizing. But yeah, if you have two teams that you sort of run separately, we use all together a two-system architecture. So let's think about the systems that we want to have, the teams that we want to have, sorry, the architects we want to have. And then let's map our teams and structure our teams around that. And then that team has a reduced cognitive load because all it has to think about is bookings, let's say. And it's working in a system that only deals with bookings. The code is simpler. And yes, it interacts with the parts of the system. But through an event-driven architecture, there's not tight coupling. There's no temporal coupling. It's not a request response. But instead they're pushing out events. And I think that the other thing people don't value enough is getting that event schema, the structure of those events correct and basing them off the business domain, not the technical domain.

Recommended talk: You Keep Using That Word • Sam Newman • GOTO 2023

Event-Driven Architectures and Platform Teams

Julian Wood: I think that as soon as people in my experience do event-driven architectures, that's the sort of first hurdle. Everyone's going, what do I put in my event? How do I structure my event? And, you know, lots of things about coupling and decoupling and everything. But the consumer and producers do need to know about each other in some kind of coupling. So any sort of advice on event structure?

Ben Ellerby: I think event storming is a good way to get your list of candidate events. We have a blog article, EventBridge Storming, where we apply that to EventBridge and we go further and build out the event schema. I think it's good to have somewhere to put your event schema. The schema registry for EventBridge is an example in AWS. There are other ways to do it. But making sure it's discoverable. The idea is that the marketing team, if they want to send an email after someone buys something, they shouldn't have to go to the payments team to figure out at what point in the code to trigger that. They should just look up payment success.

Oh, that's an event. Okay, when that event happens, we're going to trigger this. There's no communication between the two teams. The events are the interface. But I think it's key then to think about events globally at the start, to work really hard on a really good event schema that people agree with, rather than let every team to the ad-hoc it and then be like, okay, in a few months, we'll look across and make sure we sort of move things more in line. That doesn't work. You need to get it. It's one of the few things that you really need to try and get right globally at the start.

Julian Wood: And that is hard. I mean, even in the Netherlands and PostNL, the Dutch Postal Service, Luke van Donkers Goedd has done a huge amount of work on explaining their whole sort of event system and schemas and everything. And they've got a very formalized system of registering events. And they've got to, you know, the structure's got to all be in a certain kind of way. And then that's far. Once that schema is well defined, the consumers then have amazing flexibility to use those events because they know it's within their structure.

Ben Ellerby: And there's, you know, lots of views about, you know, fad events, thin events, and how are you going to do it. But I think the key is doing it one way. Find that across all events, having a good discoverability of those events and some sort of schema registry. And yeah, having, you know, really, really leaning into the fact that services should master their own data. So, yeah, maybe the marketing system needs to send emails. So it needs to know my name, but it doesn't need to call some service to get my name. It can listen to the events about user data changing. Now, this was a little bit complex because it's PII data and there are other aspects of that. But the key thing is here it can manage its own projection of that data. It doesn't have to have coupling. I think people are always like always because data duplication is wrong. Never want to duplicate data. If you never duplicate data, you add huge constraints on communication and coupling overhead.

Julian Wood: And it's a temporary duplication of data because a microservice is going to receive an email address and a name to send an email. But once that email is sent, it doesn't need to necessarily retain that data.

Ben Ellerby: So it can and it can't. But yes, it doesn't necessarily have to. I think people are terrified of data duplication. I understand why. I think we need to think about why we're so scared of it. And yet temporary data duplication to enable us to send an email. Not necessarily a bad thing. And even longer-lived data duplication can work because data duplication is better than communication at scale. Obviously, it depends on what sort of data. Your bank balance is very bad to duplicate. But some less important aspects of data can be duplicated if you think about the outcomes. This is when we go back to organizations that are trying to modernize towards this. This is moving towards a new type of cloud. This is moving towards event-driven architecture. This is moving towards eventual consistency and it's moving towards changing how they think about data. This is why people struggle. So I think a lot of it is trying to break down those problems and give people a path because it's a lot to think about at once.

Recommended talk: The Art of EDA Visuals: Exploring Concepts Through Graphics • David Boyne & Eric Johnson • GOTO 2024

Julian Wood: Just a shout out to an ex-colleague of mine, David Boyne, who's written an open-source tool called Event Catalog, you know, just for understanding and all those kind of events. We were speaking earlier about cognitive load with developers and, you know, this mythical 10x engineer. And especially if you're in the cloud and you're doing serverless. Well, the engineer is going to know everything from front end to managing Kafka queues at the bottom and the code and deployments and CICD pipelines security and the whole thing. And I know you're a sort of big proponent of helping people understand how platforms work in organizations, particularly with serverless. Some of those people don't understand why you would need a platform team or platform capabilities. if you are using serverless. The cloud provider just does it all. Yeah. What are your thoughts on that?

Ben Ellerby: Yeah, no, I see that challenge a lot. And to be honest, five years ago, I was probably more in that camp. Well, you know, the cloud provider is our platform. And yes, it's shared responsibility model. The cloud provider is your platform. You're consuming computer's utility. But it would take S3, an S3 bucket. It might be the security and compliance says it needs to apply some encryption standards and you need to stop anyone ever making an S3 bucket public. This will make sense. But how do we enable our stream-aligned teams to be able to do that correctly? In the past, you know, we've moved, sorry, we've moved to a mode in which it's much simpler to build applications at scale than it ever has been. Computers are resource. Storage is a utility. It's easy to consume. But also our teams look very different.

We have, you know, we don't have a database expert and a networking expert and a security expert on the team. We have a few full stack developers with a bit of a backlog and a dream trying to make all of this work. So we need to reduce the cognitive load to build systems at scale. Yes, some things are easier, but we need to enable those teams. So I typically see my platform team sitting between the cloud providers platform and those stream-aligned teams if we borrow the terms from team topologies. And providing things that enable those stream-aligned teams to move faster. So, for instance, building CDK constructs, CDK cloud development kits, it's an infrastructure as code solution from AWS, well, open source, but from AWS that enables people to build in familiar programming languages like TypeScript. I often have platform teams building CDK constructs, L2 plus constructs. So between level two and level three that enable their stream-aligned teams to move quicker by giving them building blocks with…

Julian Wood: So these are little package building blocks that they do with all their best practices built in and they're kind of things which their developers can consume.

Ben Ellerby: And that's their company specific best practices. That company might have, you know, some rules around S3 buckets and how they want to store them and how they want to encrypt them. That's fine. But they need to make a company bucket, whether it's airline buckets or whatever it might be. But that's then consumed by the stream-aligned teams and used as a building block, which means it's almost pre approved, which means you can roll back some ideas of change approval boards and some top down and kicking stuff back way far in the development process. Of course, some top down controls are still needed, but it sort of inverts that relationship. It's not top down imposing. It's more bottom up enabling.

And that is the platform as a product. Those constructs of the product, their version, they're released on something like NPM privately. And then they're consumed by those stream-aligned teams. And it can be an inner source model. Those stream-aligned teams can make changes. But the code is owned by the platform team. That is their product to the stream-aligned teams. So yes, AWS and other cloud providers have made it super simple to work, you know, to build applications at scale. We are then putting a little bit more customization, giving building blocks and those building blocks are then used by the stream-aligned teams. So it's a thin layer. It's important not to try and build your own platform. Use what the cloud provider's giving you. But add the level of customization you need to help those teams meet the needs of your organization. And that's the key part that I see.

Julian Wood: I can think of platform teams doing a huge number of things because you've got API design and even all the schemas we're talking about for events and where logs go and observability and monitoring and there's a huge amount that the cognitive load can be taken off the developers to concentrate on their little business domain. And the common platform can sort things out.

Ben Ellerby: Encapsulation and abstraction that come with familiar programming, that enables us to do all of that. To build observability into every Lambda function, to apply some best practices on events and really to sort of, yes, it's visible to the stream-aligned team. They can click through the code. They can see what the construct does. They don't have to. That's the big shift. They don't have to. So when they're trying to build their rest API. They don't have to think about the thousand things that need to get right. They just have to think about the application code they're writing this custom to their application that they're building. And of course, they need to understand and have some awareness of the other parts that are in play. They don't have to do all that work because the platform team is enabling them.

Julian Wood: I think those platform teams are even starting to take on more things, even in terms of, you know, cost profiling and other kind of things where you wouldn't necessarily think of that as a platform team capability, but understanding your costs and understanding scale and reliability and all those kind of things as a sort of within a central service within these different teams. I think it's also a new way that platform teams can be useful.

Ben Ellerby: I've seen it's on the platform teams that we run. One of them recently built a FinOps construct, a CDK construct that you deploy inside of your service, and it builds a CloudWatch dashboard with all the metrics we think are important around FinOps. Now that stream-aligned team can configure them, but they don't have to think about what they are to start with. All the tagging has already been done because they've used the constructs of the Lambda function. They've used the constructs of the DynamoDB. It's all well tagged. It's all well brought up to observability and thought and design has been put into that dashboard for that team to use. That team can still customize it. But yeah, FinOps is a responsibility of all teams. It's a responsibility of the platform team to give the teams the tools to be empowered to leverage that information. But yes, there is also the education piece for teams to understand FinOps. And that's where more enabling teams come into play and how do you make sure that you have subject matter experts around FinOps? But the platform team is giving the tools to teams to do things the right way. You then need to make sure that teams want to do things the right way and use those building blocks. But typically, engineers want to do things the right way. If there is a building block, they're going to want to use it because they see the value, but they would never have had the time to put a FinOps dashboard and do the backlog and modernize and learn new technology all at the same time.

Julian Wood: And I know platform teams can also get into trouble as well because you can have enormous platform teams that land up being blockers and land up being the sort of bottleneck of things. How do people structure platform teams? I know there are different thoughts of having a big platform team or multiple different platform teams. How would you suggest people approach that?

Ben Ellerby: Yes, it's a good point. The platform team being a bottleneck is a big thing everyone is scared of, especially in the modernization as we were talking about earlier. What you don't want to do in that big bang approach is say, cool, we've got a platform team. We've got these six stream-aligned teams. We need to get this done in eight months. Everyone goes and everyone's like, well, we don't have any constructs. The platform team hasn't done anything. Okay, we're going to build some stuff. We'll use the platform team stuff later. The platform team builds the stuff. Obviously, it never gets used. There's huge debates, huge fights. Everyone blames the platform team. It's not a great start.

So if you're looking to modernize and move towards this structure, give your platform team a head start. It's probably the easiest win. Let them get a bit ahead. I think then it's important to have an inner source model. So the team really wants to have support for Aurora serverless V2 that the platform team hasn't yet built in. Well, the team can build that construct. They can make the pull request. That has to be merged in by the platform team. The platform team will have subject matter experts and security and networking. They'll be able to approve it and they'll be able to own it and not run the architecture that you're building but maintain the building block that you are using.

But the inner source model is one way to do it. Letting the team get ahead of time is another way to do it. And not taking on too much. Really be like, okay, the cloud provider is our platform. We're just adding our company specific logic on top.

Julian Wood: Customizations.

Ben Ellerby: Customizations on top. And that's the key, not to take on too much responsibility. To keep your platform thin, but a thin layer on top of something that evolves and changes and making sure you keep up to date with those evolutions and those changes. I think that's key. The inner source is probably the approach where they can see the code and request changes to the code, but that ownership is still with the platform team. That's probably the way you can have some more dynamic resourcing of that platform team.

Recommended talk: Complexity is the Gotcha of Event-driven Architecture • David Boyne • GOTO 2024

Future Innovations in Cloud and Serverless Architectures

Julian Wood: So just casting your eye maybe in a crystal ball into the future. The modern cloud and serverless is continually evolving. But where do you see new innovations and cool things coming down the pipe or at least hoping they're going to be coming down the pipe to take advantage of? Could be Gen AI, could be not, but, you know, that's the whole...

Ben Ellerby: I feel like there's almost an obligation to mention Gen AI. Gen AI is interesting, of course, and it's all the stuff everyone's talking about. One of the areas we're actually seeing real applications of the work that we do is in the automation of modernization of legacy code bases. So moving an old JAR application to a new version of Java, moving from an old framework to a new version of a framework. For legacy, a scale that some enterprises have 60,000 applications running an old version of Java. Who has the time and the bandwidth? And most people...

Julian Wood: We're even understanding that code. So selecting a code and using a Gen AI model to say, just explain what this code is, what's the input, what's the output, what's the transformation happening? And then at least you know what's going on rather than having to understand the oldy-crufty stuff.

Ben Ellerby: Exactly. It's useful for that. And also what we're experimenting with when code bases have a high level of testing is to move things back to the abstract syntax tree. And then to do a transformation from an AST of one version or one language to an AST of another. And we found that because Generative AI models are changed on Excel...

Julian Wood: This is a whole other language you are moving between…

Ben Ellerby: We're experimenting with moving between languages. But because it's trained on XML, it's quite good at XML manipulations. We found that worked quite well. It works on languages. It's working-ish on languages where there's loads of examples. The issue is some of these organizations are asking for modernization of versions of languages that are not popular online, that are not popular on Stack Overflow, that there aren't great examples of, or not popular on GitHub. So we're also working on how to provide more context to these models? How do we build a bit of a rag system with good examples?

Julian Wood: Send them to Magdeline. Awesome.

Ben Ellerby: Exactly. So automatic sort of code modernization, I think, is one of the super exciting parts because, yes, we're looking to modernize how people build new applications and the applications that are touched a lot. But sometimes they're calling it layers and layers and layers of legacy that needs to be moved. And there is a critical need to move. And we're seeing an automatic way of doing that with large language models as being a really interesting use case. Of course, people are using coding assistance. That all makes sense. But can we get to the point where we could say, we're just going to upgrade the language of a well-tested code base all in one and it's going to move across? I think we're not as far away from that as we might think. Some of the other use cases, I think, we're further away than some people might be saying.

Julian Wood: Other than Gen AI, we've ticked our box with that and any conference talk we need to mention Gen AI. But other than that, what sort of future things are you looking forward to?

Ben Ellerby: I think it's interesting to see how modern platform teams evolve and how they leverage things like IDPs, developer portals. How there can be better discoverability and more productization of what platform teams do. And that's really interesting.

Julian Wood: Is that the discoverability of your templates and your blueprints and even a list of your NPM modules that you may have created or the CDK constructs or anything, your API, the schemas or anything like that?

Ben Ellerby: I think it's bidirectional discoverability. I think it's helping stream-aligned teams to discover what you've got to offer. Also helping the platform team see what the stream-aligned teams are up to.

Julian Wood: And what they're using, consuming.

Ben Ellerby: What's that random thing? We didn't have a template for that. Not saying stop. Saying if everyone's trying to do this thing, why don't we build a construct that helps them? So I think it's discoverable in both ways. I think it's a core building block four. And I'm seeing lots of people try different things. And I think it's a space that's going to evolve a lot in the next few years. So IDPs keep a strong eye on how people are adopting topologies at scale. And event-driven architectures, I think people are really, there's a resurgence and I think it's interesting to see the new patterns, the new tooling. You mentioned David Boyd's events catalog tool. We're using that. It's really exciting to see that move forwards.

Julian Wood: So what's causing the resurgence in event-driven architectures you think?

Ben Ellerby: That's a good question. I think it feels like a resurgence in event-driven architectures and domain-driven design. And I think it's because we've moved to...we've solved a lot of the challenges around compute and storage. And they're becoming more of a utility or commodity. We are now focused on how do we structure teams for fast flow? We talked from team topologies. And how do we lower time to value? And how do we lower the cost of ownership? And some of that is a structure of stream aligned teams to work effectively in reducing cognitive load. But to reduce cognitive load, you need independent areas of the system with independent teams, reduce blast radius, and reduce the amount of things you have to think about.

Now, if there's a synchronous request response between microservices, that's a lot of coupling. That's what we've seen, right? People are negative on microservices for many reasons. But one of them is the amount of sort of tight coupling across Temple and other types of coupling between systems. Event-driven architectures that are asynchronous just push the event out. Other people will respond to it. It's eventually consistent. That is a way I feel to reduce cognitive load while also getting the benefits of splitting up teams.

Sorry, to reduce coupling while also getting the benefits of splitting up teams and reducing cognitive load. And, you know, an event broker is not a new thing. If we think about the enterprise service bus and SOA, like we've done this before as an industry. But I think running an ESB was a privilege. Well, not if you ask the teams, but it was something that huge organizations could do. It was very complicated. I can now spin up an event bridge in 30 seconds. I think it's also some of the event backbones that enable it to do this and now also coming out as a commodity, as utility. And that enables companies of any scale to start to work in this way.

Julian Wood:, I think that's true as well. I think, I mean, specifically I work in the serverless space. And those integration patterns, which are, you know, from Gregor Hohpe, 20 years old and they, you know, entirely valid today. But it was really hard to build those enterprise service buses or use different patterns because you had to model your system and had you sort of had one enterprise service bus and you had to shoe on everything in it. Nowadays, you can, you know, if whether you're doing Pub/Sub or fan-out or, you know, a whole bunch of different patterns, get together, all these kind of things. It's so much easier to do in a serverless way because as you say, 30 seconds, you can spin it up.

Recommended talk: Platform Strategy • Gregor Hohpe & James Lewis • GOTO 2024

But then even older products are evolving. I mean, even something like Kafka is, you know, super hugely used and they're adding, you know, far more innovative things for be able to use Kafka from, you know, reinventing the storage mechanisms and having sort of queuing capabilities. So I think there's sort of research, people are understanding that the patterns are super helpful and a sort of serverless way of building these makes it far more easier to use more of the patterns and not have to worry about this feature bloat.

Ben Ellerby: Yeah, managing Kafka is a big change in how you can actually manage to use some of these things. And I think it's also important to say that, you know, I think a lot of people, when there was started this resurgence in event-driven, event-sourcing, same thing. If we're going to be event-driven, we're going to do event-sourcing. I think event-sourcing can be very complicated and costly and is useful for some use cases, but not every use case.

I think people are moving towards event-driven architectures, but not necessarily moving towards event-sourcing. So I think it was important for people to realize they could use events without taking an event-sourced approach because the learning curve is a lot lower, the investments a lot lower and the results are a lot faster with a pure event-driven approach. Of course, event-sourcing is useful for some things, financial data, great, audit trail, great, it's not needed for every application. You asked one thing I was interested about in the future of, you know, where cloud's going. How about yourself? Where are you? What are you excited about?

Julian Wood: I'm sort of happy that serverless has been around forever. And, you know, we often talk about the start of serverless, you know, when Lambda came out, which is, you know, 10 years ago, which is the same time as Kubernetes came out. You can see sort of different approaches to solve distributed computing problems. But I often do say, you know, serverless was way before then. In fact, all of the major cloud vendors came out with a serverless model where you just consume things via an API over the internet and that could be a queue, it could be storage and, you know, talking Amazon, we've got SQS and S3 and the concept to actually rent virtual machines only came later. And that's not just an AWS thing, it's for sort of all of them.

Recommended talk: Serverless Compute at the Heart of Your EDA • Julian Wood • GOTO 2024

They're sort of jokingly say that the cloud was sort of born serverless. And so I think that's just going to continue. And certainly serverless as a term has been very much diluted and, you know, anyone with marketing money is jumping on the bandwagon. Luckily, they've got Gen AI to now focus on. So I think serverless is just used by so many companies. It will just continue to evolve and, you know, a cloud provider or, you know, even companies behind the scenes are just going to iterate, are going to evolve.

It's going to be faster, it's going to be cheaper, it's going to be more simple to use, more functionality. And I think we're in this phase now where we're not having to sort of tell people that serverless isn't a gimmicky kind of thing and massive production workloads are being built on it. And I'm sort of hoping the realization phase we're at is just people are going to get on with it and build with it and just have success with it. And then under the hood, because you are giving over some of your operational control to a cloud provider, they can do the best job they can to just make things better, faster, cheaper and do more.

Ben Ellerby: And I guess while we're talking about buzz terms, buzz words, Edge computes. How do you see that playing into this future?

Julian Wood: I think that is going to be interesting. I think there's a lot of work with that, which is interesting. People needing more, needing less latency, building more global applications, compute power going towards the Edge. And people are going to wanting to be running functions or containers or things like that at the Edge. And then there's the gravity question of where is the data? And so sure, you can have your data in the central place and something connecting from the Edge, but that's going to add latency.

So I see a sort of push and pull dynamic of maybe more data moving towards the Edge and more sort of data services moving towards the Edge. And that means, well, more things need to work around with the data. So maybe there's going to be AI modeling or analytics or this kind of thing moving to the Edge. So I definitely think with network bandwidth, where you can always just throw more bandwidth at the problem, there are going to be more and more services that are going to be moving towards the Edge. How that is managed, what becomes the difference between public cloud, private cloud, hybrid cloud. I think it all sort of merges in some kind of thing. But yeah, I think that is exciting if you can have truly globally distributed applications with data, compute and all the kind of things in multiple locations. Yeah, I think that's a whole new world to explore.

Ben Ellerby: Then in 10 years, we can do this conversation and talk about all the pitfalls of Edge computers.

Julian Wood: Exactly. Well, you're a consultant, you'll be able to help.

Ben Ellerby: Once we've done generative AI, we'll move on.

Julian Wood: Exactly. Ben, thanks so much for your time today. A great encapsulation of the... I love talking about socio-technological thing. We spoke a lot about people and things. So yeah, good luck with all your future transformations. I know you are helping your clients in a great way. So thanks for talking to us.

Ben Ellerby: Thanks so much. See you back in London.