Home Gotopia Articles Designing A Data...

Designing A Data-Intensive Future: An Unscripted Journey with Martin Kleppmann

Jesse Anderson, director at Big Data Institute, and Martin Kleppmann, author of "Designing Data-Intensive Applications" explore together the evolving data landscape. They start with the origins of Martin’s book, emphasizing the crucial art of asking the right questions. Martin unveils industry shifts since 2017, spotlighting the transformative rise of cloud services. The conversation then takes a twist as Martin delves into academia, sharing insights on local-first collaboration software and the fascinating world of Automerge. Aspiring software engineers are treated with some advice on how to navigate the delicate balance between simplicity and adaptability. The interview concludes with a glimpse into diverse career paths in the dynamic realm of data engineering, making it a must-watch for professionals at every stage of their journey.

Share on:

Copied!

About the experts

Martin Kleppmann ( author )

Researcher at the Technical University of Munich

Jesse Anderson ( interviewer )

Managing director of Big Data Institute, host of The Data Dream Team podcast

Read further

Introduction

Jesse Anderson: Hello and welcome to GOTO Unscripted. We're here in Amsterdam. My name is Jesse Anderson. I'm the managing director at Big Data Institute. With me is Martin Kleppmann. If you don't know about him, he's done all kinds of things in the data space and he is doing some other quite interesting things. We're gonna get into that and we'll talk a little bit more about that. Welcome, Martin. Thank you again for joining us.

Martin Kleppmann: Thank you, Jesse. It's great to be here.

Jesse Anderson: Oh, I appreciate it. So you're probably known in the data world most for your work in your book, "Designing Data-Intensive Applications." Could you tell briefly what you were trying to get across that book?

Martin Kleppmann: Yes, sure, of course. I was bewildered by the number of different technologies that there are in the data space. So many different data processing systems and databases, so many different ways of doing things, and so many blog posts that tell you which one is the best or which is the worst piece of software. I was just trying to figure out some way of getting people good advice on what technologies to use for their particular application because I realized there's no one right answer, right? It always depends on what you want to do, and that means you have to ask the right questions about the technology to figure out whether the technology is suitable for you or not. And so what I had essentially wanted to do with this book is equip readers to ask all the right questions so that they could figure out themselves how to build systems involving lots of data and which technologies to use.

Jesse Anderson: That's really good advice. That's one thing I try to instill even in my own consulting business, is a vendor will tell you what they want you to use, which other than enough is their software. You need to learn to ask the right questions, otherwise, you could get painted into a corner. It could be the wrong thing. So it's really important to ask the right questions.

Martin Kleppmann: Exactly.

Recommended talk: 32 Book Recommendations for the Holidays • Various Speakers • GOTO 2021

Evolution of Data Systems

Jesse Anderson: As you look back in the book, what has changed since you wrote the book to now that you'd like people to know about?

Martin Kleppmann: The book came out in 2017, so that's been a while already, but it took me a long time to write it. So the first few chapters were written in maybe 2014 or something like that. So they're almost a decade old now. I think one of the biggest things that has changed in the meantime has been the rise of cloud services. Like, the cloud existed already in 2013, but it wasn't as omnipresent as it is now, and I felt like back then, many companies were still essentially self-hosting things on-premise. That's what the book mostly focused on. But now I feel these sorts of cloud-native architectures have just become much more popular and much more widespread. Also, it has really changed the architecture of how data systems are built because it's no longer this idea of centering on individual machines, and each machine has the hard disks attached to it, and instead, it's more like you build, you compose one service from other services, and each service individually may be spread across multiple machines.

All of these distributed systems problems have been pushed to a different layer essentially. And so that's something that I've been thinking about, how that affects the architecture of data systems also going forward into the future. So I've started working on a second edition of the book, but progress has been very slow. So don't hold your breath just yet. It's going to be a long time till it comes out. But one of the things that I've been trying to work into the second edition is how this sort of cloud-native perspective changes things. Of course, not all systems will be in the cloud, and so that's another one of these questions that you need to ask whether cloud or on-premise systems are better for a particular application. But for those that do want to be in the cloud, I think it's brought some interesting new, like, design dimensions.

Jesse Anderson: So let's invert that question. What has stayed timeless?

Martin Kleppmann: Fortunately, most of it. So I'm hoping that will make the rest of the revisions of the book manageable. The basics of, like, transactions and consistency models, I think not much has changed. The basics of replication, how data gets spread across multiple machines, and partitioning, most of those things haven't changed that much in 30 years. They haven't changed that much more recently either. But other things have. So for example, I think MapReduce was featured quite prominently in the first edition, and I think that's essentially dead now. It's become replaced in one direction by, like, cloud data warehouses for the sort of business analytics use case, and then streaming systems for the sort of custom application logic use cases. So those sorts of things have changed, but a lot of the principles, I think, have stayed the same.

Jesse Anderson: Are there any controversial things that you think you'll put either in the first edition or second edition?

Martin Kleppmann: I find it a bit hard to anticipate what will be controversial really because I try to look at things from first principles as much as possible. Whether something is controversial or not depends kind of on your prior beliefs. So there'll probably be some things that some people find controversial. But on the whole, I try to make the book less about opinions and more about, like, trade-offs that we can reason about with evidence. So I'm not telling people what to do, I'm telling people what questions to ask.

Jesse Anderson: And I think that goes to your experience there and becoming more academic is "I'm not going to tell you what to do." You're getting out of that consulting mindset and saying, "Here's how to think about the problem." And that's really what I think people should focus on. What are the things that you need to do to understand, to figure out the problem?

Martin Kleppmann: What have you found in your work? Because you say, like, when you're consulting, you try to get people to ask the questions as well. What do you think about that?

Jesse Anderson: One of the primary things I tell people to do is don't look at technology choices until you understand the problem, because usually, it's an after-the-fact of, "Hey, everybody, we're doing MapReduce." Now we invert that problem and we say, "Okay, now every business problem has to use Hadoop." And from my time at Cloudera, I saw the problems of that, of "Look, everybody, you're using Hadoop and MapReduce, and it has these inherent either limitations," and now you're having to go and do workarounds for those limitations instead of looking at the business problem in the first place and saying, "Oh, this technology lends itself."

I would say the other one is that people are looking for a single technology to do everything, and that just doesn't exist. We spent our time probably in small data just saying, "We have a database and we have maybe a web server if we're doing something like that." We just had three technologies kind of, and now we have this explosion and people say, "Well, that explosion is just us trying to over-engineer things." I don't think it is. It's us trying to optimize things, optimize for the business case when we have to do that.

Martin Kleppmann: Right. Exactly. I think if you're working with small amounts of data, you know if it's small amount enough, you can put it in a spreadsheet. If it's a bit bigger, you can put it in a relational database and it's probably just fine. But then when you get to large data volumes, you need specialized systems to deal with the particular workload that you want to throw at it. And, you know, you said people search for, like, a single tool that can do everything, but my response to that is that if you have a single tool that can do everything, the only way it can do that is by doing everything badly. So it's better to have specialized tools that do one thing well and then to try and compose those, so that then, all in all, you're getting your problems solved well rather than badly.

Jesse Anderson: Yes. And to kind of bring it to, since we're at GOTO Amsterdam, you and I sat together and we enjoyed Dave Thomas' talk, and the premise of Dave's talk is that our biggest or most important thing as technologists is to enable change, right? To be able to make things as easy to change as possible. How do you think that manifests? And let's narrow it down to the data field.

Martin Kleppmann: I think it's very important for the data field. Partly it's because, you know, I see this in discussions of scalability. Like, startups will come and contact me and say, "Hey, can you give us advice on how to make our system scalable?" And I ask them, "Well, how much data end users do you have right now?" And they say, "Oh, we haven't launched yet. We just want to be ready for the future." My number one advice to them then is then actually don't worry about scalability because what you want to optimize for at the early stage is making things easy to change. So just implement things in the simplest way that will possibly work but allows you to modify it easily and adapt to what you learn about the market, about your users, and so on. And that often means choosing general-purpose technologies.

For example, you know, just using a relational database will get you pretty far, I think. And it's pretty good for evolution, you know. You can add more tables to it, you can add indexes, you can drop indexes, and you can evolve the schema as needed. Or by all means, you know, use a document database if that's more like your preference, but those things are more or less converging anyway is my feeling. And those things allow, you know, maybe not for scale to billions of users, but they will allow you to evolve things easily. And then later on, once you learn what features you need for your system to be popular, only then will you know what the features that are putting a lot of load on your data systems, and only then will you be able to engineer them in such a way that they scale well for that particular workload. But you can't do that scalability work before you know what the workload is going to be, and you don't know what the workload is until you have a lot of people using it and large amounts of data.

I feel that there's, like, almost this trade-off between ease of change and scalability where highly scalable systems are often very difficult to change because you've baked in assumptions about what operations are gonna be common and which going to be uncommon, for example, they get baked in at a very fundamental level in the architecture, whereas a system that is easy to change and general purpose is often more difficult to scale. So I think that's quite an important trade-off for people to consider.

Jesse Anderson: Since you and I are both in the streaming world, I think one of the biggest changes that we've seen is with Kafka, with Pub/Sub, distributed Pub/Sub systems, where I think it enables an ability to change and add systems significantly better than other ways, where if you want to add that, like we were just talking about, that particular database for a specific use case, if that data's already in your Pub/Sub system, adding that efficiently can be either trivial or very straightforward. And so I would say the biggest thing that I think people who are further on in that journey, using those Pub/Subs will enable you to change easier.

Martin Kleppmann: In particular, you can then compose and extend the system. So if you currently have a single consumer for a stream and some other application comes along and they also want to maybe take another view onto that data, that's fine, just add another consumer. It's cheap in a Pub/Sub system like Kafka. And so I feel that having that common infrastructure of the streaming system allows you then to do more experiments with having multiple consumers potentially if you want to migrate from one consumer to another. For example, you could run both consumers side by side for an amount of time, check the consistency across the two systems, and then eventually decide to switch over from the old one to the new one. Streaming systems allow that so much better than, for example, systems based on, like, doing calls to individual services. So I feel like, as you said, the streaming can help with making change easier there.

Embracing Change and Timeless Principles in Startups

Jesse Anderson: I would agree. So you mentioned something interesting, and that's a question I get a lot, and that is startups. Startups are usually under-resourced on several different levels, both people, time, and technology. What advice would you give a startup right now to deal with? You mentioned relational databases. What other suggestions would you give them?

Martin Kleppmann: I think something like streaming can be useful to put in at an early stage if you need some sort of, like, real-time updates, because, yes, in principle, you can build evented systems on top of a relational database, but you're kind of going against the grain because, like, there's usually not a good way of subscribing to new changes, for example. And so a system like Kafka, even though, it is engineered for sort of large scale, and if you're an early stage startup, you won't have that scale, you won't have that need. But it feels to me like such a fundamental building block that you can build into your system. So if you have that event stream, then it can be something that enables all sorts of experiments further down the line. And so that might be one piece of infrastructure that is worth putting in at an early stage. But otherwise, I think my best recommendation would be to just keep it as simple as possible, and the fewer lines of code, the better, essentially.

Jesse Anderson: I think that's very similar to the opinions or suggestions I give remember that these distributed systems are pretty complex. So if you take a complex over-engineered system that you've already created, not knowing what the future's going to be, you're gonna create an even more complex, over-engineered system that nobody can deal with and nobody can use.

Martin Kleppmann: It might not have solved the needs of what you discover two years down the line.

Jesse Anderson: Exactly. So try to keep things as simple as possible. Complexity is your enemy on so many different levels. Don't go too far. Keep your optionality with you.

Martin Kleppmann: Yeah, I agree.

Local-First Collaboration Software

Jesse Anderson: Otherwise, you'll get into these land wars in Russia that just don't work well. So you're at Technical University in Munich right now doing research. Could you tell us about the research that you're doing?

Martin Kleppmann: So for the last couple of years already, and now continuing, I've been looking at collaboration software. So in the style of Google Docs, for example, where several people can edit a document at the same time, I've been trying to change that style of software so that it's less cloud-centric and gives more control to the end users. So do more of the data storage on the user's device, so that if the cloud service locks you out, for example, you don't lose all of your data. You still have your copy of the data. Part of what this enables is also end-to-end encryption, for example. So we can build software that allows multiple users to collaborate without giving the servers access to all of our data so that the servers only see encrypted data and only the end users on their device have access to the actual document contents. And so that's, like, an interesting set of, like, distributed systems problems, and a bit of cryptography and security protocols, a bit of database, and so on, all mashed together.

Jesse Anderson: And so you've written papers on that, or you've written code or what's out there that somebody could use?

Martin Kleppmann: Yeah, papers, code, proofs, all sorts of things. So we've been coming at it from several different angles. Partly, it's like a theory angle of trying to figure out what are the right algorithms and how can we prove that the algorithms work. The proofs are really important because some of those algorithms are very subtle and very easy to screw up. And so without formal proof of correctness, I would simply not trust that they're correct. So that's kind of the theory angle, but there's also a practical angle. So I've been working with some industrial collaborators to develop open-source implementations of some of the work we've done on the theory side. And so, for example, we have this library called Automerge, which is now a professionally maintained open-source library for building collaboration software in this style. We call the style local first, where the data is stored primarily on your device, but cloud surfaces act as a kind of data synchronization and backup mechanism. And so that's been like, you know, a fun mixture of bringing together the theoretical and practical perspectives there.

Recommended talk: Creating Local-First Collaboration Software with Automerge • Martin Kleppmann • GOTO 2023

Reflections on Academia

Jesse Anderson: And was there a reason for that change? It doesn't sound like it's quite a pivot. It's more like, here's the next thing for Martin Kleppmann. What brought that about?

Martin Kleppmann: What, moving into academia?

Jesse Anderson: Well, yes, let's talk about academia. What brought about your move into academia?

Martin Kleppmann: So I had spent almost a decade in startups previously, and it was really interesting and I learned a huge amount, but I was kind of getting tired of constantly rushing to ship something and never having time to think something through in detail. I felt like we were being very superficial with everything and never really getting down to the bottom of things. And I wanted the freedom to think harder and try to understand how and why things worked. And that sort of freedom you get in academia. It's a trade-off. You know, in a startup, you can have an immediate impact. You can ship something and the users will immediately see it, whereas, in academia, I can write a paper, maybe five years later it'll be in an open-source implementation, and 10 years later it'll be in production software. So the timelines are much longer, but on the other hand, I enjoy that freedom to go deep and make sure we understand things because, you know, so much in distributed systems is tricky to get right and I feel it's an area that rewards deep and careful thinking.

Jesse Anderson: Yes. Deep and careful thinking. Well, you see that sometimes in a design. I don't know if you've ever seen this. You'll look over the design of a technology and you'll say, "Well, there's a problem there, and there's a problem there, and a problem there." And probably if you talk to the designer of that, they'd say, "Well, it works well enough for the users." What do you think about that?

Martin Kleppmann: Well, often you see with, like, distributed systems they work just fine as long as you don't have any problems. And then a weird problem comes along, like whatever, there's a network partition that disconnects two parts of your data center from each other, or, like, disconnects one rack from the rest of the data center for several minutes, and, like, there's the network traffic that is trying to get through while that network partition is happening, but it's not getting through. And then it gets repaired, and then the network traffic gets through, and then a bunch of servers start killing each other or start trampling on each other's feet, or you have a split-brain system where different parts of the system or both think that they have the authority over some data, but, you know, you got into a contradictory situation that should never have happened, and then everything falls apart and it's catastrophic.

And so what a lot of distributed systems research is trying to do is find these ways of thinking about it such that even when those weird edge cases happen that, you know, might only happen once in several years, but when they do happen, we want to make sure that we can handle them correctly and we have a good way for making sure that the system is always working. But, you know, it can be a difficult sell because if you have a system that's working just fine right now when nothing is going wrong, people don't necessarily want to think ahead about weird obscure edge cases that might happen. But I feel it's valuable, for example, not so much for the app developers. I think for the app developers, it's good to just rely on some infrastructure and take it as a given. But if you're the developer implementing a new database and you're implementing, say, the replication mechanism inside that database that keeps multiple copies of the data up to date with each other, those are the kinds of people who need to be thinking very carefully about the obscure edge cases of what can go wrong. And so hopefully then, for example, the database can handle those things so that the app developers don't have to worry about it. The database can provide a clean abstraction that hides the weird edge cases.

Jesse Anderson: You brought up something that I haven't seen for a while. Every so often, I would see somebody decide that they wanted to write their own distributed system, and they would write that distributed system and it would work well for that first 80 % of use cases, and then the 20% were the main problems are it didn't do very well. So, fortunately, we have enough open source out there where people have implemented this and have thought through this.

Martin Kleppmann: Do you have any particular examples of what went wrong in that kind of thing, if you can share? I always love hearing, like, failure stories. There's so much to learn from them.

Jesse Anderson: One of those is that people's Ph.D. thesis is usually not what you wanna put in production.

Martin Kleppmann: Yes, That's probably fine.

Jesse Anderson: So let's just be blunt about that. I've only seen a few PhD theses that stood the test of time. Apache Flink, for example. I didn't realize that until a few years ago that was Kostas' PhD thesis. And it stood the test of time. When I first met him, I asked him, "What did you do differently or what did they do differently that allowed a PhD thesis code to get into production?" But more often than not, some company would hire them, they did their Ph.D. thesis on it, and they would say, "Oh, code ready. Let's put it in production." And that isn't production-worthy code. It either was ready for much fewer uses, didn't think about this other thing, and as a direct result, they were tied to this legacy code that they thought that they could ever pull out and use something else, and nobody else could maintain it because it was that person's code. And worse yet, that code wasn't written well. It wasn't ready for professional software engineering as it were.

Martin Kleppmann: Yes. But think I wouldn't blame that on the PhD thesis. I think you could say the same for a random library that you find on GitHub. You probably also shouldn't put that into production just like that without thinking about it as well. It's maybe the difference between, like, one person's work that is exploratory and trying out some new thing versus something that has been tried and is already in production use and already has experience that has filtered into its design.

Jesse Anderson: Yes. And then thinking about that more, one of my friends said something interesting and said it well. They said that when you have a technology designed for one thing, it was designed for, let's say, just Pub/Sub, as you add new use cases to that and new features, you deviate further and further away from that paper. So what he would do is he would go back and read the original paper and then see how they were changing it to see just how much they were deviating, and that would give you a good metric for just how much they're changing the code or how low the odds are on that they changed it well enough that this is a workaround, this is a patch. Have you thought about or seen that at all?

Martin Kleppmann: What, in, like, the transition from research into industry, you mean, or...?

Jesse Anderson: Well, as people start to change, I'll pick on Kafka since we know, Kafka, for example, they're going to add the queuing mechanism. I didn't read the KIP in great detail, but I was curious about whether are they gonna try to implement the queuing on the client side or the server side. Because, in my opinion, that should be a mostly server-side change.

Martin Kleppmann: Right. Yes. I haven't read that in detail either, so I don't know about that specific proposal. But, in general, I feel like the people working on Kafka have a really good understanding of distributed systems, and that has reflected in the architecture of Kafka on a very foundational level. And so I think that's actually been a good example of taking some foundational principles which are not in themselves hard, like, you know, the idea of an append-only log, and using the page cache well in the way that it does, and essentially running a consensus protocol to decide on the messages that appear in the topic partition, but not quite calling it that and relying on something like Zookeeper to provide external coordination. And, you know, all of those design decisions are very sound and well-made.

And so I think that's something where research can help by helping us give those sort of fundamental design principles that can then be used to engineer systems that work well. But somebody had to come up with those designs in the first place, so, you know, there's still a lot of research behind it, but that might not have produced a particular implementation that is production-ready, but it might have informed the thinking of the engineers who then made the production-ready implementation. So I feel like the research trades much more in ideals than implementations.

Recommended talk: Why Most Data Projects Fail & How to Avoid It • Jesse Anderson • GOTO 2023

Advice for Aspiring Data Engineers

Jesse Anderson: Okay. So one last question for you. What advice would you give to an aspiring data engineer now?

Martin Kleppmann: That's a difficult one. I mean, I'm fairly disconnected from day-to-day data engineering these days, right? I'm an academic now. So I've mostly been looking at the systems and their internals to understand which is good to use. I guess my recommendation would be to learn just enough about the internals of the systems that you're using so that you have a reasonably accurate mental model of what's going on there. You don't need to be able to, like, modify the codes of Kafka yourself, but I think having just enough of an idea of the internals that if, for example, the performance goes bad, you have a way of visualizing in your head what's going on, or if something goes wrong and you get a weird error message, or if you're trying to figure out whether you can build a certain consistency mechanism on top of it, all of those things. It's incredibly valuable to just have a bit of a mental model and not just treat it as a black box. So I think that would be my suggestion is to learn just a bit about what's going on internally.

Jesse Anderson: Okay. I think mine would be that there are many paths. I've finally come to that. I wish I would've known that earlier on, it's not just heads-down coding, That's one path. There's also a path of going into the management. Those are usually the two most known, but you also showed that third path of academics, or the path that I'm taking of going down and consulting route of business, or you also have the people doing influencing or doing the DevRel, developer relations. There are a lot of paths out there. Pit's to look out the various paths, look at what you want to do and what your skills are and see if one of those applies to you.

Martin Kleppmann: That's great advice.

Jesse Anderson: Well, Martin Kleppmann, it's been great. Thank you so much. Thank you so much for watching this interview. This has been GOTO Unscripted. Thank you