Building Modern Databases with the FDAP Stack
"Building a database from scratch used to take 5 years and 20 engineers". In this conversation, Olimpiu Pop and InfluxData's Andrew Lamb explore how the FDAP Stack is changing database development forever.
About the experts
Olimpiu Pop ( interviewer )
CTO
Andrew Lamb ( expert )
PMC Apache DataFusion, Apache Arrow
Read further
The Evolution of Database Development
Olimpiu Pop: Hello, everybody. I am Olimpiu Pop, technologist and probably I can call myself a tech journalist as well. I have in front of me Andrew Lamb, another technologist who knows at least two very complicated things on my side. Those are databases, or data systems probably to be more correct, and the other one is C language. So Andrew, I think it's better for you to introduce yourself, please.
Andrew Lamb: Yeah, thank you. I'm very excited to be here. I am a staff engineer currently at InfluxData. In that context, I work a lot on databases, specifically the analytic database stack, and I'm sure we'll talk about a lot of stuff later. My background was basically as a low level systems programmer. So that means databases and compilers, that kind of stuff.
I worked for many years at a company called Vertica. I worked in Oracle. Vertica was one of the sort of first wave of commercialization of this shared nothing distributed database architectures. Then I've worked in various other engineering and data platform teams. Now I work on InfluxDB 3 and the core engine that drives that, including the SQL engine, which is built on top of a big pile of open source, including Apache DataFusion, Apache Arrow, where I'm on the PMC for both of those projects. I'm in the guts sort of on a day to day basis. I really love that stuff.
Olimpiu Pop: Thank you. Thank you for loving it because then we don't have to love it and just exploit it on the back end. We had other conversations and one of them was together with Phoenix, which unfortunately cannot join us. What we spoke about is that in other parts of the programming universe, we can take bits and pieces. We have multiple frameworks. We have libraries that we can just place together and have a final system.
But until not long ago, that wasn't the case for data systems. That's what the conversation was that we started as a trio. Kafka played the role as one of the parts, at least for ingestion systems, for data systems. Then we had different engines like RocksDB or LevelDB. But when you rebuilt InfluxDB with its third version, you made a couple of changes. One of them was using another programming language, that's Rust as we spoke about. And the other one, you coined the term FDAP. Can you please elaborate about what the FDAP is? Should we add the AI as well or not just yet?
TheFDAP Stack: Flight, DataFusion, Arrow, and Parquet
Andrew Lamb: Yeah, for Iceberg probably. So the FDAP stack, I want to say I wrote the blog post, but really it was the CEO and CTO of InfluxData that came up with that. Paul Dix is the CTO of InfluxData, and he saw early on that, perhaps partly because he wanted to go write yet another version of the database, and the CEO was like, how are we ever going to possibly afford that? Because it's a big, expensive undertaking, which they knew firsthand because they just gone through it. Now they wanted to sort of rework everything from the ground up.
Paul's vision at that point and the pitch was we don't have to rebuild everything from the ground up, because at this point, people have built these analytic systems and done enough that common patterns are emerging. So at that point, he identified the technology, specifically Flight, Apache Arrow Flight, which is part of Arrow, DataFusion, and Parquet as technologies that you could build a whole new sort of time series database on top of without having to reinvent everything.
Paul was definitely on the earlier side of that trend. But the idea was rather than have to build a database from scratch, which by the way, I did at Vertica and I saw it was like a five year endeavor with like 20 engineers, good engineers, like PhD database kind of things. But since that pattern has now been implemented so many times, the thesis was we understood where the boundaries were and we can really focus people's efforts together to not have to reinvent the same thing over and over again.
Olimpiu Pop: Okay. So because you mentioned that you already have the experience of building everything from scratch, just take a step backwards then and just maybe think about it. What are the conceptual pieces that usually will come in? Let's just say that these are the building blocks for a normal data system. And then go back and make them and see how it actually moves from, I don't know, ingestion via the memory model, and then on the file system and then back again.
Understanding Columnar Storage and Apache Arrow
Andrew Lamb: Yeah. So I'm going to talk from the perspective of the database implementer, someone who's building a system, what are these components that you then build an actual data platform on top of. I'm going to talk about how you would implement a database. Depending on you as a user, when you're actually trying to assemble this into some data platform or a high level application, there's often you would use preassembled components. I'm going to talk about how you implement stuff like InfluxDB, not how you deploy an application on top of it, like with Kafka or whatever.
Actually, the line there is always a little blurry. I think it's getting more blurry as we go in the sense that it's easier to decide where exactly what you want to work on. Maybe we'll talk about that later in the call. But at a high level, when you're going to build one of these high performance analytics systems, sort of best practice in the last probably 20 years came out of the research world. It's a very simple concept, which is you store it in columns.
Which basically means, this is mostly in the context of storing tabular data, but the same concepts apply to other forms. So storing relational tables, you can either store all the values in each row next to each other, if you think about a spreadsheet, or you can store the values in each column next to each other in memory. It turns out for analytic processing, for a lot of reasons, doing the column store is a much better choice, specifically because it can take advantage of modern hardware like vectorized execution, which does phenomenally well when you just have the same operation to apply to a lot of different data items versus row oriented, you kind of would have to be continually changing what you do.
So there's basically a shift that started probably 20 or 30 years ago, starting in academia and moving into, like really moving columnar technology into databases. These days, if you're going to go build a new database system, to your point, you would typically need to have some way to get the data from your client, represent it in memory somehow, and in particular, you have to be able to do operations on it quickly. The other thing you need to do is query it. So that's what Apache Arrow is basically designed for.
I wouldn't say Apache Arrow is really earth shattering in terms of its technical content. Those ideas have been around and there's really not that many ways to represent an array of 64 bit integers in memory. But what's important is that everyone agrees on how you know how it looks, how to pass it around, how you represent nulls. You know, whether it's a null mask or validity mask doesn't really matter, but it matters a lot that everyone agrees on what the mask means. So having a standardized, what I think Arrow's real value is, it standardized best practice across sort of column representations. So that's one piece.
Olimpiu Pop: Yeah, let me just stop you here to just translate it for a five year old and see.
Andrew Lamb: Yeah, yeah. Sorry.
Olimpiu Pop: That's why we are here. So what you're saying is that in the last 20 to 30 years, we had a transition from the normal way how we thought about databases to looking, especially in terms of analytics, to just look at them in terms of columnar data, because that made sense also in terms of the extractions, but also in the way how we can store it and how the new hardware behaves lately. It would be much more efficient. Is that it?
Andrew Lamb: Yeah.
Olimpiu Pop: So after that, you mentioned that Apache Arrow takes care of that memory model. Nothing earth shattering, but it just takes care of a couple of things. It depends, because we love to use it depends on multiple levels. But from that point of view, what it actually does, it just sets a standard. Then all the things that you normally want to think about, what would be the sequence of data types and probably the way how we store it, how we represent null or other aspects are now standard. Then we can build on top of this abstraction and that would be easy to do. So with each iteration it would just improve and then everybody would use the same standard. Does that do it?
Andrew Lamb: Yeah, that's a very good summary.
Olimpiu Pop: Great. Thank you Andrew. Moving on.
Andrew Lamb: And I just wanted to maybe take a step back just a little higher to try to describe why the transition to column stores happens. At a really high level, philosophically, when computers and databases were first coming around in the 60s and 70s, there wasn't that much data in them. Not a lot of the world had been digitized. We weren't creating massive amounts of digital information at that point. So the speed of getting it in and getting data out was typically happening at the, you know, individual, it was based on human actions. Like you did something to your bank account. So that fundamentally sort of, you know, that's a human scale, both reading and writing. So you didn't, you couldn't really…
So that was one thing. Then what happened in the 90s, 2000s is an increasing amount of both our lives and also all that computing infrastructure became digitized. So it generated huge amounts of data that didn't just come from human scale systems. It came from machine scale systems. So the volumes of data were much, much higher. We needed new techniques to keep up with the increasing data volumes. So that's around the time when columnar technology really took off because we needed better ways to basically handle the ever increasing data. And now, of course, there's some theories about moving into AI, but that's maybe a different conversation.
So Arrow is the in-memory representation for any database. You need a way to actually manipulate the data once you get in there, to actually compute the calculations people ask for. I touched on it, but I wanted to highlight again, Arrow really has two things that are going for it. One is that it's a standard. So what that means is you can pass data without having to translate the format from things like data frames to a database to another database. Basically it lets you easily pass data between systems without paying the serialization cost of having to translate it back and forth to different formats, which is actually, I think, the original reason why Arrow was founded.
But I think a hugely important other aspect that I think is really critical for system design is you no longer have to go through the trouble of deciding exactly how you're going to represent timestamps or where the timezones come in and how you're going to convert it. A bunch of this stuff is not impossible to solve, but I'll tell you from experience, takes a huge amount of time and effort to actually work out all the details. So the fact that someone else has done that already basically accelerates your development cycle a lot. There's a whole bunch of infrastructural work you just don't have to do because you can just take advantage of what's already there. I think that's pretty huge.
From Memory to Storage: Understanding Parquet
Olimpiu Pop: Okay. So this would be the things that we will have in memory. But usually we call the system a database because there is this place where we can find them afterwards.
Andrew Lamb: That's right.
Olimpiu Pop: And that's probably Parquet. How do you make the connection between the two of them?
Andrew Lamb: Yeah. So there are different technologies, but they're both columnar technology. Parquet is designed for persisting data. In particular, it has a bunch of encodings and compression so that the data volume on disk or that you have to transfer over the network is significantly lower. So in order to process things in memory, you have to have a representation that's convenient to process. But if you want to save that for long term archival storage or share it with people or store on S3 or whatever, Parquet does that.
It's a common file format designed for storing large volumes of data efficiently on disk. I think it has a couple of benefits over the alternative. Well, one option is you can build your own whole file format, which gives you complete freedom. That's very expensive. Another classic alternative is CSV files. You just dump a bunch of CSV files.
The problem is there's a couple of problems with CSV files or JSON files. Similarly, they're not particularly well compressed. So they're very space inefficient, both for your storage cost, but also the amount of time it takes to transfer. They also, if you want to read a CSV file, it actually takes a lot of effort to extract the data out of the text form into the format you want to process. Because you have to parse strings into numbers, which is a very expensive operation. And the final thing is they don't have type information attached to them.
So you basically have to when you get a CSV file, you have to figure out is this column full of dates, or maybe it just kind of looks like dates, or maybe they're new IDs. Every system sort of has to do that schema interpretation, schema inference itself. And of course they're a little incompatible. So there ends up being a bunch of pain when you have to get systems to work together.
So Parquet has its own type system. It's got a very efficient encoding and compression scheme in there. It's a standard also sort of governed by the Apache Software Foundation. So you have lots of vendors that support the file format. So yeah, hopefully that makes sense.
Olimpiu Pop: Yeah. Well, sorry to be tricky, but CSV and JSON are not exactly earth shattering again. So comparing Parquet with that, I cannot say that I'm impressed. But maybe we can understand one other thing that Parquet has that allows it to be that important in the new data landscape. I think what's probably very important is the fact that a lot of other projects are already adopting it, adopted it. So that means that most probably you can use it also in terms of interconnectivity across the board in multiple data projects. Did that make a difference for you while implementing it? What would be a couple of benefits as a database to have with that?
Andrew Lamb: That's right. Yeah. So there's two primary benefits to Parquet. The first is that it's compatible in theory. Actually, you can write a Parquet file and read it with other tools. This helps if you want to transfer data, how your customers get their data, migrate their data in bulk, either into or out of your system. It's very helpful for that, especially so you don't have to rewrite it internally. You just give them the internal Parquet files. That was the theory.
The other major benefit which goes back to the original discussion is the amount of effort that's gone into these open source Parquet readers and writers is actually quite high. So if you had your own custom file format, which is what you would typically do actually in a database, to your point, you don't typically store data in CSV files. You have some custom proprietary format, is a more common thing.
The upside of that is you can custom make the file format however you want. The downside is you also are on the hook for implementing it, maintaining and optimizing it, which again is not impossible. If you had an infinite amount of resources, you for sure can make something better than Parquet. But the fact that Parquet captures like 90% the value of what you get out of your own custom binary file format, plus it's already had lots of other people working together to make it fast, and the interoperability with all these other systems, means that you already start significantly ahead of where you might realistically be able to go if you're starting everything from scratch yourself.
Olimpiu Pop: Okay, great. So this is both on the memory space, so that's volatile, and then it's about persisting it on the file system. So these are the two points. The other two missing letters from the acronym from FDAP are F and S. But one of them is a query engine. That's pretty obvious what it does. The other one is a network protocol. How does Flight come in between this?
Network Protocols and Distributed Systems
Andrew Lamb: So as networks became faster over the last 10 years, about 20 years really, that basically means almost all databases these days are either distributed or they're doing some sort of client server. So you have to send the data over the network. If you have to design a fast protocol that's efficient to send large amounts of data over the network, you can certainly do it, but it's not going to look all that dissimilar to Flight.
So basically Flight allows you to take these Arrow arrays quickly without having to spend a lot of time preparing the messages or preparing them when you receive them. You can just send efficiently the Arrow arrays over the networks. So I think that really allows you to build distributed systems with this toolkit. InfluxDB 3 has a distributed component. Rather than us have to go re-implement yet another protocol, which would probably look about the same as what Flight does, we can just take advantage or reuse the existing components that are out there.
And realistically, InfluxData is not just using these. It's also actively contributing back. That's a big chunk of my job is to help organize and drive forward the open source implementations. But it's not just an altruistic donation on InfluxData's part. The idea is that by strengthening the community, getting other people to contribute as well, we're effectively able to get access to technology that we wouldn't be able to afford to develop entirely by ourselves.
Olimpiu Pop: Yeah. But I think what Paul was mentioning at some point is one of the benefits of the open source communities where you're just building. Especially in the database space, there is a lot of academia that is needed and a lot of research. Rather than doing that and reinventing the wheel over and over again, you're benefiting from the open community, open source community. You have all the knowledge that you need and you just put them in place.
It seems that it's the other step also for the data system, the data ecosystem, because usually this remained a bit behind. In a lot of spaces we can be an early adopter, but data needs to be persistent and you cannot lose it. So then my feeling is that it's important that we got here. Just tying them together allows us to have all the building blocks that we want and to build open source community. It's a lot better given that also more and more eyes on the same source code allows you to ensure that it's safe to be used.
Andrew Lamb: Yeah, yeah, yeah.
Olimpiu Pop: Now I have one more question for you because as you said, you worked on databases for a long period of time, and we were discussing about distributed data systems. On one hand, what I was taught during university years, a long time ago, it was all about ACID and how important it is that a database, relational databases, that data is there and has some other properties.
Now, on the other hand, we have the CAP theorem for distributed systems that says that we have to pick two out of all three. How does this all fit together to just make sure that you have all this data and ensure the expectation? I'm just thinking also in terms of you having time series and you have to see trends and how they fit together.
Andrew Lamb: Yeah. So I'll talk about time series in just a second. I think at a high level, the idea of having, yeah, I think what you see is that different people in different use cases actually have slightly different tradeoffs that they want to make, which sounds obvious. But I think it's part of why you saw the growth from a few very monolithic databases like Oracle and SQL Server, which are still huge businesses, great products.
But what you've seen in the last 10 or 15 years is just an explosion of other database types. There's a lot of different types, but there's also a whole bunch of little startups making databases. It's partly driven by all these open source components which you're talking about. But I think it's also an acknowledgment that people are willing and able to spend money and time and effort optimizing for their particular use case.
So the ACID properties that Oracle offers are actually not the same type of properties that people want out of time series databases. So let me tell you, I think the major differences from time series, a special purpose time series database, what do you get? It's obviously built around time, but what does that really mean? I think the major features I've seen are when you put data in, you don't have to define the schema upfront.
In Oracle, MySQL, Postgres, you have to specifically specify upfront what all the columns are, what their types are before you can put data in. So a time series database, you can typically just load it directly in and the time series database has to figure it out. The other thing, another major difference in a time series database is that there's a huge drop off in value.
A lot of times people are using time series databases to capture metrics or information for their production systems. And it actually turns out that stuff's not very, the production metrics from two weeks ago are nowhere near as valuable as the production metrics from like 30 seconds ago. So there's a huge amount of effort that's required to efficiently get the data organized, able to be queried when it gets in, but then also quickly move it out of the system and stop paying lots of money. You want to put it on RAM for example for the last two hours. You just can't afford it for two years.
So there's a lot of sort of data lifecycle management that happens that for a big database, typically you wouldn't be doing that at the same granularity. In Oracle you'd have monthly partitions. Maybe you just drop the storage, you reclaim the storage every month or something. Okay. And then of course, I don't know if you've ever tried to write time series queries in SQL. They can be pretty terrible. So InfluxData also has a special, and many other time series databases also have special query languages that are much less cumbersome to express common time series operations like gap filling, for example, which is not realistically in any SQL standard.
DataFusion: The Query Engine
Olimpiu Pop: Okay. Thank you. Thank you for all the insights, Andrew. Last question from my side before we wrap up. I know that we can get deep. What's in the D in FDAP? Because that's the query engine, it's just designed to store data. But at some point we need to take it out because in the end that's the benefit.
Andrew Lamb: Absolutely. The D stands for DataFusion. Apache DataFusion is another top level Apache project. It's a query engine as you said. What that basically means is you can give it data, it can read Parquet files, or you can give it data from any data source. It reads Arrow record batches. You can run, it'll plan SQL queries for you with the full on optimizer and all the bells and whistles to make a nice fast query plan.
Then it's got a bunch of optimized execution operators, which we could talk at length about. But that's another example of people basically know now how to implement this class of vectorized execution engines for SQL. We did it, whatever it is, 15, 20 years ago at Vertica. They did it again in a sequence of products from the Hadoop era. Impala had one, a bunch of other examples from the Hadoop ecosystem have similar implementations. They're not the same code, but it's a similar technique.
DuckDB now has one. This was all innovated in a product called MariaDB100x, I believe is the paper name, but I think it might have been eventually, it was called Vectorwise I think when they split it out. But basically the point is it's been like two generations of commercial products based on this. So building another third whole set of query engine seems really quite silly because they all basically look the same.
DataFusion is basically a vectorized query engine that we are able to get a lot of people using. It's all great stuff. But it basically just implements all of the well known but still complicated, taking a lot of effort to implement, optimizations for running a SQL query engine, which is not just SQL, but any sort of relational query language, which I could wax philosophic about, but it's probably a whole topic for a whole other podcast.
Olimpiu Pop: That sounds quite interesting. So do you want to get into that? Well, obviously for decades we had the SQL movement and that was huge piles of money that was split between the giants. Then there was the NoSQL movement that was trying to democratize everything and do a lot of stuff that wasn't actually the greatest approach, even though companies were built from that point on.
But now with the whole movement, we have bits and pieces that will allow us to tie them together to get exactly what we have. For me, probably the most important aspect is the ability of querying in SQL these nuggets of data that are columnar restored. That makes probably the best takeout from all of this landscape.
But I know that you have one more ace in your sleeve and that's Iceberg. Probably on the roadmap. I don't know how far along is it, but I know that Iceberg, as the name says, it's about large data and ability of touching that. Can you briefly share something about it?
The Future with Apache Iceberg
Andrew Lamb: Absolutely. Yeah. So what's happened in the analytic database landscape over the last five, ten years is that the dominant architectural pattern that you see is persistent storage and object storage, which basically means you store Parquet files on S3. And the reason people do that is because AWS S3 or the equivalents from GCP and Azure, the reason people do that, it's called the disaggregated database architecture.
You basically outsource durability to S3. What S3 is great at, I don't know if you're familiar with S3, but it's basically an FTP server in the sky. It's kind of cheap, so it used to be, but it's cheap and it's durable. So from a database perspective, it frees you from having to worry about scalability or durability concerns, which can be very complex to implement.
So how Iceberg fits into this is as people started more and more storing your data in Parquet files on object store, huge piles of it, the question now becomes how do you and other systems read and write that? I think many of the early systems had their own sort of custom, basically second database, they call it a catalog. Basically it tells you how to, which Parquet files make up what table.
Iceberg is an attempt to standardize that in a way. But now you can have multiple systems, both reading and writing data that is stored in this sort of cheap, durable storage on S3. So the promise of something like Iceberg, I don't think we're quite yet at the realization, but the promise is you can basically have Iceberg be the core system of record, where all your data lives, and you can have a whole bunch of specialized data processing tools that live outside, and each are able to access the same data.
You don't have to have an ETL process that moves the data back and forth between the systems. You just have sort of one centralized repository of data, and it lives in open file formats like Parquet. Iceberg defines where, how to find the tables reasonably efficiently. Then you can have, if you want some custom AI tool to read it, you can do that. You want a BI dashboard tool to read it, you can do that. You want a distributed SQL database to read it, you can do that.
So that's the vision. I don't think we're quite at the vision yet, but the vision is you can basically have a database where the data files are shared across different systems.
Olimpiu Pop: Great. So if I got it correctly, for a long period of time, we had data pipelines that were taking it through all the phases. Then we have the other side with the ETL. Now with Apache Iceberg, probably we will have something that is much more efficient where you store the data and then it's easy to extract regardless of the space where it's coming from. Do I get it correctly?
Andrew Lamb: I mean, that's the pitch. But you might imagine, there's a bunch of vendors that are very interested in who controls what part of your data pipeline. So I would say there's a lot going on in the space right now. Okay.
Olimpiu Pop: So let's meet in one or two years and see where we got. Okay. Thank you, Andrew. Anything else that you'd like to add? Something that we missed?
Andrew Lamb: No, thank you for having me on. This is great. I know we had a lot to cover, but it's an exciting time to be in databases.
Olimpiu Pop: We're sure that we'll talk again. Thank you for the time.
Andrew Lamb: Thank you very much. Yeah.