Software Architecture

Building Planetary-Scale Data Systems with Venice

About the episode

Share on:
linkedin facebook
Copied!

Introduction

Olimpiu Pop: Hello everybody, and welcome to Bob of Curious Technologies. Welcome to the Unscripted. Today we have Felix in front of us. This is one of the rockstars in the data space. He built planetary scale data systems. We're going to dive into his experience and see what we can learn from his knowledge. Felix, please introduce yourself.

Félix GV: Hi, Olimpiu. It's great to talk to you again. Always fun to have a conversation with you and see what you're up to. Happy to see where this one takes us.

Olimpiu Pop: Last time we spoke a couple of months ago, you were tinkering with replacing RocksDB as the core engine of the data system you and your colleagues put together. But before we go there, maybe provide more context. You were in the team that put together the derived data database from LinkedIn, the place where the data is stored for recommendations and so on. Maybe if we do a cut through it, what are the main pieces if we're looking to abstract the data system, and how would this data system look?

The Architecture of Venice: An Unbundled Database

Félix GV: Venice, the system that I've been working on for the past decade up until recently—I left LinkedIn a month ago—was a distributed database which we could say was part of the category of databases that we called unbundled. That means each piece of the database is a separate standalone component.

One big part of the database was Kafka. Internally, we were in the process of replacing it with something else, but we can say it's just a pub sub. I'm saying Kafka as that is what people are familiar with. That was our write-ahead log component and also our commit log or replication log between the replicas.

We had servers which internally were made up of a Java process that had a RocksDB database locally. It could serve requests across the network, but it could also do certain computations inside of the server itself. Then there was a whole ecosystem of clients around it, various client options for accessing the data.

You had a client that could send requests across the network to the servers, like a typical client-server architecture. But then there was also another client which had the same API as the remote query client, but instead internally it was actually embedding the RocksDB database and it could pull data directly into the application process. We call that pattern the eager cache because it would eagerly load the data ahead of time. That means your client application became as if it were another follower replica of the database. You could set it up that way and then get even better performance. Of course, there is a resource cost. Nothing is free. You would have the data in your local RAM or local SSD if that's available. So that's why it's faster, but it also costs more to set it up this way.

There were some other clients. You could listen to the change capture stream. Internally, we had an ETL component to ship data to the grid. That part is not open source, but everything else I mentioned before is open source. That's the high level architecture. Then of course, there's a separate control plane apart from all of that that decides on all of the metadata, the partition placement, all that stuff. It uses Zookeeper and Apache Helix.

Circling back to the beginning, I said it follows the unbundled database pattern. What I mean is each of the components that I mentioned is a distributed system. The write-ahead log, let's say Kafka, is a distributed system right there. The server fleet—we had maybe a thousand servers per region or something like that—those were split across several clusters. Each of those are a full distributed system. The control plane itself is distributed for reliability. The clients are distributed. We don't run applications that are just single instance. They are usually maybe an order of magnitude more clients than databases, or sometimes two orders of magnitude more. Every single piece is a distributed system.

Multi-Region Reliability and Chaos Engineering

Olimpiu Pop: Given that while we are recording there is another part of the infrastructure introducing a structured outage related to cloud, did you really have multi-zone?

Félix GV: We did. We exercised it very frequently, like the Chaos Monkey pattern that Netflix popularized. We did what we call load tests several times a week where we would fail traffic out of data centers and concentrate the traffic into a single data center. Not all of the traffic, but quite a bit.

We were running in three data centers, and in a normal outage scenario we would fail out of one data center, which means there are two left. Each one needs to support more than 50% of its normal load. That would be the normal failure handling scenario. But during those load tests we would concentrate that even more. We would drain traffic out of one full data center plus a little bit of the other. The data center under test would get even more than 50% of its peak traffic.

We would do that in the morning on weekday mornings, which is when we have our peak traffic regularly. We would pick the time of day that has the highest traffic and concentrate traffic on a single data center. We would regularly test that we can sustain the load and we can sustain the load even of the next stage of growth that we anticipate.

Inevitably, we would discover that one system or the other doesn't actually sustain. It was very rarely Venice, but sometimes it was. That's a bad place to be when you're designated as the team which is the load test blocker. Then all eyes are on you and you really have to fix it quickly.

We exercised this because having reliability mechanisms that you do not regularly test basically means you don't have reliability mechanisms. By the time you need them, they're not going to work anymore.

Olimpiu Pop: I think the technical term for this is called playing with fire, and I'm happy to hear that someone is doing it to make sure that things actually work.

Data Flow: Writing and Reading in Venice

Olimpiu Pop: If we just imagine that we would like to store something, what would be the normal flow of writing and then reading it to have a full picture of how the information is flowing from one side to another?

Félix GV: That's a very good question to lean into. Venice is a little bit different than many other data systems in that regard. Venice is a derived data system, which means the data it hosts has been machine generated rather than user generated. That's not a hard rule. You could put user generated data in it if you wanted. It's just not as well suited for that. Venice is not alone. There are many derived data systems. Apache Pinot is another derived data system.

The pattern that we see in general with derived data systems is that the way you write into the system is typically asynchronous. That means the data gets loaded from a pub sub like Kafka, or maybe it gets loaded from a dump of files that were generated offline, like in some sort of job, maybe a daily job or hourly job or weekly job. So it's either batch ingestion or stream ingestion, and sometimes it's both.

In Venice we supported all permutations of these various derived data ingestion modes. You could have your data come fully from a stream processor, or you could have it come from a batch processor. Or you could have the data come once a day from a batch processor, but then in between the batch pushes you would have a stream processor that's refreshing it as well. There's a bunch of different variants of that. You could say the stream processor refresh is just a subset of the columns of the table, and the rest of the columns are batch only. It would orchestrate all of these data ingestion scenarios out of the box.

The key thing to remember is that because the data comes in asynchronously, it comes in through a pub sub, it can support very high ingestion throughput. But in exchange, we talk about trade-offs—this is what data systems are all about. The trade-off in this case is that because you're writing asynchronously to the system, you lose strong consistency, or the technical term would be linearizability. Another way we say it sometimes is read your own writes.

You do a write operation to the system, you can get an acknowledgment that the write is durable. You have attained durability, but the durability is only the durability provided by the write-ahead log, which is Kafka at that point. The data is not yet indexed into the system. It's not yet readable. That's the design decision we made there. Venice is not unique. Apache Pinot is pretty similar to that also. It's still a little bit uncommon.

In a traditional database like Postgres, you also have a write-ahead log, but the write-ahead log is embedded inside of the Postgres process. The acknowledgment that you get when you do a write request, an insert or an update, the acknowledgment you get is after the data was persisted to the write-ahead log for durability, but also after Postgres updated all of its in-memory indices. When you write to Postgres, you do get read-your-own-write guarantees. You do get linearizability, because Postgres has been designed to cater to primary data use cases. In primary data, typically you do want to have that type of consistency guarantee. That's the nuance between derived data and primary data.

Understanding the CAP Theorem

Olimpiu Pop: The thing that is spinning in my head is a theorem that I almost never understood. I usually have like 5 to 10 minutes when I understand it and then I forget about it. That's the CAP theorem. The first three letters that are very important in distributed systems. The main reason for that is that you have to choose two of the three letters when designing the distributed system. How would the permutations look from your point of view when dealing with data? Because data is sensible and it has gravity. When you're talking about global scale, we spoke about having multiple data centers.

A long time ago, 2 to 3 years ago, I was surprised by the way data was replicated in case of Amazon. I had an order, and in one browser I could see the order. On my phone, I couldn't see it. Then in the browser I got a big banner that said, "It depends on that." That was the example we have with CAP theorem. How did you look at that when looking at it practically? Because you cannot play with those things.

Félix GV: The CAP theorem says there are three properties we care for: consistency, which in what we were discussing earlier means read your own writes or linearizability. The A stands for availability, and the P stands for partition tolerance. But there's a catch in the CAP theorem. It says pick two out of three, but one of the two you pick must be partition tolerance. The only real choices you have are CP or AP: consistent and partition tolerant, or available and partition tolerant. That's what the theorem says.

In practice, partition tolerance means a network partition. That means you are cut off from the database, or maybe the distributed database itself has some of its replicas cut off from the others, or some kind of scenario where some machine is unable to contact another machine. That's what partition tolerance means: how do you deal with that?

This comes up during failure scenarios. If you do not have a network partition, also called a net split, if you do not have a net split, then you could have a system that offers both the A and the C. It could be available and consistent while there are no failures. Then when the failure occurs, you have to sacrifice one or the other.

In terms of Venice, and I would say probably derived data systems in general, there is maybe a different way to think about it. Those systems are certainly highly available, so they are AP in that sense. But because the design choices that we explored earlier have made it so that our write path is asynchronous and we've sacrificed strong consistency and linearizability from the get go. We've already sacrificed this at the drawing board level. It's not a matter of suffering a network partition that makes us lose consistency. It's almost as if we were continuously network partitioned.

It's almost like that in the sense that the network route through which your packets must be transmitted for all of your write requests, that path always has a high latency. From the point of view of a traditional database, it's as if that functionality is continuously degraded. So you never have consistency. The C in the CAP theorem is lost continuously.

Interestingly, this does not only apply to derived data systems. It can also apply to primary data systems that are multi-region and architected in a way to be multi-leader. I think that's probably what you experienced with the Amazon shopping cart example. It could be that your laptop is connected to one data center, let's say US East One, and then your phone for some reason is going to another data center. Maybe it's connecting to Europe. You're going across the Atlantic with your laptop and you're connecting locally with your phone to a closer data center, assuming you're in Europe.

There is also a replication going in both directions between these data centers, but it's asynchronous. It could be that you add something to your shopping cart over here and it's not yet replicated over there. For those scenarios where you have a multi-leader replicated system, you also are in a sense in a situation where you have continuous high write latency between two regions. That can be an interesting design trade-off. You get the great reliability characteristics from being present in multiple regions. That improves your availability profile, but it does mean you're sacrificing consistency, and not just during outages. You're sacrificing it continuously.

Olimpiu Pop: We have to choose our poison wisely and see what's actually the important part. Thank you for the explanation. Given that we are looking at about half an hour conversation, I understood the CAP theorem. As I told you, my span of remembering the CAP theorem explanation is like at most a quarter of an hour. So thank you for that.

Félix GV: Reply for another five minutes after.

Integrating DuckDB: Adding SQL Capabilities

Olimpiu Pop: The last question for me: you spoke about the two choices of having the two different engines that you and your team had implemented. I know during the hackathon you chose to give it a try and rather than using RocksDB, try to use DuckDB. Now they're a bit different. The first one is a key-value pair and the other one is a SQL engine, just to oversimplify things. Given that you're talking about mostly machine-generated data, you're talking about huge quantities, as you mentioned, batch processing and even stream. Why would you need SQL?

Félix GV: That's a good question. You're right that this was integrated in a hackathon project. In a sense, it was a way to test if our abstractions were flexible enough, a theoretical experiment. It did work. We were able to get the Venice data into DuckDB instances that we could then query with arbitrary SQL.

In terms of real world scenarios, I don't think we've used it in any real production use case yet. It is available there in open source if people want to try it out, the Venice DuckDB integration. But DuckDB is an incredibly flexible system. It's a very impressive project. Every time I've worked with it, I really enjoyed it. It has a bunch of different extensions, so you can do many, many things with it. Including you can even do some vector type work, like search and stuff like that.

These are things that could be interesting from the Venice point of view. Venice already has a little bit of built-in vector math computation so that we can make those workloads more efficient. But it's a little bit basic. We developed this before the big wave of vector databases that have exploded on the scene in the past five years. We had vector math running in the Venice servers since maybe 2019. A little bit before the rest of the industry got on that bandwagon, but we also did not really update it much since then. So it's a little bit earlier, but a little bit more primitive in terms of scope of capabilities. By bringing in DuckDB now we have all of these extensions that we could pull into the mix, for free, so to speak. Nothing's ever free. There's always a trade-off, but for free between quotes.

Another motivation we had was just data exploration and debugging. Sometimes you want to understand what is the shape of your data, what is the cardinality of values in a given column. Regularly we had some internal users that were asking us, "Hey, my Venice data is populated by the stream processors. I should know what's in there, but now I have a doubt. Maybe it's not quite what I expected that's going in there." Then they would ask us to turn on the ETL so that the data gets loaded out of Venice and into the grid, and then they would use offline tools like Spark or whatever to query that.

All of that workflow worked, but it was a little tedious, high inertia. This is one more option. If the data set is small enough, or if you could get the answer you need out of just one partition of your data set, you could load that into DuckDB and then run queries very fast and then get the answers you need, about what's the max value in that column or those kinds of aggregation queries that DuckDB does very fast.

These are the initial use cases we were thinking of. But DuckDB is so flexible that there's probably a lot more that I'm not even thinking of. We'll see.

Olimpiu Pop: It meets the reasoning in my head that it was more like some kind of analytics and some kind of just looking over data. Thank you for sharing and thank you for taking the time to have another conversation.

Félix GV: My pleasure, Olimpiu. It's always fun chatting with you.