Home Bookclub Episodes Kafka Connect: B...

Kafka Connect: Build and Run Data Pipelines

Mickael Maison • Kate Stanley | Gotopia Bookclub Episode • November 2024

You need to be signed in to add a collection

Danica Fine together with the authors of “Kafka Connect” Kate Stanley and Mickael Maison, unpack Kafka Connect's game-changing power for building data pipelines—no tedious custom scripts needed! Kate and Mickael Maison discuss how they structured the book to help everyone, from data engineers to developers, tap into Kafka Connect’s strengths, including Change Data Capture (CDC), real-time data flow, and fail-safe reliability. They dig into Kafka Connect’s customization options, share tips on keeping pipelines robust in production, and reveal some powerful tweaks for those looking to get advanced. Throughout, they emphasize how Kafka Connect’s open-source community makes learning and using it easier than ever, inviting readers to jump in and start building smarter data pipelines.

Share on:
linkedin facebook
Copied!

Transcript

Intro

Danica Fine: All right. Hey, everyone. Welcome back to another episode of the "GOTO Book Club" My name is Danica Fine. I'm a developer advocate in the data streaming space. And I'm probably entirely too excited to be here today with Kate Stanley and Mickael Maison to discuss the release of their latest book, "Kafka Connect: Build and Run Data Pipelines." And throughout the conversation, I think you'll figure out why I'm so excited to be here chatting with the two of them. But before we get into the book itself and Kafka Connect, how about each of you introduce yourselves? Kate, do you want to kick it off?

Kate Stanley: Yes. Thanks, Danica. It's great to be here. I'm Kate Stanley. I'm a principal software engineer working in the Kafka team at Red Hat. And I've actually been working on Kafka-related products since 2018. And alongside my day job, I also love sharing my experience. So I've done lots of presenting at conferences about Kafka Connect. And then, of course, now branched out slightly into writing a book.

Danica Fine: And Mickael?

Mickael Maison: Right. I'm also a software engineer also in the Kafka team at Red Hat. I've been working with Kafka for a very long time. I think I started around 2015. I'm also a committer on the project since 2019. And for the past year and a half, I'm the chair of the project management committee for Apache Kafka.

Danica Fine: Nice. And I think based on what we just heard about both of your experiences in Apache Kafka and your involvement in the Kafka community and being out at conferences speaking, I think the two of you are clearly qualified to be writing a book about Kafka Connect. But I'm curious to hear more details about the sort of this journey that brought you to writing this book. You both are very active in Kafka, but what was the motivation for getting into Kafka Connect and writing this resource?

Mickael Maison: Right. So it's come from a pretty long time, but I think at the time I was in a team that was running Kafka in production. We were running Kafka Connect as well, especially running MirrorMaker at scale. And I started basically contributing back to Connect. So anytime we found an issue or a bug, or an improvement, we'd do it. So at the time I was becoming a committer, there were not very many other committers working on Connect. So they were mostly working on the brokers and clients. So basically I became one of the regular maintainers on Connect. That's basically where I grew my experience. So from using it in production and also helping merge contributions from other members of the community. And at the time I was also working with Kate Stanley, so we're not in the same team, but we're interacting regularly. And I knew she also had experience working with Connect. So Kate, do you want to share yours?

Kate Stanley: I'd written at Connect as part of my team. And then, as part of that as well, working with lots of customers, helping them to run Kafka Connect. And I just realized how kind of few resources there were around Connect. And so I started presenting at conferences and things like that. And obviously, Mickael had seen some of my sessions and approached me and said, you know, "Would you be interested in writing a book?" And I think one of the great things about Connect is it can be used for so many different use cases. But it then felt like, you know, that's quite a lot to cover. But actually, the more we talked about the content and what we might want to include, we realized that because we did work in different teams and we'd come at Connect from different angles, we actually felt like between us, we had quite a good covering of the entirety of what we wanted to include. So we said "Actually, this feels like the perfect sort of pairing of our experiences."

Recommended talk: Making Kafka Queryable with Apache Pinot • Tim Berglund • GOTO 2023

Understanding Kafka Connect and Its Versatility

Danica Fine: That's awesome. And I love what you said about, you know, Kafka Connect being used for a lot of different use cases. I think that's one of the major selling points of it. It's just so versatile. And I think that speaks to sort of the reality of, you know, when it comes to writing data pipelines, to architecting these systems that can either, you know, move data from A to B or whatever other purpose we're using it for. Everyone who's using Kafka Connect is starting from a slightly different perspective. They have a different programming background. And even in some cases, their title is going to affect how they actually approach building a data pipeline and approach that problem. And so that must have been a challenge for the both of you for, you know, going into writing a resource on Kafka Connect. How did you cater to the different types of people who are looking to use your resource and learn something about this technology?

Kate Stanley: We started off looking at different topics, and we quickly realized that the best way we felt to structure the whole book was to look at the different personas that we had and kind of group the chapters in that way.

So there are three different personas that we kind of identified around Connect. You have the data engineers who are building the pipelines and kind of determining how my data flows from A to B. Then you have the site reliability engineers, or SREs, who are running the infrastructure, standing up Connect, managing it. And then developers who want to actually customize Kafka Connect as well.

And so what we know is that there isn't necessarily a one-to-one correlation between those roles and a real person. Often you're doing multiple roles, or it might be there's different teams. So we decided to have four sections in the book. So the first part is kind of more general, and then we have one section for each of those personas. But we really wanted people to be able to read the book beginning to end. So even if you are in a specific role and you're just an SRE or you're specifically a developer, you'll definitely gain something by reading it all.

But then by grouping the chapters by persona, we're hoping that this really means that people can more easily refer back to it and be able to say, "Okay, you know, this is my primary role day-to-day, so these are the chapters that I'm going to be going back to again and again." And certainly that's how I use the book. So I definitely have the book on my desk to kind of flick through and make sure that if I want to check things, it just makes it really easy to find things. So that's how we kind of chose to structure it.

Danica Fine: I definitely...I love that. I really do appreciate how this was broken up for this particular book. Because speaking as someone who is in developer relations, you know, constantly thinking about how best to approach the writing of any sort of content, whether that is a written blog post or a resource like this, or a guide, or even a conference presentation, I have personally found that anytime that people who are consuming these resources can see themselves reflected in the resource itself, it's just going to do that much better, right? And so I think by, like, clearly defining these personas, even if it doesn't, you know, really match up for that one-to-one mapping, it's still going to be way more successful, right? So I really, really appreciate that. Good job.

But I mean, looking at sort of you mentioned all the personas, but I noticed that the first section of the book is effective for everyone, right? So any person who's Kafka Connect curious or wants a little bit more motivation around the technology as a whole. And so I think you also do a really good job there in setting the stage and also giving a little bit of a background around Apache Kafka and then going into some really great use cases to get people thinking.

Recommended talk: Kafka in Action • Viktor Gamov & Tim Berglund • GOTO 2022

Key Use Cases for Kafka Connect

Danica Fine: Going back to that, seeing yourself in the resource, I think use cases are a great way to, you know, get people more interested. So you cover things like Change Data Capture, migrating databases, or, you know, mirroring Kafka clusters. And speaking of someone who loves Apache Kafka, I'd say these are all really great use cases, but I'm very curious for the both of you. How many times have you been in a conversation with someone who's, you know, either maybe thinking about Kafka Connect or maybe has ignored it for a little while and you're talking about these use cases, and how many times have they looked at you and said, "Oh yeah, I can just write a script to do that?"

Kate Stanley: Presenting a lot I get a lot of people kind of coming up and sort of really wanting to know, you know, "Why should I be using Connect over other things?" And I think it's really noticeable, you know, when people are first learning about Kafka Connect, the first thing they learn to do is write a producer and a consumer application. And so then when you're introducing that external system, it's easy to just say, "Oh, well, I'll just extend that application that I've already written." And so for that reason, Connect just isn't something that's kind of front of mind that you think of straight away, but Connect is designed to flow data to and from Kafka in a scalable and reliable, resilient way. And I think resiliency is one of the key pieces here of why you really should be looking at Connect rather than writing your own script. Because it's really easy to write a script that works when everything works, the golden path, but things go wrong all the time.

Danica Fine: They always will.

Kate Stanley: And it's much harder to write a script that's going to account for all of that. So there's different ways that Connect does this. It's got a lot of resiliency built in, in terms of, you know, if one of the workers goes down, it will rebalance work onto a different worker. So that workload management's really useful. But even just things like you can configure Connect so that if it's processing a record and something goes wrong, it pushes it to a dead letter queue rather than the connector failing.

So there's so much in Connect that helps you on that journey. And I really think the big motivator for using Kafka Connect is this idea that rather than you writing all of these different custom applications and having to handle all these custom failures and things, every time you introduce a new external system, you still just use Connect. So you have that knowledge base that you can start with, and then you're just adding a different connector for the different external systems. So it gives you a really good way to easily extend your pipeline without having to write everything again from scratch.

Danica Fine: That's really helpful.

Mickael Maison: I think it's also a great tool when you're adopting Apache Kafka. So sometimes when you adopt it, you say, "I'm going to build something brand new, a new use case, and that's going to be my entry to Apache Kafka," but what we see in practice is that all organizations have many data systems already, and very rarely when you build something new, it's going to be itself in a vacuum. So you'll have to interact with your existing data, and this is where really Connect shines. So it allows you to really easily connect with some of your data systems and reuse that to build something new.

So in your journey to adopt Apache Kafka, it really speeds things up and makes things easier to use. So for example, a use case, like Change Data Capture, is really powerful because it really lets you expose something you have in a database to all sorts of new use cases. And you don't put more stress on the database. So then once it's connected, you only have this extra process, not like every application goes to your database and suddenly your production database goes down because it's just overloaded.

And also with the script approach, a very easy trap, a very common anti-pattern we see is dual writes. So we say, "Oh, I got my application that's writing to the database. We just change it slightly, add a new line of code, and then it writes to Kafka as well." Obviously, this works well in the golden path. So this may work well for a week or two, and suddenly it writes to one place and not the other, and you're in trouble. So really, so all the points Kate Stanley said really, Kafka can actually abstract that from you, so you only have to focus on the pipeline you're building and forget about the details of the infrastructure.

Danica Fine: And for the Change Data Capture use case, I think we see a lot of that being supported by the Debezium project. And so I think here it's, yeah, sure, you might be able to do it yourself, but here we have a whole project dedicated to this very specific thing and doing it well, right? So those connectors are sort of battle tested, and you can trust them a lot more.

Mickael Maison: Exactly. We have a team of database experts that have built these connectors, and they've done the job already for you.

Danica Fine: So why not take advantage of that? And so I think to summarize, yeah, sure, you could write a script, but Kafka Connect has so much more to offer on top of that. And I actually think, thinking back to this sort of conversation, the mentality of, "I can be scrappy and I can do it myself," I think, is a common mentality. It's always fun to test yourself and see if you can do it. But I think that's sort of the biggest hurdle in getting folks to use Kafka Connect these days, even though it's really great and it's very useful.

Mickael Maison: I mean, it's maybe fine for proof of concepts, but if you want to go at enterprise scale or if you want something very reliable, then...

Danica Fine: Absolutely.

Mickael Maison: ...it's much harder to write a script that works 90% of the time.

Danica Fine: I think that's what, you know, Connect has to offer there on the things that you don't necessarily anticipate. And I've seen many of those issues myself in writing those scripts and then giving up and using Kafka Connect. So I feel that. I really do.

The Challenge of Kafka Connect Terminology

Danica Fine: And so actually, maybe it's not, just to shift gears here, maybe it's not actually that scrappy mentality that's the hurdle for Kafka Connect. I'm really excited to bring this up with you, but I think one of the biggest hurdles for getting people to use Kafka Connect is actually a lot of the terminology and the naming, and the technology that's been introduced. We've got Connect, we've got Connector, we've got Connector Plugin, the list goes on and on and on. What is with that? Why is it so difficult?

Kate Stanley: Terminology is a topic that I could just go on about. Because I just remember so clearly sitting in a customer meeting and we were discussing their Kafka Connect architecture and, you know, what they deployed and what was working and what wasn't. And I just realized every time somebody used the word Connect or Connectors, they meant something slightly different. And it just made the meeting really difficult because we were kind of talking cross purposes.

And I think it goes back to those different personas that Connect caters to. So you have your word Connect, that could be the Connect runtime. So that's someone who's actually deployed Connect. Or it could be the framework that you're using to implement your Connector. And similarly with the word Connector, you have a Connector Plugin could be, you know, the code sitting in GitHub or a jar file that you've installed in Kafka Connect, or it could be the actual running process. And then even the word Connector Plugin is overloaded because you could mean a Connector or you could mean a Converter or a Transformation, which are all still types of Connector Plugin.

So I think it's really important to be clear on the terminology, and we have tried really hard to check in the book that we're using the right terms in the right places. But also, I think when people are talking about Connect, if you can get in your head, "What persona am I looking at? What context am I thinking about?" then that I think helps with the terminology because then it's easier to determine, "Okay, you know, we're talking about deploying this thing. We're probably talking about the Connect runtime here." And so that does make it easier. And yeah, it is something where just now I start all of my presentations with, "This is the terminology we're going to use," just to kind of level set, but it is a tricky one with Connect.

Danica Fine: I guess it's too late to change and clarify a little bit better, but I think that is good advice to, you know, first of all, think about the persona who you're actually talking to. Know your audience. That's also something I'm constantly thinking about in developer relations. But I also definitely appreciate it. I sat through a couple of your presentations. They're wonderful. And I do definitely appreciate that, you know, there's a couple slides in the beginning to clarify things.

Kafka Connect for Data Engineers

Danica Fine: With that settled for posterity, thank you for that. I feel like people, with their definition set, they can actually start to actually build things with Kafka Connect. And conveniently, that is the focus of the second section of the book. It's mostly for those data engineers who want to set up pipelines. 

And maybe you can speak a little bit more to this, Mickael, but I feel like the open source Kafka Community has done a great job in designing Kafka Connect so that this part is relatively simple.

Mickael Maison: Yeah. So a lot of work has been done for sure. So the thing is, nowadays, if you're building a data pipeline, I think the majority of users will not have to write any code. So you just write the definition of your pipeline, and you reuse existing plugins that have been built there for you. So the Apache Kafka project itself provides the runtime where you run the connectors and the APIs for building them, but doesn't provide any connectors. The only connectors provided are for MirrorMaker, for mirroring between clusters.

So as you pointed, the community around the project has done a great job building all those connectors. Obviously, this ecosystem wouldn't work without the connectors. So there was a bit of a debate whether Kafka should have implementations, but you just need to stay focused on core Kafka and the expertise of the maintenance. And since Kafka was created, there's been literally hundreds of new data systems. So if you have to integrate with every one of them, you can't have this knowledge in Kafka, it's just not scalable. So instead, we've let the community build them, and most connectors are built by the people that build the data system. So for example, people that built, let's say, Datadog or, like, Snowflake, they build their own connector. And so they have some dedicated knowledge for that. And that's great. That works well.

That has worked very well. And an example like Debezium you mentioned earlier as well. You have all the specialized expertise that's put in a true connector, and then anybody can use them. So I mean, there's literally connectors for any data system pretty much. I mean, apart from using something custom or something very new or very esoteric, there will probably be a connector already for you to reuse. Just checking earlier today, just on Confluent Hub, there's over 200 connectors, and that's just on Confluent Hub. Probably there's a whole lot more. If you look on GitHub or just the internet, you can find probably hundreds more. So all that work done by the community has made using Kafka Connect to build a pipeline, yeah, as you said, relatively simple.

Danica Fine: That's incredible. And I always forget how impressive it is that the Apache Kafka project itself, as you said, only provides the MirrorMaker. That's the only connector that it has issued itself. And so it is absolutely crazy. We always quote this number of over 200 connectors, and there's hundreds more, as you just said. And it's crazy to think that those are all from independent companies and systems and data experts that are using what they know and their expertise about the system to release that. And so that's pretty crazy. I already appreciate and love the Apache Kafka community, but when you think about something like this, about how folks have rallied together to really build onto the framework, it is pretty cool.

Mickael Maison: And many are open source as well.

Danica Fine: Then from there, you don't have to... If the documentation for the connector is good, then all you have to do is spin up the connector with your configuration. And so you don't have to, or the average person doesn't have to write any real connector code, which is a huge selling point for it. So once you've made that decision on how you want to run Kafka Connect, you just set up your configuration and then you go from there, right? "So there's very little that could go wrong," she said very sarcastically.

Kate Stanley: It just sounds so straightforward, doesn't it? There is some complexity that we sort of talk about in the book. And I think for me, the big piece that is overlooked I think with Connect is people think that the complexity is in the terminology. Actually, once you've worked out the terminology, then it's a lot easier from there. But you do have to think about the configuration. And specifically, you have to think about it in different layers, I think. There's not just your one piece.

So if you think about, say, a source pipeline, so taking data from an external system, and then through your connector and into Kafka, you have your connector that runs first and fetches the data from the external system, but then it passes it to a converter. And that is going to serialize the data to put it into Kafka. And so it's often easy to kind of overlook the configuration for your converter because there's a default, and you can just run your pipeline without really thinking about it. But I've had multiple instances where, actually, the configuration for the converter was what was missing in terms of getting that data in the right format when you get it into Kafka.

And it doesn't just end at Kafka. Normally, the pipelines then continue from there. So you might have applications that are then consuming. And whatever converter you use is going to influence the downstream consumers, which serialize and deserialize that they'll have to use. So when you then add in schemas as well, it gets even more complicated. So you do have to think about not just those individual pieces, but how do I look at this as a whole, my connector, my converter, transformations, if you're using them, and thinking at all levels about "How am I partitioning this data? And how am I going to structure the format? What's changing?" To make sure that you're flowing it in a way that really works. And that can be quite hard to get your head around. And so we tried to make it a little bit easier for people to look at by introducing some real life examples in the book.

So two of the chapters that we definitely want people to really look at are the ones Chapter 5 and Chapter 6. So in five, we picked three specific connectors. So the Confluent S3 sink connector, the Confluent JDBC source connector, and the Debezium MySQL source connector. And then in Chapter 6, we look at the MirrorMaker connectors. And for each one, we talked about the configuration options, but we also show an example of, "Okay, let's take some data, how is it partitioned? What's the format? How do we flow it through the connector into Kafka or from Kafka into the external system?" And I think that really helps to kind of ground people in I'm not just running this tiny connector, I'm building a whole pipeline. And how do I do this in a way that works?

Danica Fine: That won't, like, bite you later on when you're trying to, you know, extend it or add something else to the pipeline. Yeah, and again, I really, you know, I love the grounding of these examples throughout the book. And I think generally, that's great advice. What you said about kind of, you know, not just focusing on the tiny bit that you yourself are implementing in this larger pipeline, you need to take a step back and think about, you know, without driving yourself crazy, right? Because you can't always anticipate everything that's going to happen with your pipeline or every way that it's going to be extended, or every other application is going to tap into that data later on. But I think just thinking about it reasonably in a way that does open it up to those sort of extensions in a smartly architected way. And I think that's great advice for, you know, not just Kafka Connect, but anybody who's working with real-time systems or Apache Kafka, I think that's a common thing that comes up in, you know, discussions that I have with users.

Kate Stanley: I think it's about making decisions as well and being really, like, mindful of them. So with the converters, it's easy to just use the default. But if you actively configure a converter in all of your pipelines, then you're actively being mindful of that decision you're making rather than just using the default that was configured by somebody else.

Recommended talk:Why Most Data Projects Fail & How to Avoid It • Jesse Anderson • GOTO 2023

Optimizing Kafka Connect: Transformations, Customization, and Best Practices

Danica Fine: Absolutely. Converters, those are, you know, really important. Usually the defaults probably could be okay, but again, like you said, it's good to have a more active role in that decision-making. But another thing that you mentioned in addition to the converters were transforms, and I think that deserves a little more air time as far as, you know, Kafka Connect components can go because those can be pretty powerful.

Mickael Maison: Yes. So this is a part of Connect that's not very well known is that you can do transformations that records flow through your pipeline. So you can basically just connect to an ETL pipeline, so extract, transform on the fly, and then load into Connect or into your target system.

So actually, we said earlier on that Kafka doesn't provide any plugins but provides a bunch of transformations. That's the only type it provides. There's also a bunch of converters, but yeah, there's a bunch of transformations. So you can use them to do all sorts of things. We can direct which topics the record will go. We can sanitize or change some fields on the fly. You can, as I said, modify the fields. So you can do some formatting, add some fields, remove some fields, inject some more data into records as they flow through.

And it's really flexible. So transformations can be chained. So you can have transformations as a single thing and basically reuse them and compose your pipeline with them. So it's very usable. It's very useful in many use cases. And you can also pair them with Predicate to have them run conditionally depending on the record they receive.

So using that is really useful, but there are some limitations. Obviously, you don't want to have a chain of 50 transformations that adds loads of latency, a transformation has to be stateless. It's one record at a time. So if you want to do a lot of processing with the remote calls for every single record, obviously that doesn't work very well with Connect. But apart from that, if you want to do something fast, just change the field or add something to do a bit of formatting that you cannot do in your converter. That's very useful. So it's useful to know about them. Also, it's very easy to implement. Also, it's only a single method. Just take the record and spit the record back.

Obviously, if you want to do something more involved, do some more changes, or do other remote calls, you're probably more looking forward to building an ELT pipeline where you will basically do the processing later and use something better suited, for example, Kafka Streams or Flink, where you can do something more involved processing and you can do aggregation and all the computation of multiple records. Transformations are really not designed for that. So with this, knowing the limitations, then you're free to build transformations. And as I said, it's really composable, so you can have a suite of transformations you build and reuse them as needed in your pipelines.

Danica Fine: No, they're absolutely wonderful, speaking from experience. And I think, yeah, that's something that people forget. You think about Kafka Connect, you think more or less a dumb pipeline, right? You're just moving from A to B. Great. But you really unlock so many additional use cases here when you realize you can do the single message transform and take advantage of those.

But I do want to call out something else because I know you said some remote calls and whatnot, but I've also seen issues, and you don't want to get too wild in moving that process into Kafka Stream and doing some external calls. And I've seen people get a bit...about that as well.

Mickael Maison: Kafka Streams is not well suited for remote calls either.

Danica Fine: Yes. So big asterisk there.

Mickael Maison: Yes, big asterisk. Yeah, I'm not saying you should use Kafka Stream to remote call, basically.

Danica Fine: Don't do it. But yeah, so I think while it's all fun and games to set up these connectors, realize the power that you have with transforms and really think about your data pipelines, I think it's an entirely different beast when it comes to deploying and maintaining, and monitoring that pipeline that you've just implemented. And it's really my least favorite part. But thankfully, you have an entire section of the book dedicated to, as you say, "Running Kafka Connect in production." So are there any general best practices that people should abide by or any gotchas that we can avoid?

Kate Stanley: It's certainly something that people need to know how to do well is run Connect. I think in terms of best practices, the one that comes to mind to me is metrics. So there are so many metrics for Kafka Connect and Kafka as well, which is really great, but it can be a little bit overwhelming, especially when you're first getting started with running a system like this.

Think carefully about how you're going to use them. And this is something we did in the book. Rather than just listing all the metrics, we wanted to give a bit more of our experience. So we've listed specifically if you're starting out, like, "Here are the ones that you should alert on. These are the ones you should graph so that you can kind of see trends over time. And then these are the ones that you just want to collect so you can do debugging and that sort of thing."

I think being familiar with your metrics, having a starting point like that, but then for your specific system, evolving your use of the metrics is really important. I think it's so easy to just forget that they're there until something goes wrong. Whereas if you're taking an active role...

Danica Fine: Then it's almost too late.

Kate Stanley: In looking at your metrics and knowing what your system looks like, then that's definitely the better way to go. And then I think in terms of gotchas, it's for me, the big one is the fact that if a connector fails, it doesn't get restarted automatically.

Danica Fine: Yes. It didn't work.

Kate Stanley: So this kind of comes back into the monitoring, right? You need to monitor to know that your connector has failed, and then you need to restart it yourself using the REST API. So that's how you generally manage all of the connectors in Kafka Connect. Or you need to have some automation around it, and this is where actually in the book we talk about how you can run Connect on Kubernetes, which can help with some of that automation.

So for example, in the book, we do explicitly talk about Strimzi, which is a CNCF project that allows you to run Apache Kafka on Kubernetes. And that handily has an option to allow you to automatically restart the connector. So that does give you some help there, but if you're running Connect yourself, then yeah, you do need to think about how you're going to monitor and then restart those connectors.

Danica Fine: Yes.

Mickael Maison: I mean, a trend we see a lot as well is SRE teams adopting the GitOps practices. So GitOps is basically when you treat configuration, when you use a control manual system like Git to store the configuration. So basically, there's more auditing, and you can be sure of what's running now it's what's in the repository. And as we said before, the Kafka Connect pipeline is only configuration. So you can already basically use Git for that to deploy your connectors. But if you use something like Strimzi, you can also use Strimzi is also a configuration-based definition like Kubernetes. So you can also use GitOps for everything. So you have your GitOps, where you have your data pipelines, but also the definition for the infrastructure and connect workers you use, what's that configuration. So that really brings things together. It makes it a lot easier to manage for operation teams, basically.

Danica Fine: I think the addition of Strimzi as a technology has really made a lot of this process so much easier. And again, speaking from experience, having all these configurations just thrown around willy-nilly, it's nice to keep them somewhere and have the whole thing be automated. And it just takes one less thing off your plate of maintaining your connectors.

Mickael Maison: Your production environment is only touched by the CI/CD pipeline, and nobody touches it. And what's in the repo is what's running. Easy.

Danica Fine: And then going back to something that you had kind of mentioned, Kate, touching on like, yeah, your connectors aren't automatically restarted, and you need to be monitoring those. But on sort of a related note, I'm thinking back to when I was a developer working at Kafka Connect, if I say, wanted to restart my source connector, like I deliberately wanted to restart it from scratch so that it would, if it were monitoring a database table, if I wanted to reprocess all of the rows, I would have to stop the connector manually. I would have to send a tombstone to the internal topic containing the connector's offsets. And then I would have to manually restart that connector with my fingers crossed and hoping that everything was okay. But I understand that this workflow, it's not actually like that anymore, is it? We've made some improvements, right?

Mickael Maison: Let's be clear about tombstones being a Kafka thing, right? You're not selling tombstones. So, yeah, this used to be really a major pain point to manage the position of your connectors. So, since Kafka 3.6, there's a dedicated REST API for that. There's a REST endpoint where you can see where your connectors are and you can change positions, you can reset, delete. So, 3.6 was released in October, so last year, October 2023. So, yeah, that's a huge relief to any Kafka Connect operator. Now there's an API to do that. You don't need to send tombstones to specific internal topics and do something manually.

Danica Fine: No, I was always surprised when I was doing it. I'm like, "This seems like a little too low level for me to be doing just to manage my connector." So, I'm personally...

Mickael Maison: There've been basically a few improvements. So, when you define a connector now, you can define it and not start automatically. You can define it, and it stays stopped. Now that I've defined you, I'm going to maybe touch your offset and start you whenever I want. Before, you would define it. It would automatically start, and you would have to first stop it and change the position or do something. So, this has improved quite dramatically over the past few years.

Danica Fine: This is what I love about Kafka Connect and the broader open-source Apache Kafka community. So much has changed in such a short amount of time, and for the better. And so, I think honestly, at this point, keeping up with all the improvements is a difficult part. There's just so much going on.

Kate Stanley: Definitely. I mean, even just the Connect offset endpoint that you were saying about for Strimzi fans, that's something that's now integrated into Strimzi. So, it's not just people using Connect, it's all of the downstream applications, the connectors, and everybody that are constantly having to update and use the new features, which is great because we want to see all the new features. But it means as somebody not only using Connect, but also creating things for Connect, you really should be keeping an eye on what's coming because there's just so many improvements.

And it was something we really had to keep on top of for the book. So, we had to fix on a specific version. So, I think in the book, we in the end settled on 3.5. So, there's things like exactly-once support for source connectors that came out in KIP-618, and it was released in Kafka 3.3. So, we included that in the book. But I think the MirrorMaker source connector only supported that in actual Kafka 3.5, so we kind of had to go back and add that in.

And then there were parts that didn't really make it. So, there's a new mechanism that's coming in Kafka 3.6 to improve the discovery and loading of plugins. And we do talk about that in the book. But as we were writing the book, the KIP-898 was agreed, but it hadn't been implemented. So, we were kind of talking about, "This is what's been agreed that is going to come," but it hadn't actually landed yet. So, we did have to keep that in mind. So, in the book, we do talk about some things that are coming down the pipeline. So, hopefully that will help for people who are reading it after it's come out. But yeah, there is just so much going on in Connect, which is really great to see.

Danica Fine: For the exactly-once support for source connectors, that one especially, thinking back to selfishly, a lot of my pain points, that was a huge one. And I was so excited to see it released. And then also sort of the wave of additional resources that came out after that, just like if you're curious to dive more into it, there's a couple of really good conference talks on how that was implemented. So, that's my shameless plug if you should check them out because they're really nice.

Mickael Maison: Try including that in your script.

Danica Fine: That's the challenge for the day. Are we egging them on? Yeah. Try to do that and see how it gets better. Actually, think about it a little bit. Try to write a script and then watch Chris Egerton's talk on KIP-618 and then decide how difficult it is. It's not simple. It's crazy. And so, that's why I appreciate the community.

But if we... So, we've got to get back on topic here. So, one of the first questions I think that a developer has when looking at a new framework of technology is to look at it with skepticism, but then also to say, "Well, how can I break that/customize it? How can I tailor this to my experience?" And sure, Kafka Connect is customizable in that it is configurable. That's just the default. But we can always take it a step further, right? There's so much more you can do with Connect.

Mickael Maison: It is very customizable. Literally, every stage in a pipeline is a plugin that can be customized. So, as Kate Stanley mentioned, there are connectors, there are transformations, there are predicates for transformations, there are converters. All of that is pluggable. So, you can write your own. There's an API for that. Most of the APIs are relatively straightforward. There's only a few methods, so you can implement them. And that gives you really a lot of flexibility. So, you can basically pick a connector that already exists, but maybe build your own transformation and reuse one of the predicates and recompose them together.

All stages in the pipeline are pluggable and customizable. But also the runtime itself is customizable. So, the runtime, as it's a Kafka component, you can plug the same kind of things as a regular Kafka client. So, the authorization or the configuration provider, that's all customizable. It's the same way as it works in the broker and the client. You can also customize REST APIs. So, you can write standard JAX-RS extensions and plug them into your Connect REST APIs. For example, if you want to have some authentication, that's something you can implement and connect.

And there's also an override policy that you can implement basically to define what connectors can override. So, the runtime comes with default configuration that connectors can reuse or can change, and you can write an override policy to decide, "Okay, connectors can change that, but not that configuration because that's a secret. It's about authentication, and I don't want connectors to mess with that set at the runtime level." So, yeah, as we said, it's very customizable.

Danica Fine: You cover most of that, if not all of that, in another very conveniently written section of your book.

Mickael Maison: We have sections for every single type of plugin, and we show you the API and how the method goes together and what's the workflow for the connectors, and also all the plugins that can be connected.

Danica Fine: Personally, I think that was one of my favorite parts of the book because I've in the past ventured down that custom connector path before, many years ago. And looking back, I'm really frustrated, really disappointed that I didn't have your book to guide me through that process because, as I was writing a connector, because I figured what could go wrong? It's only a couple of methods that you have to implement. But really in the middle of it, I was up to my eyeballs, digging through the source code, trying to figure out when those different methods were called by the framework and what exactly I had to implement and do in them. And so, it was very, very trying.

Kate Stanley: That's what we tried to do in the book is we take that away from you so you don't have to do that yourself. We've done that for you. So, yeah, when you're trying to implement a connector, as you say, there aren't that many methods, but the really crucial bit is how am I integrating this external system with Kafka? When am I going to commit my records? When am I going to flush the data? You have to think about all of these pieces.

And like you say, in order to write a good connector that works really well at high load and handles failures and everything, you need to really understand where in the life cycle of your connector the different methods are being called. So, in the final section of the book, we talk about how to write your own connector plugin. But rather than just listing the methods, we explain, "This is when they'll be called," and kind of give some recommendations as much as we can, because obviously, it depends on the use case. But what are the kind of things you should consider when you're writing a connector?

Expanding Kafka Connect: Resources, Insights, and Community

Danica Fine: I think just having that high level of guidance would have saved me so much time. So, I think that's great for future readers of the book. If you want to write a custom connector or plugin, please read through that section. I'm very excited for this book. I find myself flipping through it and just wishing, like I said, that I'd had access to it as a resource before. Because there's just so much great information in here. But it's not just the book, right? There's a lot of other Kafka Connect resources out there. So much more now than even a few years ago, right?

Mickael Maison: That's really a good point. So, if you go back just three years ago, five years ago, there were a lot less content online about Connect. Connect was really a niche topic even within the Apache Kafka ecosystem. People would often not even know about it, because we were using Kafka and running Kafka and happy. And I'd say, well, you have this thing, it's part of Apache Kafka, you have this Connect, and you're already familiar with the concept because it's similar to Kafka.

So, the past few years, there has been a lot of content and awareness around Connect. And we see use cases like mirroring with MirrorMaker and Change Data Capture really be very common with the organization, literally any industry nowadays that are using those use cases and are building solutions based on that. So, it's much easier nowadays to find resources online. So, I've seen plenty of good talks from the previous Kafka Summit and current, and other conferences, plenty of articles online. So, clearly, it's much easier nowadays to get started with Connect than it was a few years ago.

Kate Stanley: Definitely. And it's even just noticeable, you know, when I've been speaking at conferences. So, I spoke at Kafka Summit this year and did a session about Connect and how rebalancing works and specifically how it's kind of evolved over time. And there were so many people approaching me saying that, you know, they'd already started looking at Connect, and they were now wanting to go deeper and really kind of appreciated the deeper topic. And I think that's just such a change from, you know, a few years ago where I would do a kind of Connect talk and people were coming up after me afterwards and saying, you know, "Oh, like, why can't I just run a producer and consumer?" We've kind of gone beyond that now where people know they want to use Connect and they're getting started, so they are wanting that kind of deeper knowledge.

Danica Fine: I think that's fantastic. And also, it's just reassuring. And I think those sorts of talks, just like I mentioned before, and I'm always going to keep mentioning this, is that, like, people see themselves in those talks, right? Because if you approach it as sort of a, "Here's an interesting tidbit maybe motivated by a frustration from the technology," then they're excited to use that as a new jumping off point.

I think maybe on that note, you know, people wanting to go deeper in Connect, what should we leave folks with to mull over, right? If you could sort of distill down your experience and knowledge into a couple takeaways, what would they be?

Kate Stanley: I think for me, the bigger takeaway would be just not being kind of scared or intimidated by the technology. I think, like I alluded to at the beginning, once you get your head around the terminology, then it's such a powerful tool, and there're so many customizations that you can really do whatever you want with it. So, don't listen to your first Connect talk and then think, "Oh gosh, there's just too much use of the word Connect here. I can't get my head around it." But actually, if you can push past that, yeah, it's just such an amazing tool.

Mickael Maison: As Kate said, it's a really powerful tool. So, it enables many use cases, from CDC to mirroring. These are very common scenarios nowadays. If you already use Kafka, it's a tool you already have. It's available for you. You're already aware of it. So, the barrier of entry is relatively low compared to other solutions. So, really consider it. Get started. It's the easiest way to learn and use Kafka Connect, really.

Danica Fine: I think kind of going back to all the use cases that it enables, I think that really speaks to just the power of the open source community. And I would like to encourage people asking me for takeaways, but I think that it's also great just to think about the Apache Kafka community. And so, if people want to dive deeper into this technology, I think the best way to do that is to contribute to the community, get involved, poke around, and see what pieces are of interest to you. And then...

Mickael Maison: And they're very friendly.

Danica Fine: They are. We all are very friendly. Selfishly, I think the Kafka community is probably the friendliest open source community. Someone can challenge me on that, but you'll be wrong. And then I'm disappointed. As a takeaway, neither of you mentioned that people should go out and get the book. It's got a bullfrog on it, which is very fun. I don't know why. We'll have to talk about that more. But seriously, this is a fantastic resource, and yeah, everyone should dive more into Connect. And so, I think on that note, maybe, I think that's it. 

Mickael Maison: Thank you very much, Danica..

Kate Stanley: Thank you so much, Danica.

About the speakers

Mickael Maison

Mickael Maison

Senior Principal Software Engineer at Red Hat & Co-Author of "Kafka Connect"

Kate Stanley

Kate Stanley

Principal Software Engineer at Red Hat & Co-Author of "Kafka Connect"