Home Bookclub Episodes Practical Data P...

Practical Data Privacy: Enhancing Privacy and Security in Data

Alyona Galyeva • Katharine Jarmul | Gotopia Bookclub Episode • November 2023

Share on:
linkedin facebook


Behind the Scenes of Practical Data Privacy

Alyona Galyeva: Welcome, everyone, to the GOTO Book Club. We are here in Amsterdam now, and I'm Alyona Galyeva from PyLadies Amsterdam and ThoughtWorks, I'm here to introduce you to Katharine Jarmul.

Katharine Jarmul: Hi. I'm Katharine Jarmul. I'm excited to be here with you. I'm also at ThoughtWorks. I work right now as a principal data scientist. And we're here, I think, to talk about my recent O'Reilly book which is called "Practical Data Privacy" and is aimed towards data folks and technical folks who wanna learn about privacy. So excited to chat with you about this today.

Alyona Galyeva: I'm excited too, probably overexcited. But, anyway, so I was curious, what exactly inspired you to write this book in the first place?

Katharine Jarmul: I think probably a lot of things. First and foremost, I've been working in the field of data privacy in data science and machine learning for multiple years. I think I learned a lot the hard way. So when I first got interested, I started asking questions like, okay, is privacy important in machine learning? How are we gonna do it? How would we even go about that? A lot of the material that was available was either high-level, really basic stuff, so, like, kind of older techniques, maybe even sometimes broken techniques that we wouldn't recommend anymore, or it was hardcore research. It was, like, you're a PhD in differential privacy, here's a paper, let's go. Or here's a paper on cryptography. Like, figure it out yourself.

I think that through that learning were probably a lot of battles I wouldn't have had to do. The idea of the book was really like a gift to me five, or six years ago. These are all the shortcuts that you eventually find out the hard way. If I had to teach it to you again, here's how I would teach it. And kind of the goal is to demystify the field a little bit of privacy engineering and just of technical privacy in general and get more people into it. Because I think it should be, like, more accessible, easier to understand fully, like, if we wanna use the term democratize, but available for more people.

Alyona Galyeva: Wow. I think that's nice. It's a gift back to me and the rest of the folks. I wish we had more such beautiful books available where you can just go, "Oh, I just need to talk to this knowledge." Ding, ding, ding. Ready. The next one.

Katharine Jarmul: That's your book, right?

The Need for Demystifying Privacy Engineering

Alyona Galyeva: Fantastic. So before, let's say, deep diving a little bit more into this book, I just want to highlight the specific terms that we're gonna use because, based on my personal experience, I noticed that right a lot of terms that used interchangeably. So what I'm usually hearing in the room full of enterprise architects when the data topic pops up is usually starts with let's say we need to protect data, of course. And then we go, "We need to protect data. So what techniques we're gonna use? We are gonna use pseudonymization, we're gonna use anonymization, we're gonna use this and that and that. And I have a feeling that it's all different techniques. It's all different meanings. The results, how the data looks after all these techniques, that's what's usually really underestimated because we're forgetting about the data consumers such as data scientists, data engineers, data analysts, and the rest of the folks. So could you probably walk me through a little bit, are there any differences between them, or just so you can use them interchangeably?

Katharine Jarmul: Yes. And I think you're highlighting a key point, and I know it's been a pain point that you've probably experienced in your life, right?

Alyona Galyeva: Yes.

Katharine Jarmul: I wanna even zoom back to, like, data protection versus data privacy. Those are also different fields, right? They're overlapping fields. But data privacy is also like a social and cultural understanding of privacy and maybe also relates to individual experiences and feelings and, like, how we want to share things and how we want to change how we share things. And then we have data protection, which is really like a lot of the laws around data privacy, around data protection. And maybe that also borders on data security...

Alyona Galyeva: Compliance.

Katharine Jarmul: ...which is also different, and InfoSec, which is also different. And so we have all these, like, neighboring things, and I think if you're just a architect or a software person, you may just think, oh, those are all the same words and they mean all the same stuff. And that's not your fault. Probably lots of people have used them interchangeably with you. But let's correct some of the problems here. I think you point out pseudonymization versus anonymization.

Alyona Galyeva: Yes. That's usually what just mixed up constantly.

Recommended talk: Practical Data Privacy • Katharine Jarmul • GOTO 2023

Katharine Jarmul: Massively different things, massively different outcomes. I mean, technically pseudonymization is all we're trying to do is we're trying to create some sort of placeholder or pseudonym for maybe even personal identifiers. So maybe we shouldn't release the name or the email. So we created a pseudonym. We can use many different methods. So, within pseudonymization, there are some fields like masking, and tokenization. Of course, you can always use redaction. That's slightly different. You can even do format-preserving pseudonymization using methods of format-preserving encryption. You can do all of these different things. But that is a subfield of pseudonymization.

Then we have anonymization, and I'm here to break it to everybody, and I'm sorry, anonymization doesn't exist. Like, if we collect data, if we collect information, and we release information, there is mathematically no guarantee that we cannot have somebody learn something about the individual. The dictionary definition of anonymization is, that you can't learn anything about the individual, you can't have the potential to re-identify the individual, that's been mathematically proven using math and logic and information theory, that it's just impossible.

That's okay. Because even though it's not technically possible, we have now defined methods to...let's just use the term anonymize or to approximate the best way that we can anonymize data using techniques like differential privacy. I'm happy to talk further about it, but there are ways that we can try to think about the information that we're giving out and think about what the implication of that information is for the privacy of individuals and to rigorously define and measure that and to tune that for, as you say, the exact use case, right? Because at the end of the day, we have some users, they might be data analysts, they might be business users, they might be the users themselves, they might be data scientists or machine learning folks, and they're the ones that have to then consume this data and make decisions with it or potentially lead the company in certain directions. We need to be very clear, about what their needs too in this. So we balance the user needs for privacy and the data needs for information.

Tuning Data Protection with Differential Privacy

Alyona Galyeva: So a short summary, pseudonymization, anonymization, it's not the same thing, it's different. I would say the classical anonymization, like 100% guarantee, doesn't exist. Just to get back to that point. I think we are gonna deep dive further into differential privacy, of course. But what I usually see in the, let's say, boardrooms is that this a little bit binary approach to, let's say, how our data could be secured. It's like...or it's secured or not. And this is usually the all conversations that we have with InfoSec folks, with compliance folk and all of that. If it's a huge enterprise, you have, of course, a data governance board, and then it goes more and more and more and more. But I think what is fascinating with differential privacy, is that this is the first time when you can think about, let's say, data protection as a specific scale. So it's not, let's say...or protected or not protected anymore. So could you explain more about that? Because I think that's the most tremendous concept that is introduced in the Chapter 2 of your book.

Katharine Jarmul: I think it is even bigger than that too. Because it's like pseudonymization is also a method, and that offers some privacy guarantees, then we go into anonymization differential privacy, and then we can go even further of, like, federated use cases and all this stuff, you know, this stuff from our conversations, but it's all in the book too. But I think, like, I love this thing there. It's like privacy isn't on or off. Data protection isn't on or off. It isn't like, "I switched on the data protection magically. Everything's protected," and it probably never was. So if, unfortunately, we have this kind of idea, do we implement the security? Do we implement the protection? This is a very binary mindset, but that's not at all how any of these technologies work. And we know from our work in data, that it's also not how data and information work despite default, right? When you're working in machine learning, as you and I have had a lot of experience in, there's a variety of truthiness of the data, and there's a variety of protection of that data depending on how you implement your algorithms or your architectures. I think the cool thing about differential privacy, it's all tunable. So it's by default tunable. The base theory of differential privacy is the idea of tuning the amount of information that somebody can get from the result and tuning that and bounding it by a small probability bounce.

So I give you an answer, and let's say the answer is 10, whatever the answer is, how many purchases did users in this area make this month or something like that? It's 10. And then I add a person to the data set, and maybe the person falls in that query, and then you ask me again. And the change in those answers, so 10, let's say that now I say 11, it's like if I just answered you 11, you could make a pretty good inference that somebody got added and that person made a purchase. The goal of differential privacy is to add essentially some uncertainty in whether the answer was the answer and in doing so allow there to be some of this probabilistic thinking in how certain am I that a person got added or not. A really good example, I think is salaries. So let's say there was, like, a dashboard of all payroll, right?

Alyona Galyeva: With some buckets probably.

Katharine Jarmul: Exactly. And then a new person got hired in your team, and you're, of course, quite curious because all people are curious, and you're like, "Whoa, I wonder how much this person's getting paid?

Alyona Galyeva: And it's a subverted question, especially in the European Union.

Recommended talk: Privacy, Crime, National Security, Human Rights & You in the Middle • Bert Hubert • GOTO 2023

Katharine Jarmul: Exactly. And maybe the person doesn't wanna share or maybe you don't wanna ask or whatever, but if the dashboard just reports all of the actual numbers, there's a pretty good chance, especially with some data thinking and probabilistic thinking or statistical thinking, that you could reverse engineer the salary. What differential privacy tries to guarantee by using differential privacy mechanisms is it tries to, again, add some level of noise, some level of bounding, so constraining that outliers are essentially non-existent because outliers leak a lot of privacy. Then adding this probabilistic noise so that even the people running the system, can't determine how much noise was added. This uncertainty and the tuning of the noise can help you tune this amount of privacy guarantees versus the amount of information. And, of course, in an internal use case, maybe you want less noise and more information, but then you also have less privacy. And maybe if you're releasing data to a partner, to a third party, or the public, you wanna tune-up that noise. And it's okay that it's not 100% accurate for whatever accurate means.

Alyona Galyeva: So could we say that in this way we try to protect data from the, say, reverse engineering?

Katharine Jarmul: From privacy violations. Essentially, what we're trying to do is we're trying to de-risk the release of information for whatever we define as an adequate privacy risk. And this is where the thinking in privacy overlaps with security and InfoSec thinking. Because It's about what is the actual risk, what's the threat model here when we release this data? What are we worried about? And how do we then adequately tune the protections that we have or employ or, you know, implement the protections that we have that's gonna adequately mitigate that risk? So we feel safe, and we feel, like, maybe our users or our citizens, if you're a country, are safe and we can release this data. And so it's pretty cool. The library that I used to implement differential privacy in the book is Tumult Analytics. It's an open-source library. They were the folks who helped the U.S. Census release differentially private census data for the first time in 2020. And they just last week... There'll be some delay in the release of the studio, but a few weeks ago now at this point, they just released all of the Wikipedia data with differential privacy, which is pretty cool, I think.

Structuring the Privacy Process

Alyona Galyeva: I just want to...let's say, to halt this moment, so we usually see that, especially in any data projects, usually privacy is the last step, the compliance is the last step. And also takes into account the upcoming new AI Regulation Act, which indicates specific, let's say, high-risk industries, and high-risk use cases. So I'm wondering how you can structure the process in the most mutually beneficial way for software engineers, for InfoSec, and data folks. So what are you gonna do, let's say, step by step? What is your recommendation on that?

Katharine Jarmul: We've seen a lot of this happen in governance and risk conversations.

Alyona Galyeva: Exactly.

Katharine Jarmul: It's gonna be different for every organization, which is why it has to be led by those experts. I think if you name them very appropriately, it has to be the software infra-architect side of the house. It has to be the data side of the house. It has to be the security side of the house. If the company's large enough, as you referenced before, it has to be compliance, audit, privacy, and legal. And those people should all be sitting on the data governance board by themselves and talking with each other regularly. Let's hope. If not, you know, maybe that's the first step. And that's the first chapter of the book, is, like, if you don't have functioning data governance, you can't do any of this cool stuff. So you need functioning data governance first. Because, as you point out, there has to be a risk appetite set. So you have to define what is privacy risk for us. What is data protection risk? What is an information security risk? How do we define that? How do we define it on a data science or a data case-by-case basis? So, for certain things, do we have a larger risk appetite? Like, are we willing to take more privacy risks to do a new cool machine learning thing that we wanna test out?

But that all has to be defined at the organizational level. Once that starts to get defined, you can work with teams to prioritize experiments. That's very much the ThoughtWorks way, which is to identify a priority, identify a thin slice, and then iterate on it, and as very agile methodologies as well. And I think that that works well for these new technologies because they're so tunable, because some of them have different requirements in terms of how to deploy and scale them and so forth that, on a case-by-case basis, you experiment with some of these new technologies, you iterate and learn as you always do. And then maybe as you iterate and learn, you can build them into platform services. Maybe you can build them into reusable functionality that your teams can leverage. That requires the software folks in the house and the platform folks in the house to think through how do we optimize these systems for scaled use and for the types of users that we have. And so it can be this huge iterative process, and then you go back to the governance board and say, "Okay, here's our learnings. Like, what's the next priority?" And it can iterate that way too.

Alyona Galyeva: Okay. So let's, for example, zoom in on the use case. So you're in the room. There's a specific use case. We're not talking about mission-critical scenarios right now. We're just talking about something that we would prefer to do better. Let's put boundaries. It's for internal usage. So it's still, let's say, private information, but we are not releasing this information somewhere external to the company. So what will be the hands-on approach? You briefly mentioned the library that you use. So how are you gonna go? So you have this use case. You have this, like, software folks in the room, data folks in the room, infra privacy folks in the room. So what are you gonna do?

Katharine Jarmul: I mean, the minority (b) policies and standards that say, like, for these use cases, this is what we recommend. And depending on the org you're at, they might be super specific and say, like, use this library and with these parameters and so on.

Alyona Galyeva: So they...you can even define a parameter.

Katharine Jarmul: Exactly.

Alyona Galyeva: Wow. Okay.

Katharine Jarmul: If you wanted to, right?

Alyona Galyeva: Yes.

Katharine Jarmul: And that's usually for organizations that have been doing this for a long time. That's why they don't have to have so much experimentation. They've already experimented. They've codified it in their policies or their standards, but there are also a lot of orgs that don't wanna codify it in standards because they know technologies change over time. So you go to the guiding standards and policies. You kind of hopefully by that point in time read and understand a lot of them. And then to implement them if you can leverage a platform that's already there if you can leverage a tool that's already there, always do that. Nobody should be rolling their own over any of these things, in my opinion, unless you have, you know, a privacy engineering team that's in charge of this. And in the case that you described, so, like, internal data use, not gonna be released publicly, you might even just decide pseudonymization is enough, or you might decide, like, aggregation plus pseudonymization for these particular categories is enough. So there can be, I think, particularly for internal use cases, a much larger risk appetite because, presumably, with your internal users, there's a high level of trust.

If you're in a huge org, and this is one of the things that they did at Google, is, for data scientists, you have to first prove that your experiment idea works with differentially private data before you can get access to...

Alyona Galyeva: To the real...wow.

Katharine Jarmul: ...the real data. And it's for the sensitive, you know, PII and other types of sensitive data. And I think that's a cool idea that can be used in all sorts of industries where it's like, look, maybe you're just in the EDA step, the exploratory data analysis step. Maybe you don't need the raw data. Maybe you just need some artifacts of the raw data to test out the idea. And then if you wanna go further, maybe there's a process to apply to get access to the raw data and maybe there's a timeframe of experimentation. So maybe instead of accidentally, I think a lot of times accidentally have access for like two years, and then you accidentally pull production or something like this. You don't wanna do that either. It helps the users and it helps the organization to have a little bit more structure around how some of this more exploratory data works and then to define, okay, once the experiments are run, and we know the approach, and maybe we wanna move into staging or even production, then we essentially decide to give less access to the individual data user and more of a systems level, ML Ops, I know near and dear to your heart, ML Ops layer of access to the data, which is great, right? Because then the chance of somebody accidentally pulling production or accidentally exposing users is much less.

Recommended talk: ML Security Operations at One of the Largest Brewing Companies • Maurits van der Goes • GOTO 2022

Differential Privacy Applied

Alyona Galyeva: I think that's the most difficult part because, usually, it's like, we can do this because we don't have a sense of production data. We couldn't create synthetic data based on that. So I'm just wondering how in this case, let's say, the specificity of the usage of the specific libraries and the tuning of the specific parameters that just mentioned. Probably, we can zoom in a little bit more. So I just want a case. Imagine we have specific, I don't know, like, the database from this database, like, we operational database on a specific moment. We just scoop a specific amount of data. We have, let's say, a data processing pipeline, which has, as we guess, some embedded data verification checks and also applies differential privacy. And then we have, let's say, the result. So what type of these perimeters you can tweak? That's what I'm really curious. So if you look at differential privacy, what are the main concepts of it? And this concept, let's say, then turn into parameters and what could be literally... Yeah, explain, let's say in simple words, because if you go into all the privacy budget with EPSO and how it's for, it could be too much, but I would say, if I need to explain it to my software colleagues, what I'm gonna do?

Katharine Jarmul: Yes. So perfect example, right? And one of the problems and also beautiful things about differential privacy is now we have hundreds of definitions. Those definitions have different parameters, and there are some that you can use as kind of, like, hyper-parameters to optimize different definitions. It can get quite complicated. Damien Desfontaines, his work on it and his article series on it, he helped me with the differential privacy chapters as a technical advisor and helped implement the Google differential privacy stuff and now is at Tumult Analytics place. And he and a co-researcher, whose name now I unfortunately forget, did a survey paper, and they parameterized all of the parameters. So there were like 100 different parameters.

Alyona Galyeva: I'm trying just to mention the dimensionality of this data

Katharine Jarmul: Exactly. Yes, yes, yes. Absolutely. So, let's go to a simpler definition. And then depending on the privacy expertise or if you start hiring privacy engineers at your org, then they can go dive into the deep end. But from a simple standpoint, we're going back to these queries that you're running. What you wanna understand is what's the probability of learning something before in the first query and then what's the probability of learning something in the second query, and you want those probabilities to be very close to each other because the closer they are together, the less information that you've essentially given away. And the less information you've given away, the less of a chance of privacy loss. So privacy loss of the individual is information gain of the person querying.

And these are modeled in a threat modeling sense. The person querying is trying to attack. And therefore they can be relaxed for different types of use cases. If you're releasing data publicly, you should probably think about a very high... high level of security and a low level of trust with the people using the data. But if it's internally, we can think much, we can broaden that. And one of those parameters that you can tune is the closeness of those responses. Do they need to be tightly coupled? So we have very high security but also probably a lot of noise, right? So the noise distribution that we're playing from is much larger in a sense, and therefore we have much higher uncertainty of if the answer that we got the first time or the answer that we got the second time, if there's any information there. But at some point in time, it also gets too noisy, right? And then we need to determine how to do that.

The higher accuracy that you want means less of a chance of privacy for the individuals. So if you absolutely 100% need the right answer and it's mission-critical that you get the right answer, then probably differential privacy is not for your use case. But if an approximation of the right answer, and this is like also a theoretical debate in data, is what is ground truth, and do we ever have accurate data and all of these things? I think there is something to be said about how much we understand the data that we're collecting, and is there not already errors in the data we're collecting, and does the insertion of differential privacy related error, so errors so we can ensure privacy for the individuals, which is essentially what we do as part of this process, is we insert error and we can decide what type of error so we can choose the distributions that we insert so we can use Gaussian...

Alyona Galyeva: Distribution.

Katharine Jarmul: ...which we would normally expect anyways in data we collect. So as data scientists, we expect, normally distributed error in most of the data we collect anyway. And so inserting some more normally distributed errors is probably not the end of the world as long as they're doing robust data science. But these are all things to think about. From a software perspective, you need to think about this trade-off between the accuracy of the response or the information, the response and the privacy you can guarantee, and the mechanism in and of itself, and what you can tune is this tension between those two points. So either you're getting more privacy and also more error and noise and also for the attacker less certainty that they know for sure what the data is or who's in it, or you're getting, higher accuracy closer to the actual numbers, but the chance that somebody learns that somebody got added to the data or that somebody got removed from the data or certain things about the people in the data is much higher. Therefore you're dealing with much riskier data science or data release.

Alyona Galyeva: Let's say, in my head, you know, it's like the slider which is going like how much privacy laws we could afford to get meaningful results at the end versus how more we prefer to lose to get, let's say, better results. So that's, let's say, the space where we operate.

Katharine Jarmul: Absolutely.

Alyona Galyeva: I think it's usually,  everything is a trade-off in the majority of the cases. This is exactly the trade-off on which sensitivity of data we need. I just want to get a little bit back. I think what triggered me when you mentioned that this Google approach was really interesting. So we getting back to the story of, let's say, again, data in different environments. Usually, of course, you're not allowed, like, high-risk sensitive PIA data to be available in-depth or staging. You prefer to keep this data in production. But what I saw with my own eyes, like two eyes, but usually, like, it's there's only one environment.

It's only one environment. And let's say that there is no understanding of where the data folks could be put in to do their experiments. They're not allowed to do their experiments in production. Because production, there is no...by the way, there is no infra to do this, of course, for all the hungry computations. And then, usually, it goes like this, if I cannot access the data, then I find someone who could help me access the data, right? So we go this way, oh, probably you can send me these 3 gigabytes, as an attachment to the email, and imagine if it's PIA data. Good luck with this. So, again, like, in this, like, chaotic situation, what will be your piece of advice? It doesn't matter, like, right now the scale of the company, but imagine they don't have anything, and they just want to start. How are you gonna approach this?

Katharine Jarmul: Absolutely. This goes back to the very basics of data governance and the fact that, like, if you have a system that cannot support an entire department or team of people except for going around the data governance rules, you haven't done it right. And we see this a lot, even at scaled companies, where data governance comes in...

Alyona Galyeva: Experis.

Recommended talk: Investigating Privacy Issues on Mobile Platforms • Felix Krause • GOTO 2022

Katharine Jarmul: ...and with very strong hard rules of, like, no access for data teams, no this, no that. Although that might be a great idea for information security, people are clever. People are social, and social and clever people find and socially engineer new ways to access data. So just saying no all the time is not gonna stop data from being used. Instead, implementing privacy engineering at an organization. Even if right now you call it infra engineering, you put it in the architecture, you put it in software, wherever you put it, the concepts of privacy engineering, which is making these technologies easier to use and more available for the entire organization, that stuff has to be built in to avoid these shadow IT systems and these workarounds that people find because people need to do their jobs, right?

So just saying no and blocking access to production data or something is not gonna stop data scientists from figuring out a way to do that or leaving the org, even worse, getting so frustrated that they can't get access to data that they leave the org, and then you don't have any data scientists anymore, or you don't have data scientists that want to do daily data science, right, which is also problematic. So I think you have to go back to an organizational decision. We're gonna invest in ways, and if you're a small org, you can make these small ways, right? This can be a simple interface that allows for pseudonymization, that allows for tokenization, or something like this, or an interface that allows for an aggregate dump from production to go through a differential privacy pipeline or something like this. It doesn't have to... You don't have to do everything at once, but you have to give people the tools they need to do their job and the tools they need to do their job safely and with privacy-respecting tools.

The vector of attack in differential privacy

Alyona Galyeva:  I do believe as well that from day one, it should be built in. If it's there, then it's easier to expand it later. But there is nothing here if there is even no categorization of the data, regarding the sensitivity of specific data and then a basic way how the users can access the data without specific, like, role. If you have just give everyone route access, so good luck with this later for the use cases.  That's, I think, the hard thing. I also wonder if I look back, for example, pseudonymization, usually, the vector of attack is usually, say, this, like, the linkage between pseudonymized data and, let's say, the regional data. Or it's like some dictionary or any other thing. So that's usually what attackers try to get access to. What will be the vector of attack in case of differential privacy?

Katharine Jarmul: Differential privacy, you tend to release aggregates, things like histograms, things like count sketches, and other things like this. But you could also release a result, right? So you can release an average or something like that. Interestingly enough, some of the core differential privacy attacks that have been proven are implementation errors. So Damien, the same researcher who's done a bunch of work on differential privacy, will be presenting at BSides Zurich this year some work that the Tumult team did on leveraging floating point attacks against differential privacy systems. And we can think about this like when we're floating point and we're already...like, then that's already an abstraction on how computers store data, right? There's no real floating point data type at the core processing level. We're dealing with some sort of abstraction. What has been proven before and what they further proved is, depending on the sensitivity of how you're sampling this noise, the computer can sometimes not be truly pseudorandom.

These are some of the same problems we have in cryptography and so forth as well as, like, with computers, we don't often have a true source of randomness at least from a mathematical perspective. Let's say that you're an attacker of a differential privacy system. If it's quite obvious that you're running a query that has a particular distribution and the sampling of that distribution is predictable, you can start to reverse engineer. We can think of this, like, as from Bayesian or probabilistic thinking. You can start to reverse engineer these processes, and you could potentially even start to guess how much noise was added. If you can guess how much noise was added with any certainty, you immediately remove all of the privacy guarantees because the privacy guarantees come completely from the fact that nobody can probabilistically infer how much noise was added. And so the whole car is a small park if you could do that. Interestingly enough, it's less from, like, data analysis, it is more from, like, computers and systems analysis that have been most of the attack vectors against differential privacy.

Alyona Galyeva: Now, I guess I get it, why there is no such a thing, as this, like, 100% guarantee of anonymization. With all being said, like, we try to walk through, let's say, different chapters a little bit, and touch different things. So I want to summarize all of this. So could you briefly walk me through, let's say, the more or less content of the chapters, how many chapters the book has with the content, and this is, like, a hands-on privacy implementation book? So just, like, highlight this, and I think that will conclude our session for today.

Katharine Jarmul: As you're familiar with already and as, hopefully, viewers will be as they read the book, the chapters are structured. So there's a theory first, and then there's hands-on stuff in the later half of the chapter. This is because I'm a practitioner like yourself. I think it's really important to not just teach people how something works, but how they can use it, you know?

Alyona Galyeva: Exactly.

Katharine Jarmul: Okay. That sounds great. But, like, what am I gonna type when I sit down? Like, how am I supposed to do this? And so each chapter has open-source libraries. So we start with data governance, and we talk a little bit about pseudonymization and so forth, and then move on to differential privacy. You get to play around with the Tumult Analytics library.

Alyona Galyeva: Cool.

Katharine Jarmul: There's also some implementing differential privacy by scratch, which, by the way, you should never do unless you're an expert.

Alyona Galyeva: I remember this. Don't do that.

Katharine Jarmul: I teach it to folks so they can play with it, right? Because it's important to, like, play with these mechanisms and kind of reverse engineer how they work so you can reason about them. Then it moves into data pipelines and data engineering. So how do you put privacy technologies into data pipeline work and data engineering work? Then it moves on to attacking privacy systems. So how can we, like, we're...because we need to know how to attack stuff, and we wanna know how to protect it.

Alyona Galyeva: How to protect, yeah.

Katharine Jarmul: And then we move into federated learning, federated and distributed data analysis...

Alyona Galyeva: With Flower.

Katharine Jarmul: ...which we didn't get to talk to today. Then you use Flower, which I love Flower to do this. And then we go into the encrypted computation. And this is just computing only on encrypted data without decrypting it. Encrypted machine learning, encrypted data processing, and how does one do that? And then I sort of go into the human side. So how do you read policies? How do you work with lawyers? All of these things and then some FAQs and some use cases. So it's a pretty fun wild ride. I hope.

Alyona Galyeva: I do hope

Katharine Jarmul: Open for feedback if there are open questions. It's been really exciting to chat with you, and thank you so much for being my host. You're the most.

Alyona Galyeva: Thanks a lot. And I guess the folks can get your book because it's right now available, so you can find it in digital copy or hard copy, whatever you prefer. Thanks a lot, Katharine. Thanks.

Katharine Jarmul: Thank you so much.

About the speakers

Alyona  Galyeva
Alyona Galyeva ( interviewer )

Principal MLOps & Data Engineer at Thoughtworks

Katharine Jarmul
Katharine Jarmul ( author )

Principal Data Scientist at Thoughtworks & O'Reilly Author