Apache Spark Veteran Builds AI Tool to Challenge Healthcare Insurance Denials
After a decade building Apache Spark, Holden Karau is now using AI to fight back against health insurance denials. The free tool at fighthealthinsurance.com is already helping people navigate one of America's most frustrating systems.
About the experts
Julian Wood ( interviewer )
Serverless Developer Advocate at AWS
Holden Karau ( interviewer )
Open Source Engineer at Netflix
Read further
Apache Spark: A Decade of Distributed Data Processing
Julian Wood: Welcome to another episode of Boot Unscripted. We are here in the lovely city of Chicago. It's the first time I've been here and the weather's been amazing. GOTO always puts on a great conference and we get to speak to some awesome speakers. Holden, welcome to GOTO Chicago. You've done a huge amount of work in the open source and data space. Tell us, how does that work? What have you done? In the world now where data is massive and it's all about generative AI and LLMs, how does it all fit together?
Holden Karau: Well, I would say it fits together poorly, but a little bit it does fit together. I've worked on Apache Spark mostly for the past decade now, which is kind of a trip to think that I've done something for ten years. Over that time we've seen Spark develop and in many ways take over from MapReduce. But there's also a lot of other tools around like Flink and more recent ML-focused tools like TensorFlow.
There is this interplay between the AI tools and the data tools. You have to have the data to make the models or even to validate the models to know if they're going to work. So you have this classic ETL problem. But then we also have this problem of getting the data onto the GPUs or onto the specialized chips for doing the training. This is a problem because the bus there is relatively slow. Copying data back and forth a lot of times can be very expensive.
The other thing is that Spark has this approach to resilience for handling machine failures. When a computer fails, we just redo the bit of work that was on that computer. But the problem with training machine learning models is when we're using many different computers, we can't take that approach anymore. If we lose one computer in the middle of training our model, we can't just recompute the part that was on that machine. We have to recompute a lot more.
This introduced a new form of scheduling called gang scheduling into Spark to support those things. Also, a lot of the machine learning tools are written in Python ostensibly, but in practice they're CUDA libraries or whatever. They're largely not in the JVM. So we have this problem of how we get our data from the JVM onto the GPU or into the place that our machine learning libraries can use it. The Arrow project has been of increasing importance as the lingua franca between these different tools. It's the thing that glues everything together in a lot of ways.
Julian Wood: For people who may not be familiar with Spark, what is it about Spark that's differentiated it from MapReduce? There seem to be a lot of different data processing and data handling tools. For people to remember Spark, what should they know about it?
Holden Karau: Spark's strong suit, the place where it belongs, is when you have too much data to fit on a single computer, probably too much data to fit on two computers. Increasingly we can fit more data on a single computer, which is great and lovely. But on the flip side, the amount of data that we use is also increasing drastically. We can handle maybe some of the problems that we used to do on single computers, but the kinds of problems we're being asked to solve have also grown.
I think of Spark as like a conductor. It brings together a bunch of different things and helps them work together towards a shared goal.
Julian Wood: So is it a distributed computing platform that you would install? Spark runs on a fleet of computers and runs jobs to process things? How does it evolve from the MapReduce part, or is it using similar concepts of mapping things out, doing the calculations, and then reducing that to the ultimate result?
Holden Karau: There are shared parts. The mapping things out remains the same. Then we do have what's called the shuffle step, which is very similar to the reduce step but not necessarily exactly the same. That's where when we need different pieces of data that's on different computers, we shuffle the data around so all of the data we need to answer something is on the right computer together.
A lot of what made Spark successful, in my opinion, is that in MapReduce you have to spend a lot of time thinking about what is your map, what is your reduce. You have to manually put a lot of different things together to get good performance. Whereas in Spark you can write a lot of small individual things and the optimizer will figure out, "Okay, I can put all of these things together into a single pass over the data. Here is where we have to do this kind of expensive thing. And then once again, I can put together all of these things." It takes a lot of this abstraction away from you in terms of having to think about the difference there.
The other one is the resiliency model. In MapReduce, we handle machine failures by writing data out many times so that if a computer went away it was fine, we would just go and read it back in from disk. Whereas in Spark, we could do that, and in fact if you do have a problem where that is the right solution, there is an option called checkpointing you can use. But the default is that most things are not too expensive, so if we lose it we can just recompute it. We can just get this little piece back. That made Spark generally faster than MapReduce in a lot of ways.
Julian Wood: So is that because the individual jobs are smaller and it's just worth doing the compute thing?
Holden Karau: I would say that even with the individual jobs being the same size, yes we lose machines, but we don't lose machines all the time. So occasionally recomputing is cheaper than always writing to three disks, because writing to three different disks is very slow.
GPU Integration and Resource Management
Julian Wood: When you're talking about machine learning and AI across GPUs, the bottleneck before, I presume, was the storage writing speed. Things are going to be in memory for speed, but then network transfer costs to get things between GPUs. When you're running things on a single GPU or multiple GPUs on the same box, that networking path is still a bottleneck but obviously a lot faster than what we've had before. Is that the evolution of Spark, being able to take advantage of that data locality in different ways?
Holden Karau: In many ways, yes. We bring the data to the right place or we bring the compute to the right place. These are actually tunable configs that you can do, based on the fact that the network is a lot faster. We have giant fast pipes between a lot of these machines. You can configure how long it should wait for there to be room on a machine to run a task versus just copying the data to another machine that could run the task. These are tunable options.
Julian Wood: Obviously at the moment the chips are super expensive, and wasting any time doing any training or any data analytics on that chip is super wasteful. So is that the evolution of Spark you're talking about, making these efficiencies to use the hardware to the best ability?
Holden Karau: That's actually something new in the current versions of Spark, and in the next version I think it'll be very solid. We introduced something called resource profiles. The resource profiles are a way of saying this part of my job needs these special pieces of hardware. This is really important because for a lot of things you don't need those GPUs. There are just these critical parts where you do need them. So you can tag those critical parts and then Spark will only use those resources when it needs them.
Julian Wood: The best hardware for the job.
Holden Karau: A fair amount of that feature came out of Nvidia, not surprisingly. Although on the other hand, it might be a little bit surprising because it makes it so that you don't have to use them for the whole job, so they maybe don't get quite as much money. But I think in the long run it is better for them and better for the industry.
Common Mistakes and Best Practices in Data Processing
Julian Wood: What mistakes do people make when they are doing Spark or distributed data? There's obviously a massive space there must be many tripping up points. If you're advising somebody who's going down this road, what are the kinds of things that they should or shouldn't do?
Holden Karau: The most common root cause of problems that I see is not taking a look at your data first. It's understandable, these data sets are quite large, but a lot of problems come from misunderstanding cardinality of certain kinds of data. Then because you don't understand the cardinality of your data, you ask the computers to do things that it's not reasonable to do, and they'll do their best, but it'll be incredibly slow or expensive. Whereas if you did know, you might have asked them to do something else instead.
Julian Wood: That's like a data preparation step, better categorization, understanding, sanitizing data. The better your data structure, the better insights you can ultimately get out of it.
Holden Karau: In a way, yes. Data sanitization definitely plays a nice part. But I would also say in addition to sanitizing your data, I think it's good to look at the distribution of the core properties that you're analyzing. Do you have a very nice normal distribution? Do you have a skewed distribution? What does this look like? You use very different techniques for working with these different kinds of data.
For example, if we have data with a really massive skew, which is very common, you need to do things differently. In fact, it is not uncommon for me to find that people strongly believe their data is not skewed. But when we go and look at it, it's like no, your data is skewed. In fact, everything is moved to one computer and it's very sad. We have these other 100 computers and they're doing nothing. So let us think about how we could structure that.
Julian Wood: It's a classic parallelization problem, deciding where things are going to run and proper distribution.
Where's Spark in the future? You also mentioned some other projects. Spark doesn't exist in a vacuum. Is Spark alive in ten years? What would you look back on and say that Spark had achieved or had helped the world with?
Holden Karau: Is Spark alive in ten years? Probably. Realistically it might not be as important in ten years. I think that's quite likely. I think it's very likely that we'll see some other tools come and disrupt the space. One of them is Ray, which certainly has some very interesting properties. I think it's a fascinating tool.
But on the flip side, Spark is now used for so many things. It's not COBOL, but it's going to be difficult to get rid of. Even if we have a magic wand and things are better, there's still MapReduce jobs out there.
Julian Wood: What are use cases of Spark? Data is a very generic term. What are people picking Spark to use, and when is it the wrong tool for the job?
Holden Karau: Sometimes I see people using Spark for things which, frankly, really they just need local parallelization over a pandas dataframe. On the flip side, it's kind of hard to do, so I get it. I think it's, as I was saying earlier, very well suited to data sets which are just too big for a single machine. I think it is also okay for things where it does fit on a single machine but it's very slow, and we want a high level of parallelization to make things go faster. I do see a use for that.
One of the more common mistakes that I see in general is the assumption that Spark will parallelize everything. This is not true. It will only parallelize the parts that you do with Spark. So it is important to understand which parts of your program are being done inside of Spark and which parts of your program are not Spark. Once you call to pandas, it's not Spark anymore and it will in fact not go any faster. If truly this is at the point where you've filtered your data down enough, it might even be time to stop Spark. You can call spark.stop and you can just do the rest of your work in pandas. This will be great, you'll save a bunch of money.
Building an AI Tool to Fight Insurance Denials
Julian Wood: We talked about large language models and AI and all this kind of thing, obviously lots of data and lots of data processing. Not necessarily even about Spark, but what are your thoughts on the revolution we're going through at the moment and how data plays into that?
Holden Karau: Data is very important for large language models. My talk here at GOTO Chicago is very much around how do we use the data from independent medical review boards to try and build language models to help people with insurance problems in America, which is a depressingly common statistic.
Julian Wood: Your talk is about using AI to help people who've been denied insurance as a remedial mechanism to be able to understand the process or help them through finding out why that happened?
Holden Karau: More when they've had a claim denied.
Julian Wood: A claim to insurance.
Holden Karau: Yes, that's fundamentally what the talk was about.
Julian Wood: What is the genesis of that? The data wrangling of Spark over many years and then pivoting to talk about machine learning and AI in terms of insurance is a different thing.
Holden Karau: I will admit it is a little bit different. I haven't worked for an insurance company. There's a few different things that brought it on. One is I got hit by a car and so I had a lot of insurance bills, and of course no one agreed about who should pay for things, the classic American story.
Also being trans and choosing to medically transition, even in California and working for large, ostensibly progressive companies, those insurance plans didn't always do what they said they were going to do. In fact, rarely did they do what they said they were going to do. And just knowing a lot of other people, I started working on it probably as a result of that.
But I have ADHD among one of my many wonderful or terrible qualities, depending on who you ask. So I do many different projects and they don't always make it to the point of something that's usable. But the thing which pushed me over the edge was the pet insurance company from my dog denied the claim for anesthesia, saying that he should have had the root canal while he was awake. I was like, we don't even do that to humans. He is a two year old dog. I don't know of a single dog that you could do this to, but it's certainly not a root canal.
I guess they might do that on humans with local anesthetic, but even then, that was a terrible experience and the person understands what's going on. Unfortunately our large language models don't yet translate to dog, so we don't have a way to tell them, "Hey, we're doing this because there's something wrong with your tooth." So they just think that they're being hurt and obviously, understandably, are not okay with that.
I got super upset with them and it was like, "Okay, well, it is time to finish this project and get it to a point that people can use it."
Julian Wood: I often think that we have concerns in the world about using AI for decision making, for even things like insurance, like whether you should be insured and what your coverage should be. But you're taking on the second step of, okay, you have insurance, you've paid your premiums, something has happened and your claim has been denied. What is the AI asking, querying, checking, understanding? Is it looking through a contract and understanding more? What is the actual outcome of the process?
How the Insurance Appeal Tool Works
Holden Karau: That's a great question. In its current iteration, it processes the denial and extracts what was denied, the reason that the insurance company stated for the denial. That's the starting point. We also look at the patient can provide their health history.
Then it's like, a very common reason for denial is that insurance companies will say something is not medically necessary. In fact, they say this for all kinds of things. My physical therapy after getting hit by a car, they're like, "Oh, you don't need that much physical therapy." I'm like, "I literally can't walk right now, so I think I probably need more than one hour."
So it says, "Okay, this is what we understand. Does this look reasonable to you?" And you would say, "Yeah, this is right. This is the procedure I was going for. This is why they said it was denied." You provide some of your history. Then we would go ahead and generate an appeal that would probably incorporate some of the history.
Based on the historical appeals that have existed, it's more likely to be successful. For the ones where there have been previous denials for similar procedures, and for better or worse there's a lot of those that have reached the independent medical review board, this is good prior.
It's not perfect because it depends on the person providing their medical history. I spend a lot of time thinking about health insurance and even I don't know my medical history very well.
Julian Wood: And whether your insurance company knows or you've given the same information to them.
Holden Karau: I don't care what the insurance company thinks.
Julian Wood: Default is going to be no because it's going to cost them money.
Holden Karau: The default is go away. So one of the things I want to do is switch it so instead of asking the person for their medical history, we instead ask them a bunch of questions about their medical history. I believe that if we do that, we'll be able to get more context to fit into the model and produce better appeals.
But I haven't done that yet. It's definitely the sort of thing like this is a project that I do in my time after work, so there is only so much that I can do right now.
Julian Wood: So you train your own model based on existing insurance claims and denials?
Holden Karau: I use what's called the IMR data or independent medical review data. That's essentially a third level appeal data set. So you're not going to have everything in it, and you're only going to have things in it that are being appealed for medical necessity. So there's a bias there, but it's the best data set that I could find. There are tens of thousands of records. I do a bunch of things to that data to try and produce the kind of data that I need for doing the fine tuning. It's not perfect, but it's the starting point.
Julian Wood: People will fill in their history, fill in what's happened to them. Obviously there's an insurance contract and company that you would know about. Ultimately the challenge is matching all these together and using the documentation or the plan you have, finding out what is or isn't covered, looking at your medical history, and then trying to infer whether they've been unfair to you or not.
Holden Karau: There's also this challenge of the contract. That should exist, but to an extent, they do some very limited processing of the contract. But in my experience, most people do not have their plan contracts. So you kind of go based more on what the regulations require the insurance companies to do rather than necessarily the terms of the contract.
If they have their contract, that's great, it's going to be considered into the model. But many people don't have it. There's actually a requirement that insurance companies provide it on request within a certain period of time, not a lawyer, et cetera, but in my experience, insurance companies don't do that even though they are legally required to.
Julian Wood: Easy for them to obfuscate and just stall. Tactics for medical and insurance altogether just doesn't happen.
Holden Karau: A lot of Americans are insured under what's called an ERISA plan for privately funded health plans. In these cases, it's kind of not exactly clear who the plan administrator is. Like, is the plan administrator Anthem Blue Cross or is it your employer? And the answers that you get when you ask them may vary well. They have legally distinct roles, and this is really important because on request the plan administrator must do certain things. But there have been court cases where the person only asked the insurance company and the insurance company went, "Whoa, we're not the plan administrator. We're just acting on their behalf." And I'm like, how is an individual supposed to know that? The number on the back of their card is...
Julian Wood: In fact, a fundamental question that you can be insured by somebody you're not quite sure who they are.
Holden Karau: It is incredibly, incredibly obtuse. And then there's also things like multi-plan where you could actually have an insurance contract that's written by this other insurance company that uses another insurance company's network, but the ultimate financial responsibility is, you don't know. It's so confusing.
Technical Challenges and Open Source Sustainability
Julian Wood: Lessons learned in building this, things that went right and wrong that surprised you, that was easy, that annoyed you because they were difficult. What's the lesson in building what seems like an interesting app to solve a particular problem that you yourself and many people have?
Holden Karau: I would say that renting GPUs by the hour is expensive, which is annoying.
Julian Wood: Is this mainly for the training of the model?
Holden Karau: Yes. This is actually one of the decisions that I made. We would run GPUs for training, but we would use our own GPUs for inference because running them for inference would just be too expensive. But this limits obviously the kind of model I can make to a model that will fit in the kind of GPU I can afford, which is not as large as one might hope.
So yes, you can run GPUs for training and then use smaller ones for inference, very classic thing. The thing when you do that though is the GPUs that you rent are not going to be anywhere near you network-wise. So you have the classic problem of you have a giant pile of data that you need to get to these machines in the middle of nowhere.
Julian Wood: Sounds like what we were talking about earlier, getting your data to the computer, the data Spark problem.
Holden Karau: The problem is when you're doing this as an individual, you don't own a data center. So these machines are just whatever you happen to be able to find. And so you have, in the inverse of what we were talking about where normally nowadays you have very fast networks, you actually have a very slow network. The network is an incredible bottleneck. So you very much want to configure things differently.
The other one that was kind of annoying is that even for supposedly non-preemptible machines, you'll get preempted probably eventually.
Julian Wood: What does preempted mean?
Holden Karau: Preempted essentially means what you're doing will get stopped and just thrown aside. Often, normally, I'm not accusing anyone of this, it happens because of money.
Julian Wood: Okay, so this is when you're renting a GPU and you run a job on the GPU and they eject your job somehow?
Holden Karau: Now, of course, it can also be ejected because the hardware failed. And in theory you can run GPUs that are non-preemptible and that means it'll only kick you out if the hardware fails. But I'm just saying it's possible that you should plan for perhaps a higher level of preemption than one might expect.
Julian Wood: Does that just completely break the job and the job needs to completely start again? Because going back to earlier when you're talking about restarting things.
Holden Karau: This is checkpointing. It's a trade-off of how often you write out the intermediate steps of what you've been doing to persistent storage. And also, do you have persistent storage available on this computer? Some places you can, some places you don't. So I think it is very important to get persistent storage and get that attached to your machine so that when you are preempted, you don't have to start over. You can start from one of your checkpoints.
Julian Wood: So when you're doing the model training, I'm thinking the checkpointing is one part, but obviously the training steps that are happening need to be stored and saved somewhere else. How does that work?
Holden Karau: It's in VRAM. So when the computer dies you lose whatever has happened since your last checkpoint. It's not the end of the world.
Julian Wood: The checkpoint, is there a lot of data that needs to be saved?
Holden Karau: Generally speaking, checkpoints are not small. It also builds up over time because you do many of them quite likely.
Julian Wood: What are some of the things that surprised you that were easier to do, that worked really well, or new things? As you were doing the project, you did something different, you changed track and it was like, "Oh wow, that was a much better solution."
Holden Karau: One of the biggest things that changed between when I started and where I ended up is there was an introduction of a tool called Axolotl. I might be mispronouncing it, but it greatly simplified the fine-tuning process and made it a lot easier to try different models. With fine-tuning, you might try different base models and see how they perform, and it made that much more reasonable to do. So if anyone out there is trying to do fine-tuning, I would say check out Axolotl. I think it's a great tool. It's open source. I have no relation to it besides using it.
Julian Wood: You just mentioned open source. Obviously Spark was open source as well, and that's of massive importance in data and machine learning. What are your thoughts on the state of open source at the moment? There's been a lot going on in open source over the past few months.
Holden Karau: I would say that it's definitely been a rough period of time for open source. There's just been some, quite frankly, just I would view it as unacceptable behavior from certain participants in different communities.
Julian Wood: Is this licensing changes that you're mentioning or thinking of?
Holden Karau: License changes, trademark disputes, different specific instances. And by unacceptable I don't mean illegal. To be clear, I'm not accusing anyone of breaking the law. I'm just saying that they're acting in ways that are not...
Julian Wood: Assumed contracts or understandings with people as well maybe broken.
Holden Karau: Yes. The implied, yes. I think it's unfortunate. To me, in some ways it does show the importance of foundations that are not controlled by the corporate entities of which they're a part of. I think, unfortunately, I say this with care, some foundations are perhaps more controlled by corporate entities than one might hope. And I think that's unfortunate.
To be clear, I don't think that's the only issue. It's a balancing act. I think frequently some foundations will also take it perhaps too far in the other direction where they will refuse any corporate money out of the fear of tainting things. It's like, well yes, this is a very noble goal, but perhaps maybe we should take some computers from them.
Julian Wood: Altruistic but doesn't pay the bills.
Holden Karau: One of the fundamental issues with open source is it has been put together with the idea of free. A lot of people have taken the free to be in terms of free from the money aspect when it's more the freedom of being able to do what you want with it. But ultimately we want people to have the freedom to be able to use software the way they want, but there is a monetary angle to it because it does cost huge amounts of people's time, money, computers, brainpower to work on these open source things. What is a path forward that you can see, or multiple ones, of how open source becomes sustainable in the current challenges?
Holden Karau: I think I'm not the person for this fundamentally. I'm just an open source developer.
Julian Wood: But you're passionate about it.
Holden Karau: I am. I care greatly about this topic, but I think that there are other people who have perhaps more informed views here. I think there are paths forward. The classic approach of I'll make this software and then offer it as a service to make money, we're seeing that breaks down when someone else can offer it as a service to make money better than the people that made the software.
Julian Wood: That's the thing that some people don't understand, that you have sympathy for the people who have created something and then run it as a hosting service, then somebody else runs the same software better, and it's like, well, yeah, how does that work?
Holden Karau: I make money when people decide to pay me to write software. Normally they decide to pay me to write open source software to an extent. But that only makes sense in a world where running it yourself is a feasible option. So I think some of the licensing changes that we see happening are things that I don't think are good, but I am also biased.
But I think at the end of the day, there's this question of how do we still cooperate while existing inside of capitalism? I think there was a period of time when we had pretty reasonable cooperation. We're seeing the financial incentives break down in some places. And I don't have a new financial model for how open source gets funded. There's a lot of people trying a lot of different things, but I don't know what the right answer is.
Julian Wood: The right ideas bubble up and it's always going to evolve. That's something you can't hark back to the past or the future. We did a nice open source tangent on funding, but back to the tool that you've built. Is it available to use? Can people use it? What's the consumption model?
Holden Karau: It is open source, so that's one option. But I also realize that the majority of people are probably not going to install a Python package on their computer.
Julian Wood: "I've been denied my insurance, I just need help now."
Holden Karau: Yeah. So you can also just go to fighthealthinsurance.com. It's a hosting service, it's free. Perhaps in the future, as we perhaps have been alluding to, I will figure out a way to make some money by offering things on top of that. But the business plan currently is that of the underpants gnomes, if anyone's aware what that is.
Julian Wood: Underpants gnomes sounds amazing.
Holden Karau: Step one, collect underpants. Step two, question mark. Step three, profit. So we're at step one, fight health insurance. Step two, question mark, is what's next. And step three profit is almost assured because, you know, what could go wrong? But there's some small details to work out in step two.
Julian Wood: Well, it sounds like the genesis of a great open source tool where it meets a very obvious and tangible customer need, and then funding is to be worked out in the future when and if it's successful.
Holden Karau: One of the things that I think about in open source is the times when it tends to be more successful is when open source projects are when the users and developers of open source projects are a very similar set. Unfortunately or fortunately, everyone gets health insurance now, so the users and developers are almost distinct. I use my own tool because I work on it, but otherwise most people who get a health insurance denial are not going to go through the process and then be like, "You know what I would love to do? Submit a PR to fix that typo." This is probably not going to happen. A few people might, and if you're that kind of person, more power to you.
So it's that thing where it's like, okay, how do I keep working on this and still have money for food?
Julian Wood: There's also, this is not something that may be open source and people can run it themselves, but you're running a hosted version that's costing you money.
Holden Karau: Running a hosted version or even training the model, these things have real costs in terms of dollars. Not huge because I'm cheap and I've done my best to architect it in such a way that the costs are not through the roof, but they're still not zero. So yeah, I don't know. We'll see. Maybe I'll figure out how to make it survive for a while. I certainly hope so.
Julian Wood: I'm sure it's going to help many people because it is the most frustrating thing in the world to be denied health insurance for something that you know you're entitled to.
Holden Karau: It's terrible. If anyone works for a health insurance company, I'd encourage you to look in the mirror, think about your life choices, and if you want to do something else or help, then you can come and help overthrow your bosses.
And to be clear, when I say look in the mirror and consider life choices, I have done things in my life that I'm not proud of as we all have to survive in capitalism. I totally get it. Maybe you look in the mirror and you're like, "Hey, I hate that I have to deny these claims, but I'm going to do the best that I can to try and help the people that I can. But I got to eat."
Julian Wood: That's an awesome spot. Thanks so much for spending your time. I love talking about the open source and talking about Spark and data processing and how that links to machine learning and everything, and your super altruistic, amazing project to help people with health insurance. Thanks for joining us for GOTO Unscripted here at GOTO Chicago where we get to speak to these amazing speakers who are doing awesome things in the world of our lovely software engineering practice. Thank you very much.