Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch
Looking to scale your machine learning projects? This book written by Adi Polak offers a comprehensive overview of the tools and resources available for distributed ML with MLlib, TensorFlow, and PyTorch on the Apache Spark ecosystem. Designed for data scientists and engineers alike, this book covers the trade-offs of distributed machine learning and the challenges of keeping up with open source software changes. Get a sneak peek into the upcoming book and learn more about Adi Polak's next big project.
Join authors Adi Polak as she goes through key topics from the book together with Holden Karau.
Holden Karau: Hello. We're here today with Adi Polak talking about her new amazing book that I really love. It is all about doing machine learning with Spark. Adi, why don't you tell us a little bit about sort of your background, and I know that your book actually talks about why you wrote the book, but let's also hear a little bit about that, too, to start things off, if you're okay with that.
Adi Polak: I would love to. Well, it's exciting to be here with you today. It's always a pleasure and thank you so much for all the insights and support during writing the book. It takes a village. It truly is a village of good people that are willing to share from their experiences.
Why did I write the book? I guess I should go back to my personal experience. I started my career as a machine learning researcher. I worked for different labs, started to telecom, IBM security, and so on. I still remember the whole good days of Mahout and Hadoop, and also some Teiresias algorithms that's from essentially sequence algorithms in C++, which wasn't part of the Hadoop space, but still, I was able to bring it to performance and write it on big data.
That was a lot of fun. Starting all the way back in these days and then learning more about the data space, and found myself deepening my experience in building data platforms for big data analytics and machine learning. And really starting from, we have all this great functionality. There's a lot of things that we can do. Of course, there are culprits. Not everything works the way we want it to work. How can we leverage the data infrastructure, the rest of the company, and the people that can come together to really develop state-of-the-art machine learning and make sure we are delivering it to production on time? Which are two different worlds at the end of the day. It's like one of them is, oh, I can build this amazing model, and the other one, how do I get it to production?
Holden Karau: Yes. Those are very, very different problems. Sadly not super-overlapping skillsets.
Lead with the tools and resources you have
Adi Polak: Definitely not. One of the things I always advocate for when I speak with data science, and also one of the big reasons why this book was written, is how can people leverage the tools that already exist in their organization. And we know Spark is a big one of them, right? There's huge, massive adoption. I don't think there is a company in the world that does analytics or data processing that is not using Spark in some constellation either through their BI or operational data or what have you. And so a lot of data scientists can tap into the clusters that already exist, the DevOps team, all the different available resources for them in the organization and kind of build their models and experiments and research leveraging that. Because many times what happens is we go and we try to invent everything from scratch, and then 90% out of machine learning projects are failing because not enough tools, and not enough people. It's hard to get the data. That was the big thing that the book is leading with is what already exists, how can we leverage that, and how can we bridge out of it if we need to?
Holden Karau: I really like that context. That's an amazing sort of foundation to pick is like, these are the tools that you already have. How can we build awesome things with them? It's an amazing place to start. How did you pick the tools?
The Apache Spark Ecosystem
Holden Karau: This book is not just about Spark, obviously, it's about Spark inside of an ecosystem. How did you pick the other tools that you included in this sort of ecosystem to use alongside Spark?
Adi Polak: It was a huge topic that I had for a very long time. What are people truly using? What would be beneficial? What are the qualities of each one of the tools? Pro, cons. If these are open source, what the size of the community? Integration with Spark was a big thing. How well does it integrate? Does it kind of like a first-party integration with Spark? Or is it like more of a second, a third party? And throughout this research, I also went back and thought what did I needed back then in the days? What would be beneficial? So I used that. I interviewed a lot of data science, a lot of folks. And I was also fortunate to work for Microsoft and we had a lot of information around what customers are using, what are they struggling with, how are they building their machine learning systems.
The truth is there was a lot of struggles around tools and adding new tools to the mix. Like every time a team needs to adopt a new tool into their workflows, if it's that experimenting, training, getting your model into a staging environment and getting into production or getting real data, fresh data out of production for experiments. This kind of loop always had a lot of friction on what you can do and how you can do it. And also some skill gap between the different parties, data science use. It's a different language essentially.
Holden Karau: Sometimes literally as well, different languages, but, but sometimes figuratively. I both. PyTorch and TensorFlow were tools that people were frequently using, but struggling with using them alongside Spark when you were writing the book. Or?
Adi Polak: Yes, especially around the data. Loading Parquet data using TensorFlow and PyTorch is not exactly 100% intuitive. There are some tools trying to help folks, but at the end of the day, a lot of companies develop their own type of connectors or build on top of Arrow or things like that. Or they had the process like, how do you take your Parquet file? You sample it, and then you transfer it into a CSV so you can bring it to your data science. I see your face and I'm like, how many times did I do that?
Holden Karau: I mean, 100%. I totally see that in production, but it also, it makes me sad. It just makes me sad, throwing away all that good data. It's a crime. I mean, not actually a crime, but it feels like...
Adi Polak: It is also a format mismatch, right? Because of the different format between CSVs and Parquet. Then you need to have feature engineering, for example, as it's a big thing. because you want to have a similar workflow for feature engineering during the experiment and then in production as well. If you're doing the whole transformation of running feature engineering on top of CSV files and those have different data types. You can see the fun and the inconsistencies.
Holden Karau: Getting the future engineering also to move from batch to online for inferences is another place that I imagine people struggled with a lot because you probably have another set, of types that you're gonna get represented in your online queries or. So, much fun.
Adi Polak: It's trying to duplicate the same logic with different tools. At the end of the day, introduce frictions and file format issues and a lot of pain for folks that if different people are implementing it, at the end of the day, it's going to behave a little bit differently.
Holden Karau: Then you can have these like excellent models that, like everything you've got says it's gonna be fantastic. And then, when you actually run it, you don't see those results. That's really disappointing.
Book chapter overview
Adi Polak: Exactly. I devoted a whole chapter for feature engineering with Spark and the whole thought process around how do you engineer these features and why you need to leverage kind of like pipelines and different automation so these will act the same in production. There are challenges of course around running the actual Spark clustering staging environment versus production.
Holden Karau: Of course.
Adi Polak: And so on. But at least making it as close as possible can help overcome the mismatch that people are later on seeing inaccuracy in their model, failing in production. Although in development and experiment, it worked fantastic. And also, in terms of tooling, so Spark is great. There's a lot of machine learning capabilities, right? There's the MLlib and then there's the ML and there are probably new features coming up with all the new exciting things around LLVM and so on. Yet there's still some frictions around the map produce, the scheduler.
And this is why the book starts with, hey, here are all the functionalities you have with Spark. Here's how you automated things. And while you're working at it, you can also leverage a lot of the pre-work that you did with Spark on pre-processing and analytics and bridge into PyTorch and TensorFlow leveraging Petastorm. It gives you kind of like a caching capabilities or kind of a translator between Parquet and the rest of the ecosystem. And so midway throughout the book I'm okay, up until now, we leverage everything that's available for us in the organization. Now let's see what are the small things that we can do in order to provide more machine learning capabilities, more algorithms that we can go deeper into the deep learning space, leveraging everything that we did so far, including, there's also a chapter around MLflow and how to manage your experiments together with MLflow. Leveraging all of that and then diving in into how PyTorch and TensorFlow distributed architecture works that enables people to do deep learning at scale which is way different than Spark.
Holden Karau: Very different. Very different. I really liked how you use that explanation to explain why there, there had to be this new method of scheduling and then talked about the introduction of the scheduler. Like I thought that was fantastic. Because otherwise I remember like a lot of the Spark community when the gang scheduler was introduced were just like, "Why do we need this? I don't understand." And so I wish you'd been around to explain it back when we were debating including the gang scheduler, because it was fantastic. Fantastic job.
Adi Polak: Thank you. I'm glad it resonates. It was,one of my, one of my main, kind of north star was how do I take all this complex things and I simplify it so people can read, understand, and take action? It's always, and you know it better than I do, you wrote multiple books. All are fantastic and, you're truly, a great resource for the industry and this is always...
Exploring the Glue Spaces in Machine Learning and Data Engineering
Holden Karau: Your book and I'm super impressed with it. It's, so to me, the interesting problems always exist in sort of like the glue spaces, right, is one of the expressions. Like it's the place where we've got these two things that are intersecting and they're just not quite lining up because neither group is talking to each other. And, to me, and I don't know if this is your intention, but to me it feels like your book is sort of a book which explains to both sides: this is what these people are trying to do and this is how they can do it. And, it's not gonna fix the fact that, we've got data engineers and data scientists and machine learning folks all doing their own thing. But I think to me it, it develops a lot of like empathy and understanding as well as the technical skills. And I think that's, that's really awesome. And it's really, I know that your primary audience is, is definitely, or at least I think your primary audience was the ML folks, right? From your introduction, that's what it said. But I think beyond that, even just having data engineers read this book, I think would be super useful for them to get that perspective. I don't know if that was like a secondary thing that you were considering, but that definitely came through to me.
Adi Polak: Wow. I never thought about that. I think, now that you're saying it, I understand that data engineers can definitely benefit from it because many times a lot of data folks, data engineering folks would find themselves kind of like in these glue spaces enabling other teams to do their work, right, the BI and the data scientists and so on. And so they can definitely leverage that book that speaks about the glue play, the glue inside the architecture of how we build machine learning systems, kind of like end to end and think through what they can offer to folks in their organization to be successful building their machine learning experiments, models and, innovation and so on. That's really interesting.
Holden Karau: It was not like a conscious target, it's just something that like incidentally came... Okay. I thought it was fantastic. I don't, it's fine if you weren't even trying, that makes it all the more impressive. But like to me it was something where I was like, I really need to share this with the other people working on the data platform team with me. Because one of like the machine learning platform obviously works on top of the things that we do. And it's like, I want us to have better empathy for our machine learning platform colleagues so that we can understand each other more.
How did you pick, the target audience? Because for me that's always a big challenge when I'm writing, it's who is the archetype who I'm writing to?
Adi Polak: It's always a good you know, it's always what we start with when writing a book. My target audience when I started was data scientists that want to expand their skills and tools. People that see that their project is not resonating, they don't get into production on time. They do, they kind of get frustrated because, some data science, they're building state-of-the-art algorithms and they have such a deep understanding into the data space and the domain that they're operating in, yet they're struggling with actually bringing it to life. Those were the folks that I targeted in the beginning because I used to be one of them and I used to work, with them later on when I moved to the data infrastructure side. I used to build things for them. And I wanted to have that so they will learn the language of the data space as well.
The book, essentially the book is not explaining all the machine learning algorithms. There are some that are a little bit more explained, like deep learning because it goes deeper into architecture and so on and how does deep learning actually looks like when I have a distributed system. But the rest of the algorithm, like logistic regression and so on, the book doesn't go in depth into it. There's an assumption that, hey, there's logistic regression, it exists in the world. There's a lot of great books on basics of machine learning and people can dive into it. The audience were data scientists and now I understand that data engineers would definitely, or infrastructure engineers would definitely benefit from it so they can kind of speak same language. It's a good one.
Holden Karau: That's really cool. That's really cool. You picked the audience from sort of the talking in many ways to past you is sort of the vibe that I'm getting. Like, this is the book that you wish you had when you were doing these tasks. I love it. I love it.
Adi Polak: I wish more people on my team would have...
Holden Karau: Would've also had the book. Yes. Okay. That, I mean that's also, that makes sense. Part of why I write sometimes is like I wish other people knew these things so that I didn't have to fix them myself. Because I'm lazy. But also writing to a past version of oneself is I think a super great way to write as well. That's awesome.
Adi Polak: It gives an holistic view and dives into the important areas that where people are usually missing on connecting things. I have a whole chapter on deployment at the end. Like what's the difference between real-time, near-real-time batch deployment? How does it look like within a container as a service as part of existing service? Where are the pros and cons? Does it impact the way we build the experiment? Right. Also, there's a question like, those impact each other if we think through deployment, should we plan ahead like when we're turning the whole experiment? And I always say yes, like please have in mind the things that you wanna do because as you go through that path, you're going to hit those roadblocks and it's not going to be pretty. At least kind of giving people heads up saying, here are the potential roadblocks, here are the pro and cons. When you are making decisions, take those into consideration and, act accordingly because in software there's never a perfect project.
Holden Karau: No. It's always trade-offs. It's always trade-offs.
Adi Polak: Then the question is, how much knowledge do you have in advance over.
Holden Karau: I mean, as my boss is fond of saying there's the known unknowns and then there's the unknown unknowns. That's the second one, which is just it's the bane of our existence.
Adi Polak: It's it can be a lot of fun and a lot of frustration.
Holden Karau: Lots of frustration
Adi Polak: For sure.
Navigating the Trade-offs of Distributed Machine Learning
Holden Karau: I'm so sorry if this is like kind of a random tangent, but, one of the other things that I really liked about the book was how you took...so there's the optimization algorithms, right? Which the machine learning and data scientist folks are already familiar with. But then you also went and explained like, hey, now that we're in a distributed model, the trade-offs are different here. Right? And so I love that and I was wondering like how did you decide to do that, I guess? Like what?
Adi Polak: That's a great question. I looked at a data science that works locally versus what actually happens with the model and building it for, to cover as much use cases and edge cases as possible. And then for building good accurate model, it's always best to have a large sample set as possible rather than taking that perfect Kaggle dataset and oh boy, my code works fantastically, my work curve is perfect, my accuracy RME and so on, and all these good metrics are always perfect. And then I reach the real world and I'm like, oh, this is not actually working the way I expected it to, what is wrong? And so on. And then understanding that every time we take any type of logic, at the end of the day, algorithm, machine learning algorithm as a set of operators that we're running on top of the data, statistical operators are breaking it down into different CPU GPU cycles today and understanding that in distributed world there's a lot of stochastic operations involved that are necessarily coming from the algorithm itself.
It comes from the machine, oh boy, a machine is down. Oh boy, I lost a message. Or, different things that can happen and then say we need to count, take it into account when we're looking at error rate and say, when I'm building this, I wanna build this so it'll be as similar to production as possible. And so I wanna train my model on a large data set because I can, because Spark is available for me in the organization, and I'm taking these challenges of how different it is those algorithms are being implemented when they are running on one machine versus when they're running on a distributed settings. And I'm making all of these potential error risks into one place, into a plan. And then I know when I'm executing its accuracy won't be 95% like we love with Kaggle. It's probably going to be something around 70%, and we'll be happy with it.
Holden Karau: I would sell many things for 70%, which I should not mention on the recorded stream.
Adi Polak: It's, data science always gets frustrated. It's like they do all these great PhDs and research and so on like everything is perfect, and it just works. And then they reach real world, and it's like, oh, what happened to my 95% accuracy and my 5% mistake? And it's like, no, that's real life now. It's a different ballgame altogether. It was important to explain that and also, I believe it gives people the ability to maybe improve that at some point when they're tweaking their hyperparam seed, so on, so they can look into these things and say, "Oh, this comes from my distributed settings. Maybe I wanna have less machines, more machines."
Holden Karau: And I think, I don't know, maybe this is, I'm just really impressed with the like, because people don't normally think of like, maybe I want less machines. Right? That's not something that people think about. And admittedly, as someone who works on distributed systems, like it's not something that I think about very often normally. I'm just like, no, no, we get more computers, the problems will go away. But I really like this, this counterpoint, sometimes this is not what we need. And, so that was really awesome to see written down and also explained. I am sorry, I just really like that part of your book. I mean, I like a lot of your book, but that part was just.That's so good. So, good. The other thing that I'm really curious is how you decided the trade-off of, what level of sort of infrastructure information was the right level to include, right? You don't have like, hey, we're gonna go set up like a Kubernetes cluster obviously, because it's way too far down the stack. But how did you decide where these people would probably not have to be worrying about the infrastructure anymore? Was that from your time at Microsoft or was it from more recent times that you got this sort of gut feeling or maybe more than a gut feeling?
Adi Polak: It was, so this is when, having a lot of folks giving feedback and review was an immense help because I can go deep into the rabbit hole, and I would be like, oh, let's do this configurations and here's a, a configuration file and you have to tweak and so and so. But then, great folks were like, you know what, you don't have to explain it. If people wanna learn more about their Kubernetes or Docker and how they configure these things, then you should put a link to the documents and saying, here are the docs. You're welcome to read about that. And I think it's what's fantastic because then the book is not a 600 pages book. It's more something that is tolerable. And this is like a delicate balance because some people, I always think like, ooh, some people would probably need all these explanations, right, because they really wanna do it right now. And so for that, I provided different notebooks for people to run things.
I was like, Hey here,I think I ended up with 10 different exercise notebooks, so one for each chapter. And I was like,here you go, take it, try it out, get your own hands-on experience. My people ask for, how did I build a Docker image and kind of the files for that. I provided them with that as well so they can tweak it and change it according to their environment. I tried to keep the book as much more...
Holden Karau: Streamlined, but with like sort of accessory resources available to folks.
Adi Polak: Exactly.
Holden Karau:Do you... This is maybe like super rabbit hole-y. Did you use like the include tags and ASCII doc to pull in some of those like resources? Or were you just no, the resources are gonna sit over here completely separately?
Adi Polak: I had the resources separately on a GitHub project on a GitHub repo, and there's a reference to it and people can, because I realized that some of the people would appreciate the print edition and some would want the online. The print edition can be a little bit tricky. If I'm using the print edition, now I need to copy a lot of things. I'd just rather go to a GitHub repo, and clone it and, do whatever. But I guess different authors, different flavors.
Holden Karau: One hundred percent. No, I read your book on Kindle because I have too many books to carry around with me that I'm excited about reading. And so it was just like, no. Kindle, actually, I took it with me on my motorcycle trip to read because I was really curious and very happy. There's also really bad wifi, so I couldn't go onto the notebooks, but that's okay. I went onto the notebooks once I got back.
Adi Polak: Oh, nice. Thank you.
Holden Karau: This is really good. Sorry, I just.. This is such a lovely book. Okay. So, notebook's separate. Didn't use include tags. I get that 100%. I'm just curious. Mostly because I use a lot of include tags, and I always wonder if I'm overusing include tags or not. And I just, I don't know.
Adi Polak: I have code snippet.
Holden Karau: Yeah. No.
Adi Polak: I did add a lot.
Holden Karau: I saw them. But I was wondering if you, and so this is like a matter of personal taste as an author I suppose. Like, it's when you wanna share the code snippets as like chunks of the same examples that are gonna live separately in the GitHub or if you wanna just like, create specific snippets that are just for the book. And I never know what the right thing to do is here.
Adi Polak: I always think about the reader, right? I always have the reader in mind what would provide them with the most value and building for success. I believe a lot of readers, including myself, likes to run the examples, and it kind of like gets hands-on experience. It's like I can read it, it's an educational book. I have it on my desk and if I'm looking for a term or I wanna read the chapter and kind of get inspired into new architectures and new ideas and so on, then, I read it end to end and just sit with it and read it. And then when I wanna take action, sometimes I will dive into the examples and the notebook and say, oh, I remember reading about it. Now I want to...
Holden Karau: I wanna do it.
Adi Polak: Copy and test it, right? So I can run it and then adjust it to my environment. Because at the end of the day, as engineers, copy-paste is something that we do. And then combined with understanding what it is, if we just copy and paste, we can adjust it to our own system. And so that kind of, I believe, gives people the ability to learn from examples, combining theoretical, conceptual, practical side of things.
Challenges of Keeping Up with Open Source Software
Holden Karau: That's awesome. Another, like more sort of processy type question. How many times did you have to... Did the open source software change while you were writing it, and you had to go back and rewrite some of the things that you've written?
Adi Polak: Oh my God. I did the whole table of content change midway. Because I realized at some point I was writing for too advanced. I was like, this book, it will cover being great but I still need to introduce it to a larger audience. Because most of the feedback were this is great, but it's very hard to understand. You need to have, all these years of experience and knowledge and kind of field experience from production systems and doing things. And I was like, okay, so I'll start changing. I redid the whole table of content. I also added a short chapter introducing or refreshing people's memory around Spark, although there's plenty of wonderful books. And one of them is yours. Yours is the best.
Holden Karau: Thank you. I appreciate it.
Adi Polak: And I was like, okay, I'll need to reintroduce it for a little bit just for the sake of people that are not familiar with the terminology and the architecture and what's available. And so I added that as well. And when I looked into the open source versions, I kind of locked in during that time. I was like, okay, I redid the table of content. There was a couple of major new updates, especially with Spark. I remember Spark 3.1 was released, and this was the last one I introduced. And they realized that, okay, to keep on the timeline, we need to lock it. And so in the GitHub repository and in the books, it states all the versions of the open source that were used saying, features might change, functionality might change, high-level architecture could change, but less frequently. And that was it. I was like, okay, six months in, locking the versions of the open source.
Holden Karau: That's very reasonable. That's very reasonable. Because otherwise, you can keep coming back to it and just spinning your wheels on it. And, you are right, I mean, Spark 3.3 we know has some new features for doing machine learning, but fundamentally, like the architecture that you described is still, like, that is probably the one that I would pick, right? Like doing in memory, like prior or serialization between these two, is a lovely idea. But in practice, that just hasn't panned out yet.
Adi Polak: It's always interesting when writing about open sources and technology when it's not a conceptual book and it's a practical book for people. Like how can we provide our readers with the most value knowing that these products are changing and evolving all the time? And this is where I think emphasizing on the architecture, the concept, the thought process around how to choose what the glue within the different products is providing readers with a lot of good information. And a lot of people appreciate that.
Holden Karau: I love it. I love it. And, not everything needs to be the latest version. Because the latest versions are always great, of course and fantastic. And please, software developers keep working. But we don't always want to run the latest version of everything in production. There's some opportunities for improvement, as my boss would also say with those techniques. Of course.
Adi Polak: That's really interesting. I just had a conversation at work about, how people runs their open-source versions. It takes forever to upgrade. I remember, I would take at least a year and a half to, well, depending, well, I work for corporate, so I guess it's slightly different. But for corporates, when the system is so heavy, like there's a lot of moving parts, many products that you're supporting, then upgrading to a new version, it's kind of like a big thing.
Holden Karau: It is. And I feel you on the year-and-a-half timeline. We have upgrades that are scheduled for a year. And I am very anxious about those. I don't. But I also imagine, I remember being at a startup and that was totally different vibe. We were like, "Oh cool. Let's try this new software. Oh, it looks pretty cool. Let's run it in production. Oh no. Oh no. Oops, I'm sorry."
Adi Polak: Roll back. Roll back.
Holden Karau: Oh yeah.
Adi Polak: So it always depends on how many products are, how much revenue this product brings.
Holden Karau: That's the great thing about a startup with no money. It's like, well, it went terribly, but it's also not like we made any money yesterday either. Who's to say, who's to say, but yeah, we're, yeah. After, after having written this book, is there another book that you wanna write? Or are you, are you gonna take a break?
Can We Expect another Book?
Adi Polak: It's a big question. It's funny because when once I finished it I was like, oh, I'm done. And then, you are not really done because there's always editing, typos, images... So I guess for almost two months I thought I was done and then realized I wasn't done. I definitely wanna write another one. I love the process.
Holden Karau: Oh great.
Adi Polak: I love, I love working with the authors community. There's fantastic authors community and there, there's amazing folks in the industry that, happy to provide insights and feedback, which is amazing because always, you need people to work with you on these, kind of your tribe. I don't know yet which one. Like I've started investigating the whole plug-ins for generative AI and I'm fascinated with the opportunities and I'm starting thinking through right systems for myself, system for doing things, scalable, like more on the engineering side. Like how can I make it scalable? What are the opportunities, how can people interact with that better to get the most out of these tools? This is a thought process that I had.
And then, the data space, it's like on the other side of things, the data space has been booming. I have a lot of experience, in that machine learning side, analytic side, but, building data systems.
Holden Karau: So we can expect another book, which book TBD, but this is an experience that was not terrible for you. And so I'm excited. I look forward to your next book as well, whatever it may be because this book is just fantastic. I love it. And I, I really think that for any data engineers out there, you should definitely, you might not be the intended audience, but I I think you should definitely check this book out so that you can understand what the mainstream learning practitioners you're supporting or working with also need. And of course, machine learning practitioners, you should check out this book because you're gonna need this so that you can get your stuff into production. And we all love our experiments, but if we're not in production, we're not making money. And unfortunately under capitalism, we need to make money. But, Adi Polak, thank you so much for writing this book and taking the time to talk today. That's really fantastic. And I hope to see you again in San Francisco soon.
Adi Polak: I hope.
Holden Karau: It's awesome.
Adi Polak: Yes. Coming soon., I do have a trip booked. I'll send you the details.
Holden Karau: Cool. Sounds good.
Adi Polak: Thank you so much for having me. It was a lot of fun. I wish a lot of people would benefit from reading the book and learning and getting hands-on experience and really deepening their skills and which, in the places where it matters. And there, hopefully, with, capitalism deliver great machine learning projects.