How AutoML & Low Code Empowers Data Scientists
Over the past decade, AutoML has revolutionized the world of data science, propelling it several layers forward in terms of abstraction. This powerful technology has paved the way for a new era of democratization, empowering experts from all fields to harness the power of data through the concept of the citizen data scientist. Moez Ali, Creator of PyCaret, and Linda Stougaard Nielsen, director of data science at Ava Women, discuss two sides of this discipline and its future.
What is a citizen data scientist?
Linda Stougaard Nielsen: Welcome to GOTO Unscripted. We are here at the GOTO conference in Copenhagen on the second day. And my name is Linda Stougaard Nielsen. With me, I have Moez Ali.
Moez Ali: Hey, my name is Moez. I'm a data scientist, and very excited to be here today talking with Linda.
Linda Stougaard Nielsen: Right. And I think we will talk a little bit about the concept of a citizen data scientist. Can you explain what that is?
Moez Ali: Yes, sure. So, a citizen data scientist is a term coined by Gartner a couple of years ago, and it is described as a person who uses or leverages predictive and prescriptive models in their work, but their primary domain or line of work is not statistics or analytics. So, think of it as a secondary persona to add more fuel to your data science team.
Linda Stougaard Nielsen: So, this could be any person who has any interest in the domain that they're working with, basically. So, a person who has no idea of what a machine learning model even is, or how the data is structured, but just knows about the domain, right?
Moez Ali: Know about the domain, and obviously, he's interested to leverage no-code, low-code tools to their advantage to solve the problem. So, imagine a doctor interested in knowing what part of an image, or what part of, let's say it's a problem of diagnosing a condition from an x-ray image or an MRI image. And the doctor is interested in leveraging kind of data to understand what causes the condition. In that case, the doctor would not be a data scientist but would be using data science and data to solve the problem.
Recommended talk: Machine Learning for Developer Self-Care • Erik Meijer • GOTO 2021
Linda Stougaard Nielsen: And how would he go about that? Because I mean, nowadays, if you are working as a data scientist, right? You work directly, you write SQL code, you write Python, and you work with new, better notebooks, or even more advanced technologies. So, you can't expect a citizen data scientist to know all this code, right? So, how is this typically done?
Moez Ali: So, in the last few years, there's a wave of these tools that are called no-code and low-code tools where you don't necessarily need to write code to be able to build a machine learning model or test a machine learning model. You have very easy-to-use, friendly interfaces that you can use to kind of solve these problems.
Linda Stougaard Nielsen: Yes. And you were even working on one, right?
Moez Ali: Yes. Of course. In 2020, two years ago, I led the creation of a project called PyCaret, which is an open source low-code, machine-learning library in Python. You still have to code that's why we call it low-code. You still have to code, write a few lines of code instead of writing many lines of code, but then you also have no code tools where you completely, it's the code layer is completely abstracted. You have beautiful front-end UIs to work with.
Linda Stougaard Nielsen: So, these tools that you typically use as a business analyst such as Tableau or kind of these kinds of tools.
Moez Ali: Yes. The target audience of these tools originally was thought of as business analysts or citizen data scientists. Citizen data scientist is not like a formal designation, right? You can be a marketing analyst and can be a citizen of data science if you're using data, and leveraging data to solve your problems, right? But originally, all these tools come with their story of less technical personas using these tools. But what I've seen over time in the last few years is even your hardcore engineering and data science resources will also sometime use tools like this to maybe configure an ETL pipeline, compressor model, or train a model. It's not only about technical skills, or technical acumen, it's also about these kinds of tools that are really good, you know, time savers. They're good productivity tools as well.
Linda Stougaard Nielsen: Exactly. So, we are talking about this level of abstraction, right? Before this talk, we talked about AutoML, which is another level of abstraction that has been introduced within the last couple of years, whereas actually a machine learning code that finds the best machine learning code or model for you. So, it's a model that finds the best model for you. So, it's sort of an abstraction level, which is making people free to do other interesting stuff than the low-level stuff, or well, finding a machine, a good model is not necessarily low level, but you can get more creative by having these tools.
Moez Ali: Of course. AutoML has been there for over a decade now. We have seen first-tier companies like Data Abort, HTO, and these products start coming in 2012, so 10 years ago, right? Today, what we are seeing is the second-tier automation in this space whereby you'll see that all these big companies are converting AutoML into, like, API service where you send your data and target through an API endpoint. Everything is abstracted. You don't see anything. It'll get processed on the cloud. And at the end of the day, you would get the forecast or prediction back, right? Google has this product called Vertex AI, and what they're essentially doing is they're building these forecasting engines behind the scene for other companies to use in their product to eventually sell a product to consumers, right?
What we are seeing today is more certain automation in these products. We have seen that all these capabilities are now integrating with the platform, whether internal or purchase platform, rather than operating as an individual software as a service that used to couple of years ago. So, now AutoML is a big part I think of every organization or will be a big part of every organization.
Linda Stougaard Nielsen: Yes, I agree with you. But also, I mean, we are talking about abstraction, and we are talking about people would not necessarily knowing what's going on under the hood is using these models. And then there's also this other idea that with really powerful tools, you also have a lot of responsibilities, right? You can still build, and do a lot of wrong things, even though you have these powerful tools because, if you don't select your data correctly, if you don't represent your whole population, for example, you can end up building a fantastic model based on your training data, but if that is not what your population in real life is going to be for your app afterward, then it's not doing a good job when you're then moving it to work in real life. So, we talked about that before. If you have an image recognition system that is trained on the wrong data, it might go out afterward, and not be able to perform at all in your target population, if you don't make sure that your data is correct. So, even though you have all these tools, there are still other skill sets that we need to...
Recommended talk: Machine Learning for Autonomous Vehicles • Oscar Beijbom & Prayson Daniel • GOTO 2022
Moez Ali: Of course. So AutoML wouldn't prevent stupidity at work, right? It's the same thing as you can have a car with a license, but you can still go on the road and do an accident. Right? So, the gloomy knowledge, the number one thing, AutoML won't replace that. All the underlying statistical correctness of how you sample your data, how you prevent bias, how you collect the data, etc. All that has to be part of your skillset or your team's skillset. So, if you go on Google now and search for AutoML, I can bet that over 50% of the time article that mentions AutoML would also mention job security, or data scientists losing a job because of AutoML. I don't believe in that idea because when Microsoft Excel was invented, nobody said Excel would replace the accountant shop, right? You still need a person. I think AutoML is one of the tools in the stack of data science teams. It wouldn't replace or it wouldn't, you know, provide a supplement to logical acumen. You still need to have that in your team.
Linda Stougaard Nielsen: Yes, I absolutely agree with you. I also don't feel threatened by its emerging, but I think it opens for much more creativity because you can have people that actually work in the domain. They're not as far away from the tools as they used to be. If you were writing C code, or whatever you would be writing that the closer you get to where they can actually have insights from playing around with the data themselves, with playing around even with models themselves, they get much more insights. And, yes, they might require some people to help them find the right data, making sure that they're actually looking in the right direction, the right models, the type of models that they're applying. But at least the gap is closing, which is something that I really like because it also gives us opportunities to do something that's more fun, instead of doing what a lot of people have already done, finding the best model for training, for recognizing images, for example.
Linda Stougaard Nielsen: Yes. A lot of people have already done that, so let's just leverage on that.
Two sides of data science
Moez Ali: Yes. Another way to look at it is I always see data sciences as two parts. There is art in that, and there's science. I think science can be automated and is getting automated, right? The art part stays with humans. Let me ask this, would you hire a person who always writes the implementation of neural networks from a scratch? Or would you hire somebody who leverages, out of the open source libraries like TensorFlow ? I bet you do second, right? Because the first person who writes everything from a scratch every time would spend a lot of time, and you wouldn't want that as a manager, right? So, I think AutoML is saving time, and giving you that time back to focus on things that matter. That thing could be more engagement with domain experts, more time spent on identifying new features, engineering new features, or even communicating, or quantifying the value of the project, right? The idea is you do what as a human you're best at, and let the machine do what the machine can do better than you, right? And if you think about AutoML at a grounded level, at the end of the day, it's iteration, or a certain set of attempts in space, right? And I'm pretty sure a machine can do iteration better than us.
Linda Stougaard Nielsen: Yes. I really like this way of putting it, saying that we have science, and we have art because people have often identified our area with the nerds that you're just being a nerd. But I often said that, well, you don't get anywhere without a lot of creativity because you really need to be able to not just work within the box, but actually go outside of the box and explore, and you have to be creative, you have to do the art in order to do something innovative, to do something in this area. And if you can have... the more tools you can have to help you do the standard tasks, the more time you have to do the actually interesting and do art. So, I like that way of putting it.
Moez Ali: Absolutely. Also, if you think about it, like, we have been hearing this for a long time that there would be a shortage of data scientists in the future. I think there would be a shortage of people who can convert business problems into machine-learning problems. I don't think so technical skills will be short because of the way we are automating and abstracting everything from cars to software, to a washing machine. I don't think so. We are far away from a point where we'll be surrounded by abstraction. I think the real shortage of talent is in people who look at, okay, I am having inventory shortages again and again in my business, this is because of lack of quality forecast. So, this is actually a forecasting problem. If I solve that, I'll be able to, you know, overcome the inventory shortage problem.
I think we have a problem here of shortage of talent that can identify a problem, or take a business problem and frame it into a machine-learning problem, and then collaborate and coordinate with different pieces, different teams, and organizations to make that project happen, right? And that's exactly what I think a citizen data scientist is, somebody who is closest to the business problem, and can convert that problem into, a machine learning problem, at least frame it, not practically do it, frame it, quantify the value of it, convince people that we should solve this problem, and why should solve this problem.
Recommended talk: Kubeflow for Machine Learning • Holden Karau & Adi Polak • GOTO 2022
Linda Stougaard Nielsen: I think actually that would also attract much more people into this area that you don't have to deal with all the nitty-gritty details of it, that you have this abstraction level that you can deal with that problems, and solve the problems instead. I think that would be a really good way to attract people.
Moez Ali: Right. Exactly. Right. I mean, if you work in a company where you do not have motivated people, or you have not taken the business in confidence, you can build as fancy a data science team as you want, but you won't have any results. At the end of the day, you need to satisfy a business ask or a business problem, and you need the business to be in confidence with you.
Linda Stougaard Nielsen: Thanks a lot. It was very interesting talking to you.
Moez Ali: Same here. It was a pleasure meeting you, Linda. Thank you.
Linda Stougaard Nielsen: Thank you.