AI, ML & Data Science: What's the Difference?
There is a fine line between three disciplines that take the spotlight of computer science. What are the differences between machine learning, data science and artificial intelligence? Are we ready for truly creative computers, and what's the progress in this field? Feynman Liang, a Ph.D. student in statistics, and Phil Winder, CEO of Winder Research, try to answer all these questions while explaining what data science looks like in practice.
Read further
There is a fine line between three disciplines that take the spotlight of computer science. What are the differences between machine learning, data science and artificial intelligence? Are we ready for truly creative computers, and what's the progress in this field? Feynman Liang, a Ph.D. student in statistics, and Phil Winder, CEO of Winder Research, try to answer all these questions while explaining what data science looks like in practice.
Intro
Preben Thorø: My name is Preben Thoro, I'm part of the PC for the GOTO conferences. With me today I have Phil Winder and Feynman Liang.
Feynman Liang: My name is Feynman Liang, I am currently a Ph.D. student. I left my job as director of engineering at Gigster, and prior to that, I was a master's student at the University of Cambridge, working on machine learning.
Phil Winder: Hi, my name is Phil Winder, and I run a small data science consultancy in the UK. We work with companies, big and small, to help them extract value from their data, and to implement data-oriented applications and processes. My background is a general engineering one, I started in electronics, moved into software, and then into data science and AI.
AI, ML & Data Science - What's the Difference?
Preben Thorø: That is the reason why we brought the two of you together, with your field of work, expertise is data science, machine learning. I guess many of us have a hard time defining what is data science, what is machine learning, what is AI? Could you help me draw these lines?
Phil Winder: Sure. Do you want to go first? Then we'll compare notes at the end.
Feynman Liang: Sure. I do hear these terms thrown around quite a bit. For me at least, data science was originated by Jeff Hammerbacher and his colleagues at Facebook, back when there was a missing role between analysts and engineers, and they needed someone who could dive deep into the data that was being built by these production systems and created. That needed more sophisticated processing than a SQL analyst or an Excel analyst could produce. With respect to, I really define machine learning as learning from data, or systems that learn from data over time. So oftentimes, this definitely requires large pools of data, you can't do machine learning without data, and moreover, it requires algorithms that draw from optimization and control theory, that improve as more data comes in, and it's able to draw conclusions and patterns over that data, over time. I really view AI as kind of the overarching or general term that describes artificial intelligence, machine learning. It's really covering this idea of systems that are more intelligent, that can do things that are not just, you know, 0, 1 logic, rather it's fairly fuzzy, the tasks are less well-defined, but nevertheless, we expect these machines to perform almost on par with humans.
Phil Winder: I tend to agree with all of that. You've got to remember that these fields are not credited in any way. So if you say you're a civil engineer or you're an accredited electronic engineer, that means a very specific thing according to the certification of those accrediting unions. But software engineering, data science, it's not that mature, and these things are evolving over time, so the definitions of these fields are kind of developed by the people doing them. So in my opinion, I kind of see data science as an overarching thing: doing science with data, working with data, and a subset of that is machine learning- trying to teach machines to make decisions based upon data. But there's also lots of other stuff, like the exploratory data analysis and things like that, that is not consumed within machine learning itself.
Then for AI, I tend to see that like a bit more of an academic and philosophical discipline. It's been around since like, Alan Turing, so like, since the 1950s, can we build something artificial that acts and mimics. I don't want to say things like a human, but appears to think like a human, and acts like a human, behaves like a human, but from an artificial construct, whether that be software or electronics or quantum computing or anything like that. So, I think it's really hard to pin down what AI really is. But obviously, the marketing guys really like the word AI, and it's a really interesting philosophical field.
At what point do you become more intelligent than a human? Does it have to be at a specific task, can it be at a specific task, or does it have to be more general than that? So, that's kind of where the field is going, it's like these expert systems have done very well in very narrow fields, so the question is can we kind of expand that field now to cover more of a human psyche?
Does AI/Machine Learning still need a human to understand the data?
Preben Thorø: I think a big issue in this is that the way I see it at least, today, when we're dealing with machine learning, or if we call it AI, we still have to do a lot of pre-processing of the data, we have to prepare it for the machine. Basically, we have to understand it first because we can hand it over to the machine. Is that true? How far can we actually go if there still needs to be a human mind behind it?
Phil Winder: Yes, so I'll take the first turn. There's sort of several questions buried in that one question, really. From a very practical perspective, pre-processing and cleaning your data is vitally important for application projects where you're trying to apply data science or apply machine learning to a particular problem, and there's a couple of reasons for that. The first reason is that most of the models that are used in production data science systems today expect your data to be in a very specific format. The vast majority of algorithms all expect your data to be normally distributed, to have a fixed scale, to not be correlated with each other, all of these requirements are needed so that the algorithm can do its job. If you don't do that, then sort of the best case is that it just doesn't perform quite as well as it could, sometimes it can just completely fail, and it won't work at all. So, there's a very practical reason to do data cleaning.
The second point and people forget about this a lot, actually, is that the practice of looking at the data, visualizing the data, analyzing the data to clean the data often throws up new insights and new ideas, and new questions, even. Because quite often, I've had times where I've been doing data cleaning, and I'm looking at some particular features, and I'm like, what's going on there? So that's really interesting, and that's led to a new insight, a new idea for a particular project. That helps the clients because they might have not thought of doing that thing. So you need to do it in order to gain performance, and also, you need to do it for understanding, and new insights as well.
Feynman Liang: Yes, I really do agree with Phil. I mean he mentions the necessity of it for algorithms to perform, as well as possible, the learning and discovery process.
What does it mean for data to be clean? For example, if you see outliers on a certain feature, are those actually outliers, or is that a very strong signal for your algorithm? Just defining what it means for this data to be clean, that this data is not corrupted, that in itself is specifying what your assumption is for your modeling of AI. So I think through cleaning, you're able to discover and define what do I expect, what does my domain looks like, what is natural, what is not natural?
I do think that data cleaning itself is we're seeing more and more automation with it. So for example, Google has released a tool called Cloud Data Prep, and this tool is specifically designed to help do numerous tasks such as outlier detection, scaling the data so that it's between 0 and 1 or normally distributed, whitening the data, so decorrelating it. These processes are very important for certain algorithms to succeed. Without this, you're going to get fairly crazy results.
Can a computer be creative?
Preben Thorø: You keep talking about algorithms. To me, that is like a sequence of steps you need to do with your data. I might be wrong, but it also sounds to me like this is very far away from creativity. Can a computer be creative?
Phil Winder: Well, just to go back to the previous point and sort of link the two questions together, you also asked if we need to do this much data cleaning, is it ever possible to get to a point where the machine can do it itself?
More advanced algorithms these days, ie. neural network-based algorithms are attempting to automate much of the feature engineering, and to a degree, the data cleaning process. Not all of it, because obviously, they still expect data to be between plus and minus one, and things like that. But they are very, very capable of extracting features by themselves, given a reward that makes sense for the problem that you're trying to solve. So in the future, I expect to see people leveraging that a lot more, especially with complex data types. Because when we talk about data cleaning, all the examples we keep using are always very simple, sort of one-dimensional, features that don't really affect any other features, and it's very easy to comprehend. But in many applications, you have millions of features that you can't possibly analyze by hand, and so you have to rely on these more cognitive systems to be able to learn representations of that data, that better explain the domain.
Feynman Liang: I really like your point about data representation. In theoretical computer science, one learns that some problems which are hard when represented in certain ways become easy when you expand representation. I think the same is true in data science, where if you are good with your feature engineering, then something simple like a linear model might be enough for you to solve your task, but if the features are poorly engineered, then there's no way for you to separate two classes with a linear decision boundary.
Phil Winder: Yes, the classic example there is going from a Cartesian to Polar coordinates when you've got the two concentric circles. Just the switch of the coordinate system makes it from a really complex problem into a very simple linear one.
Recommended talk: GOTO 2019 • Composing Bach Chorales Using Deep Learning • Feynman Liang
Feynman Liang: But tying back to the question about can computers be creative, well, shameless plug for my talk later today if you're tuning in to this, you'll notice that almost half of the talk is about data pre-processing, and injecting domain knowledge, my knowledge about music, as well as my collaborators' knowledge about music into the representation, and then running a fairly simple, generic algorithm on top of it to just model the sequences.
I actually don't know how you'd be able to just make music. You'd have to take the music and represent it on that computer in the first place, and just that process right there, you're injecting in human knowledge and human creativity on how do I take the music, represent it in a way that's understandable to a computer, that a computer can compute on top of?
Phil Winder: Yes. I mean, the definition of creativity is difficult as well. Because when you say creativity, you usually think of music, of art, of things like that, and they, themselves, are working within a very constrained context. Like your example, your Bach examples, those are very constrained rules of what makes it sound like Bach, and what doesn't sound like Bach. If you're an artist, you often aim for a specific form of artistry. And you've got to remember that all of the sort of current and probably future AI systems are only as good as the data that you pass into it, so if you provide your algorithm with lots of examples of a specific type of art form, it could be very creative in that specific art form.
The point that we haven't quite got to yet is to match the human ability to go from one creative medium into a completely different creative medium that appears, entirely disconnected, like going from visual art to musical art, for example. But in the future, if we can track or transfer that learning from one domain into another, then it might influence something else entirely. I'm trying to give an example.
So as an example, if you could take Van Gogh's paintings, what would that sound like? If you took the learning from these paintings and applied it to a musical algorithm, what would it sound like? I'm not sure you wouldn't be able to hear it very well, that's the problem with that example.
Are we on the track to truly smart AI?
Preben Thorø: So you mean that we are actually pretty close to having truly smart AI? And the other pole is it sounds like we're very far away from actually having truly smart AI.
Phil Winder: Yes.
Preben Thorø: Are we on the road towards it? Will it come?
Feynman Liang: Well, I think problems in AI, and their difficulty is actually fairly counterintuitive. So the example I really like to use here is playing chess with Garry Kasparov. AI has been able to beat Garry Kasparov before 2000, however, picking up the piece and moving the pawn is still an unsolved problem. Just grasp mechanics, and the control systems and algorithms needed to do that is still a research problem in AI. Which for humans, it's very easy to move a piece, but most of us probably can't beat Garry. Would we call this AI smart? I don't know.
Phil Winder: Yes, exactly. It's going back to that, the idea of designing an expert system, and the reason why that occurs is because of the way we do data science. So generally, the way it works is we take the data, and then try and find a hypothesis or a model to fix that data, so your expert system is only going to be an expert in the data that you provide. If you want to have a solution to more general problems, then you need to provide it in a general context and as general data. I think that will happen because these algorithms are becoming more and more capable over time, it's just the applications that could be derived from these algorithms are all being picked off first, so. So, I'm getting very involved and interested in reinforcement learning, and it's very similar to supervised learning except that you're changing the definition of the reward. You actually provide a reward signal rather than supervised corrections, saying yes, that was right, or no, that was wrong. So the beauty of this type of algorithm is that we're not technically telling it how to do its job, it kind of learns how to do its job itself.
Using algorithms like this, in a wider range of diverse applications, and more diverse applications, I think this will spawn more general AI. I still think it won't be ever, purely general AI, for a long time.
Will AI take over my job?
Preben Thorø: So still it will remain a tool for me and my daily work, it will not take over my job?
Phil Winder: Yes. I mean, industries are all being affected in slightly different ways. It depends on the job, and it depends on the tasks that people are doing in that job. I've got an example from a project at the moment, where we are helping a company with a condition monitoring problem.
At the moment, there are lots of people that go around, they travel for many, many miles to do this condition monitoring manually, and what we're trying to do is instead of just, replacing what they do, we are attempting to do the same job remotely. So those people will still be needed, it just makes their job a lot easier to do. It's not necessarily taking over their job, it's helping them do their job more efficiently. That's where we've got.
Recommended talk: GOTO 2021 • How to Leverage Reinforcement Learning • Phil Winder & Rebecca Nugent
Feynman Liang: Yes, I agree. I think that it's highly dependent on which industry you're in, what role you're in. There's been quite a bit of discussion about how autonomous driving is going to take over the trucking industries, and the truckers will need to be retooled into doing some more systems, into doing other work. You may see similar trends in low-skilled labor as well, where machines are quite capable of automating a task that's currently being seen by humans. However, one ethical question that comes along is if a machine makes a mistake, who's at fault? Is it the algorithm's fault, or who do we blame, who do we file the lawsuit against? With a human worker there's no ambiguity here, so for certain tasks where you do need responsibility, you do need accountability, I think, like, humans are still necessary.
Preben Thorø: So the future still looks bright.
Feynman Liang: If you want to be a lawsuit target.
Preben Thorø: Yes. Thanks a lot for joining us today.
Phil Winder: Thank you.
Feynman Liang: Thank you.