What is Data Science and Where is it Heading?
We're kicking off 2021 with a new interview series: GOTO Unscripted, with our first round of interviews recorded back when we could still meet in person. GOTO Unscripted takes our conference speakers off the big stage and brings them behind the scenes for an intimate conversation on topics they know best.
What does a data scientist do in their day-to-day job and how will they impact our lives in the future? Join a conversation with Em Grasmeder, code witch at ThoughtWorks, and Evelina Gabasova, a principal research data scientist at The Alan Turing Institute, on how data science is currently shaping our lives and what its potential for the future is.
Preben Thorø: So, we're at GOTO Berlin right now. We're having a lot of interesting speakers from all around the world. I managed to bring Em and Evelina together with me yesterday. Could you please briefly mention your background?
Em Grasmeder: My name is Em Grasmeder. My background is in economics and experimental economics, and that's where I learned to do data analysis and software development. And now, I'm a consultant at ThoughtWorks, so I work on whatever clients ThoughtWorks throws me out, and often I'm doing data projects, whether it's predictions or image recognition and natural image processing. I kind of run the whole gamut.
Evelina Gabasova: I'm Evelina Gabasova. I work at The Alan Turing Institute, which is the British national institute for data science and artificial intelligence, and my background is mostly in computer science. Then, I did a machine learning Ph.D. Then, I worked in cancer research. And right now, I'm sort of in-between industry and academia, trying to bring some problems that are in the industry and apply a sort of current research to them.
What is data scientist?
Preben Thorø: So, what is a data scientist? What is data science? What is it, actually?
Evelina Gabasova: That's a very good question, and actually, it's one of the goals that our institute is sort of trying, to just define what a data scientist should look like. I can tell you how we, as the research engineering group within the institute, think what it means, and it means, basically, something in between software development and mathematics. It combines a lot of things. It combines with a sort of business knowledge of the domain you are working with, knowledge of algorithms, as well as the ability to implement them and make them scalable and use proper software engineering practices.
Em Grasmeder: I would say a data scientist is a software developer who is also using a scientific method, and often doing work that is non-deterministic. Like, we don't know, at the onset of a two-week sprint, whether we'll have discovered something useful that can improve the work that we're doing, or whether we've discovered something that, "It's not useful, that's a good thing to know, but we're not going to use this piece of data," or, "We're not going to use this methodology for approaching the problem," and a little bit more of this trial and error, and scientific thinking than happens in traditional software development.
Preben Thorø: I guess being a data scientist is not a protected title. Can you hold a Master's in data science?
Evelina Gabasova: I think you can right now.
Em Grasmeder:: Probably by now, yes.
Preben Thorø: Yes?
Evelina Gabasova: Yes, and I saw some very interesting thoughts recently where people are arguing that, even in high school, people shouldn't be learning algebra, they should be learning data science because that's much more useful in real life and is going to be a part of many, many jobs in the future.
Recommended talk: Thinking Like a Data Scientist • Em Grasmeder
How does data science handle the data volumes?
Preben Thorø: But maybe they should because we tend to talk about that there's so much data around us. Is that actually true, or are we just more focused on data that has already been there?
Em Grasmeder: So, the data has always been there. Like, the world is a thing that we can measure. Now we're maybe starting to make data about the data that exists, so you could say there's an exponential growth in that, but I think the meaningful data that's been there has been there forever.
Evelina Gabasova: Well, where I see a difference is that, right now, we have methods to analyze data, even from the past, that we never had before. For example, my colleagues are working on a project called Living with Machines, which is looking at digitized materials from the 19th century, because they want to see the parallels between how machines were introduced and how the industrialization of society was reflected in written literature and in people's sort of diaries and letters, pamphlets, everything, and how it mirrors what's happening in our society right now. And looking at the 19th century, that doesn't seem like a stereotypical data science task, but it is, and current methods allow us to process a lot of data together using the methods we have right now.
What will we need more: software developers or data scientists?
Preben Thorø: If we look ahead, out in the future, what would we need the most, software developers or data scientists? And be honest.
Evelina Gabasova: Both. For example, in our team, we are research engineering, and when people join our team, they can choose if they want to be called data scientists or software engineers. People change cycles, and we see it as a spectrum, basically. So, it's about where you put yourself on the spectrum.
Em Grasmeder: I hope to see a convergence in these. I hope to see data scientists doing continuous delivery, doing infrastructure as code, doing test-driven development. I would love to explore more about how do we do quality assurance in our data and in our data models. I think all the skills and learnings that we have from modern best practices in software should be transferred over to the field of data science as well.
Are we focusing too much on machine learning?
Preben Thorø: Most often, I think people are relating data science to machine learning and AI, maybe because it's so easy just to draw that parallel. Are we too focused on machine learning rather than good old plain math?
Evelina Gabasova: Well, I like to use the metaphor of an iceberg because data science is a bit like an iceberg where the top of the iceberg is just fitting the fancy models and using machine learning, and doing all the fancy stuff that gets reported in papers, etc., but the bottom is dealing with the data. And my favorite joke is that data science is 80% data wrangling and 20% complaining about data wrangling. So, that's how I see it, really.
Em Grasmeder: Yes. I really liked what you said earlier, like, "If I gave you the perfect model, would you be able to use it? What would you do with it?" And I think that's the hard part, is positioning yourself in a place where if you had the right model, how could we operationalize it and make it useful. I like to joke... One of the most pressing questions in data science today is, "Couldn't we just use a linear regression for this?" Like, that can get a lot of people pretty far, logistic or linear regression to do your classifications or your predictions, and from there, if you could get a thing that delivers value to somebody using that, then let's talk about making it fancy or making it more complex. But for the vast majority of problems, the hard part is not, "Which algorithm do I use? How do I perfect it?", but "How do I get myself in a position where I can use this model right?"
Recommended talk: Breaking Black-box AI • Evelina Gabasova
Can the work of a data scientist be done by machines?
Preben Thorø: Given that we could define this universal model that would help us in many, many cases, wouldn't that mean that your work as data scientists could be taken over by machines?
Evelina Gabasova: I would love that. No, the problem is that all the model fitting and model selection, etc., can be fairly automated, and there are automated ways to do that already, but the whole dealing with data, that's the difficult part, really, and there are no ways to automate that, yet at least. Because you say, "Oh, data wrangling," but it's, you need to make sure that the data are clean, that they make sense for your use case. You have to make sure that they even make sense for the model. For example, you have a linear regression model, which works fine in many cases, but then you have one outlier, and that completely changes your prediction. And without looking at the data, without understanding what are the assumptions that the model makes with respect to the data, you are hopeless.
Will quantum computing change things?
Preben Thorø: Well, that's just the matter of machine power, isn't it? So, we see quantum computing slowly appearing on the horizon somewhere. Wouldn't that completely change the picture?
Evelina Gabasova: I don't think it's solving the problem that data science has right now because, for example, I'm not worried about the machines taking over and all the jobs being automated, etc. I'm worried about people using the current algorithms that we have right now incorrectly with unreliable data, using them to affect people's lives without actually understanding what the model does and why is it biased and where is it biased, etc. And quantum computing doesn't solve that at all.
Preben Thorø: So, it's more about educating the world and each other than actually just keeping adding tools, kind of things.
Em Grasmeder: And then, using the algorithms for the right things. I mean, the job is not the end goal. I think our goal in society is to provide safe and happy and healthy...things that can enable people to be safe and happy and healthy. And a job is a means to that end right now, but if automation takes my job and we can still have a society that ensures that for all people, then great.
Evelina Gabasova: Then why not?
Preben Thorø: I think that's a wonderful way to conclude this discussion. Thanks a lot for joining us.
Em Grasmeder: Thank you very much.
Evelina Gabasova: Thank you.