What Does It Take To Be a Data Scientist?
Data science is so much more than collecting, sorting and analyzing data. What does it take to be a data scientist and how does a day in the life of a data scientist look like? Ekaterina Sirazitdinova, Prayson Daniel and Nicholai Stålung will give you an insight into this and more.
Read further
Data science is so much more than collecting, sorting and analyzing data. What does it take to be a data scientist and how does a day in the life of a data scientist look like? Ekaterina Sirazitdinova, Prayson Daniel and Nicholai Stålung will give you an insight into this and more.
Intro
Nicholai Stålung: Welcome to this interview about data science and artificial intelligence. I have Katja and Prayson with me today. I'll have them introduce themselves. My name is Nicholai Stålung, and I'm the lead data scientist at Trifork. Trifork is a software consultancy company where we try to bridge the gap between needs and technology. I'm really excited you are here. I think we can have a nice discussion and I'm looking forward to your stories. Katja, can you please introduce yourself?
Ekaterina Sirazitdinova: Hi, I am a data scientist at NVIDIA. I work in Munich, Germany and my background is in deep learning and computer vision. In my day-to-day work, I help customers in developing and deploying efficient AI models for video analytics. Thank you for having me today.
Prayson Daniel: I'm Prayson Daniel, and I'm a principal data scientist at NTT DATA business solution here in Copenhagen or the Nordics. My day-to-day job is causing trouble in the ML world. But basically, we help our clients build ethical and trustworthy algorithms that they can put in production in clear conscience. And as I said, I'm causing trouble, I am always trying to challenge some norms that get thrown into GDPR and Alti and all those things.
Interpretability and expandability
Nicholai Stålung: And you just had a talk on “Black Box Models,” and if we should even look into them, maybe you could elaborate on your talk for us.
Prayson Daniel: Most of the time, we're living in a kind of, I'm usually scared to say, an illusion, but let me just use that even though it's a heavy word. Whenever we can explain our model or whenever we can interpret it, therefore, it is X, that it is trustworthy. And my case is it is not really. We can have a model that is completely transparent, very interpretable, but extremely biased. But then in the other flip of a coin, we can have a completely uninterpretable model, God forbid, a completely black, but yet very ethical.
So, my case is what stopped me from using one over the other? And maybe the trustability, accountability of the model has nothing to do with the modeling themselves, the mechanism, but the entire space behind. So, my talk was going beyond the interpretability and expandability of the model and coming with a way to which we can fulfill all these needs of fairness, etc., without having to unveil the models themselves.
Nicholai Stålung: What is your take it for the layman? How do you attack this task?
Prayson Daniel: I usually love telling stories and coming with explainable images. My best image was with this container, where we fill water in this container, and we start making holes, and we assume the container’s transparently, almost like glossy. We know if we make a hole here, the pressure will fall suddenly, as like this and we put the hole a bit down, then the pressure is a bit longer. Then we can see the water is at a certain level. And when we don't see water coming out here, we can say it has passed that level. But then what happened when we paint this container completely black, and we repeat the same experiment of putting holes?
Apparently, we can still explain the pressure and know whether the water has passed this without seeing where the water is. So in this case, we come with something called counterfactual explanation, to which you just need to perform a test of your model. By saying, if I put these features what do I get back? And if I can give a different counterfactual and get the same result then I can come with a good explanation. For example, my model is not affected by gender. So, I can take in and change only the gender parameter to male and female and see do I get the same predictions? Right, and my model is not affected by ethnicity.
So, I will pass a Christian, a Jew, a Muslim, and do we get the same results? And if I pass those, then I don't need to unveil my model to tell you that it fulfills the guidelines we are trying to pass.
Nicholai Stålung: Katja, how does NVIDIA tackle this? Or how do you tackle the whole area of expandability and interpretability?
Ekaterina Sirazitdinova: So, as I work more on the practical side, of course, NVIDIA does some research in expandability and interpretability. But for my particular customers and clients, the general outcome of the model is most important. How does your model perform on test sets? The most important question is how does the model perform in the real setting?
There's often a case when our customer or partner has trained a model for some other customer of them, and they said, “Yay, the model works well in the laboratory scenario,” but then, we put this model in the real life, and it doesn't work. It doesn't detect anything or it doesn't segment anything. So, then we help our customers and partners to solve these problems. And the best way I think is just still testing, testing, testing the models. Of course, there are, algorithms like Graph Cam, where you can visualize the attention, but it is really questionable if the model visualizes the things which will lead to final outcome.
I also agree with you, and your position that it's probably not really necessary to look inside, but you really need to take care of what is coming out of your model.
Testing models outside of the lab
Nicholai Stålung: That's very interesting and we all agree with that fact. You both, are advocating a testing regime for models outside of the lab, which I'm also a heavy advocate for because that's the only way we know how these models will perform. In our world, we're dealing with variants, as opposed to normal computer scientists. That means we cannot just deal with Boolean value and say it's passed or not passed. It can be any value in between those. Right? My next question for you, because you're working with primarily video and image analytics. Could you please explain what does that entail? And what parts of that pipeline of deploying models are you involved with?
Ekaterina Sirazitdinova: I'm actually involved in the whole end-to-end process. I usually start early with training the model. And if it's applicable, then curating the data set, checking if the data set is good enough, if the old class is well represented in the data sets. Then the next important step is training the model efficiently. It's very important to iterate fast, to try different hyperparameters. And training also needs to be done in the most efficient way, by utilizing multiple GPUs, if possible, utilizing Automatic Mixed Precision. Many people didn't know that this technique exists, and they still run their long trainings even if they have supported GPUs.
Nicholai Stålung: Maybe you can elaborate on mixed precision because there's something new for a lot of data scientists outside work.
Ekaterina Sirazitdinova: It's a hardware-associated feature. In the newest generations of NVIDIA GPUs, we have this feature which can enable training in mixed precision. You have your data in Floating-Point 32 precision, and it enters the model. But then when you run through the convolutions, you do it in Floating-Point 16. To preserve the accuracy, switch back to Floating-Point 32 before calculating the loss and doing the optimization step. And then, again, you switch back to Floating-Point 16, and you back propagate in Floating-Point 16.
By doing that, you can run your training faster, you can train bigger models, and you can train bigger batches. There is a big advantage of that. And it's usually no more than a couple of lines of code in popular frameworks like TensorFlow, or PyTorch or if you're using our TAO toolkit, it's just basically a single parameter which you pass to the command line. It's a super, super nice feature and many people just didn't know about that. But it's really important to iterate more often and faster when you tune your hyperparameters, right. And then after the training, of course, there is deployment. And I also advise our customers on how to deploy their models in the most efficient way using our frameworks.
The best model is no model
Nicholai Stålung: Mixed position is something I found useful just a few days ago. That was an eye-opener in the training time on the GPUs that we use. Prayson, can you talk about your preferences within data science and what piques your interest?
Prayson Daniel: After going through different journeys as a data scientist, I've come to the conclusion that there are different paradigms as a data scientist. So, the first time when I received a task, the younger me will just jump in, XG Boost is the first algorithm you go to.
Nicholai Stålung: Like all of us.
Prayson Daniel: Every one of us, right? But then I discover that most of the time, as you mature, you start wondering, maybe, sometimes, the best model is no model, and people would be like, "What?" And I say, okay, the best deep learning is no deep learning. And they say again, "What? No, we love that."
But I discover that in most cases, we need to start with basic questions. Is a statistical table going to solve your problem? And if it does, then probably we shouldn't go. You're not doing images classification. Maybe it's just a simple linear model, or logistical model that will do the job, so no need to pull the guns out. And then there's a time where we need to pull the guns out. We need to go deeper.
And my area is mostly coming to Transformers natural language processing. We do image but not at that scale. We do very basic image segmentation, tracking and measuring of animals' weight and height just by images. But then the deepest part we've been working on is natural language processing, and that is where there is huge attention at the moment, and all this cool stuff we are borrowing from computer vision.
So, thanks to them, we are applying computer vision ideas in natural language processing. And we hope maybe some of our ideas are going back, which I see they're starting to use attention mechanism also in a computer vision term.
Ekaterina Sirazitdinova: True.
Prayson Daniel: It's like a back and forth helping different fieldworks together. What excites me most when it comes to all the AI fields at the moment, of course, maybe later I change, is the whole ethics, and how we make sure that we have an end-to-end pipeline that insure us. From the moment we acquire data, from the moment we agree on what we're going to do to the deployment and the continuous monitoring of our model. How do we create a pipeline that is traceable, visible, and that we can explain to different stakeholders? We can hold each part in this puzzle accountable from the person who sent the data to the person who received the data, from the person who made sure that our data was not mislabeled.
I just discovered, nowadays, you can do a task like computer vision and try to beat, what you call the standard out there, but then you look at the standard, their data set is completely corrupt in some of these things. So, they say this is a ship. But when you look at the ship, you go like, no, this is a car.
Nicholai Stålung: I read somewhere that 40 percent of the COCO dataset is actually mislabeled.
Prayson Daniel: Sometimes we are actually optimizing two things, then we need to go back and say, "Okay, we need to check our data sets. We need to, re-label them again”. And there's a good technique to find those things nowadays, right?
But again, I'm excited about ethics. I'm excited about classical data mining, the part where we can just look at the data and see association and create almost the old school “if else do something”. And then from there, when that is not enough, then, we go to the next level, and then to the next level. So, that's what excites me at the moment.
Nicholai Stålung: How did that become your field of interest? Did you have any bad experience or was it more like, this is something that's not covered by people or is it people are actually talking about it, but they don't really know what they're talking about?
Prayson Daniel: So it has to bite back and it bites back when you receive a legacy code. Somebody has built this wonderful architecture and deep learning and they have done all these things and now they're gone. Because data scientists have this 2.5 years' lifetime...
Ekaterina Sirazitdinova: [laughing] What do you mean by that?
Prayson Daniel: Go look at data scientists’ profiles. 2.5 years and they are away to another company, and then another company. I was a victim of that. I've done that, but hopefully, I am staying where I am.
Nicholai Stålung: Are you a victim of that as well?
Ekaterina Sirazitdinova: I can attest.
Prayson Daniel: So, we usually say 2.5 years, but then, you receive someone else's legacy code. And you go like, "Oh heaven, how do I proceed?" And they just went, "We would like to add you just this extra feature." And you go like, "Oh, it's not just extra. I don't even know how to deal with this." And in most cases, when we receive a legacy code, you usually have two options. One, you quit your job because nobody wants to maintain them, or two, you toss it out, and then you start from scratch. Or the best one is you take one component out and put another in. And you cross fingers that the person before you had good documentation, and all this.
But then I discover if you go simple, and you have this in your git commit, one could always check out back and build from that part where it was still breathable. And this is where we’re starting with very simple changes, and then building on top and commit all this change. In this case, in our commit, I usually use a lot of tags. Tag them, like the baseline, and then tag them before deep learning.
I know I can check out before deep learning, and build maybe another deep learning model on top of that, but all these parts are really important. So, that experience is what made me always go simple.
Simplify the workflow for data scientists and perfect automation
Nicholai Stålung: Branching out early, so you can attach multiple models to your workflow, that's an interesting idea. How do you go about that in NVIDIA? Because I know both of you are working to simplify the workflow for data scientists, even though you're working with different fields. Is that top of mind at NVIDIA?
Ekaterina Sirazitdinova: NVIDIA's position is also to simplify the life of data scientists and engineers, and we try to encapsulate everything, which is possible. A good example is our TAO framework. We hid the whole engineering part behind this executable. To train your model, you run a simple command in a command line. A person who is using that doesn't even necessarily have to know deep learning. They need to know how to prepare a data set and what to expect from this model but then, the rest is hidden.
And we are now also working on a UI for that. Soon, hopefully, next year, we'll introduce a UI for that, to make it, even more, simple for data scientists, which do not necessarily have to be computer scientists. Right?
Nicholai Stålung: So, how did you end up in a job where you're simplifying stuff for other data scientists?
Ekaterina Sirazitdinova: I'm not only simplifying. I also write code and I have a lot of experience. I have my PhD in Computer Science.
Nicholai Stålung: And you have the knowledge that everyone else wants, right? So you're the perfect person to ask.
Ekaterina Sirazitdinova: My initial idea and goal are to perfect automation, and also the simplification of things. And as a person who can develop these tools, I work in that. But I don't expect everybody to know these tools. I don't expect every person to know how to code. And in that, I really want to help people to bridge this gap.
Prayson Daniel: Just to add on top of that, because what we've been pushing so far, mostly for data scientists is the software engineering part. Because most of us, data scientists, came in the background of notebooks world of very messy code. And luckily, I was in a company where we had this inter-team shuffling to remove silos. So, I found myself in the software engineering team, because they had to package our models.
And there I was introduced to design principles. Whoa, SOLID. Whoa, what? So, encapsulation, dependency, invasion, and I go, "Whoa, this is like dessert.” So, when I came back to our team, and I go like, "Oh, guys, we have to do SOLID, encapsulation. So, always abstract and all this”, and then they go like, "Oh, it's too much code." Yes, too much code, but it's unchanging code. I only have to change this. The rest will remain the same. So, all this part of making things simpler actually comes also from the way we build our software, the way we do software design.
If you can get that good, then it's easy to automate it because it follows guidelines. When I come to you and ask, "What have you done?" You can say, "I have used a patent strategy," or "I have used the factory strategy." Then everybody knows, if you used factory, this is how things work. And in that case, people can embed them in UI or executable. But also I usually don't want people to be scared to make a change on a codebase.
Whether you're very skilled or not, we just introduced you to testing first, everybody has to go through that when you join my team. Test, test, test. And then, after we introduce you to test, we will introduce you to the basic SOLID design principle. So you know if you come to our code, you will have the abstract classes out there. You have the implementer here, and everything is here. And then, your code. And then, we say, "Change this as much as you like," and because we enforce some of those principles through abstraction we usually are on safe ground, mostly.
Is coding part of the future data sciences?
Nicholai Stålung: I know both of you are very good coders, but you're also working to simplify as we've spoken about. But is coding part of the future data sciences? Because you're both young, and you will influence how the data science community will be in the next many years. Is that the future of data scientists?
Ekaterina Sirazitdinova: Well, it depends. I think it's like in every major, you pick your own paths. If you don't feel comfortable about coding, you don't have to. There are tools for you. But if you want to create something new, of course, you need to code. You were also talking about design patterns and I think in general it can be characterized as reusability. Reusability of code is the most important thing. And one, when creating code needs to think of that, to have it in mind as the most important thing.
Nicholai Stålung: What's your take?
Prayson Daniel: The future is still coding. Why do I say that? I think there is a difference between what we call the citizen data scientist and then a data scientist. They usually like a class of saying, a data scientist who wants to do something. I want to train my model and do something. I'm not going to develop a new model. I really even don't care about building another optimizer. I don't care about that. At that level, you really don't need to learn much. There are so many tools out there to do our AutoML.
But at the same time, I say the future is still code because code gives you the freedom to invent new things, freedom to augment, freedom to think outside the box. And I know it gives you this godlike complex, where you think things into being. It was not there and now it's there. You start with the blank, maybe it's .pi, to do your Python or .cpp to do your C++, the only language that I know to do all these kinds of things. And the no-code world has been here since when? 1996. This is the end of coding. And they're still coding.
And then, they create this GUI where you can put some blocks, but they're still coding because we're still inventing new things. Coding is here to stay. And the best thing is, we should just train data scientists and non-data scientists to code in the right way, and this is taking all the principal from the software design.
Ekaterina Sirazitdinova: I have a bit controversial take on that. As a data scientist, my dream is to develop algorithms, maybe I will not do that but somebody will do that, hopefully someday, to develop some kind of general AI and just imagine code which writes code. I think this is the general target and what we are aiming at. But I guess there will be years until it will be possible.
Nicholai Stålung: That sounds interesting. The last question I have is on top of that, so what will you be working on tomorrow? What's your interest tomorrow, as opposed to today?
Ekaterina Sirazitdinova: Well, I'm still very interested in images and image processing, but I think it all is coming to that we all process different sorts of data and images coming together with text. I'm looking into mixed deep learning models to gather all the insights from the world in one big insight.
Nicholai Stålung: Wow, that sounds like a big task. Good luck with that.
Ekaterina Sirazitdinova: Thank you.
Nicholai Stålung: And you, Prayson? What will you be working on tomorrow?
Prayson Daniel: I am working on raising an army of data scientists who are good software developers and also good at performing ethical tests. Software developers mean they can perform software tests, all the unit tests, integration tests, but also this ethical element part. So, not raising an army, but just inspiring new data scientists who are really in love with software engineering, and that's the part we're missing. We just need to have this reusable code and we are starting to see that, like the Hugging Face, with all the natural languages that they don't do care how you build your model. We can always package it, and everybody can use it in the same way.
They are this new path of building beautiful, reusable packages. And so, for me, I'm just adding more element of fairness test and all these other things that need to be done as a whole. But where am I tomorrow? I'm just trying to bridge the gap between data scientists, stakeholders, and software engineers into this one space.
Nicholai Stålung: Sounds interesting. Yeah, I hope you'll raise your army, and you will get your model. I really appreciate you taking the time to answer a few questions. It's been lovely. Very interesting to peek into your heads. So, thank you very much.
Ekaterina Sirazitdinova: Thank you for having us.
Prayson Daniel: Thank you. Yeah.