Home Gotopia Articles Is Machine Learn...

Is Machine Learning a Black Box?

Data science has become a bigger part of software engineering. Where does the path lead? What have the changes been over the last couple of years and where are we heading? In this unscripted episode, Dean Wampler takes you on a journey through data science.

Share on:
linkedin facebook

Read further

Data science has become a bigger part of software engineering. Where does the path lead? What have the changes been over the last couple of years and where are we heading? In this unscripted episode, Dean Wampler takes you on a journey through data science.




Preben Thorø: My name is Preben and I'm part of the GOTO team. And with me, I have Dean Wampler today. Good morning, Dean. Thanks for getting up so early.

Dean Wampler: Good morning, Preben. That's not that early. So good to be here with you today.

Preben Thorø: Thank you. May I ask you to do a very short presentation of yourself?

Dean Wampler: My name is Dean Wampler, I work for Domino Data Lab, which is a vendor of an integrated suite of tools for data science and MLOps. I've been in the Scala community for over a decade, mostly working in the data science industry one way or the other, on either the engineering side or most of the engineering side, but some data science for the last decade or so. So it's kind of a nice intersection of those two areas of interest for me being at Domino right now.

Preben Thorø: Thank you. Yes. And it's no big secret that we've known each other for some years. You're part of the wonderful GOTO community that we have in the Chicago area.

Dean Wampler: Yeah, this conference here has been great. I'll be very happy when it finally goes live again. I really miss it.


The gap between data science and software development

Preben Thorø: We are doing our best. When we briefly talked a couple of weeks ago, you mentioned that one of the things that you are really struggling with is spending a lot of mental energy on the gap between data science and software development. That kind of surprised me because I heard the same story three years ago. Does that mean we haven't really come any further?

Dean Wampler: Yeah, I think it's going through a lot of refinement. The current thinking for example is the DevOps side of the software world has been very successful at solving a lot of problems to make deployments of production software easier, automated, all those things. Maybe we should do something similar on our side, our side being the data science side. I think if you look back a few years ago, people were sort of figuring out. All right, here's a model, I don't know how to get it into production. I don't know what to do with it after that. And so they would put it in a container and put it in production and hope for the best, and maybe start looking at some metrics to know when the model is performing well and so forth.

And now we're getting to that phase where we really need a little bit more control, more data governance, more automation, repeatability so we can do traceability back when problems happen and so forth. It’s been a process of maturation. If you go into any typical organization of data scientists and data engineers, they'll be all over the map in terms of maturity in this way too. So a lot of that is happening as well as people learn, maybe what they should be doing.

Preben Thorø: So we have actually moved those past three, five years. We're on a journey here. That's what you're saying

Dean Wampler: Pretty much. As we all know, nothing happens overnight.

Preben Thorø: No, that's true. How long did it take before we finally got to melt dev and ops together? I think that was a journey, too.

Dean Wampler: That was, and it took something like cloud services, in very generic terms, organizations like Netflix, Google, and so forth, who realized, we never can have downtime. We have to figure out how to do this stuff live with deploying new things and keeping the site up and running. And once they figured all that out, then they found out “You know what, we don't have to do these big bang deployments anymore. We can do things incrementally with very little impact to end-users, except in good ways, new features.”

So I think that's where the data science world is coming into. There are several differences. And one I'll just mention offhand is you don't necessarily deploy models over and over again really fast like you might for some applications daily. You typically leave a model running for a while until it's clear that it needs to be updated. Nevertheless, a lot of the same challenges exist in the data world, plus some new ones.

Preben Thorø: Help me, I don't really see how that can be a challenge. You let the model run to gather experience. Or you train the model in production for some time. And then at some point when maybe you train it in a test system, you deploy it to production when you'll feel it's mature. How is that different from testing your software over periods and then move it to production?

Dean Wampler: There are certainly analogies that people are exploiting. You mentioned something that's kind of an interesting, sort of smaller trend and that is, should we train in production or maybe have a production level pipeline, which is fully automated in the usual way that we build software now. Or should we let data scientists continue to build models sort of incrementally iteratively in their experimental environments? And then we take that artifact and check it into version control of some kind and deploy that right away, in which case, it feels a little bit more like how we do software.

I think we're gonna get to the point where most, at least larger organizations will decide, you know what, if we deploy a model, it's exactly the same thing as, "This software works on my machine. Why isn't it running in production?" We really actually need to get to a maturity level where data scientists figure out the model structure like so-called hyperparameters. You’re using some sort of representative data to determine this seems like a good model and then let an automated pipeline actually build the deployable model itself.

There are some other bandages to that you can verify, that you're training at scale whereas the data scientists might be using a smaller data set for performance reasons. If you have data governance concerns about who has access to certain data, you can hide that completely behind your pipeline of training. There's a bunch of reasons like that, but mostly it's the repeatability, the traceability aspects. When I put a model in production, I wanna know exactly how it was built, what data was used to train it, and then be able to trace back. If it's problematic, then I'll have a full suite of traceability artifacts that I can use to understand or hopefully understand why it failed.

I think this is important. There's a lot of analogies that we can leverage a lot of tooling in DevOps but there are some unique aspects and there's a maturation process that people are figuring out what should be automated versus what is still okay to sort of hand-build.


Is machine learning a black box?

Preben Thorø: The real problem here is that machine learning is to some extent a black box. You don't really know all the details inside of it.

Dean Wampler: That's certainly one of the big areas of research and technological innovation that's going on, so-called explainability. And this is especially problematic for very complex systems like neural networks. And it's actually an argument why a lot of people will still rely on simpler mechanisms, sort of more classic machine learning, where it's easier to understand what's going on if it's good enough for their purposes.

So you see that tension a lot where people see neural networks as this new amazing shiny thing and they wanna use it, but then they run into this problem of how do I really understand what the model is doing? Why did it say that person shouldn't get a credit card or that tumor was missed in an x-ray? So there's still a lot of those kinds of things that need to be sorted out and are being sorted out for the more sophisticated approaches. And the other reason for sticking with simpler models in a lot of cases is it's computationally easier to work with, easier to train, easier to score runtime. So there's still a lot of tension there about optimal choices of model types and so forth.


Which tools help us?

Preben Thorø: We were on a journey here. Which tools do we have that can help us?

Dean Wampler: That's another area where you're seeing some maturation happening in the industry. One of the things I think most interesting right now is the notion of a feature store, which is emerging as kind of a boundary between the data science process of ingesting data that can be in all kinds of states of quality and formats, transforming it into so-called features. They're really just columns with maybe more metadata attached to them. Effectively like a database really. And then that feature story is what is used to do model training both for experimentation as well as production and even in some cases actually serving the data.

So if I have a raw stream of data coming in, I'm like, pass it through this transformation process, put it in a feature store, and then let that be what is used by the production scoring system. So things like that are ways in which people are figuring out where are good boundaries between these different roles and responsibilities. What tools kind of best fit the needs on both ends of that spectrum, the producers as well as consumers.

So that's one of the interesting ones, I think, for just the sort of low-level pragmatics, if you will, of turning data science and data engineering into MLOps, or whatever term you wanna use. As far as some of these issues like explainability and so forth, there have been some interesting tools developed where people are learning how to analyze the impact of data changes or feature changes, if you wanna use that term, and how they impact the scoring of the model, how they impact training.

You know, one of the advantages there is figuring out what features actually matter so that if I can cut back to let’s say 100 features versus 1000, that's a huge energy saving and performance benefit. So there's a lot of tooling like that where sort of industry and academic research is starting to attack some of these more sophisticated problems and then figure out how we can leverage that technology to let us use things like neural networks with more confidence

Preben Thorø: For a moment, I was hoping that you would mention the GOTO conferences as one of those tools.

Dean Wampler: It's a really good point because things are moving so fast. It's really hard for most people to keep up with what's actually interesting or what actually is showing promise. I try to keep up with some of the research papers and it's just kind of hopeless for me. But, yeah, I think that the benefit of conferences like GOTO for me is that you can come together and not only learn what experts in your specialty are focused on, what they're figuring out but also to figure out what other people are doing.

And sometimes that sort of cross-pollination is hugely valuable. It's really tempting to just go to conferences that are narrowly focused on your specialty or interest. And they're good. I don't wanna denigrate them in any way but I often find that some sort of light bulb moment happens when I go to a talk that's completely outside my area of specialty and hear what people are thinking about in their context. So I think, to me, that's always been one of the great things about GOTO, the caliber of the sessions and the tracks in general, but also, the cross-pollination that's possible.

Preben Thorø: I think you're right. Now, this should not turn into a big commercial for the GOTO conferences that actually was meant as a joke, but I think you're right because like with everything else, you need to find the flavor and the way that works for your company and for your culture. So there's no one-size-fits-all here. One of the best things you can do is actually to hear what worked and what didn't for all organizations, for all the teams.

Dean Wampler: Yeah, and if I'm a data scientist, I need to learn more about how to manage projects. That's something that we've gotten very good at and software, well, at least in theory. If you're walking into most organizations, they still struggle. But that's the sort of thing now where you see this cross-pollination. Data scientists need to learn more about the engineering cultures that have been maturing, like DevOps, agile methods and similarly data scientists or data engineers need to understand data better than maybe they did 10, 20 years ago.


The way we teach software development

Preben Thorø: That's another point, the way we teach software development. As software engineers we should realize that they have a new persona that they should work. It's not just the customer, it's not just ops, it's not just whatever. Now, there's a new type of persona in the picture too.

Dean Wampler: And now you said that, it reminded me that software itself is evolving. Here's been a lot of interesting stuff said, and now starting to materialize, for actually using machine learning, to accelerate software development, like using GitHub's Codespace where it actually will suggest code for you, which hopefully will go beyond just cut and paste patterns.

So I think it really is true that you can't just sit still. If you're gonna get the most productivity out of your organization as a whole, you really have to understand where other people are and what their problems are, and how the world is changing and what matters to you, like data becoming even more important than it was in the past.


GitHub Codespaces

Preben Thorø: Yeah. Which by the way, Github's new initiative is definitely fascinating, but it is a little bit scary too, isn't it?

Dean Wampler: Right, it's almost like Stack Overflow being automated. You know, the old joke that we just spend our days Googling stuff on Stack Overflow? And even the way we learn has changed a lot. I remember times in the way past, like pre-internet actually, when I would very carefully record information, like I found this algorithm or I'd buy all these books, and now it's much easier to find things when you have questions like, how do I do something? But you also have to be a savvy consumer because there's so much, as in life in general, useful and not useful information out there.

Preben Thorø: That's amazing. Let's pick it up at some, hopefully, soon in-person conference somewhere. Thanks a lot for joining me today. Thank you.

Dean Wampler: Oh, it's my pleasure. Good talking with you today.