How to Leverage Reinforcement Learning

Reinforcement learning focuses on sequential decision making and it is significantly different from machine learning. The recommendations based on RL algorithms could help you attain your business goals faster while also increasing engagement.


Join Phil Winder, author of Reinforcement Learning and CEO of Winder Research, in a conversation with Rebecca Nugent, professor of statistics and data science at Carnegie Mellon University, as they discuss key concepts and the industry applications of reinforcement learning. They also cover some of the fundamentals like “problem definition,” “how to get started with RL” and the switch in risk taking from the management level to the engineering department.


Rebecca Nugent: Thank you so much for having me. My name is Rebecca Nugent. I'm a Steven E and Joyce Feinberg professor of statistics and data science at Carnegie Mellon University. I'm also the associate head of the department of statistics and data science. And I'm very excited today to be talking about reinforcement learning with Phil Winder.

Phil Winder: Hi, my name is Phil Winder. And I'm the CEO of Winder Research. We're a data science consultancy. And we're going to be talking more about this book that I've just written about reinforcement learning. Snap. Okay, how should we get it kicked off?

Rebecca Nugent: Well, let's just kick things off and dive right in there. First of all, super bonus points for the best cover picture because these penguins are amazing.

Phil Winder: Yes.


Reinforcement learning vs machine learning?

Rebecca Nugent: Second of all, when we're just thinking about reinforcement learning. A lot of people may not have a good idea about what's reinforcement learning versus machine learning, etc. And so I thought we could just start with: Phil, philosophically, what separates reinforcement learning from machine learning?

And as an example, I would say, let's say that I'm in "The Matrix," right? Morpheus is standing in front of me, and he offers me the blue pill which lets me go to bed and wake up and believe whatever I want to believe. Or I get to take the red pill, and I see how deep that rabbit hole goes. What do I want helping me make my decision? Do I want reinforcement learning helping me there? Or do I want machine learning?

Phil Winder: That's a really interesting question. There's two aspects to this question. There's sort of a practical aspect to it and a philosophical and theoretical aspect to it. So practically, I consider reinforcement learning to be a sub-discipline of machine learning. Because reinforcement learning builds upon machine learning, it uses models within the algorithms to describe you know, things.

Machine learning is still absolutely necessary. But philosophically, it's quite different. So, philosophically, machine learning, the whole objective of machine learning is to automate and optimize decisions, but single decisions, only one decision.

If you have lots of examples of the same decision, then you can build some models and build some descriptive statistics to basically say, which is the best decision given any input? The difference between that and reinforcement learning is that reinforcement learning attempts to model and therefore optimize sequential decision making.

So not just one decision. In your example there, the red pill, blue pill... red pill, blue pill, is very much a machine learning question. So in this instant “what is the best pill to take” — you could try lots of things many times, and find out which pill is best on average.

An example of something more oriented towards reinforcement learning though is if you are already within "The Matrix," and you're exploring "The Matrix" to try and figure out the best path through "The Matrix," that will be where reinforcement learning comes in. Because you would try certain avenues, you would try going down certain, you know, corridors and you will find that some of them are good, some of them are bad.

Over time, you can learn which are good and which are bad. That will create a strategy. One that allows you to make multiple decisions over time, that are more optimal than just, ad hoc one-shot decisions saying left or right.

Rebecca Nugent: Okay, so that's helpful. I'll just maybe we don't have to put this in the final cut maybe but sometimes I'll tell you. I actually auditioned for one of "The Matrix" movies, and I did not get a part. And it is one of the great regrets of my life. But I did not have reinforcement learning to help me make my decision, but I was just trying to sign up as a… trying to get into the movies.

Key takeaway no. 1

Machine learning is all about predictions for right now while reinforcement learning deals with sequential decisions, optimizing strategies looking at what might happen in the future.


Key terminology for reinforcement learning

Rebecca Nugent: Okay. Well, for people who are less familiar with reinforcement learning, maybe we could just walk through some terminology basics, some terms that you use in your book, and then we could maybe take that and adapt it to another example that people might want to walk through.

So for example, some common words are action. So how would you describe an action?

Entities: agent, environment and reward

Phil Winder: An action is something that... So actually, let's step back a little bit and describe the two main entities. The two main entities are actually an agent and an environment. An agent is the thing that is making decisions. It could be a piece of software, it could be a robot, it could be a human being.

So this is the thing that's actually making decisions. The environment represents the... I'm trying to not use the word environment again. The environment is basically everything else that the agent resides within. So the environment could be the real world. It could be all of the information that you have about a particular user.

It's all of the information that the agent can use to form and make a decision. But also, and this is the interesting part is that the environment actually contains states, and it often contains hidden states. So you can imagine that I can observe the weather outside, but there's some inherent state that is managing the weather, you know, millions of miles away that I can't actually observe.

That doesn't mean it doesn't exist, but it's just the state that is hidden. So we've got the agent, and we've got the environment, and the agent can enact or it can create actions upon the environment. So I can operate within real life, and I can decide to go outside or go inside.

I can decide to turn left or turn right. These are all actions that I've decided upon, which are altering the environment in some subtle way. And then the environment presents back as an observation of the state of the environment.

So for example, if I walk outside, if that's my action walk outside, my observation of the environment will change. Because I'm no longer looking at the walls in my room, I'm looking at the trees outside. So typically, when the agent uses or tries an action, it alters the state of the environment, which then alters the observation that you receive back.

And the key to reinforcement learning is that this is a loop — this is an ongoing continuous loop. So the one final and crucial part of the equation is something called the reward. The reward is feedback that the environment gives you when something goes well... It's feedback from the environment.

It could be good feedback, or it could be bad feedback. It could be positive, or it could be negative. And generally, the aim of all reinforcement learning is to teach the agent to maximize that reward. We always wanted to try and maximize the reward.

Rebecca Nugent: You also use the term policy?

Phil Winder: Yeah.

Rebecca Nugent: You are incorporating it into the description that you just did. But maybe we don't need to go through policy, you actually hit most of the terms, and I was there, so we can probably move on from that one.

Let's say you're a small child. So I have two children. And let's think about... can you walk through an example of talking about your agent and your environment, say we're trying to have a small child learn not to do something dangerous, for example. Or learn to walk or learn to… you know, something that they're learning, right?

They're learning through an experience. Just to kind of give us an analogy or an example that goes along with those terms?

Reinforcement learning (RL) will deliver one of the biggest breakthroughs in AI over the next decade, enabling algorithms to learn from their environment to achieve arbitrary goals. This exciting development avoids constraints found in traditional machine learning (ML) algorithms. This practical book shows data science and AI professionals how to learn by reinforcement and enable a machine to learn by itself. Author Phil Winder of Winder Research covers everything from basic building blocks to state-of-the-art practices. You'll explore the current state of RL, focus on industrial applications, learn numerous algorithms, and benefit from dedicated chapters on deploying RL solutions to production. This is no cookbook; doesn't shy away from math and expects familiarity with ML.

Buy the book
Antipatterns

Markov Decision Process

Phil Winder: Humans learn through reinforcement. That's really where the inspiration came from, for the early researchers. And the classic example is riding a bike. So attempting to learn how to ride a bike. The agent in this case is the child and the environment is everything that is relevant to that task.

It's the bike itself, it's the feel, it's the balance, it's the things you can see, the things you can hear, that's all of the environment. The goal is that the child has to try actions and that those actions could be, steer left, steer right, or pedal or not pedal or brake. Depending on how complicated you make it. And then the child will try one of those actions and see what happens.

Over time, something bad could happen or something good could happen. The bad thing might be falling off the bike. So whenever you fall off the bike, you get some negative reinforcement to say, "Don't do that again because that hurts. Okay, try something else."

The positive reinforcement could be the enjoyment of riding the bike, or maybe you're getting reassurance or applause from a parent or a friend or something like that. And that would be a positive experience, which would help the child to decide that the action just been taken was a good one. And we should continue trying to do that. And so over time, the child will develop strategies to try and keep doing the things that generate these positive rewards.

GOTO Amsterdam 2021

Join a group of like-minded developers June 15-18 for keynotes, masterclasses, Q&As & valuable networking. Use promo code ‘bookclub’ for 15% off your conference pass.

Learn more
GOTO AMS

Rebecca Nugent: So if we take that a little bit further to talk about one of the underlying models that we need for reinforcement learning, what you spend some time talking about in the book, if we think about what a Markov Decision Process is, right, or an MDP, could you explain a little bit about what an MDP is? And then what form it would take inside that example of the child learning to ride the bike?

Phil Winder: We've used all of the terminologies already, and that really is the definition of the MDP. The only extra caveat that we need to add is that every cycle of that loop has to be independent. If it's not independent, then you can get into a situation where you have this ongoing feedback loop that is not stable.

Rebecca Nugent: Ok.

Phil Winder: So every action needs to be independent of the previous action. It only depends on the observation that you've currently observed. That's it. As long as you can provide all of the details that we've just described, the definition of the agent, the environment, the actions, the reward and the observation, and as long as they're independent-ish, then you've got an MDP.

Rebecca Nugent: So how is thatis on that independent? Or what kinds of guarantees do we need there?

Phil Winder: That's hard to define it's the same as in machine learning. The vast majority of machine learning algorithms assume independent and identically distributed data. And more often than not, that's not true. But it still works reasonably well in most cases, depending on the algorithm.

So there's a couple of things there. The less independence you have, the more unstable the loop becomes. If it does start to become dependent, you get this reinforcing loop, which could blow the whole learning process up.

An example would be a child eating a sweet. Eating sweets is an inherently positive thing, and it tastes great. And so you just eat more and eat more and eat more and eat more and eat more, but eventually, you're sick. Eventually, you explode.

So there's that. And then also, inside the agent itself, the thing that it uses to learn these optimal strategies. It uses machine learning and statistical models. So again, if you don't have independent data in there, it can actually blow up the model that the agent is using as well. So you have to be careful, but it should be okay. If you design your problem okay.


Reinforcement learning in education

Rebecca Nugent: One of the examples that you give in your book about where reinforcement learning could be used is in education, so the e-learning, online learning, etc. That's an area that I'm really interested in. One of my main research projects is something called the Integrated Statistics Learning Environment, or ISLE. 

That's a browser-based educational platform. It primarily focuses on doing data analysis, but it's used for other kinds of classes as well. But what we're doing is we're tracking everything the students are doing. And we're tracking everything the instructors are doing.

If they collaborate and work on a data analysis project together, we track their keystrokes and we track what they write. If they write, "God, I hate statistics," and then they delete it, we technically can see it. But what we're excited about is that we end up seeing so many different people, depending on their background, and their perspectives have different ways of seeing data and have different kinds of optimal paths to get to their final answer.

So we think about it as an opportunity to provide a kind of personalized, adaptive learning environment. I was wondering if you could talk me through what an example of a reinforcement learning framework or setup would be for, say, either ISLE or some other kind of e-learning environment where you're aiming for a personalized kind of adaptive optimization for all these different users? How can they get to their final goal, their final learning outcome?

Phil Winder: That's a really interesting bit of research and important as well. There's a couple of different ways in which you could tackle that. If I was treating this as an industrial project, I would kind of start as simple as possible.

Rebecca Nugent: Sure.

Phil Winder: Start with a simpler problem and a bit of research that I saw a couple of months ago. It is related, so bear with me. It was a piece of research, showing how you can use effective nudges, appropriate nudges at the right time to get people with type B diabetes to do exercise, basically.

And what they did is they used reinforcement learning to train a policy, to train a strategy to sort of nudge people, send messages to people and emails and articles and just questions and things like that. To help them, be more active. Over time they learn that… this agent was able to learn the best opportunities to send out some information or some or a nudge to help them lose weight and get healthy.

So I could imagine that, what a simple first step might be to try and help people with their learning by attempting to learn the best moments in time to give them a nudge. To ask them, "Are you stuck?"

Rebecca Nugent: That's a great idea.

Phil Winder: "Do you need help? What do you think of this?" And over time, the agent would learn to do that, you know, the best opportunity. So that's one thing but I think the Holy Grail for education, like you said, is a pure sort of personalized curriculum for an individual. And, I mean, it's easy to say what the solution is, the solution is basically allowing the agent to actually provide that curriculum. Over time, it would learn the best curriculum, given something that you're trying to optimize.

Say you're trying to optimize for the final year test score. You could try and maximize that score, based upon all of the learning that was done during the year. The agent could switch out and try different things and teach different ways and try all of these things in order to optimize and maximize that score.

The problem with that approach though, is that you need to allow the agent to make that decision itself. This means that it has to have the content, to begin with, to be able to provide to the user. So, unfortunately, I don't think it's as easy as saying just use RL because there's going to be something generating that content there. I think that is actually the biggest problem that you've got, because that is still the one thing that probably isn't scalable yet like it needs to be.

In order to do that, we would need to scale the process of generating that content. And if you could do that, then I could certainly see that you could start building more personalized curricula.

Rebecca Nugent: If I come back to that nudge idea for reinforcement learning, I might have a different set of constraints. So for example, if I'm teaching a university course in a typical semester in the United States, it’s about 15 weeks of classes, roughly. I do have some constraints, and I want to get to a particular learning outcome. But I only have so much time, right?

Versus somebody who's maybe in a MOOC or a course that's a little bit more open-ended. Maybe that's kind of a self-paced course. It's one of these gated things, you just move to the next level when you're ready to go to the next level. You could imagine the nudges being a little bit different there. Because they don't have as much of a time constraint, maybe they are just more focused on that final goal and not having to get there in a certain amount of time.

With reinforcement learning you can be able to change, kind of be able to learn different sets of nudges, depending on if… say I'm working in different classroom environments, or maybe even with different people. Only have three weeks left in the semester, right? To finish something, some concepts, and maybe they're working kind of in a different area. That's kind of complex, what I just asked, but could reinforcement learning be adapted from that way.

Reinforcement learning (RL) will deliver one of the biggest breakthroughs in AI over the next decade, enabling algorithms to learn from their environment to achieve arbitrary goals. This exciting development avoids constraints found in traditional machine learning (ML) algorithms. This practical book shows data science and AI professionals how to learn by reinforcement and enable a machine to learn by itself. Author Phil Winder of Winder Research covers everything from basic building blocks to state-of-the-art practices. You'll explore the current state of RL, focus on industrial applications, learn numerous algorithms, and benefit from dedicated chapters on deploying RL solutions to production. This is no cookbook; doesn't shy away from math and expects familiarity with ML.

Buy the book
Antipatterns

Phil Winder: Definitely. And there are a couple of different ways of solving that challenge. The first and the simplest way is to simply retrain your RL algorithm for the specific task at hand. So you could say that we're going to train one algorithm for university students, and we're going to train another one for the MOOC students.

That would be totally separate algorithms, there's no interaction there. That would work perfectly fine. The downside of that is that you kind of miss an opportunity to kind of learn a more general set of rules that could happen between them. So if you did want to learn something like that, then the next step would be to include information to the agent basically, include information that tells the agent more about the particular setting.

So for example, you could add in a feature that represents the amount of time left in the course. That would be an important signal that you would feed to the model. You could include information about the individual students, you know, there's some demographic information like prior qualifications and past experience and stuff. And again, the agent will be able to leverage that when it switches between, a quite educated university student versus a beginner MOOC online type learner.

So the key for transferring that information and that learning is actually to include more information about the representative circumstances, that separate or make those two things different in the first place.


What kind of industry applications is RL best suited for?

Rebecca Nugent: Well, I've completely violated one of the principles of your book, which is to start simple. So that's what we're going to have to break this down, I think into much simpler tasks. You've mentioned a few times here, and of course, it's in the title of your book, the focus here is on industry applications. Why are reinforcement learning and intelligent agents specifically useful for industrial applications? And are there applications that you think it might be better suited for? Or less well suited for?

Phil Winder: That's a good question. I think, principally the goal of the book was to attempt to build a bridge between academia and industry. Because the academics and the researchers that have been purely concentrating on pure reinforcement learning have done a lot of really incredible work. But I saw that even though they have some really great examples and use cases, many of them hadn't really translated or are being transferred into industrial applications.

There are probably various reasons for that. But the whole goal of the book was to try and ease and help that process by moving the knowledge that has been gained across to the industry. But industry brings a slightly different sort of set of challenges, in the sense that the priorities are different. In research, the priority is always, for knowledge, for knowledge's sake, basically, pushing the boundaries of what we know, whereas industry is far more, profit-driven. So, the implications are slightly different because of that. But back to your question. Some of the applications that we see in the industry have typically been using machine learning.

Sometimes maybe not even anything, because there hasn't been anything that has been suitable. But these many applications that use machine learning at the moment could be doing better using reinforcement learning. So, I don't think that it's industry-specific, I think it crosses many industries, just like machine learning does. But to give a couple of examples, I think one classic example is, the task of recommendations.

So recommending things to users, either in like an e-commerce website or maybe through some sort of search functionality, you know, searching for anything. The current breed of recommenders attempts to make a single decision, a one-shot single decision, attempting to make the best recommendation at the time. And businesses like that approach, but it's not really solving the right problem.

The whole goal of running a business is to fundamentally affect some business metrics. We want to try and increase profit. We want to make money. We want to increase the number of users. We want to increase engagement. We want to increase customer happiness, you know, things metric like this. But the machine learning algorithms are not trained against that metric, because they're too far apart.

So if you did, if you trained some of these algorithms with reinforcement learning, instead, what you might find is that it actually makes a series of recommendations that lead to more engagement. Or it leads to more profit in the long run, or it leads to more customers because they're happier and things like that. We're optimizing the recommendations over a longer period of time, to improve those metrics directly.

Whereas I think like sort of the classic example there is when you're recommending our articles to people. It's very easy to get a user to click on an article through some really clickbaity kind of title. But that really impacts the relationship in the future. If somebody clicks on that, and really hates the article, they'll never click on it again, they'll never use your website again.

We've got to move past this one-shot decision making where we optimize for that one single decision and actually look at the bigger picture and optimize these decisions over time and over sequential decisions.

Rebecca Nugent: So with the idea being that like, it's not so much that I care if the user is clicking on this particular title, or what percentage of every article… like every recommendation that I show, what percentage of the time do I get people to click on it, but are they continuing to come back to my recommendation engine three months from now, and are they continuing to engage and click an article? So thinking about that long-term relationship is that one way to think about it?

Phil Winder: Yes, exactly. Thinking of another example for a similar problem in advertising. If you imagine you're trying to advertise your product or your brand to a customer, the vast majority of the industry currently charges per click or per view or something like that. The brand and the company, all they know, and all they understand, is how many times someone has clicked on their advert.

They're paying for it every time. They have no idea whether that's like actually improving the sales of their products or the awareness of their brand. So you could use reinforcement learning, and tie the reward and the feedback back to those final customer sales. Actually, optimize your adverts to promote those sales as opposed to just paying for clicks.

So yes, it's trying to make strategic and multiple decisions for individuals in order to find the best strategy for getting them to do what you want them to do.

Key takeaway no. 2

Recommendations based on RL could lead to better engagement and attaining the business goal faster.


Can we apply RL to inform long-term strategies? 

Rebecca Nugent: It feels like with most models and algorithms and forecasting and just in general, we're going to do better when it's a little bit more local. But do you see applications of reinforcement learning being used to think about strategies that are months in advance down the road or years in advance? Or are most people using them for maybe shorter-term goals?

Phil Winder: Well, the main difference between describing problems with machine learning and reinforcement learning is that with reinforcement learning, you've got this inherent decision to continuously train your agent to continuously retrain your agent to perform better in the future. So most of the time the actions are… it's a single action that happens before it learns again.

Timescale wise, there isn't really a timescale on that, because that one action could be, one action that affects something for many months. Or it could be an action that affects something right now. So like, my decision, for example, to walk outside, that is just one decision. And I will learn from that. But that could affect me for years to come. That one decision to go outside. So the concept of time and future prediction is a little bit different.

Rebecca Nugent: Yes, that's a really good point. Let me kind of rephrase what I was getting. If I'm thinking about if I'm the CEO of a company, right? And so my group of engineers is coming to me and they're saying, "We want to adopt this RL framework, right? This is the direction that we want to go." How is that business leader thinking about what's the investment here?

Or will this require me to take… and I completely hear your point on like, timescale, right? It's hard to know. But if you're thinking about trying to sell to that to your C suite, that we want to think far more about sequential, right? The sequential decision making and long kind of these longer-term strategies, and the impact of, you know we may decide to take an action because when the RL was exploring the space, it saw something way down the road that it wants to optimize.

The current state may seem really odd. Like how do you sell that kind of time aspect? I guess, to your business leaders if you're trying to get them to move in that direction? That may be kind of a better way to think about that?

Phil Winder: Yes, that's a really good question. And there's no technical answer to that. That's pure salesmanship. I think just like in machine learning, the best way to prove something to someone else is just through evidence. So it depends on what stage you're at like it never goes from this is an idea to this is a full-on implementation that we're going out live with real customers, right now.

There are phases and steps between those two things. So the first thing would be, "I have a simulation, and it looks quite good. I'd like to continue this with some real data." Then you get some real data, and we've just done this real data. And we think that we can actually make a big difference, look at these results, they're really good. And then you start on like, you know, maybe you start shadowing other systems. So you don't actively provide the action, but you sort of suggest what the action would have been.

And you see from that shadowing experience, that it's no worse than the current machine learning. And then it gives you more confidence. And then you start testing it with a very small percentage of real users. And after that, you know, a very small, small-scale test, you found again, that is doing really well and then you scale it up, and so on, and so on, and so on.

So it's very iterative, you know, you wouldn't go from point A… from start to end in one big jump, you would take a phased approach, much like you would in any kind of critical sort of ML application.


 Are the people running the algorithms, running the company?

Rebecca Nugent: Well I imagine one of the selling points would also be that the performance metric can change, right? That you're able to really define what performance metric you're looking for, but also the reward structure. So if the business context changes, right? Or the business goals change, that those kinds of changes can be incorporated into the reinforcement learning, right?

So that might be one way to think about prioritizing what the business side I might want. Since we were talking about this... the shadowing made me think of something. But I want to get back to one of the questions that I had specifically about this, this executive versus engineers. In Chapter 9 of your book, one of the sections that stuck out to me was you have this short discussion about how typically executives are assuming the risk of making strategic decisions for companies.

Which happens all the time, right? And that's their job. But the decision about how to design a reward structure and how to quantify performance that… and those things directly impact chosen strategies through reinforcement learning. That really falls on the engineers, and you have some language in there about like, does that mean eventually people who design the algorithms are actually the ones who are running the company?

I was wondering if you really do see kind of a shakeup coming down the road, about who is assuming risk, or sort of who's making the strategic decisions for the company. Albeit maybe not through verbal indications of what the strategy should be, but through the design of algorithms that are helping dictate the business strategies? I thought that was really fascinating. I wonder if you could talk more about that.

Phil Winder: Yes. So that's just a trend of the replacement of who takes on the risk. And it's happening throughout the tech industry, with self-driving cars, for example. The responsibility and ownership of decisions have changed and is moving away from an individual, from a human to something that is not a human and is designed by somebody else.

So, where do you place that risk, and who's responsible for those decisions, ultimately. This is kind of just an extrapolation of that like you said, the executives in the company are generally rewarded the highest, because they're the ones that have the vision, and quantify the risks, and assume the burden of those risks if they don't play out well.

It's those risk-takers that are rewarded the most highly. It's the same in a stock market or a trading environment. You used to get traders that were paid huge sums of money to make trades because there was the potential of making a lot of money. More and more algorithms are taking over those types of jobs because they're not necessarily more robust, but they're more quantifiable. The risks involved are more quantifiable.

So rather than relying on someone's judgment, you can actually start to quantify the risk of taking a decision in the first place. So it's not that that particular algorithm is better or worse. It just has a known risk profile that is easily quantifiable. And I think that's the real sort of important thing. But, in terms of the way businesses are structured, it kind of inverts the whole hierarchical approach.

I think slowly, that's been happening for a long time. We're sort of changing from a culture of having very hierarchical sort of power structures, they're flattening out over time, especially in the tech industry. Now, it's very rare to see any strict hierarchy anymore. Organizations are usually quite flat. I think that trend is only going to continue when other people sort of get the tools and they get the help to do some of the functions of a business that previously would have been very difficult for them to do.

I gave a couple of examples in the book and I've been talking about it recently about another important subject that's kind of related to this, it's the… sort of the migration of jobs and people's livelihoods and how they're affected by new technologies and things like that. So we said for a long time that software engineering has replaced kind of manual process driven repetitive jobs.

Machine learning started to replace some of the jobs of the simple decision makers. The people that are sort of first-line decision makers that were making decisions about loan applications and things like that. Reinforcement learning because it's inherently learning optimal policies over time. Now we're really moving up the stack.

Now we're kind of starting to take the jobs of the data scientists because reinforcement learning algorithms can learn themselves. Which models are the best to use? So, technology is eating at the heart of the business. And I don't know where that's going to end, to be honest.

Reinforcement learning (RL) will deliver one of the biggest breakthroughs in AI over the next decade, enabling algorithms to learn from their environment to achieve arbitrary goals. This exciting development avoids constraints found in traditional machine learning (ML) algorithms. This practical book shows data science and AI professionals how to learn by reinforcement and enable a machine to learn by itself. Author Phil Winder of Winder Research covers everything from basic building blocks to state-of-the-art practices. You'll explore the current state of RL, focus on industrial applications, learn numerous algorithms, and benefit from dedicated chapters on deploying RL solutions to production. This is no cookbook; doesn't shy away from math and expects familiarity with ML.

Buy the book
Antipatterns

Rebecca Nugent: Well let me push back on the data scientists. Let's go down that road. Just for some context, I ran a data scientist experiential learning program for our department in which we get teams of students and faculty who work on reactive data science problems that these partners are working on. So it's something they haven't solved.

It's not like a toy problem. It's an open research problem for them. We work with for-profits, nonprofits, government organizations, social services, you name it, just all kinds of things. A lot of which are the types of applications that you've already mentioned. They work with teams of undergrads or masters or Ph.D. students, just depending on what the problem is.

But one of the things that we hear all the time in these conversations with kind of bringing in data science, or bringing in machine learning, reinforcement learning, what have you, bringing it into these companies is, there's a lot of pushback on the idea that people are just kind of replaceable. Because they're... And we see it all the time in data science.

Our data science problems as well, that the algorithms and the models that are learning things over time, they do a really good job at some types of problems. But if we're missing some of that business context, if we're missing some of that expertise that the humans have built up over time, we either don't explore this space quite correctly, or it moves in a direction that maybe the company wasn't comfortable with.

Almost 100% of the companies are very interested in how can we use these tools but still be able to incorporate subjective information or advice or wisdom, or… and I'm using those phrases kind of loosely, but bringing in kind of that personal information that somebody has, right? Or another example that we see quite often is that maybe they want to completely change directions.

So they found that their hiring practices… just as a common example, that shows up in popular media. That their hiring practices tend to only hire certain kinds of people and they want to be more diverse. Be that skill set, background, demographics, whatever the definition behind diversity. But if they only use their data in their processes that they have now, the models wouldn't move off and explore into other spaces and bring in that more diverse hiring.

Key takeaway no. 3

Technology is eating at the heart of the business because the risk is now taken by Engineers and not at the management level.


Reinforcement learning and people

Rebecca Nugent: So I know that in reinforcement learning, we can add some random steps in there to maybe help us with the exploring. But I wonder kind of what you think about that, that pushback that we hear all the time from companies. Don't replace the people? How do I do reinforcement learning and people together to make it even more kind of an optimal situation for our company? Do you hear that a lot as well?

Phil Winder: I don't hear that explicitly. But it just makes sense to think like that, I think. Because these are just tools. And the tools are only as good as the people designing the tools. And the tools are there to be useful. They're not used… they're not there to be harmful or to do damage. They're there to be useful. And if they're not useful, they won't get used and if they're not used, then they're not commercially viable.

So it makes sense to make them as useful as possible. So let's start from the bottom of the stack again. Thinking about software engineering.

Rebecca Nugent: Yes.

Phil Winder: Using software in IT, we don't use it to replace people, we use it to help those people to do their jobs. Similarly, with machine learning, we don't replace people with AIs or whatever. We do it to help them make the decisions and to make their decisions more robust. And same with reinforcement learning, it's a tool, it still needs to be designed, and it needs to be curated. It's just I feel like that it's working at a slightly higher level than people are used to. So it's going to take a little bit of a mind shift to let that happen.

Rebecca Nugent: Yes, and I think that's right. So I agree with you that these tools are designed and are best used in helping people and assisting them in doing their jobs. Maybe taking away the low-level easy decisions that can be automated and letting them be more complex… letting them spend their time on more complex decisions, right? In which we see that having humans involved helps, right? That it helps.

Key takeaway no. 4

Keep in mind: The tools are only as good as the people designing them


Problem definition and reinforcement learning

Rebecca Nugent: One area that I thought might be useful to have quite a bit of input from people. And we certainly see this in data science as well, which is exactly one of the things that you highlight throughout the book, which is the importance of the problem definition.

At one point, you actually say, "All projects start with a problem definition and it's often wrong." Which is kind of a basic tenet that we hear in statistics all the time, like, all models are wrong. But hopefully, some of them are useful, right? I'm paraphrasing the famous quote. But we certainly see all of that time all of that all the time in our data science work as well.

So I was curious, if you were going to make recommendations, right? About where people should be putting most of their time, energy, effort, right? Would it be in that problem definition phase? I guess I'll stop there.

Phil Winder: Yes

Rebecca Nugent: Why is problem definition... I mean, problem definition is so crucial for any problem ever on the planet Earth, right?

Phil Winder: Exactly.

Rebecca Nugent: Why is it so crucial for reinforcement learning?

Phil Winder: I don't think it is any more crucial compared to other tools and techniques. Because you can start a software project or an ML project and go down the same path. You can do a lot of research, invest a lot of money, and if you're solving the wrong problem, then it's irrelevant what happens next. So it's… yeah. So that's that.

But I think, for reinforcement learning to talk about that a little bit in to focus on reinforcement learning. It's slightly different because the whole point of using an agent to form an optimal strategy is so that it can explore spaces that are computationally inefficient to do so in the first place. So actually, yeah, so I'm sorry, I'm working out in my mind as I'm talking.

Rebecca Nugent: Yes.

Phil Winder: So I think the main difference is actually… it's not getting the wrong problem definition. I mean, that's still important, especially from a commercial point of view, you don't want to be working on the wrong project. But I think the main thing is, is that if you don't improve and work on the problem definition, you're missing out on potential constraints that can make the problem much simpler.

Because more often, actually not more often than not, but one thing that's really tricky in reinforcement learning is that it's very easy to try and do something that is too complex. You're basically… it's like saying, you've just bought this little robot and you got this robot hoover thing, and you've put it on the floor. And you tell that robot to learn how to make the coffee.

Well, that's an incredibly complex task for it to do. But if you constrain the problem, if you made the problem smaller, if you worked on the problem and say, "Well, okay, the first thing is, you just learn how to move about my house. Learn about where the things you can go and the things you can't go. All right, now you've done that. Now learn how to clean the floor, that'll be step two. Once you've figured out that, now learn where the coffee machine is," and it can go around it'll take time, it'll finally find something that it thinks is a coffee machine. And you can work up from there.

So, I think the difference with reinforcement learning is that problem definition defines whether the problem is doable in the first place. And I think that's a little bit different to software engineering because no matter what the problem definition is, you could probably always solve that software engineering problem.

Rebecca Nugent: Okay, that's it. Now, I really liked the way you stated that about how the more time we spend on defining the problem, the easier it is to identify solutions, to identify the constraints that we need. But I think what I'm hearing is that you actually are going to be able to target what you want in kind of these smaller, achievable steps, right?

They have a higher chance of success by identifying a strategy that's going to work for you long term, right? Depending on what you're trying to optimize. And that I mean, that's a good lesson for everything, not just reinforcement learning.

Phil Winder: Right.

Key takeaway no. 5

Reinforcement learning is that problem definition defines whether the problem is doable in the first place


The book

Rebecca Nugent: That makes a lot of sense, that's kind of one of the things that are going to promote success, right? With reinforcement learning. Let's talk about the actual book for a second, right? So my view of it when I was working through it, is that it's a really practical guide to reinforcement learning. So it's less mathematical by design, you chose to do it that way. And it's really focused on what I would call it down to earth advice about how to actually do a reinforcement learning project, an RL project. That it's pretty straightforward and it's pretty funny, and a lot of spots.

So that was nice. I don't read a lot of statistics, machine learning books that are funny. So I was like, yeah, that's good. So it starts with some basic background in RL. And then you build the underlying models kind of from the ground up giving examples, kind of like what we did at the beginning.

When we first started talking, and then the final chapters move more into real-world applications and kind of discuss the challenges of actually building, operationalizing, and scaling in a parallel framework. It uses lots of graphics and conceptual explanations. And the supplementary code, which is slightly unusual has been posted on a company website, rather than in the book.

So what was your primary motivation for presenting the material in this way? And what kinds of audiences do you think will really be able to connect with the material in the book?

Phil Winder: Thank you for that. That's a good summary, better than I could do. So I think when I first started writing the book, I actually wanted to... If you look at the end of the book, the bits about productionizing and operationalizing, and actually how to do RL projects, that's actually the start of the book. I started to write that first. But it very quickly dawned on me that actually, this subject is still pretty new. And actually, there's a lot of context that is required in order to explain that fully.

I kind of had to step back a bit and actually provide a bit more of an introduction. So when I went down that path, I thought, well, what else is out there, and there are some good books out there. But most of the good books were academic, there were textbooks. And, you know, there were great for what they are. They're great for learning, they're great for reference, but they're quite dry. They're very mathematical. So then I had to sort of think about the audience and the primary audience for this particular publisher, or software engineers, they're engineers, practicing engineers.

I know from experience that these people maybe prefer slightly less math… they're okay with math, but less math and unusual. But codes, they're used to working in code. So I tried to avoid the sort of the dryness of the academic texts and tried to sort of interjecting it with some advice, basically. And moving on to like the code question that was… it's a little bit controversial because many engineers expect to see code written in books, but I just thought... I mean personally, I really hate it. I really don't like it.

I don't like reading reams and reams of code, because I don't think it's… it's hard to read. It's not meant to be read, it's not spoken language. So I don't particularly like it, but more importantly, like, what code is the definition of a program and that thing changes over time. There are probably a million bugs in there that people are going to point out, I'm sure.

If it was printed on a dead tree, I wouldn't be able to do anything about it. But if it lives in version control, as it does, I can easily update it, I can easily improve it, I can add more examples, I can add to it. So I felt that it was better to keep the code separate to the actual written text and sort of concentrate on the story on the things I think the most important points and also the advice.

Rebecca Nugent: So if someone's picking up this book and they're just completely brand new to reinforcement learning, where should they start? Should they start from the beginning reading through, and then maybe accessing the code or examples as they want to? Should they try to work through the entire book and then start coding? Kind of what would you tell somebody who, again, is just brand new to reinforcement learning?

Phil Winder: I'd recommend reading the book all the way through, to be honest. I think that, from my perspective, I feel like people learn best when they actually do something for real. So I would read the book first, then go and refer to the examples for things that they're interested in. But then really… the crucial step is to find a problem that they are interested in. And if they're working in the industry, finding an industrial problem that is related to their work or related to work that they want to do, I think is really important, because it's only then that you really realize what you don't know.

You realize all of the sorts of the practicalities involved in learning a new subject. And I think also it would help to have some context as well. So this is... I don't really talk that much about machine learning. But I assume that they do have some experience with machine learning. So I'd also say read this book, but then go away and read some machine learning books as well. Because there are all sorts that you need to learn about machine learning and statistics and modeling and things like that.


How to get started with RL after reading the book

Rebecca Nugent: Well, you do reference several available tools in the RL community to help people get started such as, for example, open AIs gym for an inter environment… excuse me, an environment interface. Do you think that it's a useful exercise for someone to build their own interface in a simulation? Maybe for some really, really super small, constrained problem before using already built tools? Or do you think, should new people kind of just already start with what's out there? Maybe there are pros and cons?

Phil Winder: Yes, I think at the moment, yes. In the future, maybe not. A similar explanation would be if you learn...

Rebecca Nugent: Which one, I'm sorry, Phil Yes, to doing your own, to building your own first? Or yes to going and grabbing the tools that are out there?

Phil Winder: Yes and no.

Rebecca Nugent: All right, because...

Phil Winder: Because there are arguments, foreign against because if you're trying to learn how to program you can do a lot by just learning the language and writing in that language. You don't need to learn how to… you don't need to write a compiler to be a good programmer. You don't need to learn how to build a microprocessor to be a good coder. So I think there's some level somewhere, I don't know where exactly that is, that you probably don't need to go lower than that.

I think writing in the environment definitely is required. You don't necessarily have to do a toy one. And again, I recommend doing something that's related to your work. The reason for that is because as part of your day job, you will probably have to write some assimilator, you will probably have to do that in order to gain some knowledge of the problem.

So I think that'll happen externally. Similarly, you're probably going to have to tinker with the algorithms at some point. So again, playing with and trying new algorithms is really important. And then I think the final sort of piece of the puzzle is the sort of the fundamental modeling and statistical thinking behind some of the models that are used within the agents. I think that's really important as well so you need to gain experience playing with them.

Rebecca Nugent: Ok. I've got some extra questions so if you don't mind, they're a little bit out of order now because I have some extra things that I thought of.


What excites you the most about reinforcement learning?

Rebecca Nugent: The follow-up question to what I just did is what excites you the most about RL and where do you think it's going?

Phil Winder: Well, I think it's an evolution of MLS. It's like adding a piece of the puzzle that has never existed before. So I feel like that would be able to solve problems that were not… that weren't being solved in the right way, previously. We were just hacking around the problem rather than solving the core problem. But it also opens up some really new interesting avenues of research, like people are sort of fundamental researchers. They're still very interested in the act of learning.

It's what makes us different to the other animals living on this rock hurtling through space. And I think that reinforcement learning is a big part of that. Its testing ideas.. In fact, it's just a scientific method. It's testing ideas, it's testing theories, it's seeing what happens and if they're good, continue doing them. Otherwise don't. And I feel like for humanity's sake. It's obvious that the rate at which humans are doing that is increasing. There's this exponential growth in knowledge and the speed at which knowledge is being acquired, it's just going faster and faster and faster.

And now we're finding ways to kind of automate that process through things potentially like reinforcement learning, and I feel like that this sets a really strong and reliable and robust sort of baseline for the future to continue on that trajectory. So I think that reinforcement learning is interesting and important for maybe the next sort of 2, 3 to 5 to 10 years, in the industry, it'll be useful.

But actually, I think it'll help set the trajectory for the things that are going to happen in 10, 20, 30 years in the future. And I can only imagine what kind of breakthroughs they're going to be. But I feel like it's going to be on the same sort of trajectory. But we're starting to investigate how things are learning. And so the next obvious step is to continue with that.

It's like, why do we learn in the first place? And are there more complex models that can be used to model more complex learning behavior? And through that, you know, the things that, that future thing could learn would be even more complex? If you know what I mean? I'm getting slightly ahead of myself a little bit there. But...

Rebecca Nugent: No, I think you're actually describing, if we go back to "The Matrix," I think you're describing "The Matrix" movies, right?

Phil Winder: Yes.

Rebecca Nugent: Right? Like you had to choose the red pill or the blue pill. But then that set them off on a course. Right?

Phil Winder: Exactly. We've got parallel universes now.


What are the challenges with the interpretability of reinforcement learning results?

Rebecca Nugent: And now it kind of gets back to maybe reinforcement learning would have been helpful there, right? To sort of learn, here's the sequential. One of the things that came up in this conversation caused me to think, one of the things that we hear a lot about when we're working with partners on data science problems.

I'm helping lead a team of students do something that's pretty straightforward. It doesn't require extremely complex models. And sometimes we argue when working on problems that reinforcement learning would be very, very helpful.

But at the end of the day, we are always spending a non-trivial amount of time thinking about how to interpret the results and how to explain them and turn them into actionable insights that can be explained to like investors, customers, people who need to adopt whatever results are coming out. And I wondered if you could talk a little bit about any challenges you see with interpretability of reinforcement learning, kind of the final results that are coming back?

I know it's presenting you with a strategy or a decision. But do you see any difficulty in industry with communicating about reinforcement learning. How it got to the results, why does it show the results that it did? How we can extract actionable insights from it, right? That we can turn into decisions for the company and so on.

Phil Winder: Yes, this is going to be another long answer, I'm afraid. So I think I'll start with the simplest answer first. And it comes back to the problem definition again. If you've got a really tight problem definition of the thing that you're trying to achieve, and that is tied to a good business metric it can actually be quite easy to explain the result of an algorithm. The tricky part is that many, many, many researchers are focusing on at the moment is not necessarily explaining why you would use an algorithm but it's actually explaining how it comes to that decision.

That can be important for many different areas, for regulatory reasons, maybe, or for biased reasons. You want to be sure that you're not biased towards certain things. And then at that point, it does begin to become really difficult. The reason why it's difficult, is that you've not only got the explained ability issues of the underlying model. So you've got a model embedded within the RL algorithm that is doing something.

There's already inherent problems if you've got a really complex, deep neural network based model being used in the depths of your RL algorithm. That's going to be hard to explain to begin with, before you even get started. Now you're writing to the fact that the explanation involves time, it involves decisions over time. And so you're adding yet another dimension to it.

And then it comes to the problem domain as well. If you're working in a problem domain that is inherently conceivable by humans, if it's a domain that is 2D or 3D, or 3D plus time still vaguely doable. If you're trying to explain to people, well, this robotic dog did this crazy sudden jump, because it predicted in the future it was going to get shot up.

That's why it moved to the right, even though it looked weird at the time, it actually saved that dog, and something like that. But when we're talking about things that are more conceptual, like, say the strategy of a business. Something as abstract as the strategy of the business, and if you had an RL algorithm saying, stop selling your best-selling product and invest loads of money in that one. Everyone's going to say, "Oh, we don't trust it, we're not going to go through with that."

Rebecca Nugent: Yes.

Phil Winder: Even though there are some… it saw something that other people had not seen, changes in regulations, changing in the perception of people. Something really complex like that. And we're going to have to figure out how to describe those types of outcomes, so that people have the confidence in trusting those decisions. There's no easy answer for that, I don't think. I think that's an open research question.

Rebecca Nugent: Well, I think the open research question is right. But I think also, a need to invest in building skill sets for people to be able to do that. So I thought you just did a really good job. I mean, I felt kind of bad that that dog was getting shot at, that was a little bit much for early in the morning in the U.S.

Phil Winder: Sorry.

Rebecca Nugent: No, no. I'm joking. I started with Morpheus, cracking open "The Matrix." But the point being is that I totally understood exactly what you meant, though. You meant that like, something "bizarre" has just happened, but it's because it's saving me from a problem down the road, I can't see yet. Right? But maybe what we also need in this field, is we also need more people to concentrate on building this skill set of being able to kind of be that go-between that can explain these results.

So it's almost like there's a lot of work that kind of goes into data journalism, right? So data science and journalism, it's not exactly a perfect match for what I'm talking about. But the idea of taking some really complex models that are trying to help people understand how to better manage their healthcare, whatever.

How do you take that information and turn it into something that people will understand? That they don't require them to really understand all the technicalities of the models that arrived, right? With their results. And when you get that wrong, people make the wrong decisions, right? So much of this is about communication. And it kind of makes me think that we need people who are studying this also.

We need classes or courses in how to communicate about this, communicate about what these RL algorithms are doing, or other kinds of algorithms that are complex and not well understood. To make… to help them understand, okay, that bizarre thing just happened. But here's why. Or I have a bunch of choices in front of me. Where did those choices come from?

Now I'm going to have to make a decision on what to do. I guess I'm thinking more like Decision Sciences, kinds of Social Behavioral sciences like connecting these kinds of complex data-centered tools, with how can we get people to understand them and then change their behavior based on the strategies and things that are presented to them?

I think that would be… that's probably really necessary for us in the future. I'm speaking mostly from academia standpoint, we need to be training people to be able to do that. I think that's what I'm hearing. Another example that popped up... I'm sorry.

Phil Winder: .I was totally agreeing with what you said. I think that you're actually taking a really complex subject, technically, with a really complex subject, people trying to get those things to fit together is a complex squared problem.

Rebecca Nugent: No, you're exactly right, though. We people are complex and these algorithms are complex, but to get them, the power is really in their intersection, right? 

Phil Winder: Yes, exactly. You're right.

Rebecca Nugent: I think we need to be concentrating more on that. And in some ways, I think your book is trying to do that, in that it's trying to build that bridge that you talked about, right? From academia to industry, how can we get more people to be... And I'm sure my department in my university would not be happy about me saying this next thing.

But how can we get more people to work in this space without requiring them to spend years learning all of the technical mathematical details, which other people are already working on, right? So how can we get people in this space, thinking about these ideas, right move, thinking about reinforcement learning, not being intimidated, right? Not kind of being intimidated by whether or not they'd be able to do it.

Phil Winder: I'm sure it's feasible. It's people like me that will help make that a reality. Like, I've got two young kids as well, and my eldest is six and she's able to do stuff like Scratch and sort of programming environments like that now. And only a few years ago, that's pretty, it would have been pretty impossible to ask a five-year-old or a six-year-old to do programming.

So I can imagine that moving forward in the future, maybe it's going to be possible for that five or six year old to be able to do modeling or some kind of abstract ML type thing. And again, moving into the future, maybe that'll go to RL as well. But, yeah, it's achievable.

Rebecca Nugent: My five year old, they have programming things like Scratch and Kodable, and things on their school iPads as well. And I'll be looking at it trying to figure out what her… what the app is about. And she'll just snatch it out of my hand and be like, "Please, mom, you don't get it.

Phil Winder: And I'm going like, "Slow down, slow down. You're doing it too fast. Can you just slow down?" I feel so old already.


Applying RL algorithms to real life

Rebecca Nugent: No, no, they're fantastic. Another example that popped into my head when you were talking about… now I'm going way back in the conversation is shadowing. When you were talking about kind of bringing in an RL framework or an RL system, right? And you wouldn't just move it online exactly, right? You would shadow systems, and that's very typical, right? In industry.

But what it made me think of was actually a possible sports analytics reinforcement learning example. So I was curious if you thought we do an awful lot of sports analytics work at Carnegie Mellon. We have a whole center and we do conferences and camps and things like that.

Phil Winder: Yes.

Rebecca Nugent: But one of the interesting applications we've seen come out of… it was some Carnegie Mellon researchers. But it ends up, it's now at Disney. Because ESPN was moving into Disney, it was coming out of a research group with ESPN. But it was showing you can watch a soccer game, it would show a video of a soccer game that had been played. And it had shadow players showing where the players should have been to have better optimized where like a final performance goal.

So every time players make a decision, am I going to run six steps to the right, or am I going to run six steps to the left? Am I speeding up here, etc.? Where should my position be if the ball was over there, and what it was doing it was trying to mimic how I could've optimized my position? How could I have optimized my speed and so on, and you would watch like a shadow game at the same time as you would watch the real game.

I mean, the real game was already over. They had taken the information and built the shadow game behind it. But then the coaches and players could see what other kinds of decisions could be made there. And I don't remember if they used reinforcement learning, but I thought I'd run that by you and see if that was a place where we might be able to think about sequential decision making, right? My goal at the end is to get the ball in the net, right? What strategy should I take to get there if you've seen it used in sports?

Phil Winder: Yes, I mean 100%. Games are perfect for reinforcement learning as you've probably seen in the media. Generally, the most publicized examples are limited to very constrained games, i.e. board games and things like that. So you've seen lots of board game examples. But you can think of most sports as a slightly extended form of game and there's lots of researchers out there doing pure research on yes, strategic game theory and trying to figure out algorithms to best solve those problems.

And because they're so constrained to begin with, generally, there's a lot of tricks and computational tricks that you can do in order to make the problem solvable and tractable. One of the things that I was interested in is motor racing. So I quite like motor racing, I'll follow five Formula One teams. And we've already seen teams beginning to use Monte Carlo style methods and sort of rudimentary reinforcement learning to inform teams on strategies for pit stops.

So basically, when to stop the car to put new tires on and refill the fuel. They do that so that they can try and take advantage of other teams not stopping at the optimal point. And so it's being used in Formula One and other racing disciplines already. And the interesting thing about soccer, or American football or rugby, or anything like that, is that there's just there's more to play with than just the rules of the game.

So that's the main problem. It depends more on the individuals and the actions and the abilities of the individuals. Because it's all well and good saying to score the goal, but you've actually got to have a person to actually kick the ball well enough to score the goal. I'm sure it's doable, and I'd be really interested in seeing a spin-off where there is more integration between. I don't even know how that would work. It would end up looking like Tron ,I think. 

Rebecca Nugent: Well that's not bad. I mean, Tron's pretty great. 

Phil Winder: The problem is like sport inherently is watching people make these decisions themselves. It's the enjoyment of seeing those things come to fruition. If you got rid of that, if you made everything so clinical, that it was just the optimization of an algorithm, would it be sport anymore? I'm not sure it would.

Rebecca Nugent: Oh, I don't think I'd want to replace telling the players what they had to do. I guess I was thinking of it. I mean, because I agree with you, it's fun to see what's going to happen. I was thinking more if you were going to train somebody, right? Or like correct mistakes, do you know what I mean? . If I'm getting ready to play a team that we've played several times before, right?

Doing something more complicated, like you need to make these decisions better, this sequence of decisions. Because they typically do this kind of offense or this kind of defense. And we're able to kind of model what might be the best strategy here. In the talk where the gentleman was presenting the soccer game research, he told us that they had tried to do something similar with basketball and that they had used film and video from a division three college basketball team. Because that's what was available.

And they had built their system, etc. Then they'd take it down to… I forget the full name, but it's the ESPN day down in Florida where lots of high school all-stars come. They're playing and a bunch of scouts, and there's press and it's kind of just a big fun day. But these are people who are really kind of aiming for being professional. They tried, and they played a game. They played an exhibition game, and then all of their simulations completely failed.

Their system failed because they had been basing it off of team basketball. When they went down to ESPN day, all of the students were really just trying to show off and slam dunk the ball and do all kinds of things to draw attention to themselves. And they couldn't… the system couldn't keep up with them at all so I thought that was a pretty funny story. About how you've really... Those populations have to be similar, right?

Phil Winder: Yes, exactly.

Rebecca Nugent: For it to work but it was a totally different purpose, the two games. Totally different purpose. Okay.

Rebecca Nugent: All right, well, thank you so much, Phil. This was an absolutely wonderful conversation. I love learning about reinforcement learning, industrial applications of intelligent agents. I found this to be a super fun read which is kind of an odd thing to say about a book of this type. But thank you for your efforts in trying to build this bridge between academia and industry. I think it will be very successful, and I hope that others benefit and take the time to explore the topic, explore the materials on the website and check it out.

Reinforcement learning (RL) will deliver one of the biggest breakthroughs in AI over the next decade, enabling algorithms to learn from their environment to achieve arbitrary goals. This exciting development avoids constraints found in traditional machine learning (ML) algorithms. This practical book shows data science and AI professionals how to learn by reinforcement and enable a machine to learn by itself. Author Phil Winder of Winder Research covers everything from basic building blocks to state-of-the-art practices. You'll explore the current state of RL, focus on industrial applications, learn numerous algorithms, and benefit from dedicated chapters on deploying RL solutions to production. This is no cookbook; doesn't shy away from math and expects familiarity with ML.

Buy the book
Antipatterns

Phil Winder: Thank you very much. That's very kind of you to say that. I thought you had some really fantastic questions there. Some really hard ones as well, so thank you for that. But I really appreciate your time and hopefully, we can work together in the future. Thank you.

Rebecca Nugent: I'm going to bring that sports analytics and we have to work on modeling "The Matrix."

Phil Winder: Sounds good, thanks.

Rebecca Nugent: Thank you so much.

About the author and interviewer

Dr. Phil Winder is a multidisciplinary software engineer and data scientist. As the CEO of Winder Research, a Cloud-Native data science consultancy, he helps startups and enterprises improve their data-based processes, platforms, and products. Phil specializes in implementing production-grade cloud-native machine learning and was an early champion of the MLOps movement. More recently, Phil has authored a book on Reinforcement Learning (RL) (https://rl-book.com) which provides an in-depth introduction of industrial RL to engineers.

He has thrilled thousands of engineers with his data science training courses in public, private, and on the O’Reilly online learning platform. Phil’s courses focus on using data science in industry and cover a wide range of hot yet practical topics, from cleaning data to deep reinforcement learning. He is a regular speaker and is active in the data science community.

Phil holds a Ph.D. and M.Eng. in electronic engineering from the University of Hull and lives in Yorkshire, U.K., with his brewing equipment and family.

Rebecca Nugent is the Stephen E. and Joyce Fienberg Professor of Statistics & Data Science, the Associate Department Head and Co-Director of Undergraduate Studies for the Carnegie Mellon Statistics & Data Science Department, and an affiliated faculty member of the Block Center for Technology and Society. She received her PhD in Statistics from the University of Washington in 2006. Prior to that, she received her B.A. in Mathematics, Statistics, and Spanish from Rice University and her M.S. in Statistics from Stanford University. She was won several national and university teaching awards including the American Statistical Association Waller Award for Innovation in Statistics Education and serves as one of the co-editors of the Springer Texts in Statistics. She recently served on the National Academy of Sciences study on Envisioning the Data Science Discipline: The Undergraduate Perspective and is the co-chair of the current NAS study Improving Defense Acquisition Workforce Capability in Data Use. She is the Founding Director of the Statistics & Data Science Corporate Capstone program, an experiential learning initiative that matches groups of faculty and students with data science problems in industry, non-profits, and government organizations. She has worked extensively in clustering and classification methodology with an emphasis on high-dimensional, big data problems and record linkage applications. Her current research focus is the development and deployment of low-barrier data analysis platforms that allow for adaptive instruction and the study of data science as a science.

Recommended talks


Recent Episodes

Our Books

THE ART OF STRATEGY

Erik Schön

Buy the book

Chaos Engineering: System Resiliency in Practice 1st Edition

Casey Rosenthal

Nora Jones

Buy the book

Graph Databases: New Opportunities for Connected Data 2nd Edition

Jim Webber

Get the free eBook

Microservices: How to Start with Ronnie Mitra and Mike Amundsen

Irakli Nadareishvili

Ronnie Mitra

Buy the book