The man who tamed a reinforcement learning agent in a labyrinth game


Maarten Fish

Machine Learning Engineer

Maarten Fish, Machine Learning Engineer at Faktion who is passionate about reinforcement learning, tells about his experimenting with reinforcement learning agent rewards and punishment signals in a labyrinth game.


Back in January, I did a three-month internship about reinforcement learning, since then I’ve had some synthetic data projects as well as more traditional NLP and CV.

My background is not so straightforward because I'm a self-taught programmer, which I achieved through fun and engaging game mechanics built in Unity, written in C#. This urge to code grew stronger and stronger towards a desire to learn artificial intelligence: machine learning.

Scratching the surface of ML, I quickly encountered a huge overload of information, and I was a bit intimidated by all the theory. There's so much background that I was missing so I just dropped the idea until I heard of BeCode, which had an AI-focused bootcamp on offer.

There, I went through all the machine learning branches where I learned computer vision and NLP. Through Becode I came into contact with Faktion, the Head of Applied AI presented us with an interesting use case. I immediately felt like that was the most technically challenging use case that we did back then, and I just knew then that Faktion would be my go-to partner to do my internship. Later when Faktion announced they were offering an internship about reinforcement I knew exactly where I needed to be, this was the very reason that I wanted to do artificial intelligence in the first place, reinforcement learning. I was really intrigued by the reinforcement learning agent, an entity, learning something, learning a task autonomously. So that's how I got into contact with Faktion, and it's been a fantastic journey ever since day one.

What have you been working on?

During my internship specifically, I went through the book, "Reinforcement learning: An introduction." By Richard S. Sutton and Andrew G. Barto. The book covers everything from basic tabular approaches all the way to the state-of-the-art deep learning side of RL. Every time I covered one of the chapters I would go and put it to the test. My background isn’t exactly theoretical so at first, the maths behind RL was really intimidating. Formulas started out looking completely alien to me but after applying what I had learned in a more practical manner, I started to puzzle together all the pieces and the maths became second nature. I just did it backwards. Whereas normally you would learn the theory and then try to apply that. I found my perfect learning method by just reading the chapter, playing around with all the new information, and finding insight through practical use.

As an example, after reading the chapter on planning and modelling, I would go to OpenAI Gym, an open-source library that allows you to create reinforcement learning agent environments. They additionally offer some predefined games. (Games are a really good way to benchmark reinforcement learning algorithms) So, I would always go and pick a game that was suitable for this new algorithm that I had learned, I would apply it, solve the game, and take note of all the interesting behaviour I’d encounter. This exponentially grew my love for reinforcement learning. I started out learning how to code through game development to eventually guide autonomous reinforcement learning agents in handcrafted environments. In a sense, making games for reinforcement learning agents that then allow for business optimization. All-in-all I'm really pleased with how everything's been progressing lately.

After researching RL, I was looking for something to challenge myself and use everything that I had learned so far in a somewhat inspiring way. In reinforcement learning, the example environments were usually grid-based where some challenge is introduced to the environment. Because the nature of such a grid world is rather simplistic, small adjustments yield big changes, the perfect place to dive deep into reinforcement learning agent behaviour.

Still, in the mindset of these grid-worlds, I wanted to create something in Unity. When exploring available state-of-the-art RL algorithms I encountered Ray’s rllib which is this amazing open-source anthology of learning algorithms and allows for scaled reinforcement learning. I wanted to explore this library with fun. I had recently discovered that Unity offers a lot of machine learning tools. Unity is a 3D game engine, which is free to use and usually written in C#.

Unity is essentially a 3D engine to create games and other forms of media, it's also used in the industry to make demos and simulations. They now have released a couple of exciting machine learning tools such as the ML-Agents toolkit as well as the perception camera, which is a tool for computer vision.


Why did I decide to do the labyrinth?

Because it looks and feels like a grid world. It is essentially just a maze with added danger in the form of those traps, where the ball can fall through the floor and reset the game. Everything about the actual game just screamed agent environment to me. And I just wanted to challenge myself to see if I could make the bridge from 2D to 3D. I was wondering, if we keep the agent exact, would it behave the same going from a 2D to the 3D environment, will it still work? Obviously, with added observations, because going from 2D to 3D makes the challenge way more complex.

Can you tell more in detail about the labyrinth game and reinforcement learning agent behaviour?

Within the reinforcement learning landscape, there's a couple of key components that I started to spot such as the reward signal. In essence, the reward is given to the reinforcement learning algorithm each step based on which actions it took. So that's one crucial function of any good environment. That it can provide meaningful feedback based on which actions were performed.

The reward signal became almost a philosophical concept for me. The one aspect that intrigues me most with reinforcement learning. All the rewards and penalties you have programmed into the reinforcement learning environment will cause an internal reaction within the agent. It will play a couple of episodes and based on the signal that you're giving it, based on the actions taken and mistakes made, it will completely change behaviour in sometimes unpredictable ways.

That's what I wanted to investigate deeper. Which signals cause what kind of new RL agent behaviour. The article and the Labyrinth game were just a tool to explore how a reinforcement learning agent learns. And by doing so, I also discovered what type of changes to a reward signal can guide agent behaviour in the intended direction. It was very amusing to witness all the creative ways a reinforcement learning algorithm tries to cheat. Cut corners.

To cheat?

The reinforcement learning agent always tries to "cheat". In our eyes, it's cheating. But in his experience, he is simply looking for the most optimal strategy. If you're good at cheating, you will win a lot, which is an optimal strategy, is it not? The reinforcement learning algorithm was obviously doing things that it's not supposed to do because it's hardcoded to optimize whatever it's doing. For example, the first iteration of my labyrinth was loose. So, if you moved the plane field, it would move fast so you could fling the ball. And that's one of the first things that the reinforcement learning agent did was just learn how to fling the ball from one side to the other and finish the game in a single move. To the observer that's considered cheating, so we have to adjust the environment without really telling what to do.

That’s the thing with training reinforcement learning agents; you're not allowed to explicitly say to the agent to do something. You're not supposed to say to an algorithm, "Hey, do this." You're supposed to signal, "Hey, do this", without telling it directly which actions to take.

How do you do that? With rewards and penalties. The clue of reinforcement learning is basically to fine-tune an environment so that you could place an agent inside and without knowing what it's supposed to do, it can derive meaning from the environment and then optimize a solution.

By the environment, I mean also signalling. The signals are programmed into the environment. For example, in the case of the labyrinth, you have holes in the board, and if the ball falls through a hole, it will trigger a terminal state. That is a reward signal that is programmed into the environment so that if the reinforcement learning agent enters a terminal state, it will finish the episode at the heavy penalty. And this signals to the reinforcement learning agent, "don't fall into these holes". It's as simple as that. You're just predefining a penalty amount for dropping down one of those holes, and this causes the reinforcement learning algorithm to avoid those states because it knows that it will get punished for doing that.

What are your main take aways and conclusions?

I was surprised to see that the translation from 2D to 3D went pretty well. In the 2D version, the reinforcement learning agent just has a location to work with, whereas in the 3D version, there is a lot more relevant information such as the ball's velocity and direction. So other than the location, the ball's physical properties are valuable information the reinforcement learning algorithm could use. That's the only change that happened from the observation, but the reward signal is still the same. The reinforcement learning agent doesn't care where it is. It just cares about discovering the right tools to complete the challenge.

Another big takeaway from this project is that making a digital twin can have great business value. Even though this is just a game, it would take a robot 100 000+ training iterations before it starts to attain an accurate world model. A physical robot can only play 1 game at a time. Therefore, achieving an optimal strategy is impractical to achieve in real-time. If a company wants to enhance their business model using reinforcement learning, we could build them a digital twin that closely mirrors that task, be it an operation or some logistics problem. This digital twin can then be simulated non-stop and in parallel, yielding a thousand lifetimes in a single second.

We made a game to optimize a maze in an intelligent way, but all the same, steps that we took to optimize that is applicable to business as well.

All the same steps and logic that go into solving games autonomously also apply to business. Where we create an environment (digital twin) and observe what the initial behaviour of the reinforcement learning agent is. Then we can fine-tune that behaviour and push the reinforcement learning agent to its limits. Under optimal conditions, receiving appropriate signalling, a reinforcement learning algorithm can come up with super-human strategies that on.

You mean time value and financial value?

It depends on what task you put the agent on, but essentially if you optimize the reward signal so that the reinforcement learning agent receives better feedback, it’ll increase the overall intelligence it possesses. Better signalling also drastically decreases the time required to achieve an optimal strategy. This means shorter training, less supervision required and lower electrical bills.

An environment doesn't need to be static. Once an environment is made ... you could introduce certain scenarios to the RL agent. Imagine you're doing a supply chain RL agent for Ikea. It’s perfectly possible to introduce a fictional COVID-like situation. We make it so that lead times skyrocket, which they did, as well as closing off retail to the public for health concerns. Suddenly it takes two months, rather than two weeks to receive items ordered online and we can run this simulation and see how the RL agent would behave in such a scenario because obviously it's been trained to optimize a normal situation. Suddenly out of nowhere, a global pandemic hits and all the observations will differ from the norm. So how will the agent handle such a situation? How can we improve the agent to be able to better handle this in the future?

A reinforcement learning agent just comes down to an algorithm, right? There are hands full of algorithms that agents can use to get the best accurate model of the world. There's a clear distinction between an agent, which is essentially just an algorithm. There isn't an actual physical unit of an agent or digital unit of an agent. It's just the interactions between state to state transitions. That's what we call an agent, but it doesn't have to resemble an agent or a unit. So the reinforcement learning agent is the learning algorithm and then the environment is just like a stage and audience. If an RL agent performs an action, the environment is programmed to provide adequate feedback. And then that causes the RL agent to overtime to learn the optimal strategy to satisfy the most reward.

Any other projects you are involved in?

So my internship was reinforcement learning, and since then I have done synthetic data projects, computer vision projects and NLP projects. I really like the fact that I'm learning a lot at Faktion, I value these two most. I'm here to grow personally and be a better ML engineer, but I'm also trying to add value to the Faktion team. I've had a project from each branch of machine learning so far. It's all been amazing to research state of the art. As well as synthetic data, it is a really exciting field. Because you have these big tech companies that kind of bully the rest of the world in asking ridiculous prices for data and tools such as unity that allows for synthetic data is, setting a precedent for democratized data where companies don't charge ridiculous rates, but you can just create your own data where possible obviously.

What is the next step for reinforcement learning agent project?

The primary goal was to learn. I think reinforcement learning is new-ish to most because, one, it's quite a new field as the advancements that have been recently made have really pushed the field to a new standard, so exploring that new standard is new to all of us. Exploring ray’s RLLIB has been just amazing. It's like opening Pandora's box. There are all these super complicated learning algorithms just ready to plug and play. And it's when we discovered this library that we wanted to test it in different areas. And one of those areas for me personally was "I want to test if an agent is capable of doing like the finer motor skills required to do such a challenging game"

One of the downsides that are currently in the project is that I didn't use any vision. So, the agent in our labyrinth game has no vision whatsoever of the game. He's blind, he is doing everything on touch. This is one major improvement that I want to add, plus this is still a huge field, which is something that I didn't know. There's still a lot of people looking into solving mazes because it's difficult, especially the bigger that you make them it becomes almost near impossible for the agent to find the exit. Learning only starts the moment that the agent has found the exit because then the value starts propagating backwards through the maze.

One change that I want to maybe do in the future is to have a more vision-based labyrinth solver while also creating a labyrinth generator so that each episode could have a unique layout. Rather than the agent blindly going through the maze and remembering everything, this version would just look at the maze and know how to solve it. Each episode would a variant. So over time, the agent will learn how to link what it is seeing to what it needs to do.

My internship has ended. It was nice to make such an interactive game but obviously, we're focused on business and so taking reinforcement learning into the direction of business value.

We're also making some advancements towards supply chain optimization, which is exciting because I see it as making a strategy game for an RL agent. Trying to optimize whichever digital twin that you offer it. Other fields that we're exploring with RL are process operations and some experiments with robotics.

I think that's probably my superpower now. I've spent 10 years learning how to code through interactive features, through fun mechanics in an environment. A unity environment or an agent environment, it's all very similar. When it comes to making engaging games, I can now make rewarding games for an RL agent. So, my superpower is, just seeing the fun in all of it.

What are your interests aside the job?

I'm a big music lover. I like to make music, listen to music.

I try to be a jack of all trades when it comes to instruments and coding. I play the guitars: acoustic, bass, electric. I play the piano. I play the drums. I really want to learn cello and saxophone, but those are just crazy expensive, crazy complex as well. Big fan of music. Obviously, I'm a huge fan of coding. I'm spending 90% of the time just coding.

I code anything from games to more machine learning heavy side-projects. If I make a game these days, I'll try to implement artificial intelligence to create an interesting system and to learn from that. My biggest passion of all is just knowledge and learning. I'm always looking to learn new stuff. I'm always looking to improve, so that brings us to values at work, which is that I really like to provide insights into things that I research. I like to be challenged and Faktion is just great at it. They allow me to do valuable research and I am given the opportunity to start a project in the field that I might not yet be super knowledgeable in. And then give it a couple of weeks and I might have learned a thing or two and come up with an interesting solution. And here at Faktion, we value knowledge sharing. When all the engineers are together, we can nerd about some algorithm. We try to share as much knowledge because obviously machine learning has a lot of different branches, so it's nice to be able to see what the NLP people are doing, it's nice to see what's the MLOps guys are doing.

What drives me?

I wake up every day feeling like a 21st-century wizard because machine learning just feels like magic to me, magic built on theory. What drives me is just the rush of getting a model right, of getting a reinforcement learning agent to behave exactly as I predicted. Sometimes, I can be stuck on a problem, and then I'll randomly get an idea. And if that works, it's just fascinating. What drives me is knowledge and learning new things, learning crazy machine learning concepts.

Ambitions? Dreams?

Again, knowledge, my ambition is to learn as much as I possibly can from as many different fields as I can. I'm happy at Faktion. I'm being challenged. Every project has been totally different and required a specific angle to solve, which is always interesting. As a hobbyist programmer and developer, I'm learning every day and that's just the best feeling.

I hope one day I'll make a commercial game infused with artificial intelligence. As for future plans at Faktion, I'm just trying to provide as much meaningful knowledge and expertise where I can and I'm trying to look for products everywhere I go.


My dream would be to have an independent game company. I'm waiting for the next wave of next-gen virtual reality hardware because then there's going to be another big boom of software that can be sold for those devices. I'm expecting that within the next 5- 10 years. And by then, I'm hoping I'll have everything that I need to create a mind-blowing gaming experience. It would be a dream come true to be able to push the landscape of entertainment by infusing intelligence into interesting existing systems.

Get in touch!

Inquiry for your POC

Scroll to Top