Imagine if Robots could adapt to any environment they’re put into without explicitly being told how to do so. Similar to how toddlers learn how to walk by falling over again and again, what if we could impart this ability to Robots to learn to do a certain task by failing and learning from its mistakes? Surely, you must be wondering which science fiction movie I stole this notion from. Welcome to the world of **Reinforcement Learning (RL)**. You may have heard of Google DeepMind’s ‘AlphaGo’ beating professionals at the game ‘Go’ and computer algorithms beating players at 1v1 Dota. These are indeed technological revelations of the recent past. RL is booming right now and find a plethora of applications in Robotics, such as in robotic arm manipulation and in locomotion of walking robots, both biped, and quadruped. Many robotics researchers are moving away from traditional control algorithms and into RL because of their dynamic nature to adapt to any working environment. It is therefore becoming an essential tool for Roboticists. This post serves as an introduction to RL and the various possibilities it offers.

Reinforcement learning is a sub-field of machine learning, which is an essential tool to achieve intelligent robotic behavior. There are two approaches to training your machine learning model, namely supervised and unsupervised learning. In supervised learning, you give input to the model whose output is definitive and known to you. Therefore, you can tweak the parameters in your algorithm so that it produces that known output. For example, if you were to train your algorithm to play a video game, you would simply have an expert gamer play the game and feed in the various situations encountered during the course of the game as inputs and the corresponding actions taken by the professional in that situation would be the expected output. Over time, the algorithm will replicate these actions of the expert and ideally become as good as the professional. However,** it can never be better than the professional**. Obviously, it is clear that in this approach you would require a lot of data about the input states and the corresponding actions to be performed on these inputs. It is therefore a ‘data-driven’ approach.

What if you wanted your algorithm to be better than the professional? To realize this, we need to move away from supervised learning and segue into the world of unsupervised Reinforcement Learning. The framework of Reinforcement Learning is surprisingly very similar to supervised learning. Any neural network model, be it supervised or unsupervised learning, will have the following features: An input layer, a set of hidden layers, and an output layer. A typical neural network model is as depicted below:

Considering the same scenario of training an algorithm to play a video game, we will not have any data to decipher the ideal move in a given scenario. We will thus have no idea about what the output should be for any given input. In Reinforcement Learning, the network (hidden layer as in the image given above) that transforms the given input into an output action is called a **‘Policy Network’.** The objective now is to train the policy network, in a non-data-driven approach such that the inputs are appropriately processed to yield a meaningful output.

To understand how this works, imagine that the game we’re playing is the classic Super Mario and at a particular frame in the game engine (given as input), we have the option to either jump or not jump. Our algorithm will have no idea what the correct action is. Therefore, there is an equal probability to choose between all the available options, that is either to jump or not jump. The algorithm will thus randomly make its choice, which will have either a positive impact or negative impact on the player, and the game engine will move on to the next frame, where it will have another choice to make. This way, we train this algorithm is to assign different probabilities of jumping or not jumping in each iteration. In other words, we will let the algorithm explore various possibilities, much like how we play around with the different functionalities and controls of a game that we’re just introduced to. How do we as humans decide whether our actions are right or not while playing a game? We look at the scoreboard and decide if we’re progressing or falling behind in the game. In a similar way, the only data we will give to the algorithm is the scoreboard as feedback from the game engine. The algorithm will receive a reward if it makes a move that increases its score on the scoreboard and a penalty if it fails to do so. Using this feedback, the algorithm then tries to optimize its policy to obtain the** maximum reward**. The algorithm is learning from experiences of the various possibilities by playing Super Mario several thousand times.

As a novice to the game, our algorithm is going to fail miserably in most of its attempts. However, there will be certain instances when it does get lucky and accumulate a lot of points on the scoreboard, and thus a lot of rewards. These chance encounters are exactly what we need, as the algorithm will be modified to ensure that there is a greater probability of generating the moves that lead to these rewards in the future, by applying something called a **‘positive gradient’. **Similarly, we try to avoid all the negative results in the future by multiplying the gradient by -1 (**‘negative gradient’**), as a way to indicate that those moves are not favorable in future scenarios, thus reducing their likelihood. Eventually, all the negative results will be filtered out and the reward-yielding actions are going to become more and more likely. Over time, our algorithm will learn to play the game all by itself.

While all of this sounds idealistic, there are downsides to this approach. Suppose the algorithm is playing Super Mario really well five minutes into the game. Suddenly, it makes a bad move and a penalty is incurred. The algorithm then assumes that all the good moves it made during the course of the five minutes in which it was playing were also bad moves, and thereby reduces the likelihood of the good moves as well. However, we know that it was only the last move after the fifth minute that resulted in its downfall and not the entire five-minute episode of moves. This problem is called the **‘Credit Assignment Problem’** in RL. This problem occurs due to the assignment called **‘Spare Reward Setting’**, where instead of getting rewards after every move, we only get the reward after an entire episode of moves. Therefore, to solve this, the algorithm needs to realize what part of the episode caused the reward and thus deduce what actions to repeat and what not to repeat. A major disadvantage of this approach is the sheer number of games that need to be sampled before a stable desirable output can be achieved.

In this simple scenario of Super Mario, we considered whether the algorithm should cause the sprite to jump or not jump. However, when applying RL algorithms to Robotics, the options are seldom binary. Suppose you wanted to train a robotic arm to pick up an object and stack it over another. A typical robotic arm has at least 7 joints and so there is a much larger sample space to be explored to find the right combination of moves to result in a reward, i.e, successful stacking of one block over the other. In most cases, randomly generating moves from the state space can be futile and impractical. We solve this by using a technique called **‘Reward Shaping’,** where we give the model some intermediate rewards to allow faster convergence at the desired solution. In our example, we could give rewards, each time the arms bring one block closer to the other, and again when the orientations of the blocks are similar. However, since reward shaping is very specific to a given problem, it needs to be re-done for every new problem and every new environment. It cannot be generalized and thus isn’t scalable. In some cases, the agent we’re training finds a way to ‘cheat’ our reward shaping function to get rewards when it’s doing something completely absurd. As a result, reward shaping can be very tricky to build.

While it’s true that many algorithms have learned some tasks all by themselves, credit must be given to the engineers behind their development, as the learning process has to be optimized in a meticulous fashion to reap these rewards (pun intended). While this was a generic introduction to RL, further posts will explore robotics-specific applications to RL and the mathematics involved in these algorithms. I hope this was of some use to understand the underlying mechanisms involved in RL. Stay tuned for the next one. Your comments and feedback will be greatly appreciated!