Exploring the Maze of Reinforcement Learning - By : Mostafa Hussien,

Exploring the Maze of Reinforcement Learning

Mostafa Hussien
Mostafa Hussien Author profile
Mostafa Hussien is a PhD student in the Department of Electrical Engineering at ÉTS. He completed his MSc in 2017.

Reinforcement Learning

Purchased from Copyright.


The last decade has witnessed increased applicability for reinforcement learning (RL) as a consequence of its successive achievements. These achievements have taken the form of defeating human operators in complex problems that require a high degree of intelligence like Chess, Go, or Atari games. However, beginners who just started their journey of learning this powerful tool usually found themselves in a maze of algorithms, technical terms, and jargon. In turn, this complicates building a road map to guide any learning efforts. In this article, we aim to present a clear taxonomy for the well-known RL algorithms. This taxonomy guides beginners through their journey toward learning and mastering reinforcement learning. At the end, we recommend some powerful tools required to build RL applications and helpful readings.

What Is Reinforcement Learning?

Reinforcement learning (RL) is a branch of machine learning concerned with learning through the trial-and-error paradigm. According to Rich Sutton, “Reinforcement learning problems involve learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.” Therefore, RL has been applied to many problems that require interaction with external environments, such as robotics, control, wireless communications, games, or algorithmic trading.

The Potential of Reinforcement Learning

RL has shown great potential in tackling complex problems in different domains. Recently, this power has been largely boosted with the increased power of deep learning techniques. Coupling the two techniques produced the more seminal tool of deep reinforcement learning (DRL). Many groundbreaking success stories have been recorded using DRL. For example, in a classical Atari game, a DRL agent learned to play in a superhuman level of performance. The hybrid DRL system, AlphaGo, recorded another success story when it defeated the human world champion in the complex Go game. Another breakthrough for RL recorded in the early 1990s is TD-Gammon. TD-Gammon is a neural network-based technique that obtained expert-level performance in the backgammon game. The aforementioned examples are a mere subset of a wide range of achievements.

Taxonomy of RL Algorithms

FIGURE 1. An intuitive taxonomy for RL algorithms. Such a taxonomy is crucial for anyone wishing to start learning RL. This taxonomy acts as a guide in exploring the interconnected and overlapped algorithms.

One main challenge facing anyone who needs to start learning RL is the lack of a clear road map of available techniques and algorithms falling under the umbrella of RL. To start reading in RL, one can be faced with a large number of algorithms, making it difficult for a beginner to draw the big picture on how these components connect with each other. In Fig. 1, we present a high-level taxonomy for the common RL algorithms in the literature. We can see that a given problem can be categorized as either a bandits problem or a Markov decision process (MDP) problem based on how the agent’s actions interact with/change in the environment. It is worth mentioning that MDP-based problems are more common than bandits-based problems in real-life applications. Therefore, you will find that the most famous RL algorithms fall under the umbrella of MDP-based problems. A broad taxonomy of RL algorithms divides the algorithms based on awareness of environment dynamics (i.e. the model) into two main classes, namely, model-based and model-free. In model-based, as the name implies, the model is assumed to be known. On the other hand, model-free algorithms do not require prior knowledge of the environment model, and environment interactions are explored through trial-and-error. Another relevant classification, based on what the agent is trying to optimize, categorizes a given algorithm into two main categories: value-based and policy-based. In value-based, the agent tries to learn the state/action quality function (Q-values) and the optimal policy can be obtained from these Q-values. In policy-based algorithms, the agent tries to learn the policy directly by means of a parameterized function (e.g. artificial neural networks). Actor-critic is a third category that combines value-based and policy-based algorithms. As shown in Fig. 1, we follow the taxonomy that divides MDP-based problems according to availability of the environment model.

Essential Tools for RL

To start learning RL, you should get your hands dirty with some of the available tools you might need. For their popularity in this context, the tools mentioned here are mainly used for developing RL applications in Python. First, we need an easy-to-use integrated development environment (IDE) to start developing any RL application. Anaconda is one of the best options for this task. It integrates the most commonly used Python packages in one place. OpenAI gym is an RL development tool that encapsulates many environments to be used for developing and validating different RL algorithms. For the sake of completeness, we should mention other necessary ML packages, such as NumPy, SciPy, SK-learn, and Keras-rl. Fig.2 summarizes some of these essential tools.


The complexity of interconnections between different RL algorithms is a main barrier for beginners. In this article, we provide a bird’s-eye view of RL algorithms to help the reader start a self-guided learning journey.

Mostafa Hussien

Author's profile

Mostafa Hussien is a PhD student in the Department of Electrical Engineering at ÉTS. He completed his MSc in 2017.

Program : Automated Manufacturing Engineering 

Research laboratories : SYNCHROMEDIA – Multimedia Communication in Telepresence 

Author profile

Get the latest scientific news from ÉTS