Demystifying Machine Learning: Part III

A Three-Part Series on Machine Learning Techniques

Nature is a self-made machine, more perfectly automated than any automated machine. To create something in the image of Nature is to create a machine, and it was by learning the inner workings of Nature that man became a builder of machines.” —Eric Hoffer, Reflections on the Human Condition

As we’ve learned from parts one and two of this series on Machine Learning techniques, Machine Learning divides into three main approaches: Supervised Learning, Unsupervised Learning, and Reinforcement Learning. In this final article, we will cover Reinforcement Learning.

The Reinforcement Learning (RL) technique involves the exploration of a sequence of actions that adapt a policy to a specific context to maximize a result by using a trial-and-error learning paradigm where two elements dictate the evolution path of an action’s adaptation. Those two elements are reward and punishment. A policy in RL represents a method that transforms input signals into output actions.

As the intelligent agent’s action results in an observable consequence in the environment, a function evaluates that consequence and decides whether it is good or bad, therefore rewarding the agent or punishing it and finally updating the policy. Hence, the more actions the agent is trying, the more knowledge he accumulates in the form of action-reward.

Of course, to build an exhaustive dataset of all the possible action-reward combinations, many unproved actions have to be tried multiple times to ensure their reliability. So, there is a balancing act to make between the exploration of new possible actions and the exploitation of the acquired knowledge.

To understand this concept better, think about training your dog to execute some actions based on some commands you give him. Four main parts come into play when you do this:

  1. Your command
  2. Dog’s reaction
  3. Your right hand, holding a treat representing a reward
  4. Your left hand, holding nothing intended as a punishment

When you say “sit” (command), and the dog sits (action), you give him a treat (reward). If not, you don’t give him anything (punishment). 

If you repeat this long enough, the dog eventually associates the “sit” command with the treat, and in his brain, the neural networks instinctively move the muscles of his legs to make him sit down. That’s also what Reinforcement Learning aims to do.

Another example that best fits this learning paradigm is the braking system of autonomous cars.

Let’s explore this in the diagram below:

Reinforcement Learning Algorithms


The Algorithms used in RL splits into two categories: Model-free algorithms and Model-based ones. Think of the Model as the representation of reality or an intelligent agent’s environment in a specific moment.

Mode-free are the algorithms that try to make sense of the environment by learning it in real-time, thus by trial-and-error. Robots that have to move in a forest (or any other real-life environment) and not playable characters in a computer game may use this method.

Those algorithms also divide into two additional classes: Policy Optimization and Q-Learning.

Policy Optimization methods optimize the parameters either directly by gradient ascent on the performance objective (J(?(?)) or indirectly by maximizing local approximations of J(?(?)

This optimization uses data acquired while operating according to the most recent version of the policy and performs the update directly on-policy. Indeed, it usually involves learning an approximator for the on-policy value function, which describes how to update the policy.

Q-Learning methods learn the optimal action-value function Q(s, a). It translates to learning the goodness of an action given a particular state. This optimization uses data acquired at any point during the training and performs the update off-policy

The main advantage of the Policy Optimization methods is that you can directly optimize the thing you want. Thus, making them stable and reliable. 

Q-Learning, on the other hand, only indirectly optimizes for agent performance by training to satisfy a self-consistency equation. They are indeed more sample efficient when they work thanks to effective data recycling. However, given the many failure modes, they are also less stable.


Model-based are those algorithms that rely on a pre-existing model of the environment, which consists of knowledge of state transitions and rewards. This approach allows the agent to plan by thinking ahead what actions to perform. This method also splits into two categories: Model Provided and Learn the Model

The difference between those two approaches is that in the latter case (Learn the Model), the agent has to interact with the environment the minimum necessary to construct the representation of it, the model indeed. To build this model, the agent applies the Supervised Learning technique.

A downside to this approach is that real-life environments can change, and the simulated actions applied in the model may not have the same outcome in real life.


The real-life application of this learning paradigm applies best to robots and intelligent agents, which have to move in a particular environment to maximize a specified result, i.e. to complete a series of actions that result in the desired outcome. Apart from physical robots, many other applications benefit from this AI approach.

If we consider trading agents, they can train to spot the most effective trading strategy. The game industry can completely change for the better the experience of gamers. As game non-playable characters become AI-powered, that will enable them to interact with players and together create entirely personalized experiences.

Reinforcement Learning is a cutting-edge technology that has enormous potential to transform our world. Indeed, it is the best approach to make machines creative. By seeking new, innovative ways to fulfill some assignments, the computer is behaving similarly to a creative person.

If you think that this is only in theory, then I suggest you look into DeepMind’s AlphaGo, who beat the best Go player in the world, Lee Sedol, in 2016 (Go is a game with more potential moves than the number of particles in the cosmos!). Now DeepMind’s ability is used to power research in fields like AI Ingenuity, scientific discovery, and many projects that once were seen only in SF movies.

AI systems, capable of creating music that was once considered only a domain for gifted human beings, are beginning to appear. One of them is MuseNet from OpenAI.

In 2020, reinforcement learning adoption is still humble. However, recent advances are making this AI approach more and more critical. It has the potential to disrupt industries and bring incommensurable value.

Keep in touch with Cognizant Softvision to be among the first to know how cutting edge technology is shaping our world.