#### Introduction

Machine Learning can be broadly classified into 3 categories:

- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning

**Supervised Learning** is a type of learning in which the Target variable is known, and this information is explicitly used during training (Supervised), that is the model is trained under the supervision of a Teacher (Target). For example, if we want to build a classification model for handwritten digits, the input will be the set of images (training data) and the target variable will be the labels assigned to these images, that is their classes from 0-9.

**Unsupervised Learning** is a type of learning algorithm that is used to draw inferences from datasets consisting of input data without knowing the target. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.

**Reinforcement Learning** is a type of learning algorithm in which the machine takes decisions on what actions to take, given a certain situation/environment, so as to maximize a reward. The difference between supervised and reinforcement learning is the reward signal that simply tells whether the action (input) taken by the agent is good or bad. It doesn’t tell us anything about what is the best action. In this type of learning, we neither have the training data nor the target variables.

#### Reinforcement Learning

Reinforcement learning is a type of Machine Learning that is influenced by behaviorist psychology. It is concerned with how software agents ought to take action in an environment so as to maximize some notion of cumulative reward.

It is learning what to do, how to map situations to actions so as to maximize a numerical reward signal. It does not make use of any training dataset to learn the pattern, unlike other learning methods. The learner is not told which actions to take, as in most forms of machine learning, but instead, must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics: trial-and-error search and delayed reward are the distinguishing features of Reinforcement Learning.

**The reinforcement learning model consists of:**

- A set of environment and agent states S.
- A set of actions A of the agent.
- Policies of transitioning from states to actions.
- Rules that determine the scalar immediate reward of a transition.
- Rules that describe what the agent observes.

A task is defined by a set of states, s∈S, a set of actions, a∈A, a state-action transition function,

T: S×A→S, and a reward function, R: S×A→R. At each time step, the learner (also called the agent) selects an action, and then as a result, given a reward and its new state. The goal of reinforcement learning is to learn a policy, a mapping from states to actions, Π: S →A that maximizes the sum of its reward over time.

In machine learning, the environment is formulated as a Markov decision process (MDP), as many reinforcement learning algorithms for this context utilize dynamic programming techniques.

##### Examples

To get more insights of Reinforcement Learning, let us consider some examples:

- A master chess player makes a move. The choice is informed both by planning-anticipating possible replies and counter replies and by immediate, intuitive judgments of the desirability of particular positions and moves.
- An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time. The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal costs without sticking strictly to the set points originally suggested by engineers.
- A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour.
- Self-driving car is the best example of Reinforcement Learning.
- Playing Tic-Tac-Toe with Computer that has been trained through Reinforcement Learning.

##### Elements of Reinforcement Learning

Except for the agent and the environment, we have four sub-elements of reinforcement learning system:

**Policy:**It defines the learning agent’s way of behaving at a given time.**Reward function:**It defines the goal in reinforcement learning problem.**Value function:**It specifies what is good in the long run.**Model of the environment (optional):**Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards, there could be no values, and the only purpose of estimating values is to achieve more reward.

Reinforcement learning is all about trying to understand the optimal way of making decisions/actions so that we maximize **reward R**. This reward is a reply signal that shows how well the agent is doing at a given time step. The** action A** that an agent takes at every time step is a function of both the reward and the **state S**, which is a description of the environment the agent is in. The mapping from environment states to actions is** policy P**. The policy basically defines the agent’s way of behaving at a certain time, given a certain situation. Now, we also have a **value** **function V** which is a measure of how good each position is. This is different from the reward in that the reward signal indicates what is good in the immediate sense, while the value function is more indicative of how good it is to be in this state/position in the long run. Finally, we have a **model M** which is the agent’s representation of the environment. This is the agent’s model of how it thinks that the environment is going to behave.

The whole Reinforcement Learning environment can be described with an **MDP.**

#### Markov Decision Process (MDP)

Markov decision process (MDP) is a mathematical framework used to model decision making in the situations where the target is partly random and partly under the control of a decision maker.

MDPs are useful when we are studying a wide range of optimization process that can be solved by dynamic programming and reinforcement learning. MDP consists of a finite set of states, value functions for those states, finite set of actions, a policy and a reward function.

The above diagram illustrates an MDP with three states (green circles) and two actions (orange circles), with two rewards (yellow arrows). The value function can be defined in terms of 2 functions.

**State-value function V:**State-value function V is defined as the expected return from being in a state S and following a policy π. It is calculated by the summation of the reward at each future time step (Gamma refers to a constant discount factor with value lying between 0 and 1). It is represented by the following equation:

**Action-value function Q:**The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following π.

The above equation is commonly known as Q-Equation where is the reward observed after performing in, alpha is the learning rate, and Gamma is a number between 0 and 1 called the discount factor.

By solving MDP we get the optimum policy through the use of dynamic programming and specifically through the use of **policy iteration.** The idea is that we take some initial policy π1 and evaluate the state value function for that policy. We solve it by using **Bellman expectation equation** **given as:**

This equation represents that the value function, given the policy π, can be decomposed into the expected return sum of the immediate reward Rt+1 and the value function of the successor state St+1. This is equivalent to the value function definition used in the previous section. **Policy evaluation** component uses this equation. In order to get a better policy, we use a **policy** improvement step where we simply act greedily with respect to the value function. In other words, the agent takes the action that maximizes value.

Now, in order to get the optimal policy, we repeat these 2 steps, one after the other, until we converge to optimal policy π.

#### Summary

Reinforcement learning is a computational approach used to understand and automate the goal-directed learning and decision-making. It is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with its environment, without relying upon some predefined labeled dataset. Reinforcement learning addresses the computational issues that arise when learning from interaction with the environment so as to achieve long-term goals.

RL uses a formal framework that defines the interaction between a learning agent and its environment in terms of states, actions, and rewards. The framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals.

#### References

- Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An Introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.
- Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. “Reinforcement learning: A survey” Journal of artificial intelligence research 4 (1996): 237-285.
- Deep-Learning-Research-Review-Week-2-Reinforcement-Learning: https://adeshpande3.github.io/