An introduction to Reinforcement Learning

The key idea behind Reinforcement learning, we have an environment which represents the outside world to the agent and an agent that takes actions, receives observations from the environment that consists of a reward for his action and information of his new state. That reward informs the agent of how good or bad was the taken action, and the observation tells him what is his next state in the environment. Its actions also may affect not only the immediate rewards but rewards for the next situations.

Key Concepts in Reinforcement Learning

Bellman Equations

The value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.

Both Value functions obey Bellman equations. The Bellman equations for the state and actions value functions are


Markov Decision Processes

Markov Processes

P[St+1 | St] = P[St+1 | S1, S2 .......... St]

A Markov process is a memory-less random process, i.e. a sequence of random states S1, S2, ….. with the Markov property. A Markov process or Markov chain is a tuple (S, P) on state space S, and transition function P. The dynamics of the system can be defined by these two components S and P.

Markov Reward Process

A Markov Reward Process is a Markov process with value judgment, saying how much reward accumulated through some particular sequence that we sampled. An MRP is a tuple (S, P, R, 𝛾) where S is a finite state space, P is the state transition probability function, R is a reward function where it says how much immediate reward we expect to get from state S at the moment.

Rs = [Rt+1 | St = S]

There is the notion of the return Gt, which is the total discounted rewards from time step t. This is what we care about, the goal is to maximise this return

𝛾 is a discount factor, where 𝛾 ∈ [0, 1]. It informs the agent of how much it should care about rewards now to rewards in the future. If (𝛾 = 0), that means the agent is short-sighted, in other words, it only cares about the first reward. If (𝛾 = 1), that means the agent is far-sighted, i.e. it cares about all future rewards.

Markov Decision Process

An MDP is a Markov Reward Process with decisions, it’s an environment in which all states are Markov. This is what we want to solve. An MDP is a tuple (S, A, P, R, 𝛾), where

Types of Reinforcement Learning Agents

Famous Deep RL Algorithms

References

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html