Learn Reinforcement Learning by Coding (1)

Value Iteration, Policy Iteration, Q-learning and Monte Carlo method

January 10, 2017 - 5 minute read -
machine-learning reinforcement-learning

To consolidate my understanding of reinforcement learning algorithms, I decided to implement popular RL algorithms myself. This post includes the core math of RL algorithms, problems I faced while implementing the algorithms and their solutions, and some useful links for learning RL.

Git repo: https://github.com/stormmax/reinforcement_learning

Table of Contents

Dynamic Programming

Value Iteration

  • Learning

here is the expected value (cumulated discounted reward) of state if the agent act optimally. The time complexity of each iteration is , is number of states, is number of actions on each state.

can also be expressed in terms of q-value:

  • Optimal Policy

Policy Iteration

Policy iteration method could be see as iteratively doing policy evaluation and policy improvement.

Deterministic Policy

From UC Berkeley’s AI course as well as on Richard S. S, and Andrew G. B. ‘s Reinforcement Learning book, the policy on each state is deterministic

  • Policy Evaluation

The run time complexity is for each iteration.

Since policy iteration removes the nonlinear operation, it can be solved by solving a linear system.

  • Policy Improvement

Policy improvement is not necessarily performed on each iteration and hence we can save computation.

Non-Deterministic Policy

However, from David Silver’s lecture slides, the policy on each state is a probability distribution over all the possible actions.

  • Policy Evaluation

The run time complexity is therefore

  • Policy Improvement

Here I ran into the issue that the distribution of probability of actions on state s should always be positive, however, the values of some states were evaluated as negative. To solve this problem, I exponentiated all the values and then took the softmax distribution:

Where is a set of possible actions at state .

Model-based Q-learning

Model-Free Temporal Difference (TD) Learning

TD Value Learning

Where is the learning rate. The second equation is exponential moving average. The reasons why using moving average instead of averaging over all the sample:

  • Samples are noisy
  • More recent samples are better, and hence should be counted more
  • Memoryless

Off-Policy TD Q-learning

TD Q-learning only propagate the optimal Q-value. Q-learning is off-policy learning, which means, in the limit, it will converge to the correct value no matter how you select the actions.

Some things to note for Q-learning:

  • You have to explore more
  • You have to eventually make the learning rate small enough


  • With (small) probability , act randomly
  • With (large) probability , act on current policy

One of the issues I ran into when implementing q-learning is that how does the agent get access to the legal actions on each state? In UCB’s pacman project, they passed a actionFn in the constructor which takes a state and returns a list of legal actions. This practice assumes that the agent has an oracle that magically provides list of possible actions for every state. This design works for most cases, however, in the case of the agent does not have access to all the legal actions on all states, there should be some extra thoughts. An alternative is to embed a map from state to legal actions in the agent and then the agent learns the legal actions on state only after it observed that state. I finally decided to follow the UCB Pacman’s design since it is more flexible.

On-Policy SARSA

SARSA stands for (state 1, action 1, reward, state 2, action 2):

  • Q-learning vs SARSA

SARSA is on-policy, whereas q-learning is off-policy. SARSA evaluates q-value with one-step-look-a-head current policy, where as q-learning evaluates q-value with one-step-look-a-head optimal policy.


Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards. Mimimizing regret is optimally learning to be optimal.

Model-Free Monte-Carlo (MC) Methods

  • MC methods learn directly from complete episodes of experiences

  • Value = empirical mean of returns
    • return is the total discounted reward
  • All episodes must terminate
  • Monte Carlo learning is unbiased with high variance (opposite to TD learning, which is biased with low variance)
  • Incremental Mean


After some thoughts, I decided to implement my own environments (e.g. Gridworld) to gain a thorough understanding of how the agent interacts with environments and to hone my software engineering skills.

I did checkout some existing environment implementations to see their design trade-offs.

  • Openai Gym

Most of the Gym environments requires only one agent. Openai Gym assumes nothing about the structures of the agents, and hence, users can implement agents as they like.

  • UC Berkeley Pacman project

UCB Pacman project is more complicated. Some environment involves multiple agents. It defined a base interfaces for agents and had a class game to handle agents’ information.


  • Richard S. S, and Andrew G. B. ‘s Reinforcement Learning book
  • UC Berkeley’s Pacman Project

This post is getting too long, therefore, I decided to start new posts.