To consolidate my understanding of reinforcement learning algorithms, I decided to implement popular RL algorithms myself. This post includes the core math of RL algorithms, problems I faced while implementing the algorithms and their solutions, and some useful links for learning RL.
Git repo: https://github.com/stormmax/reinforcement_learning
Table of Contents
 Dynamic Programming
 ModelFree Temporal Difference (TD) Learning
 ModelFree MonteCarlo (MC) Methods
 Environments
 References
 Links
Dynamic Programming
Value Iteration
 Learning
here is the expected value (cumulated discounted reward) of state if the agent act optimally. The time complexity of each iteration is , is number of states, is number of actions on each state.
can also be expressed in terms of qvalue:
 Optimal Policy
Policy Iteration
Policy iteration method could be see as iteratively doing policy evaluation and policy improvement.
Deterministic Policy
From UC Berkeley’s AI course as well as on Richard S. S, and Andrew G. B. ‘s Reinforcement Learning book, the policy on each state is deterministic
 Policy Evaluation
The run time complexity is for each iteration.
Since policy iteration removes the nonlinear operation, it can be solved by solving a linear system.
 Policy Improvement
Policy improvement is not necessarily performed on each iteration and hence we can save computation.
NonDeterministic Policy
However, from David Silver’s lecture slides, the policy on each state is a probability distribution over all the possible actions.
 Policy Evaluation
The run time complexity is therefore
 Policy Improvement
Here I ran into the issue that the distribution of probability of actions on state s should always be positive, however, the values of some states were evaluated as negative. To solve this problem, I exponentiated all the values and then took the softmax distribution:
Where is a set of possible actions at state .
Modelbased Qlearning
ModelFree Temporal Difference (TD) Learning
TD Value Learning
Where is the learning rate. The second equation is exponential moving average. The reasons why using moving average instead of averaging over all the sample:
 Samples are noisy
 More recent samples are better, and hence should be counted more
 Memoryless
OffPolicy TD Qlearning
TD Qlearning only propagate the optimal Qvalue. Qlearning is offpolicy learning, which means, in the limit, it will converge to the correct value no matter how you select the actions.
Some things to note for Qlearning:
 You have to explore more
 You have to eventually make the learning rate small enough
Greedy
 With (small) probability , act randomly
 With (large) probability , act on current policy
One of the issues I ran into when implementing qlearning is that how does the agent get access to the legal actions on each state? In UCB’s pacman project, they passed a actionFn
in the constructor which takes a state and returns a list of legal actions. This practice assumes that the agent has an oracle that magically provides list of possible actions for every state. This design works for most cases, however, in the case of the agent does not have access to all the legal actions on all states, there should be some extra thoughts. An alternative is to embed a map from state to legal actions in the agent and then the agent learns the legal actions on state only after it observed that state. I finally decided to follow the UCB Pacman’s design since it is more flexible.
OnPolicy SARSA
SARSA stands for (state 1, action 1, reward, state 2, action 2):
 Qlearning vs SARSA
SARSA is onpolicy, whereas qlearning is offpolicy. SARSA evaluates qvalue with onesteplookahead current policy, where as qlearning evaluates qvalue with onesteplookahead optimal policy.
Regret
Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards. Mimimizing regret is optimally learning to be optimal.
ModelFree MonteCarlo (MC) Methods

MC methods learn directly from complete episodes of experiences
 Value = empirical mean of returns
 return is the total discounted reward
 All episodes must terminate
 Monte Carlo learning is unbiased with high variance (opposite to TD learning, which is biased with low variance)
 Incremental Mean
Environments
After some thoughts, I decided to implement my own environments (e.g. Gridworld) to gain a thorough understanding of how the agent interacts with environments and to hone my software engineering skills.
I did checkout some existing environment implementations to see their design tradeoffs.
 Openai Gym
Most of the Gym environments requires only one agent. Openai Gym assumes nothing about the structures of the agents, and hence, users can implement agents as they like.
 UC Berkeley Pacman project
UCB Pacman project is more complicated. Some environment involves multiple agents. It defined a base interfaces for agents and had a class game
to handle agents’ information.
References
Links
This post is getting too long, therefore, I decided to start new posts.