To consolidate my understanding of reinforcement learning algorithms, I decided to implement popular RL algorithms myself. This post includes the core math of RL algorithms, problems I faced while implementing the algorithms and their solutions, and some useful links for learning RL.
Table of Contents
- Dynamic Programming
- Model-Free Temporal Difference (TD) Learning
- Model-Free Monte-Carlo (MC) Methods
here is the expected value (cumulated discounted reward) of state if the agent act optimally. The time complexity of each iteration is , is number of states, is number of actions on each state.
can also be expressed in terms of q-value:
- Optimal Policy
Policy iteration method could be see as iteratively doing policy evaluation and policy improvement.
From UC Berkeley’s AI course as well as on Richard S. S, and Andrew G. B. ‘s Reinforcement Learning book, the policy on each state is deterministic
- Policy Evaluation
The run time complexity is for each iteration.
Since policy iteration removes the nonlinear operation, it can be solved by solving a linear system.
- Policy Improvement
Policy improvement is not necessarily performed on each iteration and hence we can save computation.
However, from David Silver’s lecture slides, the policy on each state is a probability distribution over all the possible actions.
- Policy Evaluation
The run time complexity is therefore
- Policy Improvement
Here I ran into the issue that the distribution of probability of actions on state s should always be positive, however, the values of some states were evaluated as negative. To solve this problem, I exponentiated all the values and then took the softmax distribution:
Where is a set of possible actions at state .
Model-Free Temporal Difference (TD) Learning
TD Value Learning
Where is the learning rate. The second equation is exponential moving average. The reasons why using moving average instead of averaging over all the sample:
- Samples are noisy
- More recent samples are better, and hence should be counted more
Off-Policy TD Q-learning
TD Q-learning only propagate the optimal Q-value. Q-learning is off-policy learning, which means, in the limit, it will converge to the correct value no matter how you select the actions.
Some things to note for Q-learning:
- You have to explore more
- You have to eventually make the learning rate small enough
- With (small) probability , act randomly
- With (large) probability , act on current policy
One of the issues I ran into when implementing q-learning is that how does the agent get access to the legal actions on each state? In UCB’s pacman project, they passed a
actionFn in the constructor which takes a state and returns a list of legal actions. This practice assumes that the agent has an oracle that magically provides list of possible actions for every state. This design works for most cases, however, in the case of the agent does not have access to all the legal actions on all states, there should be some extra thoughts. An alternative is to embed a map from state to legal actions in the agent and then the agent learns the legal actions on state only after it observed that state. I finally decided to follow the UCB Pacman’s design since it is more flexible.
SARSA stands for (state 1, action 1, reward, state 2, action 2):
- Q-learning vs SARSA
SARSA is on-policy, whereas q-learning is off-policy. SARSA evaluates q-value with one-step-look-a-head current policy, where as q-learning evaluates q-value with one-step-look-a-head optimal policy.
Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards. Mimimizing regret is optimally learning to be optimal.
Model-Free Monte-Carlo (MC) Methods
MC methods learn directly from complete episodes of experiences
- Value = empirical mean of returns
- return is the total discounted reward
- All episodes must terminate
- Monte Carlo learning is unbiased with high variance (opposite to TD learning, which is biased with low variance)
- Incremental Mean
After some thoughts, I decided to implement my own environments (e.g. Gridworld) to gain a thorough understanding of how the agent interacts with environments and to hone my software engineering skills.
I did checkout some existing environment implementations to see their design trade-offs.
- Openai Gym
Most of the Gym environments requires only one agent. Openai Gym assumes nothing about the structures of the agents, and hence, users can implement agents as they like.
- UC Berkeley Pacman project
UCB Pacman project is more complicated. Some environment involves multiple agents. It defined a base interfaces for agents and had a class
game to handle agents’ information.
This post is getting too long, therefore, I decided to start new posts.