To consolidate my understanding of reinforcement learning algorithms, I decided to implement popular RL algorithms myself. This post continues Learn Reinforcement Learning by Coding (1). This post follows Chapter 13 of Sutton’s RL book.
In comparison with value-based reinforcement learning algorithms, policy gradient methods directly optimize the performance of the policy.
We use to denote the weights in the model, and to denote the performance of the policy,
Where is a stochastic estimate whose expectation approximates the gradient of the performance measure w.r.t. .
Let be the parameterized numerical preferences for each state-action pair, we can get the stochastic policy according to softmax distribution:
Often times, can be represented as linear combination of features ,
The Policy Gradient Theorem
Where, in episodic case, is defined to be the expected number of time steps on which in a randomly generated episode starting in and following and the dynamics of the MDP. Here, we expect to be differentiable.
REINFORCE employs monte carlo estimation for the policy gradient. Since we monte carlo estimation to estimate here, we expect the episodes to be finite.
For linear features,
Pros and Cons of Policy Gradient Methods
- Better convergence properties
- Effective in high-dim or continuous action spaces
- Can learn stochastic policies
- When dealing with incomplete information (e.g. Rock Paper Scissors), stochastic policies might be optimal
- Local optimal rather than global optimal
- Evaluating a policy is typically inefficient and high variance