Learn Reinforcement Learning by Coding (2)

Policy Gradient, Actor-Critic

February 1, 2017 - 2 minute read -
machine-learning reinforcement-learning

To consolidate my understanding of reinforcement learning algorithms, I decided to implement popular RL algorithms myself. This post continues Learn Reinforcement Learning by Coding (1). This post follows Chapter 13 of Sutton’s RL book.

Policy Gradient

In comparison with value-based reinforcement learning algorithms, policy gradient methods directly optimize the performance of the policy.

We use to denote the weights in the model, and to denote the performance of the policy,

Where is a stochastic estimate whose expectation approximates the gradient of the performance measure w.r.t. .

Let be the parameterized numerical preferences for each state-action pair, we can get the stochastic policy according to softmax distribution:

Often times, can be represented as linear combination of features ,

The Policy Gradient Theorem

Where, in episodic case, is defined to be the expected number of time steps on which in a randomly generated episode starting in and following and the dynamics of the MDP. Here, we expect to be differentiable.


REINFORCE employs monte carlo estimation for the policy gradient. Since we monte carlo estimation to estimate here, we expect the episodes to be finite.


Note that,

For linear features,

Pros and Cons of Policy Gradient Methods

  • Advantages
    • Better convergence properties
    • Effective in high-dim or continuous action spaces
    • Can learn stochastic policies
    • When dealing with incomplete information (e.g. Rock Paper Scissors), stochastic policies might be optimal
  • Disadvantages
    • Local optimal rather than global optimal
    • Evaluating a policy is typically inefficient and high variance