# Learn Reinforcement Learning by Coding (2)

February 1, 2017 - 2 minute read -

To consolidate my understanding of reinforcement learning algorithms, I decided to implement popular RL algorithms myself. This post continues Learn Reinforcement Learning by Coding (1). This post follows Chapter 13 of Sutton’s RL book.

In comparison with value-based reinforcement learning algorithms, policy gradient methods directly optimize the performance of the policy.

We use $\theta$ to denote the weights in the model, and $\eta$ to denote the performance of the policy,

Where $\nabla \eta (\theta_t)$ is a stochastic estimate whose expectation approximates the gradient of the performance measure w.r.t. $\theta_t$.

Let $h(s,a,\theta)$ be the parameterized numerical preferences for each state-action pair, we can get the stochastic policy according to softmax distribution:

Often times, $h(s,a,\theta)$ can be represented as linear combination of features $\phi(s,a)$,

Where, in episodic case, $d_{\pi}(s)$ is defined to be the expected number of time steps $t$ on which $S_t = s$ in a randomly generated episode starting in $s_0$ and following $\pi$ and the dynamics of the MDP. Here, we expect $\pi(a\vert s,\theta)$ to be differentiable.

#### REINFORCE

REINFORCE employs monte carlo estimation for the policy gradient. Since we monte carlo estimation to estimate $G_t$ here, we expect the episodes to be finite.

Then,

Note that,

For linear features,