Revisit Maximum Entropy Inverse Reinforcement Learning

“What I cannot create, I do not understand” – Richard Feynman

Following the last post about linear programming IRL, this is a summary of Ziebart et al’s 2008 paper: Maximum Entropy Inverse Reinforcement Learning . I found this is a good way for me to distill the essence of the paper. Since the Maxent algorithm is mostly cited by the later papers in IRL/imitation learning, I would like to look into details of this algorithm. Code is available at github. Hope this post can also help others who are interested in IRL.

Motivations

The paper points out the limitations of the previous algorithms including LPIRL (Ng & Russell 2000), structured maximum margin prediction (MMP, Ratliff, Bagnell & Zinkevich 2006, Apprenticeship Learning vis IRL (Abbeel & Ng 2004) about feature counts and IRL

Both IRL and the matching of feature counts are ambiguous.
- Each policy can be optimal for many reward functions.
- many policies lead to the same feature counts.
The ambiguity of suboptimality is unresolved.

Notations

An MDP is a tuple $(S, A, \{P_{sa}\}, \gamma, R)$

$S$ is a finite set of $N$ states.
$A = \{a_1, .., a_k\}$ is a set of $k$ actions.
$P_{sa}(s')$ is the state transition probability of landing at state $s'$ : $P(s, a, s')$ upon taking the action $a$ at state $s$ .
$\gamma \in [0, 1)$ is the discount factor.
$R: S \rightarrow \mathbf{R}$ is the reward function.

Maxent

$\zeta = \{(s, a)\}$ is a trajectory.
$\mathbf{f_s} \in \mathbf{R}^k$ is the feature vector of the state $s$ .
$\theta \in \mathbf{R}^k$ reward function parameters.
$P(\zeta)$ probability of the trajectory $\zeta$ to occur
$P(s)$ the probatility of visiting state $s$ (state visitation frequency), $P(\zeta) = \prod_{s\in\zeta} P(s)$ .

Assumptions

The reward of a trajectory is expressed as a linearly combination with feature counts

$R(\zeta) = \theta ^T \mathbf{f}_{\zeta} = \sum_{s\in \zeta} \theta ^T \mathbf{f}_s$

Principle of maximum entropy (Jaynes 1957): probability of a trajectory demonstrated by the expert is exponentially higher for higher rewards than lower rewards,

$P(\zeta) \propto e^{R(\zeta)}$

Algorithm

The Maxent algorithm learns from demonstrated expert trajectories with the objective being maximizing the likelihood of the demonstrated trajectories,

$\theta^* = \text{argmax}_{\theta} L(\theta) \\ = \text{argmax}_{\theta} \frac{1}{M}\text{log} P(\{\zeta\} | \theta)\\ = \text{argmax}_{\theta} \frac{1}{M}\text{log} \prod_{\zeta} P(\zeta|\theta)\\ = \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} \text{log} P(\zeta|\theta)\\ = \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} \text{log} \frac{e^{R(\zeta)}}{Z}\\ = \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} R(\zeta) - \text{log} Z\\$

Where $M$ is the number of trajectories, $Z$ is the normalization term,

$Z = \sum_{\zeta} e^{R(\zeta)}$

Then,

$\theta^* = \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} R(\zeta) - \text{log} \sum_{\zeta} e^{R(\zeta)}\\ \theta^* = \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} \theta ^T \mathbf{f}_{\zeta} - \text{log} \sum_{\zeta} e^{\theta ^T \mathbf{f}_{\zeta}}\\$

And the objective is convex! (with the second term being log-sum-exp). We go ahead to differentiate the objective to find the gradients:

$\nabla_{\theta} L = \frac{1}{M}\sum_\zeta \mathbf{f}_{\zeta} - \frac{1}{\sum_\zeta e^{R(\zeta)}} \sum_{\zeta} (e^{R(\zeta)\frac{d R(\zeta)}{d\theta}})\\ = \frac{1}{M}\sum_\zeta \mathbf{f}_{\zeta} - \frac{1}{\sum_\zeta e^{R(\zeta)}} \sum_{\zeta} (e^{R(\zeta)}\frac{d R(\zeta) }{d\theta})\\ = \frac{1}{M}\sum_\zeta \mathbf{f}_{\zeta} - \sum_{\zeta}\frac{e^{R(\zeta)}}{\sum_\zeta e^{R(\zeta)}} \mathbf{f}_{\zeta}\\ = \frac{1}{M}\sum_\zeta \mathbf{f}_{\zeta} - \sum_{\zeta} P(\zeta | \theta) \mathbf{f}_{\zeta}\\$

Since the trajectories $\{\zeta\}$ are consist of states,

$\nabla_{\theta} L = \frac{1}{M}\sum_s \mathbf{f}_{s} - \sum_{s} P(s | \theta) \mathbf{f}_{s}$

Where $\mathbf{f}_s$ is the feature vector for the state $s$ . And

$P(s|\theta)$

is the state visitation frequency for state $s$ .

So far the main body of the algorithm is described. The only thing left is to compute the state visitation frequency (SVF) vector. To do so, we can use the following dynamic programming algorithm (for convienience we use $P(s)$ to denote SVF on state $s$ ).

We use $\mu_t(s)$ to denote the prob of visiting $s$ at $t$ (obviously, $P(s) = \sum_t \mu_t(s)$ )

solve the MDP using value iteration with the intermediate recovered rewards to get current optimal policy $\{\pi(a,s)\}$ .
compute $\mu_1(s)$ using sampled trajectories
using DP to solve for the rest given optimal policy $\{\pi(a,s)\}$ and the transition dynamics $\{P_{sa}(s')\}$

For $t = 1,..,T$ ,

$\mu_{t+1} (s) = \sum_{a}\sum_{s'} \mu_{t}(s')\pi(a,s')P_{sa}(s')$

And finally.

$P(s) = \sum_t \mu_t(s)$

One key things to note is that, the algorithm solves MDP in each iteration of training.

If the transition dynamics $\{P_{sa}(s')\}$ is unknown, we can actually using Monte Carlo to estimate the SVF with the trajectories. This is much more easier, so the details are omitted. Plugging the SVF back to the gradients, we can use iterative gradient descent to solve for the parameters $\theta$ .

Summary

As a final summary of the algorithm, here is the slide from UC Berkeley’s CS 294, Deep Reinforcement Learning course,

Implementation & Experiments

usage: maxent_gridworld.py [-h] [-hei HEIGHT] [-wid WIDTH] [-g GAMMA]
                           [-a ACT_RANDOM] [-t N_TRAJS] [-l L_TRAJ]
                           [--rand_start] [--no-rand_start]
                           [-lr LEARNING_RATE] [-ni N_ITERS]

optional arguments:
  -h, --help            show this help message and exit
  -hei HEIGHT, --height HEIGHT
                        height of the gridworld
  -wid WIDTH, --width WIDTH
                        width of the gridworld
  -g GAMMA, --gamma GAMMA
                        discount factor
  -a ACT_RANDOM, --act_random ACT_RANDOM
                        probability of acting randomly
  -t N_TRAJS, --n_trajs N_TRAJS
                        number of expert trajectories
  -l L_TRAJ, --l_traj L_TRAJ
                        length of expert trajectory
  --rand_start          when sampling trajectories, randomly pick start
                        positions
  --no-rand_start       when sampling trajectories, fix start positions
  -lr LEARNING_RATE, --learning_rate LEARNING_RATE
                        learning rate
  -ni N_ITERS, --n_iters N_ITERS
                        number of iterations

$ python maxent_gridworld.py --height=10 --width=10 --gamma=0.8 --n_trajs=100 --l_traj=50 --no-rand_start --learning_rate=0.01 --n_iters=20

Try use 2 rewards on the gridworld. Note here I enabled random start for generating trajectories. Because if the trajectory always starts at (0,0) then they tend to go to the nearest reward and stay there.

$ python maxent_gridworld.py --gamma=0.8 --n_trajs=400 --l_traj=50 --rand_start --learning_rate=0.01 --n_iters=20

Strengths and Limitations

Strengths
- scales to neural network costs (overcome the drawbacks of linear costs)
- efficient enough for real robots
Limitations
- requires repeatedly solving the MDP
- assumes known dynamics

References

Ziebart et al’s 2008 paper: Maximum Entropy Inverse Reinforcement Learning
UCB’s CS 294 DRL course, lecture on IRL