Policy Search (policy gradient)

Tags	CS 234CS 285PolicyReviewed 2024

From value methods

Value methods are always focused on searching in Value space and then deriving a policy from it. Can we actually bypass the middleman and just optimize the policy?

In other words, we define a parameterized policy $\pi_\theta(s,a)$ , and we want to train it to yield the highest value function $V^\pi$ .

As a quick sidenote, value methods and policy search methods are not completely disjoint. In the middle lies actor-critic methods, which we will cover later.

The case for stochastic policies

Previously with value methods, we mostly used a greedy policy, which is stochastic. In fact, in the tabular environment, all optimal policies are deterministic. But there are two main cases where we see where an optimal policy becomes stochastic:

Gameplay. You do not want a predictable agent

Aliased states where two states with very different properties look alike.

The aliased states are a little tricky. In essence, if you don’t know much about the state, it’s bad to keep on doing one action. Rather, you should be as entropic as possible.

However, there is a caveat: even though stochastic policies are important for exploration, for any MDP, if there exists an optimal policy, then you can make a deterministic optimal policy. That’s because if the policy is stochastic and optimal, it means that either direction is optimal. So you can just pick one.

The Policy Gradient

What we want is to have some $\pi$ that optimizes $V^\pi(s_0 ; \theta)$ . You can think of the $V$ as your cost funciton, and you want to just perform optimization based on $\theta$ .

You can do this through a few gradient-free methods, and in fact, gradient-free methods have certain benefits. They’re faster and they are a great heuristic. The problem is that they’re less sample efficient.

Deriving the policy gradient 🚀

We can express the value function as an expectation over policy trajectories, which becomes

So this is BAD, because we have to marginalize over $\tau$ . However, we also know that we can reduce integrals/sums to monte carlo samples IF we can get it into the expectation form. We ruined the expectation form with the gradient, but we can fix it using a little trick called the log-gradient trick:

=\sum_\tau \frac{P(\tau)}{P(\tau)}\nabla_\theta P(\tau)R(\tau) = \sum_\tau P(\tau) R(\tau) \frac{\nabla_\theta P(\tau)}{P(\tau)} = \sum_\tau P(\tau)R(\tau)\nabla_\theta \log P(\tau; \theta)

(all $P(\tau)$ depend on $\theta$ ; I just didn’t write it out for shorthand). The ratio $\nabla P(\tau) / P(\tau)$ is known as the likelihood ratio.

Ok, so this is a good trick but where does this get us? Well, let’s write out what $\nabla_\theta \log P(\tau)$ is:

We see that everything drops out except for the model itself, which is differentiable. And what’s more, it’s an expectation!

So, in summary, the policy gradient is just

\nabla_\theta V(\theta) = E_{\tau \sim p(\tau)}\left[R(\tau)\sum_t^{T - 1} \nabla_\theta \log \pi_\theta(a_t | s_t)\right]

which becomes through Monte Carlo estimate of the trajectories,

\nabla_\theta V(\theta)\approx (1/m)\sum_i^m R(\tau^{(i)})\sum_t^{T - 1} \nabla_\theta \log \pi_\theta(a_t | s_t)

Score functions: a deeper dive

We call $\nabla_\theta \log \pi(s)$ as the score function, and for many policies, this is pretty easy to compute

softmax: if we have feature vector $\phi(s, a)$ and the output is $\phi(s, a)^T\theta$ in a softmax, the derivative is just $\phi(s, a) - \sum_a \pi_\theta(s, a)\phi(s, a)$ .

For gaussian prameterized by $\mu(s), \sigma^2$ where $\mu(s) = \phi(s)^T\theta$ , we have the derivative as $\frac{(a - \mu(s))\phi(s)}{\sigma^2}$ .

Intuition

☝

Policy gradient is literally weighted maximum likelihood

If you stare at the policy gradient definition, you kinda see that it’s just weighting the gradient of the policy by how much reward it got in a trajectory. This actually makes a ton of sense. You’re increasing the likelihood of a policy’s actions that yielded long term good rewards, and vice versa.

Making Policy Gradients Better

So policy gradients are like Monte Carlo estimations. It’s unbiased, but it’s noisy. Can we make it better?

Temporal Structure

So what we have right now is this expression

this is valid, but you’re weighting every policy execution in a trajectory the same, and this includes rewards that have happened before the policy’s execution. This doesn’t make sense, because the policy can’t influence its past. So we can just get rid of it!

Concretely, this just means you have

\nabla_\theta V(\theta) = E_{\tau \sim p(\tau)}\left[\sum_t^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t)\sum^{T- 1}_{t' = t}r_{t'}\right]

And we know that this sum of rewards is just the return $G^{(i)}_t$ . If we hop back to Monte Carlo approximations, we get

This is useful because we reduce the variance. Why do we reduce the variance? Well, there’s some variance (due to environment stochasticity) before the relevant time $t$ . This contributes to the variance, and when we account for causality, we factor that out.

Baselines

Here’s another insight. While the rewards can vary a lot, they may vary in a controlled way. For example, you might pass through a place of naturally high reward. So you shouldn’t get phased by that. Furthermore, because we are doing weighted regression, the “shift” of the reward now matters. If your rewards are always positive, then we never discourage bad behavior, for example.

To mitigate these problems, you can subtract a baseline, which accounts for what you “normally” see. This removes some of the noise.

Proof that baselines don’t add bias (as long as the baseline doesn’t depend on $a_t$ )
Intuitively, this seems weird. But we can show that any baseline will not add bias to the estimator.
What we want to show is that $E_{\tau}[\sum \nabla \log \pi(a | s) b(s_t)] = 0$ . But how?

Analysis of Variance with Baselines

We want to derive a minimizer for the variance, which shows us what we might want to use for the baseline. This isn’t sufficient to show that all baselines decrease variance (actually, this may not be true), but it shows a minimum.

We start by writing the variance expression

We showed previously that the second term doesn’t vary with $b$ , so in our optimization, it is irrelevant.

Now, let’s compute the gradient of the variance WRT $b$ (where $g(\tau) = \nabla_\theta \log p_\theta(\tau)$ ). When we write the derivative out, we get

The first term drops, but the second and third terms do matter. Now, we take the derivative and set it to zero.

And we get

So this baseline depends on the gradient, and it’s the expected reward, weighted by gradient magnitudes. In practice this is hard to compute, but if we assume that the gradients are more or less constant, then $b = E[r(\tau)]$ . What’s so special about this? Well, it shows that the best choice for the baseline is the expected return at $s_t$ , or the value function.

REINFORCE and Vanilla PG

REINFORCE is the policy gradient with the temporal structure imposed:

👉

You need a stochastic policy or else the log probs will go to zero and yield zero gradient.

And when we add the baseline, we get the vanilla policy gradient algorithm, which alternates between fitting $b(s_t)$ and $\pi$

Actor Critic Methods

This is actually surprisingly simple. In REINFORCE, we trained an Advantage estimate through a monte-carlo approach. This is good in that it doesn’t use a markov assumption, but it’s bad in that it’s high variance.

We can just use a TD approach to learn $Q(s, a)$ , which is an approximation to $G_t$ . And while we’re at it, we can just learn $V(s)$ , which is an approximation to $b(s)$ . In doing so, we keep track of the policy (actor) and the value functions (critic). This is why it’s called actor-critic methods. Soft-Actor Critic (SAC) is a popular method, as well as A3C.

where $A = Q - V$ .

There is more information in the Deep Reinforcement Learning notes

Implementing Policy Gradients (practical advice)

We can take advantage of autodiff by setting up a graph such that the derivative of the graph gives the policy gradient.

Recall that the PG is very similar to maximum likelihood. So, we can just take the maximum likelihood objective and weight it by the returns.

The derivative of this is the PG.