Policy Search (policy gradient)

TagsCS 234CS 285PolicyReviewed 2024

From value methods

Value methods are always focused on searching in Value space and then deriving a policy from it. Can we actually bypass the middleman and just optimize the policy?

In other words, we define a parameterized policy πθ(s,a)\pi_\theta(s,a), and we want to train it to yield the highest value function VπV^\pi.

As a quick sidenote, value methods and policy search methods are not completely disjoint. In the middle lies actor-critic methods, which we will cover later.

The case for stochastic policies

Previously with value methods, we mostly used a greedy policy, which is stochastic. In fact, in the tabular environment, all optimal policies are deterministic. But there are two main cases where we see where an optimal policy becomes stochastic:

  1. Gameplay. You do not want a predictable agent
  1. Aliased states where two states with very different properties look alike.

The aliased states are a little tricky. In essence, if you don’t know much about the state, it’s bad to keep on doing one action. Rather, you should be as entropic as possible.

However, there is a caveat: even though stochastic policies are important for exploration, for any MDP, if there exists an optimal policy, then you can make a deterministic optimal policy. That’s because if the policy is stochastic and optimal, it means that either direction is optimal. So you can just pick one.

The Policy Gradient

What we want is to have some π\pi that optimizes Vπ(s0;θ)V^\pi(s_0 ; \theta). You can think of the VV as your cost funciton, and you want to just perform optimization based on θ\theta.

You can do this through a few gradient-free methods, and in fact, gradient-free methods have certain benefits. They’re faster and they are a great heuristic. The problem is that they’re less sample efficient.

Deriving the policy gradient 🚀

We can express the value function as an expectation over policy trajectories, which becomes

So this is BAD, because we have to marginalize over τ\tau. However, we also know that we can reduce integrals/sums to monte carlo samples IF we can get it into the expectation form. We ruined the expectation form with the gradient, but we can fix it using a little trick called the log-gradient trick:

=τP(τ)P(τ)θP(τ)R(τ)=τP(τ)R(τ)θP(τ)P(τ)=τP(τ)R(τ)θlogP(τ;θ)=\sum_\tau \frac{P(\tau)}{P(\tau)}\nabla_\theta P(\tau)R(\tau) = \sum_\tau P(\tau) R(\tau) \frac{\nabla_\theta P(\tau)}{P(\tau)} = \sum_\tau P(\tau)R(\tau)\nabla_\theta \log P(\tau; \theta)

(all P(τ)P(\tau) depend on θ\theta; I just didn’t write it out for shorthand). The ratio P(τ)/P(τ)\nabla P(\tau) / P(\tau) is known as the likelihood ratio.

Ok, so this is a good trick but where does this get us? Well, let’s write out what θlogP(τ)\nabla_\theta \log P(\tau) is:

We see that everything drops out except for the model itself, which is differentiable. And what’s more, it’s an expectation!

So, in summary, the policy gradient is just

θV(θ)=Eτp(τ)[R(τ)tT1θlogπθ(atst)]\nabla_\theta V(\theta) = E_{\tau \sim p(\tau)}\left[R(\tau)\sum_t^{T - 1} \nabla_\theta \log \pi_\theta(a_t | s_t)\right]

which becomes through Monte Carlo estimate of the trajectories,

θV(θ)(1/m)imR(τ(i))tT1θlogπθ(atst)\nabla_\theta V(\theta)\approx (1/m)\sum_i^m R(\tau^{(i)})\sum_t^{T - 1} \nabla_\theta \log \pi_\theta(a_t | s_t)

Score functions: a deeper dive

We call θlogπ(s)\nabla_\theta \log \pi(s) as the score function, and for many policies, this is pretty easy to compute

Intuition

Policy gradient is literally weighted maximum likelihood

If you stare at the policy gradient definition, you kinda see that it’s just weighting the gradient of the policy by how much reward it got in a trajectory. This actually makes a ton of sense. You’re increasing the likelihood of a policy’s actions that yielded long term good rewards, and vice versa.

Making Policy Gradients Better

So policy gradients are like Monte Carlo estimations. It’s unbiased, but it’s noisy. Can we make it better?

Temporal Structure

So what we have right now is this expression

this is valid, but you’re weighting every policy execution in a trajectory the same, and this includes rewards that have happened before the policy’s execution. This doesn’t make sense, because the policy can’t influence its past. So we can just get rid of it!

Concretely, this just means you have

θV(θ)=Eτp(τ)[tT1θlogπθ(atst)t=tT1rt]\nabla_\theta V(\theta) = E_{\tau \sim p(\tau)}\left[\sum_t^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t)\sum^{T- 1}_{t' = t}r_{t'}\right]

And we know that this sum of rewards is just the return Gt(i)G^{(i)}_t. If we hop back to Monte Carlo approximations, we get

This is useful because we reduce the variance. Why do we reduce the variance? Well, there’s some variance (due to environment stochasticity) before the relevant time tt. This contributes to the variance, and when we account for causality, we factor that out.

Baselines

Here’s another insight. While the rewards can vary a lot, they may vary in a controlled way. For example, you might pass through a place of naturally high reward. So you shouldn’t get phased by that. Furthermore, because we are doing weighted regression, the “shift” of the reward now matters. If your rewards are always positive, then we never discourage bad behavior, for example.

To mitigate these problems, you can subtract a baseline, which accounts for what you “normally” see. This removes some of the noise.

Analysis of Variance with Baselines

We want to derive a minimizer for the variance, which shows us what we might want to use for the baseline. This isn’t sufficient to show that all baselines decrease variance (actually, this may not be true), but it shows a minimum.

We start by writing the variance expression

We showed previously that the second term doesn’t vary with bb, so in our optimization, it is irrelevant.

Now, let’s compute the gradient of the variance WRT bb (where g(τ)=θlogpθ(τ)g(\tau) = \nabla_\theta \log p_\theta(\tau)). When we write the derivative out, we get

The first term drops, but the second and third terms do matter. Now, we take the derivative and set it to zero.

And we get

So this baseline depends on the gradient, and it’s the expected reward, weighted by gradient magnitudes. In practice this is hard to compute, but if we assume that the gradients are more or less constant, then b=E[r(τ)]b = E[r(\tau)]. What’s so special about this? Well, it shows that the best choice for the baseline is the expected return at sts_t, or the value function.

REINFORCE and Vanilla PG

REINFORCE is the policy gradient with the temporal structure imposed:

👉
You need a stochastic policy or else the log probs will go to zero and yield zero gradient.

And when we add the baseline, we get the vanilla policy gradient algorithm, which alternates between fitting b(st)b(s_t) and π\pi

Actor Critic Methods

This is actually surprisingly simple. In REINFORCE, we trained an Advantage estimate through a monte-carlo approach. This is good in that it doesn’t use a markov assumption, but it’s bad in that it’s high variance.

We can just use a TD approach to learn Q(s,a)Q(s, a), which is an approximation to GtG_t. And while we’re at it, we can just learn V(s)V(s), which is an approximation to b(s)b(s). In doing so, we keep track of the policy (actor) and the value functions (critic). This is why it’s called actor-critic methods. Soft-Actor Critic (SAC) is a popular method, as well as A3C.

where A=QVA = Q - V.

There is more information in the Deep Reinforcement Learning notes

Implementing Policy Gradients (practical advice)

We can take advantage of autodiff by setting up a graph such that the derivative of the graph gives the policy gradient.

Recall that the PG is very similar to maximum likelihood. So, we can just take the maximum likelihood objective and weight it by the returns.

The derivative of this is the PG.