Value Function Methods

Tags	CS 234CS 285Reviewed 2024Value

Suggested readings

Moving from AC

Actor-critic we used a policy to act and a critic to change the policy. But can we do without the actor? This insight gives rise to a set of value-based methods.

Note that these will sound very familiar because we’ve talked about them in the Model-Free Control section. However, that section posed many different methods in a tabular environment. We will now turn to practical methods that work in real environments and deep networks. You will find that the ideas are very similar.

Policy Iteration

This is the first value function method we will talk about. The idea is very simple

Fit $A^\pi(s, a)$ using your preferred method (like TD). Typically this is done through a Monte Carlo estimate of $r(s, a) + \gamma E[V^\pi(s’)] - V^\pi(s)$

Set $\pi$ as the delta function, where $\pi(a | s) = 1$ if $a = \arg \max A^\pi(s, a)$

Practically-speaking, you only need to fit $Q(s, a)$ because $A = Q - V$ and $V$ doesn’t depend on $a$ .

Connection to tabular, model-based methods

Recall that we had policy iteration in the tabular environment, where we alternated between learning a V function through bellman backups and deriving a policy from the Q function (which is in turn derived from the V function). So we are just doing the same thing, except that everything is approximate.

Value Iteration

Policy iteration can work, but we are still defining a policy explicitly. If we know that the policy is just the argmax, we can forgo this policy completely and arrive at the value iteration algorithm

Fit $Q(s, a) = r(s, a) + \gamma E[V(s’)]$ . Approximate this expectation with a sample

Fit $V(s) = \max_a Q(s, a)$

Note that the first step is evaluation, and the second step is improvement. The implicit policy of $V(s)$ is better than the implicit policy behind $Q(s, a)$ .

Derivation of Q-learning

We can do step 1 and 2 in one singular optimization problem, where you define the target $y = r(s, a) + \gamma \max_{a’}Q(s’, a’)$ and then you fit $Q(s, a)$ to $y$ . We call this Q-learning.

Connection to tabular, model-based methods

As we saw in the tabular notes, value iteration is different because of its implicit policy representation. There are some subtle differences in behavior, as outlined in those notes. But overall, we are just re-establishing a method that we first talked about in tabular.

Dealing with the maximization

In discrete land, you can just ask the Q function for all the actions and pick the best. In continuous land, this might be a bit hard to do because you need to maximize. In this case, an actor critic algorithm might work better (although the tradeoff is that AC is generally on-policy)

Why fit V or Q?

The cautious reader may notice that our version of value iteration is different than the one in the tabular notes. Instead of fitting V, we fit Q and then maximize. It begs the question: why chose to fit Q or V?

It turns out, there’s one critical difference. Consider policy evaluation for V:

Note how the update requires you to have actions from $\pi(s)$ , and this isn’t good if you’re using offline data of the form $s, a, r, s’$ . But now consider policy evaluation for Q:

Do you see how it doesn’t matter what distribution $a$ is sampled from? It means that you can apply this algorithm to any interaction data and yield $Q^\pi$ .

Therefore, if you want your algorithm to be off-policy, you would want to fit a Q function, not a V function.

The General Fitted Algorithms ⭐

For Monte Carlo, you use $G_t$ as the target

For SARSA, you use the TD target

and for Q-learning, you use the max-Q target (Known as fitted Q iteration).

Why Fitted convergence is hard

Essentially, when you are doing function approximation, you’re constrained to a subspace.

The bellman backup is a guaranteed contraction, but this backup may take you out of this subspace (see diagram).

Therefore, you have to project it back to the subspace with $\Pi$ , and you can no longer say anything about this being a contraction on your subspace. You might actually be going further away.

Deep Q learning (DQN) 💻

This is just Q learning with a slow-updating target function modification and the use of neural networks.

Pseudocode

This isn’t stable out of the box though. We might get some divergence. There are two primary problems.

Correlation between samples

Non-stationary targets

In the next part, we’ll talk about how we can resolve such issues

Experience replay

The first problem is sample correlation. You ideally want IID samples from your policy, but your policy is rolling out in an environment linearly. So you’re going to get highly correlated samples, which can lead to sequential overfitting.

To solve this problem, just use a replay buffer! Q-learning is off-policy so we can learn from past experiences.

Generally, the more memory the better, although if your dynamics are changing, you might want a smaller memory.

Fixed Q targets

The second problem is that your target Q value changes with your updates because they share the same parameters (even if you didn’t propagate directly into it). This might lead to a dog-chasing-tail sort of problem.

To solve this, just freeze the target Q network and update once every $k$ updates to the actual Q network. However, this doesn’t necessarily solve another problem that arises with this framework, known as maximization bias. In the next part, we will talk about this.

Maximization Bias and How to Solve

What is maximization bias?

Well, intuitively, you’re trying to maximize the $Q$ function, but this maximization yields some potentially harmful pitfalls. We’ll look at offline RL later where this is a big problem, but essentially when you maximizie the Q function, you may be accidentally selecting an accidentally high value.

More theoretically, you can consider Jensen’s inequality in this toy example, where we only have two actions and they both have zero expected reward. However, because of Jensen’s inequality, you have

So this estimation $V^{\hat \pi}$ may be greater than $0$ , even though the true value is $V^\pi = 0$ .

How does Maximization Bias affect learning?

Maximization bias can slow things down, and in functional approximation cases, can lead to suboptimality. However, in the tabular environment, GLIE is enough for convergence.

Double Q learning

One idea: use one Q function to select the max action, and use another to estimate the value of this maximum action. Intuitively, we want to avoid using the max of estimates as the max of true values. By decoupling the selection and evaluation, we move in that right direction.

Algorithm

Double DQN

With DQN, there’s a lagging target and there’s the current network. With DQN, we maximize over this target network and then evaluate stuff on the target network. Can we use the same idea from Double Q learning here?

we can! Use the current network to select the best action, evaluate with target network, and update your current network.

With continuous actions

We can extend Q learning to continuous action spaces. The main problem is that the $\arg \max$ doesn’t work very well over continuous action spaces. We have three choices

Approximating the maximum with classic approaches

We can approximate the maximum over the continuous space as the maximum over a set of discretized values. The problem is that this is not very accurate.

We can use a cross-entropy method (see later lectures). It’s basically rounds of iterated sampling, and it’s pretty fast. These things work OK up to around 40D action spaces.

Fitting a simple Q function

We can also use a function class that is easy to optimize like a quadratic function

This is a neural network that gives a quadratic in terms of the action (NOT in the state, so it’s still pretty powerful). this method is NAF, or Normalized Advantage Functions.

This is efficient but it’s less expressive

Learn an approximate maximizer

We can train a second neural network that performs optimization. One example is DDPG, and essentially, you let $Q$ be the negative loss and you try to optimize the second neural network to maximize this $Q$ . This is sort a deterministic actor-critic algorithm.

We solve this using the chain rule:

\frac{dQ}{d\theta} = \frac{da}{d\theta}\frac{dQ}{da}

Where $da/d\theta$ is a jacobian.

DDPG algorithm (here, $\mu$ is the maximizer)

Practical Considerations & Improvements

use multi-step returns (although beware! this is no longer off-policy anymore!)

try with toy examples

Q learning performs very differently with different environments

large replay buffer can help (roughly 1 million)

it takes a lot of time; you might see signs of life much later

bellman error gradients can be very large. However, if you have a very bad action, you don’t really care about how bad that is. However, the regression task will care a lot, which can lead to your gradient being swamped. Clip your gradients, or use a Huber loss

double Q learning can really help

schedule exploration and learning rate, as well as using things like Adam

run multiple random seeds

BONUS: Heavy Theoretical Analysis of Fitted Q iteration (CS 285)

Recall that fitted Q iteration is defined as

We approximate this by doing it as an optimization problem

There are two sources of error: the $\hat T$ is an approximate bellman operator (because it’s derived from approximate rewards and probabilities), and the minimization is approximate as well. This is called sampling error and approximation error respectively.

The $\hat{r}$ is the average reward seen at this state-action tuple, and $\hat{P}$ is the empirical transition distribution as derived from observation