Value Function Methods

TagsCS 234CS 285Reviewed 2024Value

Moving from AC

Actor-critic we used a policy to act and a critic to change the policy. But can we do without the actor? This insight gives rise to a set of value-based methods.

Note that these will sound very familiar because we’ve talked about them in the Model-Free Control section. However, that section posed many different methods in a tabular environment. We will now turn to practical methods that work in real environments and deep networks. You will find that the ideas are very similar.

Policy Iteration

This is the first value function method we will talk about. The idea is very simple

  1. Fit Aπ(s,a)A^\pi(s, a) using your preferred method (like TD). Typically this is done through a Monte Carlo estimate of r(s,a)+γE[Vπ(s)]Vπ(s)r(s, a) + \gamma E[V^\pi(s’)] - V^\pi(s)
  1. Set π\pi as the delta function, where π(as)=1\pi(a | s) = 1 if a=argmaxAπ(s,a)a = \arg \max A^\pi(s, a)

Practically-speaking, you only need to fit Q(s,a)Q(s, a) because A=QVA = Q - V and VV doesn’t depend on aa.

Connection to tabular, model-based methods

Recall that we had policy iteration in the tabular environment, where we alternated between learning a V function through bellman backups and deriving a policy from the Q function (which is in turn derived from the V function). So we are just doing the same thing, except that everything is approximate.

Value Iteration

Policy iteration can work, but we are still defining a policy explicitly. If we know that the policy is just the argmax, we can forgo this policy completely and arrive at the value iteration algorithm

  1. Fit Q(s,a)=r(s,a)+γE[V(s)]Q(s, a) = r(s, a) + \gamma E[V(s’)]. Approximate this expectation with a sample
  1. Fit V(s)=maxaQ(s,a)V(s) = \max_a Q(s, a)

Note that the first step is evaluation, and the second step is improvement. The implicit policy of V(s)V(s) is better than the implicit policy behind Q(s,a)Q(s, a).

Derivation of Q-learning

We can do step 1 and 2 in one singular optimization problem, where you define the target y=r(s,a)+γmaxaQ(s,a)y = r(s, a) + \gamma \max_{a’}Q(s’, a’) and then you fit Q(s,a)Q(s, a) to yy. We call this Q-learning.

Connection to tabular, model-based methods

As we saw in the tabular notes, value iteration is different because of its implicit policy representation. There are some subtle differences in behavior, as outlined in those notes. But overall, we are just re-establishing a method that we first talked about in tabular.

Dealing with the maximization

In discrete land, you can just ask the Q function for all the actions and pick the best. In continuous land, this might be a bit hard to do because you need to maximize. In this case, an actor critic algorithm might work better (although the tradeoff is that AC is generally on-policy)

Why fit V or Q?

The cautious reader may notice that our version of value iteration is different than the one in the tabular notes. Instead of fitting V, we fit Q and then maximize. It begs the question: why chose to fit Q or V?

It turns out, there’s one critical difference. Consider policy evaluation for V:

Note how the update requires you to have actions from π(s)\pi(s), and this isn’t good if you’re using offline data of the form s,a,r,ss, a, r, s’. But now consider policy evaluation for Q:

Do you see how it doesn’t matter what distribution aa is sampled from? It means that you can apply this algorithm to any interaction data and yield QπQ^\pi.

Therefore, if you want your algorithm to be off-policy, you would want to fit a Q function, not a V function.

The General Fitted Algorithms ⭐

For Monte Carlo, you use GtG_t as the target

For SARSA, you use the TD target

and for Q-learning, you use the max-Q target (Known as fitted Q iteration).

Why Fitted convergence is hard

Essentially, when you are doing function approximation, you’re constrained to a subspace.

The bellman backup is a guaranteed contraction, but this backup may take you out of this subspace (see diagram).

Therefore, you have to project it back to the subspace with Π\Pi, and you can no longer say anything about this being a contraction on your subspace. You might actually be going further away.

Deep Q learning (DQN) 💻

This is just Q learning with a slow-updating target function modification and the use of neural networks.

This isn’t stable out of the box though. We might get some divergence. There are two primary problems.

In the next part, we’ll talk about how we can resolve such issues

Experience replay

The first problem is sample correlation. You ideally want IID samples from your policy, but your policy is rolling out in an environment linearly. So you’re going to get highly correlated samples, which can lead to sequential overfitting.

To solve this problem, just use a replay buffer! Q-learning is off-policy so we can learn from past experiences.

Generally, the more memory the better, although if your dynamics are changing, you might want a smaller memory.

Fixed Q targets

The second problem is that your target Q value changes with your updates because they share the same parameters (even if you didn’t propagate directly into it). This might lead to a dog-chasing-tail sort of problem.

To solve this, just freeze the target Q network and update once every kk updates to the actual Q network. However, this doesn’t necessarily solve another problem that arises with this framework, known as maximization bias. In the next part, we will talk about this.

Maximization Bias and How to Solve

What is maximization bias?

Well, intuitively, you’re trying to maximize the QQ function, but this maximization yields some potentially harmful pitfalls. We’ll look at offline RL later where this is a big problem, but essentially when you maximizie the Q function, you may be accidentally selecting an accidentally high value.

More theoretically, you can consider Jensen’s inequality in this toy example, where we only have two actions and they both have zero expected reward. However, because of Jensen’s inequality, you have

So this estimation Vπ^V^{\hat \pi} may be greater than 00, even though the true value is Vπ=0V^\pi = 0.

How does Maximization Bias affect learning?

Maximization bias can slow things down, and in functional approximation cases, can lead to suboptimality. However, in the tabular environment, GLIE is enough for convergence.

Double Q learning

One idea: use one Q function to select the max action, and use another to estimate the value of this maximum action. Intuitively, we want to avoid using the max of estimates as the max of true values. By decoupling the selection and evaluation, we move in that right direction.

Double DQN

With DQN, there’s a lagging target and there’s the current network. With DQN, we maximize over this target network and then evaluate stuff on the target network. Can we use the same idea from Double Q learning here?

we can! Use the current network to select the best action, evaluate with target network, and update your current network.

With continuous actions

We can extend Q learning to continuous action spaces. The main problem is that the argmax\arg \max doesn’t work very well over continuous action spaces. We have three choices

Approximating the maximum with classic approaches

We can approximate the maximum over the continuous space as the maximum over a set of discretized values. The problem is that this is not very accurate.

We can use a cross-entropy method (see later lectures). It’s basically rounds of iterated sampling, and it’s pretty fast. These things work OK up to around 40D action spaces.

Fitting a simple Q function

We can also use a function class that is easy to optimize like a quadratic function

This is a neural network that gives a quadratic in terms of the action (NOT in the state, so it’s still pretty powerful). this method is NAF, or Normalized Advantage Functions.

This is efficient but it’s less expressive

Learn an approximate maximizer

We can train a second neural network that performs optimization. One example is DDPG, and essentially, you let QQ be the negative loss and you try to optimize the second neural network to maximize this QQ. This is sort a deterministic actor-critic algorithm.

We solve this using the chain rule:

dQdθ=dadθdQda\frac{dQ}{d\theta} = \frac{da}{d\theta}\frac{dQ}{da}

Where da/dθda/d\theta is a jacobian.

Practical Considerations & Improvements

BONUS: Heavy Theoretical Analysis of Fitted Q iteration (CS 285)

Recall that fitted Q iteration is defined as

We approximate this by doing it as an optimization problem

There are two sources of error: the T^\hat T is an approximate bellman operator (because it’s derived from approximate rewards and probabilities), and the minimization is approximate as well. This is called sampling error and approximation error respectively.

The r^\hat{r} is the average reward seen at this state-action tuple, and P^\hat{P} is the empirical transition distribution as derived from observation

What do we want to ask? Well, how close can we get our Q functions?

Sampling error

We want to understand this difference

If we sub the definitions, we get

By the triangle inequality, we get

Now, the first value is the estimation error of a continuous random variable, so we can use a concentration inequality

The second value we can factorize, and then we can bound

and this by definition is the total variation divergence of PP multiplied by the infinity norm of QQ, which gets us

and using our derived bounds for categorical distributions, we get

So, we get that

Approximation error

Why don’t we assume that when we fit a function, our fit has an infinity error at most ϵk\epsilon_k

(for now, we assume that we have an exact backup operator for a second)

^ this is just an add-subtract trick. We can group

and apply triangle inequality and definitions

The bellman backup is a contraction, which means that applying TT to two Q functions causes them to be closer by a factor of γ<1\gamma < 1.

So, we’ve related the Q functions recursively:

And at this point, we realize that the more iterations we take, the more we forget our initialization!

(we simplified the ϵ\epsilon by replacing it with the maximum ϵ\epsilon. So, the approximation error scales with the horizon too.

Putting it together

We have our sampling error and approximation error. How do we combine them?

Well, let’s do the same trick

The first term is approximation error and the second term is the sampling error. So, the total error is just these two errors added up

This tells us that

  • error compounds with horizon
  • Error compounds due to sampling