Linear Analysis of Value Function Methods

TagsCS 234CS 285Value

Linear Analysis

💡
this is looking at a specific version of fitted RL that has some nice guarentees

We can use a feature vector to represent some state ss. As the most degenerate feature vector, we can use δ(s)\delta(s), which is a one-hot vector. (more on this later).

With the feature vector, we can approximate the value with some ww

which means that with an MSE objective function

the weight update is

In all of these, we treat the target as a SCALAR, not a function, even though it is the same model type! So Vπ(s)V^\pi(s) in the example above would not be propagated into.

But you don’t have VπV^\pi. So…we make do an approximation.

Monte Carlo Value Function Approximation

We can just approximate VπV^\pi with GtG_t and get

Temporal Difference Value Function Approximation

We can also just use TD(0) as the VπV^\pi approximation.

This is a little weird, because we’re invoking the function twice. The key observation is that we are only taking the derivative through V(s)V(s), not V(s)V(s’) or else we run into some issues.

There are three approximations in TD learning

  1. sampling (sampling s, a, r, s’)
  1. bootstrapping (the TD part)
  1. value function approximation (non-tabular)

Convergence of Linear Policy Evaluation

We know that the bellman backup is a contraction, but this is only based on a perfectly accurate function. In reality, if you’re not in the tabular realm, you can’t say anything about Vθ(s)V(s)|V_\theta(s) - V^*(s)| because the optimization process is another projection, and this projection in L2 space can actually lead to an expansion in the infinity norm space. So you have to be careful!

Finite state distribution

Define μ(s)\mu(s) as the probability of visiting state ss under policy π\pi. This is a finite horizon task. Note that it doesn’t use the markov property anywhere, so it works for monte carlo.

MSVE-μ\mu

We define the mean-squared error MSVE as

For Monte Carlo policy evaluation, it converges to the minimum MSVE

Stationary distribution

A stationary distribution d(s)d(s) is defined as the distribution of states under π\pi. This is a property of any markov chain: if you set off a swarm of robots in the chain, eventually their populations would stabilize. This is d(s)d(s).

Because of the markov chain and the markov assumption, the stationary distribution satisfies this balance equation.

which sort of makes sense. We’re saying that the distribution over ss’ is the same as the expected transitions of all its neighbors.

MSVE-d

The MSVE-d is the same as MSVE-μ\mu except that we use this stationary distribution.

For TD policy evaluation, it converges to a scaled version of the minimum MSVE-d as follows:

The larger the gamma, the worse the upper bound.

This is only really relevant for the functional approximation case. For tabular, MSVE is always zero because you have the contraction inequalities.

Linear Control

Control using function approximations is basically just approximating QπQ^\pi and using the same policy-evaluation + policy improvement approach.

This is unfortunately unstable. We have three things present (at least in Q-learning)

  1. Function approximation
  1. Bootstrapping
  1. Off-policy learning

These three form the deadly triad and can yield bad results sometimes.

Functional representation

Just like before, you can represent the state-action with some feature vector and then let QQ be the inner product

Great table

“chatter” means that SARSA will converge to a narrow window. For Monte Carlo on linear, this is active work.