Model-Based Tabular Control

TagsCS 234TabularValue

Ways of understanding V and Q

Q is used for control, V is used for evaluation. They have an intimate connection with each other.

Policy Iteration in an MDP 💻

policy iteration is the gradual creation of a policy in a systematic way that yields a good policy. We do this with a two-step process:

  1. make a good policy evaluation of πi\pi_i
  1. make a policy improvement based on this policy evaluation and yield πi+1\pi_{i+1}.

V function (step 1)

Through the bellman backup, you can yield a perfectly accurate VπiV^{\pi_i}. This is done for an infinite horizon value of a policy, by default.

But how do you use it?

Q function (step 2)

How do we make a good policy based on policy evaluation? Well, we previously wanted to take argmax over VπV^\pi, but that was too impractical. To help, we need to define a new function in terms of the V function we got in step (1). We call this the Q function:

You might recognize this as the inside of the expectation bracket of the bellman backup. In other words,

Vπ(s)=Eaπ(as)[Q(s,a)]V^\pi(s) = E_{a\sim \pi(a | s)}[Q(s, a)]

so as much as we are building QQ from VV, we can just as readily build VV from QQ. They are inherently interconnected.

Intuitively, the Q function means “take action a, and then follow π\pi forever”

Now, we propose to let the new policy follow this rule:

and this is the policy improvement. Now, we will show some critical things about this approach

Monotonic Improvement of policy 🚀

We define a monotonic improvement in a policy by this definition

here, again because we are dealing with a tabular environment, we assume that every VV is converged.

We will show that the policy improves monotonically if we use the policy iteration algorithm defined previously.

Key results from monotonicity 🚀

If the policy stops changing, then it can’t change again. We can show this inductively. Suppose we had πi1=πi\pi_{i-1} = \pi_i. Then, we know that Qπi1=QπiQ^{\pi_{i-1}} = Q^{\pi_{i}} (or at least their maximums match). Therefore, if we define πi+1=maxaQπi\pi_{i+1} = \max_a Q^{\pi_i}, we know that

πi+1=maxaQπi=maxaQπi1=πi\pi_{i+1} = \max_a Q^{\pi_i} = \max_a Q^{\pi_{i-1}} = \pi_{i}

So, if πi1=πi\pi_{i-1} = \pi_i, then πi+1=πi\pi_{i+1} = \pi_i, and the rest inductively follows.

There is a maximum number of iterations of policy iteration: because policy improvement is monotonic and because there is only a finite number of policies, you will eventually converge on the best.

Bellman Operator 🐧

We define the bellman operator BB as a function that runs bellman backup on a value function. When we talk about value iteration (below), the bellman operator is defined as

When we talk about policy evaluation in our previous setup, we define the bellman operator as

which means that in policy evaluation (above), we can see it as applying the operator repeatedly until things stop changing

In the next section, we will show some neat things about this operator. Essentially, VπV^\pi is a fixed point of the operator.

Contraction 🐧

We define a contraction as some operator OO and some norm such that OVOVVV|OV - OV’| \leq |V - V’|.

Intuitively, if the norm was L2, applying this operator to any vector will shrink it towards some common point. We want to show (below) that the bellman operator is a contraction on the infinity norm.

Extra: Asynchronous Bellman

What if you updated the value function one element at a time? Well, you can’t say that this one-element update is a contraction anymore. You can say that it’s not an expansion, i.e. BsVBsV’’VV’’||B_s V’ - B_s V’’||\leq ||V’ - V’’||. You also know that for the affected state, it is a γ\gamma contraction.

Putting this together, you can claim that applying S|S| of these single-value updates in-place will yield a contraction (i.e. changing the values one by one).

Value iteration in an MDP 💻

You can think of policy iteration as a slow dance between the policy (implicitly defined in terms of QQ) and the value function VV. Again, it’s a dance because you make a future policy from a past policy evaluation. You can think of it as bootstrapping.

But in this process, you have a policy that slowly improves. In value iteration, you ditch the policy altogether. It just doesn’t exist. You immediately try to compute VV^*, which is the value function of the optimal policy.

or Vk+1=BVkV_{k+1} = BV_k.

To extract the optimal policy, you just need to do

Value iteration vs policy iteration

Now, this is a common footfall: you might think that value iteration is just policy iteration in disguise. After all, you say, isn’t the right hand side just maxaQ(s,a)\max_a Q(s, a)? Isn’t this just smushing the two steps together?

Here’s the slight but important difference

  1. Policy iteration will compute infinite horizon value of a policy and then improve the policy through that means. This is because we have a policy, even though it’s implicit, it’s still a policy.
    1. we showed that policy iteration yield a monotonically increasing policy, which is enough
    1. this is related to policy gradients, which we will cover later
  1. Value iteration never has a policy. You can imagine it as starting from an end state where V(s)=0V(s') = 0, and then greedily maximizing R(s,a)R(s, a). This gets us the VV for the optimal policy acting on the state ss before the end state. Then, imagine moving backwards to the state before ss. The value iteration step will make the VV for the optimal policy acting on the state before ss, and so on and so forth.
    1. as another way of understanding it: policy iteration there’s a set solution. In value iteration, you’re kinda chasing your own tail a little bit. We actually need to show that the bellman backup is a contraction.

Here’s another way of understanding the difference: with policy iteration, if you pull a VπV^\pi out of the algorithm prematurely, you can say very little about it. In contrast, for value iteration, if you pull a VV prematurely, you can say that it is the optimal value function for kk steps, with kk being the number of times it’s been run so far.

the tl;dr is that policy iteration is a story of bootstrapping, while value iteration is a story of planning. If you run it long enough, the VV will be optimal at any state. Namely, if the environment is ergodic and you need kk steps to reach anywhere, then you only need to run value iteration kk times.

Will value iteration converge? 🚀

Well, we seek to show that this is true (infinity norm)

Is the convergence unique? 🚀

Here, we show the uniqueness of the fixed point

Is value iteration steps bounded? 🚀

We saw with policy iteration that we can only iterate for at most AS|A|^{|S|} steps before we get convergence. That was because we trained the value function to convergence VπV^\pi every time. In contrast, with value iteration, we sort of slowly approximate VV^*. So does the same limit still apply?

It turns out that no, there is no such limit anymore. Intuitively, the value function might “drag” behind and take a long time to converge, even if the greedy policy extracted from it is correct.

Additional questions to consider

  1. Is value iteration going to yield a unique solution?
  1. Does initialization matter?

Value iteration as finite horizon planner ⭐

Here, we re-explore the thing we just talked about: value iteration can be seen as a way of expanding our planning horizon by one step every time we run the algorithm.

Let VkV_k be the optimal value if making kk more decisions. Then, let’s initialzie V0(s)=0V_0(s) = 0 for all states. Now, we run value iteration

Now, you see how we can integrate one more time step into Vk+1V_{k + 1} because we consider this last VkV_k on top of the current reward. So there’s this branching out of states. It’s kinda cool!

The value iteration may not converge after seeing the timesteps; it just means that information is propagating. However, we can say that after kk iterations, value iteration will yield a policy that is optimal within kk steps of the goal, because we’ve propgatated the value backwards from the goal kk steps, and any information is already present. It’s just for infinite-case settings, the value itself may converge slowly.

The saga of V,V,VπV^*, V, V^\pi (and the greedy policy)

To be clear, VV can be any vector of size S|S|. Now, there exists some VV^* which is the best possible value function for an environment. Similarly, there exists some VπV^\pi for every policy that accuragely describes the return of the policy.

For VV^* and VπV^\pi, they are fixed points of the bellman backup and therefore must follow V=maxa(r(s,a)+sp(ss,a)V(s))V^* = \max_a(r(s, a) + \sum_{s’}p(s’ | s, a)V^*(s’)), and Vπ=Eaπ(s)[(r(s,a)+sp(ss,a)V(s)]V^\pi = E_{a\sim \pi(s)}[(r(s, a) + \sum_{s’}p(s’ | s, a)V^*(s’)] respectively.

For VV, no such rule exists. So you can’t write out any relationship that VV satisfies. But you can use VV in a policy! Just do a greedy selection

π(s)=argmaxa(r(s,a)+sp(ss,a)V(s))\pi(s)= \arg \max_a (r(s, a) + \sum_{s’}p(s’ | s, a)V(s’))

Now, we note that for the greedy policy, we have BV=BπVBV = B^\pi Vbecause the policy is inherently greedy. But be careful—for any other VV’, you have BVBV’ pushing towards VV^* and BπVB^\pi V’ pushing towards VπV^\pi, which is the value of the greedy policy based on VV, an arbitrary vector.

the tricky part is that VV generates π\pi, which has its own VπV^\pi that can be very different from VV.

Reasoning about optimality

Optimality is defined as some π\pi^* where Vπ(s)Vπ(s)V^{\pi^*}(s) \geq V^\pi(s) for all π\pi in the environment. If you apply π\pi^* to another environment, does the same condition hold for the real vaue function? If so, the optimality still holds. Otherwise, it doesn’t. Linear shifts and positive scales in value functions do not change optimality.

Optimality with shifts in rewards and horizon

Impact of discount factor on value 🚀

Claim: for any β<γ\beta < \gamma, we have

which effectively means that we can bound the differences in the true value.