Model-Free Control

TagsCS 234TabularValue
What is tabular? Tabular means that you can represent the Q and V as a table. Sometimes it’s tricky, and a linear Q function is tabular if it selects one weight per state

What’s our main problem?

So in the last section, we were able to compute VπV^\pi in a few different ways. We could do this by letting the policy run as normal and then averaging the returns.

The problem is that we were only able to evaluate the policy. We really want to derive a policy using similar methods.

An intuitive start is to use the rollouts of π\pi to estimate Qπ(s,a)Q^\pi(s, a). But there’s a critical problem: with a determinsitic policy, we won’t be getting sufficient action coverage! We always pick the same action!

Again, remember that if we had p(ss,a)p(s’ | s, a) we don’t have any of this problem. In fact, we can arrive at the optimal policy with out any exploration. The problem of exploration is only present in these model-free situations.

Good rule of thumb: evaluation only needs VV. But control needs QQ.

How do we intend to solve it?

We will first show how we are to use similar methods as policy evaluation to learn our QQ. This will yield an on-policy control algorithm that requires data executed from the current policy. We will then show an off-policy control algorithm that can take any arbitrary dataset.

Remember: control is nothing more than evaluation with an additional policy definition step.

Exploration

To start with QQ, we need to resolve the problem of Q(s,a)Q(s, a) learning. We can’t just have a greedy policy running in the environment. Rather, we need to have some degree of exploration.

ϵ\epsilon-greedy policy 💻

The idea here is pretty simple: there’s an 1ϵ1 - \epsilon chance that we select the greedy action argmaxaQ(s,a)\arg \max_a Q(s, a), and there’s an ϵ\epsilon chance of doing a random actin. Therefore, at every state, you can start expand across every action, with a bias on what you think is the best action.

Epsilon-greedy is a key way to juggle exploration and exploitation as we learn our policies from scratch

We often ramp the ϵ\epsilon as the agent learns. We start with a high ϵ\epsilon and then anneal as the agent gets better.

Boltzmann exploration 💻

You can try something softer and essentially reduce the “hardness” of the distribution

Monotonic guarantees

You can actually show that with ϵ\epsilon-greedy, the policy still improves monotonically.

Monte-Carlo

The idea of Monte-Carlo contro is actually pretty easy. In the last section, we talked about how we used count-based methods to derive VV. Now, we just treat s,as, a as a unique identifier instead of ss, and we still use count-based methods

Control with the Monte Carlo Algorithm ⭐

Essentially keep the same count but for QQ, and we use an epsilon-greedy policy to help with exploration. We continue to extract a greedy policy as we iterate on the QQ.

Guarantees

So this is a little different than Monte Carlo policy evaluation, where the policy was stationary and you kept you trying to get a better approxmation. In this case, the policy is non-stationary, and you’re not training everything to convergence. Therefore, each QQ is not a good estimate of QπkQ^{\pi_k}

This procedure might also fail to get QQ^* if the ϵ\epsilon is too small and we end up getting stuck in a local optimal policy and don’t really explore anything else.

However, we have the Greedy in the Limit of Infinite Exploration (GLIE) condition. According to this, if

then these are sufficient conditions for the convergence of QQ to QQ^*.

MC off-policy

We can use MC approaches and importance sampling if you have access to the old policy and the one you’re currently trying to explore. However, importance sampling can be very unstable.

Temporal Difference

Previously, we used TD approaches to evaluate policies. Can we adapt them for control?

The SARSA Algorithm ⭐💻

So this is very literally the same TD policy evaluation approach. We just need to sample s,a,r,s,as, a, r, s’, a’ where a,aa, a’ are sampled from the policy. Note that this requires the additional aa’

You can also run SARSA on one arbitrary policy, and it will converge to the value of that policy. We’re going greedy to converge on the optimal.

SARSA essentially is like policy iteration except there is no world model and we don’t train to convergence.

Expected SARSA

You can run SARSA, but instead of using at+1a_{t+1}, you can take the expectation over your current policy π\pi (more relevant if you’re donig SARSA on a specific policy, not the greedy one)

Properties of the SARSA algorithm

The SARSA algorithm is on-policy, meaning that the data we use must be sampled from the same current policy. This is because we are trying to make QπQ^\pi, and the a,aa, a’ must be sampled from π\pi.

SARSA will converge to QQ^* if

Q Learning

We just saw the SARSA algorithm, which is basically model-free policy iteration that estimates the QQ and then improves π\pi based on QQ. Now, this means that the data we use to estimate the QQ must come from the current π\pi, which makes it on-policy. This isn’t very data-efficient.

Instead, we can try to estimate π\pi^* while acting under a different behavior policy πb\pi_b, which we call an off-policy RL algorithm.

The Q-learning algorithm 💻⭐

Note that there is only one difference: we replace the Q(st+1,at+1)Q(s_{t+1}, a_{t + 1}) with the maximization operation.

Unlike SARSA, you converge to Q* with any rollout policy.

Properties of Q learning

If ϵ=0\epsilon = 0, the Q-learning algorithm is EXACTLY the same as SARSA, because the maxQ\max Q is implicit in the policy. If ϵ>0\epsilon > 0, the difference is that the current Q function evaluted by SARSA is the epsilon-greedy policy, not the pure greedy policy of the Q-learner

To converge to QQ^*, we just need to visit all (s,a)(s, a) pairs infinitely often and satisfy Robbins-Munro with the α\alpha. We don’t need to reduce the ϵ\epsilon to have QQ^*.

To converge to π\pi^*, we just need to make sure that our ϵ0\epsilon → 0.

Why is Q-learning off-policy? Well, the critical part is that we only needed s,a,r,ss, a, r, s’. These actually don’t depend on any policy. It just means “suppose that I was in this state and took this action, what happens?” This is in contrast with SARSA, which required s,a,r,s,as, a, r, s’, a’. This last aa’ comes from the current policy, and so the update is bound to the current policy. It’s very nuanced!

In Q-learning, we just imagine that we follow the Q function greedily past this first s,a,r,ss, a, r, s’. This means that it is off-policy. You can easily plug in data collected from any arbitrary policy and it will learn (although some better than others).

Quick summary about convergences

What happens if they break? (Q learning)

If you don’t visit all possible state-actions, you may not convege to QQ^*. In fact, if you perform Q learning with a Q value initialized to 0 and positive rewards only, on data that is collected by a deterministic policy, the maxaQ(s,a)\max_a Q(s, a) will always select the only non-zero value, which comes from the a=π(as)a = \pi(a | s). So, in this case, Q learning actually converges to the data-generating policy