Exploration without Rewards

TagsCS 285ExplorationReviewed 2024

Removing the rewards

Human children and other animals interact randomly with the environment without the use of any rewards! There’s some useful body of knowledge that is being learned. Maybe there’s a set of skills we can learn!

The quantities we care about

Using imagined goals

Skew Fit Algorithm

👉
this is from Skew Fit.

One idea of reward-free exploration is to self-propose goals, and therefore self-propose reward structures. This is a bootstrapping approach, where you can become more and more creative with goal generation as your policy gets better.

Concretely, we create a VAE that can generate a goal state xgx_g, and then attempt to reach this goal using a goal-conditioned policy. Use this online data to update the policy and the variational autoencoder.

One potential problem is that we need to keep on pushing to create goal states that are slightly out of distribution to encourage exploration. Tod do this, we want to upweight the likelihood of reaching a lesser seen state, or “skew” it. Visually, it looks like this:

We can obtain weights because we can estimate the density score of p(x)p(x) through the VAE by computing pθ(x)=zpθ(xz)p(z)p_\theta(x) = \sum_z p_\theta(x | z)p(z). This is possible if you have a categorical distribution.

More specifically, you use this weight as we are updating the generative model.

It’s possible to prove that if you use a negative exponent α\alpha in the algorithm, the entropy of pθ(x)p_\theta(x) increases, which means that we will converge to a uniform distribution over valid states!

There’s an interesting connection with count-based exploration. Instead of using counts as rewards, we use this measure of novelty to skew the distribution towards these novel states.

What is the Objective?

We’ve derived the skew-fit algorithm, but one common thing to ask is: what is the underlying objective?

If we get really good, the p(GS)p(G|S) becomes more deterministic. So, we can write the RL objective as minimizing H(p(GS))H(p(G|S)). Putting this together, we have the combined objective

Taken together, our objective is

which is just I(S;G)I(S; G). we are maximizing the mutual information between the final states and the goal states! This means that we have high goal coverage and a good ability to hit those goals. It is a very intuitive result.

State-Distribution matching

As we saw in Skew-Fit, our objective was in terms of states and goals. Is it possible to run further with this idea and attempt to fit a policy to a certain state marginal distribution?

An initial implicit attempt

Let’s say that we wanted the policy to have a state marginal that is as high entropy as possible. Can we just create a state marginal estimator and add a reward for being novel WRT this marginal?

Concretely, here’s an initial idea

  1. Given rˉ=r(s)logpπ(s)\bar{r} = r(s) - \log p_\pi(s), optimize π(as)\pi(a|s) to maximize this reward
    1. Note that this is closely related to adding an entropy term in the RL objective, but it’s slightly different. Let’s talk more about this below.
  1. Fit pπ(s)p_\pi(s) to fit the new state marginal

There is a subtle problem. Let’s assume that r(s)=0r(s) = 0 for a second. In this case, the reward is to minimize the likelihood of pπ(s)p_\pi(s), which basically means changing the policy as much as posslble from the last policy. You end up having a policy progression that chases its own tail.

As you do this, the density estimator will get good state coverage, but the policy you end up with will be arbitrary. It depends on where you stop the policy in the tail-chasing stage. And more importantly, the actual policy doesn’t get the full coverage.

Explicit Matching: Attempt 1

Instead of implicitly adding a reward augmentation (which didn’t really work), can we be more explicit on matching state marginals? For example, can we optimize this?

We can write the reward as

and the RL objective under this reward is exactly the KL divergence.

However, RL doesn’t exactly do this! RL isn’t aware that pπp_\pi depends on π\pi (because there are no gradients), which means that we still get suboptimal tail-chasing problems. In fact, this is no different than our previous non-example, with the difference being a constant value p(s)p^*(s).

Resolving tail-chasing behavior (Attempt 2)

💡
From Efficient Exploration via State Marginal Matching
Provably Efficient Maximum Entropy Exploration

The original algorithm was like this

which had the tail chasing behavior. But we have a pretty simple resolution. What if we fit a state marginal to estimate all the state seen so far, instead of the current state marginal.

We also return π(as)=kπk(as)\pi^*(a|s) = \sum_k^{\pi^k}(a|s).

This works because pπ(s)=p(s)p_\pi(s) = p^*(s) is the Nash equilibrium of a two-player game. It’s a game between the exploration policy and the density. It makes sense because the exploration policy is essentialy carving out a path, but this path becomes immediately something to avoid. Once the states are matched, there is nowhere more to carve out, but the cost is that now everything needs to be avoided.

And because we know that this is a Nash equlibrarium, we know that we need to return the mixture (from game theory results).

When is state entropy helpful?

In Skew Fit: we compute maxI(S;G)\max I(S; G) and in SMM in the the special case where the marginal is uniform, we are computing maxH(pπ(S))\max H(p_\pi(S)).

When is this a good idea? If you want to adversarially give the worst possible goal and the robot knows, then the best distribution is uniform. This is a result that shows from maximum entropy modeling.

Do you see why we can’t practice for the worst case situation, because the adversary will know? It’s better to be versatile.

Learning Diverse Skills

💡
From: Diversity is All You Need

Previously we talked about state distribution objectives, but can we be more abstract. Can we have maximum entropy skill objectives?

Let’s say that we have a policy π(as,z)\pi(a | s, z) where zz is a task index. Can we self-supervise train a policy that performs maximally different skills per zz?

You can use a diversity-promoting reward function, which means that we should reward states that are unlikely to occur for other task

You can just have the reward be a classifier

Note that this is not an adversarial paradigm. You’re actually trying to make the classifier’s job easier!

We can also understand the above objective as maximizing I(Z,S)I(Z, S). This is because H(z)H(z) is maximized if you have a uniform prior P(z)P(z), and H(ZS)H(Z | S) is maximized by maximizing the reward objective logp(zs)\log p(z | s)