Skill Discovery

Tags	CS 224R

Skill Discovery

Soft Q-learning

There’s more theory in the Berkeley version, but essentially we replace the maximizatoin with an $\exp Q(s_t, a_t)$ with some very nice theoretical background. But more practically, it yields better exploration, finetunability, and robustness.

Diversifying Skills

Condition your policy on some style variable $z$ , but you want to be careful. How can you be certain that these skills are diverse?

Soft Q-learning increases action entropy, but this is NOT the same as state entropy. Our objective is different

High entropy across different $z$ ’s, i.e. $H(s)$ maximize

Low entropy once we observe $z$ , i.e. $H(s | z)$ minimize

This naturally pushes us towards the information objective $I(s, z) = H(z) - H(z | s)$ . The first term is maximized with a uniform prior. The second term is minimized with sharp distributions, so we propose

r(s, z) = \log p(z | s)

as the reward for the policy. The idea here is that we want to maximally separate out the policies.

Using discovered skills

So with this policy that uses $z$ , you can learn a policy on top of this policy that only learns how to manipulate $z$ . This allows for a policy that deals with a higher level of abstraction, yielding an easier learning experience. But this is actually quite naive. Can we do better?

In the previous paradigm, we didn’t know if a skill is useful or not. Instead, we turn to a slightly different objective. Can we condition the MI on the current state?

I(s' ; z | s) = H(s'|s) - H(s'|s, z)

How does that help? Well, it means that this $z$ informs the dynamics of the state the best. Intuitively, this means that every $z$ will pick something that is meaningful and consistent.

Algorithm

This yields far more consistent policies. And because we learn a dynamics model, we can now use it in classic control.

Hierarchical RL

Why Hierarchical? Well, that’s how we learn. We start with the simplest techniques, and then we build it up into complicated behaviors.

There is a ton of work in this field, so this will only be a survey

Have a slower $\pi$ feed a style vector to a larger pi.

You can ask for data to be relabeled