Skill Discovery

TagsCS 224R

Skill Discovery

Soft Q-learning

There’s more theory in the Berkeley version, but essentially we replace the maximizatoin with an expQ(st,at)\exp Q(s_t, a_t) with some very nice theoretical background. But more practically, it yields better exploration, finetunability, and robustness.

Diversifying Skills

Condition your policy on some style variable zz, but you want to be careful. How can you be certain that these skills are diverse?

Soft Q-learning increases action entropy, but this is NOT the same as state entropy. Our objective is different

This naturally pushes us towards the information objective I(s,z)=H(z)H(zs)I(s, z) = H(z) - H(z | s). The first term is maximized with a uniform prior. The second term is minimized with sharp distributions, so we propose

r(s,z)=logp(zs)r(s, z) = \log p(z | s)

as the reward for the policy. The idea here is that we want to maximally separate out the policies.

Using discovered skills

So with this policy that uses zz, you can learn a policy on top of this policy that only learns how to manipulate zz. This allows for a policy that deals with a higher level of abstraction, yielding an easier learning experience. But this is actually quite naive. Can we do better?

In the previous paradigm, we didn’t know if a skill is useful or not. Instead, we turn to a slightly different objective. Can we condition the MI on the current state?

I(s;zs)=H(ss)H(ss,z)I(s' ; z | s) = H(s'|s) - H(s'|s, z)

How does that help? Well, it means that this zz informs the dynamics of the state the best. Intuitively, this means that every zz will pick something that is meaningful and consistent.

This yields far more consistent policies. And because we learn a dynamics model, we can now use it in classic control.

Hierarchical RL

Why Hierarchical? Well, that’s how we learn. We start with the simplest techniques, and then we build it up into complicated behaviors.

There is a ton of work in this field, so this will only be a survey