Reward Learning (overview)

TagsCS 224R

What we want

We can learn from reward functions now, but what happens if we don’t have a good reward? What do we do?

Well, we can learn one! The process of learning a reward is known as inverse reinforcement learning, and it has a lot of challenges. There’s a more theoretical workup in a different piece of note, but here we look at some more high-level concepts.

Goal Classifiers

The idea is simple: classify between success and failure, and then use the classifier as reward. The problem is that the RL agent may learn to “reward hack” and just visit states that is OOD for the classifier, and therefore may yield overestimation.

Regularizing classifiers

The simple fix: just add visited states as negative examples, and we retrain the classifier online

We need to be careful: sometimes visited states are successful, so you give a false negative. But as long as the batches are balanced, the classifier will be outputting 0.5\geq 0.5 for successes (because we know that at least half of the batch is correct and positive).

As a sidenote, this is sort of like how a GAN works. The generator is the policy, and as the policy gets very good, the “unsuccessful” states get more and more successful. Adversarial, the policy’s goal is to get the classifier to output a uniform 0.5 (maximal confusion) and attempts to match the expert distribution.

When done correctly, this can outperform imitation learning

GAIL

This brings us to a more general class of imitation learning. Previously, we had goal classifiers, but we can more generally give examples of (s,a)(s,a) from the expert and use these as the positive examples. We use the policy (s,a)(s,a) as negative examples. We call this generative adversarial imitation learning, or GAIL.

Human Preferences

Humans are much better at providing relative performance rankings (instead of a set number)

Learning rewards from human preferences

Essentially, you are given a ranking of things (could be full or partial rollout). You want to create a reward function that respects that ranking as much as possible

We define the following function:

Here, rθ(τ)r_\theta(\tau) is shorthand for the sum of rewards under the function, and σ\sigma is a sigmoid function. The larger the difference, the greater the probability that these trajectories are different in ranking. It essentially “softens” the rankings into a differentiable function. The objective thusly becomes a max-likelihood objective

The overall algorithm does this for all pairs

Rewards from AI feedback

Critique is easier than generation, so we can use another AI model to critique the outputs of an AI model, which helps it improve. We can even reduce harmful behavior by using a paradigm called constitutional RL , where you establish a set of ground rules for the model.

Proposing your own goals

This is an active field of research: can an agent propose its own goal-directed behavior? This would be unsupervised RL, and it is most similar to how humans and animals learn without trainers or teachers.