Prompting, Instruction Finetuning, RLHF
Tags | CS 224N |
---|
Zero-Shot / Few-Shot In-Context Learning
Large language models can learn from a few examples.
You can also ask for zero-shot behavior like summaries through things like adding tl;dr
at the end of a passage.
Performance can be improved if we ask the model to talk through the reasoning process, known as the chain of thought
paradigm.
Overall, it in-context learning requires no finetuning, but it limits what you can fit in the context, and it is also computationally expensive.
Instruction Finetuning
For pretrained decoder models, the likelihood is NOT the same as user intent. The phrase “write a story about a cat” might be completed with a list of other fun writing prompts, instead of the actual story.
We can, of course, finetune a LLM to be better fit to the task of instruction following. A T5 model (encoder-decoder) can be particularly helpful for this task, because it can do forward-backward processing for the prompt.
Unfortunately, for open-ended tasks like creative generation, there is no right answer, so finetuning may not be optimal. Furthermore, LLM losses penalize tokens the same, even if some token swaps are minor while other are semantically very different. So it’s mismatched between objective and human preference
Reinforcement Learning from Human feedback (RLHF)
You can also just let humans define the reward, and run RL on it. For chatGPT, we start by finetuning on instruction following, and then we use human labeling data to construct a reward model. Then, we optimize over this reward model through reinforcement learning.
The problem is that RL is tricky to get right. And, we may be tricking the reward model. Answers that sound authoritative can be wrong, and still get a good score. To help with this issue, we typically regularize the reward with a KL divergence term to prevent the policy from straying too far from the supervised learning policy.
And more philosophically, human preferences are fallible.
What’s next?
- can we use AI feedback to improve LLMs?