Lifelong Learning

Tags	AdvancedCS 330

Review of methods

multi-task learning is when you learn to solve a bunch of tasks

meta-learning is when you use examples of tasks and learn a way to solve a new task quickly.

In the real world, however, things are more sequential. Given a set of tasks fed to us, can we get better?

The different setups

Unfortunately, lifelong learning is a very active field and the definitions haven’t solidified.

Discrete vs continuous: do you get different, labeled tasks, or do the tasks slowly change?

Level of data storage: do we get access to all the prior data? Or is this not possible/not allowed? On one end of the sepctrum, you have a large replay buffer. On the other end, you have a singular point drawn from a mysterious black box and it disappears after training.

Desired performance: do we only want best performance on the current task, or do we want good performance on all prior tasks?

Task order: is it predictable? Random? Is it a curriculum? Is it adversarial (like email spam filtering).
- an iid setting is where you draw $x_i \sim p(x)$ shared distribution. For curriculums and adversarial, however, this may not be the case.

Data exposure: do we expect the model to generalize zero-shot to the new data, or do we provide some support data?

Other terms for lifelong learning include online learning, lifelong learning, continual learning, incremental learning, and streaming learning.

What do you want?

We want to minimize regret, which is the cumulative cost of the learner as it goes through the data, verses a learner that takes all the data and fits things all at once.

Now, it may well be possible that the regret is negative; if you have highly temporally correlated tasks, a near-sighted model may do better on each individual task as compared to the whole data.

What regret do we want?

Linear regret is trivial. If the regret grows linearly, it means that we aren’t improving at all; we keep on messing up the same degree every time. This can be as stupid as outputting a constant prediction, but it could also mean that the model is learning from scratch for every task somehow.

Therefore, we want sublinear regret.

Positive and negative transfer

forward transfer is having previous tasks affect your future tasks. Positive forward transfer means that past tasks allow future tasks to perform better than learning-from-scratch. Negative forward transfer means that opposite.

backward transfer is having current tasks affect your previous tasks. Positive backward transfer means that past tasks perform better after seeing more data. Vice versa for negative.

Ideally, you want both positive forward and backward transfer.

Basic approach

Follow the leader

This is very easy. Just store all the data, and train on it.

very strong performance (as long as it isn’t adversarial)

computationally intensive and memory intensive

SGD

For every data point you observe, take a gradient step in this direction (or a few)

very computationally cheap and requires very little memory

subject to negative backward transfer (catastrophic forgetting)

learning is slow because of SGD being stochastic

Variants

You might only use the current task data but continue to use some sort of model from the past, like a Q function

Improving from the basics

Improving SGD

Can we improve SGD to avoid negative backward transfer? Here’s one idea

store small amounts of data per task in memory

For the current optimization, compute gradients for each of these tasks. Then, generate a gradient for the current task such that the current gradient has a non-negative inner product with all other task gradients.

We can solve for this constraint using quadratic programming.

Meta-learning perspective

If we assume that we get a little bit of information about the task beforehand, then we can just reframe it as an online meta-learning task. Each task is a meta-task, and the small amount of data is the support example.

You can think of this as follow the meta-leader, where you meta-train on the existing data and run update procedure. This removes the assumption that $x_i \sim p(x)$ and all $y | x$ are the same. However, this can still suffer from OOD tasks. For better performance, optimization based approaches can be helpful.