Meta-Learning Basics

Tags	CS 330Meta-Learning Methods

Why meta-learning?

Transfer learning is when you have a source task and a target task. Meta-learning looks at optimizing how well we go from source to target. In other words, given a set of training tasks, can we optimize to learn new tasks quickly?

Alternatively, we can understand meta-learning as transfer learning with many source tasks. For both transfer and meta-learning, we don’t have access to prior tasks. In all settings (multi-task, transfer, meta) we must assume shared structure.

Ways of looking at Meta-learning

Probabilistic view

We can make a PGM that looks like this

The $\phi_i$ is the parameters for each task, and the $\theta$ is the meta-learning information. You can interpret it as a random variable whose support is over all possible functions. if the tasks are not independent to start with, then this $\theta$ gives the prior to $\phi$ .

We can also see that

H(p(\phi_i | \theta)) < H(p(\phi_i))

which means that if we know $\theta$ , we are more certain about what $\phi$ is.

Here are some examples; if you are fitting a family of sinusoids, the $\theta$ might contain the base sinusoid wave. If you are fitting a machine translation model, the $\theta$ might contain the family of languages. The key here is that $\theta$ is a narrower distribution than all possible functions, which makes it easier to learn $\phi$ .

Mechanistic view

This view is just focusing on what meta-learning models actually are.

In Meta-Supervised learning, your inputs are $D^{tr}$ and $x^{ts}$ and your output is $y^{ts}$ . You have many datasets, one for each task.

Therefore, the meta-learning objective is just

y^{ts} = f_\theta(x^{ts}, D^{tr})

where $\theta$ are the meta-parameters. This $f_\theta$ can be an actual model (black box models) or it can be an optimization process, or something else.

Terminology

For the subtasks, we call the training set the support set and the test set the query set. This reduces confusion, as we have meta-training and meta-testing sets that are each individual tasks with a support and a query.
- So, it’s actually important that we train on the sub-test set (query), which might sound wrong for a bit

k-shot learning means that we learn with $k$ examples per class in meta-testing. A holy grail is 1-shot learning, which means that you feed the model with one example of every desired class and it will work on other examples.

N-way classification means you choose between $N$ classes.

Thoughts about training and testing

Your $D^{tr}$ and your $D^{TS}$ doesn’t have to be sampled independently from the master dataset. You could add a lot of augmentations to the training data, like noisy labels, domain shift, etc. This can help the model generalize to the test set, just like non-meta augmentations can help a model generalize.