Multimodal Data

Tags	CS 224N

Early models

essentially create a shared embedding space for the different modalities

The idea is to align the relevant modalities, like text to image. A popular application is captioning

Features and Fusion

For images, we typically extract relevant features using bounding boxes (RCNN, YOLO). For text, we typically use some pretraining embedding model, although we will look more into this later

Combining Modalities

There are many ways of combining features of different modalities. We can categorize them in three categories

Early (mix the inputs and then feed into network)

Middle (Concatenate features and then do more processing)

End (combine final scores, then backprop)

It turns out that for text and images, it is often sufficient to stick everything into a transformer