Multimodal Data

TagsCS 224N

Early models

Features and Fusion

For images, we typically extract relevant features using bounding boxes (RCNN, YOLO). For text, we typically use some pretraining embedding model, although we will look more into this later

Combining Modalities

There are many ways of combining features of different modalities. We can categorize them in three categories

  1. Early (mix the inputs and then feed into network)
  1. Middle (Concatenate features and then do more processing)
  1. End (combine final scores, then backprop)

It turns out that for text and images, it is often sufficient to stick everything into a transformer