Multimodal Data
Tags | CS 224N |
---|
Early models
- essentially create a shared embedding space for the different modalities
- The idea is to align the relevant modalities, like text to image. A popular application is captioning
Features and Fusion
For images, we typically extract relevant features using bounding boxes (RCNN, YOLO). For text, we typically use some pretraining embedding model, although we will look more into this later
Combining Modalities
There are many ways of combining features of different modalities. We can categorize them in three categories
- Early (mix the inputs and then feed into network)
- Middle (Concatenate features and then do more processing)
- End (combine final scores, then backprop)
It turns out that for text and images, it is often sufficient to stick everything into a transformer