Multimodal Learning

Models that jointly understand multiple data types.

Multimodal learning builds models that take in and relate more than one modality — text, images, audio, video — in a shared representation. It enables systems that can describe images, answer questions about video, or generate across modalities.

Related papers