Introduction to MM LLMs
Most people were incredibly impressed when OpenAI's Sora made its debut in Feb 2024, seamlessly producing lifelike videos. Sora is a prime example of a multimodal LLM (MM-LLM), employing text to influence the generation of videos, a direction in research that has been evolving for several years. In the past year particularly, MM-LLMs have witnessed remarkable advancements, paving way for a new era of AI capable of processing and generating content across multiple modalities. These MM-LLMs represent a significant evolution of traditional LLMs, as they integrate information from various sources such as text, images, and audio to enhance their understanding and generation capabilities.
It's essential to note that not all multimodal systems are MLLMs. While some models combine text and image processing, true MLLMs encompass a broader range of modalities and integrate them seamlessly to enhance understanding and generation capabilities. In essence, MM-LLMs augment off-the-shelf LLMs with cost-effective training strategies, enabling them to support multimodal inputs or outputs. By leveraging the inherent reasoning and decision-making capabilities of LLMs, MM-LLMs empower a diverse range of multimodal tasks, spanning natural language understanding, computer vision, and audio processing.
Another notable example of an MLLM is OpenAI's GPT-4(Vision), which combines the language processing capabilities of the GPT series with image understanding capabilities. With GPT-4(Vision), the model can generate text-based descriptions of images, answer questions about visual content, and even generate captions for images. Similarly, Google's Gemini and Microsoft's KOSMOS-1 are pioneering MLLMs that demonstrate impressive capabilities in processing both text and images.
The applications of MLLMs are vast and diverse. MLLMs can analyze text inputs along with accompanying images or audio to derive deeper insights and context. For example, they can assist in sentiment analysis of social media posts by considering both the textual content and the accompanying images. In computer vision, MLLMs can enhance image recognition tasks by incorporating textual descriptions or audio cues, leading to more accurate and contextually relevant results. Additionally, in applications such as virtual assistants and chatbots, MLLMs can leverage multimodal inputs to provide more engaging and personalized interactions with users.
Beyond these examples, MLLMs have the potential to improve various industries and domains, including healthcare, education, entertainment, and autonomous systems. By seamlessly integrating information from different modalities, MLLMs can enable AI systems to better understand and interact with the world, ultimately leading to more intelligent and human-like behavior.
In the following sections of this beginner friendly guide, we will explore the core components, training paradigms, state-of-the-art advancements, evaluation methods, challenges, and future directions of MLLMs, shedding light on the exciting possibilities and implications of this groundbreaking technology.
Image Source: https://arxiv.org/pdf/2401.13601
Most MM-LLMs can be categorized into key components, each differentiated by specific design choices. In this guide, we will adopt the component framework outlined in the paper "MM-LLMs: Recent Advances in Multimodal Large Language Models” (link). These components are designed to seamlessly integrate information from diverse modalities, such as text, images, videos, audio, etc. enabling the model to understand and generate content that spans multiple modalities.
2.1 Modality Encoder
The Modality Encoder (ME) plays a pivotal role in MM-LLMs by encoding inputs from various modalities into corresponding feature representations. Its function is akin to translating the information from different modalities into a common format that the model can process effectively. For example, ME processes images, videos, audio, and 3D data, converting them into feature vectors that capture their essential characteristics. This step is essential for facilitating the subsequent processing of multimodal inputs by the model. Examples of Modality Encoders for different modalities include ViT, OpenCLIP etc.
2.2 Input Projector
Once the inputs from different modalities are encoded into feature representations, the Input Projector comes into play. This component aligns the encoded features of other modalities with the textual feature space, enabling the model to effectively integrate information from multiple sources. By aligning the features from different modalities with the textual features, the Input Projector ensures that the model can generate coherent and contextually relevant outputs that incorporate information from all modalities present in the input. This can be implemented using Linear Projector, Multi-Layer Perceptron (MLP), Cross-attention, Q-Former etc.
2.3 LLM Backbone
At the core of MM-LLMs lies the Language Model Backbone, which processes the representations from various modalities, engages in semantic understanding, reasoning, and decision-making regarding the inputs. The LLM Backbone produces textual outputs and signal tokens from other modalities, acting as instructions to guide the generation process. By leveraging the capabilities of pre-trained LLMs, MM-LLMs inherit properties like zero-shot generalization and few-shot learning, enabling them to generate diverse and contextually relevant multimodal content. Examples of commonly used LLMs in MM-LLMs include Flan-T5, PaLM, LLaMA or any text-generation LLM.
2.4 Output Projector
The Output Projector serves as the bridge between the LLM Backbone and the Modality Generator, mapping the signal token representations from the LLM Backbone into features understandable to the Modality Generator. This component ensures that the generated multimodal content is aligned with the textual representations produced by the model. By minimizing the distance between the mapped features and the conditional text representations, the Output Projector facilitates the generation of coherent and semantically consistent multimodal outputs. This can be implemented using a Tiny Transformer with a learnable decoder feature sequence or MLP.
2.5 Modality Generator
Finally, the Modality Generator is responsible for producing outputs in distinct modalities based on the aligned textual representations. By leveraging off-the-shelf Latent Diffusion Models (LDMs), the Modality Generator synthesizes multimodal content that aligns with the input text and other modalities. During training, the Modality Generator utilizes ground truth content to learn to generate coherent and contextually relevant multimodal outputs.
Let's understand how all these components work together in an use-case for generating captions for multimedia content, given inputs from images and textual descriptions as shown in the image below:
Here's how each component would function:
Training MM-LLMs generally includes 2 steps, similar to that for LLMs:
During the pretraining phase of a MM-LLM, the model encounters a vast dataset comprising pairs of various modalities, like images and text, audio and text, video and text, or other combinations, depending on the task at hand. The aim of pretraining is to initialize the model's parameters and facilitate the learning of representations that capture meaningful connections between different modalities and their respective textual descriptions.
Throughout pretraining, the MM-LLM acquires the ability to extract features from each modality and merge them to produce cohesive representations.
Typically, three primary types of data are utilized:
The below table from this paper provides a list of popular training datasets and their sizes:
In the instruction tuning phase for MM-LLMs, the model is fine-tuned to perform specific task. Let us take the example of a task like Visual Question Answering (VQA).
By providing explicit instructions alongside the input data. Let's break down how this works using a VQA example:
Therefore, the instruction tuning phase for MM-LLMs involves fine-tuning the model on task-specific datasets while providing explicit instructions and feedback to guide the learning process, ultimately improving the model's ability to perform the desired task. In the above example, we studies a specific task: Visual Question Answering.
Image Source: https://arxiv.org/pdf/2401.13601
The above image lists some popular SoTA MM-LLMs. At a high level, different MM-LLMs vary in several key aspects:
To choose the right MM-LLM for a specific use case, consider factors such as:
MM-LLMs can be evaluated using a variety of metrics and methodologies to assess their performance across different tasks and datasets. Here are the dimensions to evaluating MM-LLMs:
MM-LLMs represent a rapidly growing field with vast potential for both research advancements and practical applications.
Here are some promising directions to explore: