Multimodal RAG systems, AI systems capable of processing text and non-text data.
Multimodal RAG enables more sophisticated inferences beyond what is conveyed by text alone. For example, it could analyze someone’s facial expressions and speech tonality to give a richer context to a meeting’s transcription
three basic strategies at increasing levels of sophistication.
Translate modalities to text. => In this case
Text-only retrieval + MLLM
Multimodal retrieval + MLLM
A simple way to make a RAG system multimodal is by translating new modalities to text before storing them in the knowledge base. This could be as simple as converting meeting recordings into text transcripts, using an existing multimodal LLM (MLLM) to generate image captions, or converting tables to a readable text format (e.g., .csv or .json).
In text-only retrieval + MLLM , generate text representations of all items in the knowledge base, e.g., descriptions and meta-tags, for retrieval, but to pass the original modality to a multimodal LLM (MLLM).
In level 3, we can use multimodal embeddings to perform multimodal retrieval. This works the same way as text-based vector search, but now the embedding space co-locates similar concepts independent of its original modality. The results of such a retrieval strategy can then be passed directly to a MLLM.
references:
https://towardsdatascience.com/multimodal-rag-process-any-file-type-with-ai-e6921342c903
No comments:
Post a Comment