-- Living Mobile --: What are multi modal RAG systems?

Wednesday, December 11, 2024

What are multi modal RAG systems?

Multimodal RAG systems, AI systems capable of processing text and non-text data.

Multimodal RAG enables more sophisticated inferences beyond what is conveyed by text alone. For example, it could analyze someone’s facial expressions and speech tonality to give a richer context to a meeting’s transcription

three basic strategies at increasing levels of sophistication.

Translate modalities to text. => In this case

Text-only retrieval + MLLM

Multimodal retrieval + MLLM

A simple way to make a RAG system multimodal is by translating new modalities to text before storing them in the knowledge base. This could be as simple as converting meeting recordings into text transcripts, using an existing multimodal LLM (MLLM) to generate image captions, or converting tables to a readable text format (e.g., .csv or .json).

In text-only retrieval + MLLM , generate text representations of all items in the knowledge base, e.g., descriptions and meta-tags, for retrieval, but to pass the original modality to a multimodal LLM (MLLM).

In level 3, we can use multimodal embeddings to perform multimodal retrieval. This works the same way as text-based vector search, but now the embedding space co-locates similar concepts independent of its original modality. The results of such a retrieval strategy can then be passed directly to a MLLM.

references:

https://towardsdatascience.com/multimodal-rag-process-any-file-type-with-ai-e6921342c903

-- Living Mobile --

Wednesday, December 11, 2024

What are multi modal RAG systems?

No comments:

Post a Comment

Followers

Blog Archive

About Me