Naïve RAG is great for queries where an embedding nearest neighbour
search will help you arrive at a result quickly," Larson explained. "In other
words, naïve RAG is better at finding specific phrases rather than more
abstract ideas and concepts. It is difficult for naïve RAG to retrieve all
relevant parts of abstract ideas and concepts. It has no understanding of
the dataset as a whole and can't reason holistically over it."
One question that traditional naive RAG approach can answer is a query
such as: 'How many models of Product XYZ are we currently selling to
Customer ZYX?"
However, naive models do not work so well with deeper questions such as:
"Tell me about all of my customers and give me a summary of the status
for each."
"Naïve RAG will fall short on this type of question as it doesn't have the
ability to holistically analyze the dataset," Larson continued.
GraphRAG enters the fray by improving on naive RAG approaches based
on vector search-a method of information retrieval in which queries and
documents are mathematically represented as vectors instead of plain text.
GraphRAG uses an LLM to automate the extraction of a "rich knowledge
graph" from a collection of text documents. It reports on the semantic
structure of the data before answering user queries by detecting
"communities" of nodes and then creating a hierarchical summary of the
data to provide an overview of a dataset, with each community able to
summarise its entities and their relationships.
Larson said: "GraphRAG enables a variety of new scenarios that naïve RAG
fails to address. We see enormous potential for business productivity as
GraphRAG takes us beyond the limitations of naïve RAG, allowing us to
reason holistically and to get past the limitations of vector search.
"For example, suppose I look at a tranche of enterprise project and design
documents and ask the question: 'What are the major projects that are
being worked on? Give me details of each project and a listing of everyone
mentioned to be working on it.
In contrast to naive approaches, GraphRAG builds a memory
representation of the dataset which allows it to "clearly see and reason over
its contents and their relationships", Larson went on. "This allows you to
ask questions like 'which are the most popular products across all of our
customers' for which naïve RAG would struggle," he said.
Microsoft's own research found that GraphRAG "outperforms" RAG on
comprehensiveness and diversity when using community summaries "at
any level of the community hierarchy", with a win rate of between 70% and
80%.
One challenge around this is where you have a lot of files in your data that
have very similar information. How do you help your RAG system find that
data when the search is looking at files with very similar semantic
information?
References:
https://www.microsoft.com/en-us/research/project/graphrag/
No comments:
Post a Comment