Wednesday, June 18, 2025

What is Project GraphRAG

Naïve RAG is great for queries where an embedding nearest neighbour

search will help you arrive at a result quickly," Larson explained. "In other

words, naïve RAG is better at finding specific phrases rather than more

abstract ideas and concepts. It is difficult for naïve RAG to retrieve all

relevant parts of abstract ideas and concepts. It has no understanding of

the dataset as a whole and can't reason holistically over it."


One question that traditional naive RAG approach can answer is a query

such as: 'How many models of Product XYZ are we currently selling to

Customer ZYX?"


However, naive models do not work so well with deeper questions such as:

"Tell me about all of my customers and give me a summary of the status

for each."


"Naïve RAG will fall short on this type of question as it doesn't have the

ability to holistically analyze the dataset," Larson continued.


GraphRAG enters the fray by improving on naive RAG approaches based

on vector search-a method of information retrieval in which queries and

documents are mathematically represented as vectors instead of plain text.


GraphRAG uses an LLM to automate the extraction of a "rich knowledge

graph" from a collection of text documents. It reports on the semantic

structure of the data before answering user queries by detecting

"communities" of nodes and then creating a hierarchical summary of the

data to provide an overview of a dataset, with each community able to

summarise its entities and their relationships.


Larson said: "GraphRAG enables a variety of new scenarios that naïve RAG

fails to address. We see enormous potential for business productivity as

GraphRAG takes us beyond the limitations of naïve RAG, allowing us to

reason holistically and to get past the limitations of vector search.


"For example, suppose I look at a tranche of enterprise project and design

documents and ask the question: 'What are the major projects that are

being worked on? Give me details of each project and a listing of everyone

mentioned to be working on it.


In contrast to naive approaches, GraphRAG builds a memory

representation of the dataset which allows it to "clearly see and reason over

its contents and their relationships", Larson went on. "This allows you to

ask questions like 'which are the most popular products across all of our

customers' for which naïve RAG would struggle," he said.


Microsoft's own research found that GraphRAG "outperforms" RAG on

comprehensiveness and diversity when using community summaries "at

any level of the community hierarchy", with a win rate of between 70% and

80%.


One challenge around this is where you have a lot of files in your data that

have very similar information. How do you help your RAG system find that

data when the search is looking at files with very similar semantic

information?




References:

https://www.microsoft.com/en-us/research/project/graphrag/

No comments:

Post a Comment