Wednesday, June 18, 2025

How to use GraphRAG with LLamaIndex ?

GraphRAG (Graphs + Retrieval Augmented Generation) combines the strengths of Retrieval Augmented Generation (RAG) and Query-Focused Summarization (QFS) to effectively handle complex queries over large text datasets. While RAG excels in fetching precise information, it struggles with broader queries that require thematic understanding, a challenge that QFS addresses but cannot scale well. GraphRAG integrates these approaches to offer responsive and thorough querying capabilities across extensive, diverse text corpora.

This notebook provides guidance on constructing the GraphRAG pipeline using the LlamaIndex PropertyGraph abstractions.

GraphRAG Aproach

The GraphRAG involves two steps:

Graph Generation - Creates Graph, builds communities and its summaries over the given document.

Answer to the Query - Use summaries of the communities created from step-1 to answer the query.

Graph Generation:

Source Documents to Text Chunks: Source documents are divided into smaller text chunks for easier processing.

Text Chunks to Element Instances: Each text chunk is analyzed to identify and extract entities and relationships, resulting in a list of tuples that represent these elements.

Element Instances to Element Summaries: The extracted entities and relationships are summarized into descriptive text blocks for each element using the LLM.

Element Summaries to Graph Communities: These entities, relationships and summaries form a graph, which is subsequently partitioned into communities using algorithms using Heirarchical Leiden to establish a hierarchical structure.

Graph Communities to Community Summaries: The LLM generates summaries for each community, providing insights into the dataset’s overall topical structure and semantics.

Answering the Query:

Community Summaries to Global Answers: The summaries of the communities are utilized to respond to user queries. This involves generating intermediate answers, which are then consolidated into a comprehensive global answer.

GraphRAG Pipeline Components

Here are the different components we implemented to build all of the processes mentioned above.

Source Documents to Text Chunks: Implemented using SentenceSplitter with a chunk size of 1024 and chunk overlap of 20 tokens.

Text Chunks to Element Instances AND Element Instances to Element Summaries: Implemented using GraphRAGExtractor.

Element Summaries to Graph Communities AND Graph Communities to Community Summaries: Implemented using GraphRAGStore.

Community Summaries to Global Answers: Implemented using GraphQueryEngine.

Let's check into each of these components and build GraphRAG pipeline.

Installation

graspologic is used to use hierarchical_leiden for building communities


!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0


Load Data

We will use a sample news article dataset retrieved from Diffbot, which Tomaz has conveniently made available on GitHub for easy access.


The dataset contains 2,500 samples; for ease of experimentation, we will use 50 of these samples, which include the title and text of news articles.


import pandas as pd

from llama_index.core import Document


news = pd.read_csv(

    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"

)[:50]


news.head()



Prepare documents as required by LlamaIndex


documents = [

    Document(text=f"{row['title']}: {row['text']}")

    for i, row in news.iterrows()

]


Setup API Key and LLM


import os


os.environ["OPENAI_API_KEY"] = "sk-..."


from llama_index.llms.openai import OpenAI


llm = OpenAI(model="gpt-4")




references:

https://docs.llamaindex.ai/en/stable/examples/cookbooks/GraphRAG_v1/

No comments:

Post a Comment