Wednesday, February 19, 2025

What is Document Summary Index in LlamaIndex?

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.


Retrieval can be performed through the LLM or embeddings (which is a TODO). We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.


The Steps involved in this is like below 


Step 1: Load Datasets

Load Wikipedia pages on different cities


city_docs = []

for wiki_title in wiki_titles:

    docs = SimpleDirectoryReader(

        input_files=[f"data/{wiki_title}.txt"]

    ).load_data()

    docs[0].doc_id = wiki_title

    city_docs.extend(docs)


Step 2: Build Document Summary Index 

two ways of building the index:


a. default mode of building the document summary index

b. customizing the summary query


# LLM (gpt-3.5-turbo)

chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")

splitter = SentenceSplitter(chunk_size=1024)



# default mode of building the index

response_synthesizer = get_response_synthesizer(

    response_mode="tree_summarize", use_async=True

)

doc_summary_index = DocumentSummaryIndex.from_documents(

    city_docs,

    llm=chatgpt,

    transformations=[splitter],

    response_synthesizer=response_synthesizer,

    show_progress=True,

)


doc_summary_index.get_document_summary("Boston")

doc_summary_index.storage_context.persist("index")



from llama_index.core import load_index_from_storage

from llama_index.core import StorageContext


# rebuild storage context

storage_context = StorageContext.from_defaults(persist_dir="index")

doc_summary_index = load_index_from_storage(storage_context)


Step 3: 

Performing retrieval from Summary Index 


References:

https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/

No comments:

Post a Comment