The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.
Retrieval can be performed through the LLM or embeddings (which is a TODO). We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.
The Steps involved in this is like below
Step 1: Load Datasets
Load Wikipedia pages on different cities
city_docs = []
for wiki_title in wiki_titles:
docs = SimpleDirectoryReader(
input_files=[f"data/{wiki_title}.txt"]
).load_data()
docs[0].doc_id = wiki_title
city_docs.extend(docs)
Step 2: Build Document Summary Index
two ways of building the index:
a. default mode of building the document summary index
b. customizing the summary query
# LLM (gpt-3.5-turbo)
chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")
splitter = SentenceSplitter(chunk_size=1024)
# default mode of building the index
response_synthesizer = get_response_synthesizer(
response_mode="tree_summarize", use_async=True
)
doc_summary_index = DocumentSummaryIndex.from_documents(
city_docs,
llm=chatgpt,
transformations=[splitter],
response_synthesizer=response_synthesizer,
show_progress=True,
)
doc_summary_index.get_document_summary("Boston")
doc_summary_index.storage_context.persist("index")
from llama_index.core import load_index_from_storage
from llama_index.core import StorageContext
# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="index")
doc_summary_index = load_index_from_storage(storage_context)
Step 3:
Performing retrieval from Summary Index
References:
https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/
No comments:
Post a Comment