Wednesday, February 19, 2025

Metadata Replacement + Node Sentence Window based retrieval for RAG

We use the SentenceWindowNodeParser to parse documents into single sentences per node. Each node also contains a "window" with the sentences on either side of the node sentence.


Then, after retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences using the MetadataReplacementNodePostProcessor.


This is most useful for large documents/indexes, as it helps to retrieve more fine-grained details.


By default, the sentence window is 5 sentences on either side of the original sentence



The whole process can be split into like below 



Step 1: Load Data, Build the Index

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf 


from llama_index.core import SimpleDirectoryReader


documents = SimpleDirectoryReader(

    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]

).load_data()



Extract Nodes¶


We extract out the set of nodes that will be stored in the VectorIndex. This includes both the nodes with the sentence window parser, as well as the "base" nodes extracted using the standard parser.



nodes = node_parser.get_nodes_from_documents(documents)

base_nodes = text_splitter.get_nodes_from_documents(documents)


Build the Indexes

We build both the sentence index, as well as the "base" index (with default chunk sizes).

from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)

base_index = VectorStoreIndex(base_nodes)


Step 2: Querying

With MetadataReplacementPostProcessor


we now use the MetadataReplacementPostProcessor to replace the sentence in each node with it's surrounding context.


from llama_index.core.postprocessor import MetadataReplacementPostProcessor


query_engine = sentence_index.as_query_engine(

    similarity_top_k=2,

    # the target key defaults to `window` to match the node_parser's default

    node_postprocessors=[

        MetadataReplacementPostProcessor(target_metadata_key="window")

    ],

)

window_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(window_response)


We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.


window = window_response.source_nodes[0].node.metadata["window"]

sentence = window_response.source_nodes[0].node.metadata["original_text"]


print(f"Window: {window}")

print("------------------")

print(f"Original Sentence: {sentence}")




Contrast with normal VectorStoreIndex


query_engine = base_index.as_query_engine(similarity_top_k=2)

vector_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(vector_response)



Well, that didn't work. Let's bump up the top k! This will be slower and use more tokens compared to the sentence window index.


query_engine = base_index.as_query_engine(similarity_top_k=5)

vector_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(vector_response)



Step 3: Analysis


So the SentenceWindowNodeParser + MetadataReplacementNodePostProcessor combo is the clear winner here


Embeddings at a sentence level seem to capture more fine-grained details, like the word AMOC.


We can also compare the retrieved chunks for each index!


for source_node in window_response.source_nodes:

    print(source_node.node.metadata["original_text"])

    print("--------")


Here, we can see that the sentence window index easily retrieved two nodes that talk about AMOC. Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!


let's try and disect why the naive vector index failed.


for node in vector_response.source_nodes:

    print("AMOC mentioned?", "AMOC" in node.node.text)

    print("--------")


So source node at index [2] mentions AMOC, but what did this text actually look like?

print(vector_response.source_nodes[2].node.text)




Step 4: Evaluation 

We more rigorously evaluate how well the sentence window retriever works compared to the base retriever.


We define/load an eval benchmark dataset and then run different evaluations over it.


WARNING: This can be expensive, especially with GPT-4. Use caution and tune the sample size to fit your budget.



from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset


from llama_index.llms.openai import OpenAI

import nest_asyncio

import random


nest_asyncio.apply()


num_nodes_eval = 30

# there are 428 nodes total. Take the first 200 to generate questions (the back half of the doc is all references)

sample_eval_nodes = random.sample(base_nodes[:200], num_nodes_eval)

# NOTE: run this if the dataset isn't already saved

# generate questions from the largest chunks (1024)

dataset_generator = DatasetGenerator(

    sample_eval_nodes,

    llm=OpenAI(model="gpt-4"),

    show_progress=True,

    num_questions_per_chunk=2,

)


eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()


val_dataset.save_json("data/ipcc_eval_qr_dataset.json")


eval_dataset = QueryResponseDataset.from_json("data/ipcc_eval_qr_dataset.json")


import asyncio

import nest_asyncio


nest_asyncio.apply()


from llama_index.core.evaluation import (

    CorrectnessEvaluator,

    SemanticSimilarityEvaluator,

    RelevancyEvaluator,

    FaithfulnessEvaluator,

    PairwiseComparisonEvaluator,

)



from collections import defaultdict

import pandas as pd


# NOTE: can uncomment other evaluators

evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))

evaluator_s = SemanticSimilarityEvaluator()

evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))

evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))

# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model="gpt-4"))


from llama_index.core.evaluation.eval_utils import (

    get_responses,

    get_results_df,

)

from llama_index.core.evaluation import BatchEvalRunner


max_samples = 30


eval_qs = eval_dataset.questions

ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]


# resetup base query engine and sentence window query engine

# base query engine

base_query_engine = base_index.as_query_engine(similarity_top_k=2)

# sentence window query engine

query_engine = sentence_index.as_query_engine(

    similarity_top_k=2,

    # the target key defaults to `window` to match the node_parser's default

    node_postprocessors=[

        MetadataReplacementPostProcessor(target_metadata_key="window")

    ],

)


import numpy as np


base_pred_responses = get_responses(

    eval_qs[:max_samples], base_query_engine, show_progress=True

)

pred_responses = get_responses(

    eval_qs[:max_samples], query_engine, show_progress=True

)


pred_response_strs = [str(p) for p in pred_responses]

base_pred_response_strs = [str(p) for p in base_pred_responses]


evaluator_dict = {

    "correctness": evaluator_c,

    "faithfulness": evaluator_f,

    "relevancy": evaluator_r,

    "semantic_similarity": evaluator_s,

}

batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)


eval_results = await batch_runner.aevaluate_responses(

    queries=eval_qs[:max_samples],

    responses=pred_responses[:max_samples],

    reference=ref_response_strs[:max_samples],

)


base_eval_results = await batch_runner.aevaluate_responses(

    queries=eval_qs[:max_samples],

    responses=base_pred_responses[:max_samples],

    reference=ref_response_strs[:max_samples],

)


results_df = get_results_df(

    [eval_results, base_eval_results],

    ["Sentence Window Retriever", "Base Retriever"],

    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],

)

display(results_df)




References:

https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo/

No comments:

Post a Comment