Tuesday, February 18, 2025

What is Camelot Library for PDF extraction

Camelot is a Python library that makes it easy to extract tables from PDF files.  It's particularly useful for PDFs where the tables are not easily selectable or copyable (e.g., scanned PDFs or PDFs with complex layouts).  Camelot works by using a combination of image processing and text analysis to identify and extract table data.   

Here's a breakdown of what Camelot does and why it's helpful:

Key Features and Benefits:

Table Detection: Camelot can automatically detect tables within a PDF, even if they aren't marked up as tables in the PDF's internal structure.   

Table Extraction: Once tables are detected, Camelot extracts the data from them and provides it in a structured format (like a Pandas DataFrame).   

Handles Different Table Types: It can handle various table formats, including tables with borders, tables without borders, and tables with complex layouts.   

Output to Pandas DataFrames: The extracted table data is typically returned as a Pandas DataFrame, making it easy to further process and analyze the data in Python.   

Command-Line Interface: Camelot also comes with a command-line interface, which can be useful for quick table extraction tasks.   

How it Works (Simplified):


Image Processing: Camelot often uses image processing techniques to identify the boundaries of tables within the PDF. This is especially helpful for PDFs where the tables aren't readily discernible from the underlying PDF structure.   

Text Analysis: It analyzes the text content within the identified table regions to reconstruct the table structure and extract the data.   

When to Use Camelot:


PDFs with Non-Selectable Tables: If you're working with PDFs where you can't easily select or copy the table data, Camelot is likely the right tool.

Complex Table Layouts: When tables have complex formatting, borders, or spanning cells that make standard PDF text extraction difficult, Camelot can help.   

Automating Table Extraction: If you need to extract tables from many PDFs programmatically, Camelot provides a convenient way to do this.

Limitations:


Scanned PDFs: Camelot primarily works with text-based PDFs. It does not have built-in OCR (Optical Character Recognition) capabilities. If your PDF is a scanned image, you'll need to use an OCR library (like Tesseract) first to convert the image to text before you can use Camelot.

Accuracy: While Camelot is good at table detection and extraction, its accuracy can vary depending on the complexity of the PDF and the tables. You might need to adjust some parameters or do some manual cleanup in some cases.



In summary: Camelot is a valuable library for extracting table data from PDFs, particularly when the tables are difficult to extract using other methods.  It combines image processing and text analysis to identify and extract table data, providing it in a structured format that can be easily used in Python.  Keep in mind its limitations with scanned PDFs and the potential for some inaccuracies.


References:

Gemini

What are some of advanced techniques for building production grade RAG?

 Decoupling chunks used for retrieval vs. chunks used for synthesis

Structured Retrieval for Larger Document Sets

Dynamically Retrieve Chunks Depending on your Task

Optimize context embeddings


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).

2. Embed a sentence, which then links to a window around the sentence.

This allows for finer-grained retrieval of relevant context (embedding giant chunks leads to “lost in the middle” problems), but also ensures enough context for LLM synthesis.


Structured Retrieval for Larger Document Sets

A big issue with the standard RAG stack (top-k retrieval + basic text splitting) is that it doesn’t do well as the number of documents scales up - e.g. if you have 100 different PDFs. In this setting, given a query you may want to use structured information to help with more precise retrieval; for instance, if you ask a question that's only relevant to two PDFs, using structured information to ensure those two PDFs get returned beyond raw embedding similarity with chunks.

Key Techniques#

1. Metadata Filters + Auto Retrieval Tag each document with metadata and then store in a vector database. During inference time, use the LLM to infer the right metadata filters to query the vector db in addition to the semantic query string.

Pros ✅: Supported in major vector dbs. Can filter document via multiple dimensions.

Cons 🚫: Can be hard to define the right tags. Tags may not contain enough relevant information for more precise retrieval. Also tags represent keyword search at the document-level, doesn’t allow for semantic lookups.

2. Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval Embed document summaries and map to chunks per document. Fetch at the document-level first before chunk level.

Pros ✅: Allows for semantic lookups at the document level.

Cons 🚫: Doesn’t allow for keyword lookups by structured tags (can be more precise than semantic search). Also autogenerating summaries can be expensive.

Dynamically Retrieve Chunks Depending on your Task

RAG isn't just about question-answering about specific facts, which top-k similarity is optimized for. There can be a broad range of queries that a user might ask. Queries that are handled by naive RAG stacks include ones that ask about specific facts e.g. "Tell me about the D&I initiatives for this company in 2023" or "What did the narrator do during his time at Google". But queries can also include summarization e.g. "Can you give me a high-level overview of this document", or comparisons "Can you compare/contrast X and Y". All of these use cases may require different retrieval techniques.

LlamaIndex provides some core abstractions to help you do task-specific retrieval. This includes router module as well as our data agent module. This also includes some advanced query engine modules. This also include other modules that join structured and unstructured data.

You can use these modules to do joint question-answering and summarization, or even combine structured queries with unstructured queries.

Optimize Context Embeddings

This is related to the motivation described above in "decoupling chunks used for retrieval vs. synthesis". We want to make sure that the embeddings are optimized for better retrieval over your specific data corpus. Pre-trained models may not capture the salient properties of the data relevant to your use cas

Key Techniques#

Beyond some of the techniques listed above, we can also try finetuning the embedding model. We can actually do this over an unstructured text corpus, in a label-free way.

referneces:

https://docs.llamaindex.ai/en/stable/optimizing/production_rag/

Monday, February 17, 2025

When using PyMuPDF4LLM, LlamaIndex is one of the option as output what are the advantages of these?

When parsing a PDF and getting the result as a LlamaIndex Document, the primary advantage is the ability to seamlessly integrate the extracted information with other data sources and readily query it using a large language model (LLM) within the LlamaIndex framework, allowing for richer, more contextual responses and analysis compared to simply extracting raw text from a PDF alone; essentially, it enables you to build sophisticated knowledge-based applications by combining data from various sources, including complex PDFs, in a unified way. 

Key benefits:

Contextual Understanding:

LlamaIndex can interpret the extracted PDF data within the broader context of other related information, leading to more accurate and relevant responses when querying. 

Multi-Source Querying:

You can easily query across multiple documents, including the parsed PDF, without needing separate data processing pipelines for each source. 

Advanced Parsing with LlamaParse:

LlamaIndex provides a dedicated "LlamaParse" tool specifically designed for complex PDF parsing, including tables and figures, which can be directly integrated into your workflow. 

RAG Applications:

By representing PDF data as LlamaIndex documents, you can readily build "Retrieval Augmented Generation" (RAG) applications that can retrieve relevant information from your PDF collection based on user queries. 

references:

Gemini 



Sunday, February 16, 2025

What are the main features for PyMuPDF4LLM?

 PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:

Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output

Direct support for output as LlamaIndex Documents

Multi-Column Pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.

Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.


Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


This delivers a list of dictionary objects for each page of the document with the following schema:


metadata — dictionary consisting of the document’s metadata.

toc_items — list of Table of Contents items pointing to the page.

tables — list of tables on this page.

images — list of images on the page.

graphics — list of vector graphics rectangles on the page.

text — page content as Markdown text.

In this way page chunking allows for more structured results for your LLM input.


LlamaIndex Documents Output

If you are using LlamaIndex for your LLM application then you are in luck! PyMuPDF4LLM has a seamless integration as follows:

import pymupdf4llm

llama_reader = pymupdf4llm.LlamaMarkdownReader()

llama_docs = llama_reader.load_data("input.pdf")


With these simple 3 lines of code you will receive LLamaIndex document objects from the PDF file input for use with your LLM application!



What is Test-Time Scaling technique?

Test-Time Scaling (TTS) is a technique used to improve the performance of Large Language Models (LLMs) during inference (i.e., when the model is used to generate text or make predictions, not during training).  It works by adjusting the model's output probabilities based on the observed distribution of tokens in the generated text.   

Here's a breakdown of how it works:

Standard LLM Inference:  Typically, LLMs generate text by sampling from the probability distribution over the vocabulary at each step.  The model predicts the probability of each possible next token, and then a sampling strategy (e.g., greedy decoding, beam search, temperature sampling) is used to select the next token.   

The Problem:  LLMs can sometimes produce outputs that are repetitive, generic, or lack diversity.  This is partly because the model's probability distribution might be overconfident, assigning high probabilities to a small set of tokens and low probabilities to many others.   

Test-Time Scaling: TTS addresses this issue by introducing a scaling factor to the model's output probabilities.  This scaling factor is typically applied to the logits (the pre-softmax outputs of the model).

How Scaling Works: The scaling factor is usually less than 1.  When the logits are scaled down, the probability distribution becomes "flatter" or less peaked. This has the effect of:

Increasing the probability of less frequent tokens: This helps to reduce repetition and encourages the model to explore a wider range of vocabulary.

Reducing the probability of highly frequent tokens: This can help to prevent the model from getting stuck in repetitive loops or generating overly generic text.   

Adaptive Scaling (Often Used):  In many implementations, the scaling factor is adaptive.  It's adjusted based on the characteristics of the generated text so far.  For example, if the generated text is becoming repetitive, the scaling factor might be decreased further to increase diversity.  Conversely, if the text is becoming too random or incoherent, the scaling factor might be increased to make the distribution more peaked.

Benefits of TTS:

Improved Text Quality: TTS can lead to more diverse, creative, and less repetitive text generation.

Better Performance on Downstream Tasks: For tasks like machine translation or text summarization, TTS can improve the accuracy and fluency of the generated output.

In summary: TTS is a post-processing technique applied during inference. It adjusts the LLM's output probabilities to encourage more diverse and less repetitive text generation.  By scaling down the logits, the probability distribution is flattened, making it more likely for the model to choose less frequent tokens and avoid getting stuck in repetitive loops.  Adaptive scaling makes the process even more effective by dynamically adjusting the scaling factor based on the generated text.

references:

https://www.marktechpost.com/2025/02/13/can-1b-llm-surpass-405b-llm-optimizing-computation-for-small-llms-to-outperform-larger-models/


 


Saturday, February 15, 2025

What is LLM as a judge and how to compares to RAGAS?

The idea behind LLM-is-a-judge is simple – provide an LLM with the output of your system and the ground truth answer, and ask it to score the output based on some criteria.

The challenge is to get the judge to score according to domain-specific and problem-specific standards.

in other words, we needed to evaluate the evaluators!

First, we ran a sanity test – we used our system to generate answers based on ground truth context, and scored them using the judges.

This test confirmed that both judges behaved as expected: the answers, which were based on the actual ground truth context, scored high – both in absolute terms and in relation to the scores of running the system including the retrieval phase on the same questions. 

Next, we performed an evaluation of the correctness score by comparing it to the correctness score generated by human domain experts.

Our main focus was investigating the correlation between our various LLM-as-a-judge tools to the human-labeled examples, looking at trends rather than the absolute score values.

This method helped us deal with another risk – human experts’ can have a subjective perception of absolute score numbers. So instead of looking at the exact score they assigned, we focused on the relative ranking of examples.

Both RAGAS and our own judge correlated reasonably well to the human scores, with our own judge being better correlated, especially in the higher score bands

The results convinced us that our LLM-as-a-Judge offers a sufficiently reliable mechanism for assessing our system’s quality – both for comparing the quality of system versions to each other in order to make decisions about release candidates, and for finding examples which can indicate systematic quality issues we need to address.

references:
https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/

What are couple of issues with RAGAS?

RAGAS covers a number of key metrics useful in LLM evaluation, including answer correctness (later renamed to “factual correctness”) and context accuracy via precision and recall.

RAGAS implements correctness tests by converting both the generated answer and the ground truth (reference) into a series of simplified statements.

The score is essentially a grade for the level of overlap between statements from reference vs. the generated answer, combined with some weight for overall similarity between the answers.

When eyeballing the scores RAGAS generated, we noticed two recurring issues:

For relatively short answers, every small “missed fact” results in significant penalties.

When one of the answers was more detailed than the other, the correctness score suffered greatly, despite both answers being valid and even useful

The latter issue was common enough, and didn’t align with our intention for the correctness metric, so we needed to find a way to evaluate the “essence” of the answers as well as the details.

references:

https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/