Sunday, February 16, 2025

What are the main features for PyMuPDF4LLM?

 PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:

Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output

Direct support for output as LlamaIndex Documents

Multi-Column Pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.

Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.


Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


This delivers a list of dictionary objects for each page of the document with the following schema:


metadata — dictionary consisting of the document’s metadata.

toc_items — list of Table of Contents items pointing to the page.

tables — list of tables on this page.

images — list of images on the page.

graphics — list of vector graphics rectangles on the page.

text — page content as Markdown text.

In this way page chunking allows for more structured results for your LLM input.


LlamaIndex Documents Output

If you are using LlamaIndex for your LLM application then you are in luck! PyMuPDF4LLM has a seamless integration as follows:

import pymupdf4llm

llama_reader = pymupdf4llm.LlamaMarkdownReader()

llama_docs = llama_reader.load_data("input.pdf")


With these simple 3 lines of code you will receive LLamaIndex document objects from the PDF file input for use with your LLM application!



No comments:

Post a Comment