Saturday, October 26, 2024

What is PyMuPDF4LLM

PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:


Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output

Direct support for output as LlamaIndex Documents


Multi-Column Pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.


Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.

Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:


import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


This delivers a list of dictionary objects for each page of the document with the following schema:


metadata — dictionary consisting of the document’s metadata.

toc_items — list of Table of Contents items pointing to the page.

tables — list of tables on this page.

images — list of images on the page.

graphics — list of vector graphics rectangles on the page.

text — page content as Markdown text.

Pymupdfllm has support in LLamaIndex.

import pymupdf4llm

llama_reader = pymupdf4llm.LlamaMarkdownReader()

llama_docs = llama_reader.load_data("input.pdf")

No comments:

Post a Comment