PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:
Support for multi-column pages
Support for image and vector graphics extraction (and inclusion of references in the MD text)
Support for page chunking output
Direct support for output as LlamaIndex Documents
Multi-Column Pages
The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.
Image Support
PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:
import pymupdf4llm
output = pymupdf4llm.to_markdown("input.pdf", write_images=True)
The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.
Page Chunking
We can obtain output with enriched semantic information if we request page_chunks:
import pymupdf4llm
output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)
This delivers a list of dictionary objects for each page of the document with the following schema:
metadata — dictionary consisting of the document’s metadata.
toc_items — list of Table of Contents items pointing to the page.
tables — list of tables on this page.
images — list of images on the page.
graphics — list of vector graphics rectangles on the page.
text — page content as Markdown text.
Pymupdfllm has support in LLamaIndex.
import pymupdf4llm
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("input.pdf")
No comments:
Post a Comment