MinerU is a powerful open-source PDF data extraction tool developed by OpenDataLab. It intelligently converts PDF documents into structured data formats, supporting precise extraction of text, images, tables, and mathematical formulas. Whether you’re dealing with academic papers, technical documents, or business reports, MinerU makes it easy.
Key Features
🚀 Smart Cleaning - Automatically removes headers, footers, and other distracting content
📝 Structure Preservation - Retains the hierarchical structure of the original document
🖼️ Multimodal Support - Accurately extracts images, tables, and captions
➗ Formula Conversion - Automatically recognizes mathematical formulas and converts them to LaTeX
🌍 Multilingual OCR - Supports text recognition in 84 languages
💻 Cross-Platform Compatibility - Works on all major operating systems
Multilingual Support
MinerU leverages PaddleOCR to provide robust multilingual recognition capabilities, supporting over 80 languages:
When processing documents, you can optimize recognition accuracy by specifying the language parameter:
magic-pdf -p paper.pdf -o output -m auto --lang ch
API Integration Development
MinerU provides flexible Python APIs, here is a complete usage example:
import os
from loguru import logger
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.TXTPipe import TXTPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
def pdf_parse_main(
pdf_path: str,
parse_method: str = 'auto',
model_json_path: str = None,
is_json_md_dump: bool = True,
output_dir: str = None
):
"""
Execute the process from pdf to json and md
:param pdf_path: Path to the .pdf file
:param parse_method: Parsing method, supports auto, ocr, txt, default auto
:param model_json_path: Path to an existing model data file
:param is_json_md_dump: Whether to save parsed data to json and md files
:param output_dir: Output directory path
"""
try:
# Prepare output path
pdf_name = os.path.basename(pdf_path).split(".")[0]
if output_dir:
output_path = os.path.join(output_dir, pdf_name)
else:
pdf_path_parent = os.path.dirname(pdf_path)
output_path = os.path.join(pdf_path_parent, pdf_name)
output_image_path = os.path.join(output_path, 'images')
image_path_parent = os.path.basename(output_image_path)
# Read PDF file
pdf_bytes = open(pdf_path, "rb").read()
# Initialize writer
image_writer = DiskReaderWriter(output_image_path)
md_writer = DiskReaderWriter(output_path)
# Select parsing method
if parse_method == "auto":
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
elif parse_method == "txt":
pipe = TXTPipe(pdf_bytes, [], image_writer)
elif parse_method == "ocr":
pipe = OCRPipe(pdf_bytes, [], image_writer)
else:
logger.error("unknown parse method, only auto, ocr, txt allowed")
return
# Execute processing flow
pipe.pipe_classify() # Document classification
pipe.pipe_analyze() # Document analysis
pipe.pipe_parse() # Content parsing
# Generate output content
content_list = pipe.pipe_mk_uni_format(image_path_parent)
md_content = pipe.pipe_mk_markdown(image_path_parent)
# Save results
if is_json_md_dump:
# Save model results
md_writer.write(
content=json.dumps(pipe.model_list, ensure_ascii=False, indent=4),
path=f"{pdf_name}_model.json"
)
# Save content list
md_writer.write(
content=json.dumps(content_list, ensure_ascii=False, indent=4),
path=f"{pdf_name}_content_list.json"
)
# Save Markdown
md_writer.write(
content=md_content,
path=f"{pdf_name}.md"
)
except Exception as e:
logger.exception(e)
# Usage example
if __name__ == '__main__':
pdf_path = "demo.pdf"
pdf_parse_main(
pdf_path=pdf_path,
parse_method="auto",
output_dir="./output"
)
Note: The above code demonstrates a complete processing flow, including:
Support for multiple parsing methods (auto/ocr/txt)
Automatically create output directory structure
Save model results, content list, and Markdown output
Exception handling and logging
Practical Application Scenarios
1. Academic Research
Batch extract research paper data
Build a literature knowledge base
Extract experimental data and charts
2. Data Analysis
Extract financial statement data
Process technical documents
Analyze research reports
3. Content Management
Document digital conversion
Build a search system
Build a knowledge base
4. Development Integration
RAG system development
Document processing service
Content analysis platform
references:
https://stable-learn.com/en/mineru-tutorial/
No comments:
Post a Comment