Monday, August 11, 2025

What is MinerU

MinerU is a powerful open-source PDF data extraction tool developed by OpenDataLab. It intelligently converts PDF documents into structured data formats, supporting precise extraction of text, images, tables, and mathematical formulas. Whether you’re dealing with academic papers, technical documents, or business reports, MinerU makes it easy.


Key Features

🚀 Smart Cleaning - Automatically removes headers, footers, and other distracting content

📝 Structure Preservation - Retains the hierarchical structure of the original document

🖼️ Multimodal Support - Accurately extracts images, tables, and captions

➗ Formula Conversion - Automatically recognizes mathematical formulas and converts them to LaTeX

🌍 Multilingual OCR - Supports text recognition in 84 languages

💻 Cross-Platform Compatibility - Works on all major operating systems


Multilingual Support


MinerU leverages PaddleOCR to provide robust multilingual recognition capabilities, supporting over 80 languages:

When processing documents, you can optimize recognition accuracy by specifying the language parameter:


magic-pdf -p paper.pdf -o output -m auto --lang ch


API Integration Development

MinerU provides flexible Python APIs, here is a complete usage example:


import os

from loguru import logger

from magic_pdf.pipe.UNIPipe import UNIPipe

from magic_pdf.pipe.OCRPipe import OCRPipe 

from magic_pdf.pipe.TXTPipe import TXTPipe

from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter


def pdf_parse_main(

    pdf_path: str,

    parse_method: str = 'auto',

    model_json_path: str = None,

    is_json_md_dump: bool = True,

    output_dir: str = None

):

    """

    Execute the process from pdf to json and md

    :param pdf_path: Path to the .pdf file

    :param parse_method: Parsing method, supports auto, ocr, txt, default auto

    :param model_json_path: Path to an existing model data file

    :param is_json_md_dump: Whether to save parsed data to json and md files

    :param output_dir: Output directory path

    """

    try:

        # Prepare output path

        pdf_name = os.path.basename(pdf_path).split(".")[0]

        if output_dir:

            output_path = os.path.join(output_dir, pdf_name)

        else:

            pdf_path_parent = os.path.dirname(pdf_path)

            output_path = os.path.join(pdf_path_parent, pdf_name)

        

        output_image_path = os.path.join(output_path, 'images')

        image_path_parent = os.path.basename(output_image_path)


        # Read PDF file

        pdf_bytes = open(pdf_path, "rb").read()

        

        # Initialize writer

        image_writer = DiskReaderWriter(output_image_path)

        md_writer = DiskReaderWriter(output_path)


        # Select parsing method

        if parse_method == "auto":

            jso_useful_key = {"_pdf_type": "", "model_list": []}

            pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)

        elif parse_method == "txt":

            pipe = TXTPipe(pdf_bytes, [], image_writer)

        elif parse_method == "ocr":

            pipe = OCRPipe(pdf_bytes, [], image_writer)

        else:

            logger.error("unknown parse method, only auto, ocr, txt allowed")

            return


        # Execute processing flow

        pipe.pipe_classify()    # Document classification

        pipe.pipe_analyze()     # Document analysis

        pipe.pipe_parse()       # Content parsing


        # Generate output content

        content_list = pipe.pipe_mk_uni_format(image_path_parent)

        md_content = pipe.pipe_mk_markdown(image_path_parent)


        # Save results

        if is_json_md_dump:

            # Save model results

            md_writer.write(

                content=json.dumps(pipe.model_list, ensure_ascii=False, indent=4),

                path=f"{pdf_name}_model.json"

            )

            # Save content list

            md_writer.write(

                content=json.dumps(content_list, ensure_ascii=False, indent=4),

                path=f"{pdf_name}_content_list.json"

            )

            # Save Markdown

            md_writer.write(

                content=md_content,

                path=f"{pdf_name}.md"

            )


    except Exception as e:

        logger.exception(e)


# Usage example

if __name__ == '__main__':

    pdf_path = "demo.pdf"

    pdf_parse_main(

        pdf_path=pdf_path,

        parse_method="auto",

        output_dir="./output"

    )



Note: The above code demonstrates a complete processing flow, including:


Support for multiple parsing methods (auto/ocr/txt)

Automatically create output directory structure

Save model results, content list, and Markdown output

Exception handling and logging


Practical Application Scenarios

1. Academic Research

Batch extract research paper data

Build a literature knowledge base

Extract experimental data and charts

2. Data Analysis

Extract financial statement data

Process technical documents

Analyze research reports

3. Content Management

Document digital conversion

Build a search system

Build a knowledge base

4. Development Integration

RAG system development

Document processing service

Content analysis platform




references:

https://stable-learn.com/en/mineru-tutorial/

No comments:

Post a Comment