Saturday, May 10, 2025

How to extract Tables from document

How to Extract Table Content from Documents

If you see a table in a document, you are normally not looking at something like an embedded Excel or other identifiable object. It usually is just normal, standard text, formatted to appear as tabular data.


Extracting tabular data from such a page area therefore means that you must find a way to identify the table area (i.e. its boundary box), then (1) graphically indicate table and column borders, and (2) then extract text based on this information.


This can be a very complex task, depending on details like the presence or absence of lines, rectangles or other supporting vector graphics.


Method Page.find_tables() does all that for you, with a high table detection precision. Its great advantage is that there are no external library dependencies, nor the need to employ artificial intelligence or machine learning technologies. It also provides an integrated interface to the well-known Python package for data analysis pandas.


Sample tables for analysis are given below 


https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis

https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/table-analysis/find_tables.ipynb


Below is given sample code, which works 


from google.colab import drive

drive.mount('/content/drive')

file_path = '/content/drive/My Drive/temp/sample_pdfs/input1.pdf' 

doc = fitz.open(file_path)

page = doc[0]


tabs = page.find_tables()  # detect the tables

for i,tab in enumerate(tabs):  # iterate over all tables

    for cell in tab.header.cells:

        page.draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)

    page.draw_rect(tab.bbox,color=fitz.pdfcolor["green"])

    print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")

    

show_image(page, f"Table & Header BBoxes")


To get the actual content, the code is like below 


# choose the second table for conversion to a DataFrame

tab = tabs[0]

df = tab.to_pandas()


# show the DataFrame

df



references:

https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-table-content-from-documents

https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis


No comments:

Post a Comment