How to Extract Table Content from Documents
If you see a table in a document, you are normally not looking at something like an embedded Excel or other identifiable object. It usually is just normal, standard text, formatted to appear as tabular data.
Extracting tabular data from such a page area therefore means that you must find a way to identify the table area (i.e. its boundary box), then (1) graphically indicate table and column borders, and (2) then extract text based on this information.
This can be a very complex task, depending on details like the presence or absence of lines, rectangles or other supporting vector graphics.
Method Page.find_tables() does all that for you, with a high table detection precision. Its great advantage is that there are no external library dependencies, nor the need to employ artificial intelligence or machine learning technologies. It also provides an integrated interface to the well-known Python package for data analysis pandas.
Sample tables for analysis are given below
https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis
https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/table-analysis/find_tables.ipynb
Below is given sample code, which works
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/My Drive/temp/sample_pdfs/input1.pdf'
doc = fitz.open(file_path)
page = doc[0]
tabs = page.find_tables() # detect the tables
for i,tab in enumerate(tabs): # iterate over all tables
for cell in tab.header.cells:
page.draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
page.draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")
show_image(page, f"Table & Header BBoxes")
To get the actual content, the code is like below
# choose the second table for conversion to a DataFrame
tab = tabs[0]
df = tab.to_pandas()
# show the DataFrame
df
references:
https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-table-content-from-documents
https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis
No comments:
Post a Comment