Extract Text & Tables from Invoices: Hands-on Guide
Learn to extract text and tables from invoices & forms using Tesseract, EasyOCR, and PDFPlumber. A practical, hands-on ML/AI document processing guide.
Hands-on: Extract Text and Tables from Invoices and Forms
Extracting structured information, such as text, tables, and key-value pairs, from documents like invoices and forms is a fundamental task in document processing. This guide provides a practical, step-by-step approach using powerful open-source tools: Tesseract OCR, EasyOCR, and PDFPlumber.
Objectives
- Extract text from scanned or digital invoices and forms.
- Identify and extract tabular data.
- Structure the extracted data for further processing or storage (e.g., JSON, CSV, database).
Prerequisites
Before you begin, ensure you have the necessary Python packages installed.
pip install pytesseract pillow easyocr opencv-python pdfplumber
Additionally, you need to install the Tesseract OCR engine itself:
- Windows: Download from Tesseract OCR releases on GitHub.
- Linux (Debian/Ubuntu):
sudo apt update sudo apt install tesseract-ocr
- macOS:
brew install tesseract
Method 1: Extract Text Using Tesseract OCR
Tesseract is a robust open-source OCR engine that can be integrated into Python using the pytesseract
library.
Step 1: Import Required Libraries
import pytesseract
from PIL import Image
import cv2
Step 2: Load and Preprocess Image
For optimal OCR results, it's often beneficial to preprocess the image by converting it to grayscale.
image_path = 'invoice_sample.png' # Replace with your image file path
image = cv2.imread(image_path)
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
Step 3: Run OCR with Tesseract
pytesseract.image_to_string()
converts the image content into a string. Custom configuration options (--oem
and --psm
) can improve accuracy depending on the document type.
--oem 3
: Use the default Tesseract OCR Engine Mode (LSTM).--psm 6
: Assume a single uniform block of text.
custom_config = r'--oem 3 --psm 6'
extracted_text = pytesseract.image_to_string(gray_image, config=custom_config)
print("--- Extracted Text (Tesseract) ---")
print(extracted_text)
Method 2: Extract Text Using EasyOCR
EasyOCR is a more modern OCR library known for its ease of use and effectiveness, especially with noisy or varied documents.
Step 1: Import and Initialize Reader
EasyOCR supports multiple languages. Initialize the Reader
with the desired language codes.
import easyocr
# Initialize the reader for English. Add more language codes if needed (e.g., ['en', 'fr']).
reader = easyocr.Reader(['en'])
Step 2: Read Text from Image
The readtext()
method returns a list of detected text, including bounding box coordinates and confidence scores.
image_path = 'invoice_sample.png' # Replace with your image file path
results = reader.readtext(image_path)
print("\n--- Extracted Text (EasyOCR) ---")
for (bbox, text, prob) in results:
print(f"Detected text: '{text}' (Confidence: {prob:.2f})")
Note: EasyOCR often performs better than Tesseract on noisy or handwritten documents.
Method 3: Extract Tables Using PDFPlumber (For Digital PDFs)
PDFPlumber is excellent for extracting text and tables directly from digital PDF files. It understands the structure of digital PDFs, making table extraction straightforward.
Step 1: Load PDF and Extract Table
Open the PDF file and access its pages. extract_text()
gets all textual content, while extract_table()
specifically targets tabular data.
import pdfplumber
pdf_path = 'invoice_sample.pdf' # Replace with your PDF file path
try:
with pdfplumber.open(pdf_path) as pdf:
# Process the first page
first_page = pdf.pages[0]
# Extract all text
page_text = first_page.extract_text()
print("--- Extracted Text (PDFPlumber) ---")
print(page_text)
# Extract tables
# extract_tables() returns a list of tables, where each table is a list of rows.
tables = first_page.extract_tables()
print("\n--- Extracted Table(s) (PDFPlumber) ---")
if tables:
# Assuming you're interested in the first table found
first_table = tables[0]
for row in first_table:
print(row)
else:
print("No tables found on this page.")
except FileNotFoundError:
print(f"Error: The file '{pdf_path}' was not found.")
except Exception as e:
print(f"An error occurred: {e}")
Important Note: The extract_table()
method works best with digital PDFs that have defined table structures. It will not work reliably on scanned PDFs that are essentially images.
Optional: Use OpenCV for Table Line Detection (for Scanned Documents)
For scanned tables where rows and columns are not explicitly defined (i.e., the PDF is an image), you can use computer vision techniques with OpenCV to detect table lines. This can help in segmenting the table structure.
import cv2
import numpy as np
# Assuming 'image' is your loaded image from Method 1 (or loaded similarly)
# If you are working with a scanned PDF page converted to an image:
# image_path = 'scanned_invoice_page.png'
# image = cv2.imread(image_path)
if 'image' not in locals():
print("Please load an image first for line detection.")
else:
# Convert to grayscale if not already
gray_for_lines = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Apply thresholding to get a binary image
# Adjust the threshold value (e.g., 150) based on your image's brightness and contrast.
# THRESH_BINARY_INV makes text/lines white and background black.
thresh = cv2.threshold(gray_for_lines, 150, 255, cv2.THRESH_BINARY_INV)[1]
# Detect horizontal lines
# Kernel size determines how long a line needs to be to be detected.
# A larger width (e.g., 40) detects longer horizontal lines.
horizontal_kernel_size = (40, 1)
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, horizontal_kernel_size)
detected_horizontal_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
# Detect vertical lines
# A larger height (e.g., 40) detects longer vertical lines.
vertical_kernel_size = (1, 40)
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, vertical_kernel_size)
detected_vertical_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
# Combine horizontal and vertical lines to form a table grid
table_grid = cv2.addWeighted(detected_horizontal_lines, 0.5, detected_vertical_lines, 0.5, 0)
# You can further process `table_grid` to find contours of cells
# and then apply OCR to each cell for more precise extraction.
# For example, find contours:
# cnts = cv2.findContours(table_grid, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# ... process contours to identify cells ...
print("\n--- Table Line Detection (OpenCV) ---")
print("OpenCV processed image for table line detection. Examine `thresh`, `detected_horizontal_lines`, `detected_vertical_lines`, and `table_grid` for visualization.")
# To visualize, you would typically use cv2.imshow() or save the images.
# cv2.imshow("Thresholded", thresh)
# cv2.imshow("Horizontal Lines", detected_horizontal_lines)
# cv2.imshow("Vertical Lines", detected_vertical_lines)
# cv2.imshow("Table Grid", table_grid)
# cv2.waitKey(0)
# cv2.destroyAllWindows()
This method provides the building blocks for detecting table structures. Combining this with contour analysis and applying OCR to individual cells is a more advanced technique for scanned documents.
Exporting Extracted Data
Once you have extracted data, you'll want to save it in a usable format.
Save Text as .txt
# Assuming 'extracted_text' contains the text from Tesseract or EasyOCR
if 'extracted_text' in locals() and extracted_text:
with open("extracted_text.txt", "w", encoding='utf-8') as file:
file.write(extracted_text)
print("\nExtracted text saved to 'extracted_text.txt'")
elif 'page_text' in locals() and page_text:
with open("extracted_text.txt", "w", encoding='utf-8') as file:
file.write(page_text)
print("\nExtracted text saved to 'extracted_text.txt'")
Save Table as CSV
import csv
# Assuming 'table' contains the extracted table data from PDFPlumber
if 'first_table' in locals() and first_table:
with open("extracted_table.csv", "w", newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerows(first_table)
print("Extracted table saved to 'extracted_table.csv'")
Bonus: Convert to Structured JSON
Often, you'll need to map extracted text and table data to a predefined schema, such as JSON, for easier consumption by applications.
import json
# Example of manually structuring extracted data into a dictionary
# In a real-world scenario, you'd use parsing logic to identify fields
structured_data = {
"Vendor": "Example Corp", # Manually identified or extracted
"InvoiceNumber": "INV-9876", # Manually identified or extracted
"Date": "2023-10-27", # Manually identified or extracted
"TotalAmount": 1500.75, # Manually identified or extracted
"Currency": "USD", # Manually identified or extracted
"LineItems": [
{"Description": "Product A", "Quantity": 2, "UnitPrice": 500.00, "Amount": 1000.00},
{"Description": "Service B", "Quantity": 1, "UnitPrice": 500.75, "Amount": 500.75}
]
}
# If you extracted a table, you could convert its rows into line items
if 'first_table' in locals() and first_table:
# Example conversion (assuming table has columns like: 'Item', 'Qty', 'Price', 'Total')
# You'll need to adapt this based on your actual table structure and data types.
structured_data["LineItems"] = []
for row in first_table[1:]: # Skip header row if present
try:
item_data = {
"Description": row[0],
"Quantity": int(row[1]),
"UnitPrice": float(row[2].replace('$', '').replace(',', '')),
"Amount": float(row[3].replace('$', '').replace(',', ''))
}
structured_data["LineItems"].append(item_data)
except (ValueError, IndexError) as e:
print(f"Could not parse row {row}: {e}")
with open("invoice_output.json", "w", encoding='utf-8') as json_file:
json.dump(structured_data, json_file, indent=4)
print("Structured data saved to 'invoice_output.json'")
Applications
The techniques described are widely applicable in various domains:
- Automated Invoice Processing: Streamline accounts payable by automatically extracting invoice data.
- Form Digitization: Convert paper forms (e.g., HR applications, insurance claims, medical records) into digital, searchable formats.
- Expense Report Generation: Automatically extract details from receipts.
- Receipt Recognition: For accounting and personal finance management.
- Data Entry Automation: Reduce manual data input across various industries.
Conclusion
Extracting text and tables from documents like invoices and forms is achievable using powerful open-source libraries such as Tesseract, EasyOCR, and PDFPlumber. By combining these tools with preprocessing techniques and data formatting (like CSV or JSON), you can build robust automation solutions.
For more advanced scenarios, such as understanding complex document layouts, extracting specific fields without explicit table structures, or handling very challenging document quality, consider exploring more sophisticated models like LayoutLM, Donut, or fine-tuned deep learning approaches.
SEO Keywords
Invoice text extraction, Table extraction from PDFs, Tesseract OCR tutorial, EasyOCR invoice processing, PDFPlumber table extraction, Extract tables from scanned documents, Automated form digitization, Python OCR tools for invoices, Export OCR data to JSON, Open-source document processing.
Interview Questions
- What are the main differences between Tesseract OCR and EasyOCR for text extraction?
- Tesseract is a mature, powerful engine with extensive language support and customization but can require more preprocessing. EasyOCR is often easier to set up and performs well on a wider variety of document types out-of-the-box, especially those with noise or variations.
- How do you preprocess images before applying OCR in Python?
- Common preprocessing steps include: converting to grayscale, resizing, noise reduction (e.g., Gaussian blur), binarization (thresholding), and deskewing (correcting image rotation). Libraries like OpenCV (
cv2
) are typically used for these tasks.
- Common preprocessing steps include: converting to grayscale, resizing, noise reduction (e.g., Gaussian blur), binarization (thresholding), and deskewing (correcting image rotation). Libraries like OpenCV (
- Explain how PDFPlumber helps in extracting tables from digital PDFs.
- PDFPlumber analyzes the internal structure of digital PDFs, identifying text blocks, lines, and geometric relationships. It uses this information to infer table boundaries and extract data into structured formats (lists of lists), unlike image-based OCR which treats the page as a pixel grid.
- What challenges arise when extracting tables from scanned documents versus digital PDFs?
- Scanned: Tables lack inherent digital structure. They are essentially images. Challenges include image quality (resolution, skew, noise), inconsistent line thickness, merged cells, and variations in table layout. OCR might require complex image processing (line detection) and cell-by-cell analysis.
- Digital: Tables have defined structure. Challenges are usually related to how the PDF creator defined the table (e.g., complex layouts, embedded images representing text, or tables not using standard PDF objects), which PDFPlumber might sometimes misinterpret.
- How can OpenCV be used to detect table structures in images?
- OpenCV can detect table structures by:
- Converting the image to binary (thresholding).
- Using morphological operations (opening, closing) with carefully chosen kernels to isolate horizontal and vertical lines.
- Combining these detected lines to form a grid.
- Finding contours on this grid to identify individual cells or the overall table bounding box.
- OpenCV can detect table structures by:
- Describe a workflow to convert extracted invoice data into structured JSON format.
- Extraction: Use OCR (Tesseract/EasyOCR) for text and PDFPlumber (for digital PDFs) or OpenCV-based analysis (for scanned tables) to extract raw data.
- Parsing/Mapping: Implement logic (rule-based, regular expressions, or even machine learning models) to identify key fields (e.g., Invoice Number, Date, Total) and line items from the extracted text and tables.
- Structuring: Populate a Python dictionary or object with the identified data according to a predefined JSON schema.
- Serialization: Use the
json
library to serialize the Python object into a JSON string and save it to a file.
- When would you prefer EasyOCR over Tesseract OCR for document processing?
- Prefer EasyOCR when dealing with:
- Noisy or low-quality scanned documents.
- Documents with varied fonts or handwriting.
- Quick setup and good out-of-the-box performance.
- Need for accurate bounding box information for each detected text segment.
- Tesseract might be preferred for:
- High-volume processing where fine-tuning specific configurations or language packs yields better results.
- Documents with very clean, standard text.
- Integration into existing systems already reliant on Tesseract.
- Prefer EasyOCR when dealing with:
- How do you handle multi-language support using OCR libraries?
- Tesseract: Requires downloading and installing language data packs (e.g.,
tessdata
forfra
,deu
). You specify the language code inpytesseract.image_to_string(config=f'--lang {language_code}')
. - EasyOCR: Initialize the
Reader
with a list of language codes, e.g.,easyocr.Reader(['en', 'fr'])
.
- Tesseract: Requires downloading and installing language data packs (e.g.,
- What are some limitations of rule-based table detection methods, and how can deep learning help?
- Rule-based limitations: Highly dependent on clear, consistent lines and grid structures. Struggle with:
- Tables without visible borders.
- Tables with merged cells or complex nested structures.
- Variations in line thickness or style.
- Documents with heavy noise or artifacts.
- Deep Learning: Models like LayoutLM or specialized object detection models can learn to recognize table structures, cells, and even semantic meaning (e.g., header vs. data cell) directly from pixel data, making them more robust to variations and capable of handling tables without explicit lines. They can also perform end-to-end extraction of structured data.
- Rule-based limitations: Highly dependent on clear, consistent lines and grid structures. Struggle with:
- Can you explain how to combine OCR output with layout analysis models like LayoutLM for enhanced document understanding?
- LayoutLM models are pre-trained on large datasets of documents and understand both the text content and the spatial layout.
- Workflow:
- OCR: Extract raw text and bounding boxes for each word/token using Tesseract or EasyOCR.
- Input Preparation: Format the extracted text and bounding box information into the input format expected by the LayoutLM model (often including token IDs, attention masks, and layout embeddings).
- LayoutLM Processing: Feed this prepared input into a fine-tuned LayoutLM model (e.g., for Named Entity Recognition or Question Answering tasks related to invoices).
- Output Interpretation: The model predicts labels for text tokens (e.g., "Vendor Name", "Invoice Date", "Total Amount") or answers questions based on the document content and layout.
- This combination allows for more accurate extraction of specific fields, even if they aren't neatly arranged in tables, by leveraging the model's understanding of document structure and context.
OCR Fundamentals: AI for Text & Table Extraction
Explore OCR fundamentals & AI-powered document understanding. Learn to extract text and tables from invoices and forms with practical applications.
LayoutLM & Donut: Advanced Document Understanding with AI
Explore LayoutLM and Donut for powerful document understanding, going beyond OCR. Learn how AI models interpret text, layout, and visual structure for complex documents.