LayoutLM & Donut: Advanced Document Understanding with AI
Explore LayoutLM and Donut for powerful document understanding, going beyond OCR. Learn how AI models interpret text, layout, and visual structure for complex documents.
Document Understanding with LayoutLM and Donut
Document understanding extends beyond simple Optical Character Recognition (OCR). It involves comprehending the text, layout, and visual structure of documents such as invoices, forms, receipts, and scientific papers. Traditional OCR systems struggle in scenarios where layout is crucial for accurate content interpretation.
LayoutLM and Donut are two state-of-the-art deep learning models that address this challenge by combining visual and textual data for comprehensive document understanding.
LayoutLM: Layout-Aware Language Modeling
LayoutLM, developed by Microsoft, is a document AI model that integrates three essential elements for understanding:
- Textual Information: Extracted text content, typically from OCR.
- Layout Information: Spatial coordinates (x, y) of text blocks on the page.
- Visual Features: Image embeddings (introduced in LayoutLMv2/v3) that capture visual cues from the document.
By understanding the spatial arrangement of text, LayoutLM enables more intelligent document processing, going beyond just recognizing the words.
LayoutLM Variants
Model | Key Feature | Use Case Examples |
---|---|---|
LayoutLMv1 | Text + Layout | Form understanding, document classification |
LayoutLMv2 | Text + Layout + Visual (images) | End-to-end document understanding with scanned documents |
LayoutLMv3 | Enhanced multi-modal alignment | Stronger results in OCR-free scenarios, improved overall performance |
How LayoutLM Works
LayoutLM processes inputs including:
- Tokens: Words or sub-word units derived from OCR.
- 2D Position Embeddings: Bounding box coordinates for each token.
- Visual Embeddings: Representations of the document image (for v2 and v3).
These embeddings are fed into a Transformer architecture, which learns the complex relationships between words, their positions, and visual features, thereby achieving layout awareness.
Applications of LayoutLM
- Form understanding
- Invoice and receipt parsing
- Key-value pair extraction
- Document classification
- Named Entity Recognition (NER) within documents
Example Use Case: Extracting Fields from Invoices
LayoutLM can be trained to extract specific fields from invoices, such as:
- Vendor name
- Invoice date
- Total amount
- Line item details
This capability is vital for automating business workflows like expense reporting and financial auditing.
LayoutLM Example (Hugging Face)
This example demonstrates how to perform inference using LayoutLMv2 for token classification (e.g., identifying entity types).
Step 1: Install Dependencies
pip install transformers datasets torchvision pytesseract pdf2image Pillow
Optional: Install Tesseract OCR and Poppler
For processing scanned PDFs, you'll need to install these system dependencies.
- Ubuntu:
sudo apt install tesseract-ocr poppler-utils
- Windows: Download binaries and add them to your system's PATH. Refer to the Tesseract OCR and Poppler GitHub repositories for instructions.
Step 2: Prepare Input Document
You'll need an image file of the document (e.g., invoice.png
, form.jpg
).
Step 3: LayoutLMv2 Inference Code
import torch
from transformers import LayoutLMv2Processor, LayoutLMv2ForTokenClassification
from PIL import Image
import pytesseract
# Load model and processor
model_name = "microsoft/layoutlmv2-base-uncased"
processor = LayoutLMv2Processor.from_pretrained(model_name)
model = LayoutLMv2ForTokenClassification.from_pretrained(model_name)
# Load image
image_path = "form_sample.png" # Replace with your image path
try:
image = Image.open(image_path).convert("RGB")
except FileNotFoundError:
print(f"Error: Image file not found at {image_path}")
exit()
# OCR using pytesseract
ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
# Extract words and bounding boxes
words = []
boxes = []
for i in range(len(ocr_data["text"])):
if ocr_data["text"][i].strip() != "":
words.append(ocr_data["text"][i])
(x, y, w, h) = (ocr_data["left"][i], ocr_data["top"][i], ocr_data["width"][i], ocr_data["height"][i])
boxes.append([x, y, x + w, y + h])
# Normalize bounding boxes to 0-1000 range
width, height = image.size
normalized_boxes = []
for box in boxes:
normalized_box = [
int(1000 * (box[0] / width)),
int(1000 * (box[1] / height)),
int(1000 * (box[2] / width)),
int(1000 * (box[3] / height)),
]
normalized_boxes.append(normalized_box)
# Tokenize and encode for the model
encoding = processor(
image,
words,
boxes=normalized_boxes,
return_tensors="pt",
truncation=True,
padding="max_length"
)
# Inference
with torch.no_grad():
outputs = model(**encoding)
# Get predicted labels
logits = outputs.logits
predicted_ids = logits.argmax(-1).squeeze().tolist()
# Map IDs to labels
labels = model.config.id2label
predicted_labels = [labels[i] for i in predicted_ids]
# Display results
print(f"Processing {image_path}:")
for word, label in zip(words, predicted_labels):
print(f"{word} -> {label}")
Donut: Document Understanding Transformer
Donut, developed by NAVER AI Lab, is a cutting-edge, OCR-free model that directly processes document images to output structured content, eliminating the need for OCR preprocessing.
Key Features of Donut
- OCR-Free: Bypasses traditional OCR steps, processing images directly.
- End-to-End: Takes an image as input and outputs structured data (e.g., JSON).
- Vision Transformer Based: Leverages the Swin Transformer architecture for image encoding.
- Template-based Output: Designed to return document content in structured formats, such as key-value pairs.
Why Donut is a Breakthrough
Traditional document processing pipelines:
Image → OCR → NLP
Donut's streamlined pipeline:
Image → [Vision + Transformer] → Structured Output
This approach offers several advantages:
- Improved Accuracy: Better performance on noisy, handwritten, or complex layouts.
- Simplified Architecture: Eliminates the need to manage OCR errors.
- Flexibility: Easily trainable on custom templates and diverse document types.
Donut Use Cases
- Receipt parsing
- ID document parsing
- Invoice digitization
- Document classification
- Multilingual document understanding
Donut Architecture Overview
- Image Encoder: Utilizes a Swin Transformer to process the input image.
- Decoder: A Transformer-based component that generates sequences, typically in a JSON-like format.
- Pretraining: Trained on a combination of synthetic and real document datasets.
- Output: Generates fully formatted JSON objects with field-value pairs.
Example Code with Pretrained Donut
This example shows how to use a pre-trained Donut model for extracting information from a receipt or invoice.
import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
# Load pre-trained Donut model and processor
# This example uses a model fine-tuned on receipts/invoices (CORD dataset)
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
# Load image
image_path = "invoice.png" # Replace with your invoice image path
try:
image = Image.open(image_path).convert("RGB")
except FileNotFoundError:
print(f"Error: Image file not found at {image_path}")
exit()
# Prepare input for the model
pixel_values = processor(image, return_tensors="pt").pixel_values
# Define the task prompt for Donut
# The prompt guides the model on what kind of structured output to generate
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
# Inference
# The model generates a sequence of tokens representing the structured output
with torch.no_grad():
outputs = model.generate(
pixel_values,
decoder_input_ids=decoder_input_ids,
max_length=768 # Adjust max_length as needed
)
# Decode the generated tokens into human-readable text
output_text = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Postprocess the output (e.g., parse JSON)
print("\n🧾 Extracted Output:\n")
print(output_text)
# Example of parsing the output text into a Python dictionary (if it's valid JSON)
import json
try:
extracted_data = json.loads(output_text)
print("\nParsed JSON:")
print(json.dumps(extracted_data, indent=2))
except json.JSONDecodeError:
print("\nCould not parse output as JSON.")
Example Output (JSON format):
{
"company": {
"name": "CVS Pharmacy"
},
"items": [
{
"name": "Advil Tablets",
"price": "$8.99"
},
{
"name": "Toothbrush",
"price": "$2.50"
}
],
"total": "$11.49",
"date": "2023-08-19"
}
Comparison: LayoutLM vs. Donut
Feature | LayoutLM | Donut |
---|---|---|
OCR Dependency | Yes (requires OCR input) | No (OCR-free) |
Input Type | OCR tokens, positions, optional visual features | Document image |
Output | Tokens, labels (for NER, classification) | Structured text (e.g., JSON with key-value pairs) |
Pretrained Data | FUNSD, SROIE, DocVQA, FormNet | Synthetic receipts, invoices, CORDS (Common Objects in Raw Documents) |
Best For | Layout-aware NLP tasks, detailed text analysis | End-to-end document parsing, structured data extraction |
Language Support | Multilingual via OCR | Multilingual with visual training |
Libraries and Tools
- LayoutLM:
- Transformers by HuggingFace
- Datasets: FUNSD, SROIE, RVL-CDIP
- Donut:
Real-World Applications
- Finance: Invoice and expense report automation, financial auditing.
- Healthcare: Parsing prescriptions, patient records, and medical reports.
- Legal: Extracting clauses from contracts, reviewing legal documents.
- Logistics: Processing bills of lading, shipment tracking information.
- E-commerce: Digitizing receipts, delivery notes, and order confirmations.
Conclusion
LayoutLM and Donut represent two powerful paradigms for document understanding. LayoutLM excels at layout-aware Natural Language Processing (NLP) by leveraging OCR and positional data. In contrast, Donut simplifies the pipeline with an OCR-free, image-to-JSON transformation approach. The choice between them depends on your specific data sources, accuracy requirements, and the complexity of your use case.
SEO Keywords
LayoutLM, Donut, document AI, OCR-free model, document understanding transformer, layout-aware NLP, invoice parsing, end-to-end document parsing, vision transformer, multimodal document understanding, OCR vs OCR-free, structured data extraction.
Interview Questions
- What is LayoutLM and how does it improve document understanding beyond traditional OCR?
- Explain the key differences between LayoutLMv1, LayoutLMv2, and LayoutLMv3.
- How does LayoutLM integrate textual, layout, and visual information for document processing?
- What is Donut, and how does it differ from OCR-dependent models like LayoutLM?
- Describe the architecture of Donut and how it processes document images end-to-end.
- What are the main advantages of OCR-free models like Donut in handling noisy or handwritten documents?
- How can LayoutLM be applied for key-value pair extraction in invoices or forms?
- What datasets and tools are commonly used to train and fine-tune LayoutLM and Donut models?
- In what scenarios would you prefer LayoutLM over Donut, and vice versa?
- How do vision transformers contribute to the advancements in document AI models like Donut?
Extract Text & Tables from Invoices: Hands-on Guide
Learn to extract text and tables from invoices & forms using Tesseract, EasyOCR, and PDFPlumber. A practical, hands-on ML/AI document processing guide.
Table Detection & Structure Recognition | AI Document Understanding
Master table detection and structure recognition for intelligent document understanding. Learn AI methodologies, tools, and applications for extracting structured data from documents.