Explore LayoutLM and Donut for powerful document understanding, going beyond OCR. Learn how AI models interpret text, layout, and visual structure for complex documents.

Document Understanding with LayoutLM and Donut

Document understanding extends beyond simple Optical Character Recognition (OCR). It involves comprehending the text, layout, and visual structure of documents such as invoices, forms, receipts, and scientific papers. Traditional OCR systems struggle in scenarios where layout is crucial for accurate content interpretation.

LayoutLM and Donut are two state-of-the-art deep learning models that address this challenge by combining visual and textual data for comprehensive document understanding.

LayoutLM: Layout-Aware Language Modeling

LayoutLM, developed by Microsoft, is a document AI model that integrates three essential elements for understanding:

Textual Information: Extracted text content, typically from OCR.
Layout Information: Spatial coordinates (x, y) of text blocks on the page.
Visual Features: Image embeddings (introduced in LayoutLMv2/v3) that capture visual cues from the document.

By understanding the spatial arrangement of text, LayoutLM enables more intelligent document processing, going beyond just recognizing the words.

LayoutLM Variants

Model	Key Feature	Use Case Examples
LayoutLMv1	Text + Layout	Form understanding, document classification
LayoutLMv2	Text + Layout + Visual (images)	End-to-end document understanding with scanned documents
LayoutLMv3	Enhanced multi-modal alignment	Stronger results in OCR-free scenarios, improved overall performance

How LayoutLM Works

LayoutLM processes inputs including:

Tokens: Words or sub-word units derived from OCR.
2D Position Embeddings: Bounding box coordinates for each token.
Visual Embeddings: Representations of the document image (for v2 and v3).

These embeddings are fed into a Transformer architecture, which learns the complex relationships between words, their positions, and visual features, thereby achieving layout awareness.

Applications of LayoutLM

Form understanding
Invoice and receipt parsing
Key-value pair extraction
Document classification
Named Entity Recognition (NER) within documents

Example Use Case: Extracting Fields from Invoices

LayoutLM can be trained to extract specific fields from invoices, such as:

Vendor name
Invoice date
Total amount
Line item details

This capability is vital for automating business workflows like expense reporting and financial auditing.

LayoutLM Example (Hugging Face)

This example demonstrates how to perform inference using LayoutLMv2 for token classification (e.g., identifying entity types).

Step 1: Install Dependencies

pip install transformers datasets torchvision pytesseract pdf2image Pillow

Optional: Install Tesseract OCR and Poppler

For processing scanned PDFs, you'll need to install these system dependencies.

Ubuntu:

sudo apt install tesseract-ocr poppler-utils

Windows: Download binaries and add them to your system's PATH. Refer to the Tesseract OCR and Poppler GitHub repositories for instructions.

Step 2: Prepare Input Document

You'll need an image file of the document (e.g., invoice.png, form.jpg).

Step 3: LayoutLMv2 Inference Code

import torch
from transformers import LayoutLMv2Processor, LayoutLMv2ForTokenClassification
from PIL import Image
import pytesseract

# Load model and processor
model_name = "microsoft/layoutlmv2-base-uncased"
processor = LayoutLMv2Processor.from_pretrained(model_name)
model = LayoutLMv2ForTokenClassification.from_pretrained(model_name)

# Load image
image_path = "form_sample.png"  # Replace with your image path
try:
    image = Image.open(image_path).convert("RGB")
except FileNotFoundError:
    print(f"Error: Image file not found at {image_path}")
    exit()

# OCR using pytesseract
ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

# Extract words and bounding boxes
words = []
boxes = []
for i in range(len(ocr_data["text"])):
    if ocr_data["text"][i].strip() != "":
        words.append(ocr_data["text"][i])
        (x, y, w, h) = (ocr_data["left"][i], ocr_data["top"][i], ocr_data["width"][i], ocr_data["height"][i])
        boxes.append([x, y, x + w, y + h])

# Normalize bounding boxes to 0-1000 range
width, height = image.size
normalized_boxes = []
for box in boxes:
    normalized_box = [
        int(1000 * (box[0] / width)),
        int(1000 * (box[1] / height)),
        int(1000 * (box[2] / width)),
        int(1000 * (box[3] / height)),
    ]
    normalized_boxes.append(normalized_box)

# Tokenize and encode for the model
encoding = processor(
    image,
    words,
    boxes=normalized_boxes,
    return_tensors="pt",
    truncation=True,
    padding="max_length"
)

# Inference
with torch.no_grad():
    outputs = model(**encoding)

# Get predicted labels
logits = outputs.logits
predicted_ids = logits.argmax(-1).squeeze().tolist()

# Map IDs to labels
labels = model.config.id2label
predicted_labels = [labels[i] for i in predicted_ids]

# Display results
print(f"Processing {image_path}:")
for word, label in zip(words, predicted_labels):
    print(f"{word} -> {label}")

Donut: Document Understanding Transformer

Donut, developed by NAVER AI Lab, is a cutting-edge, OCR-free model that directly processes document images to output structured content, eliminating the need for OCR preprocessing.

Key Features of Donut

OCR-Free: Bypasses traditional OCR steps, processing images directly.
End-to-End: Takes an image as input and outputs structured data (e.g., JSON).
Vision Transformer Based: Leverages the Swin Transformer architecture for image encoding.
Template-based Output: Designed to return document content in structured formats, such as key-value pairs.

Why Donut is a Breakthrough

Traditional document processing pipelines: Image → OCR → NLP

Donut's streamlined pipeline: Image → [Vision + Transformer] → Structured Output

This approach offers several advantages:

Improved Accuracy: Better performance on noisy, handwritten, or complex layouts.
Simplified Architecture: Eliminates the need to manage OCR errors.
Flexibility: Easily trainable on custom templates and diverse document types.

Donut Use Cases

Receipt parsing
ID document parsing
Invoice digitization
Document classification
Multilingual document understanding

Donut Architecture Overview

Image Encoder: Utilizes a Swin Transformer to process the input image.
Decoder: A Transformer-based component that generates sequences, typically in a JSON-like format.
Pretraining: Trained on a combination of synthetic and real document datasets.
Output: Generates fully formatted JSON objects with field-value pairs.

Example Code with Pretrained Donut

This example shows how to use a pre-trained Donut model for extracting information from a receipt or invoice.

import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image

# Load pre-trained Donut model and processor
# This example uses a model fine-tuned on receipts/invoices (CORD dataset)
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

# Load image
image_path = "invoice.png"  # Replace with your invoice image path
try:
    image = Image.open(image_path).convert("RGB")
except FileNotFoundError:
    print(f"Error: Image file not found at {image_path}")
    exit()

# Prepare input for the model
pixel_values = processor(image, return_tensors="pt").pixel_values

# Define the task prompt for Donut
# The prompt guides the model on what kind of structured output to generate
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids

# Inference
# The model generates a sequence of tokens representing the structured output
with torch.no_grad():
    outputs = model.generate(
        pixel_values,
        decoder_input_ids=decoder_input_ids,
        max_length=768  # Adjust max_length as needed
    )

# Decode the generated tokens into human-readable text
output_text = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Postprocess the output (e.g., parse JSON)
print("\n🧾 Extracted Output:\n")
print(output_text)

# Example of parsing the output text into a Python dictionary (if it's valid JSON)
import json
try:
    extracted_data = json.loads(output_text)
    print("\nParsed JSON:")
    print(json.dumps(extracted_data, indent=2))
except json.JSONDecodeError:
    print("\nCould not parse output as JSON.")

Example Output (JSON format):

{
  "company": {
    "name": "CVS Pharmacy"
  },
  "items": [
    {
      "name": "Advil Tablets",
      "price": "$8.99"
    },
    {
      "name": "Toothbrush",
      "price": "$2.50"
    }
  ],
  "total": "$11.49",
  "date": "2023-08-19"
}

Comparison: LayoutLM vs. Donut

Feature	LayoutLM	Donut
OCR Dependency	Yes (requires OCR input)	No (OCR-free)
Input Type	OCR tokens, positions, optional visual features	Document image
Output	Tokens, labels (for NER, classification)	Structured text (e.g., JSON with key-value pairs)
Pretrained Data	FUNSD, SROIE, DocVQA, FormNet	Synthetic receipts, invoices, CORDS (Common Objects in Raw Documents)
Best For	Layout-aware NLP tasks, detailed text analysis	End-to-end document parsing, structured data extraction
Language Support	Multilingual via OCR	Multilingual with visual training

Libraries and Tools

LayoutLM:
- Transformers by HuggingFace
- Datasets: FUNSD, SROIE, RVL-CDIP
Donut:
- Official Donut GitHub Repo
- HuggingFace Integration Example

Real-World Applications

Finance: Invoice and expense report automation, financial auditing.
Healthcare: Parsing prescriptions, patient records, and medical reports.
Legal: Extracting clauses from contracts, reviewing legal documents.
Logistics: Processing bills of lading, shipment tracking information.
E-commerce: Digitizing receipts, delivery notes, and order confirmations.

Conclusion

LayoutLM and Donut represent two powerful paradigms for document understanding. LayoutLM excels at layout-aware Natural Language Processing (NLP) by leveraging OCR and positional data. In contrast, Donut simplifies the pipeline with an OCR-free, image-to-JSON transformation approach. The choice between them depends on your specific data sources, accuracy requirements, and the complexity of your use case.

SEO Keywords

LayoutLM, Donut, document AI, OCR-free model, document understanding transformer, layout-aware NLP, invoice parsing, end-to-end document parsing, vision transformer, multimodal document understanding, OCR vs OCR-free, structured data extraction.

Interview Questions

What is LayoutLM and how does it improve document understanding beyond traditional OCR?
Explain the key differences between LayoutLMv1, LayoutLMv2, and LayoutLMv3.
How does LayoutLM integrate textual, layout, and visual information for document processing?
What is Donut, and how does it differ from OCR-dependent models like LayoutLM?
Describe the architecture of Donut and how it processes document images end-to-end.
What are the main advantages of OCR-free models like Donut in handling noisy or handwritten documents?
How can LayoutLM be applied for key-value pair extraction in invoices or forms?
What datasets and tools are commonly used to train and fine-tune LayoutLM and Donut models?
In what scenarios would you prefer LayoutLM over Donut, and vice versa?
How do vision transformers contribute to the advancements in document AI models like Donut?

LayoutLM & Donut: Advanced Document Understanding with AI