OCRmyPDF Batch Processor

Batch process multiple folders of scanned documents (TIFF, JPEG, PDF) into searchable, compressed PDFs using OCR.

Features

Batch processing: Process entire directories of document folders in parallel
OCR with Tesseract: Adds searchable text layer with automatic deskewing and rotation
JBIG2 compression: Optimizes file size while preserving OCR quality
Parallel processing: Utilize all CPU cores efficiently
Docker-based: Easy deployment, consistent environment

Quick Start

# Build the Docker image
docker build -t ocr-batch-processor .

# Process all folders in /input, output to /output
# 4 folders in parallel, 12 CPU cores per folder (48 total)
docker run -v /path/to/input:/input -v /path/to/output:/output \
  ocr-batch-processor python batch_process.py /input /output -p 4 -j 12

What It Does

Input folder structure:       Output:
/input/                       /output/
  ├── 630666-01/             ├── 630666-01.pdf (searchable, compressed)
  │   ├── page_001.tif       ├── 629157-02.pdf (searchable, compressed)
  │   ├── page_002.tif       └── 633421-03.pdf (searchable, compressed)
  │   └── page_003.tif
  ├── 629157-02/
  │   └── scan.pdf
  └── 633421-03/
      ├── img001.jpg
      └── img002.jpg

Each subfolder becomes one output PDF with:

Full text OCR layer (searchable/copyable)
Automatic page rotation and deskewing
JBIG2 compression (smaller file size)
Preserved image quality

Usage

For 48-Core Machine (Recommended)

Balanced approach (4 folders × 12 cores = 48 cores):

docker run -v /input:/input -v /output:/output \
  ocr-batch-processor python batch_process.py /input /output -p 4 -j 12

More parallelism (6 folders × 8 cores = 48 cores):

docker run -v /input:/input -v /output:/output \
  ocr-batch-processor python batch_process.py /input /output -p 6 -j 8

Advanced Options

# High-quality OCR (slower, larger files)
docker run -v /input:/input -v /output:/output \
  ocr-batch-processor python batch_process.py /input /output -p 4 -j 12 --accurate-ocr

# Preview what will be processed (no actual processing)
docker run -v /input:/input -v /output:/output \
  ocr-batch-processor python batch_process.py /input /output --dry-run

# Longer timeout for very large documents
docker run -v /input:/input -v /output:/output \
  ocr-batch-processor python batch_process.py /input /output -p 4 -j 12 --timeout 7200

Single Folder Processing

If you just need to process one folder:

docker run -v /path/to/folder:/input -v /output:/output \
  ocr-batch-processor python FINAL.py /input/subfolder /output/result.pdf -j 48

Documentation

See DEPLOYMENT.md for detailed deployment instructions, troubleshooting, and performance expectations.

Components

batch_process.py: Parallel batch processing wrapper
FINAL.py: Single folder processor (combine → OCR → compress)
opencv_optimizer.py: JBIG2 compression engine
jbig2enc/: JBIG2 encoder binaries and utilities

Requirements

Docker
Input folders containing TIFF, JPEG, or PDF files
Sufficient disk space for output

Performance

On a 48-core machine:

Small docs (10-20 pages): ~30-60 seconds each
Large docs (100+ pages): ~5-15 minutes each
100 folders (~50 pages each): ~2 hours total

License

Uses OCRmyPDF, Tesseract OCR, and jbig2enc. See respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
jbig2enc		jbig2enc
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
DOCKER_WINDOWS_SETUP.md		DOCKER_WINDOWS_SETUP.md
Dockerfile		Dockerfile
FINAL.py		FINAL.py
LICENSE		LICENSE
QUICK_START.txt		QUICK_START.txt
README.md		README.md
SECURITY.md		SECURITY.md
WINDOWS_SETUP.md		WINDOWS_SETUP.md
batch_process.py		batch_process.py
opencv_optimizer.py		opencv_optimizer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCRmyPDF Batch Processor

Features

Quick Start

What It Does

Usage

For 48-Core Machine (Recommended)

Advanced Options

Single Folder Processing

Documentation

Components

Requirements

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OCRmyPDF Batch Processor

Features

Quick Start

What It Does

Usage

For 48-Core Machine (Recommended)

Advanced Options

Single Folder Processing

Documentation

Components

Requirements

Performance

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages