Batch process multiple folders of scanned documents (TIFF, JPEG, PDF) into searchable, compressed PDFs using OCR.
- Batch processing: Process entire directories of document folders in parallel
- OCR with Tesseract: Adds searchable text layer with automatic deskewing and rotation
- JBIG2 compression: Optimizes file size while preserving OCR quality
- Parallel processing: Utilize all CPU cores efficiently
- Docker-based: Easy deployment, consistent environment
# Build the Docker image
docker build -t ocr-batch-processor .
# Process all folders in /input, output to /output
# 4 folders in parallel, 12 CPU cores per folder (48 total)
docker run -v /path/to/input:/input -v /path/to/output:/output \
ocr-batch-processor python batch_process.py /input /output -p 4 -j 12Input folder structure: Output:
/input/ /output/
├── 630666-01/ ├── 630666-01.pdf (searchable, compressed)
│ ├── page_001.tif ├── 629157-02.pdf (searchable, compressed)
│ ├── page_002.tif └── 633421-03.pdf (searchable, compressed)
│ └── page_003.tif
├── 629157-02/
│ └── scan.pdf
└── 633421-03/
├── img001.jpg
└── img002.jpg
Each subfolder becomes one output PDF with:
- Full text OCR layer (searchable/copyable)
- Automatic page rotation and deskewing
- JBIG2 compression (smaller file size)
- Preserved image quality
Balanced approach (4 folders × 12 cores = 48 cores):
docker run -v /input:/input -v /output:/output \
ocr-batch-processor python batch_process.py /input /output -p 4 -j 12More parallelism (6 folders × 8 cores = 48 cores):
docker run -v /input:/input -v /output:/output \
ocr-batch-processor python batch_process.py /input /output -p 6 -j 8# High-quality OCR (slower, larger files)
docker run -v /input:/input -v /output:/output \
ocr-batch-processor python batch_process.py /input /output -p 4 -j 12 --accurate-ocr
# Preview what will be processed (no actual processing)
docker run -v /input:/input -v /output:/output \
ocr-batch-processor python batch_process.py /input /output --dry-run
# Longer timeout for very large documents
docker run -v /input:/input -v /output:/output \
ocr-batch-processor python batch_process.py /input /output -p 4 -j 12 --timeout 7200If you just need to process one folder:
docker run -v /path/to/folder:/input -v /output:/output \
ocr-batch-processor python FINAL.py /input/subfolder /output/result.pdf -j 48See DEPLOYMENT.md for detailed deployment instructions, troubleshooting, and performance expectations.
- batch_process.py: Parallel batch processing wrapper
- FINAL.py: Single folder processor (combine → OCR → compress)
- opencv_optimizer.py: JBIG2 compression engine
- jbig2enc/: JBIG2 encoder binaries and utilities
- Docker
- Input folders containing TIFF, JPEG, or PDF files
- Sufficient disk space for output
On a 48-core machine:
- Small docs (10-20 pages): ~30-60 seconds each
- Large docs (100+ pages): ~5-15 minutes each
- 100 folders (~50 pages each): ~2 hours total
Uses OCRmyPDF, Tesseract OCR, and jbig2enc. See respective licenses.