OCRBase

How to Install and Set Up OCRBase: Complete Self-Hosting Guide for PDF to JSON Conversion

9 min read

PDF extraction remains one of those problems that looks deceptively easy until you’re knee-deep in regex patterns, trying to parse a random receipt that you scanned in at a 15-degree angle. The traditional approach for devs – throwing Tesseract at it and praying – works fine for clean documents, but the moment you need structured data from real-world PDFs at any kind of scale, things fall apart quickly. Tables become chaos, headers merge with body text, and that carefully formatted JSON you needed? Good luck.

So when OCRBase showed up on my radar claiming to solve this through a combination of PaddleOCR and LLM-powered parsing, my immediate reaction was to dig into the architecture. The premise is interesting: instead of treating OCR as a purely optical problem, it layers intelligent parsing on top to convert extracted text into clean markdown and JSON. What I liked besides the feature set is that it’s open-source and self-hostable, which means if you’re dealing with sensitive documents, then you don’t have to ship that data to yet another third-party API.

The real question is whether this architecture actually delivers on the promise of production-grade document processing. In this breakdown, we’ll examine how OCRBase’s queue-based processing handles document throughput, take a look at the TypeScript SDK and React hooks they’ve built for integration, and figure out if this approach holds up under real-world conditions. If you’ve been cobbling together your own OCR pipeline or burning API credits on commercial solutions, this might be worth your attention.

After spending time with the implementation, we’ll walk through the complete setup process and integration patterns that actually work in production environments.

What is OCRBase?

OCRBase is an open-source document processing solution that combines advanced OCR capabilities with LLM-powered parsing to convert PDFs into structured markdown and JSON data. Built around the PaddleOCR-VL-0.9B model for accurate text extraction, it offers a complete API-driven solution for document processing at scale.

Key Features:

  • Advanced OCR Processing: Uses PaddleOCR-VL-0.9B model for best-in-class text extraction from PDF documents
  • Structured Data Extraction: Define custom schemas and receive structured JSON output from unstructured documents
  • Scalable Architecture: Queue-based processing system designed to handle thousands of documents efficiently
  • Type-safe TypeScript SDK: Full TypeScript support with React hooks for seamless frontend integration
  • Real-time Updates: WebSocket notifications provide live progress updates for document processing jobs
  • Self-hostable: Complete control over deployment and data processing on your own infrastructure

Prerequisites

Before you begin, make sure you have:

  • [ ] Docker and Docker Compose installed (version 20.10 or higher)
  • [ ] Git for cloning the repository
  • [ ] At least 4GB RAM available for the Docker containers
  • [ ] Node.js 18+ and Bun runtime for SDK development
  • [ ] A text editor or IDE for configuration files

Step-by-Step Installation Guide

Step 1: Clone the OCRBase Repository

First, clone the official OCRBase repository to your local machine:

git clone https://github.com/majcheradam/ocrbase.git
cd ocrbase
Terminal showing git clone command and successful repository downloadTerminal showing git clone command and successful repository download

Expected result: You should see the repository files downloaded and be inside the ocrbase directory.

Step 2: Examine the Project Structure

Take a moment to understand the project layout:

ls -la

You’ll see the main components:

  • docker-compose.yml – Container orchestration
  • .env.example – Environment configuration template
  • packages/ – Contains the SDK and core components
  • apps/ – Application components
Directory listing showing project structure with key files highlightedDirectory listing showing project structure with key files highlighted

The project structure tells you a lot about the engineering philosophy here. Instead of cramming everything into a monolith, they’ve carefully separated the SDK from the core processing engine. This modular approach makes it easier to maintain and suggests they’ve thought about how teams actually deploy and integrate these kinds of tools.

Step 3: Configure Environment Variables

Copy the example environment file and customize it for your setup:

cp .env.example .env

Open the .env file in your preferred text editor:

nano .env

The key settings to configure:

# Database configuration
DATABASE_URL="postgresql://ocrbase:password@postgres:5432/ocrbase"

# Redis configuration for job queue
REDIS_URL="redis://redis:6379"

# OCR processing settings
OCR_MODEL_PATH="/models/paddleocr"
MAX_CONCURRENT_JOBS=3

# API configuration
API_PORT=3000
WEBSOCKET_PORT=3001
Environment file open in editor with key configuration sections highlightedEnvironment file open in editor with key configuration sections highlighted

Expected result: Your .env file should contain all necessary configuration variables with values appropriate for your setup.

Step 4: Start the OCRBase Services

Launch all services using Docker Compose:

docker-compose up -d

This command starts several containers:

  • PostgreSQL database
  • Redis for job queuing
  • OCRBase API server
  • WebSocket server for real-time updates
  • OCR processing workers
Terminal showing docker-compose up command with containers starting successfullyTerminal showing docker-compose up command with containers starting successfully

Expected result: All containers should start without errors. You can verify with:

docker-compose ps

Step 5: Verify Installation

Check that all services are running properly:

docker-compose logs --tail=50

Look for these success indicators:

  • Database migrations completed
  • API server listening on port 3000
  • WebSocket server ready
  • OCR workers initialized
Docker logs showing successful service initialization messagesDocker logs showing successful service initialization messages

The logs reveal another thoughtful design decision: they handle database migrations automatically on startup. This eliminates the common headache of remembering to run migration scripts manually, especially when deploying updates.

Step 6: Test the API Endpoint

Verify the API is responding:

curl http://localhost:3000/health

Expected result: You should receive a JSON response indicating the service is healthy:

{
  "status": "ok",
  "timestamp": "2026-01-27T10:30:00.000Z",
  "services": {
    "database": "connected",
    "redis": "connected",
    "ocr": "ready"
  }
}

Step 7: Install the TypeScript SDK

For application integration, install the OCRBase client library:

bun add ocrbase

Or if using npm:

npm install ocrbase
Terminal showing successful SDK installation with version informationTerminal showing successful SDK installation with version information

Expected result: The SDK should install without dependency conflicts and be ready for use in your applications.

Configuration Options

The configuration system reveals some impressive operational planning. Instead of hard-coding behavior deep in the application, they’ve exposed the parameters that actually matter for scaling through clean environment variables.

Processing Configuration

The OCR processing behavior can be tuned through environment variables:

# Maximum concurrent OCR jobs
MAX_CONCURRENT_JOBS=3

# Processing timeout (in milliseconds)
PROCESSING_TIMEOUT=300000

# Output format preferences
DEFAULT_OUTPUT_FORMAT="json"
ENABLE_MARKDOWN_OUTPUT=true

Queue Configuration

Adjust the job queue settings for your workload:

# Redis connection settings
REDIS_URL="redis://redis:6379"
REDIS_DB=0

# Queue processing settings
JOB_ATTEMPTS=3
JOB_BACKOFF_DELAY=5000
QUEUE_CONCURRENCY=5

The queue configuration options demonstrate real production experience. Configurable retry attempts with exponential backoff means they understand that OCR jobs can fail for transient reasons—network hiccups, memory pressure, corrupted image data—and building resilience into the core system saves you from debugging mysterious failures at 2 AM.

Database Optimization

For high-volume processing, optimize the database settings:

# PostgreSQL connection pool
DB_POOL_MIN=2
DB_POOL_MAX=20
DB_TIMEOUT=30000

# Enable query logging for debugging
DB_LOGGING=false

Common Configuration Patterns

High-Volume Processing Setup

For processing thousands of documents daily:

MAX_CONCURRENT_JOBS=8
QUEUE_CONCURRENCY=10
DB_POOL_MAX=50
PROCESSING_TIMEOUT=600000

Development Environment

For local development and testing:

MAX_CONCURRENT_JOBS=1
QUEUE_CONCURRENCY=2
DB_POOL_MAX=5
DB_LOGGING=true

Memory-Constrained Environment

For servers with limited RAM:

MAX_CONCURRENT_JOBS=1
QUEUE_CONCURRENCY=1
DB_POOL_MAX=10
OCR_BATCH_SIZE=1

SDK Integration and Usage

Basic Client Setup

Create an OCRBase client in your application:

import { createOCRBaseClient } from "ocrbase";

const client = createOCRBaseClient({
  baseUrl: "http://localhost:3000",
  timeout: 30000
});

Processing Documents

Submit a PDF for processing:

async function processDocument(pdfBuffer: Buffer) {
  try {
    const job = await client.jobs.create({
      type: "parse",
      document: pdfBuffer,
      outputFormat: "json"
    });
    
    console.log(`Job created: ${job.id}`);
    return job;
  } catch (error) {
    console.error("Processing failed:", error);
  }
}

Real-time Progress Monitoring

Use WebSocket connections for live updates:

import { useOCRJob } from "ocrbase/react";

function DocumentProcessor({ jobId }: { jobId: string }) {
  const { job, progress, error } = useOCRJob(jobId);
  
  if (error) return <div>Error: {error.message}</div>;
  if (!job) return <div>Loading...</div>;
  
  return (
    <div>
      <h3>Processing Status: {job.status}</h3>
      <div>Progress: {progress}%</div>
      {job.status === "completed" && (
        <pre>{JSON.stringify(job.result, null, 2)}</pre>
      )}
    </div>
  );
}

The React hooks integration shows sophisticated understanding of how document processing actually gets used in applications. Most OCR tools dump a basic REST API on you and call it done. The useOCRJob hook handles WebSocket subscriptions, reconnection logic, and state management automatically. This is the the kind of developer experience that saves hours of boilerplate code.

Tips and Troubleshooting

Common Issues

Problem: Docker containers fail to start

This usually happens when ports are already in use or insufficient memory is available. To fix it:

  1. Check port availability:

bash netstat -tlnp | grep :3000

  1. Stop conflicting services or change ports in docker-compose.yml
  1. Ensure Docker has at least 4GB memory allocated
  1. Restart Docker daemon if needed:

bash sudo systemctl restart docker

Problem: OCR processing is slow or fails

Performance issues often stem from resource constraints:

  1. Monitor container resources:

bash docker stats

  1. Reduce concurrent jobs if memory is limited:

bash MAX_CONCURRENT_JOBS=1

  1. Check OCR worker logs:

bash docker-compose logs ocr-worker

Problem: WebSocket connections drop frequently

Network configuration issues can cause connection instability:

  1. Verify WebSocket port is accessible:

bash telnet localhost 3001

  1. Check firewall settings for WebSocket protocols
  1. Increase connection timeout in client configuration:

typescript const client = createOCRBaseClient({ baseUrl: "http://localhost:3000", websocketTimeout: 60000 });

Problem: Database connection errors

Database connectivity issues can halt processing:

  1. Verify PostgreSQL container is healthy:

bash docker-compose exec postgres pg_isready

  1. Check database logs:

bash docker-compose logs postgres

  1. Reset database if corrupted:

bash docker-compose down -v docker-compose up -d

Pro Tips

  • Batch Processing: Process multiple documents simultaneously by creating multiple jobs and monitoring them collectively through the WebSocket connection.
  • Custom Schemas: Define JSON schemas for structured extraction to get consistent output formats tailored to your specific document types.
  • Resource Monitoring: Set up monitoring for the Redis queue depth and PostgreSQL connection counts to identify bottlenecks before they impact performance.
  • Backup Strategy: Regularly backup your PostgreSQL database and Redis snapshots, especially if you’re storing processing results long-term.
  • Load Testing: Use the SDK to create test scripts that simulate your expected document volume to validate performance under load.

Conclusion

Awesome! You now have a fully functional OCRBase installation capable of processing PDF documents into structured Markdown and JSON formats. The system includes queue-based processing for scalability, real-time progress tracking, and a type-safe TypeScript SDK for easy integration.

While I was a bit skeptical starting out, implementing the tool changed my view. OCRBase actually delivers on its promises, though it’s not perfect. The setup process is more involved than I’d like, especially that initial model download, and you definitely need decent hardware to run it smoothly. But once it’s running, the accuracy is solid and the structured extraction feature is genuinely useful.

The TypeScript SDK is well-designed and the real-time updates make it feel professional rather than hacky. I’ve been using it for a few weeks now and it’s become part of my regular workflow for document processing projects.

Is it worth the setup effort? If you’re processing more than a handful of documents regularly and need structured output, absolutely. Just make sure you have the hardware to support it properly.

Next steps:

  • Explore custom schema definitions for your specific document types
  • Set up monitoring and alerting for production deployments
  • Integrate OCRBase into your existing document processing workflows