8000
Skip to content

SivagurunathanV/claude-router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
< 8000 a aria-label="View commit history for this file." href="/SivagurunathanV/claude-router/commits/main/" class="prc-Button-ButtonBase-9n-Xk LinkButton-module__linkButton__nFnov flex-items-center fgColor-default" data-loading="false" data-size="small" data-variant="invisible" aria-describedby="_R_9d9jcdajal1d_">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Claude Router

A high-performance, cost-optimized routing layer for Claude models with virtual threads support.

Features

  • Racing Router - Races multiple models in parallel, returns fastest response
  • Adaptive Learning - Automatically identifies best-performing models over time
  • StructuredTaskScope - Java 25 structured concurrency for clean, safe parallelism
  • Tiered Routing - Routes requests to Haiku, Sonnet, or Opus based on priority
  • Cost Optimization - Achieves up to 96% cost savings vs using Opus for everything
  • Virtual Threads - Java 25 virtual threads for high concurrency (~9K req/sec)
  • Metrics Collection - Tracks latency, success rates, tokens, and costs per model

Model Tiers

Model Tier Input $/1M Output $/1M Use Case
Haiku 1 $0.25 $1.25 Simple queries, high volume
Sonnet 2 $3.00 $15.00 Balanced tasks
Opus 3 $15.00 $75.00 Complex analysis

Racing Router & StructuredTaskScope

The RacingRouter uses Java 25's StructuredTaskScope to race multiple models in parallel and return the fastest response. This is a significant improvement over traditional approaches.

StructuredTaskScope vs Traditional Implementation

With StructuredTaskScope (Java 25)

private LLMResponse raceModels(List<Model> models, LLMRequest request) {
    try (var scope = StructuredTaskScope.open(
            Joiner.<LLMResponse>anySuccessfulResultOrThrow())) {

        // Fork concurrent tasks
        for (Model model : models) {
            scope.fork(() -> executeModel(model, request));
        }

        // Wait for first success - others auto-cancelled
        return scope.join();

    } catch (Exception e) {
        return handleError(e);
    }
}

Benefits:

  • Automatic cancellation of losing tasks
  • Structured lifetime - scope closes cleanly
  • Exception handling is straightforward
  • No thread pool management
  • No Future tracking or cleanup

Traditional Implementation (Pre-Java 25)

private LLMResponse raceModelsTraditional(List<Model> models, LLMRequest request) {
    ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
    CompletableFuture<LLMResponse>[] futures = new CompletableFuture[models.size()];
    AtomicBoolean completed = new AtomicBoolean(false);

    try {
        // Create futures for each model
        for (int i = 0; i < models.size(); i++) {
            Model model = models.get(i);
            futures[i] = CompletableFuture.supplyAsync(() -> {
                if (completed.get()) {
                    throw new CancellationException("Race already won");
                }
                return executeModel(model, request);
            }, executor);
        }

        // Wait for first successful result
        CompletableFuture<Object> anyOf = CompletableFuture.anyOf(futures);
        LLMResponse winner = (LLMResponse) anyOf.get(30, TimeUnit.SECONDS);
        completed.set(true);

        // Manually cancel remaining futures
        for (CompletableFuture<LLMResponse> future : futures) {
            if (!future.isDone()) {
                future.cancel(true);
            }
        }

        return winner;

    } catch (TimeoutException e) {
        // Cancel all on timeout
        for (CompletableFuture<LLMResponse> future : futures) {
            future.cancel(true);
        }
        return handleError(e);
    } catch (Exception e) {
        return handleError(e);
    } finally {
        executor.shutdown();
    }
}

Problems:

  • Manual cancellation required
  • Complex exception handling
  • Race conditions with completed flag
  • Must manage executor lifecycle
  • Type casting from anyOf()
  • Timeout handling is verbose

Comparison Summary

Aspect StructuredTaskScope CompletableFuture
Cancellation Automatic Manual
Resource Cleanup try-with-resources Manual shutdown
Type Safety Full generics Casting required
Thread Management Built-in ExecutorService lifecycle

Adaptive Racing Phases

The RacingRouter adapts its strategy based on accumulated data:

Phase Races Behavior Cost
Cold Start 0-100 Race ALL 3 models 3x
Learning 100-500 Race top 2 by win rate 2x
Optimized 500+ Single model if >80% wins 1x

Requirements

  • Java 25+
  • Gradle 8.14+

Quick Start

# Clone the repo
git clone https://github.com/SivagurunathanV/claude-router.git
cd claude-router

# Run the server
./gradlew run

# Test the endpoint
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello","priority":"LOW"}'

API Endpoints

POST /chat

Route a request through the RacingRouter.

{
  "prompt": "Your prompt here",
  "priority": "LOW | MEDIUM | HIGH"
}

Racing Behavior:

  • Races multiple models in parallel (based on current phase)
  • Returns the fastest successful response
  • Automatically adapts model selection based on win rates

GET /metrics

Get cost, performance metrics, and race statistics.

{
  "models": {
    "haiku": { "totalRequests": 1000, "raceWins": 960, "totalCostUSD": 0.26 },
    "sonnet": { "totalRequests": 500, "raceWins": 38, "totalCostUSD": 7.50 },
    "opus": { "totalRequests": 50, "raceWins": 2, "totalCostUSD": 3.75 }
  },
  "race_stats": {
    "total_races": 1000,
    "current_phase": "optimized",
    "models_racing": ["haiku"],
    "win_rates": { "haiku": 0.96, "sonnet": 0.038, "opus": 0.002 }
  },
  "total_cost_usd": 11.51,
  "cost_savings_pct": 95.9
}

GET /config

View current racing configuration.

{
  "racing_config": {
    "min_sample_size": 100,
    "learning_threshold": 500,
    "dominance_threshold": 0.80,
    "exploration_rate": 0.10
  },
  "current_phase": "optimized",
  "total_races": 1000
}

GET /health

Health check with phase info.

{
  "status": "ok",
  "active_threads": 12,
  "total_cost_usd": 11.51,
  "current_phase": "optimized"
}

POST /reset

Reset all metrics and race statistics.

Configuration

Environment variables:

Variable Default Description
PORT 8080 Server port
USE_VIRTUAL_THREADS true Enable Java 25 virtual threads

Benchmark Results

RacingRouter Performance

8599
Metric Value
P50 Latency 96ms
Throughput (concurrent) 231 req/s
HAIKU Win Rate 96%
Cost Savings vs Opus 95.9%

RacingRouter vs TierRouter

Metric RacingRouter TierRouter
Strategy Adaptive racing Deterministic
P50 Latency 96ms ~100ms
Adapts to Performance Yes No
Cost (1000 reqs) $0.46 $0.29

Virtual Threads vs Platform Threads

10,000 requests, 1,000 concurrency

Metric Virtual Platform Improvement
Requests/sec 3,078 1,530 2x
p50 latency 103ms 475ms 4.6x
p95 latency 420ms 1,276ms 3x

See BENCHMARK_RESULTS.md for detailed results.

Project Structure

app/src/main/java/example/
├── api/
│   ├── Model.java          # Model enum with pricing
│   ├── LLMRequest.java     # Request with priority
│   ├── LLMResponse.java    # Response with metrics
│   ├── ModelFactory.java   # Model instantiation
│   └── RequestHandler.java # Simulated inference
├── router/
│   ├── Router.java         # Router interface
│   ├── RacingRouter.java   # Adaptive racing with StructuredTaskScope
│   ├── RacingConfig.java   # Racing tunable parameters
│   ├── TierRouter.java     # Priority-based routing
│   ├── AdaptiveRouter.java # Performance-based routing
│   ├── RouterServer.java   # HTTP server
│   └── RouterConfig.java   # Server configuration
└── metrics/
    ├── MetricsCollector.java # Metrics aggregation + race stats
    └── ModelMetrics.java     # Per-model stats + race wins

Running Tests

./gradlew test

Running Benchmarks

# Start server
./gradlew run &

# Run racing router benchmark (tests all phases)
./scripts/racing_benchmark.sh

# Compare RacingRouter vs TierRouter
./scripts/router_comparison.sh

# Run simple load test
./scripts/simple_load_test.sh LOW 10000 1000

License

MIT

About

Tiered routing for Claude models with virtual threads, cost optimization, and metrics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

0