A high-performance, cost-optimized routing layer for Claude models with virtual threads support.
- Racing Router - Races multiple models in parallel, returns fastest response
- Adaptive Learning - Automatically identifies best-performing models over time
- StructuredTaskScope - Java 25 structured concurrency for clean, safe parallelism
- Tiered Routing - Routes requests to Haiku, Sonnet, or Opus based on priority
- Cost Optimization - Achieves up to 96% cost savings vs using Opus for everything
- Virtual Threads - Java 25 virtual threads for high concurrency (~9K req/sec)
- Metrics Collection - Tracks latency, success rates, tokens, and costs per model
| Model | Tier | Input $/1M | Output $/1M | Use Case |
|---|---|---|---|---|
| Haiku | 1 | $0.25 | $1.25 | Simple queries, high volume |
| Sonnet | 2 | $3.00 | $15.00 | Balanced tasks |
| Opus | 3 | $15.00 | $75.00 | Complex analysis |
The RacingRouter uses Java 25's StructuredTaskScope to race multiple models in parallel and return the fastest response. This is a significant improvement over traditional approaches.
private LLMResponse raceModels(List<Model> models, LLMRequest request) {
try (var scope = StructuredTaskScope.open(
Joiner.<LLMResponse>anySuccessfulResultOrThrow())) {
// Fork concurrent tasks
for (Model model : models) {
scope.fork(() -> executeModel(model, request));
}
// Wait for first success - others auto-cancelled
return scope.join();
} catch (Exception e) {
return handleError(e);
}
}Benefits:
- Automatic cancellation of losing tasks
- Structured lifetime - scope closes cleanly
- Exception handling is straightforward
- No thread pool management
- No Future tracking or cleanup
private LLMResponse raceModelsTraditional(List<Model> models, LLMRequest request) {
ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
CompletableFuture<LLMResponse>[] futures = new CompletableFuture[models.size()];
AtomicBoolean completed = new AtomicBoolean(false);
try {
// Create futures for each model
for (int i = 0; i < models.size(); i++) {
Model model = models.get(i);
futures[i] = CompletableFuture.supplyAsync(() -> {
if (completed.get()) {
throw new CancellationException("Race already won");
}
return executeModel(model, request);
}, executor);
}
// Wait for first successful result
CompletableFuture<Object> anyOf = CompletableFuture.anyOf(futures);
LLMResponse winner = (LLMResponse) anyOf.get(30, TimeUnit.SECONDS);
completed.set(true);
// Manually cancel remaining futures
for (CompletableFuture<LLMResponse> future : futures) {
if (!future.isDone()) {
future.cancel(true);
}
}
return winner;
} catch (TimeoutException e) {
// Cancel all on timeout
for (CompletableFuture<LLMResponse> future : futures) {
future.cancel(true);
}
return handleError(e);
} catch (Exception e) {
return handleError(e);
} finally {
executor.shutdown();
}
}Problems:
- Manual cancellation required
- Complex exception handling
- Race conditions with
completedflag - Must manage executor lifecycle
- Type casting from
anyOf() - Timeout handling is verbose
| Aspect | StructuredTaskScope | CompletableFuture |
|---|---|---|
| Cancellation | Automatic | Manual |
| Resource Cleanup | try-with-resources | Manual shutdown |
| Type Safety | Full generics | Casting required |
| Thread Management | Built-in | ExecutorService lifecycle |
The RacingRouter adapts its strategy based on accumulated data:
| Phase | Races | Behavior | Cost |
|---|---|---|---|
| Cold Start | 0-100 | Race ALL 3 models | 3x |
| Learning | 100-500 | Race top 2 by win rate | 2x |
| Optimized | 500+ | Single model if >80% wins | 1x |
- Java 25+
- Gradle 8.14+
# Clone the repo
git clone https://github.com/SivagurunathanV/claude-router.git
cd claude-router
# Run the server
./gradlew run
# Test the endpoint
curl -X POST http://localhost:8080/chat \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello","priority":"LOW"}'Route a request through the RacingRouter.
{
"prompt": "Your prompt here",
"priority": "LOW | MEDIUM | HIGH"
}Racing Behavior:
- Races multiple models in parallel (based on current phase)
- Returns the fastest successful response
- Automatically adapts model selection based on win rates
Get cost, performance metrics, and race statistics.
{
"models": {
"haiku": { "totalRequests": 1000, "raceWins": 960, "totalCostUSD": 0.26 },
"sonnet": { "totalRequests": 500, "raceWins": 38, "totalCostUSD": 7.50 },
"opus": { "totalRequests": 50, "raceWins": 2, "totalCostUSD": 3.75 }
},
"race_stats": {
"total_races": 1000,
"current_phase": "optimized",
"models_racing": ["haiku"],
"win_rates": { "haiku": 0.96, "sonnet": 0.038, "opus": 0.002 }
},
"total_cost_usd": 11.51,
"cost_savings_pct": 95.9
}View current racing configuration.
{
"racing_config": {
"min_sample_size": 100,
"learning_threshold": 500,
"dominance_threshold": 0.80,
"exploration_rate": 0.10
},
"current_phase": "optimized",
"total_races": 1000
}Health check with phase info.
{
"status": "ok",
"active_threads": 12,
"total_cost_usd": 11.51,
"current_phase": "optimized"
}Reset all metrics and race statistics.
Environment variables:
| Variable | Default | Description |
|---|---|---|
PORT |
8080 | Server port |
USE_VIRTUAL_THREADS |
true | Enable Java 25 virtual threads |
| Metric | Value |
|---|---|
| P50 Latency | 96ms |
| Throughput (concurrent) | 231 req/s |
| HAIKU Win Rate | 96% |
| Cost Savings vs Opus | 95.9% |
| Metric | RacingRouter | TierRouter |
|---|---|---|
| Strategy | Adaptive racing | Deterministic |
| P50 Latency | 96ms | ~100ms |
| Adapts to Performance | Yes | No |
| Cost (1000 reqs) | $0.46 | $0.29 |
10,000 requests, 1,000 concurrency
| Metric | Virtual | Platform | Improvement |
|---|---|---|---|
| Requests/sec | 3,078 | 1,530 | 2x |
| p50 latency | 103ms | 475ms | 4.6x |
| p95 latency | 420ms | 1,276ms | 3x |
See BENCHMARK_RESULTS.md for detailed results.
app/src/main/java/example/
├── api/
│ ├── Model.java # Model enum with pricing
│ ├── LLMRequest.java # Request with priority
│ ├── LLMResponse.java # Response with metrics
│ ├── ModelFactory.java # Model instantiation
│ └── RequestHandler.java # Simulated inference
├── router/
│ ├── Router.java # Router interface
│ ├── RacingRouter.java # Adaptive racing with StructuredTaskScope
│ ├── RacingConfig.java # Racing tunable parameters
│ ├── TierRouter.java # Priority-based routing
│ ├── AdaptiveRouter.java # Performance-based routing
│ ├── RouterServer.java # HTTP server
│ └── RouterConfig.java # Server configuration
└── metrics/
├── MetricsCollector.java # Metrics aggregation + race stats
└── ModelMetrics.java # Per-model stats + race wins
./gradlew test# Start server
./gradlew run &
# Run racing router benchmark (tests all phases)
./scripts/racing_benchmark.sh
# Compare RacingRouter vs TierRouter
./scripts/router_comparison.sh
# Run simple load test
./scripts/simple_load_test.sh LOW 10000 1000MIT