This directory contains comprehensive benchmarking tools for comparing GraphBit with other popular AI agent frameworks including LangChain, LangGraph, PydanticAI, LlamaIndex, and CrewAI.
The benchmark suite measures and compares:
- Execution Time: Task completion speed in milliseconds
- Memory Usage: RAM consumption in MB
- CPU Usage: Processor utilization percentage
- Token Count: LLM token consumption
- Throughput: Tasks completed per second
- Error Rate: Failure percentage across scenarios
- Latency: Response time measurements
- GraphBit
- LangChain
- LangGraph
- PydanticAI
- LlamaIndex
- CrewAI
- AutoGen (Microsoft fork —
autogen-agentchat) - AG2 (community fork —
ag2)
The suite runs six different scenarios to test various aspects of framework performance:
- Simple Task: Basic single LLM call
- Sequential Pipeline: Chain of dependent tasks
- Parallel Pipeline: Concurrent independent tasks
- Complex Workflow: Multi-step workflow with conditional logic
- Memory Intensive: Large data processing tasks
- Concurrent Tasks: Multiple simultaneous operations
graphbit-benchmarks supports both Poetry and uv for dependency management.
Python requirement:
>=3.10,<3.14
Tested only on Python 3.10, 3.11, 3.12, 3.13. This is because frameworks other than GraphBit do not support Python 3.9.
uv reads dependencies from the [project] section in pyproject.toml.
uv sync
source .venv/bin/activate # .venv\bin\activate for windowspoetry reads dependencies from the [tool.poetry] section in pyproject.toml.
poetry install
poetry env activatepip install "ag2[openai,anthropic,ollama]>=0.11.0"Run AG2 benchmark:
python run_benchmark.py --framework ag2 --provider openai --model gpt-4o-miniNote:
ag2andautogen-agentchatare separate packages with incompatible APIs.autogen-agentchatis Microsoft's fork;ag2is maintained by the original AutoGen contributors. Both are benchmarked independently.
Use run_benchmark.py (or the module variant) to execute all scenarios with your chosen LLM provider and model.
python -m benchmarks.run_benchmark \
--provider openai \
--model gpt-4o-mini \
--frameworks graphbit,langchain \
--scenarios simple_task,parallel_pipelineUse --concurrency to define how many tasks run in parallel.
By default, it uses the number of CPU cores available to the process.
Increase this value cautiously to avoid CPU contention.
python benchma
6013
rks/run_benchmark.py --provider openai --model gpt-4o-mini --concurrency 8Use --cpu-cores to specify a comma-separated list of cores (e.g. 0,1,2,3).
This helps ensure more reproducible results by isolating benchmark workloads from other system processes.
python benchmarks/run_benchmark.py --cpu-cores 0,1,2,3 --concurrency 2Use --membind to bind the benchmark process's memory allocations to a specific
NUMA node. This can improve consistency on multi-socket systems.
python benchmarks/run_benchmark.py --membind 0If libnuma is not available, the benchmark will attempt to re-execute itself
under numactl. Make sure numactl is installed and in your PATH when using
--membind.
Run scenarios sequentially by default to minimize noisy performance interference.
Only run multiple benchmarks in parallel if you assign each to a unique set of CPU cores with --cpu-cores.
| Option | Description | Example |
|---|---|---|
--provider |
LLM provider to use (e.g., openai, anthropic) |
--provider openai |
--model |
Model name or ID for the chosen provider | --model gpt-4o-mini |
--frameworks |
Comma-separated list of frameworks to benchmark (e.g., graphbit,langchain,crewai) |
--frameworks graphbit,langchain |
--scenarios |
Comma-separated list of benchmark scenarios to run | --scenarios simple_task,parallel_pipeline |
--concurrency |
Number of tasks to run in parallel. Defaults to CPU cores count. | --concurrency 8 |
--cpu-cores |
Comma-separated list of specific CPU cores to pin the process to | --cpu-cores 0,1,2,3 |
--membind |
Bind memory allocations to a specific NUMA node | --membind 0 |
--output |
Path to save benchmark results JSON file | --output results.json |
--verbose |
Enable detailed logging | --verbose |
--dry-run |
Show which benchmarks would be run, but do not execute them | --dry-run |
# Run all benchmarks sequentially with default concurrency (auto-detects CPU cores)
python -m benchmarks.run_benchmark \
--provider openai \
--model gpt-4o-mini \
--frameworks graphbit,langchain \
--scenarios simple_task,parallel_pipeline
# Run with explicit concurrency
python benchmarks/run_benchmark.py \
--provider openai \
--model gpt-4o-mini \
--concurrency 8
# Pin to specific CPU cores for reproducible results
python benchmarks/run_benchmark.py \
--cpu-cores 0,1,2,3 \
--frameworks graphbit \
--scenarios parallel_pipeline \
--concurrency 2
# Just preview what will run without executing
python benchmarks/run_benchmark.py \
--frameworks graphbit,langchain \
--dry-run# Show available options
docker-compose run --rm graphbit-benchmark run_benchmark.py --help
# List available models for a provider
docker-compose run --rm graphbit-benchmark run_benchmark.py --provider openai --list-models
# Run specific scenarios only
docker-compose run --rm graphbit-benchmark run_benchmark.py --scenarios "simple_task,parallel_pipeline"
# Run with verbose output
docker-compose run --rm graphbit-benchmark run_benchmark.py --verbose
# Run with custom configuration
docker-compose run --rm graphbit-benchmark run_benchmark.py \
--provider anthropic \
--model claude-3-5-sonnet-20241022 \
--temperature 0.2 \
--max-tokens 1000 \
--frameworks "graphbit,pydantic_ai" \
--verboseResults are saved in multiple formats:
-
JSON Data:
results/benchmark_results_YYYYMMDD_HHMMSS.json- Raw performance metrics
- Error details
- Configuration used
-
Visualizations:
results/benchmark_comparison_YYYYMMDD_HHMMSS.png- Performance comparison charts
- Memory usage graphs
- Execution time analysis
-
Logs:
logs/{framework}.log- Detailed execution logs per framework
- LLM responses and errors
- Debug information