benchmarks

GraphBit Framework Benchmarks

This directory contains comprehensive benchmarking tools for comparing GraphBit with other popular AI agent frameworks including LangChain, LangGraph, PydanticAI, LlamaIndex, and CrewAI.

Overview

The benchmark suite measures and compares:

Execution Time: Task completion speed in milliseconds
Memory Usage: RAM consumption in MB
CPU Usage: Processor utilization percentage
Token Count: LLM token consumption
Throughput: Tasks completed per second
Error Rate: Failure percentage across scenarios
Latency: Response time measurements

Frameworks

GraphBit
LangChain
LangGraph
PydanticAI
LlamaIndex
CrewAI
AutoGen (Microsoft fork — autogen-agentchat)
AG2 (community fork — ag2)

Benchmark Scenarios

The suite runs six different scenarios to test various aspects of framework performance:

Simple Task: Basic single LLM call
Sequential Pipeline: Chain of dependent tasks
Parallel Pipeline: Concurrent independent tasks
Complex Workflow: Multi-step workflow with conditional logic
Memory Intensive: Large data processing tasks
Concurrent Tasks: Multiple simultaneous operations

Environment Setup

graphbit-benchmarks supports both Poetry and uv for dependency management.

Python requirement: >=3.10,<3.14
Tested only on Python 3.10, 3.11, 3.12, 3.13. This is because frameworks other than GraphBit do not support Python 3.9.

Option 1: Using `uv` (Recommended)

uv reads dependencies from the [project] section in pyproject.toml.

uv sync
source .venv/bin/activate # .venv\bin\activate for windows

Option 2: Using `poetry`

poetry reads dependencies from the [tool.poetry] section in pyproject.toml.

poetry install
poetry env activate

AG2 (community-maintained fork)

pip install "ag2[openai,anthropic,ollama]>=0.11.0"

Run AG2 benchmark:

python run_benchmark.py --framework ag2 --provider openai --model gpt-4o-mini

Note: ag2 and autogen-agentchat are separate packages with incompatible APIs. autogen-agentchat is Microsoft's fork; ag2 is maintained by the original AutoGen contributors. Both are benchmarked independently.

Running Benchmarks

Use run_benchmark.py (or the module variant) to execute all scenarios with your chosen LLM provider and model.

Example

python -m benchmarks.run_benchmark \
  --provider openai \
  --model gpt-4o-mini \
  --frameworks graphbit,langchain \
  --scenarios simple_task,parallel_pipeline

Concurrency Control

Use --concurrency to define how many tasks run in parallel. By default, it uses the number of CPU cores available to the process. Increase this value cautiously to avoid CPU contention.

python benchma
6013
rks/run_benchmark.py --provider openai --model gpt-4o-mini --concurrency 8

CPU Core Pinning

Use --cpu-cores to specify a comma-separated list of cores (e.g. 0,1,2,3). This helps ensure more reproducible results by isolating benchmark workloads from other system processes.

python benchmarks/run_benchmark.py --cpu-cores 0,1,2,3 --concurrency 2

Memory Binding

Use --membind to bind the benchmark process's memory allocations to a specific NUMA node. This can improve consistency on multi-socket systems.

python benchmarks/run_benchmark.py --membind 0

If libnuma is not available, the benchmark will attempt to re-execute itself under numactl. Make sure numactl is installed and in your PATH when using --membind.

Sequential Execution Recommended

Run scenarios sequentially by default to minimize noisy performance interference. Only run multiple benchmarks in parallel if you assign each to a unique set of CPU cores with --cpu-cores.

Option	Description	Example
`--provider`	LLM provider to use (e.g., `openai`, `anthropic`)	`--provider openai`
`--model`	Model name or ID for the chosen provider	`--model gpt-4o-mini`
`--frameworks`	Comma-separated list of frameworks to benchmark (e.g., `graphbit,langchain,crewai`)	`--frameworks graphbit,langchain`
`--scenarios`	Comma-separated list of benchmark scenarios to run	`--scenarios simple_task,parallel_pipeline`
`--concurrency`	Number of tasks to run in parallel. Defaults to CPU cores count.	`--concurrency 8`
`--cpu-cores`	Comma-separated list of specific CPU cores to pin the process to	`--cpu-cores 0,1,2,3`
`--membind`	Bind memory allocations to a specific NUMA node	`--membind 0`
`--output`	Path to save benchmark results JSON file	`--output results.json`
`--verbose`	Enable detailed logging	`--verbose`
`--dry-run`	Show which benchmarks would be run, but do not execute them	`--dry-run`

Quick Start Cheatsheet

# Run all benchmarks sequentially with default concurrency (auto-detects CPU cores)
python -m benchmarks.run_benchmark \
  --provider openai \
  --model gpt-4o-mini \
  --frameworks graphbit,langchain \
  --scenarios simple_task,parallel_pipeline

# Run with explicit concurrency
python benchmarks/run_benchmark.py \
  --provider openai \
  --model gpt-4o-mini \
  --concurrency 8

# Pin to specific CPU cores for reproducible results
python benchmarks/run_benchmark.py \
  --cpu-cores 0,1,2,3 \
  --frameworks graphbit \
  --scenarios parallel_pipeline \
  --concurrency 2

# Just preview what will run without executing
python benchmarks/run_benchmark.py \
  --frameworks graphbit,langchain \
  --dry-run

Docker Commands Reference

# Show available options
docker-compose run --rm graphbit-benchmark run_benchmark.py --help

# List available models for a provider
docker-compose run --rm graphbit-benchmark run_benchmark.py --provider openai --list-models

# Run specific scenarios only
docker-compose run --rm graphbit-benchmark run_benchmark.py --scenarios "simple_task,parallel_pipeline"

# Run with verbose output
docker-compose run --rm graphbit-benchmark run_benchmark.py --verbose

# Run with custom configuration
docker-compose run --rm graphbit-benchmark run_benchmark.py \
  --provider anthropic \
  --model claude-3-5-sonnet-20241022 \
  --temperature 0.2 \
  --max-tokens 1000 \
  --frameworks "graphbit,pydantic_ai" \
  --verbose

Results

Results are saved in multiple formats:

JSON Data: results/benchmark_results_YYYYMMDD_HHMMSS.json
- Raw performance metrics
- Error details
- Configuration used
Visualizations: results/benchmark_comparison_YYYYMMDD_HHMMSS.png
- Performance comparison charts
- Memory usage graphs
- Execution time analysis
Logs: logs/{framework}.log
- Detailed execution logs per framework
- LLM responses and errors
- Debug information

Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
frameworks		frameworks
report		report
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
conftest.py		conftest.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run_benchmark.py		run_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

GraphBit Framework Benchmarks

Overview

Frameworks

Benchmark Scenarios

Environment Setup

Option 1: Using `uv` (Recommended)

Option 2: Using `poetry`

AG2 (community-maintained fork)

Running Benchmarks

Example

Concurrency Control

CPU Core Pinning

Memory Binding

Sequential Execution Recommended

Quick Start Cheatsheet

Docker Commands Reference

Results

FilesExpand file tree

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

GraphBit Framework Benchmarks

Overview

Frameworks

Benchmark Scenarios

Environment Setup

Option 1: Using uv (Recommended)

Option 2: Using poetry

AG2 (community-maintained fork)

Running Benchmarks

Example

Concurrency Control

CPU Core Pinning

Memory Binding

Sequential Execution Recommended

Quick Start Cheatsheet

Docker Commands Reference

Results

Option 1: Using `uv` (Recommended)

Option 2: Using `poetry`