-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Description
Environment
- Parlant Version: 3.0.2
- Python: 3.12.4
- Platform: macOS
- LLM: Qwen (qwen3-32b)
Problem
Tool calls through Parlant take 10.22 seconds, while the same LLM API call directly takes only 1.14 seconds - a 9x performance overhead.
Reproduction
Simple weather tool:
@p.tool
async def get_weather(context: p.ToolContext, city: str) -> p.ToolResult:
return p.ToolResult(f"{city}的天气:晴天,22°")Performance Comparison
| Method | Time |
|---|---|
| Direct Qwen API | 1.14s ✅ |
| Through Parlant | 10.22s |
| Overhead | +795% |
Logs
[ToolCaller] Creating batches finished in 0.0 seconds
[ToolCaller] Processing batches started
[Qwen LLM Request] started
[Qwen LLM Request] finished in 10.219 seconds ← 9-second overhead
[ToolCaller] Evaluation finished in 10.22 seconds
Analysis
- Only 1 LLM request is made (no hidden retries)
- Framework batch creation: ~0ms
- The entire 9-second overhead occurs during the single LLM call
- Same model, same API endpoint, vastly different performance
Test Code
Direct API (1.14s):
from openai import AsyncClient
response = await client.chat.completions.create(
model="qwen3-32b",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
extra_body={"enable_thinking": False},
)
# Takes 1.14 secondsThrough Parlant (10.22s):
async with p.Server(nlp_service=load_qwen_service) as server:
agent = await server.create_agent(...)
await agent.create_guideline(
condition="用户询问天气",
action="获取当前天气并提供友好的回应和建议",
tools=[get_weather],
)
# Takes 10.22 seconds for tool callImpact
Makes Parlant unsuitable for:
- Real-time conversations
- Production environments with latency SLAs
- High-throughput applications
Questions
- Where is this 9-second overhead coming from?
- Are there extremely large system prompts being generated?
- Is there hidden synchronous blocking?
- Any performance tuning recommendations?
Additional Testing
Different Qwen models (direct API):
- qwen3-32b: 1.14s
- qwen-turbo: 0.48s
- qwen-turbo (streaming): 0.84s (TTFB: 0.23s)
All show similar performance, while Parlant consistently takes 10+ seconds.
Let me know if you need profiling data or additional information!
stpwin, jotramon and coding-xiankoolay
Metadata
Metadata
Assignees
Labels
No labels