8000
Skip to content

Framework adds 9x overhead to LLM API calls #587

@oyty

Description

@oyty

Environment

  • Parlant Version: 3.0.2
  • Python: 3.12.4
  • Platform: macOS
  • LLM: Qwen (qwen3-32b)

Problem
Tool calls through Parlant take 10.22 seconds, while the same LLM API call directly takes only 1.14 seconds - a 9x performance overhead.

Reproduction

Simple weather tool:

@p.tool
async def get_weather(context: p.ToolContext, city: str) -> p.ToolResult:
    return p.ToolResult(f"{city}的天气:晴天,22°")

Performance Comparison

Method Time
Direct Qwen API 1.14s ✅
Through Parlant 10.22s ⚠️
Overhead +795%

Logs

[ToolCaller] Creating batches finished in 0.0 seconds
[ToolCaller] Processing batches started
[Qwen LLM Request] started
[Qwen LLM Request] finished in 10.219 seconds  ← 9-second overhead
[ToolCaller] Evaluation finished in 10.22 seconds

Analysis

  • Only 1 LLM request is made (no hidden retries)
  • Framework batch creation: ~0ms
  • The entire 9-second overhead occurs during the single LLM call
  • Same model, same API endpoint, vastly different performance

Test Code

Direct API (1.14s):

from openai import AsyncClient
response = await client.chat.completions.create(
    model="qwen3-32b",
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"},
    extra_body={"enable_thinking": False},
)
# Takes 1.14 seconds

Through Parlant (10.22s):

async with p.Server(nlp_service=load_qwen_service) as server:
    agent = await server.create_agent(...)
    await agent.create_guideline(
        condition="用户询问天气",
        action="获取当前天气并提供友好的回应和建议",
        tools=[get_weather],
    )
# Takes 10.22 seconds for tool call

Impact
Makes Parlant unsuitable for:

  • Real-time conversations
  • Production environments with latency SLAs
  • High-throughput applications

Questions

  1. Where is this 9-second overhead coming from?
  2. Are there extremely large system prompts being generated?
  3. Is there hidden synchronous blocking?
  4. Any performance tuning recommendations?

Additional Testing

Different Qwen models (direct API):

  • qwen3-32b: 1.14s
  • qwen-turbo: 0.48s
  • qwen-turbo (streaming): 0.84s (TTFB: 0.23s)

All show similar performance, while Parlant consistently takes 10+ seconds.

Let me know if you need profiling data or additional information!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0