Price Per TokenPrice Per Token
Price Per Token

Price Per Token Discussion

Feed

3
Pinned
ellmanalex·9d ago

Why I made this forum!

Hello all!
I am launching this forum in hopes of creating a helpful community to discuss AI, development and agents. There are many issues in the existing spaces (Reddit and Hackernews) and I hoping that this being owned by a solo developer can avoid some / all of them, some of these include:

We will have a 0 tolerance policy for using AI to post here. If I suspect you of using AI I will permanently ban your GitHub from the forum (which is the only way to sign in).

Big tech influence- I am not affiliated with any big tech companies / VCs. I am using the hackernews algorithm to rank the posts (rank = points / (age_hours + 2) ^ 1.8). Happy to answer any questions on that. My only goal is to have positive engagement and an active community. I monetize via the ads you see on the sides of the site, I will update here if that ever changes.

Price Per Token is currently at 1.5-2k daily visitors. We will need a bit more to have a very active community, but traffic is growing and I plan on contributing here daily. If there is anything I can do to make this better feel free to reply here.

2
arturoyo·1d ago

Measuring and optimizing LLM spend(optym.pro)

Been working on exactly this problem for the past year. What we've found is that the biggest lever isn't really observability — it's automatic model routing.

Most teams are paying for a single model (usually GPT-4 class) for every call, when in practice 60–70% of those calls could be handled by a model 5–10x cheaper with equivalent output quality.

We built optym.pro around this idea — it's an OpenAI-compatible router that evaluates each request and routes it to the best model for that specific task, in real time. No code changes on your end, just swap the endpoint.

Teams spending $3K+/month on LLMs are seeing 40–60% reduction. Happy to share more details if useful.

2
Cahl-Dee·2d ago

Measuring and optimizing LLM spend

Curious what's working for people here in practice. The tooling landscape is getting crowded (LiteLLM, Helicone, Langfuse, LangSmith, Bifrost, provider dashboards, custom middleware...) but actual workflows still seem pretty fragmented, or completely unhinged (read: caveman ).

I'm interested across three use cases:

  1. LLMs powering apps / products

Are you tracking spend per request, per customer, per feature, or per successful outcome?
Gateway-level?
SDK wrapper?
Something custom?
Beyond just "swap to a cheaper model," is anyone doing real optimization like semantic caching, prompt compaction, adaptive context windows, confidence-based model escalation?
Has anyone gotten to actual cost-per-user-action metrics rather than just raw token totals?

  1. LLMs inside agents (OpenClaw-style, tool-use loops, planner-executor setups)

This is where spend gets slippery. Lots of small calls that add up: planning, tool selection, heartbeat, memory updates, retries, self-correction.
Are you routing different parts of the loop to different models (cheap for heartbeat, expensive only for hard reasoning)?
Setting explicit budgets per run?
Anyone found good heuristics for stopping an agent from overthinking before the orchestration costs more than the task itself?

  1. LLMs for software development

Feels under-discussed because coding workflows look cheap per interaction but can be massive in aggregate. Are you tracking spend per dev, per repo, per task type?
Aggressively filtering context before sending?
Using smaller models for first-pass edits and reserving bigger ones for architecture decisions?
Has anyone measured whether agentic coding actually reduces cost per merged change, or just increases throughput while also increasing spend?

What I'd love to hear:

  • Your stack and what you measure
  • Where spend surprised you
  • The optimization that actually moved the needle (and the one you thought would but didn't)
  • Whether you optimize for raw token cost, latency-adjusted cost, or cost per outcome
  • What's working for you?
2
z-spectral·2d ago

“Cost per token vs cost per outcome” - are we optimizing the wrong thing?

Everyone here tracks token pricing (this site literally exists for that), but I’m starting to think it’s becoming a misleading metric for real-world systems.

A few observations from building with LLMs:

  1. Token price is going down, but total system cost often goes up
  2. Longer context, retries, tool calls, and evaluation loops all multiply usage
  3. A “cheap” model can end up more expensive if it fails more often

There’s also the hidden cost of:

  • bad outputs → user churn
  • retries → more tokens
  • guardrails → more complexity

This matches what some FinOps discussions highlight - the real cost is driven by usage patterns, not list price...

So I’m wondering:
👉 Should we stop comparing models by $/token and instead compare them by something like:

  • cost per successful task
  • cost per correct answer
  • cost per user session

Curious how others here think about this....

Do you track anything beyond token usage?
Or is everyone still optimizing for the wrong metric?

2
ellmanalex·7d ago

After Claude ban I found my new main model

I've been using OpenClaw for months with only Opus 4.6, Sonnet 4.6, and GPT 5.3/5.4. I'm the kind of person who needs the flagship model as long as budgets are reasonable.

Claude is dead. OpenAI made business plan quotas unusable. So I went shopping for alternatives.

GLM 5.1 and 5 Turbo: absolute garbage for agentic tasks and automation. Couldn't even write a simple Reddit reply without flooding Telegram with code dumps. Felt like talking to a drunk model. Cancelled. (They said "we'll refund" — still waiting 3 weeks later.)

MiMo V2 Pro: using it since launch, really liked it. Honestly got Opus/GPT vibes in many ways. After Claude banned OpenClaw yesterday I got the Token Plan (standard $16). Terrible credit system. Everything in OpenClaw deducts from credits. Session history, bootstrap MD content, tool outputs, cache — literally everything. One month's quota gone in 1 day after filling just 2 session contexts. Horribly inefficient. I will never pay again until they fix the credit logic.

Kimi: reviews were bad, never tried.

Grok: community feedback looked bad, skipped it.

Gemini: no monthly payment option. If there was I'd probably use it, but it's too expensive.

So I went with the most popular alternative: Minimax 2.7. When MiMo and GPT failed to handle my nit cron task and Minimax M2.7 solved it in 5 minutes. And the quota on Minimax is impossible to exhaust. — I was shocked. How are they this generous? If anyone knows please explain because it really feels like it won't run out. Tested browser automations. It's not as smart as Opus, but for my automation tasks, light coding work, and being a personal agent — it's enough. If it falls short, I might rotate between a few GPT Plus subscriptions for GPT 5.4 access. Right now, price/performance-wise, Minimax and GPT Plus x 2 accounts are the only efficient OpenClaw model options I could find.

1

Have the GB10 devices become the current "best value" for LLMs?

I want to buy some real hardware because I feel like I'm falling behind. 3090s are >$1000 on ebay, and building out the server would be very expensive with current memory and storage prices. Macs are backordered for the next 5 months. I have no idea on the status of AMD products or Intel, but I don't want to fight driver and compatibility issues on top of trying to get models and harnesses running.

Are the GB10 variants the best value if you want to buy now? Is it better to try to wait on the M5 releases in 2-4 months? That seems like forever in today's fast-moving environment.

1

Used ray tracing cores on my RTX 5070 Ti for LLM routing — 218x speedup, runs entirely on 1 consumer GPU

Quick summary: I found a way to use the RT Cores (normally used for ray tracing in games) to handle expert routing in MoE models. Those cores sit completely idle during LLM inference, so why not put them to work?

What it does:

  • Takes the routing decision in MoE models (which experts process which tokens)
  • Projects tokens into 3D space
  • Uses the GPU's dedicated ray tracing hardware to find the right experts
  • O(log N) instead of O(N) — hardware-accelerated

Numbers (OLMoE-1B-7B, RTX 5070 Ti 16GB):

  • 218x faster routing at batch 1024
  • 731x less VRAM for routing
  • Only +1.5% perplexity hit
  • 95.9% routing accuracy

Unexpected discovery: I also found that MoE experts don't actually specialize by topic. Tested across 3 different models (OLMoE, Qwen-MoE, DeepSeek-MoE) — they all specialize by syntactic type (content words vs function words vs punctuation). The "science expert" is a myth.

Code repo: https://github.com/JordiSilvestre/Spectral-AI All papers are open access on Zenodo with full data and reproduction instructions: https://doi.org/10.5281/zenodo.19457288

1

Salesforce cut 4,000 support roles using AI agents. Then admitted the AI had reliability problems significant enough to warrant a strategic pivot.

I have said this multiple times and received a lot of pushback. But this Salesforce story makes it clearer than anything I could write.

You cannot deploy AI in production workflows without infrastructure governing how it executes. Salesforce just figured that out. The hard way.

They deployed Agentforce across their own help site, handling over 1.5 million customer conversations. Cut 4,000 support roles in the process. Then their SVP of Product Marketing said: "All of us were more confident about large language models a year ago."

One customer found satisfaction surveys were randomly not being sent despite clear instructions. The fix was deterministic triggers. Another name for what should have been enforced from the start.

Human agents had to step in to correct AI-generated responses. That is the babysitting problem. The same one developers describe when they say half their time goes into debugging the agent's reasoning instead of the output.

They could have added LLM-as-judge. A verification protocol. Some other mitigation. But all of that is post-hoc. It satisfies the engineering checklist. It does not satisfy the user who already got a wrong answer and moved on. A frustrated customer does not give you a second chance to get it right.

They have now added Agent Script, a rule-based scripting layer that forces step-by-step logic so the AI behaves predictably. Their product head wrote publicly about AI drift, when agents lose focus on their primary objectives as context accumulates. Stock is down 34% from peak.

The model was not the problem. Agentforce runs on capable LLMs. What failed was the system around them. No enforcement before steps executed. No constraint persistence across turns. No verification that instructions were actually followed before the next action ran.

They are now building what should have been there before the 4,000 roles were cut. Deterministic logic for business-critical processes, LLMs for the conversational layer.

That is not a new architecture. That is the enforcement layer. Arrived at the hard way.

1
Dace1187·6d ago

Bypassing context decay in long-running sims: Why we ditched sliding windows for strict DB mutations

If you’re building long-running agentic loops or text-based RPGs, you already know standard sliding windows and simple RAG eventually fall apart. By turn 30, the model forgets your inventory, hallucinates dead NPCs back to life, and totally loses the causal chain.

I’m working on a project called Altworld, and we decided to solve this by completely decoupling the LLM's narrative generation from the actual state management.

Instead of treating the chat transcript as the source of truth, "canonical run state is stored in structured tables and JSON blobs". We basically force the LLMs to act as highly constrained database mutators first, and storytellers last.

Here is the architectural pattern that keeps our simulation consistent across hundreds of turns.

The Pipeline: Specialist Roles

We don't use one massive prompt. Instead, "The AI layer is split into specialist roles rather than one monolithic prompt: scenario generation, scenario bootstrap, world systems reasoning, NPC planning, action resolution, narrative rendering".

When a user submits a move, the pipeline fires like this:

  1. State Load: We acquire a lock and pull the canonical state from PostgreSQL via Prisma. This includes exact numerical values for `coin`, `fatigue`, and

`stress`.

  1. NPC & System Inference: We run smaller models (e.g., Gemini 3 Flash Preview via OpenRouter) to handle background logic. Crucially, "important NPCs make local plans and act based on limited knowledge rather than omniscient story scripting". They output JSON diffs.

  2. Action Adjudication: An action resolution model compares the user's intent against their stats and outputs a JSON result (success/fail, state changes).

  3. The Commit: The server transactionally persists all of these structured state changes to the database.

  4. Narrative Render: This is our golden rule: "narrative text is generated after state changes, not before". We pass the database diffs to the narrative model, which \only\ has to write the prose describing what just happened.

Latency vs. Consistency

The obvious tradeoff here is latency. You are making 3-4 LLM calls per turn. We mitigate this by parallelizing the world/NPC reasoning where possible, and relying heavily on UI streaming.

Because we use a commercial Stripe setup for this project (Candles/subscriptions), I am strictly adhering to Rule 5 regarding no commercial self-promotion and Rule 10 against disguised marketing. Therefore, I won't drop direct links. But I did want to share this architecture, because treating LLMs as modular JSON calculators instead of omniscient storytellers is the only way we've found to reliably maintain state in highly mutable environments.

Has anyone else moved away from text-based context windows toward strict relational DB mutations for their memory layers? Curious what your latency overhead looks like.

1

Did Moonshot AI quietly block Kimi Code API in third-party tools like OpenClaw?

I've been using the Kimi Code API (api.kimi.com/coding/v1) with OpenClaw for months without issues, but recently started getting this error.
{"error":{"type":"access_terminated_error","message":"Kimi For Coding is currently only available for Coding Agents such as Kimi CLI, Claude Code, Roo Code, Kilo Code, etc."}}

I pay for the Moderato plan ($19/mo) specifically for API access, but now it seems OpenClaw isn't on their "whitelist."

1

A lot of the new "GPT 5.4 sucks in OpenClaw" posts are really config issues

A lot of the new "GPT 5.4 sucks in OpenClaw" posts are really config issues

Since the Anthropic change, a lot more people here are trying GPT 5.4 in OpenClaw.

Which makes sense. Claude Pro / Max inside OpenClaw was one of the biggest cost-saving setups, so now a lot of people are testing GPT 5.4 harder than before.

What I keep seeing though is people blaming the model for stuff that is mostly setup.

A lot of them are testing on old OpenClaw versions. Or with reasoning off. Or with weak thinking settings. Or without the right OpenAI path.

At that point, you’re not really testing GPT 5.4 properly.

You’re testing a crippled setup.

That’s also where the weird "it looks like it’s working but it’s not actually doing anything" feeling comes from.

If reasoning isn’t active, the bot is way more limited than people think. It can answer the last message. It can sound plausible. But as soon as the task needs actual multi-step reasoning, or something changes mid-flow, it falls apart fast.

That’s why some demos feel fake.
Not because GPT 5.4 is automatically bad in OpenClaw, but because the setup is bad enough that the bot never had much room to work with.

The biggest fixes for me were:

  • update OpenClaw to at least 2026.4.5
  • turn reasoning on
  • keep thinking at least at medium
  • use openai-responses
  • enable block streaming if the bot is living in Telegram
  • keep enough recent context so the conversation doesn’t degrade

Once those were fixed, GPT 5.4 felt way better.

Less fake progress.
Less random stopping.
Better continuity.

One important thing though: even with the right setup, GPT 5.4 still doesn’t feel exactly like Opus 4.6.

Opus 4.6 would sometimes take a lot more initiative. Sometimes that felt great. Sometimes it was honestly too much, and that freedom could also lead to mistakes.

GPT 5.4 feels a bit different. In my experience, it benefits more from validation and tighter steering on some steps.

Personally, I prefer that.

I’d rather have a model that needs a bit more checking, but stays more controllable, than one that takes too much initiative and occasionally goes off in the wrong direction.

1
TigerJoo·7d ago

Here’s a stupid‑simple H = π * ψ² governor you can paste into your pipeline

Below is a minimal pattern of the H Formula code that anyone can try:

Define ψ as a simple scalar from your own context (e.g., prompt length).
Compute H = π·ψ².
Use H to govern max_tokens (or any other cost driver).
Print a tiny before/after cost report.

You can adapt it to OpenAI, vLLM, llamafile, etc.

  1. Minimal “H Governor” Demo (pure Python)

This version doesn’t call any API.
It just shows how H changes the token budget and logs the savings:

import math
import random

PI = math.pi

def estimate_psi(prompt: str) -> float:
"""
Super simple ψ estimator:
- Longer, denser prompts → higher ψ.
- You can swap this with entropy, KV size, etc.
"""
base = len(prompt.split())
# Optional: add a tiny random jitter to simulate variability
return base / 50.0  # scale factor so numbers aren't huge

def holistic_energy(psi: float) -> float:
"""H = π * ψ²"""
return PI (psi * 2)

def token_budget_with_H(prompt: str,
max_tokens_baseline: int = 512,
H_cap: float = 25.0,
min_tokens: int = 64) -> int:
"""
Use H to govern the token budget:
- High H → strong / intense state → we don't need to brute-force tokens.
- Low H → allow more tokens (within baseline).
"""
psi = estimate_psi(prompt)
H = holistic_energy(psi)

# Normalize H into [0, 1] band using a cap
H_norm = min(H / H_cap, 1.0)

# Invert: higher H_norm → smaller token budget
reduction_factor = 0.5 * H_norm  # up to 50% cut
governed_budget = int(max_tokens_baseline * (1.0 - reduction_factor))

governed_budget = max(governed_budget, min_tokens)

return psi, H, governed_budget

def run_demo():
prompts = [
"Quick: summarize this in one sentence.",
"Explain the H = pi * psi^2 formula and its implications for AI cost control.",
"You are given a long technical spec document about distributed systems, "
"OOM behavior, and inference economics. Analyze the tradeoffs between context length, "
"KV cache growth, and token-based governors, providing detailed recommendations."
]

max_tokens_baseline = 512

print("=== H-Governor Cost Demo ===")
for i, prompt in enumerate(prompts, start=1):
psi, H, governed = token_budget_with_H(
prompt,
max_tokens_baseline=max_tokens_baseline
)

saved = max_tokens_baseline - governed
save_pct = (saved / max_tokens_baseline) * 100

print(f"\n[Example {i}]")
print(f"Prompt length (words): {len(prompt.split())}")
print(f"ψ (psi) estimate:      {psi:.3f}")
print(f"H = π * ψ²:            {H:.3f}")
print(f"Baseline max_tokens:   {max_tokens_baseline}")
print(f"H-governed max_tokens: {governed}")
print(f"Estimated tokens saved: {saved} ({save_pct:.1f}% reduction)")

if name == "main":
run_demo()

What this gives you:

  • A visible mapping: longer / denser prompts → higher ψ → higher H.
  • Automatic token reduction as H rises.
  • Immediate printout of token savings per request.

You can literally run:

python h_governor_demo.py

…and see: “Oh, I just cut 30–50% of my max_tokens on high-H prompts.”

1
ellmanalex·7d ago

Have you used Gemma 4 yet?

I think the best part of it is that you can run it locally. When I get OpenClaw up and running again I think I will explore using it for research and other simpler tasks since the Anthropic subscription doesn't cover that anymore.