Cut your Hermes Agent token bill in half

Teodor Tudor

16/04/2026

Cut your Hermes Agent token bill in half

The Hermes maintainers know per-call overhead is high. Issue #4379 on the GitHub repo broke down the cost: roughly 73 percent of each LLM call is fixed overhead before the agent has done anything useful, dominated by tool definitions and the system prompt, totalling around 13.9K tokens per call. For someone making a few dozen calls a day this is a few cents a month difference. For someone running an agent that handles real volume, it adds up fast.

This guide is the practical version of "how to spend less money running Hermes". It covers what the overhead is composed of, which provider choices give you the cheapest path for routine work and the configuration changes that meaningfully drop your per-call cost. Numbers below are from my own usage tracking on a personal agent over the last month.

What the 13.9K tokens per call is

Quick anatomy of a typical Hermes API call's input.

Tool definitions. Around 8K to 10K tokens. The agent has access to dozens of tools (filesystem, browser, shell, code execution, messaging, web search, etc.) and the LLM provider needs each tool's name, description, parameters and JSON schema in the request. This is the biggest single line item.

System prompt. Around 2K to 3K tokens. Hermes's behavioural rules, how to use tools, when to ask for confirmation, formatting conventions.

SOUL.md plus MEMORY.md plus USER.md. Around 1K to 5K tokens depending on what you've put in them. The persona and facts the agent loads every session.

Skills loaded for the current task. Variable, often 0 to 2K tokens. Only loaded if a skill matches the current task.

Recent conversation context. Variable, dominated by your actual messages and the agent's recent replies.

For a brand-new chat with no skills active, the input cost is roughly 12K to 14K tokens before you've sent your first message. Each subsequent turn adds the new exchange to the running context.

Lower-cost LLM providers

The cheapest possible setup is the Nous Portal free tier with the Xiaomi MiMo v2 Pro model. It's good enough for most agentic work (coding help, content drafting, basic research, day-to-day chat) and costs nothing per call. Rate-limited rather than dollar-limited, so you can lean on it heavily without worrying about a bill.

Set with:

hermes model
# Pick "nous-portal" then "mimo-v2-pro"

For tasks where MiMo isn't strong enough, OpenRouter gives you access to 200+ models on a metered basis. Two specific models worth flagging:

Qwen3.5-7B-Instruct via OpenRouter, around $0.07 per million input tokens. Cheap enough to be effectively free for personal use, capable enough for most non-creative tasks.

DeepSeek's models on OpenRouter, similar pricing, sometimes better at code and reasoning. Worth A/B testing on your specific load.

The headline-name models (Claude Sonnet, GPT-5, Gemini Pro) are an order of magnitude more expensive per token. Keep them for tasks that genuinely need their capability; route routine work to the cheap models.

Per-task model routing

Hermes lets you set a default model and override per task or per skill. The pattern that works for me:

hermes config set model.default openrouter:qwen/qwen-2.5-7b-instruct
hermes config set model.heavy openrouter:anthropic/claude-3.5-sonnet

Then in skills that need horsepower (a complex code-review skill, a serious research skill), declare the heavy model in the SKILL.md frontmatter:

---
name: deep-code-review
model: heavy
---

Routine chat uses Qwen at 7 cents per million tokens; complex code review jumps to Sonnet at $3 per million tokens. Average cost stays low because the heavy work is the minority of total calls.

Trim the tool surface

The biggest fixed cost is tool definitions. Hermes ships with a lot of tools enabled by default; you can disable ones you don't use.

hermes tools list  # See what's enabled
hermes tools disable browser_navigate browser_snapshot
hermes tools disable code_execute_python
hermes tools disable web_search

Each disabled tool removes its definition from the per-call payload. A few hundred to a few thousand tokens per call depending on the tool. The savings compound across thousands of calls.

Be careful what you disable. If a skill depends on a disabled tool, the skill silently fails. Disable with intent (you're sure you don't use the browser, you don't need code execution), not as a blanket cost-saving exercise.

The opposite move: tool gates. The v0.10 series introduced a "tool gateway" pattern where some tools are mounted lazily; they're not in the per-call payload until the agent decides it might need them, at which point they get pulled in dynamically. Enable tool-gating for the heavy tools:

hermes config set tools.gating.enabled true
hermes config set tools.gating.always_loaded "[shell, filesystem, send_message]"
hermes config set tools.gating.lazy "[browser, code_execute, web_search, mcp_*]"

The always-loaded set is what the agent has on hand for any task. The lazy set is loaded only on demand. Net: typical calls have a smaller tool footprint, while the heavy tools are still available when needed.

Trim memory files

Every session pays the cost of SOUL.md, MEMORY.md and USER.md. If those files have grown to 5K tokens combined, every call costs an extra 5K tokens of overhead.

Audit them. Open each file. Delete entries that aren't useful any more. Consolidate duplicates. The reflective phase usually keeps these files reasonably tight, but human-edited memory tends to grow.

Target: keep the three files under 3K tokens combined for a personal setup. Use wc -c as a quick proxy: 3000 tokens is roughly 12K to 15K characters in English. wc -c ~/.hermes/SOUL.md ~/.hermes/memories/MEMORY.md ~/.hermes/memories/USER.md tells you where you are.

Use the /compress slash command

Long conversations bloat over time as each turn appends to context. Hermes's /compress slash command summarises the conversation so far into a shorter context, reducing the per-turn cost. Run it when a session is getting expensive:

/compress

The agent generates a summary of the recent context, replaces the verbose history with the summary in active context and continues. Subsequent turns pay less per call because the context is smaller.

Don't run this aggressively; the summary loses some specificity and the agent may lose track of details that mattered. As a rule of thumb, run /compress when a session has been going for 50+ turns or when the agent feels slow because the context is large.

Track what each call costs

You can't optimise what you can't measure. Hermes logs every call's token counts and provider choice to the session DB.

sqlite3 ~/.hermes/state.db "SELECT
  date(created_at) as day,
  model,
  COUNT(*) as calls,
  SUM(input_tokens) as in_tok,
  SUM(output_tokens) as out_tok
FROM messages
WHERE created_at > date('now', '-7 days')
GROUP BY day, model
ORDER BY day"

This gives you a daily breakdown of calls per model with token totals. Multiply by your provider's per-token rate to get the dollar cost. The slash command /usage inside the agent shows similar data summarised for the current day.

If you see one specific model accounting for a disproportionate share of cost, route work away from it. If your input tokens are roughly 14K per call regardless of model, the fixed-overhead optimisations above are where the savings are.

Self-host with Ollama or vLLM for free inference

The most aggressive cost cut is to stop paying per-token entirely by self-hosting the model. Hermes supports custom OpenAI-compatible endpoints, so anything that speaks the OpenAI API works.

For local development on a workstation with a decent GPU, Ollama is the easiest. Install on your VPS or a separate GPU box, pull a model, configure Hermes:

hermes config set provider custom
hermes config set custom.base_url http://localhost:11434/v1
hermes config set custom.model qwen2.5:7b

Inference latency is higher than commercial providers (a 7B model on CPU takes seconds per token; on a consumer GPU it's faster but still slower than commercial APIs). For latency-sensitive work this isn't ideal. For batch tasks, daily summaries, anything that doesn't need sub-second response, it's free after the hardware cost.

For higher throughput, vLLM with a small open model handles much more traffic per GPU than Ollama. Setup is more involved; Hermes config is the same custom-endpoint pattern.

Disable features you don't use

A few specific defaults that cost tokens but you might not need.

The reflective phase generates summaries periodically, costing a small batch of LLM calls per day. If you don't want that:

hermes config set memory.reflection_enabled false

Background self-evaluation runs occasionally to grade past task quality. If you don't care:

hermes config set evaluation.enabled false

Auto-skill creation runs the LLM to generate skill files from successful runs. Useful but not free:

hermes config set skills.auto_create false

Each of these is small individually; together they remove a few percent of background cost.

The OpenClaw equivalent

If you're cost-tuning OpenClaw rather than Hermes, the patterns are similar but the levers are named differently. The OpenClaw cost optimisation guide covers the OpenClaw-specific configuration knobs. Most of the high-level strategy (cheap default model, expensive model for hard tasks, trim overhead, self-host where possible) is identical between the two.

Realistic monthly numbers

Three reference points from my own usage:

Personal agent, light use, 30 to 50 calls per day. Default config, OpenRouter for inference. Around $5 to $8 per month.

Personal agent, heavy use, 200+ calls per day with browser automation and code review. Same setup with model routing for heavy tasks. Around $25 to $40 per month.

Same heavy setup but using Nous Portal free tier as the default and OpenRouter only for the 10 percent of calls that need stronger models. Around $4 per month, mostly the OpenRouter overflow.

Your numbers will vary; the point is that the order-of-magnitude difference between "default config, premium provider" and "tuned config, free tier as default" is real and worth the half-hour of work to set up.

The 1-click route

The LumaDock Hermes Agent VPS template ships with the agent and a sensible default config; the cost-optimisation steps in this article are the manual tuning you'd layer on top. The VPS itself is a flat monthly fee with no per-GB egress charges, so the hosting cost is predictable. The variable is your LLM provider bill, which the tuning above is what shrinks.

Your idea deserves better hosting

24/7 support 30-day money-back guarantee Cancel anytime

Abonament

1 GB RAM VPS

$3.99 Save 25 %

$2.99 Lunar

1 vCPU AMD EPYC
30 GB NVMe stocare
✔Traficnecontorizat
✔ IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
✔1 Gbps rețea
✔Firewall configurabil
✔Monitorizare server gratuit

Cut your Hermes Agent token bill in half

What the 13.9K tokens per call is

Lower-cost LLM providers

Per-task model routing

Trim the tool surface

Trim memory files

Use the /compress slash command

Track what each call costs

Self-host with Ollama or vLLM for free inference

Disable features you don't use

The OpenClaw equivalent

Realistic monthly numbers

The 1-click route

Your idea deserves better hosting

1 GB RAM VPS

2 GB RAM VPS

4 GB RAM VPS

6 GB RAM VPS

AMD EPYC VPS.P1

AMD EPYC VPS.P2

AMD EPYC VPS.P3

AMD EPYC VPS.P4

AMD EPYC VPS.P5

AMD EPYC VPS.P6

AMD EPYC VPS.P7

EPYC Genoa VPS.G1

EPYC Genoa VPS.G2

EPYC Genoa VPS.G3

EPYC Genoa VPS.G4

EPYC Genoa VPS.G6

EPYC Genoa VPS.G7

1 vCPU AMD Ryzen 9

2 vCPU AMD Ryzen 9

4 vCPU AMD Ryzen 9

8 vCPU AMD Ryzen 9

FAQ

How do I tell if a specific skill is more expensive to invoke than I realised?

How do I cap my monthly LLM spend at a hard ceiling?

How do I use Nous Portal for the bulk of work but fall back to Anthropic when MiMo struggles?

How do I prove the cost-saving changes I made saved money rather than just feeling like they did?

Your agent runs wild. Your bill doesn't.

Produse

Găzduire aplicații

Funcționalități

Resurse

Soluții

Ajutor

Companie

Generare Parolă