Back to Article List

How to reduce your OpenClaw API costs by 90% or more

How to reduce your OpenClaw API costs by 90% or more - How to reduce your OpenClaw API costs by 90% or more

Running OpenClaw continuously is not free, and the cost can surprise you. A single agent checking heartbeats every few minutes, running a handful of cron jobs, and handling conversational sessions across two or three channels can easily generate $50-150/month in API costs if you're not paying attention to model selection and session hygiene. Add multi-agent coordination and that number can triple .... research puts the coordination overhead at roughly 3.5× the token consumption of equivalent single-agent workflows, because every handoff duplicates context across specialists.

The good news is that most of that cost is recoverable through fairly mechanical changes: model tiering, proxy caching, smarter heartbeat scheduling, and context management. This article covers all of it, with concrete config examples rather than general advice.

Understanding where the money actually goes

Before optimizing anything, it helps to know what you're actually paying for. OpenClaw costs are almost entirely LLM API calls, and each call costs input tokens + output tokens × price per million. The token count per call isn't just your message — it's your message plus the full conversation history, plus whatever MEMORY.md content gets retrieved, plus tool definitions, plus system prompt. On a long session, that input token count balloons fast.

The main cost drivers in practice:

  • Model choice. This is the biggest lever by far. Claude Opus 4 runs around $15/M input and $75/M output. Gemini 1.5 Flash is $0.30/M input and $1.20/M output. That's a 50× difference on input tokens. Using Opus for a heartbeat check that just needs to read a status file and reply "OK" is burning money for no reason.
  • Session length and history accumulation. Every turn in a session re-sends the full conversation history as input tokens. A 40-turn session sends 40 copies of the early messages (the first turn once, the second turn twice, and so on). Sessions that never get reset or compacted keep growing.
  • Concurrency without limits. Heartbeats firing every few minutes, cron jobs running in parallel, webhooks triggering agent runs - if you haven't set concurrency limits, these pile up and run simultaneously, each as a separate billable API call.
  • Multi-agent coordination overhead. Every time a coordinator sends context to a specialist, that specialist receives the coordinator's summarized context plus its own system prompt plus tool definitions. If your coordinator is verbose and your specialists receive detailed briefings, the token cost per task compounds quickly.

The fastest way to see where your spend is going is session_status, which returns tokens in/out per run and which model was used. Run it on a few recent sessions and you'll usually find one or two obvious culprits.

Model tiering: matching model to task

The single most impactful cost change most OpenClaw setups can make is not using the same model for everything. Routine tasks (heartbeats, status checks, simple cron jobs, routing decisions) don't need Claude Sonnet or Opus. They need something fast and cheap that can read a status file and make a yes/no decision.

The tiers worth knowing

At the time of writing, the cost-to-capability landscape roughly breaks down like this:

  • Free / effectively free: Local models via Ollama (Qwen 2.5, Llama 3.2, Mistral) running on your VPS cost nothing per call beyond compute. Google AI Studio's free tier for Gemini Flash gives you a meaningful number of free requests per day. These are ideal for heartbeats, simple cron jobs, and anything where "good enough" is genuinely good enough.
  • Budget tier ($0.10–$0.50/M input): Gemini 1.5 Flash, Gemini 2.0 Flash, Claude Haiku 3.5. Fast, cheap, capable enough for classification, summarization, and light research tasks.
  • Mid tier ($1–$5/M input): Claude Sonnet 3.5/4, GPT-4o Mini, Gemini 1.5 Pro. Use for tasks that need real reasoning quality - code review, complex research synthesis, multi-step planning.
  • Premium ($10+/M input): Claude Opus, GPT-4o, Gemini Ultra. Reserve for the specific tasks that genuinely need them - complex code generation, nuanced writing, difficult reasoning chains.

Configuring a model allowlist with tiering

OpenClaw's agents.defaults.models acts as an allowlist. When you define it, only listed models can be selected by any agent or cron job:

agents:
  defaults:
    models:
      - "google/gemini-1.5-flash-latest"
      - "qwen/qwen2.5-coder-7b"
      - "anthropic/claude-3-5-sonnet-20241022"

Then assign specific models per agent and per cron job rather than letting everything default to whatever the most capable model is:

# In your cron job definition
{
  "name": "daily-status-check",
  "schedule": "0 9 * * *",
  "model": "google/gemini-1.5-flash-latest",
  "session": "isolated"
}

For multi-agent setups, give the coordinator a budget model for routing decisions and let specialists use higher-tier models only when the task warrants it. A coordinator prompt like "use the cheapest model unless the task requires deep reasoning or code generation - when in doubt, start cheap and escalate if the output is insufficient" actually works reasonably well in practice.

See the free AI models guide and the Claude vs OpenAI model comparison for more on selecting models for specific use cases.

API proxy setup: Caching, rate limiting and fallbacks

Running OpenClaw through a proxy layer like LiteLLM adds a small amount of infrastructure complexity but pays for itself quickly through caching and fallback handling. The API proxy setup guide covers the configuration in detail; here's the cost-specific angle.

Prompt caching

LiteLLM can cache responses to identical or near-identical prompts. For OpenClaw, the most cacheable calls are deterministic tool invocations - heartbeat checks that read the same status file and produce the same output, cron jobs that run on schedule with the same instruction. These are genuinely identical calls and benefit from semantic caching.

Realistic savings from caching vary a lot by workload. If your setup has frequent repetitive heartbeats (every 5 minutes, checking the same conditions), caching can cut those call costs by 70–90%. Sessions with high conversational variability won't see much benefit. The low end is probably 20% total cost reduction; setups with heavy scheduled automation can see 50%+.

Docker Compose setup for LiteLLM alongside OpenClaw:

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    env_file: .env
    command: "--detailed_debug --cache yes"
  openclaw-gateway:
    image: openclaw/openclaw:latest
    environment:
      - OPENCLAW_LITELLM_BASE_URL=http://litellm:4000/v1
    depends_on:
      - litellm

Then in OpenClaw config, point your provider at the proxy:

providers:
  litellm:
    baseUrl: "http://litellm:4000/v1"

Rate limiting and 429 handling

API rate limits are a cost issue as well as a reliability issue. When you hit a 429, OpenClaw retries, and those retries can stack up quickly if you have multiple agents hitting the same provider endpoint simultaneously. LiteLLM's rate limiting layer adds a token bucket with configurable burst limits so you never pile up retries, and its fallback routing automatically switches to a backup provider on sustained 429s or 5xx errors.

A practical fallback config: primary Anthropic endpoint, fall back to OpenAI on errors, fall back to local Ollama on sustained failures. This means your agent stays responsive even when a provider has an outage, and the cost stays manageable because you're not burning tokens on failed retries.

Budgeting: Tracking spend and setting limits

OpenClaw doesn't have a built-in hard spend cap — there's no native configuration that cuts API calls when you've hit a dollar limit. What it does have is enough observability to build budget monitoring on top of, using a cron skill that aggregates session_status data and alerts when thresholds are crossed.

Tracking with session_status

The foundation is session_status, which returns token counts per run along with the model used. From this you can estimate cost per session by multiplying token counts by the model's per-token price. Running this as a daily aggregation cron job gives you a running total per agent and per model.

A simple version: a cron job that runs each morning, reads session logs, estimates spend, and writes a summary to MEMORY.md. A more sophisticated version sends the summary to Telegram or Slack and fires an alert if spend is trending above the weekly budget.

Define your thresholds in MEMORY.md so the monitoring agent can read and apply them:

## Budget thresholds
- Daily limit: $5.00 (alert at $4.00)
- Weekly limit: $20.00 (alert at $14.00)
- Per-agent daily: $2.00
- Alert channel: Telegram

Concurrency as a cost lever

Setting maxConcurrentRuns in your Gateway config is a soft cost control mechanism. If six agents can run simultaneously, you have six parallel API calls potentially hitting premium model endpoints at the same time. Limiting concurrency to three or four won't prevent expensive calls; it just ensures they're not all happening at the same instant, which smooths out your rate limit exposure and makes spend more predictable.

The monitoring guide covers Prometheus-based metrics that can surface token usage trends over time. Grafana dashboards showing token spend per agent per day are more useful for budgeting than per-session snapshots because they make trends visible.

Cost-aware automation: The design decisions that matter

Heartbeats are the silent budget drain

Default heartbeat configuration in OpenClaw fires regularly and, if you haven't thought about it, uses whatever model your agent defaults to. If that's Claude Sonnet firing every five minutes, you're paying Sonnet prices for a check that could run on Gemini Flash or a local Qwen model at a fraction of the cost.

The pattern that works: cheap model for the heartbeat itself, escalate to a real model only when the heartbeat determines something actually needs attention. A Gemini Flash heartbeat that checks status files and replies HEARTBEAT_OK unless it finds an anomaly costs almost nothing. Only when it finds something worth acting on does it trigger a more capable agent run.

Configure heartbeats explicitly rather than relying on defaults:

openclaw cron add --every 30m --session isolated --model google/gemini-flash \
  "Read HEARTBEAT.md; if all systems normal reply HEARTBEAT_OK. If anomaly found, alert coordinator."

Or disable the default heartbeat entirely and replace it with an explicit cron job you control:

agents:
  defaults:
    heartbeat:
      every: "0"    # disabled

The heartbeat vs cron guide goes into the tradeoffs in more depth. The short version: heartbeats are convenient but opaque; explicit cron jobs give you full control over model, session, and trigger logic.

Session management and context compaction

Long sessions are expensive because token costs compound with history length. A session that's been running for 40 turns and hasn't been compacted is sending those early messages as input tokens on every single subsequent turn, even though they're probably not relevant anymore.

Two approaches: aggressive compaction and session resets. Compaction prunes history while preserving a summary — the trade-off is that some detail gets lost, which is the memory concern covered in the advanced memory guide. Session resets are more aggressive: end the session, write anything important to MEMORY.md, start a new session. This is the right call for task-based workflows where each session has a clear scope and you don't need to carry full history forward.

For cron jobs specifically: always use --session isolated. An isolated session starts clean with no history, runs the task, and terminates. This is both cheaper (no accumulated history) and cleaner (no cross-contamination between scheduled runs).

Summarize before you delegate

In multi-agent setups, the coordinator's verbosity directly multiplies cost. If the coordinator sends a 2,000-token briefing to each specialist, and you have three specialists handling a complex task, that's 6,000 tokens of input before any specialist has done any work. A coordinator that compresses task briefings to 200 tokens saves 5,400 tokens per delegation cycle, which compounds across many tasks.

This sounds obvious but goes against the instinct to be thorough in instructions. The right framing: specialists should have enough context to complete their specific subtask, not the full context of everything the coordinator knows.

Tool call costs

Tool definitions are sent as part of every prompt. If your agent has 20 tools defined and only uses 3 of them regularly, those 17 unused tool definitions are adding input tokens to every single call. Use per-agent tool allowlists to give each agent only the tools it actually needs, and you trim token overhead without changing functionality.

Web search tools deserve special attention — each web_search call can trigger multiple sub-requests, and if your agent is using web search to answer questions that could be answered from memory, you're paying for searches you don't need. Adding maxResults: 3 to your web search configuration and configuring memory retrieval to run before web search can cut unnecessary search calls significantly.

Free and discounted options

Local models via Ollama

If you're running OpenClaw on a VPS with decent specs, local models via Ollama cost nothing per call beyond the compute you're already paying for. Qwen 2.5 Coder 7B is a strong choice for code-adjacent tasks. Llama 3.2 3B handles simple classification and routing well. Mistral 7B is a reasonable general-purpose option for tasks that don't need high capability.

The practical limitation is memory: a 7B parameter model in 4-bit quantization needs roughly 4–5GB of RAM to run comfortably. If your VPS has 8GB of RAM and the gateway itself is using 2GB, you're tight. A 16GB VPS gives you comfortable room to run a local model alongside the gateway. See the free AI models guide for Ollama setup specifics.

Google AI Studio free tier

Google AI Studio offers a free tier for Gemini models with per-day request limits that are generous enough for background automation tasks. Gemini 1.5 Flash on the free tier handles heartbeats and simple cron jobs well. It's worth setting up as a fallback in your LiteLLM proxy configuration - if your primary provider hits rate limits, traffic falls back to the free tier rather than failing or queuing.

Provider volume discounts

Anthropic's tier system gives meaningful rate and price benefits at higher usage levels. If you're consistently in the Tier 3–4 range (which corresponds to significant monthly spend), the per-token price drops and rate limits increase substantially. If your OpenClaw usage is borderline, it can be worth consolidating API usage (using Anthropic for more tasks rather than splitting across providers) to reach a higher tier faster.

Third-party API resellers (aggregators that buy capacity in bulk and resell at a discount) are another option worth knowing about, though quality and reliability vary. The trade-off is always price versus stability and compliance — for production setups handling sensitive data, stick with direct provider relationships.

Your idea deserves better hosting

24/7 support 30-day money-back guarantee Cancel anytime
Billing Cycle

1 GB RAM VPS

£2.99 Save  50 %
£1.49 Monthly
  • 1 vCPU AMD EPYC
  • 30 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Firewall management
  • Free server monitoring

2 GB RAM VPS

£3.74 Save  20 %
£2.99 Monthly
  • 2 vCPU AMD EPYC
  • 30 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Firewall management
  • Free server monitoring

6 GB RAM VPS

£10.49 Save  29 %
£7.49 Monthly
  • 6 vCPU AMD EPYC
  • 70 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Firewall management
  • Free server monitoring

AMD EPYC VPS.P1

£5.24 Save  29 %
£3.74 Monthly
  • 2 vCPU AMD EPYC
  • 4 GB RAM memory
  • 40 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

AMD EPYC VPS.P2

£9.74 Save  31 %
£6.74 Monthly
  • 2 vCPU AMD EPYC
  • 8 GB RAM memory
  • 80 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

AMD EPYC VPS.P4

£19.49 Save  31 %
£13.49 Monthly
  • 4 vCPU AMD EPYC
  • 16 GB RAM memory
  • 160 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

AMD EPYC VPS.P5

£24.36 Save  29 %
£17.24 Monthly
  • 8 vCPU AMD EPYC
  • 16 GB RAM memory
  • 180 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

AMD EPYC VPS.P6

£36.74 Save  31 %
£25.49 Monthly
  • 8 vCPU AMD EPYC
  • 32 GB RAM memory
  • 200 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

AMD EPYC VPS.P7

£46.48 Save  35 %
£29.99 Monthly
  • 16 vCPU AMD EPYC
  • 32 GB RAM memory
  • 240 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

EPYC Genoa VPS.G1

£3.74 Save  20 %
£2.99 Monthly
  • 1 vCPU AMD EPYC Gen4 AMD EPYC Genoa 4th generation 9xx4 with 3.25 GHz or similar, on Zen 4 architecture.
  • 1 GB DDR5 memory
  • 25 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

EPYC Genoa VPS.G2

£7.49 Save  20 %
£5.99 Monthly
  • 2 vCPU AMD EPYC Gen4 AMD EPYC Genoa 4th generation 9xx4 with 3.25 GHz or similar, on Zen 4 architecture.
  • 4 GB DDR5 memory
  • 50 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

EPYC Genoa VPS.G4

£14.24 Save  32 %
£9.74 Monthly
  • 4 vCPU AMD EPYC Gen4 AMD EPYC Genoa 4th generation 9xx4 with 3.25 GHz or similar, on Zen 4 architecture.
  • 8 GB DDR5 memory
  • 100 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

EPYC Genoa VPS.G5

£22.49 Save  27 %
£16.49 Monthly
  • 4 vCPU AMD EPYC Gen4 AMD EPYC Genoa 4th generation 9xx4 with 3.25 GHz or similar, on Zen 4 architecture.
  • 16 GB DDR5 memory
  • 150 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

EPYC Genoa VPS.G6

£26.24 Save  23 %
£20.24 Monthly
  • 8 vCPU AMD EPYC Gen4 AMD EPYC Genoa 4th generation 9xx4 with 3.25 GHz or similar, on Zen 4 architecture.
  • 16 GB DDR5 memory
  • 200 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

EPYC Genoa VPS.G7

£43.48 Save  26 %
£32.24 Monthly
  • 8 vCPU AMD EPYC Gen4 AMD EPYC Genoa 4th generation 9xx4 with 3.25 GHz or similar, on Zen 4 architecture.
  • 32 GB DDR5 memory
  • 250 GB NVMe storage
  • Unmetered bandwidth
  • IPv4 & IPv6 included IPv6 support is currently unavailable in France, Finland or the Netherlands.
  • 1 Gbps network
  • Automatic backup included
  • Firewall management
  • Free server monitoring

FAQ

How much can I realistically save with model tiering?

Reports from the OpenClaw community consistently show 80–95% cost reduction when heartbeats and cron jobs are moved to free or budget-tier models and premium models are reserved for sessions that genuinely need them. Your exact savings depend on your workload mix - setups heavy on scheduled automation save more than setups heavy on interactive sessions, because scheduled tasks are the easiest to tier aggressively.

Automate faster, for less

Bring your winning ideas to life with AMD power, NVMe speed and unmetered bandwidth. Deploy your VPS in seconds, with a pre-installed OpenClaw template on Ubuntu 24.04.