A simple guide to OpenClaw concurrency and retry control

Daniel Ignat

27/02/2026

A simple guide to OpenClaw concurrency and retry control

OpenClaw processes every agent run through an internal queue. Most people never think about this until something goes wrong: a second message arrives while the first task is still running, a cron job fires simultaneously with an inbound message and they seem to interfere, or a rate limit error from Anthropic starts a retry loop that burns through tokens before you even know it started. Understanding how the queue and retry systems actually work is what separates a setup that holds up under load from one that fails in ways that are hard to explain.

This guide covers the lane-based concurrency model in depth, how to configure limits per lane and per agent, what retry policies exist for each channel and provider, the known gaps in the current implementation worth being aware of, and how to monitor queue health before problems become outages.

How OpenClaw's lane-based queue system works

OpenClaw does not use threads or background worker processes. The entire Gateway is a single Node.js process running on async promises. Concurrency is managed by a lane-aware FIFO queue implemented in src/process/command-queue.ts, which serializes work within each lane while allowing different lanes to run in parallel.

Every incoming task gets enqueued twice before it executes:

Session lane (session:<key>): guarantees only one active run per session at a time. No two tasks from the same session can execute concurrently, which prevents race conditions on session files and history.
Global lane: the session-scoped task then joins a global lane (main by default), which caps total parallelism across all sessions via agents.defaults.maxConcurrent.

The four named global lanes in the current codebase are:

main: inbound messages and main heartbeats. Default cap of 4.
cron: all scheduled jobs. Runs independently so a long cron task cannot block an inbound reply.
subagent: sessions spawned via sessions_spawn. Default cap of 8.
nested: nested tool calls within a running agent turn.

Lanes don't compete. A cron job in the cron lane cannot starve an inbound message waiting in the main lane, and a fleet of subagents in the subagent lane has its own concurrency budget separate from the main chat flow. The official queue documentation describes this as the core reliability guarantee: "serialize writes per session, throttle global work, and define exactly what happens when new input arrives mid-run."

One detail with practical implications: typing indicators fire immediately on enqueue, even before the run starts executing. From the user's perspective, the agent appears to be working. Under the hood, the task is waiting in the queue. This is a deliberate UX choice, but it can create confusion during debugging when the typing indicator appears but the response is delayed by queue depth.

Configuring concurrency limits

Global and per-lane limits

The primary knob is agents.defaults.maxConcurrent in your config, which controls the main lane cap:

agents:
  defaults:
    maxConcurrent: 4          # main lane (inbound messages)
    subagents:
      maxConcurrent: 8        # subagent lane
    cron:
      maxConcurrent: 2        # cron lane

Be aware of a current limitation: there is an open GitHub issue (#16055) reporting that agents.defaults.maxConcurrent does not always propagate to the main lane as documented. Users running five or more independent agents on the same Gateway have observed bottlenecks where all agents share the main lane's default cap of 4 regardless of what the config says. The workaround, also proposed in that issue, is assigning a custom lane per agent:

agents:
  list:
    - id: "agent-one"
      lane: "lane-one"
      laneConcurrency: 10
    - id: "agent-two"
      lane: "lane-two"
      laneConcurrency: 10

Custom lanes auto-create separate queue buckets per agent, enabling true parallel processing without sharing the main lane's global cap. This is the practical solution for multi-bot setups until the underlying issue is fixed.

Session isolation via dmScope

For agents handling direct messages from multiple users, session isolation determines whether each user gets their own session lane. The relevant config:

session:
  dmScope: "per-channel-peer"    # one session per user per channel (default for DMs)
  # or "per-channel" for all DMs to share one session

per-channel-peer gives each user their own session key, meaning their conversations are fully serialized and isolated from other users. This is almost always what you want for user-facing agents, and it's what prevents one user's long-running task from blocking another user's message.

Message deduplication and inbound batching

Two related features worth knowing about. OpenClaw maintains a short-lived deduplication cache keyed on channel, account, peer, session, and message ID. Duplicate deliveries from the channel provider (which happen more often than you'd expect, especially with Telegram webhooks and Discord event replay) don't trigger duplicate agent runs.

For human typing patterns where multiple short messages arrive in quick succession, messages.inbound.debounceMs batches them into a single agent turn:

messages:
  inbound:
    debounceMs: 1500          # wait 1.5s for follow-up messages before starting the run
    perChannel:
      telegram:
        debounceMs: 800       # shorter debounce for Telegram

This reduces unnecessary queue entries and API calls when users send fragmented messages.

Retry policies per channel and provider

Retry behavior in OpenClaw is configured separately for each channel (Telegram, Discord) and for LLM providers. The defaults are reasonable but not always sufficient for production use, and there are some documented implementation gaps worth understanding.

Telegram retry configuration

Telegram retries on transient errors: 429 rate limits, connection timeouts, ECONNRESET, ETIMEDOUT, and "temporarily unavailable" responses. It uses retry_after from the response header when available, and exponential backoff otherwise. Markdown parse errors are not retried; they fall back to plain text instead.

There is a known long-polling bug worth configuring around: Telegram's long-polling connection silently dies after roughly 8 minutes (issue #7526). No error is logged, messages stop arriving, and the only fix is a Gateway restart. The workaround is to set an explicit retry policy on the Telegram channel, which converts this hard failure into a self-healing reconnection:

channels:
  telegram:
    retry:
      attempts: 5
      minDelayMs: 1000
      maxDelayMs: 10000
      jitter: 0.3

Users who added this config reported 23+ hours of continuous operation with zero timeouts versus frequent silent polling deaths without it. The default config ships without this, so it's worth adding manually if you rely on Telegram for anything important.

Discord retry configuration

Discord retries only on 429 rate limit responses, using retry_after from the header when available. The minimum delay is 500ms. Configuration follows the same structure:

channels:
  discord:
    retry:
      attempts: 3
      minDelayMs: 500
      maxDelayMs: 30000
      jitter: 0.1

LLM provider retry and the known backoff bug

This is the area with the most important caveat in this entire guide. OpenClaw's documentation describes exponential backoff intervals of 1, 5, 25, and 60 minutes for LLM provider 429 errors. The actual implementation does not behave as documented.

GitHub issue #5159 documents that observed retry intervals are as short as 1-27 seconds rather than the documented minutes. This issue was closed by maintainers as "not planned" for fixing. The practical implication: do not rely on OpenClaw's internal LLM retry logic for rate limit handling. Configure fallback providers via LiteLLM proxy instead, and let the proxy handle backoff at the infrastructure level rather than at the gateway level. The API proxy setup guide covers this.

There is also a related bug (issue #17589) where a Gateway restart or config reload (SIGUSR1) aborts in-flight requests, which then get classified as errors and retried up to 4 times with the full session context. If you're on Opus with a large context window, this can burn significant tokens. The symptom is "API rate limit reached" messages from the user's perspective while the Anthropic console shows successful completed requests. The workaround is to avoid restarting the Gateway mid-session when possible, and to schedule config reloads during low-traffic windows.

Another related issue: a single model hitting rate limits can trigger cooldown for the entire provider, not just that model (issue #5744). Claude Sonnet hitting a limit marks Claude Opus as unavailable too. Configure per-provider fallbacks to a different provider (not a different model within the same provider) to avoid this.

Cron job retry behavior

Cron jobs currently have no built-in retry on transient failure. If a cron job fails due to a provider 429 or a network timeout, it is immediately set to enabled: false with no automatic retry. One-shot jobs with deleteAfterRun: true get permanently disabled. This is tracked as an open issue (issue #24355).

The workaround until this is fixed: implement retry logic inside the cron job's prompt itself. A cron agent can check whether its last run succeeded (via MEMORY.md status) and retry the actual task if needed, or send an alert so you can manually re-enable it. For transient network failures, scheduling crons to run slightly less frequently also reduces the chance of hitting a rate limit window.

Transient vs fatal errors: what gets retried and what doesn't

Understanding which errors are retryable is important for designing reliable automation. The distinction:

Transient errors are temporary conditions that may resolve without any code or config change. OpenClaw (where it implements retry) will attempt these automatically:

HTTP 429 (rate limit) with a retry_after header
Connection timeout (ETIMEDOUT)
Connection reset (ECONNRESET)
Temporary provider unavailability (503)

Fatal errors indicate a permanent problem that won't resolve with retries. These get logged and fail immediately:

HTTP 401 / 403 (auth failure). Retrying is pointless; the credential is wrong or expired.
HTTP 400 / 422 (bad request, schema validation failure). The request itself is malformed.
HTTP 500 (provider internal error) without a retryable indicator. Some providers signal retryability explicitly; without that signal, a 500 is treated as fatal.
Markdown parse errors from Telegram. These fall back to plain text rather than retrying.

When designing cron jobs and automation, structure tasks so that the parts that talk to external APIs are isolated and their errors are logged explicitly. A cron agent that silently fails because of a 401 and gets disabled is harder to diagnose than one that writes a failure entry to a status file that you can check.

Diagnosing stuck queues

The first sign of a stuck or saturated queue is usually delayed responses, not errors. The Gateway appears to be working but messages take much longer than expected. Before assuming a model or provider problem, check the queue.

Enable verbose logging to surface queue wait times:

OPENCLAW_LOG_LEVEL=verbose openclaw gateway run

With verbose logging enabled, runs that wait more than roughly 2 seconds before starting emit a log line like:

[queue] session:abc123 queued for 4230ms

Consistent "queued for Xms" lines in the logs indicate the lane is saturated and tasks are piling up behind the concurrency cap. The fix depends on the cause:

If it's the main lane filling up, increase agents.defaults.maxConcurrent or assign separate lanes per agent as described above.
If it's the cron lane backing up, reduce cron frequency or increase cron.maxConcurrent. Each cron job execution occupies a cron lane slot for its full duration, so long-running cron tasks with high frequency is a common cause.
If sessions seem stuck and not clearing, check for runs that are waiting on a tool call that will never return. A web search or exec tool that hangs indefinitely holds the session lane open. Tool call timeouts in agent configuration prevent this from being permanent.

If the queue seems completely stuck and verbose logs show nothing draining, restart the Gateway. On restart, OpenClaw drains the queue backlog cleanly using a generation counter that invalidates stale entries. This is the intended recovery path for queue deadlocks.

Monitoring concurrency and tuning for cost

OpenClaw's Prometheus exporter (if enabled) surfaces metrics you can use to monitor queue health over time. The ones worth tracking for concurrency:

openclaw_queue_depth per lane: how many tasks are waiting. Sustained depth above 3-4 in the main lane suggests you're consistently hitting the concurrency cap.
openclaw_run_duration_seconds: how long individual runs take. Sudden increases here often precede queue depth increases.
openclaw_retry_count per provider: high retry rates signal either a rate limit problem or the provider having reliability issues.

Set alerts on average queue depth above 5 minutes sustained. At that point, you either need to raise the concurrency cap, add resources, or look at whether some tasks can be rescheduled to distribute load more evenly. The monitoring guide covers Prometheus and alerting configuration in more depth.

On the cost side: higher concurrency caps mean more parallel LLM calls, which means higher instantaneous spend. The relationship is direct. If you're finding that raising maxConcurrent causes cost spikes, the right response is usually not to lower it back down, but to review which tasks are running in parallel and whether any of them should be on cheaper models.

A main lane running four concurrent sessions where two of them are heartbeat checks is a good candidate for routing those heartbeats to a separate cron lane on a budget model, freeing the main lane cap for real interactive work. See the cost optimization guide for how to combine model tiering with concurrency configuration effectively.

Your idea deserves better hosting

24/7 support 30-day money-back guarantee Cancel anytime

Cycle de facturation

1 GB RAM VPS

$3.99 Save 25 %

$2.99 Mensuel

1 vCPU AMD EPYC
30 GB NVMe stockage
✔Bande passante illimitée
✔ IPv4 & IPv6 inclus La prise en charge d’IPv6 n’est actuellement pas disponible en France, en Finlande ou aux Pays-Bas.
✔1 Gbps réseau
✔Gestion du pare-feu
✔Suivi serveur gratuit

A simple guide to OpenClaw concurrency and retry control

How OpenClaw's lane-based queue system works

Configuring concurrency limits

Global and per-lane limits

Session isolation via dmScope

Message deduplication and inbound batching

Retry policies per channel and provider

Telegram retry configuration

Discord retry configuration

LLM provider retry and the known backoff bug

Cron job retry behavior

Transient vs fatal errors: what gets retried and what doesn't

Diagnosing stuck queues

Monitoring concurrency and tuning for cost

Your idea deserves better hosting

1 GB RAM VPS

2 GB RAM VPS

4 GB RAM VPS

6 GB RAM VPS

AMD EPYC VPS.P1

AMD EPYC VPS.P2

AMD EPYC VPS.P3

AMD EPYC VPS.P4

AMD EPYC VPS.P5

AMD EPYC VPS.P6

AMD EPYC VPS.P7

EPYC Genoa VPS.G1

EPYC Genoa VPS.G2

EPYC Genoa VPS.G3

EPYC Genoa VPS.G4

EPYC Genoa VPS.G6

EPYC Genoa VPS.G7

1 vCPU AMD Ryzen 9

2 vCPU AMD Ryzen 9

4 vCPU AMD Ryzen 9

8 vCPU AMD Ryzen 9

FAQ

Why are messages from different agents being serialized even though I set maxConcurrent high?

A cron job got disabled after a 429 error. How do I re-enable it?

OpenClaw shows "API rate limit reached" but the Anthropic console shows requests succeeded. What's happening?

How do I add retry logic to a cron job that needs to be reliable?

Automate faster, for less

Produits

Hébergement d’apps

Fonctionnalités

Ressources

Solutions

Obtenir de l’aide

Entreprise

Générer un mot de passe