Back to Article List

Fix Hermes Agent Ollama context window 4096 error

Fix Hermes Agent Ollama context window 4096 error

Hermes Agent against Ollama is the cheapest local-model setup you can run. It also breaks in a way the agent doesn't explain: weird short replies, mid-session amnesia, the model "forgetting" your prompt halfway through, tools that suddenly stop firing. Nine times out of ten the cause is the same. Ollama is silently truncating your prompt because its default context window is 4096 tokens, and Hermes Agent burns through 4096 tokens before the model sees any of your actual question.

Hermes alone pushes around 13,000 tokens of system context per turn (skill registry, persona, memory headers, tool schemas). Add a real conversation and you're at 20k easily. Default Ollama just drops everything past 4096. The model sees a stub. Output is stubby too.

Why Ollama ships with 4096

RAM. Ollama defaults are tuned for someone running a 7B model on a laptop with 8 GB of RAM. At those numbers, a 4096 context window fits in memory comfortably. Bump the window to 32k and the same model needs 3-4x more memory for the KV cache. Most casual Ollama users would crash. So Ollama plays it safe and lets advanced users opt into bigger windows.

Hermes Agent is an advanced user. It needs the bigger window.

The fix: set num_ctx properly

Three ways to raise it. Pick the one that matches how you run Ollama.

Option 1: Modelfile (cleanest, model-specific)

Create a custom Modelfile that inherits from your base model and pins the context window:

cat > Modelfile.hermes << 'EOF'
FROM qwen2.5-coder:32b
PARAMETER num_ctx 32768
EOF

ollama create qwen2.5-coder-32b-hermes -f Modelfile.hermes

Then point Hermes at the new model name instead of the original. Wins: each model can have its own window. Losses: you have to remember to do this for each model you pull.

Option 2: OLLAMA_NUM_CTX environment variable

Newer Ollama versions support a global default through an env var. Set it before starting Ollama:

export OLLAMA_NUM_CTX=32768
systemctl restart ollama
# or if you run Ollama as a foreground process
ollama serve

For persistence across reboots, add it to the Ollama systemd unit:

[Service]
Environment="OLLAMA_NUM_CTX=32768"

This affects every model loaded on this Ollama instance. Simpler than Modelfile per model, but if you sometimes want a smaller window for a tiny model you've lost that flexibility.

Option 3: Pass num_ctx in Hermes provider config

Hermes lets you pass extra options to the inference call. If your provider definition supports it (check hermes provider show ollama), you can set num_ctx there and override whatever Ollama itself defaults to. The exact field name varies by Hermes version. Look in the provider's options block or check the Hermes configuration docs for the current field name.

How much context Hermes really needs

This is the bit no one writes about. The answer depends on what you ask Hermes to do.

Use caseMinimum num_ctxRecommended
Short chat with no tools40968192
Hermes with shell + filesystem tools1638432768
Hermes with full skill registry3276865536
Long multi-turn debugging session65536131072

32k is what I'd start with. It covers Hermes Agent normal operation with a few skills enabled and leaves room for chat. If you load every skill and you're doing long multi-step work, push to 64k or 128k (only on models that support that big a context).

What this costs in RAM

The KV cache (key-value pairs the model holds in memory during inference) scales with context size. Rough numbers for a 7B model with 4-bit quantisation:

  • 4096 context: ~0.5 GB extra RAM
  • 16384 context: ~2 GB extra RAM
  • 32768 context: ~4 GB extra RAM
  • 131072 context: ~16 GB extra RAM

Multiply roughly by 4x for a 32B model. A 32B model with 32k context wants around 24 GB of total VRAM/RAM. If your machine doesn't have it, the model loads partially into CPU memory and inference becomes painfully slow.

Verifying the fix worked

Two quick checks. First, ask Ollama what context size it's using for your model:

curl http://localhost:11434/api/show -d '{"name":"qwen2.5-coder-32b-hermes"}' | jq .parameters

You should see num_ctx 32768 (or whatever you set) in the output. If it still shows 4096 or doesn't show the field at all, your Modelfile or env var didn't take.

Second, run a long Hermes conversation that would have failed before. Something with five or six turns of context plus a skill invocation. If it completes coherently, your context window is doing what it should.

The symptoms the model gives you when context is too small

These are the giveaways. If you see any of these, num_ctx is probably your problem (not the model itself):

  • Replies that ignore the last half of your message
  • Tool calls that reference earlier conversation but get the details wrong
  • Sudden persona shifts mid-session (because SOUL.md got truncated out of the prompt)
  • "What were we discussing again?" mid-conversation
  • Skill calls that fail because the skill definition got truncated

People often blame the model in these cases ("Qwen 7B is dumb"). Then they switch to a bigger model and it does the same thing. The bug isn't the model. It's the window.

Per-model recommendations from my own use

I run a few different local models for different tasks. Here's what I settled on:

  • Qwen2.5-coder 32B: 32k context, 24 GB VRAM, fine for most Hermes work
  • Llama 3.3 70B: 16k context (because 32k blows past 48 GB VRAM on Q4), only for one-shot tasks
  • Phi-4 14B: 16k context, runs on a 16 GB consumer card, my go-to for quick chats

None of these are 4096. The default never made sense for Hermes-class use.

When to give up on local and route to a hosted provider

If you've gone through the Modelfile dance, set num_ctx high, the model still loses the plot mid-session and you're running on a machine with 16 GB of VRAM or less, you've hit the trade-off ceiling. Either upgrade hardware or send tool-heavy work to Anthropic via the fallback pattern in our Hermes 402 quota fallback piece. Local is great for cost. It is not great for accuracy at the edge.

Where this fits with the rest of the stack

If the symptoms above feel familiar but you're not on Ollama, the closest related piece is our Hermes Agent talks but takes no action diagnostic, which covers tool-call failures from a different angle. If you want to cut token usage to make a smaller window survivable, see our cut Hermes token costs guide for the /compress command and skill pruning.

The hosted-model alternative

For the cost-conscious who don't want to manage Ollama at all, OpenRouter's free tier covers a lot of casual use and avoids the whole context-window dance.

The LumaDock Hermes Agent template ships with provider setup walking you through OpenRouter or Anthropic in a few prompts, so you can be running before you've even decided if local makes sense for you. Unmetered bandwidth and no setup fees. Setup walkthrough in our Hermes Agent complete guide.

Your idea deserves better hosting

24/7 support 30-day money-back guarantee Cancel anytime
Ciclo de Facturación

1 GB RAM VPS

37.50 kr Save  25 %
28.10 kr Mensual
  • 1 vCPU AMD EPYC
  • 30 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Gestión de firewall
  • Monitoreo gratis

2 GB RAM VPS

56.30 kr Save  17 %
46.90 kr Mensual
  • 2 vCPU AMD EPYC
  • 30 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Gestión de firewall
  • Monitoreo gratis

6 GB RAM VPS

140.89 kr Save  33 %
93.89 kr Mensual
  • 6 vCPU AMD EPYC
  • 70 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Gestión de firewall
  • Monitoreo gratis

AMD EPYC VPS.P1

75.10 kr Save  25 %
56.30 kr Mensual
  • 2 vCPU AMD EPYC
  • 4 GB memoria RAM
  • 40 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

AMD EPYC VPS.P2

140.89 kr Save  27 %
103.29 kr Mensual
  • 2 vCPU AMD EPYC
  • 8 GB memoria RAM
  • 80 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

AMD EPYC VPS.P4

281.87 kr Save  20 %
225.48 kr Mensual
  • 4 vCPU AMD EPYC
  • 16 GB memoria RAM
  • 160 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

AMD EPYC VPS.P5

342.96 kr Save  21 %
272.47 kr Mensual
  • 8 vCPU AMD EPYC
  • 16 GB memoria RAM
  • 180 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

AMD EPYC VPS.P6

535.64 kr Save  21 %
422.85 kr Mensual
  • 8 vCPU AMD EPYC
  • 32 GB memoria RAM
  • 200 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

AMD EPYC VPS.P7

657.82 kr Save  20 %
526.24 kr Mensual
  • 16 vCPU AMD EPYC
  • 32 GB memoria RAM
  • 240 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

EPYC Genoa VPS.G1

46.90 kr Save  20 %
37.50 kr Mensual
  • 1 vCPU AMD EPYC Gen4 AMD EPYC Genoa de 4ª generación 9xx4 con 3.25 GHz o similar, basado en la arquitectura Zen 4.
  • 1 GB DDR5 memoria RAM
  • 25 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

EPYC Genoa VPS.G2

122.09 kr Save  23 %
93.89 kr Mensual
  • 2 vCPU AMD EPYC Gen4 AMD EPYC Genoa de 4ª generación 9xx4 con 3.25 GHz o similar, basado en la arquitectura Zen 4.
  • 4 GB DDR5 memoria RAM
  • 50 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

EPYC Genoa VPS.G4

244.28 kr Save  27 %
178.48 kr Mensual
  • 4 vCPU AMD EPYC Gen4 AMD EPYC Genoa de 4ª generación 9xx4 con 3.25 GHz o similar, basado en la arquitectura Zen 4.
  • 8 GB DDR5 memoria RAM
  • 100 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

EPYC Genoa VPS.G6

460.45 kr Save  31 %
319.47 kr Mensual
  • 8 vCPU AMD EPYC Gen4 AMD EPYC Genoa de 4ª generación 9xx4 con 3.25 GHz o similar, basado en la arquitectura Zen 4.
  • 16 GB DDR5 memoria RAM
  • 200 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

EPYC Genoa VPS.G7

704.82 kr Save  27 %
516.84 kr Mensual
  • 8 vCPU AMD EPYC Gen4 AMD EPYC Genoa de 4ª generación 9xx4 con 3.25 GHz o similar, basado en la arquitectura Zen 4.
  • 32 GB DDR5 memoria RAM
  • 250 GB NVMe disco
  • Ilimitado ancho de banda
  • IPv4 e IPv6 incluidos El soporte IPv6 no está disponible en Francia, Finlandia o Países Bajos.
  • 1 Gbps red
  • Copia automática incluida
  • Gestión de firewall
  • Monitoreo gratis

AMD Ryzen VPS.R1

150.29 kr Save  31 %
103.29 kr Mensual
  • 1 CPU dedicada AMD Ryzen 9 7950X a 4,5 GHz o similar, en arquitectura Zen 4. vCPU
  • 4 GB DDR5MEMORIA
  • 50 GB NVMeDISCO
  • Ancho de banda sin medir
  • IPv4 & IPv6 incluidos El soporte IPv6 no está disponible actualmente en Francia, Finlandia ni Países Bajos.
  • Backup automático incluido

AMD Ryzen VPS.R2

263.07 kr Save  21 %
206.68 kr Mensual
  • 2 CPU dedicadas AMD Ryzen 9 7950X a 4,5 GHz o similar, en arquitectura Zen 4. vCPU
  • 8 GB DDR5MEMORIA
  • 100 GB NVMeDISCO
  • Ancho de banda sin medir
  • IPv4 & IPv6 incluidos El soporte IPv6 no está disponible actualmente en Francia, Finlandia ni Países Bajos.
  • Backup automático incluido

AMD Ryzen VPS.R4

939.79 kr Save  20 %
751.81 kr Mensual
  • 8 CPU dedicadas AMD Ryzen 9 7950X a 4,5 GHz o similar, en arquitectura Zen 4. vCPU
  • 32 GB DDR5MEMORIA
  • 400 GB NVMeDISCO
  • Ancho de banda sin medir
  • IPv4 & IPv6 incluidos El soporte IPv6 no está disponible actualmente en Francia, Finlandia ni Países Bajos.
  • Backup automático incluido

FAQ

Why does Hermes Agent fail or give weird short replies with Ollama?

Because Ollama's default context window is 4096 tokens and Hermes alone pushes about 13,000 tokens of system context per turn. Your actual question and the conversation history get silently truncated before reaching the model.

Your agent runs wild. Your bill doesn't.

Easily deploy Hermes in one click on Ubuntu 24.04 with AMD EPYC, NVMe storage and unmetered bandwidth. The price stays the same whatever the agent does, no setup fees, no overage charges and no tier traps.

GPU products are in high demand at the moment. Fill the form to get notified as soon as your preferred GPU server is back in stock.