Fix Hermes Agent Ollama context window 4096 error

Ellie Grace Hayes

11/04/2026

Fix Hermes Agent Ollama context window 4096 error

Hermes Agent against Ollama is the cheapest local-model setup you can run. It also breaks in a way the agent doesn't explain: weird short replies, mid-session amnesia, the model "forgetting" your prompt halfway through, tools that suddenly stop firing. Nine times out of ten the cause is the same. Ollama is silently truncating your prompt because its default context window is 4096 tokens, and Hermes Agent burns through 4096 tokens before the model sees any of your actual question.

Hermes alone pushes around 13,000 tokens of system context per turn (skill registry, persona, memory headers, tool schemas). Add a real conversation and you're at 20k easily. Default Ollama just drops everything past 4096. The model sees a stub. Output is stubby too.

Why Ollama ships with 4096

RAM. Ollama defaults are tuned for someone running a 7B model on a laptop with 8 GB of RAM. At those numbers, a 4096 context window fits in memory comfortably. Bump the window to 32k and the same model needs 3-4x more memory for the KV cache. Most casual Ollama users would crash. So Ollama plays it safe and lets advanced users opt into bigger windows.

Hermes Agent is an advanced user. It needs the bigger window.

The fix: set num_ctx properly

Three ways to raise it. Pick the one that matches how you run Ollama.

Option 1: Modelfile (cleanest, model-specific)

Create a custom Modelfile that inherits from your base model and pins the context window:

cat > Modelfile.hermes << 'EOF'
FROM qwen2.5-coder:32b
PARAMETER num_ctx 32768
EOF

ollama create qwen2.5-coder-32b-hermes -f Modelfile.hermes

Then point Hermes at the new model name instead of the original. Wins: each model can have its own window. Losses: you have to remember to do this for each model you pull.

Option 2: OLLAMA_NUM_CTX environment variable

Newer Ollama versions support a global default through an env var. Set it before starting Ollama:

export OLLAMA_NUM_CTX=32768
systemctl restart ollama
# or if you run Ollama as a foreground process
ollama serve

For persistence across reboots, add it to the Ollama systemd unit:

[Service]
Environment="OLLAMA_NUM_CTX=32768"

This affects every model loaded on this Ollama instance. Simpler than Modelfile per model, but if you sometimes want a smaller window for a tiny model you've lost that flexibility.

Option 3: Pass num_ctx in Hermes provider config

Hermes lets you pass extra options to the inference call. If your provider definition supports it (check hermes provider show ollama), you can set num_ctx there and override whatever Ollama itself defaults to. The exact field name varies by Hermes version. Look in the provider's options block or check the Hermes configuration docs for the current field name.

How much context Hermes really needs

This is the bit no one writes about. The answer depends on what you ask Hermes to do.

Use case	Minimum num_ctx	Recommended
Short chat with no tools	4096	8192
Hermes with shell + filesystem tools	16384	32768
Hermes with full skill registry	32768	65536
Long multi-turn debugging session	65536	131072

32k is what I'd start with. It covers Hermes Agent normal operation with a few skills enabled and leaves room for chat. If you load every skill and you're doing long multi-step work, push to 64k or 128k (only on models that support that big a context).

What this costs in RAM

The KV cache (key-value pairs the model holds in memory during inference) scales with context size. Rough numbers for a 7B model with 4-bit quantisation:

4096 context: ~0.5 GB extra RAM
16384 context: ~2 GB extra RAM
32768 context: ~4 GB extra RAM
131072 context: ~16 GB extra RAM

Multiply roughly by 4x for a 32B model. A 32B model with 32k context wants around 24 GB of total VRAM/RAM. If your machine doesn't have it, the model loads partially into CPU memory and inference becomes painfully slow.

Verifying the fix worked

Two quick checks. First, ask Ollama what context size it's using for your model:

curl http://localhost:11434/api/show -d '{"name":"qwen2.5-coder-32b-hermes"}' | jq .parameters

You should see num_ctx 32768 (or whatever you set) in the output. If it still shows 4096 or doesn't show the field at all, your Modelfile or env var didn't take.

Second, run a long Hermes conversation that would have failed before. Something with five or six turns of context plus a skill invocation. If it completes coherently, your context window is doing what it should.

The symptoms the model gives you when context is too small

These are the giveaways. If you see any of these, num_ctx is probably your problem (not the model itself):

Replies that ignore the last half of your message
Tool calls that reference earlier conversation but get the details wrong
Sudden persona shifts mid-session (because SOUL.md got truncated out of the prompt)
"What were we discussing again?" mid-conversation
Skill calls that fail because the skill definition got truncated

People often blame the model in these cases ("Qwen 7B is dumb"). Then they switch to a bigger model and it does the same thing. The bug isn't the model. It's the window.

Per-model recommendations from my own use

I run a few different local models for different tasks. Here's what I settled on:

Qwen2.5-coder 32B: 32k context, 24 GB VRAM, fine for most Hermes work
Llama 3.3 70B: 16k context (because 32k blows past 48 GB VRAM on Q4), only for one-shot tasks
Phi-4 14B: 16k context, runs on a 16 GB consumer card, my go-to for quick chats

None of these are 4096. The default never made sense for Hermes-class use.

When to give up on local and route to a hosted provider

If you've gone through the Modelfile dance, set num_ctx high, the model still loses the plot mid-session and you're running on a machine with 16 GB of VRAM or less, you've hit the trade-off ceiling. Either upgrade hardware or send tool-heavy work to Anthropic via the fallback pattern in our Hermes 402 quota fallback piece. Local is great for cost. It is not great for accuracy at the edge.

Where this fits with the rest of the stack

If the symptoms above feel familiar but you're not on Ollama, the closest related piece is our Hermes Agent talks but takes no action diagnostic, which covers tool-call failures from a different angle. If you want to cut token usage to make a smaller window survivable, see our cut Hermes token costs guide for the /compress command and skill pruning.

The hosted-model alternative

For the cost-conscious who don't want to manage Ollama at all, OpenRouter's free tier covers a lot of casual use and avoids the whole context-window dance.

The LumaDock Hermes Agent template ships with provider setup walking you through OpenRouter or Anthropic in a few prompts, so you can be running before you've even decided if local makes sense for you. Unmetered bandwidth and no setup fees. Setup walkthrough in our Hermes Agent complete guide.

Your idea deserves better hosting

24/7 support 30-day money-back guarantee Cancel anytime

Cycle de facturation

VPS.S1

$5.99 Save 17 %

$4.99 Mensuel

2 vCPU AMD EPYC
2 GB RAMMÉMOIRE
30 GB NVMeSTOCKAGE
Bande passante illimitée
IPv4 & IPv6Le support IPv6 est actuellement indisponible en France, Finlande ou aux Pays-Bas. inclus

Fix Hermes Agent Ollama context window 4096 error

Why Ollama ships with 4096

The fix: set num_ctx properly

Option 1: Modelfile (cleanest, model-specific)

Option 2: OLLAMA_NUM_CTX environment variable

Option 3: Pass num_ctx in Hermes provider config

How much context Hermes really needs

What this costs in RAM

Verifying the fix worked

The symptoms the model gives you when context is too small

Per-model recommendations from my own use

When to give up on local and route to a hosted provider

Where this fits with the rest of the stack

The hosted-model alternative

Your idea deserves better hosting

VPS.S1

VPS.S2

VPS.S3

EPYC VPS.P1

EPYC VPS.P2

EPYC VPS.P3

EPYC VPS.P4

EPYC VPS.P5

EPYC VPS.P6

EPYC VPS.P7

Genoa VPS.G2

Genoa VPS.G3

Genoa VPS.G4

Genoa VPS.G6

Genoa VPS.G7

AMD Ryzen VPS.R1

AMD Ryzen VPS.R2

AMD Ryzen VPS.R3

AMD Ryzen VPS.R4

FAQ

Why does Hermes Agent fail or give weird short replies with Ollama?

How do I increase Ollama context window for Hermes Agent?

What context window size should I set for Hermes on Ollama?

How much extra RAM does a bigger Ollama context window need?

Can I set num_ctx per model instead of globally?

Your agent runs wild. Your bill doesn't.

Produits

Hébergement d’apps

Ressources

Entreprise

Fonctionnalités

Obtenir de l’aide

Solutions

Générer un mot de passe