Hermes Agent against Ollama is the cheapest local-model setup you can run. It also breaks in a way the agent doesn't explain: weird short replies, mid-session amnesia, the model "forgetting" your prompt halfway through, tools that suddenly stop firing. Nine times out of ten the cause is the same. Ollama is silently truncating your prompt because its default context window is 4096 tokens, and Hermes Agent burns through 4096 tokens before the model sees any of your actual question.
Hermes alone pushes around 13,000 tokens of system context per turn (skill registry, persona, memory headers, tool schemas). Add a real conversation and you're at 20k easily. Default Ollama just drops everything past 4096. The model sees a stub. Output is stubby too.
Why Ollama ships with 4096
RAM. Ollama defaults are tuned for someone running a 7B model on a laptop with 8 GB of RAM. At those numbers, a 4096 context window fits in memory comfortably. Bump the window to 32k and the same model needs 3-4x more memory for the KV cache. Most casual Ollama users would crash. So Ollama plays it safe and lets advanced users opt into bigger windows.
Hermes Agent is an advanced user. It needs the bigger window.
The fix: set num_ctx properly
Three ways to raise it. Pick the one that matches how you run Ollama.
Option 1: Modelfile (cleanest, model-specific)
Create a custom Modelfile that inherits from your base model and pins the context window:
cat > Modelfile.hermes << 'EOF'
FROM qwen2.5-coder:32b
PARAMETER num_ctx 32768
EOF
ollama create qwen2.5-coder-32b-hermes -f Modelfile.hermes
Then point Hermes at the new model name instead of the original. Wins: each model can have its own window. Losses: you have to remember to do this for each model you pull.
Option 2: OLLAMA_NUM_CTX environment variable
Newer Ollama versions support a global default through an env var. Set it before starting Ollama:
export OLLAMA_NUM_CTX=32768
systemctl restart ollama
# or if you run Ollama as a foreground process
ollama serve
For persistence across reboots, add it to the Ollama systemd unit:
[Service]
Environment="OLLAMA_NUM_CTX=32768"
This affects every model loaded on this Ollama instance. Simpler than Modelfile per model, but if you sometimes want a smaller window for a tiny model you've lost that flexibility.
Option 3: Pass num_ctx in Hermes provider config
Hermes lets you pass extra options to the inference call. If your provider definition supports it (check hermes provider show ollama), you can set num_ctx there and override whatever Ollama itself defaults to. The exact field name varies by Hermes version. Look in the provider's options block or check the Hermes configuration docs for the current field name.
How much context Hermes really needs
This is the bit no one writes about. The answer depends on what you ask Hermes to do.
| Use case | Minimum num_ctx | Recommended |
|---|---|---|
| Short chat with no tools | 4096 | 8192 |
| Hermes with shell + filesystem tools | 16384 | 32768 |
| Hermes with full skill registry | 32768 | 65536 |
| Long multi-turn debugging session | 65536 | 131072 |
32k is what I'd start with. It covers Hermes Agent normal operation with a few skills enabled and leaves room for chat. If you load every skill and you're doing long multi-step work, push to 64k or 128k (only on models that support that big a context).
What this costs in RAM
The KV cache (key-value pairs the model holds in memory during inference) scales with context size. Rough numbers for a 7B model with 4-bit quantisation:
- 4096 context: ~0.5 GB extra RAM
- 16384 context: ~2 GB extra RAM
- 32768 context: ~4 GB extra RAM
- 131072 context: ~16 GB extra RAM
Multiply roughly by 4x for a 32B model. A 32B model with 32k context wants around 24 GB of total VRAM/RAM. If your machine doesn't have it, the model loads partially into CPU memory and inference becomes painfully slow.
Verifying the fix worked
Two quick checks. First, ask Ollama what context size it's using for your model:
curl http://localhost:11434/api/show -d '{"name":"qwen2.5-coder-32b-hermes"}' | jq .parameters
You should see num_ctx 32768 (or whatever you set) in the output. If it still shows 4096 or doesn't show the field at all, your Modelfile or env var didn't take.
Second, run a long Hermes conversation that would have failed before. Something with five or six turns of context plus a skill invocation. If it completes coherently, your context window is doing what it should.
The symptoms the model gives you when context is too small
These are the giveaways. If you see any of these, num_ctx is probably your problem (not the model itself):
- Replies that ignore the last half of your message
- Tool calls that reference earlier conversation but get the details wrong
- Sudden persona shifts mid-session (because SOUL.md got truncated out of the prompt)
- "What were we discussing again?" mid-conversation
- Skill calls that fail because the skill definition got truncated
People often blame the model in these cases ("Qwen 7B is dumb"). Then they switch to a bigger model and it does the same thing. The bug isn't the model. It's the window.
Per-model recommendations from my own use
I run a few different local models for different tasks. Here's what I settled on:
- Qwen2.5-coder 32B: 32k context, 24 GB VRAM, fine for most Hermes work
- Llama 3.3 70B: 16k context (because 32k blows past 48 GB VRAM on Q4), only for one-shot tasks
- Phi-4 14B: 16k context, runs on a 16 GB consumer card, my go-to for quick chats
None of these are 4096. The default never made sense for Hermes-class use.
When to give up on local and route to a hosted provider
If you've gone through the Modelfile dance, set num_ctx high, the model still loses the plot mid-session and you're running on a machine with 16 GB of VRAM or less, you've hit the trade-off ceiling. Either upgrade hardware or send tool-heavy work to Anthropic via the fallback pattern in our Hermes 402 quota fallback piece. Local is great for cost. It is not great for accuracy at the edge.
Where this fits with the rest of the stack
If the symptoms above feel familiar but you're not on Ollama, the closest related piece is our Hermes Agent talks but takes no action diagnostic, which covers tool-call failures from a different angle. If you want to cut token usage to make a smaller window survivable, see our cut Hermes token costs guide for the /compress command and skill pruning.
The hosted-model alternative
For the cost-conscious who don't want to manage Ollama at all, OpenRouter's free tier covers a lot of casual use and avoids the whole context-window dance.
The LumaDock Hermes Agent template ships with provider setup walking you through OpenRouter or Anthropic in a few prompts, so you can be running before you've even decided if local makes sense for you. Unmetered bandwidth and no setup fees. Setup walkthrough in our Hermes Agent complete guide.

