Hermes Agent voice mode with faster-whisper and Edge TTS

Ellie Grace Hayes

15/06/2026

Hermes Agent voice mode with faster-whisper and Edge TTS - Hermes Agent voice mode with faster-whisper and Edge TTS

Voice mode in Hermes Agent gets you the "ask the agent something while making coffee" experience that's the whole reason people want voice agents in the first place. The official docs cover the cloud-provider route (OpenAI Whisper API, ElevenLabs TTS) which costs money per minute. The free and local route is what this article covers: faster-whisper for speech-to-text running on your machine, Microsoft Edge TTS for the spoken reply.

Edge TTS, in case you haven't seen it, is the speech engine behind the Edge browser's "Read aloud" feature. Microsoft exposes it for free use through an undocumented API. The voices are surprisingly good. The community wrapped it into a Python package (edge-tts) you can use without an account or rate limit.

The architecture in one diagram

Three components, each handling one job:

Microphone capture: your mic, Hermes records audio
Speech-to-text: faster-whisper transcribes the audio locally
Agent: Hermes processes the text prompt, generates response
Text-to-speech: edge-tts speaks the response through your speakers

Steps 1, 2 and 4 happen locally. Step 3 happens at whatever LLM provider you've configured. Total cost: zero for speech, plus whatever the LLM call costs you (usually a few cents per turn).

Install faster-whisper

The "faster" in faster-whisper is because it's a CTranslate2-optimised reimplementation of OpenAI's Whisper. Runs about 4x faster than vanilla Whisper at the same accuracy. CPU-only mode works fine for the smaller models.

pip install faster-whisper

On Linux/macOS the install is straightforward. On Windows, install Microsoft Visual C++ Redistributable first (the CTranslate2 wheel needs it).

Pick a model size

faster-whisper supports several model sizes. Trade-off is speed vs accuracy:

Model	Disk	CPU speed (rough)	Accuracy
tiny	75 MB	10x realtime	OK for clear English
base	140 MB	6x realtime	Good for clear English
small	460 MB	3x realtime	Robust, handles accents
medium	1.5 GB	1.5x realtime	Near-human
large-v3	3.0 GB	0.8x realtime CPU	Best

For interactive voice mode I'd start with base. Realtime-or-better means the transcription completes by the time you've finished speaking, which is what you want for conversational use. small if you have a slight accent or work in a noisy room. large-v3 if you have a GPU and want best-in-class quality.

Install edge-tts

pip install edge-tts

That's it. No account. No API key. The library hits Microsoft's free TTS endpoint directly.

Pick a voice

List available voices:

edge-tts --list-voices | head -30

You'll see hundreds of voices across languages, genders and styles. My picks for English:

en-GB-RyanNeural: clean British male
en-GB-LibbyNeural: clean British female
en-US-GuyNeural: clean American male
en-US-JennyNeural: clean American female

Test a voice before wiring it in:

edge-tts --voice en-GB-RyanNeural --text "Hello, this is Hermes speaking." --write-media /tmp/test.mp3
mpv /tmp/test.mp3

(Or use whatever audio player you have.)

Wire both into Hermes voice mode

Hermes voice mode reads from a config block in ~/.hermes/config.yaml. Replace the default cloud-provider config with the local stack:

voice_mode:
  enabled: true
  stt:
    provider: faster-whisper
    model: base
    language: en
  tts:
    provider: edge-tts
    voice: en-GB-RyanNeural
    rate: "+0%"
    volume: "+0%"

Restart Hermes:

hermes restart

Then start a voice session:

hermes voice

The agent listens through your default microphone. Speak. When you pause, faster-whisper transcribes, Hermes processes the prompt, edge-tts speaks the response. Repeat until you exit with Ctrl-C.

Tuning for actual usability

Out of the box voice mode is functional but rough. Three tweaks make it pleasant.

Set silence detection sensitivity

If Hermes cuts you off mid-sentence (because there's a brief pause), increase the silence threshold:

voice_mode:
  silence_threshold_ms: 800

Default is 500ms. 800 gives you more thinking time without dragging out turn-taking.

Reduce response length for voice

The default Hermes response style is good for reading on screen. Bad for listening. A 300-word response takes 90 seconds to speak. Add to your SOUL.md or via /personality:

When in voice mode, keep responses under 50 words unless I explicitly ask for detail.

The agent picks up the context. Quick conversational turns. Much better feel.

Pick a faster TTS rate

Edge TTS default rate is slow for native English speakers. Bump it up:

voice_mode:
  tts:
    rate: "+15%"

15% faster is the sweet spot for me. 25% starts to sound rushed. Try a few values.

Where this fits in the stack

Voice mode pairs well with other patterns we've covered. Combine with the scheduled tasks from our Hermes cron tasks piece to get the morning briefing read out at 7 a.m. Combine with Telegram gateway from our Telegram setup so voice notes you send to the bot get transcribed and answered automatically.

For the latter, you'll want faster-whisper's larger models because Telegram voice notes are often noisier than your local mic.

GPU acceleration for whisper

If you have an NVIDIA GPU and want to run the large model in realtime, install the CUDA-enabled variant:

pip install faster-whisper[cuda]

And add to your voice_mode config:

voice_mode:
  stt:
    model: large-v3
    device: cuda

Now you're transcribing high-quality at 5x realtime on a consumer GPU. Worth doing if you want the agent to handle accents, mixed languages or noisy environments well.

What goes wrong

Microphone not detected

Linux usually needs ALSA or PulseAudio configured correctly. On macOS, grant Terminal microphone permission in System Settings > Privacy & Security. On Windows you may need to enable microphone access for the WSL distro.

arecord -l   # Linux: list capture devices
hermes voice --list-mics

edge-tts errors with "no audio output"

The edge-tts API occasionally returns empty audio for certain voices and certain text patterns. Workaround: catch the error and retry with a different voice or split the text. Usually transient.

Whisper transcribes garbage

Three usual causes. Mic level too low (test with arecord first). Background noise (try the small or medium model). Wrong language config (Whisper detects language by default but voice_mode forces it; check the language field).

Hermes responds in English when you spoke another language

Set the agent's persona to respond in the language it heard. SOUL.md line: "Respond in the same language the user spoke." Or set language: auto in the STT config and let Whisper detect.

Privacy reality check

faster-whisper runs locally. Your audio never leaves your machine. Edge TTS does send your text to Microsoft's servers (that's where the synthesis happens). If you're voicing sensitive prompts, edge-tts isn't the right output choice. Local TTS options exist (Coqui, Piper, Mozilla TTS) but they don't sound as good. Trade-off.

For most use, "the question text gets sent to Microsoft" is the same privacy posture as "the question gets sent to Anthropic" (which already happens through the LLM call). If you've accepted the LLM going to a hosted provider, Edge TTS isn't a meaningful additional leak.

What edge-tts doesn't do

It doesn't do real-time streaming TTS. The whole response gets synthesised, then played. For long responses you'll hear a few seconds of delay before audio starts. ElevenLabs and OpenAI TTS support streaming and feel more responsive. For free local: this is the trade-off.

If responsiveness matters more than cost, OpenAI TTS streaming is the go-to. The voice_mode config in Hermes supports it; swap the tts provider block.

Where else to take this

Voice in, voice out is the most obvious use. Voice in, Telegram out is the more interesting one for me. I'll dictate a quick note while walking, the agent processes it (writing a Telegram message back to me, pushing a task to my todo list or whatever else the prompt asks), I see the result on screen later.

The same multi-channel pattern from our multi-platform gateway tutorial applies. Voice is just another channel.

Running this on a VPS

Voice mode is mostly a local-machine feature because microphones and speakers don't exist on a VPS. You can still use a VPS-hosted Hermes brain for the agent side, with voice handled on your laptop using Hermes Desktop's voice mode. The Desktop app talks to the remote agent and routes audio I/O locally. Setup pattern in our Desktop remote backend on VPS tutorial.

For the VPS itself, the LumaDock Hermes Agent template handles the brain side cleanly. Unmetered bandwidth and no setup fees, which matters because voice transcripts plus TTS prompts plus LLM responses add up to more total traffic than text-only use. Full setup in our Hermes Agent complete guide.

Your idea deserves better hosting

24/7 support 30-day money-back guarantee Cancel anytime

Billing Cycle

VPS.S1

57.47 kr Save 17 %

47.87 _kr Monthly

2 vCPU AMD EPYC
2 GB RAMMEMORY
30 GB NVMeSTORAGE
Unmetered bandwidth
IPv4 & IPv6IPv6 is currently unavailable in France, Finland or the Netherlands. included

Hermes Agent voice mode with faster-whisper and Edge TTS

The architecture in one diagram

Install faster-whisper

Pick a model size

Install edge-tts

Pick a voice

Wire both into Hermes voice mode

Tuning for actual usability

Set silence detection sensitivity

Reduce response length for voice

Pick a faster TTS rate

Where this fits in the stack

GPU acceleration for whisper

What goes wrong

Microphone not detected

edge-tts errors with "no audio output"

Whisper transcribes garbage

Hermes responds in English when you spoke another language

Privacy reality check

What edge-tts doesn't do

Where else to take this

Running this on a VPS

Your idea deserves better hosting

VPS.S1

VPS.S2

VPS.S3

EPYC VPS.P1

EPYC VPS.P2

EPYC VPS.P3

EPYC VPS.P4

EPYC VPS.P5

EPYC VPS.P6

EPYC VPS.P7

Genoa VPS.G2

Genoa VPS.G3

Genoa VPS.G4

Genoa VPS.G6

Genoa VPS.G7

AMD Ryzen VPS.R1

AMD Ryzen VPS.R2

AMD Ryzen VPS.R3

AMD Ryzen VPS.R4

FAQ

How do I set up Hermes Agent voice mode without paying for cloud APIs?

What's the best Whisper model size for interactive Hermes voice mode?

Can I use Edge TTS voices in Hermes Agent?

Does Hermes voice mode work over a VPS?

Is Edge TTS private enough for sensitive conversations?

Your agent runs wild. Your bill doesn't.

Products

App hosting solutions

Resources

Company

Features

Get help

Solutions by use case

Generate Password