Back to Article List

Hermes Agent voice mode with faster-whisper and Edge TTS

Hermes Agent voice mode with faster-whisper and Edge TTS

Voice mode in Hermes Agent gets you the "ask the agent something while making coffee" experience that's the whole reason people want voice agents in the first place. The official docs cover the cloud-provider route (OpenAI Whisper API, ElevenLabs TTS) which costs money per minute. The free and local route is what this article covers: faster-whisper for speech-to-text running on your machine, Microsoft Edge TTS for the spoken reply.

Edge TTS, in case you haven't seen it, is the speech engine behind the Edge browser's "Read aloud" feature. Microsoft exposes it for free use through an undocumented API. The voices are surprisingly good. The community wrapped it into a Python package (edge-tts) you can use without an account or rate limit.

The architecture in one diagram

Three components, each handling one job:

  • Microphone capture: your mic, Hermes records audio
  • Speech-to-text: faster-whisper transcribes the audio locally
  • Agent: Hermes processes the text prompt, generates response
  • Text-to-speech: edge-tts speaks the response through your speakers

Steps 1, 2 and 4 happen locally. Step 3 happens at whatever LLM provider you've configured. Total cost: zero for speech, plus whatever the LLM call costs you (usually a few cents per turn).

Install faster-whisper

The "faster" in faster-whisper is because it's a CTranslate2-optimised reimplementation of OpenAI's Whisper. Runs about 4x faster than vanilla Whisper at the same accuracy. CPU-only mode works fine for the smaller models.

pip install faster-whisper

On Linux/macOS the install is straightforward. On Windows, install Microsoft Visual C++ Redistributable first (the CTranslate2 wheel needs it).

Pick a model size

faster-whisper supports several model sizes. Trade-off is speed vs accuracy:

ModelDiskCPU speed (rough)Accuracy
tiny75 MB10x realtimeOK for clear English
base140 MB6x realtimeGood for clear English
small460 MB3x realtimeRobust, handles accents
medium1.5 GB1.5x realtimeNear-human
large-v33.0 GB0.8x realtime CPUBest

For interactive voice mode I'd start with base. Realtime-or-better means the transcription completes by the time you've finished speaking, which is what you want for conversational use. small if you have a slight accent or work in a noisy room. large-v3 if you have a GPU and want best-in-class quality.

Install edge-tts

pip install edge-tts

That's it. No account. No API key. The library hits Microsoft's free TTS endpoint directly.

Pick a voice

List available voices:

edge-tts --list-voices | head -30

You'll see hundreds of voices across languages, genders and styles. My picks for English:

  • en-GB-RyanNeural: clean British male
  • en-GB-LibbyNeural: clean British female
  • en-US-GuyNeural: clean American male
  • en-US-JennyNeural: clean American female

Test a voice before wiring it in:

edge-tts --voice en-GB-RyanNeural --text "Hello, this is Hermes speaking." --write-media /tmp/test.mp3
mpv /tmp/test.mp3

(Or use whatever audio player you have.)

Wire both into Hermes voice mode

Hermes voice mode reads from a config block in ~/.hermes/config.yaml. Replace the default cloud-provider config with the local stack:

voice_mode:
  enabled: true
  stt:
    provider: faster-whisper
    model: base
    language: en
  tts:
    provider: edge-tts
    voice: en-GB-RyanNeural
    rate: "+0%"
    volume: "+0%"

Restart Hermes:

hermes restart

Then start a voice session:

hermes voice

The agent listens through your default microphone. Speak. When you pause, faster-whisper transcribes, Hermes processes the prompt, edge-tts speaks the response. Repeat until you exit with Ctrl-C.

Tuning for actual usability

Out of the box voice mode is functional but rough. Three tweaks make it pleasant.

Set silence detection sensitivity

If Hermes cuts you off mid-sentence (because there's a brief pause), increase the silence threshold:

voice_mode:
  silence_threshold_ms: 800

Default is 500ms. 800 gives you more thinking time without dragging out turn-taking.

Reduce response length for voice

The default Hermes response style is good for reading on screen. Bad for listening. A 300-word response takes 90 seconds to speak. Add to your SOUL.md or via /personality:

When in voice mode, keep responses under 50 words unless I explicitly ask for detail.

The agent picks up the context. Quick conversational turns. Much better feel.

Pick a faster TTS rate

Edge TTS default rate is slow for native English speakers. Bump it up:

voice_mode:
  tts:
    rate: "+15%"

15% faster is the sweet spot for me. 25% starts to sound rushed. Try a few values.

Where this fits in the stack

Voice mode pairs well with other patterns we've covered. Combine with the scheduled tasks from our Hermes cron tasks piece to get the morning briefing read out at 7 a.m. Combine with Telegram gateway from our Telegram setup so voice notes you send to the bot get transcribed and answered automatically.

For the latter, you'll want faster-whisper's larger models because Telegram voice notes are often noisier than your local mic.

GPU acceleration for whisper

If you have an NVIDIA GPU and want to run the large model in realtime, install the CUDA-enabled variant:

pip install faster-whisper[cuda]

And add to your voice_mode config:

voice_mode:
  stt:
    model: large-v3
    device: cuda

Now you're transcribing high-quality at 5x realtime on a consumer GPU. Worth doing if you want the agent to handle accents, mixed languages or noisy environments well.

What goes wrong

Microphone not detected

Linux usually needs ALSA or PulseAudio configured correctly. On macOS, grant Terminal microphone permission in System Settings > Privacy & Security. On Windows you may need to enable microphone access for the WSL distro.

arecord -l   # Linux: list capture devices
hermes voice --list-mics

edge-tts errors with "no audio output"

The edge-tts API occasionally returns empty audio for certain voices and certain text patterns. Workaround: catch the error and retry with a different voice or split the text. Usually transient.

Whisper transcribes garbage

Three usual causes. Mic level too low (test with arecord first). Background noise (try the small or medium model). Wrong language config (Whisper detects language by default but voice_mode forces it; check the language field).

Hermes responds in English when you spoke another language

Set the agent's persona to respond in the language it heard. SOUL.md line: "Respond in the same language the user spoke." Or set language: auto in the STT config and let Whisper detect.

Privacy reality check

faster-whisper runs locally. Your audio never leaves your machine. Edge TTS does send your text to Microsoft's servers (that's where the synthesis happens). If you're voicing sensitive prompts, edge-tts isn't the right output choice. Local TTS options exist (Coqui, Piper, Mozilla TTS) but they don't sound as good. Trade-off.

For most use, "the question text gets sent to Microsoft" is the same privacy posture as "the question gets sent to Anthropic" (which already happens through the LLM call). If you've accepted the LLM going to a hosted provider, Edge TTS isn't a meaningful additional leak.

What edge-tts doesn't do

It doesn't do real-time streaming TTS. The whole response gets synthesised, then played. For long responses you'll hear a few seconds of delay before audio starts. ElevenLabs and OpenAI TTS support streaming and feel more responsive. For free local: this is the trade-off.

If responsiveness matters more than cost, OpenAI TTS streaming is the go-to. The voice_mode config in Hermes supports it; swap the tts provider block.

Where else to take this

Voice in, voice out is the most obvious use. Voice in, Telegram out is the more interesting one for me. I'll dictate a quick note while walking, the agent processes it (writing a Telegram message back to me, pushing a task to my todo list or whatever else the prompt asks), I see the result on screen later.

The same multi-channel pattern from our multi-platform gateway tutorial applies. Voice is just another channel.

Running this on a VPS

Voice mode is mostly a local-machine feature because microphones and speakers don't exist on a VPS. You can still use a VPS-hosted Hermes brain for the agent side, with voice handled on your laptop using Hermes Desktop's voice mode. The Desktop app talks to the remote agent and routes audio I/O locally. Setup pattern in our Desktop remote backend on VPS tutorial.

For the VPS itself, the LumaDock Hermes Agent template handles the brain side cleanly. Unmetered bandwidth and no setup fees, which matters because voice transcripts plus TTS prompts plus LLM responses add up to more total traffic than text-only use. Full setup in our Hermes Agent complete guide.

Your idea deserves better hosting

24/7 support 30-day money-back guarantee Cancel anytime
Abonament

1 GB RAM VPS

37.50 kr Save  25 %
28.10 kr Lunar
  • 1 vCPU AMD EPYC
  • 30 GB NVMe stocare
  • Traficnecontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Firewall configurabil
  • Monitorizare server gratuit

2 GB RAM VPS

56.30 kr Save  17 %
46.90 kr Lunar
  • 2 vCPU AMD EPYC
  • 30 GB NVMe stocare
  • Traficnecontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Firewall configurabil
  • Monitorizare server gratuit

6 GB RAM VPS

140.89 kr Save  33 %
93.89 kr Lunar
  • 6 vCPU AMD EPYC
  • 70 GB NVMe stocare
  • Traficnecontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Firewall configurabil
  • Monitorizare server gratuit

AMD EPYC VPS.P1

75.10 kr Save  25 %
56.30 kr Lunar
  • 2 vCPU AMD EPYC
  • 4 GB memorie RAM
  • 40 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

AMD EPYC VPS.P2

140.89 kr Save  27 %
103.29 kr Lunar
  • 2 vCPU AMD EPYC
  • 8 GB memorie RAM
  • 80 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

AMD EPYC VPS.P4

281.87 kr Save  20 %
225.48 kr Lunar
  • 4 vCPU AMD EPYC
  • 16 GB memorie RAM
  • 160 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

AMD EPYC VPS.P5

342.96 kr Save  21 %
272.47 kr Lunar
  • 8 vCPU AMD EPYC
  • 16 GB memorie RAM
  • 180 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

AMD EPYC VPS.P6

535.64 kr Save  21 %
422.85 kr Lunar
  • 8 vCPU AMD EPYC
  • 32 GB memorie RAM
  • 200 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

AMD EPYC VPS.P7

657.82 kr Save  20 %
526.24 kr Lunar
  • 16 vCPU AMD EPYC
  • 32 GB memorie RAM
  • 240 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

EPYC Genoa VPS.G1

46.90 kr Save  20 %
37.50 kr Lunar
  • 1 vCPU AMD EPYC Gen4 AMD EPYC Genoa generația a 4-a 9xx4 la 3.25 GHz sau similar, bazat pe arhitectura Zen 4.
  • 1 GB DDR5 memorie RAM
  • 25 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

EPYC Genoa VPS.G2

122.09 kr Save  23 %
93.89 kr Lunar
  • 2 vCPU AMD EPYC Gen4 AMD EPYC Genoa generația a 4-a 9xx4 la 3.25 GHz sau similar, bazat pe arhitectura Zen 4.
  • 4 GB DDR5 memorie RAM
  • 50 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

EPYC Genoa VPS.G4

244.28 kr Save  27 %
178.48 kr Lunar
  • 4 vCPU AMD EPYC Gen4 AMD EPYC Genoa generația a 4-a 9xx4 la 3.25 GHz sau similar, bazat pe arhitectura Zen 4.
  • 8 GB DDR5 memorie RAM
  • 100 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

EPYC Genoa VPS.G6

460.45 kr Save  31 %
319.47 kr Lunar
  • 8 vCPU AMD EPYC Gen4 AMD EPYC Genoa generația a 4-a 9xx4 la 3.25 GHz sau similar, bazat pe arhitectura Zen 4.
  • 16 GB DDR5 memorie RAM
  • 200 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

EPYC Genoa VPS.G7

704.82 kr Save  27 %
516.84 kr Lunar
  • 8 vCPU AMD EPYC Gen4 AMD EPYC Genoa generația a 4-a 9xx4 la 3.25 GHz sau similar, bazat pe arhitectura Zen 4.
  • 32 GB DDR5 memorie RAM
  • 250 GB NVMe stocare
  • Trafic necontorizat
  • IPv4 și IPv6 incluse Suportul IPv6 nu este disponibil momentan în Franța, Finlanda sau Olanda.
  • 1 Gbps rețea
  • Backup automat inclus
  • Firewall configurabil
  • Monitorizare server gratuit

AMD Ryzen VPS.R1

150.29 kr Save  31 %
103.29 kr Lunar
  • 1 CPU dedicat AMD Ryzen 9 7950X cu 4,5 GHz sau similar, pe arhitectura Zen 4. vCPU
  • 4 GB DDR5MEMORIE
  • 50 GB NVMeSTOCARE
  • Trafic nelimitat
  • IPv4 & IPv6 incluse Suportul IPv6 este momentan indisponibil în Franța, Finlanda sau Țările de Jos.
  • Backup automat inclus

AMD Ryzen VPS.R2

263.07 kr Save  21 %
206.68 kr Lunar
  • 2 CPU dedicate AMD Ryzen 9 7950X cu 4,5 GHz sau similar, pe arhitectura Zen 4. vCPU
  • 8 GB DDR5MEMORIE
  • 100 GB NVMeSTOCARE
  • Trafic nelimitat
  • IPv4 & IPv6 incluse Suportul IPv6 este momentan indisponibil în Franța, Finlanda sau Țările de Jos.
  • Backup automat inclus

AMD Ryzen VPS.R4

939.79 kr Save  20 %
751.81 kr Lunar
  • 8 CPU dedicate AMD Ryzen 9 7950X cu 4,5 GHz sau similar, pe arhitectura Zen 4. vCPU
  • 32 GB DDR5MEMORIE
  • 400 GB NVMeSTOCARE
  • Trafic nelimitat
  • IPv4 & IPv6 incluse Suportul IPv6 este momentan indisponibil în Franța, Finlanda sau Țările de Jos.
  • Backup automat inclus

FAQ

How do I set up Hermes Agent voice mode without paying for cloud APIs?

Use faster-whisper for speech-to-text (runs locally) and edge-tts for text-to-speech (Microsoft's free TTS endpoint, no account needed). Both install via pip. Configure in ~/.hermes/config.yaml under voice_mode.

Your agent runs wild. Your bill doesn't.

Easily deploy Hermes in one click on Ubuntu 24.04 with AMD EPYC, NVMe storage and unmetered bandwidth. The price stays the same whatever the agent does, no setup fees, no overage charges and no tier traps.

GPU products are in high demand at the moment. Fill the form to get notified as soon as your preferred GPU server is back in stock.