Voice Messages
Voice Messages
Section titled “Voice Messages”When gateway.voice.enabled = true (the default), inbound voice notes and audio attachments are decoded with ffmpeg, transcribed by nodejs-whisper, and the transcript is sent to the agent as if you’d typed it.
Pipeline
Section titled “Pipeline”Inbound audio attachment ↓ffmpeg → decode to 16kHz mono WAV ↓nodejs-whisper → local ASR transcription (no network) ↓Transcript → forwarded to the agent like a normal text messageWhy ffmpeg?
Section titled “Why ffmpeg?”Telegram, Discord, and Slack deliver voice messages as compressed audio (Telegram uses OGG/Opus; Discord and Slack vary). Whisper needs a 16 kHz mono PCM WAV. ffmpeg does the conversion in one shot — resample, downmix, container swap.
Requirements
Section titled “Requirements”- ffmpeg ships bundled with the Ptah desktop app via the
ffmpeg-staticnpm package — no separate install, noPATHsetup. The pinned binary is platform-specific (Windows / macOS / Linux) and is selected automatically byelectron-builderat packaging time. - whisper model files are downloaded on first run by
nodejs-whisperand cached locally under the user data directory.
Fallback
Section titled “Fallback”If the bundled ffmpeg binary is missing (corrupted install) or whisper fails to load, the gateway replies with a short “voice message ignored” notice on the same chat thread and drops the audio. It does not silently fail — you’ll see the platform reply.
Disabling
Section titled “Disabling”Set gateway.voice.enabled = false in Settings → Messaging to ignore audio attachments entirely. Text messages continue to work.