Skip to content

Voice Messages

When gateway.voice.enabled = true (the default), inbound voice notes and audio attachments are decoded with ffmpeg, transcribed by nodejs-whisper, and the transcript is sent to the agent as if you’d typed it.

Inbound audio attachment
ffmpeg → decode to 16kHz mono WAV
nodejs-whisper → local ASR transcription (no network)
Transcript → forwarded to the agent like a normal text message

Telegram, Discord, and Slack deliver voice messages as compressed audio (Telegram uses OGG/Opus; Discord and Slack vary). Whisper needs a 16 kHz mono PCM WAV. ffmpeg does the conversion in one shot — resample, downmix, container swap.

  • ffmpeg ships bundled with the Ptah desktop app via the ffmpeg-static npm package — no separate install, no PATH setup. The pinned binary is platform-specific (Windows / macOS / Linux) and is selected automatically by electron-builder at packaging time.
  • whisper model files are downloaded on first run by nodejs-whisper and cached locally under the user data directory.

If the bundled ffmpeg binary is missing (corrupted install) or whisper fails to load, the gateway replies with a short “voice message ignored” notice on the same chat thread and drops the audio. It does not silently fail — you’ll see the platform reply.

Set gateway.voice.enabled = false in Settings → Messaging to ignore audio attachments entirely. Text messages continue to work.