Three shapes of a voice bot
I’ve shipped three voice products in the last year. None of them look the same. Voicemail to SMS is async and one-to-one. Realtime WebRTC is a phone call. WhatsApp with voice notes is chat that sometimes talks. The channel decides almost everything.
Three shapes, side by side:
Three channels. Three latency budgets. Three trust models. Pick wrong and the product stops working, no matter how good the model is.
Shape one. Voicemail in, SMS out.
Someone rings, I don’t pick up, they leave a voicemail. A worker transcribes the recording with Whisper. GPT-4o-mini reads the transcript and returns a summary, an intent tag, a suggested reply, and a confidence score. A confidence gate decides whether to send the LLM’s text or one of five pre-written SMS templates. The caller gets exactly one text back. They can’t ask follow-up questions.
The latency budget is generous. Nobody expects an instant response to a voicemail.
The trust model is low. One message ships to a stranger. If the LLM is wrong, they act on wrong information and I find out hours later when the phone rings again.
Three design constraints fell out of the trust model.
- The fallback template is the default. Send the LLM’s text only when it reports
confidence: highand the intent is one ofcallback_request,information_request, orissue_report. Anything else falls back to a handful of static templates I wrote myself. The LLM is the exception. The template is the rule. - One SMS per call, ever. Twilio retries webhooks. Networks blip. The fix is hard idempotency on a status column so the phase short-circuits on replay. Without it, a retry means a duplicate text to the caller, and the caller loses trust in the thing that’s meant to be helping.
- The owner gets a copy. One notification SMS per inbound call, with the intent, the summary, and the caller’s number as a
tel:link. If the auto-reply misfired, the business owner can still tap to call back.
Shape two. Realtime voice.
Someone dials in and the audio streams both ways over WebRTC. LiveKit self-hosted for the media, OpenAI Realtime API for the agent. There’s a traditional pipeline (speech to text, then LLM, then text to speech) sitting next to it as a fallback, because Realtime has downtime and because some voices cost less.
The latency budget is tight. Under 500ms for first audio out or the conversation feels broken. A pause reads as thinking, then as confusion, then as a dropped call.
The trust model is higher than voicemail. The caller is on the line. They interrupt, push back, ask again. The bot gets to recover. The UI is the audio.
What changes:
- Two pipelines, one agent. Realtime is great when it works. The traditional pipeline is slower but predictable, and the cost profile is different. Each agent picks.
- Interrupt handling is the product. Server-side voice activity detection. When the caller starts speaking, the bot stops talking mid-sentence. Without this, users feel talked over and hang up.
- First-token latency beats every other metric. Warm pools, streamed TTS, a pre-cached opener. The time to first audible reply is the single variable that decides whether the product feels alive.
Shape three. Chat with voice notes.
WhatsApp on Twilio. Users send text. Sometimes they send voice notes. Sometimes the bot sends voice notes back. The conversation is async but the surface looks synchronous because the thread is always there.
Latency is moderate. A few seconds feels fine, ten feels slow, twenty feels broken.
Trust is medium. The user can see the whole history. They can scroll back and ask the same thing differently. The mistake from last Thursday is still in the thread.
Three shifts from the other shapes:
- Paid actions bypass the LLM. The tool generates a cover letter. In code, I send the cover letter directly to the user. In the message history I replace it with
[ALREADY SENT, do NOT repeat]. If I let the LLM forward it, the LLM summarises. The user paid two credits and gets one sentence. Anything the user paid for ships deterministically. - Voice out is content-aware. Don’t send a voice note for a cover letter (they want to read it). Do send one for “great question, here’s how I’d prep for that.” The decision is per-tool, not global.
- Tools, not prompts. Nine tools in the router. The system prompt fits on a page. The prompt is a router, the tools are the product. Adding a capability means adding a tool, not rewriting the prompt.
What the channel decides
Each shape comes with its own defaults. Picking the channel picks all of them.
- Latency budget. Minutes for voicemail. Seconds for chat. Under half a second for realtime.
- Trust model. One-shot and final at the voicemail end. Interactive and correctable for chat. Live and recoverable for realtime.
- Failure tolerance. Lowest at voicemail (the caller can’t reply). Highest at realtime (the caller is right there).
- Fallback shape. Static templates, or a slower pipeline, or bypassing the LLM for paid content. Different answers to the same question: what happens when the model is wrong?
You can fight the channel. I wouldn’t.
The one thing that’s the same
In every shape, the LLM is the unreliable part. It’s also the most interesting part. The design work is the same move every time. Find the probabilistic bits. Wrap them in something deterministic. Make the safe path the default.
The channel decides what “safe” looks like. Whether that’s a fallback template, a fallback pipeline, or a deterministic delivery path. The shape of the work is identical.
The voice is the last 10%
The voice of the bot gets all the attention. I’ve spent weeks on system prompts. The best thing I can say about each of these products is this: when the model misbehaves, nothing bad happens to the user.
That’s not a prompt problem. That’s an architecture problem. The shape you pick decides the architecture.