Architecture

Production WhatsApp AI: What the Tutorials Don't Tell You

Lalit Sharma · · 10 min read

The 80% Nobody Talks About

Every WhatsApp AI tutorial shows you the happy path: receive message, call GPT, send response. Three API calls and a demo video. That’s about 20% of the work.

The other 80% is the part that makes production systems production systems: What happens when WhatsApp delivers the same webhook twice? When two messages from the same user arrive simultaneously? When the LLM times out mid-conversation and the user sends “hello?” again? When a voice note is 4 minutes long and the transcription takes 90 seconds?

I’ve built two WhatsApp AI systems that handle these problems differently — because they have fundamentally different architectures:

Kaiwa’s chatbot is stateful. It manages multi-step booking flows — event discovery, seat selection, dietary preferences, Razorpay payment, post-dinner feedback. Conversation state lives in Redis with a 24-hour TTL. The hard part is concurrency and state management.

MR Voice is stateless. A pharma field rep sends a voice note describing a doctor visit; 30 seconds later, they get back a structured intelligence report. No database, no queue, no sessions. Every request is independent. The hard part is the transcription-to-extraction pipeline and graceful degradation.

Same messaging platform. Opposite architectures. The patterns that emerged are what tutorials skip.

Webhooks Will Betray You

WhatsApp Business API delivers messages via webhooks. The first thing to accept: webhooks are unreliable. Your server restarts. The network hiccups. WhatsApp retries. You’ll receive the same message twice. If you process it twice, the user gets duplicate responses — or worse, double charges.

Kaiwa: Deduplication via Redis SET NX

Every incoming webhook ID gets written to Redis with a SET NX (set-if-not-exists) and a 5-minute TTL. If the key already exists, the message is a duplicate — skip it silently.

dedup:{message_id} → "1" (TTL: 300 seconds)

The webhook handler returns HTTP 200 immediately, before processing. The actual work runs in a FastAPI background task. This matters because WhatsApp has a tight timeout on webhook responses — if you block on LLM inference, the webhook times out and WhatsApp retries, creating the exact duplicate you’re trying to prevent.

MR Voice: Return 200, Process in Background

MR Voice takes the same approach but doesn’t need deduplication. Each voice note is processed independently — if a duplicate arrives, it generates the same report twice. The user gets a redundant message, not a broken state. For a stateless system, idempotency is free.

Both systems validate webhook signatures before processing. Kaiwa uses HMAC-SHA256 with hmac.compare_digest() for constant-time comparison (prevents timing attacks). MR Voice validates Telegram’s X-Telegram-Bot-Api-Secret-Token header. Neither system trusts unsigned payloads.

The Concurrency Problem Nobody Warns You About

Here’s a scenario tutorials never cover: a user sends “What’s the venue for next Tuesday in HSR Layout?” and then immediately sends “no, tell me about venue in Central Bengaluru” before the first message finishes processing. Both messages hit your webhook simultaneously. Both read the same session state. Both trigger LLM tool calls. The first response comes back with HSR Layout results, but the session history now contains the Central Bengaluru query too — the conversation state is corrupted.

Kaiwa: Per-Phone Distributed Locks

Kaiwa solves this with a Redis-based per-phone lock. Before processing any message, the handler acquires a lock:

lock:{phone} → "1" (NX, TTL: 30 seconds)

If the lock is already held (another message from the same user is being processed), the message is silently skipped. The 30-second TTL prevents deadlocks if the process crashes mid-handling.

This is a deliberate tradeoff: the second message is dropped while the first is being processed. The user gets HSR Layout results, not Central Bengaluru. But the alternative — queuing both and processing sequentially — adds latency that makes the chatbot feel unresponsive. In practice, the user sees the HSR Layout response, sends “no, Central Bengaluru” again, and gets the right answer on the next turn. One extra message beats a corrupted session.

On top of the lock, Kaiwa enforces a rate limit: 10 messages per minute per phone number, implemented as a Redis INCR with a 60-second TTL. This protects against spam and accidental message floods.

MR Voice: No Concurrency Problem

MR Voice doesn’t need locks because there’s no shared state. Two simultaneous voice notes from the same user generate two independent reports. No session to corrupt, no booking to double-process. Stateless architectures sidestep entire categories of bugs.

Conversation State: The Hardest Part of Chatbot Engineering

WhatsApp doesn’t have sessions. Every message arrives independently — there’s no “conversation” object in the API. You build that yourself.

Kaiwa: A Full State Machine in Redis

Kaiwa stores a complete session object per phone number in Redis:

session:{phone} → JSON (TTL: 24 hours)

The session tracks everything: conversation stage (welcome → qa → booking → payment → confirmed), user ID, selected event, dietary preferences, booking ID, payment link, escalation status, and the full conversation history.

The stage field gates behavior. In the payment stage, if the user sends any message, the bot checks payment status instead of routing to the LLM. In the confirmed stage, it offers rebooking. This prevents the LLM from accidentally re-starting a booking flow after payment.

For LLM context, the bot sends the last 10 messages from the conversation history — not the full session object. The LLM sees the recent conversation; the state machine handles the flow logic. This separation is critical: the LLM handles natural language understanding, deterministic code handles the booking state transitions.

MR Voice: Zero State

MR Voice has no session, no history, no state machine. Each voice note is an independent request. The “conversation” is a single exchange: voice in, report out. This simplicity is the entire architecture — no Redis, no database, no session expiry to worry about.

LLM Integration: Tools, Fallbacks, and Failure

Kaiwa: Tool Calling With a Fallback Chain

Kaiwa’s chatbot uses Azure OpenAI (GPT-4o-mini) with three defined tools:

  • get_upcoming_dinners — searches available events
  • get_policies — returns cancellation, refund, privacy, and safety policies
  • validate_referral_code — checks if a referral code is valid

The LLM decides when to call a tool based on the user’s message. “What’s happening this Saturday?” triggers get_upcoming_dinners. “What’s your cancellation policy?” triggers get_policies. The tool execution loop runs a maximum of three rounds — the LLM calls a tool, gets the result, and either responds or calls another tool.

When Azure OpenAI fails (timeout, rate limit, outage), Kaiwa falls back to Anthropic Claude Haiku — without tool calling. The fallback can still answer questions from context but can’t search events or validate codes. If both providers fail, the user gets a hardcoded message with an option to escalate to a human.

Three-tier degradation: full tool calling → knowledge-only LLM → hardcoded response. The bot always responds, even if imperfectly.

MR Voice: Two-Stage Pipeline, No Tools

MR Voice doesn’t use tool calling. Its LLM integration is a two-stage pipeline:

Stage 1: gpt-4o-transcribe-diarize transcribes the voice note with automatic speaker diarization. Raw audio bytes are sent directly — no format conversion, no silence detection, no chunking. The API handles OGG, MP3, WAV, AAC, AMR, and WebM natively. Speakers are automatically labeled as “MR” (field rep) and “Doctor.”

Stage 2: GPT-5.4 analyzes the diarized transcript and extracts 15+ structured fields: doctor name, products discussed, receptivity assessment (Positive/Neutral/Negative with reasoning), objections, commitments with strength ratings (Strong/Tentative/Polite deflection), next steps, follow-up date, sentiment score, rep talk percentage, key insight, and coaching notes.

The output is validated JSON. If GPT-5.4 returns malformed JSON (markdown fences, trailing commas, truncated output), a text-only repair call sends the broken text back to GPT-5.4 for extraction — without re-sending the audio. Cheaper and faster than a full retry.

If transcription fails entirely, summarization still runs with a “(Transcript was unavailable)” placeholder. The user gets a partial report rather than an error message.

Platform-Specific Formatting: The Tedious Necessity

Both systems format output differently per platform — and this matters more than it sounds.

MR Voice: HTML for Telegram, Plain Text for WhatsApp

The same structured report is formatted twice:

  • Telegram: HTML tags (<b>, <i>), emoji section headers, structured indentation
  • WhatsApp: Markdown-style bold (*text*), emoji headers, simplified layout

Long messages are split at newline boundaries to respect the 4,096-character platform limit. Telegram’s splitter finds the last newline within the limit and cuts there. WhatsApp’s accumulates full lines until adding the next line would exceed the limit. Both approaches respect section boundaries — a report about “Doctor Receptivity” never gets split mid-paragraph.

Empty sections are omitted entirely. If there are no objections, the “Objections” section disappears rather than showing “None.” The transcript is truncated to 20 lines with a note: ”… [full transcript: 47 lines, 15 min visit].”

Kaiwa: Templates for Critical Messages

Kaiwa uses WhatsApp’s pre-approved message templates for critical communications — booking confirmations, payment links, reminders. These bypass the 24-hour customer service window and can be sent proactively. Free-form LLM responses are used only within the conversation window.

The Patterns

After building both systems, three patterns stand out:

Return 200 before processing. Always. WhatsApp webhooks have tight timeouts. If you block on LLM inference (which can take 2-30 seconds), the webhook times out, WhatsApp retries, and you get duplicates. Acknowledge immediately, process in background.

Separate the LLM from the state machine. In Kaiwa, the LLM handles language understanding and the state machine handles flow control. The LLM never decides to charge a credit card or confirm a booking — it generates a response, and deterministic code acts on the session stage. This is the same “agents for meaning, code for pixels” principle, applied to chatbots.

Design for the failure that will definitely happen. Both systems assume every external call can fail. Kaiwa’s three-tier LLM fallback. MR Voice’s JSON repair instead of full retry. Blob storage deleted fire-and-forget. Message sending that logs errors but never crashes. The user should always get a response — even if it’s “Something went wrong, try again.”

The hard parts of production WhatsApp AI aren’t the AI. They’re webhooks, concurrency, state management, platform formatting, and graceful degradation. Get those right, and the AI part is the easy 20%.