When to Use AI Agents vs. Deterministic Code
The Expensive Mistake in Both Directions
The first version of QueryNLQ’s SQL verification used an LLM. I asked GPT-4o: “Is this SQL query safe to execute?” It said yes to a query with a cross-product join that would have returned 400 million rows. The actual fix was six lines of sqlglot parsing — deterministic, testable, and it has never been wrong.
The first version of the AI Marketing Engine’s poster layout was a rule engine. Dozens of if statements mapping content types to templates. It worked for the first five archetypes. By archetype twelve, every new rule broke two old ones. The actual fix was giving an LLM the archetype catalog and letting it reason about which pattern fit the content.
Both mistakes cost weeks. The first one put an agent where a function belonged — expensive, slow, unreliable for a binary decision. The second put functions where an agent belonged — brittle, unscalable for a creative decision.
After shipping three production systems — QueryNLQ (14-agent NL-to-SQL pipeline), Kaiwa (social dining platform with AI matching), and the AI Marketing Engine (autonomous poster generation) — I’ve landed on five questions that reliably place the boundary.
Five Questions to Find the Boundary
1. Is the output space bounded?
If you can enumerate every valid output, use deterministic code. If the output requires creativity, judgment, or contextual interpretation, use an agent.
In the Marketing Engine, the Overlay Policy engine decides how to make text readable over a background image. Three options: DIRECT (text shadow), SCRIM (gradient overlay), CARD (opaque panel). The choice is driven by a busyness score from image analysis — if busy > 0.60, CARD; if busy > 0.30, SCRIM; otherwise DIRECT. Three outputs, one threshold function. No agent needed.
But the Creative Director — which selects from 25+ poster archetypes, writes the headline, and maps brand colors — operates in an unbounded space. “Summer sale for a sushi restaurant” could be a hero image overlay, a bold announcement card, or a minimal gradient with large type. The right answer depends on brand voice, content type, and visual context. That’s agent territory.
The test: can you write a lookup table or decision tree that covers every case? If yes, it’s a function. If the table would be infinite, it’s an agent.
2. Does correctness require reasoning across context?
If the answer depends only on the input parameters, use a function. If it depends on understanding relationships, history, or meaning, consider an agent.
In QueryNLQ, SQL verification depends only on the query itself. Does it parse? Does it execute in a read-only transaction? Do the referenced columns exist? sqlglot parses the AST, SQLAlchemy executes it — both are context-free. The query is either valid or it isn’t.
But SQL generation requires reasoning across the user’s intent (“top customers by revenue last quarter”), the pruned schema (which tables, which columns, which foreign keys), and the conversation history (“now filter that by the Asia region”). No function can synthesize a correct JOIN + GROUP BY + ORDER BY + LIMIT from those three inputs. That requires an agent reasoning about meaning.
The test: does the component need to understand why, or only what? Verification asks “what does this SQL do?” — that’s parsing. Generation asks “what SQL would answer this question?” — that’s reasoning.
3. Can you test it with assertions?
If you can write assert output == expected for every valid input, it’s a function. If the best you can do is “this looks reasonable,” it’s probably an agent.
The Marketing Engine’s Typography Engine computes font sizes from container width, text length, and platform dimensions. Same input, same output, every time. I can write assert font_size("Summer Sale 30% Off", 1080) == 64 and it will pass on every run.
The Quality Auditor evaluates a rendered poster PNG and decides if white text is readable over a partly cloudy sky. There’s no assertion for that. “Readable” is a perceptual judgment that depends on the specific image, the specific text placement, and the specific overlay. That’s why it uses GPT-5.4 with vision — it needs to see the output to evaluate it.
The test: if you can’t write a deterministic test for it, an agent might be the right tool. If you can write a deterministic test, an agent is almost certainly the wrong one.
4. What’s the cost of being wrong?
If a wrong answer causes data corruption, financial loss, or security holes — keep it deterministic. Use agents where mistakes can be caught, retried, or where “good enough” is acceptable.
Kaiwa processes payments through Razorpay. Every step — payment link generation, webhook verification, subscription renewal, invoice generation — is deterministic code. No LLM touches the money path. A hallucinated payment amount or a skipped webhook verification would be catastrophic.
But Kaiwa’s WhatsApp chatbot uses Azure OpenAI with tool calling to handle event queries. When a user asks “What’s happening this Saturday near Indiranagar?”, the LLM interprets the intent and calls the event search API. If it misinterprets “Saturday” as the wrong date, the user sees the wrong events — annoying, but they can ask again. The cost of being wrong is a bad search result, not a bad financial transaction.
The test: if a wrong output is irrecoverable (money, data, access), keep it deterministic. If it’s recoverable (search, copy, layout), an agent is acceptable.
5. Will the input space surprise you?
If inputs are structured and predictable, use a function. If users will express the same intent in a hundred different ways, use an agent.
Kaiwa sends dinner reminders at T-7, T-3, and T-1 days before an event. The input is an event date. The logic is subtraction. A cron job handles this — no LLM needed, no surprises possible.
But the WhatsApp chatbot receives messages like “hey whats happening sat”, “Any dinners near koramangala this weekend?”, “do you have anything vegetarian friendly on the 28th?” Same intent, wildly different phrasing. No regex or keyword matcher can handle this variance reliably. The LLM with tool calling parses the intent and calls the right API.
The test: can you define a schema for every possible input? If yes, use a function. If users will surprise you, use an agent.
Three Patterns That Kept Showing Up
Agents sandwiched between deterministic guardrails
In every system I’ve built, the most robust architecture is the same: deterministic validation before and after the agent.
QueryNLQ: the SQLGeneration agent writes SQL → sqlglot parses and validates the AST → SQLAlchemy executes in a read-only transaction → SQLReviewer agent checks logical correctness. The agent is never the last step before execution. Deterministic code always gets the final say on safety.
Marketing Engine: the Creative Director agent selects an archetype and writes copy → six deterministic engines handle vision analysis, overlay policy, typography, template composition, and rendering → the Quality Auditor agent evaluates the output. Agents make creative decisions. Engines execute them with pixel-level precision. The renderer has zero AI.
The pattern: agent generates → function validates → function executes. Never let an agent be the last gate before a side effect.
Bounded revision loops with restricted vocabularies
Open-ended “improve until perfect” loops are a production hazard. Every system that uses agent feedback needs hard bounds and constrained output.
Marketing Engine: the Quality Auditor can return exactly four fix types — APPROVE, REWRITE_HEADLINE, REGENERATE_IMAGE, ESCALATE_OVERLAY. Not “make the font bigger” or “try a different shade of blue.” Four verbs that the pipeline knows how to execute. Maximum two revision cycles, then ship the best attempt.
QueryNLQ: the self-correcting SQL loop (generate → verify → review → regenerate) runs a maximum of three times per task. A LoopDetectionManager tracks per-agent visit counts. If any agent runs too many times, the system breaks the loop and asks the user to rephrase. No infinite cycles.
The pattern: constrain the agent’s output to actions the system can execute. Set a hard ceiling on iterations. Make the exit condition deterministic, not agent-decided.
AI services are not agents
This distinction tripped me up early. Kaiwa uses Azure Speech SDK for voice transcription, Azure Document Intelligence for receipt OCR, and pgvector for personality embeddings. These are AI-powered, but they’re not agents — they’re deterministic API calls. Same input, same output, no reasoning.
Kaiwa’s voice profile feature records a user answering personality questions, transcribes with Azure Speech SDK, then extracts traits deterministically from the structured text. The voice audio never goes to an LLM. The transcription is a stateless service call. The trait extraction is parsing, not reasoning.
Bill splitting works the same way: Azure Document Intelligence extracts line items from a receipt photo (AI service), then deterministic code splits the total across diners and calculates tax per person (arithmetic).
The test: does the service maintain state, reason about context, or make judgment calls? If no, it’s an API call, not an agent. Treat it like any other external service — validate inputs, handle errors, cache where possible.
The One Rule
If you’re writing a prompt that says “always output exactly this format” and you’re frustrated when it sometimes doesn’t — that’s a function, not an agent. Move the formatting to deterministic code and let the agent focus on the decision.
Across three production systems — 14 agents in QueryNLQ, 3 agents and 6 engines in the Marketing Engine, one LLM integration in Kaiwa — the boundary always falls in the same place. Agents own meaning: intent, creativity, judgment, perceptual evaluation. Deterministic code owns execution: validation, rendering, payments, scheduling, safety.
The most reliable parts of all three systems are the parts with zero AI in them.