AI Engineering

Agents for Meaning, Code for Pixels: Building a 9-Step Poster Pipeline

Lalit Sharma · · 8 min read

The Poster That Broke the Demo

The first poster the AI Marketing Engine generated looked incredible. DALL-E rendered a moody restaurant interior, warm lighting, a headline in elegant serif type: “Summr Sale — 30% Off Appetisers.”

Two typos. Wrong brand colors. The logo was hallucinated into a smear of pixels in the corner. And the next poster for the same restaurant looked nothing like the first one.

This is the core problem with end-to-end image generation for marketing: generative models produce beautiful one-offs. Brands need boring consistency. A restaurant’s Tuesday lunch special must look like it came from the same designer as Monday’s event announcement — same fonts, same palette, same logo placement, same visual language. Every time.

The insight that reshaped the entire architecture: use AI to generate background images, but never let AI place a single pixel of text or layout. AI decides what to say and what mood the image should evoke. Deterministic code handles typography, composition, and rendering.

Agents for meaning. Code for pixels.

The Pipeline at a Glance

Poster pipeline architecture — 9 steps from user prompt to production poster, showing 3 LLM agents and 6 deterministic engines with a bounded revision loop

Three LLM agents. Six deterministic engines. Every intermediate result flows through strongly-typed Pydantic schemas — no free-form text between stages.

Decision 1: Archetypes Over Free-Form Layout

The single most important design choice: the Creative Director doesn’t invent layouts. It selects from 25+ tested archetypes — hero_image_overlay, split_image_text, quote_card, tip_list, event_promo, and more. Each archetype maps to a Jinja2 HTML template that guarantees professional structure.

This is constrained creativity. The LLM’s creative range is channeled through patterns that are known to work, not left to hallucinate CSS Grid layouts. The Creative Director decides which pattern fits — “this is a summer sale, hero image with text overlay” — and the template handles every pixel downstream.

The constraint unlocked something unexpected: the system got more creative, not less. With layout guaranteed, the LLM could focus entirely on copy, tone, and image mood — the things it’s actually good at.

Decision 2: The Image Paradox

Here’s the tension I had to resolve: image generation doesn’t work for full posters (the typo problem), but AI-generated backgrounds are often exactly what a poster needs. A photorealistic food shot, a moody restaurant interior, an abstract gradient — these are where FLUX.2-pro excels.

The Image Strategist agent resolves this per-poster. It routes between four options:

  • FLUX.2-pro ($0.05) — photorealistic backgrounds via Azure-hosted FLUX
  • GPT-Image ($0.04) — fallback when FLUX is rate-limited
  • CSS Programmatic ($0.00) — deterministic gradients, mesh patterns, geometric shapes
  • None ($0.00) — text-only archetypes like quote cards

Every AI image prompt gets a suffix: “The image must not contain any text, words, letters…” — because text rendering is the deterministic pipeline’s job, not the image model’s.

The key insight: the same technology that fails at generating complete posters succeeds brilliantly when scoped to just backgrounds. The problem was never image generation — it was asking image models to also be typographers, brand managers, and layout engines.

Decision 3: Analyzing What You Can’t Predict

An AI-generated background is inherently unpredictable. FLUX might produce a bright sky in the top-left where you planned to place the headline, or a busy texture that makes text unreadable everywhere.

Two deterministic engines solve this:

Vision Analysis maps the image into a 3×3 grid using Pillow, measuring brightness (0-255) and busyness (variance-based edge density, 0-1) in each quadrant. It identifies safe zones where text can sit — regions where the busyness score falls below 0.35.

Overlay Policy uses those measurements to choose a readability strategy:

  • DIRECT (busy ≤ 0.30) — text shadow only, image shows through fully
  • SCRIM (busy ≤ 0.60) — semi-transparent gradient between image and text
  • CARD (busy > 0.60) — opaque panel, readability guaranteed regardless of background

Pure threshold logic. No LLM. The Creative Director can express a preference for direct text, but if the image analysis says the background is too complex, the engine overrides. This is deliberate — creative intent is advisory, readability is mandatory.

Decision 4: Puppeteer Over Pillow

The original prototype used Pillow’s ImageDraw.text() for rendering. It worked for simple layouts. Then I needed kerning control. Then letter-spacing. Then responsive font sizing. Then CSS Grid for split-panel layouts. Then clamp() for type scales that work across Instagram (1080×1080) and LinkedIn (1200×627).

Pillow can’t do any of that.

The switch to Puppeteer — a Node.js microservice that renders HTML to PNG — gave the pipeline the entire web typography stack. Jinja2 templates are version-controlled and testable. The renderer is a pure function: same HTML in, identical PNG out, every time. At 2× DPI for retina quality.

This also made the archetype system practical. Writing 25+ layout templates in HTML/CSS is routine web development. Writing them as Pillow drawing sequences would have been a nightmare.

Decision 5: Vision-Based QA With a Restricted Vocabulary

The Quality Auditor is the most unusual agent in the pipeline. It uses GPT-5.4 with vision to evaluate the actual rendered PNG — not the template parameters, not the intermediate data, the real output image that a user would see.

But here’s the constraint that makes it production-safe: the auditor can only return four fix types.

  • APPROVE — poster passes all checks
  • REWRITE_HEADLINE / REWRITE_SUBHEADLINE — text clarity issue, includes suggested replacement
  • REGENERATE_IMAGE — background doesn’t match the brief
  • ESCALATE_OVERLAY — text contrast too poor, escalate from DIRECT to SCRIM or CARD

No “change the font size to 48px.” No “move the logo 20 pixels left.” No hallucinated CSS. The auditor identifies what’s wrong using vision. The pipeline’s deterministic engines decide how to fix it.

If the auditor rejects, the pipeline re-enters at the appropriate step — step 4 for image issues, step 5 for overlay escalation — with a maximum of two revision cycles. After that, the best-effort result ships. The pipeline converges or returns its best attempt. It never spins.

The Economics

Cost tracking isn’t overhead — it drives model selection.

A poster with an AI-generated FLUX background: $0.06-0.10 total (image generation + three LLM agent calls). A CSS-only poster (quote card, gradient text): $0.01-0.03. The Image Strategist factors this in — it picks the cheapest model that satisfies the creative brief.

If a revision loop triggers image regeneration, the pipeline first tries FLUX Kontext (an image editing model at $0.05) to fix the existing image before falling back to full regeneration. Preserves what worked, costs the same.

What Held Up, What Didn’t

Held up: The agent/engine boundary. Every time I’ve needed to add a new archetype, fix a rendering bug, or tune typography, the change has been in deterministic code with tests. The LLM agents haven’t needed prompt changes in weeks.

Held up: Bounded revision loops. Two cycles with four possible actions converges quickly. I was initially worried about rejection rates — in practice, the pipeline approves on the first pass more often than not.

Didn’t hold up: My initial assumption that CSS programmatic backgrounds would be rare. Restaurants post a lot of quote cards, tip lists, and event announcements. Nearly half the posters use CSS backgrounds, no image model needed at all.

Surprised me: Multi-platform re-rendering was almost free. The same creative brief generates posters for Instagram, Facebook, LinkedIn, and Stories — just re-run steps 4-8 at different dimensions. The typography engine adjusts font sizes per platform. The vision analyzer re-evaluates safe zones for different aspect ratios. Same creative intent, platform-native execution.

The AI Marketing Engine currently powers Kaiwa’s social media presence. The principle — agents for meaning, code for pixels — has held up through every iteration. The most reliable part of the system is the part with zero AI in it.