AI & LLMs Intermediate

The AI Application Roadmap — From PoC to Production

The architecture, evals, monitoring, and process disciplines that take an AI feature from "works in a demo" to "survives real users at scale."

DjangoZen Team May 09, 2026 15 min read 124 views

Why most AI features die between demo and production

The first version of an AI feature usually works in a demo. The version that actually serves real users for months without on-call pages — that's a different beast. The gap between them is the work of this tutorial.

This is the playbook for crossing it.

Phase 1 — Proof of Concept (week 1)

Goal: prove the LLM can do the task at all.

What to build:

  • A single Django management command or Jupyter notebook
  • Hardcoded prompt, hardcoded model, hardcoded inputs
  • Just enough output handling to see if the result is plausible
  • 10–20 hand-picked examples

What NOT to build yet:

  • Production endpoints
  • Streaming
  • Caching
  • Authentication
  • Error handling
  • A UI

What to decide at the end of this phase:

  • Is the LLM capable of this task at all? If results on your hand-picked examples are bad, no amount of prompt engineering will save you. Pivot or kill.
  • Which model tier do you actually need? Run the same examples through Haiku, Sonnet, Opus. Pick the cheapest one that meets quality.

If quality is acceptable, advance.

Phase 2 — Working integration (week 2–3)

Goal: plug the feature into the Django app and let internal users hit it.

What to build:

  • A Django view + template (or a /admin/ page)
  • Reasonable error handling (covered in tutorial 4)
  • Streaming if response > a few sentences (tutorial 8)
  • Logging of every call (input, output, latency, tokens)
  • Feature flag — start it disabled for everyone except staff
# settings.py
AI_FEATURE_X_ENABLED = config("AI_FEATURE_X_ENABLED", default=False, cast=bool)
AI_FEATURE_X_STAFF_ONLY = config("AI_FEATURE_X_STAFF_ONLY", default=True, cast=bool)

# views.py
@login_required
def feature_x(request):
    if not settings.AI_FEATURE_X_ENABLED:
        raise Http404()
    if settings.AI_FEATURE_X_STAFF_ONLY and not request.user.is_staff:
        raise Http404()
    ...

What to decide at the end of this phase:

  • What does it actually feel like to use? Have 3–5 internal users use it daily for a week. Listen to complaints, watch their screens.
  • Where does it fail catastrophically? Document every failure case. These are your eval set seeds.

Phase 3 — Evals and quality gates (week 3–4)

Goal: stop being able to ship prompt changes without knowing if quality regressed.

What to build:

  • A test set of 50–200 (input, expected output) pairs covering: typical cases, edge cases, adversarial inputs, known failure modes
  • An eval harness that runs the prompt against the test set and scores results
  • A scoring strategy:
  • For factual answers: regex/keyword match against ground truth
  • For classifications: exact match
  • For free-form text: LLM-as-judge (use Claude to score each output 1–5 against criteria)
# evals/test_set.json
[
    {"input": "How do I install QuizCraft?", "expected_keyword": "venv"},
    {"input": "Refund policy?", "expected_keyword": "30 days"},
    ...
]

# evals/run_eval.py
def evaluate():
    test_set = load_test_set()
    results = []
    for example in test_set:
        actual = ask(example["input"])
        passed = example["expected_keyword"].lower() in actual.lower()
        results.append({"input": example["input"], "passed": passed, "actual": actual})

    pass_rate = sum(r["passed"] for r in results) / len(results)
    print(f"Pass rate: {pass_rate:.1%}")
    return pass_rate

What to decide at the end of this phase:

  • Where's the floor? What pass rate is acceptable for production launch? Set the bar before you start tuning.
  • Run on every prompt change. Tie it to CI if possible.

Phase 4 — Limited public beta (week 4–6)

Goal: see what real users do that you didn't anticipate.

What to build:

  • Open the feature flag to a small percent of users (5–10%) or to opted-in beta users
  • In-app feedback widget (thumbs up/down per response)
  • Per-user rate limits to prevent abuse
  • A dashboard tracking: response satisfaction, latency, error rate, cost per user

What to expect:

  • Real users will type things you never imagined. Single-character prompts. Prompts in languages you don't speak. Prompts trying to break the system. Prompts in foul language. Adversarial prompts. Prompts with personal information you shouldn't store.
  • Edge cases will not match your evals. Add the new failure modes to the eval set.
  • Costs may surprise you. A single user can do thousands of calls if your rate limits are wrong.

What to decide at the end of this phase:

  • Is the feature actually adding value? Look at retention, completion rate, NPS — not just thumbs up. People rate generously; they vote with their feet.
  • Are there abuse vectors? Patch them.

Phase 5 — General availability (week 6+)

Goal: make it boring and reliable.

What to build:

  • Full rollout (100% of users)
  • Production monitoring: Sentry on all errors, Grafana on cost/latency/usage
  • Per-feature kill switch — one config flip can disable the feature instantly
  • Prompt versioning — every prompt change gets a version number, the version is logged with every call
  • Regular regression evals (weekly cron)
# Versioned prompts
PROMPT_VERSION = "summarize_v3"
PROMPTS = {
    "summarize_v1": "Summarize this:",
    "summarize_v2": "Provide a 2-sentence summary:",
    "summarize_v3": "Provide a concise 2-sentence summary preserving key facts:",
}

response = ask(PROMPTS[PROMPT_VERSION])
log_call(prompt_version=PROMPT_VERSION, ...)

When you change a prompt and quality drops, you can correlate the regression with the version change in your logs.

The architecture pattern

By Phase 5, your AI code should look something like this:

┌─ View layer ──────────────────────────┐
│  - Authentication, rate limit, feature flag │
│  - Calls service layer, returns response  │
└──────────────────────────────────────┘
            ↓
┌─ Service layer ────────────────────────┐
│  - Routes to right model (router)         │
│  - Constructs prompt from versioned templates │
│  - Calls inference layer                  │
│  - Validates output (Pydantic)            │
│  - Logs call with all metadata            │
└──────────────────────────────────────┘
            ↓
┌─ Inference layer ──────────────────────┐
│  - Wraps Anthropic SDK                    │
│  - Handles retries, errors, timeouts      │
│  - Manages prompt caching                 │
└──────────────────────────────────────┘

Three clean layers, each testable. Don't put client.messages.create() in your view.

What to monitor

The metrics that matter:

Metric Alert when
p50 / p95 latency p95 > 10 seconds
Error rate > 2%
Cost per user per day > 2x baseline
Eval pass rate (weekly) drops > 10 points
User feedback (thumbs down rate) > 20%
Hallucination rate (sample-audited) > 5%

Not all of these are easy to measure automatically. The hallucination rate especially needs human spot-checks. Build a weekly review where someone reads 20 random conversations and rates them.

Failure mode planning

Before launch, write down:

  1. What happens if Anthropic is down? (Your AI feature should fail gracefully — return a "service unavailable" message, not a 500.)
  2. What happens if costs spike? (Per-user limits + global circuit breaker.)
  3. What happens if a user submits something that gets the LLM to behave badly? (Output filtering + audit log.)
  4. What happens if the LLM returns unsafe content? (Output moderation pass + fallback.)

Have answers before you launch, not during the incident.

What kills AI features

Common causes of failed AI features:

  • No evals — you don't know if it's getting better or worse
  • No monitoring — you don't know what's happening in production
  • Premature scale — you launched to all users before validating quality
  • No kill switch — when something goes wrong you can't stop it
  • No human review — high-stakes decisions made entirely by LLM with no oversight
  • Cost surprises — your launch worked, then a viral moment 10x'd traffic and your bill

Each of these is preventable with the practices above.

The gap between demo and production

AI features are famously easy to prototype and famously hard to productionize, and understanding this gap is the heart of the roadmap. A proof of concept that works on a few hand-picked inputs can feel almost done, but production demands reliability across messy real inputs, handling of failures and edge cases, cost control, monitoring, and safety — none of which the demo addressed. Many AI projects stall precisely because teams underestimate this gap, treating the impressive demo as nearly finished when most of the real engineering still lies ahead. Recognizing that the demo is the start, not the end, sets realistic expectations for what it takes to ship.

From proof of concept to pilot

The first step beyond a demo is a pilot: putting the feature in front of a limited set of real users with real data, which immediately surfaces the gap between curated test cases and reality. Real inputs are messier, users do unexpected things, and quality problems that were invisible in testing appear. The pilot is where you learn what actually needs hardening, measure real quality and cost, and validate that the feature delivers value. Treating the pilot as a deliberate learning phase — instrumented, observed, and limited in blast radius — is what turns a promising prototype into knowledge about what production will really require.

Hardening for production

Moving from pilot to production means addressing everything the demo ignored: handling errors and model failures gracefully, validating and constraining output, controlling cost, monitoring quality and spend, guarding against misuse and prompt injection, and ensuring the feature degrades sensibly when the model is unavailable. This hardening work is the bulk of the real engineering in an AI feature, and it is what makes the difference between something that impresses in a meeting and something that holds up under real traffic. Budgeting for this phase realistically — rather than assuming a working pilot is nearly shippable — is essential to actually reaching production.

What kills AI features

AI features fail to reach or survive production for recognizable reasons: unreliable quality on real inputs, runaway or unjustified cost, lack of monitoring so problems go unnoticed, safety and misuse issues, and solving a problem that did not really need AI. Many of these are avoidable with the right roadmap — piloting to surface quality issues early, controlling cost deliberately, monitoring from the start, and honestly assessing whether AI is the right tool. Knowing the common failure modes in advance lets you steer around them, which is much of what separates AI projects that deliver lasting value from the many that produce an exciting demo and then quietly stall.

Summary

The roadmap from PoC to production is not about writing better prompts. It's about:

  1. Hand-validating capability before building anything
  2. Internal-first integration with logging
  3. Eval-driven quality gates
  4. Staged rollout with feedback loops
  5. Production monitoring and kill switches
  6. Architectural separation of concerns

Most demos can be built in a weekend. Most production AI features take 4–8 weeks. The difference is this checklist. Skip it and you'll either ship something flaky or burn money silently.

Take it seriously and you'll ship features that earn user trust and survive scale. Which is the whole point.