AI & LLMs Intermediate

The AI Application Roadmap — From PoC to Production

The architecture, evals, monitoring, and process disciplines that take an AI feature from "works in a demo" to "survives real users at scale."

DjangoZen Team May 09, 2026 14 min read 6 views

Why most AI features die between demo and production

The first version of an AI feature usually works in a demo. The version that actually serves real users for months without on-call pages — that's a different beast. The gap between them is the work of this tutorial.

This is the playbook for crossing it.

Phase 1 — Proof of Concept (week 1)

Goal: prove the LLM can do the task at all.

What to build:

  • A single Django management command or Jupyter notebook
  • Hardcoded prompt, hardcoded model, hardcoded inputs
  • Just enough output handling to see if the result is plausible
  • 10–20 hand-picked examples

What NOT to build yet:

  • Production endpoints
  • Streaming
  • Caching
  • Authentication
  • Error handling
  • A UI

What to decide at the end of this phase:

  • Is the LLM capable of this task at all? If results on your hand-picked examples are bad, no amount of prompt engineering will save you. Pivot or kill.
  • Which model tier do you actually need? Run the same examples through Haiku, Sonnet, Opus. Pick the cheapest one that meets quality.

If quality is acceptable, advance.

Phase 2 — Working integration (week 2–3)

Goal: plug the feature into the Django app and let internal users hit it.

What to build:

  • A Django view + template (or a /admin/ page)
  • Reasonable error handling (covered in tutorial 4)
  • Streaming if response > a few sentences (tutorial 8)
  • Logging of every call (input, output, latency, tokens)
  • Feature flag — start it disabled for everyone except staff
# settings.py
AI_FEATURE_X_ENABLED = config("AI_FEATURE_X_ENABLED", default=False, cast=bool)
AI_FEATURE_X_STAFF_ONLY = config("AI_FEATURE_X_STAFF_ONLY", default=True, cast=bool)

# views.py
@login_required
def feature_x(request):
    if not settings.AI_FEATURE_X_ENABLED:
        raise Http404()
    if settings.AI_FEATURE_X_STAFF_ONLY and not request.user.is_staff:
        raise Http404()
    ...

What to decide at the end of this phase:

  • What does it actually feel like to use? Have 3–5 internal users use it daily for a week. Listen to complaints, watch their screens.
  • Where does it fail catastrophically? Document every failure case. These are your eval set seeds.

Phase 3 — Evals and quality gates (week 3–4)

Goal: stop being able to ship prompt changes without knowing if quality regressed.

What to build:

  • A test set of 50–200 (input, expected output) pairs covering: typical cases, edge cases, adversarial inputs, known failure modes
  • An eval harness that runs the prompt against the test set and scores results
  • A scoring strategy:
  • For factual answers: regex/keyword match against ground truth
  • For classifications: exact match
  • For free-form text: LLM-as-judge (use Claude to score each output 1–5 against criteria)
# evals/test_set.json
[
    {"input": "How do I install QuizCraft?", "expected_keyword": "venv"},
    {"input": "Refund policy?", "expected_keyword": "30 days"},
    ...
]

# evals/run_eval.py
def evaluate():
    test_set = load_test_set()
    results = []
    for example in test_set:
        actual = ask(example["input"])
        passed = example["expected_keyword"].lower() in actual.lower()
        results.append({"input": example["input"], "passed": passed, "actual": actual})

    pass_rate = sum(r["passed"] for r in results) / len(results)
    print(f"Pass rate: {pass_rate:.1%}")
    return pass_rate

What to decide at the end of this phase:

  • Where's the floor? What pass rate is acceptable for production launch? Set the bar before you start tuning.
  • Run on every prompt change. Tie it to CI if possible.

Phase 4 — Limited public beta (week 4–6)

Goal: see what real users do that you didn't anticipate.

What to build:

  • Open the feature flag to a small percent of users (5–10%) or to opted-in beta users
  • In-app feedback widget (thumbs up/down per response)
  • Per-user rate limits to prevent abuse
  • A dashboard tracking: response satisfaction, latency, error rate, cost per user

What to expect:

  • Real users will type things you never imagined. Single-character prompts. Prompts in languages you don't speak. Prompts trying to break the system. Prompts in foul language. Adversarial prompts. Prompts with personal information you shouldn't store.
  • Edge cases will not match your evals. Add the new failure modes to the eval set.
  • Costs may surprise you. A single user can do thousands of calls if your rate limits are wrong.

What to decide at the end of this phase:

  • Is the feature actually adding value? Look at retention, completion rate, NPS — not just thumbs up. People rate generously; they vote with their feet.
  • Are there abuse vectors? Patch them.

Phase 5 — General availability (week 6+)

Goal: make it boring and reliable.

What to build:

  • Full rollout (100% of users)
  • Production monitoring: Sentry on all errors, Grafana on cost/latency/usage
  • Per-feature kill switch — one config flip can disable the feature instantly
  • Prompt versioning — every prompt change gets a version number, the version is logged with every call
  • Regular regression evals (weekly cron)
# Versioned prompts
PROMPT_VERSION = "summarize_v3"
PROMPTS = {
    "summarize_v1": "Summarize this:",
    "summarize_v2": "Provide a 2-sentence summary:",
    "summarize_v3": "Provide a concise 2-sentence summary preserving key facts:",
}

response = ask(PROMPTS[PROMPT_VERSION])
log_call(prompt_version=PROMPT_VERSION, ...)

When you change a prompt and quality drops, you can correlate the regression with the version change in your logs.

The architecture pattern

By Phase 5, your AI code should look something like this:

┌─ View layer ──────────────────────────┐
│  - Authentication, rate limit, feature flag │
│  - Calls service layer, returns response  │
└──────────────────────────────────────┘
            ↓
┌─ Service layer ────────────────────────┐
│  - Routes to right model (router)         │
│  - Constructs prompt from versioned templates │
│  - Calls inference layer                  │
│  - Validates output (Pydantic)            │
│  - Logs call with all metadata            │
└──────────────────────────────────────┘
            ↓
┌─ Inference layer ──────────────────────┐
│  - Wraps Anthropic SDK                    │
│  - Handles retries, errors, timeouts      │
│  - Manages prompt caching                 │
└──────────────────────────────────────┘

Three clean layers, each testable. Don't put client.messages.create() in your view.

What to monitor

The metrics that matter:

Metric Alert when
p50 / p95 latency p95 > 10 seconds
Error rate > 2%
Cost per user per day > 2x baseline
Eval pass rate (weekly) drops > 10 points
User feedback (thumbs down rate) > 20%
Hallucination rate (sample-audited) > 5%

Not all of these are easy to measure automatically. The hallucination rate especially needs human spot-checks. Build a weekly review where someone reads 20 random conversations and rates them.

Failure mode planning

Before launch, write down:

  1. What happens if Anthropic is down? (Your AI feature should fail gracefully — return a "service unavailable" message, not a 500.)
  2. What happens if costs spike? (Per-user limits + global circuit breaker.)
  3. What happens if a user submits something that gets the LLM to behave badly? (Output filtering + audit log.)
  4. What happens if the LLM returns unsafe content? (Output moderation pass + fallback.)

Have answers before you launch, not during the incident.

What kills AI features

Common causes of failed AI features:

  • No evals — you don't know if it's getting better or worse
  • No monitoring — you don't know what's happening in production
  • Premature scale — you launched to all users before validating quality
  • No kill switch — when something goes wrong you can't stop it
  • No human review — high-stakes decisions made entirely by LLM with no oversight
  • Cost surprises — your launch worked, then a viral moment 10x'd traffic and your bill

Each of these is preventable with the practices above.

Summary

The roadmap from PoC to production is not about writing better prompts. It's about:

  1. Hand-validating capability before building anything
  2. Internal-first integration with logging
  3. Eval-driven quality gates
  4. Staged rollout with feedback loops
  5. Production monitoring and kill switches
  6. Architectural separation of concerns

Most demos can be built in a weekend. Most production AI features take 4–8 weeks. The difference is this checklist. Skip it and you'll either ship something flaky or burn money silently.

Take it seriously and you'll ship features that earn user trust and survive scale. Which is the whole point.