The architecture, evals, monitoring, and process disciplines that take an AI feature from "works in a demo" to "survives real users at scale."
The first version of an AI feature usually works in a demo. The version that actually serves real users for months without on-call pages — that's a different beast. The gap between them is the work of this tutorial.
This is the playbook for crossing it.
Goal: prove the LLM can do the task at all.
What to build:
What NOT to build yet:
What to decide at the end of this phase:
If quality is acceptable, advance.
Goal: plug the feature into the Django app and let internal users hit it.
What to build:
# settings.py
AI_FEATURE_X_ENABLED = config("AI_FEATURE_X_ENABLED", default=False, cast=bool)
AI_FEATURE_X_STAFF_ONLY = config("AI_FEATURE_X_STAFF_ONLY", default=True, cast=bool)
# views.py
@login_required
def feature_x(request):
if not settings.AI_FEATURE_X_ENABLED:
raise Http404()
if settings.AI_FEATURE_X_STAFF_ONLY and not request.user.is_staff:
raise Http404()
...
What to decide at the end of this phase:
Goal: stop being able to ship prompt changes without knowing if quality regressed.
What to build:
# evals/test_set.json
[
{"input": "How do I install QuizCraft?", "expected_keyword": "venv"},
{"input": "Refund policy?", "expected_keyword": "30 days"},
...
]
# evals/run_eval.py
def evaluate():
test_set = load_test_set()
results = []
for example in test_set:
actual = ask(example["input"])
passed = example["expected_keyword"].lower() in actual.lower()
results.append({"input": example["input"], "passed": passed, "actual": actual})
pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"Pass rate: {pass_rate:.1%}")
return pass_rate
What to decide at the end of this phase:
Goal: see what real users do that you didn't anticipate.
What to build:
What to expect:
What to decide at the end of this phase:
Goal: make it boring and reliable.
What to build:
# Versioned prompts
PROMPT_VERSION = "summarize_v3"
PROMPTS = {
"summarize_v1": "Summarize this:",
"summarize_v2": "Provide a 2-sentence summary:",
"summarize_v3": "Provide a concise 2-sentence summary preserving key facts:",
}
response = ask(PROMPTS[PROMPT_VERSION])
log_call(prompt_version=PROMPT_VERSION, ...)
When you change a prompt and quality drops, you can correlate the regression with the version change in your logs.
By Phase 5, your AI code should look something like this:
┌─ View layer ──────────────────────────┐
│ - Authentication, rate limit, feature flag │
│ - Calls service layer, returns response │
└──────────────────────────────────────┘
↓
┌─ Service layer ────────────────────────┐
│ - Routes to right model (router) │
│ - Constructs prompt from versioned templates │
│ - Calls inference layer │
│ - Validates output (Pydantic) │
│ - Logs call with all metadata │
└──────────────────────────────────────┘
↓
┌─ Inference layer ──────────────────────┐
│ - Wraps Anthropic SDK │
│ - Handles retries, errors, timeouts │
│ - Manages prompt caching │
└──────────────────────────────────────┘
Three clean layers, each testable. Don't put client.messages.create() in your view.
The metrics that matter:
| Metric | Alert when |
|---|---|
| p50 / p95 latency | p95 > 10 seconds |
| Error rate | > 2% |
| Cost per user per day | > 2x baseline |
| Eval pass rate (weekly) | drops > 10 points |
| User feedback (thumbs down rate) | > 20% |
| Hallucination rate (sample-audited) | > 5% |
Not all of these are easy to measure automatically. The hallucination rate especially needs human spot-checks. Build a weekly review where someone reads 20 random conversations and rates them.
Before launch, write down:
Have answers before you launch, not during the incident.
Common causes of failed AI features:
Each of these is preventable with the practices above.
AI features are famously easy to prototype and famously hard to productionize, and understanding this gap is the heart of the roadmap. A proof of concept that works on a few hand-picked inputs can feel almost done, but production demands reliability across messy real inputs, handling of failures and edge cases, cost control, monitoring, and safety — none of which the demo addressed. Many AI projects stall precisely because teams underestimate this gap, treating the impressive demo as nearly finished when most of the real engineering still lies ahead. Recognizing that the demo is the start, not the end, sets realistic expectations for what it takes to ship.
The first step beyond a demo is a pilot: putting the feature in front of a limited set of real users with real data, which immediately surfaces the gap between curated test cases and reality. Real inputs are messier, users do unexpected things, and quality problems that were invisible in testing appear. The pilot is where you learn what actually needs hardening, measure real quality and cost, and validate that the feature delivers value. Treating the pilot as a deliberate learning phase — instrumented, observed, and limited in blast radius — is what turns a promising prototype into knowledge about what production will really require.
Moving from pilot to production means addressing everything the demo ignored: handling errors and model failures gracefully, validating and constraining output, controlling cost, monitoring quality and spend, guarding against misuse and prompt injection, and ensuring the feature degrades sensibly when the model is unavailable. This hardening work is the bulk of the real engineering in an AI feature, and it is what makes the difference between something that impresses in a meeting and something that holds up under real traffic. Budgeting for this phase realistically — rather than assuming a working pilot is nearly shippable — is essential to actually reaching production.
AI features fail to reach or survive production for recognizable reasons: unreliable quality on real inputs, runaway or unjustified cost, lack of monitoring so problems go unnoticed, safety and misuse issues, and solving a problem that did not really need AI. Many of these are avoidable with the right roadmap — piloting to surface quality issues early, controlling cost deliberately, monitoring from the start, and honestly assessing whether AI is the right tool. Knowing the common failure modes in advance lets you steer around them, which is much of what separates AI projects that deliver lasting value from the many that produce an exciting demo and then quietly stall.
The roadmap from PoC to production is not about writing better prompts. It's about:
Most demos can be built in a weekend. Most production AI features take 4–8 weeks. The difference is this checklist. Skip it and you'll either ship something flaky or burn money silently.
Take it seriously and you'll ship features that earn user trust and survive scale. Which is the whole point.