DevOps Advanced

Feature Flags and Progressive Delivery in Django: Canary Releases, A/B Tests, and Kill Switches

Ship to production continuously without big-bang risk. Build a feature-flag layer in Django, roll features out to a percentage of users, run A/B experiments, and add instant kill switches for when something goes wrong.

DjangoZen Team Jun 06, 2026 17 min read 163 views

Deploying code and releasing a feature should be two separate events. Feature flags let you merge to main, deploy on Tuesday, turn a feature on for 5% of users on Thursday, and roll it back in one click if metrics dip — all without a redeploy. This is progressive delivery, and it is how mature teams ship continuously without betting the site on every release. This tutorial builds a real flag system in Django and the delivery patterns it unlocks.

Decoupling deploy from release

Without flags, every deploy is a gamble: all the new code goes live to everyone the instant it ships, so a bug is an outage and a rollback is another deploy under pressure. Feature flags break that coupling. The deploy becomes boring — the new code ships dormant, wrapped in a flag that is off — and the release becomes a controlled, reversible configuration change you make when you are ready and watching. That separation is the entire point: it turns scary, all-or-nothing launches into gradual, observable, instantly-reversible rollouts, and it lets code sit safely in production ahead of the business decision to turn it on.

It also unlocks workflows that are otherwise impossible: merging long-running work in small safe increments behind a flag instead of in a giant risky branch, testing in production with internal users before anyone else sees a feature, and giving product and sales a switch they can flip without engineering. The flag is the seam between "the code exists" and "customers experience it."

A minimal flag layer

You do not need a SaaS vendor to start. A small model captures the essentials — an on/off switch, a percentage rollout, and an explicit allowlist for specific users:

class FeatureFlag(models.Model):
    key = models.SlugField(unique=True)
    enabled = models.BooleanField(default=False)
    rollout_percent = models.PositiveSmallIntegerField(default=0)  # 0-100
    allowed_users = models.ManyToManyField(User, blank=True)

The model is deliberately simple, but the resolution logic is where the important decisions live — above all, that the same user must always get the same answer for a given flag.

Deterministic resolution

A flag that flips on and off for the same user on every request is useless — the experience flickers and any experiment data is noise. Resolution must be deterministic: bucket each user by hashing the flag key together with their ID, so a user stays in the same bucket as you ramp the percentage up:

import hashlib

def flag_enabled(key, user=None):
    flag = FeatureFlag.objects.filter(key=key).first()
    if not flag or not flag.enabled:
        return False
    if user and flag.allowed_users.filter(pk=user.pk).exists():
        return True
    if flag.rollout_percent >= 100:
        return True
    if not user:
        return False
    bucket = int(hashlib.sha256(
        f"{key}:{user.pk}".encode()).hexdigest(), 16) % 100
    return bucket < flag.rollout_percent

Hashing key:user_id has two crucial properties: a given user's bucket is stable, so raising the rollout from 5% to 25% only ever adds users and never removes someone who already had the feature; and the bucketing differs per flag, so a user in the unlucky 95% for one feature is not systematically excluded from every feature. That stability is what makes both safe rollouts and honest experiments possible.

Caching flag lookups

Every request will check flags, often several, so hitting the database each time is wasteful. Cache flag definitions in Redis with a short TTL — a few seconds to a minute — so lookups are fast but changes still propagate quickly. The short TTL is a deliberate balance: long enough to remove the database load, short enough that flipping a kill switch takes effect almost immediately. For the highest-traffic paths, an in-process cache with an even shorter TTL layered on top removes the Redis round trip too. Whatever you choose, a flag check must be cheap enough that developers never hesitate to add one.

Canary rollouts

The canary pattern ramps a feature gradually while you watch the system's health: 1%, then 5%, 25%, 50%, 100%, pausing at each step to check error rates, latency, and the business metric the feature is meant to move. If the dashboards stay green, you continue; if anything spikes, you drop the rollout back to 0 instantly — no deploy, no rollback branch, just a configuration change. This turns a release from a single terrifying moment into a series of small, observable, reversible steps. The discipline that makes it work is watching the right metrics at each stage and being willing to halt; a canary you do not monitor is just a slow full release.

A/B experiments

The same deterministic bucketing that powers rollouts powers experiments. Assign users to variant A or B by their stable bucket, render the appropriate experience, and emit an analytics event recording which variant each user saw. Then measure the metric you actually care about — conversion, retention, revenue per user — not vanity numbers like clicks. Run the experiment until you reach statistical significance rather than stopping the moment the numbers look good, which is how teams fool themselves with noise. When a winner is clear, ship it to everyone and delete the flag. Done well, this replaces opinion-driven debate with evidence; done carelessly, it produces confident conclusions from random variation, so respect the statistics.

Targeting and segmentation

Percentage rollouts are the start, not the end. Real systems target by attributes: enable a feature for internal staff first, then beta-opt-in users, then a specific plan tier, then a region, then everyone. Extend the resolution logic to evaluate rules against user and request context — plan, signup date, country, account flags — so you can say "on for enterprise customers in the EU" declaratively. This is also how you run a feature for a single customer who requested it, or exclude a customer who must not see a change yet. Keep the rules data-driven so product can adjust targeting without a deploy.

Kill switches

Not every flag controls a new feature — some exist purely to turn things off in an emergency. Wrap risky dependencies and expensive operations in a kill switch: a new payment provider, a third-party integration prone to outages, a heavy report that can overload the database under load. When that dependency misbehaves, you flip the switch in seconds and degrade gracefully instead of failing hard, with no deploy in the critical moment. Kill switches are cheap insurance; add them proactively to anything whose failure would hurt, so that when the inevitable incident comes, your response is a click rather than a frantic hotfix.

Flag hygiene and technical debt

Flags are not free forever. A flag that has been at 100% for three months is dead conditional code — two code paths where there should be one, extra branches to read and test, and a latent risk that someone flips the wrong one. This is real technical debt, and ignoring it leaves your codebase a maze of stale toggles. Build the discipline to remove flags once they have fully shipped: track each flag's age and state, schedule cleanup of the long-since-launched ones, and treat flag removal as part of finishing a feature, not an optional afterthought. A healthy flag system has a steady flow of flags created and retired, not an ever-growing pile.

Build versus buy

The model above is enough to start and gives you full control with no external dependency. As needs grow — complex targeting, audit logs, a UI for non-engineers, experiment analysis, SDKs across services — managed platforms like LaunchDarkly, Flagsmith, or Unleash become attractive, and Unleash in particular has a strong open-source self-hosted option. The decision hinges on how central flags become to your workflow: a handful of rollout toggles do not justify a vendor, but an organization where product managers run dozens of experiments and targeting rules benefits from purpose-built tooling. Start simple, and adopt a platform when the operational and product needs clearly exceed what a model and an admin page provide.

Server-side versus client-side evaluation

Where a flag is evaluated matters for both security and performance. Server-side evaluation — deciding on the flag in Django and sending the client only the resulting experience — keeps your rollout logic and targeting rules private, which is essential when a flag gates an unreleased feature you do not want competitors or curious users discovering by reading the JavaScript bundle. Client-side evaluation, where the browser receives flag values and decides, is simpler for purely cosmetic toggles but exposes the existence of features and the shape of your targeting. The rule of thumb: evaluate anything sensitive or security-relevant on the server, and only push non-secret presentation flags to the client.

Flags across multiple services

Once flags span more than one service, consistency becomes the challenge: the same user must get the same flag decision whether they hit the web app, the API, or a background worker, or the experience fractures. The deterministic hashing approach from earlier is what makes this possible — because the decision is a pure function of the flag key and user ID, every service computes the same answer without coordinating, as long as they share the flag configuration. Distribute the flag definitions to every service (via a shared store or a flag platform's SDK) and let each evaluate locally. This keeps decisions fast and consistent without a central evaluation service becoming a bottleneck or a single point of failure.

Governance: who can flip a flag

A kill switch is powerful, and power needs guardrails. Flipping a flag in production is a production change, so it deserves the same care as a deploy: an audit log of who changed what and when, permissions controlling who can toggle which flags, and ideally a confirmation step for high-impact switches. Without this, a feature flag system becomes an ungoverned back door where anyone can alter production behavior untracked. Record every change, restrict sensitive flags to the people who should control them, and surface the current state of all flags somewhere visible, so the team always knows what is on, for whom, and who decided.

Flags, experiments, and entitlements are different

It is worth separating three things that all look like flags but serve different masters. Release flags are temporary, controlling the rollout of new code and meant to be deleted once shipped. Experiment flags drive A/B tests and live until the experiment concludes. Entitlement flags are permanent, encoding what a customer's plan unlocks — they are part of your product, not technical debt. Treating them identically causes confusion: someone deletes an entitlement flag thinking it is stale release debt, and breaks a paying customer's access. Name and categorize flags by their purpose, apply hygiene only to the temporary kinds, and model entitlements deliberately as the long-lived configuration they are.

Testing code behind flags

A flag creates two code paths, and both must work, which means your tests must exercise both states. Test the feature on and off, and test the transition, because bugs often hide in the assumption that everyone is in one state. This doubles some test cases, which is a real cost and another reason to delete flags promptly — every lingering flag permanently multiplies the states your suite must cover. For flags that interact, the combinations multiply further, so keep the number of simultaneously-live flags manageable. Parametrized tests that run the same scenario with the flag forced on and off are the clean way to cover both paths without duplicating test code.

Summary

Feature flags turn scary releases into boring config changes by separating deploy from release. Build a small flag layer whose resolution is deterministic — hash key:user_id so a user's bucket is stable and ramping a rollout only ever adds users — and cache lookups with a short TTL so checks are cheap and kill switches still act fast. On that foundation, ramp features as monitored canaries, reuse the bucketing for statistically honest A/B tests, target by user and request attributes, and put kill switches on anything risky. The discipline that keeps the system healthy is hygiene: delete flags once they have fully shipped, or your codebase fills with dead toggles. Start with your own model and graduate to a managed platform only when targeting, experimentation, and team workflows genuinely demand it.

Ready to Build?

Skip the boilerplate. Get production-ready Django packages.

Browse Products