DevOps Advanced

Profiling Django in Production: py-spy, django-silk, and Flame Graphs

Stop guessing why your app is slow. Profile a live Django process without restarting it using py-spy, trace per-request queries and timings with django-silk, read flame graphs, and turn findings into concrete fixes.

DjangoZen Team Jun 06, 2026 17 min read 9 views

"The site feels slow" is not a bug report you can act on. Profiling turns a vague complaint into a precise answer: this function, on this query, for eight hundred milliseconds. The best news is that you can profile a live production process without restarting it, adding code, or noticeably slowing it down. This tutorial covers the tools and the workflow to find the real bottleneck instead of guessing — py-spy for CPU, django-silk for queries, and flame graphs to read it all.

The cardinal rule: measure, don't guess

The most expensive mistake in performance work is optimizing the wrong thing. Developers are notoriously bad at guessing where time goes — the function you are sure is slow is often fine, while the real cost hides in an innocent-looking helper that runs ten thousand times. Every hour spent optimizing a path that was not actually the bottleneck is wasted, and worse, it adds complexity for no gain. Profiling exists precisely to replace intuition with evidence. Before changing a single line for performance, you should be able to point at a measurement that says exactly where the time goes; this discipline is what separates real optimization from superstition.

Confirm the symptom first

Start above the code, with metrics. Look at your p95 and p99 latency, your slow-query log, and your error rates to confirm there is a real, located problem and to decide what to profile. "Slow" at p50 and "slow" at p99 are different problems with different causes — a high p99 with a fine p50 often points at occasional lock contention or a cold cache, not at the average code path. Letting the metrics tell you which endpoint or job is slow means you profile the right thing, instead of attaching a profiler to a healthy process and learning nothing.

py-spy: zero-instrumentation sampling

py-spy is a sampling profiler that attaches to a running Python process by its process ID and reads the call stacks from outside the process — no code changes, no restart, and negligible overhead because it samples rather than instruments every call. This is what makes it safe to run against production. You point it at a busy gunicorn worker and immediately see where that worker spends its time:

pip install py-spy

# live, top-like view of where a worker spends time
py-spy top --pid 2372612

# record 30 seconds and emit an interactive flame graph
py-spy record -o profile.svg --pid 2372612 --duration 30

Because it requires no cooperation from the target process, py-spy works on a process that is already misbehaving in production, which is exactly when you most need answers and least want to redeploy.

Diagnosing a hung worker

One of py-spy's most valuable uses is the stuck process. When a worker hangs — pinned at 100% CPU, or frozen waiting on something — py-spy dump --pid <PID> instantly prints the current stack of every thread, showing you the exact line it is stuck on. Is it spinning in a regex, blocked on a database lock, waiting on a slow external API with no timeout? The dump tells you in seconds, without the futile exercise of trying to reproduce a production hang locally. This single command has rescued countless incident calls from hours of speculation.

Reading a flame graph

A flame graph is the standard way to visualize profiler output, and reading it is a skill worth learning. Each box is a function call; the boxes stack vertically by call depth, with callers below and callees above. The crucial axis is horizontal: width represents time spent, not chronological order. So you read a flame graph by scanning for the widest boxes, because those are where the time goes. A wide plateau near the top is a single function eating CPU directly; a wide box that turns out to be database-driver code means you are query-bound rather than CPU-bound. The shape tells you not just where the time is, but what kind of problem you have.

django-silk: per-request query tracing

Where py-spy shows CPU, django-silk shows the request lifecycle — and for most Django apps, the bottleneck is the database, not CPU. Silk records every request, every SQL query it issued, each query's duration, and crucially the duplicate queries that are the signature of an N+1 problem:

pip install django-silk
# settings.py
INSTALLED_APPS += ["silk"]
MIDDLEWARE = ["silk.middleware.SilkyMiddleware"] + MIDDLEWARE

Silk's interface will show you a page running "73 queries, 68 duplicates" where there should be three — an instant, unambiguous N+1 diagnosis. You add the missing select_related or prefetch_related, reload, and watch the count collapse. Because it stores request data and adds overhead, run Silk on staging or behind admin-only access rather than wide-open in production.

The N+1 query, the most common culprit

If there is one performance bug that dominates Django, it is the N+1 query: code that runs one query to fetch a list and then one more query per item to fetch a related object, turning a 50-item page into 51 queries. It hides easily because each individual query is fast and the ORM's lazy loading makes the extra queries invisible in the code. Silk surfaces them by showing duplicates; the fix is almost always select_related for forward foreign keys or prefetch_related for reverse and many-to-many relations. Learning to spot and kill N+1 queries is the highest-leverage Django performance skill there is.

Profiling locally with precision

Production sampling tells you where time goes under real load; local deterministic profiling gives you exact call counts and timings for a specific code path you are optimizing. Python's built-in cProfile, or a tool like line_profiler for line-by-line detail, lets you drill into a single function with precision that sampling cannot match. Use the two together: py-spy in production points you at the hot path, and a deterministic local profiler tells you exactly which calls within it to attack. Reproduce the slow operation locally with representative data and you can iterate quickly without touching production at all.

Memory profiling

Not every performance problem is about speed; some are about memory. A worker whose memory grows steadily until it is killed and restarted has a leak or an unbounded accumulation — often a cache without a limit, a global list that keeps growing, or loading a huge queryset entirely into memory instead of iterating. Tools like tracemalloc (built in) and memray show you what is allocating and where it is held. For the common Django case of processing a large table, the fix is usually to stream with iterator() or process in batches rather than materializing millions of rows at once. Memory issues manifest as mysterious restarts and OOM kills, so watch your workers' memory trend alongside their CPU.

A real profiling workflow

Putting it together, an effective investigation follows a clear sequence. First, confirm the symptom with metrics so you profile the right endpoint or job. Second, run py-spy top on a busy worker to learn whether you are CPU-bound or waiting. Third, if you are waiting on the database, point django-silk at the slow endpoint to find the offending or duplicated queries. Fourth, if you are CPU-bound, record a flame graph and attack the widest bar. Fifth — and this is the step people skip — change exactly one thing, then measure again to confirm it actually helped. Optimizing without a before-and-after number is how you add complexity that does not improve anything.

Continuous profiling

Profiling does not have to be a reactive, incident-driven activity. Continuous profiling tools run a low-overhead sampler across your fleet all the time, storing flame graphs you can look back on, so when latency creeps up over a week you can compare today's profile to last month's and see exactly what changed. This turns performance from a fire you fight into a property you monitor, catching slow regressions before they become incidents. Even a lightweight version — periodically recording py-spy profiles of a representative worker — gives you a historical baseline that makes the next investigation far faster.

Database query profiling with EXPLAIN

When the bottleneck is a specific slow query, the profiler hands off to the database's own tool: EXPLAIN (ANALYZE, BUFFERS). It shows the query plan — whether PostgreSQL is using an index or scanning the whole table, where the time and IO actually go, and how its row estimates compare to reality. A sequential scan on a large table where you expected an index seek is the classic finding, fixed by adding the right index. Learning to read a query plan is the natural complement to django-silk: Silk tells you which query is slow, and EXPLAIN tells you why. Together they turn database performance from guesswork into diagnosis.

Caching as the fix, not just the diagnosis

Often the result of profiling is not "make this query faster" but "stop running this query so often." When you find an expensive operation on a hot path whose inputs change slowly, caching the result is frequently the highest-leverage fix. The profile tells you exactly what to cache — the specific resolver, queryset, or computed value eating the time — and how often it is called, which informs the cache strategy and TTL. Profiling and caching work hand in hand: the profile identifies the costly, repeated work, and the cache removes it. Just remember that caching adds an invalidation problem, so cache where staleness is acceptable and key entries carefully.

Profiling I/O-bound and async code

Not all slowness is CPU. Much of a web app's latency is spent waiting — on the database, on external APIs, on the network — and a CPU profiler shows that wait as idle time rather than a hot function. For I/O-bound work, the question shifts from "what is computing" to "what is it waiting on, and can the waits overlap." Async views and concurrent requests to independent services can collapse serial waits into parallel ones, turning three 200ms calls made one after another into a single 200ms window. Profile with the wait in mind: a request that is slow but shows little CPU is an I/O problem, and the fix is usually concurrency or a faster dependency, not faster code.

Profiling background tasks

Performance work tends to focus on web requests, but background tasks deserve the same scrutiny — a slow Celery task can back up a queue, delay everything behind it, and hold resources far longer than any request. The same tools apply: py-spy can attach to a worker process to see where a task spends its time, and the same query and caching analysis applies to the database work tasks do. Watch queue depth and task duration as first-class metrics, because a task that quietly grows from one second to thirty as data scales will eventually saturate your workers. The asynchronous nature that makes tasks useful also makes their slowness easy to overlook until the queue is on fire.

Performance budgets and SLOs

Profiling is most powerful when it is not purely reactive. Set explicit performance budgets — a target p95 latency for key endpoints, a maximum query count, a ceiling on task duration — and treat a breach as a bug, the same way you treat a failing test. Service-level objectives turn performance from a vague aspiration into a concrete, monitored commitment, and they tell you when to invest in profiling before users feel the pain. Combined with the query-budget tests from your test suite and continuous profiling in production, budgets close the loop: you define what fast means, measure against it continuously, and act the moment you drift, rather than waiting for complaints.

Summary

Profile, do not guess — the cardinal rule, because optimizing the wrong thing is worse than doing nothing. Confirm the symptom with metrics first, then reach for the right tool: py-spy attaches to live production processes for CPU profiling and hung-worker diagnosis with no code changes, django-silk exposes the per-request queries and duplicate-query N+1 patterns that dominate Django slowness, and flame graphs point you straight at the widest, hottest path. Drill in with a deterministic local profiler, watch memory as well as CPU, and consider continuous profiling so regressions surface before incidents. Above all, change one thing at a time and always measure the delta — that discipline, more than any single tool, is what makes performance work effective instead of superstitious.