Why Your AI App Feels Slow (and the Latency Budget That F...

Open the network tab on a typical AI chat product and you will usually find something that should bother you more than it does. The model returned its first token in about 280 milliseconds. The user saw the first word on screen 2.1 seconds after hitting enter. So where did the other 1.8 seconds go?

That gap is the whole story. Your model can be state of the art and your product can still feel sluggish, because users never experience your model. They experience the wait, and the wait is almost always somewhere you are not looking. I have watched a team spend three weeks shaving 90 milliseconds off inference while a single un-pooled Postgres connection upstream quietly cost 400 on every request. The model was never the problem. The accounting was.

So let us do the accounting properly.

0ms

Below this, a response feels like a direct result of your action

Attention holds, but the user now feels the system working

The limit of patience; most people give up and context-switch

Those three thresholds come from decades of human-computer interaction research (Robert Miller’s 1968 work and Jakob Nielsen’s later summaries), and they have held up remarkably well across hardware generations. Your goal is not to make everything faster than 100 milliseconds, because some things physically cannot be. Your goal is to spend a fixed amount of the user’s patience where it actually buys you something.

Latency is a budget, not a number

Here is the reframe that changes how you build: stop treating speed as one number and start treating it as a budget spread across a chain of events. The time a user feels is the sum of a dozen smaller times stacked end to end. Each is a line item. Some you control, several you do not, and a few you have never measured.

“
You cannot optimize a budget you have never itemized.
”

The first rule of making things feel fast

Let me itemize a real request, because the shape of it surprises people every single time. Take a retrieval-augmented chat query: a user asks a question, your backend pulls relevant context from a vector store, and a hosted model streams an answer.

Connection setup, before a single byte of yours moves

On a cold connection the browser pays for a DNS lookup (often 20 to 120ms on a first, uncached resolution), a TCP handshake (one full round trip), and a TLS negotiation. TLS 1.3 needs one round trip to set up a session; TLS 1.2 needed two. With session resumption or QUIC (HTTP/3), you can reach 0-RTT and skip most of it. None of this is code you wrote, and on a fresh cross-region connection it can total 150 to 300ms.

The network round trip, where physics gets a vote

Light in fiber travels at roughly two-thirds of c, about 200,000 km per second, so every 1,000 km of distance costs about 5ms each way before any processing. In practice: same-region hops land under 5ms, US coast-to-coast round trips run 60 to 80ms, New York to London is around 70 to 90ms, and US to India commonly exceeds 200ms. You cannot beat the speed of light. You can refuse to pay it five times in series.

The edge, auth, and the cold-start tax

Gateway routing, a JWT verification, a rate-limit check. Cheap when everything is warm. Brutal when a serverless function cold-starts: a lean Node or Python function typically adds 100 to 400ms, and a heavy runtime, a large dependency tree, or a VPC attachment can push that into multiple seconds on the unlucky request.

Context gathering, the silent killer for AI apps

This is where the budget quietly disappears. An approximate-nearest-neighbor search over an HNSW index is genuinely fast (single-digit milliseconds for millions of vectors held in memory, more once it spills to disk). The damage is rarely the search itself. It is the three internal service calls made one after another, the N+1 query fetching user history, and the embedding call for the query that you forgot to count. Four small serial waits routinely add up to more than the model.

The model, finally, in two distinct phases

LLM inference is not one cost, it is two. Prefill processes your whole prompt in parallel and produces the first token; its cost scales with prompt length, so a bloated 8,000-token context hurts time-to-first-token directly. Then decode generates the rest sequentially, one token at a time, at maybe 20 to 100 tokens per second depending on model size and hardware. The first token is the moment that matters. Everything after it can be watched.

Notice what just happened. We reached the fifth stage before the model did anything, and four of those five are things teams routinely never measure. The AI is the headline act. The opening bands played for most of the evening.

Where the milliseconds actually go

Draw your request as a waterfall, where each stage starts, takes its slice, and hands off. The total height is what the user feels.

A representative RAG chat request. Connection, network, and context gathering consume the budget long before decode does. Illustrative figures in the typical ranges named above; measure your own, because they will differ.

Two things jump out once you draw your own version. First, the context-gathering bar is almost always wider than anyone guessed, because nobody timed the embedding call plus the three serial lookups hiding inside it. Second, time-to-first-token matters far more than total generation time, because the instant the first token appears the user knows the system is alive and working. That instant, around 855ms in the chart above, is the number you are really racing toward. The remaining 900ms of decode is spent while the user is already reading.

Perceived speed beats real speed

Here is the liberating part. You often do not have to make things faster. You have to make them feel faster, and human time perception is wildly manipulable. Three techniques do most of the work.

Stream the first token the instant prefill finishes. This is the single highest-leverage move in AI UX. Send the response over Server-Sent Events or a chunked HTTP stream so the user sees words at 855ms instead of waiting 1.75 seconds for the whole answer. A response that starts in under a second and finishes in two feels dramatically faster than one that lands all at once at 1.5, even though the streaming version technically takes longer to complete. The brain rewards “it started.”
Show structure before content. A skeleton layout or a real progress signal converts dead waiting into legible waiting. The wait feels shorter when the user can see the system is busy on their behalf rather than frozen. The keyword is real: a fake progress bar that does not track actual work erodes trust the second time someone notices.
Refuse to block on your slowest dependency. If you need three pieces of context and one lives in a slow service, render with the two fast ones and stream the third in when it arrives. Fire independent calls concurrently instead of awaiting them in sequence. In the waterfall above, running the embedding and the three lookups in parallel rather than in series can collapse that 380ms bar to roughly the slowest single call, maybe 150ms, with no new infrastructure at all.

“
A response that starts in 855 milliseconds and finishes in two seconds beats one that lands all at once at 1.5. Starting is the whole game.
”

The second rule of making things feel fast

Underneath all three sits one principle: move work off the critical path. The critical path is the chain of things that must finish before the user sees a useful result. Anything you can do before the user asks (prefetch context, warm the function, keep a connection pool hot) or after the first useful token (lazy-load citations, hydrate the rest) no longer costs you any felt latency. Fast systems are mostly the art of shrinking that path.

A practical playbook

Here is the sequence I would actually run on a sluggish AI product, in order, because order matters. You cannot fix what you have not measured, and you should not touch what does not move the number.

Instrument the whole waterfall, not just the model

Put a timer around every stage: connection, each backend call, the embedding, the vector search, prefill (time to first token), and decode. Emit them as spans you can read per request. Most teams are flying blind on four of these six, which is exactly why they optimize the wrong one.

Find the widest bar at p95, not p50

Sort by the tail, where your heaviest users live, and target the single widest bar. It is very often context gathering, and it is very often a surprise to the person who wrote it.

Parallelize everything that has no data dependency

Serial awaits are the most common self-inflicted wound in this whole stack. If the embedding, the user-history fetch, and the permissions check do not depend on one another, issue them concurrently. This alone routinely removes hundreds of milliseconds for the price of a refactor.

Kill the cold starts and reuse connections

Keep critical functions warm (provisioned concurrency, a ping, or a long-lived process) and pool your database and HTTP connections so you stop paying TCP and TLS setup on every call. Connection reuse with HTTP keep-alive turns a recurring 70ms tax into a one-time cost.

Trim the prompt, then stream

Prefill scales with prompt length, so a 2,000-token context reaches the first token meaningfully sooner than an 8,000-token one. Cut the context to what the model actually needs, then stream the output so the user starts reading at first token instead of last.

The part nobody likes to hear

Some latency you cannot remove. The speed of light is not in your backlog, and a user on a congested mobile network 12,000 km from your nearest region will have a slower experience than someone on fiber next to your data center. No amount of clever engineering erases that gap completely.

That is freeing, not discouraging. Once you accept that a slice of the budget is fixed, you stop wrestling physics and spend your energy on the parts you own: the serial calls you can parallelize, the cold starts you can warm, the prompt you can trim, the connection you can reuse, and the perception you can shape with a stream. The fastest-feeling AI products are not the ones with the fastest models. They are the ones that respect the user’s budget of patience and spend it with taste.

That 1.8-second gap from the opening was never a model problem. It was a budget nobody itemized, spent in places nobody timed. Itemize it, move the spend onto things the user can watch, and the app stops feeling slow. Not because the milliseconds vanished, but because you finally put them where they count.

The short version

Latency is a budget, not a number. Itemize the full request waterfall with a timer on every stage before you optimize anything.
At the median, the model is usually not the bottleneck. Connection setup, serial context calls, and cold starts are.
Time-to-first-token is the number that matters. Prefill scales with prompt length, so trim the context; decode happens while the user is already reading.
Measure p95 and p99, not the average. Your heaviest users live in the tail and feel the worst of it.
Parallelize independent calls, pool connections, and warm cold functions. These remove hundreds of milliseconds for the price of a refactor.
Perceived speed beats real speed. Stream the first token over SSE, show real structure while waiting, and never block on your slowest dependency.
Some latency is physics. Accept it, and spend your effort on the parts you actually control.

Why Your AI App Feels Slow (and the Latency Budget That Fixes It)

Latency is a budget, not a number

Connection setup, before a single byte of yours moves

The network round trip, where physics gets a vote

The edge, auth, and the cold-start tax

Context gathering, the silent killer for AI apps

The model, finally, in two distinct phases

Where the milliseconds actually go

Perceived speed beats real speed

A practical playbook

Instrument the whole waterfall, not just the model

Find the widest bar at p95, not p50

Parallelize everything that has no data dependency

Kill the cold starts and reuse connections

Trim the prompt, then stream

The part nobody likes to hear

Anshad Ameenza

Get new ideas in your inbox

Related Articles

On-Device and Hybrid Architectures: The Edge AI Revolution

Why LLMs Hallucinate (and What Actually Reduces It)

Adoption Without Trust: The Real State of AI Coding Tools

Why Your AI App Feels Slow (and the Latency Budget That Fixes It)

Latency is a budget, not a number

Connection setup, before a single byte of yours moves

The network round trip, where physics gets a vote

The edge, auth, and the cold-start tax

Context gathering, the silent killer for AI apps

The model, finally, in two distinct phases

Where the milliseconds actually go

Perceived speed beats real speed

A practical playbook

Instrument the whole waterfall, not just the model

Find the widest bar at p95, not p50

Parallelize everything that has no data dependency

Kill the cold starts and reuse connections

Trim the prompt, then stream

The part nobody likes to hear

Anshad Ameenza

Get new ideas in your inbox

Related Articles

On-Device and Hybrid Architectures: The Edge AI Revolution

Why LLMs Hallucinate (and What Actually Reduces It)

Adoption Without Trust: The Real State of AI Coding Tools

Cookie & Reality Check