Why Your AI App Feels Slow (and the Latency Budget That Fixes It)
A deeply technical walkthrough of AI app latency: the real cost of TLS handshakes, cross-region round trips, cold starts, vector search, and LLM time-to-first-token, plus the budget framework that makes a product feel instant.
Open the network tab on a typical AI chat product and you will usually find something that should bother you more than it does. The model returned its first token in about 280 milliseconds. The user saw the first word on screen 2.1 seconds after hitting enter. So where did the other 1.8 seconds go?
That gap is the whole story. Your model can be state of the art and your product can still feel sluggish, because users never experience your model. They experience the wait, and the wait is almost always somewhere you are not looking. I have watched a team spend three weeks shaving 90 milliseconds off inference while a single un-pooled Postgres connection upstream quietly cost 400 on every request. The model was never the problem. The accounting was.
So let us do the accounting properly.
Below this, a response feels like a direct result of your action
Attention holds, but the user now feels the system working
The limit of patience; most people give up and context-switch
Those three thresholds come from decades of human-computer interaction research (Robert Miller’s 1968 work and Jakob Nielsen’s later summaries), and they have held up remarkably well across hardware generations. Your goal is not to make everything faster than 100 milliseconds, because some things physically cannot be. Your goal is to spend a fixed amount of the user’s patience where it actually buys you something.
Latency is a budget, not a number
Here is the reframe that changes how you build: stop treating speed as one number and start treating it as a budget spread across a chain of events. The time a user feels is the sum of a dozen smaller times stacked end to end. Each is a line item. Some you control, several you do not, and a few you have never measured.
“You cannot optimize a budget you have never itemized.
”
Let me itemize a real request, because the shape of it surprises people every single time. Take a retrieval-augmented chat query: a user asks a question, your backend pulls relevant context from a vector store, and a hosted model streams an answer.
Connection setup, before a single byte of yours moves
On a cold connection the browser pays for a DNS lookup (often 20 to 120ms on a first, uncached resolution), a TCP handshake (one full round trip), and a TLS negotiation. TLS 1.3 needs one round trip to set up a session; TLS 1.2 needed two. With session resumption or QUIC (HTTP/3), you can reach 0-RTT and skip most of it. None of this is code you wrote, and on a fresh cross-region connection it can total 150 to 300ms.
The network round trip, where physics gets a vote
Light in fiber travels at roughly two-thirds of c, about 200,000 km per second, so every 1,000 km of distance costs about 5ms each way before any processing. In practice: same-region hops land under 5ms, US coast-to-coast round trips run 60 to 80ms, New York to London is around 70 to 90ms, and US to India commonly exceeds 200ms. You cannot beat the speed of light. You can refuse to pay it five times in series.
The edge, auth, and the cold-start tax
Gateway routing, a JWT verification, a rate-limit check. Cheap when everything is warm. Brutal when a serverless function cold-starts: a lean Node or Python function typically adds 100 to 400ms, and a heavy runtime, a large dependency tree, or a VPC attachment can push that into multiple seconds on the unlucky request.
Context gathering, the silent killer for AI apps
This is where the budget quietly disappears. An approximate-nearest-neighbor search over an HNSW index is genuinely fast (single-digit milliseconds for millions of vectors held in memory, more once it spills to disk). The damage is rarely the search itself. It is the three internal service calls made one after another, the N+1 query fetching user history, and the embedding call for the query that you forgot to count. Four small serial waits routinely add up to more than the model.
The model, finally, in two distinct phases
LLM inference is not one cost, it is two. Prefill processes your whole prompt in parallel and produces the first token; its cost scales with prompt length, so a bloated 8,000-token context hurts time-to-first-token directly. Then decode generates the rest sequentially, one token at a time, at maybe 20 to 100 tokens per second depending on model size and hardware. The first token is the moment that matters. Everything after it can be watched.
Notice what just happened. We reached the fifth stage before the model did anything, and four of those five are things teams routinely never measure. The AI is the headline act. The opening bands played for most of the evening.
Where the milliseconds actually go
Draw your request as a waterfall, where each stage starts, takes its slice, and hands off. The total height is what the user feels.
Two things jump out once you draw your own version. First, the context-gathering bar is almost always wider than anyone guessed, because nobody timed the embedding call plus the three serial lookups hiding inside it. Second, time-to-first-token matters far more than total generation time, because the instant the first token appears the user knows the system is alive and working. That instant, around 855ms in the chart above, is the number you are really racing toward. The remaining 900ms of decode is spent while the user is already reading.
Perceived speed beats real speed
Here is the liberating part. You often do not have to make things faster. You have to make them feel faster, and human time perception is wildly manipulable. Three techniques do most of the work.
-
Stream the first token the instant prefill finishes. This is the single highest-leverage move in AI UX. Send the response over Server-Sent Events or a chunked HTTP stream so the user sees words at 855ms instead of waiting 1.75 seconds for the whole answer. A response that starts in under a second and finishes in two feels dramatically faster than one that lands all at once at 1.5, even though the streaming version technically takes longer to complete. The brain rewards “it started.”
-
Show structure before content. A skeleton layout or a real progress signal converts dead waiting into legible waiting. The wait feels shorter when the user can see the system is busy on their behalf rather than frozen. The keyword is real: a fake progress bar that does not track actual work erodes trust the second time someone notices.
-
Refuse to block on your slowest dependency. If you need three pieces of context and one lives in a slow service, render with the two fast ones and stream the third in when it arrives. Fire independent calls concurrently instead of awaiting them in sequence. In the waterfall above, running the embedding and the three lookups in parallel rather than in series can collapse that 380ms bar to roughly the slowest single call, maybe 150ms, with no new infrastructure at all.
“A response that starts in 855 milliseconds and finishes in two seconds beats one that lands all at once at 1.5. Starting is the whole game.
”
Underneath all three sits one principle: move work off the critical path. The critical path is the chain of things that must finish before the user sees a useful result. Anything you can do before the user asks (prefetch context, warm the function, keep a connection pool hot) or after the first useful token (lazy-load citations, hydrate the rest) no longer costs you any felt latency. Fast systems are mostly the art of shrinking that path.
A practical playbook
Here is the sequence I would actually run on a sluggish AI product, in order, because order matters. You cannot fix what you have not measured, and you should not touch what does not move the number.
Instrument the whole waterfall, not just the model
Put a timer around every stage: connection, each backend call, the embedding, the vector search, prefill (time to first token), and decode. Emit them as spans you can read per request. Most teams are flying blind on four of these six, which is exactly why they optimize the wrong one.
Find the widest bar at p95, not p50
Sort by the tail, where your heaviest users live, and target the single widest bar. It is very often context gathering, and it is very often a surprise to the person who wrote it.
Parallelize everything that has no data dependency
Serial awaits are the most common self-inflicted wound in this whole stack. If the embedding, the user-history fetch, and the permissions check do not depend on one another, issue them concurrently. This alone routinely removes hundreds of milliseconds for the price of a refactor.
Kill the cold starts and reuse connections
Keep critical functions warm (provisioned concurrency, a ping, or a long-lived process) and pool your database and HTTP connections so you stop paying TCP and TLS setup on every call. Connection reuse with HTTP keep-alive turns a recurring 70ms tax into a one-time cost.
Trim the prompt, then stream
Prefill scales with prompt length, so a 2,000-token context reaches the first token meaningfully sooner than an 8,000-token one. Cut the context to what the model actually needs, then stream the output so the user starts reading at first token instead of last.
The part nobody likes to hear
Some latency you cannot remove. The speed of light is not in your backlog, and a user on a congested mobile network 12,000 km from your nearest region will have a slower experience than someone on fiber next to your data center. No amount of clever engineering erases that gap completely.
That is freeing, not discouraging. Once you accept that a slice of the budget is fixed, you stop wrestling physics and spend your energy on the parts you own: the serial calls you can parallelize, the cold starts you can warm, the prompt you can trim, the connection you can reuse, and the perception you can shape with a stream. The fastest-feeling AI products are not the ones with the fastest models. They are the ones that respect the user’s budget of patience and spend it with taste.
That 1.8-second gap from the opening was never a model problem. It was a budget nobody itemized, spent in places nobody timed. Itemize it, move the spend onto things the user can watch, and the app stops feeling slow. Not because the milliseconds vanished, but because you finally put them where they count.