AI changes the shape of the wait

The previous eight essays assumed something nobody bothered to state out loud: the wait has a known shape. A page load takes 800 ms to 4 seconds. A form submit takes 200 ms to a few seconds. A search query is bounded above by a timeout you control. You can measure the p50 and the p90, you know roughly where the wait is going to land, and you can put a skeleton screen at the right place because you know what "the right place" means.

AI breaks all three.

I'd argue the cleanest way to see this is to look at what AI waits and web waits do not share, rather than what they do.

Three properties that change

Three differences are load-bearing. The rest is detail.

Duration variance across two orders of magnitude. A Copilot autocomplete returns in ~180 ms. A Claude chat response takes ~4 seconds. An o1-style reasoning response takes ~45 seconds. An agentic workflow — Cursor's compose, Devin, a long Claude tool-using session — can run for minutes. The same physical surface (the chat composer) produces all four. The user does not know in advance which one they are about to wait for, and the perception cue cannot ask.
The shape is conversational, not navigational. Web waits are mostly "click → page" — the wait sits between an input and a settled destination state, and the perception job is to keep the user oriented while the destination loads. AI waits sit between an input and an answer; the answer arrives in pieces; the conversation continues. There is no destination state. The wait is followed by another wait.
The wait is filled by the answer arriving. The token-by-token streaming pattern that every modern LLM converged on is the most aggressive form of "filled time" in any UI category. Block & Zakay's Block & Zakay 1997 meta-analysis is the long-standing justification for skeleton screens and engaging loaders — filled retrospective duration is shorter than empty retrospective duration. AI streaming applies the same mechanism, but instead of filling the wait with a placeholder, the product fills it with the actual answer.

The three properties compound. A wait of unknown length, with no destination state, where the answer arrives mid-wait, is a different perception animal than a web click-to-page wait. Same toolbox; different problem. The bad AI waits I see in the wild are usually waits where someone applied a web heuristic to an AI surface — a generic spinner across a 30-second stream, a "Loading..." copy where a thinking-state cue belonged, a determinate progress bar promising 60 % done when the model has no notion of percent.

The new anatomy

Essay 04 broke every wait into four phases — pre-action signal, response, animation, completion. The framing applies to AI surfaces but it maps a little oddly. I'd argue the AI wait has its own four-phase anatomy, with a different shape:

Pre-action signal — identical to web. The user reaches for the send button. The 100 ms perceptual frame still holds. mousedown over click still works.
Thinking — the new phase. After the user hits send and before the first token arrives, the model is doing something the user cannot see. Could be 200 ms. Could be 30 seconds. There is no web analog.
Streaming — the answer arrives as a stream of tokens, optionally interrupted by tool calls that expose mid-stream agent work. The user is reading the answer as it forms. This is the bulk of the wait for most AI surfaces.
Settled — the response is complete. The cancel button disappears, the input becomes active again, and the user can react.

Two things to point out about this anatomy.

First, the user is doing something different in each phase. In thinking, they are waiting passively. In streaming, they are reading actively. In settled, they are deciding what to do next. The perception cues required for each are different — and conflating them is the source of most bad AI UX. A spinner that lives across all four phases is not a perception treatment; it is the absence of one.

The thinking phase has its own design vocabulary because no web pattern fits the case. The two conventions that have settled into practice are the three-dot bounce (small, calm, sits inline next to the input — reads as "the system is composing a reply") and the pulsing orb (larger, slower, centred — reads as "the model is doing extended work"). I'd argue the choice is about implied effort: dots for sub-3 s expected thinking, orb for anything past that. Conflating them is harmless visually but degrades the cue's information content — when both surfaces show dots the user cannot tell whether to wait a beat or a minute. Both variants must be gated by motion-safe: so they collapse to a static shape under prefers-reduced-motion; a thinking state that animates against the system preference is the kind of accessibility regression that ships unnoticed because every developer on the team has motion enabled.

Second, the streaming phase blurs the line between "the wait" and "the result." There is no clean "the wait ended, here is the answer" event. The answer was the wait. The user's experience of "how long did I wait" anchors heavily on the reading-cadence of the streamed tokens, not on the wall-clock duration. This is why AI UX teams are obsessive about token cadence — and why the products that get it right feel categorically better than the ones that do not.

Deliberate cadence deserves a paragraph of its own. The naive implementation pushes one token to the DOM as soon as it arrives — fast tokens cluster, slow tokens stutter, and the reading rhythm jerks. The thoughtful implementation buffers and reveals at variable chunk sizes calibrated to reading speed: short chunks (a few characters) at the start of a sentence when the eye is locking on, longer chunks (a phrase) mid-sentence when the reader is moving with momentum, a brief pause at punctuation. The cost is one client-side timer and a small reveal queue. The benefit is that the same wall-clock latency reads as prose being written instead of characters being dribbled — and the perception cost of the wait drops noticeably. The principle: cadence matters more than latency on conversational AI surfaces, because cadence is what the user actually measures.

Which classical thresholds still hold

Useful to be specific about which parts of the canon survive the move to AI.

The 100 ms perceptual frame. Card, Moran & Newell's Card, Moran & Newell 1983 ~100 ms upper bound for the input-to-acknowledgement loop is intact. When the user hits send, the input has to deactivate, the cursor has to clear, the cancel affordance has to appear, all within ~100 ms. Same as the web. The failure mode is worse: on web a sluggish acknowledgement produces a double-click and a duplicate submit; on AI it produces two parallel inference jobs, two streams competing for the same composer, and a billing event the user did not intend.
The 1 s flow break. Nielsen's Nielsen 1993 1-second threshold applies, but only to the first-token moment, not the full response. If the model takes more than ~1 s to start streaming, the user has slipped out of active mode. The thinking-state cue — pulsing orb, three-dot bounce, a calm "Thinking..." label — is what keeps the active connection alive past 1 s and into the streaming phase.
The 10 s wall. Nielsen's Nielsen 1993 attention-drift threshold applies, with a footnote. Past 10 seconds of pure thinking, the user disengages. Past 10 seconds of streaming, the user is still engaged because the answer is arriving. The wall moves. For agentic workflows that disappear into 30-second tool-call sequences, you have to surface what the agent is doing — otherwise the wall closes and the user goes to another tab.

These three thresholds are doing work in AI UX. They map onto different events than they did on the web, but the perception physics is the same.

Which ones break

A few thresholds that get repeated in AI UX writeups do not actually translate.

Doherty's 400 ms productivity cliff. Doherty's Doherty 1982 finding applies to autocomplete-band AI surfaces. Cursor, Copilot, and v0's inline completions all live inside this budget; the GitHub Research study by Ziegler et al. Ziegler et al. 2022 on Copilot productivity is the modern follow-up, and it confirms that completion timing in the sub-second range materially affects acceptance rates. But Doherty is irrelevant for the chat surface. A 3-second chat response is not a productivity-cliff event because the user is not doing repeated transactions — they are having a conversation. Citing Doherty for chat latency is a category error you see a lot in AI UX hot takes.
The unit-task model. Card et al. Card et al. 1991 framed responsiveness in unit-task tiers — perceptual (~100 ms), immediate (~1 s), unit-task (~10 s). The tiers assume the user has a discrete task they are completing. AI conversations are continuous; there is no clean unit-task boundary. Trying to enforce tier-based budgets on a chat surface produces UX that feels mechanical — "your last response was 2.3 s, well within target" is not a thing a user is having.
Time-to-Interactive (TTI). A perfectly reasonable web metric. Useless for AI surfaces. On a chat surface the input becomes interactive again only after the response completes, which is variable; trying to optimise TTI on chat tells you nothing about whether the wait felt good.
Miller's transaction taxonomy. Miller's Miller 1968 seventeen transaction types assume bounded server-side work — the system has a known thing to do and a known time to do it in. AI is unbounded in both senses. The taxonomy holds for the autocomplete band (those are recognisable Miller transactions) but stops describing what is happening as soon as the user is in chat.

How the products converged

A few minutes with Claude, ChatGPT, Gemini, Cursor, and v0 reveals that they have all converged on roughly the same shape:

A thinking-state cue that holds the user's attention without claiming progress.
Token streaming with a deliberate cadence — fast enough to feel alive, slow enough to read.
A visible cancel button at every point in the stream.
Tool-call surfaces that expand inline to show mid-stream agent work.
A clear settled state — input re-activates, cancel disappears, focus returns to the composer.

Tool-call transparency deserves a footnote because it is the headline distinction between AI products that converged in the last two years and the ones still using a generic spinner across long compute. The surface looks small — an inline strip listing "Reading file… Running tests…" — but it is what makes a multi-minute agent run feel like work rather than a stall. Two failure modes worth flagging. The first is over-sharing: tool-call labels can leak sensitive content (file paths, query strings, prompt fragments) that the user did not expect to be visible — redact aggressively, especially on shared screens. The second is flicker: tool calls that resolve in under ~300 ms produce a rapid-fire strip of strings appearing and disappearing, which reads as broken rather than transparent. Below ~300 ms, batch the label updates or skip them entirely; transparency is information about the agent's work, not a per-keystroke debug log.

Convergence usually means the perception research is right. Block & Zakay's Block & Zakay 1997 filled-vs-empty framing predicted that streaming would beat batching; Nielsen's Nielsen 1993 1-second wall predicted that thinking-state cues would be required; the 100 ms perceptual frame predicted that the send button needs to acknowledge within a frame. The shape we see across products is the shape the canon predicts when you re-apply it to a different wait.

What is genuinely new in AI UX is the thinking-state itself and the tool-call surface. Both have no direct web analog. Those are the surfaces where individual product teams are doing original perception work — what shape should "the model is thinking" take? what shape should "the agent is reading three files" take? The rest of the AI wait is existing perception patterns applied carefully.

What to do with this

Three takeaways before the next essay:

Design the four phases separately. The thinking-state, the streaming phase, the tool-call surface, and the settled state are four distinct perception problems. Treating them as one — a generic spinner that sits across all four — is the most common AI-UX mistake. The fix is structural: name the four phases in your design system and pick a treatment per phase.
Match the cadence to the reading. Token streaming feels good when the reading speed matches comfortable prose intake (~200–300 words per minute). Faster reads as the model showing off; slower reads as broken. The cadence is the perception control surface, not a side-effect of the API. Tune it.
Treat AI thresholds as bands, not lines. The classical thresholds (100 ms, 1 s, 10 s) still mark perception transitions, but in AI they apply to different events (first-token latency, thinking-state appearance, stream completion, attention drift inside long agent runs). Map each threshold to the AI event it covers, not the wall-clock duration of the whole response.

References · 7

Block & Zakay 1997
Block, R. A., & Zakay, D. (1997). Prospective and retrospective duration judgments: A meta-analytic review. Psychonomic Bulletin & Review, 4(2), 184–197. The filled-vs-empty-time finding that justifies token streaming on AI surfaces — content arriving mid-wait shortens retrospective duration even when wall-clock latency is unchanged.
Miller 1968
Miller, R. B. (1968). Response time in man-computer conversational transactions. Proceedings of the AFIPS Fall Joint Computer Conference, 33(I), 267–277. The 17-transaction taxonomy. Useful in this essay as a counter-example — AI conversations are not unit tasks, and Miller's tier-based budgeting does not survive the move.
Nielsen 1993
Nielsen, J. (1993). Usability Engineering. Academic Press. The 0.1 / 1 / 10 s thresholds. Still load-bearing for AI — the perceptual frame applies to first-token latency, the 1 s wall applies to the thinking state, the 10 s wall closes during agentic tool-call sequences unless the work is surfaced.
Card et al. 1991
Card, S. K., Robertson, G. G., & Mackinlay, J. D. (1991). The information visualizer, an information workspace. Proceedings of CHI '91, 181–188. Perceptual / immediate-response / unit-task tiers. The tiered framing assumes discrete user tasks; this essay argues the assumption breaks on AI's continuous conversational surfaces.
Card, Moran & Newell 1983
Card, S. K., Moran, T. P., & Newell, A. (1983). The Psychology of Human-Computer Interaction. Lawrence Erlbaum. ~100 ms perceptual processing frame — applied here to the send-button acknowledgement on AI input surfaces, where a slow ack produces double-tap submission and parallel inference jobs.
Doherty 1982
Doherty, W. J., & Thadhani, A. J. (1982). The Economic Value of Rapid Response Time. IBM. The 400 ms productivity cliff. Relevant to autocomplete-band AI surfaces (inline code completion sits inside this budget); a category error when invoked for chat-band surfaces, which are not productivity transactions in Doherty's sense.
Ziegler et al. 2022
Ziegler, A., Kalliamvakou, E., Li, X. A., Rice, A., Rifkin, D., Simister, S., Sittampalam, G., & Aftandilian, E. (2022). Productivity assessment of neural code completion. Proceedings of MAPS 2022 (PLDI workshop). GitHub Research study of Copilot acceptance and timing — useful as an industry anchor for the inline-completion latency band where Doherty's cliff still bites.