AI · Chat / streaming response

The user submits a prompt to an AI surface. The total time to last token is roughly the same on both sides — call it 5–6 seconds. The naive surface goes blank for those seconds, then drops the full response in at the end. The tuned surface fills the same wait with a brief thinking state and a token-by-token reveal pacing the response to a natural reading rhythm. Same clock, very different experience.

This scenario sits in the 1–10 S band. Card et al. 1991's Card et al. 1991 ~100 ms perceptual frame sets the cadence floor for the token reveal. Block & Zakay 1997's Block & Zakay 1997 filled-duration principle is the underlying mechanism — streaming converts an empty wait into a filled one. Myers 1985's Myers 1985 indeterminate-feedback baseline is what the thinking state covers during time-to-first-token.

AI · Streaming response

A chat-style assistant returns a ~200-character answer. Naive: total wait, then the full response drops in. Tuned: ~600 ms thinking state, then tokens stream at a natural reading pace.

1 – 10 S

Off

Press Run to start

What is happening in the demo

Both sides simulate the same total work — a 5,500 ms p50 wait drawn from a gamma distribution. The naive side waits the entire duration with an empty box, then renders the full response at once. The tuned side has two distinct phases.

Phase 1 — Thinking (~600 ms with gamma jitter). Three pulsing dots in the muted-foreground colour. The model is doing the time-to-first-token work; the thinking state is the only honest signal that the input registered and processing has begun. Without it, the user wonders whether their submit even fired. Past 5 seconds without first token, this would escalate to tool-call transparency or engaging loading; for a typical chat response, the dots cover the gap.

Phase 2 — Streaming. Characters reveal in chunks of 1–2 at a time at ~30 ms intervals, with slight randomness so the cadence does not feel mechanical. A pulsing primary-color cursor sits at the tail and disappears on completion. ARIA live announcements fire on phase changes; motion-safe: gates the dots and cursor for users who have asked for less motion.

The total wall-clock time is identical. What changed is whether the user is reading along with the model or staring at a blank box.

What to tune

Pre-action — submit button echo within ~50 ms; nothing about the model state yet.
Thinking — pulsing dots during time-to-first-token. Indeterminate by design; replaced the moment content begins.
Streaming — variable-chunk reveal at ~30 ms cadence. Constant rate feels mechanical; instant per-token jitters the eye.
Cancellation — stop button registers within ~100 ms even if the abort takes longer.
Settled — cursor disappears on completion; semantics (code blocks, lists, headings) stay correct across the stream.

When perceived performance hurts you here

If your time-to-first-token consistently exceeds 5–10 seconds, the thinking-state-plus-streaming pattern is no longer enough. The user crosses the unit-task boundary and starts to disengage. The fix at that point is real engineering — model selection, retrieval optimisation, parallel inference — not a longer thinking state. See Concepts §6 on production surfaces and the looks-fast / is-fast trap.

The other failure mode: streaming the response too fast. Token reveal at 5 ms intervals reads as flashing text rather than typing; the user cannot encode it in real time. The cadence has to land in a band the human reading system can handle — roughly ~30 ms per character chunk for native English readers, slightly more for content with code or technical vocabulary.

Accessibility

aria-live="polite" on the streaming container so screen readers announce updates as they arrive. Avoid aria-live="assertive" — it interrupts other speech.
aria-busy flips with each phase: true during thinking, true during streaming, false on done.
motion-safe: on the thinking dots and the cursor. For reduced-motion users, the dots stay static (still informative) and the cursor disappears entirely.
Don't strip semantics during streaming. If the model produces markdown — code blocks, lists, headings — render them with the right elements as the tokens accumulate. A streaming <pre><code> block reads better than a paragraph that retroactively becomes a code block.
Cancellation must be reachable from keyboard. Escape by convention; surface it visibly the first few times.

References

References · 3

Card et al. 1991
Card, S. K., Robertson, G. G., & Mackinlay, J. D. (1991). The information visualizer, an information workspace. Proceedings of CHI '91, 181–188. ~100 ms perceptual frame is the cadence floor for token reveals.
Block & Zakay 1997
Block, R. A., & Zakay, D. (1997). Prospective and retrospective duration judgments: A meta-analytic review. Psychonomic Bulletin & Review, 4(2), 184–197. The filled-duration principle that streaming converts an empty wait into a filled one.
Myers 1985
Myers, B. A. (1985). The importance of percent-done progress indicators for computer-human interfaces. Proceedings of CHI '85, 11–17. Indeterminate-feedback principle behind the thinking state during time-to-first-token.