Long Compute Inference
3 min read
The user submits a request that is going to take 30 seconds or more. Image generation. A complex analytical query. A long-context reasoning task. The user has crossed the unit-task boundary; static skeletons no longer carry their weight; engagement is the last move before they leave.
This scenario sits in the 10 S+ band. Block & Zakay 1997's Block & Zakay 1997 meta-analysis frames the trade-off — engagement compresses prospective duration while expanding retrospective duration; the design decision is whether you have made that trade deliberately. Fitch's Fitch Slack and FIFA examples are the canonical references; Myers 1985 Myers 1985 is the determinate-progress fallback where the inference reports phases.
Animated Text Streaming
A chat-style assistant returns a short answer. Naive: total wait, then the full response drops in. Tuned: ~600 ms thinking state, then tokens stream at a natural reading pace.
Off
Press Run to start
On
Press Run to start
What is happening
The ai-streaming demo stands in. For a real long inference, the tuned flow stacks more layers:
- Thinking state during the time-to-first-token gap — the same dots-and-cursor pattern from ai-chat-streaming-response.
- Streaming render as soon as the first token arrives, paced to a natural reading rhythm.
- Tool-call transparency if the inference involves visible tool calls — narrate them ("Searching…", "Reading…", "Reasoning…").
- Determinate progress if the inference can report phases ("Step 3 of 7"); fall back to engagement otherwise.
- Cancellation always available — the stop button must respond inside the perceptual frame even if the abort takes longer.
- Background fallback — past 30–60 seconds, offer "do this in the background and notify me" so the user can leave the surface.
What to tune
- Pre-action — submit button echo within ~50 ms; thinking dots cover the time-to-first-token gap.
- First 1 s — thinking state in place. No spinner, no skeleton over content the model will produce.
- 1 – 10 s — token streaming where text is the output. Tool-call transparency where the work is visible.
- Past 10 s — engaging copy where applicable; determinate progress where the inference reports phases. Cancellation always visible.
- Past 30–60 s — hand-off to background sync with notification. The foreground is no longer the right surface.
When perceived performance hurts you here
The engagement-vs-retrospective-duration trade is the central trap. A 30-second inference with rich engagement feels short while it runs and long in retrospect — the user remembers it as taking forever even when their session went smoothly. Slack and FIFA accept this; for inference where the user repeats the action many times in a session, the retrospective cost compounds.
The cleaner answer for repeat-use AI inference: ship determinate progress where measurable, tool-call transparency where applicable (the user is learning during the wait), and background sync past 60 s. Generic engaging content (motivational quotes, mini-games) belongs only on rare or one-off inferences.
Accessibility
aria-live="polite"on the streaming output and on tool-call narration.aria-busy="true"during the inference; flip on completion.prefers-reduced-motion: reduce— replace cross-fades and pulse animations with static states.- Always provide a way out — visible cancellation, visible "do in background", visible re-attempt on failure.
Other patterns that fit
Same band, same surface — switch in whichever maps best to your UI.
Branded Story Sequence
Slack-style cold-boot pattern. Naive: a single "Loading…" line for the full ~12 s. Tuned: a paced sequence of branded frames — wordmark, tagline, skeleton, near-ready — each fading to the next over the wait. The user reads the wait as the app composing itself, not as absence.
Off
Press Run to start
On
Press Run to start
Rotating Tips
Engaging copy during a long wait. Naive: a static "Loading…" line for the full duration; the user has nothing to do but stare. Tuned: a "Did you know?" card cycles through a handful of perception facts every ~2.5 s with a soft cross-fade. Same wait — but the time fills with information instead of absence.
Off
Press Run to start
On
Press Run to start
Entertainment Loader
T-Rex Run-style mini-game during a long wait. Naive: a spinner for the full duration. Tuned: a clickable runner game — press Jump to leap over incoming mushrooms. The wait stops being time the user is paying and becomes time they are spending. Block & Zakay 1997: filled time has shorter retrospective duration than empty time.
Off
Press Run to start
On
Press Run to start
Pulsing Orb
AI-presence cue for short waits. Naive: a static "Working…" line. Tuned: a small primary-coloured circle that breathes (opacity + scale). Calmer than a spinner, more present than nothing — the modern "the agent is here" signal used between tool calls and after a user message.
Off
Press Run to start
On
Press Run to start
Put into Background + Success Message
Past the 10-second wall, foreground waiting is the wrong unit. Naive: spinner blocks the panel; the user must keep watching. Tuned: a "Run in background" affordance demotes the wait to a small corner indicator and surfaces a toast when the work lands. Foreground attention is freed; the result still arrives loudly.
Off
Press Run to start
On
Press Run to start
References · 3
- Block & Zakay 1997
Block, R. A., & Zakay, D. (1997). Prospective and retrospective duration judgments: A meta-analytic review. Psychonomic Bulletin & Review, 4(2), 184–197. The trade-off engagement makes during long inference waits.
- Fitch
Fitch, E. Perceived Performance: The Only Kind That Really Matters (conference talk). Engaging-loading examples (Slack, FIFA) that map onto long AI inference waits.
- Myers 1985
Myers, B. A. (1985). The importance of percent-done progress indicators for computer-human interfaces. Proceedings of CHI '85, 11–17. Determinate progress where measurable.