Long Compute Inference

3 min read

The user submits a request that is going to take 30 seconds or more. Image generation. A complex analytical query. A long-context reasoning task. The user has crossed the unit-task boundary; static skeletons no longer carry their weight; engagement is the last move before they leave.

This scenario sits in the 10 S+ band. Block & Zakay 1997's Block & Zakay 1997 meta-analysis frames the trade-off — engagement compresses prospective duration while expanding retrospective duration; the design decision is whether you have made that trade deliberately. Fitch's Fitch Slack and FIFA examples are the canonical references; Myers 1985 Myers 1985 is the determinate-progress fallback where the inference reports phases.

Animated Text Streaming

A chat-style assistant returns a short answer. Naive: total wait, then the full response drops in. Tuned: ~600 ms thinking state, then tokens stream at a natural reading pace.

1 – 10 S

Off

Press Run to start

What is happening

The ai-streaming demo stands in. For a real long inference, the tuned flow stacks more layers:

Thinking state during the time-to-first-token gap — the same dots-and-cursor pattern from ai-chat-streaming-response.
Streaming render as soon as the first token arrives, paced to a natural reading rhythm.
Tool-call transparency if the inference involves visible tool calls — narrate them ("Searching…", "Reading…", "Reasoning…").
Determinate progress if the inference can report phases ("Step 3 of 7"); fall back to engagement otherwise.
Cancellation always available — the stop button must respond inside the perceptual frame even if the abort takes longer.
Background fallback — past 30–60 seconds, offer "do this in the background and notify me" so the user can leave the surface.

What to tune

Pre-action — submit button echo within ~50 ms; thinking dots cover the time-to-first-token gap.
First 1 s — thinking state in place. No spinner, no skeleton over content the model will produce.
1 – 10 s — token streaming where text is the output. Tool-call transparency where the work is visible.
Past 10 s — engaging copy where applicable; determinate progress where the inference reports phases. Cancellation always visible.
Past 30–60 s — hand-off to background sync with notification. The foreground is no longer the right surface.

When perceived performance hurts you here

The engagement-vs-retrospective-duration trade is the central trap. A 30-second inference with rich engagement feels short while it runs and long in retrospect — the user remembers it as taking forever even when their session went smoothly. Slack and FIFA accept this; for inference where the user repeats the action many times in a session, the retrospective cost compounds.

The cleaner answer for repeat-use AI inference: ship determinate progress where measurable, tool-call transparency where applicable (the user is learning during the wait), and background sync past 60 s. Generic engaging content (motivational quotes, mini-games) belongs only on rare or one-off inferences.

Accessibility

aria-live="polite" on the streaming output and on tool-call narration.
aria-busy="true" during the inference; flip on completion.
prefers-reduced-motion: reduce — replace cross-fades and pulse animations with static states.
Always provide a way out — visible cancellation, visible "do in background", visible re-attempt on failure.

Other patterns that fit

Same band, same surface — switch in whichever maps best to your UI.

Branded Story Sequence

Slack-style cold-boot pattern. Naive: a single "Loading…" line for the full ~12 s. Tuned: a paced sequence of branded frames — wordmark, tagline, skeleton, near-ready — each fading to the next over the wait. The user reads the wait as the app composing itself, not as absence.

10 S+

Off

Press Run to start

Rotating Tips

Engaging copy during a long wait. Naive: a static "Loading…" line for the full duration; the user has nothing to do but stare. Tuned: a "Did you know?" card cycles through a handful of perception facts every ~2.5 s with a soft cross-fade. Same wait — but the time fills with information instead of absence.

10 S+

Off

Press Run to start

Entertainment Loader

T-Rex Run-style mini-game during a long wait. Naive: a spinner for the full duration. Tuned: a clickable runner game — press Jump to leap over incoming mushrooms. The wait stops being time the user is paying and becomes time they are spending. Block & Zakay 1997: filled time has shorter retrospective duration than empty time.

10 S+

Off

Press Run to start

Pulsing Orb

AI-presence cue for short waits. Naive: a static "Working…" line. Tuned: a small primary-coloured circle that breathes (opacity + scale). Calmer than a spinner, more present than nothing — the modern "the agent is here" signal used between tool calls and after a user message.

100 MS – 1 S

Off

Press Run to start

Put into Background + Success Message

Past the 10-second wall, foreground waiting is the wrong unit. Naive: spinner blocks the panel; the user must keep watching. Tuned: a "Run in background" affordance demotes the wait to a small corner indicator and surfaces a toast when the work lands. Foreground attention is freed; the result still arrives loudly.

10 S+

Off

Press Run to start

References · 3

Block & Zakay 1997
Block, R. A., & Zakay, D. (1997). Prospective and retrospective duration judgments: A meta-analytic review. Psychonomic Bulletin & Review, 4(2), 184–197. The trade-off engagement makes during long inference waits.
Fitch
Fitch, E. Perceived Performance: The Only Kind That Really Matters (conference talk). Engaging-loading examples (Slack, FIFA) that map onto long AI inference waits.
Myers 1985
Myers, B. A. (1985). The importance of percent-done progress indicators for computer-human interfaces. Proceedings of CHI '85, 11–17. Determinate progress where measurable.