What a streaming response feels like

April 2026·5 min read

Every AI product has to answer a design question that wasn’t a design question five years ago: how should the response arrive?

You have a model that takes seconds to finish. You have a user watching a box. In between, there’s an arrival pattern, and three reasonable defaults. They look similar in a screenshot and feel completely different in use. Flip between them:

Complete sentences arrive one at a time. Fast time-to-first-byte, grammatical chunks, easy to skim. The default pattern for most AI chat.

Same response. Same total time. Different products.

All at once

The user clicks. The spinner turns. After a beat, the full response appears. This is what most request-response UIs defaulted to before streaming was easy.

The reason it feels slow isn’t the total duration. It’s the time-to-first-byte. The user is watching a blank box with no signal that anything is happening. A 1.6-second wait with nothing appearing feels worse than a 3-second wait with words showing up.

There are places this is still correct: short answers, deterministic responses, copy-paste output like JSON where partial reads would be misleading. But for anything prose-shaped, it’s the wrong default.

Chunked by sentence

Complete sentences arrive one at a time. You get a fast time-to-first-byte, the reader can skim, and the grammar is always intact.

This is where most good AI chat UIs land. It’s a compromise that respects both the model’s actual output rhythm (tokens arriving) and the reader’s actual reading rhythm (a sentence at a time). The rendering has time to apply formatting (bold, inline code, link targets) because each sentence arrives complete, not partial.

A small detail that matters: a typing indicator that stays visible after the first sentence lands, right up until the response is finished. Without it, the reader assumes the response is done at every pause. The indicator is the promise that more is coming.

Token by token

One character at a time. The typewriter. Time-to-first-byte is instant. This is what ChatGPT did on its first launch, and it’s still the default most products reach for because it looks “live.”

It’s the pattern with the sharpest trade-off. On short responses (a yes/no, a one-line answer, a joke) it’s perfect. The text appears with theatrical intent. On long responses, it’s a prison. The reader’s eye is pinned to the cursor, crawling along one glyph at a time, unable to skim. The product is literally slower to read than a 3-second wait followed by the full paragraph.

A good design engineer notices that and conditions on output length. If the model is about to emit 400 tokens, switch to chunked. If it’s going to emit 40, let the typewriter run.

The small details the demo doesn’t show

A few things that matter in production and don’t show up above:

  • Auto-scroll should be sticky, not forced. If the user scrolls up to re-read something, don’t yank them back down on every new token. The moment they interact, hand them back the scroll.
  • Stop button, always. Streaming without a way to interrupt is a trap. Users realise halfway through that the answer is off-track and need to cut it without waiting for the full generation.
  • Markdown should render progressively. If the model streams **bold**, don’t wait for the closing asterisks to apply bold. Show the formatting as each token confirms it.
  • Tool calls deserve their own visual. When the model pauses to call a function, show that, not a blank cursor. “Searching the web…” or “Querying the database…” with a spinner.
  • Network failure mid-stream. Show the partial response, mark it incomplete, offer retry from where it stopped. The worst pattern is silently dropping the half that did arrive.

Cadence is the product

Two chat products with identical model quality feel completely different depending on how the words arrive. The model is half the product. The arrival pattern is the other half. Most teams pick a streaming mode once at the start of the project and never revisit it. That choice is a design decision masquerading as an engineering one.

Good default: chunked by sentence, with a typing indicator that persists through the response, and a stop button visible the entire time. Everything else is tuning.