The LLM Cold-Start Problem — Trevor Thompson

Every LLM benchmark you’ve ever seen was taken on a brain that just woke up.

There’s a phenomenon called Sleep Inertia. The moment you wake from deep sleep, your cognitive performance doesn’t just dip — it craters. Wertz et al. (JAMA, 2006) found that performance immediately upon waking was measurably worse than after 26 straight hours without sleep.

Sleep inertia is well-characterized in the literature: — Peak impairment in the first 15–30 minutes after waking — Dissipates asymptotically — not linearly — over 2 to 4 hours (Jewett et al., 1999) — Complex cognitive tasks are hit harder than simple motor tasks — Attention lapses more than double vs. baseline (PMC3832615) The brain is online before it’s operational.

Basically every time you open a new chat with an LLM, the model instantiates with zero prior context. No conversation history. No accumulated task frame. No warm state. It doesn’t pick up where it left off. It begins. Every. Single. Time.

The parallel isn’t metaphorical. It’s structural. Sleep inertia research identifies a specific mechanism: the transition lag between consciousness returning and higher-order cognitive regions coming fully online. Brain areas tied to arousal recover first. Regions handling complex reasoning recover much later. Sound familiar?

We have LLM data on this too. The NoLiMa benchmark: at 32k tokens, 11 of 12 tested models dropped below 50% of their short-context performance. Chroma Research (2025): performance degrades as input length increases — “often in surprising and non-uniform ways.” LongCodeBench: Claude 3.5 Sonnet’s accuracy dropped from 29% → 3% as context scaled from 32k to 256k tokens. The model degrades as it fills. What does it look like before it fills?

Here’s the measurement gap nobody has closed: Sleep inertia research built rigorous test batteries to quantify the exact performance cost of the wake-up transition. Psychomotor Vigilance Task. Descending Subtraction Task. N-back. Cognitive throughput across defined intervals. We have decades of clean, quantified impairment curves for humans. We have no equivalent for LLMs at instantiation.

The question worth asking: If we designed an equivalent battery — measuring LLM reasoning quality, working memory proxy tasks, retrieval accuracy, and multi-step coherence at T=0, T+1k tokens, T+5k tokens, T+10k tokens — Would we see an asymptotic dissipation curve? Would it look like the Jewett curve?

And if it does — Current benchmarks are almost exclusively run at optimal context conditions. That’s like measuring human cognitive performance only after the subject has been awake for 4 hours, eaten breakfast, and reviewed their notes. We’re not benchmarking the model. We’re benchmarking the model at its best.

The implication isn’t small. If there’s a measurable cold-start penalty — and the architecture strongly suggests there is — then frontier model performance ceilings are currently underreported by some unknown delta. That delta has never been formally named, measured, or subtracted from evals.

Sleep inertia has countermeasures: light exposure, nap duration, caffeine, sound. Each targets the mechanism of the lag — not just the symptom. If LLM cold-start impairment is real and measurable, what are the countermeasures? Warm-start context priming? Structured system prompt scaffolding? Pre-loaded task state? The intervention design follows directly from the measurement.

We’ve been measuring intelligence. We haven’t been measuring intelligence under the only conditions in which it’s actually deployed. Every AI interaction starts at T=0. Nobody has benchmarked T=0.

Is the benchmark that doesn’t exist yet the one that matters most?