Context Rot: Why More Context Can Make AI Worse

Context rot is when AI accuracy drops as input grows — well before the window fills. Learn why it happens and how to avoid it in your agents.

By Founder of CtxFlow

Context Rot: Why More Context Can Make AI Worse

Context rot is the measurable decline in an AI model’s accuracy as the amount of input grows — even well before the context window is full. Add more tokens to a prompt and, past a point, the model’s answers get worse, not better. It loses track of details, misses facts it was given, and reasons less reliably. Crucially, this happens long before you hit the model’s maximum token limit, so a bigger window does not protect you.

Context rot is one of the strongest arguments against the “just dump everything in” approach to AI. This guide explains what it is, the research behind it, and how to keep your agents out of the rot zone.

In this guide

Key takeaways

What exactly is context rot?

Context rot is the phenomenon where adding tokens to an LLM’s input lowers the quality of its output. The model still functions, but it gets less accurate and less consistent as the prompt grows.

It is distinct from context-window overflow. Overflow is hitting the hard token limit. Rot sets in well below that limit — a model with a 200K-token window can show serious degradation at 50K tokens. The capacity is there; the model just stops using it well. This is a core reason behind the Goldilocks problem of agent context.

The name is apt: like rot in fruit, the decline is gradual and starts from the inside while the outside still looks fine. There is no error, no warning, no truncation message — just answers that quietly get less reliable as the prompt grows. That silence is what makes context rot dangerous. A prompt that overflows the window fails loudly and you fix it; a prompt that has rotted returns a confident, fluent, subtly wrong answer that passes a glance and ships.

Context rotContext overflow
TriggerInput grows past a soft pointInput exceeds the hard token limit
WhenWell below the window limitAt the window limit
SignalSilent — accuracy quietly dropsLoud — truncation or an error
FixCurate and shrink the contextSame, but also forced by the limit

What does the research say about context rot?

The term was formalized in Chroma’s 2025 study, which put 18 frontier models — Claude, GPT, and Gemini variants among them — through tasks ranging from 10,000 to 500,000 tokens of input.

Every single model degraded as input length grew. Performance dropped even on simple retrieval tasks once distracting filler was added. The finding overturns a common assumption: that models use their full advertised context uniformly. They do not.

The study’s design is worth understanding because it isolates the effect cleanly. The researchers held the task constant — for example, find a specific answer that is definitely present — and varied only the surrounding input length. If models used context uniformly, accuracy would stay flat; the needle is always there. Instead, accuracy fell as the haystack grew. That rules out “the model never knew the answer” and points squarely at the length of the input as the cause. They also found that adding semantically similar but irrelevant distractors hurt more than adding random filler — the harder the noise is to dismiss, the worse the rot. This is consistent with the broader industry framing of context as a finite resource that has to be curated, not merely filled, echoed in Anthropic’s context engineering guidance.

It is not just about finding a needle

Earlier benchmarks tested “needle in a haystack” — find one fact in a long document. Real tasks are harder: multi-hop reasoning across scattered facts. Context rot hits these hardest, because the model must hold many things in view at once.

This matters because the needle test flatters models. Finding one verbatim string in a long document is something even a brittle long-context model can often do, which led to early optimism that long windows “just worked.” But production tasks are rarely a single lookup. They require connecting a fact on page 3 with a fact on page 40 and a tool output from two steps ago — exactly the kind of distributed reasoning that degrades fastest as the window fills. A model that aces needle-in-a-haystack can still rot badly on real multi-hop work, which is why benchmark headlines about huge context windows should be read with caution.

Why does context rot happen?

Context rot happens because attention is a finite resource spread across all tokens. The more tokens, the thinner the model’s attention on any one of them.

Long inputs also introduce distractors — irrelevant content that competes with the signal. And models attend unevenly across position, which produces the related lost-in-the-middle effect, where mid-prompt facts get dropped. Together these mean more context can actively crowd out the facts that matter.

There is a third, subtler driver: training distribution. Models see far fewer extremely long, densely packed examples during training than short ones, so their behaviour on very long inputs is less practiced and more brittle. A model can advertise a 200K-token window because that is the maximum it can mechanically process, but the bulk of its training rewarded competent behaviour on much shorter inputs. The window size is an engineering ceiling; reliable behaviour is an empirical question that has to be measured, not assumed from the spec sheet. This is why two models with the same advertised window can rot at very different rates.

How is context rot different from a small context window?

A small window limits how much you can include. Context rot is about how poorly the model uses what you do include, regardless of window size.

So the answer to context rot is not “buy a bigger window.” A larger window raises the ceiling but does nothing for attention quality. This is why more context does not reliably improve LLM answers.

Signs your agent is in the rot zone

You rarely see rot directly — you infer it from behaviour. Watch for these tells:

If you can reproduce any of these by shortening the prompt and watching quality recover, you have confirmed rot rather than a missing fact.

How do you prevent context rot?

You prevent context rot by keeping the context lean and relevant:

These are all part of context engineering, the discipline of deciding what the model sees.

These tactics are all facets of context engineering. None requires a model change — they are things you control in how you assemble the prompt.

It also helps to measure your own rot curve rather than trusting a model’s advertised window. Build a small set of representative tasks, run them at increasing input lengths, and note where accuracy turns over. That turning point — not the spec-sheet maximum — is your practical working limit for that model and task. Re-measure when you change models, because, as noted above, two models with identical windows can rot at very different lengths. Knowing your real limit lets you set a token budget with confidence instead of guessing.

A worked example: the distractor problem

Suppose an agent needs to answer a question about your Q3 pricing change, and the relevant memo is one paragraph. Two ways to build the context:

Dump. You retrieve the entire pricing wiki — twelve pages covering Q1 through Q4, historical rates, regional variations, and an FAQ. The Q3 memo is in there, but so are eleven pages of plausibly relevant pricing text. Those pages are the worst kind of distractor: semantically close to the question, so the model cannot easily dismiss them. Attention spreads across all twelve pages, and the model may anchor on a stale Q1 rate that reads just as authoritative as the Q3 one.

Curate. You retrieve only the Q3 memo paragraph and place it near the end of the prompt, next to the question. The window is short, the signal is unmistakable, and there is nothing for the model to confuse it with. Same underlying knowledge base, same model — but one approach invites rot and the other sidesteps it entirely.

The lesson generalizes: rot is driven less by length for its own sake than by the ratio of signal to plausible-looking noise. Cutting the noise is almost always cheaper and more effective than reasoning harder over it.

The deeper takeaway: a bigger window is the wrong lever for rot. What actually protects accuracy is retrieving a small, relevant slice of knowledge per query instead of pasting in walls of text — and keeping durable facts in persistent agent memory rather than the active prompt. Treat every token as something that has to earn its place, and rot stops being your bottleneck.

FAQ

What is context rot in LLMs? Context rot is the decline in an LLM’s accuracy as its input grows longer. The model becomes less reliable at using facts it was given, even when the prompt stays well under the maximum token limit.

Is context rot the same as running out of context window? No. Running out of window is overflow — hitting the hard token cap. Context rot happens well before that cap, as a gradual quality decline driven by attention spreading thin across many tokens.

At what context length does rot start? There is no fixed threshold; it varies by model and task. Research shows degradation can begin at a fraction of a model’s advertised window — sometimes tens of thousands of tokens below the limit.

How do I stop context rot in my AI agent? Keep context lean: retrieve only task-relevant facts, prune stale content, budget tokens, and place key facts at the prompt edges. Treat the window as scarce, not free.

Does context rot affect newer long-context models? Yes. The Chroma study tested current frontier models and found all of them degraded as input grew. Newer models tolerate more before they slip, but the direction is the same — quality falls with length. A bigger advertised window does not make a model immune.

Is context rot the same as the lost-in-the-middle effect? They are related but not identical. Context rot is the overall decline in accuracy as input grows. Lost-in-the-middle is the positional version — facts in the middle of a long prompt get dropped. The longer the input, the more middle there is, so the two compound each other.

Why does adding a relevant document sometimes make answers worse? Because even relevant-looking documents act as distractors when most of their content does not bear on the question. They spread the model’s attention and can pull it toward stale or adjacent facts. Retrieving a trimmed passage instead of a whole document avoids this.

How do I tell context rot apart from a model simply being wrong? Run the shortening test. Take the failing prompt, remove everything except the few facts the task truly needs, and re-run it. If the answer improves, the problem was rot — the model had the fact but lost it in the noise, and trimming brought it back into focus. If the answer stays wrong even on the lean prompt, the issue is not rot but a missing fact or a genuine model limitation, and the fix is to retrieve the right fact rather than to trim. This two-minute experiment is the fastest way to know whether you are facing a context problem you can engineer away or a grounding gap you need to fill.

Back to all posts