Lost in the Middle: How LLMs Drop Context in Long Prompts

“Lost in the middle” is the well-documented tendency of large language models to use information at the beginning and end of a prompt reliably, while ignoring facts buried in the middle. Place the key fact your agent needs in the center of a long prompt, and the model is far more likely to miss it. Accuracy follows a U-shape: high at the edges, low in the middle. This means where you put a fact in the context window changes whether the model actually uses it.

Named in a 2023 study, the effect holds even for models built for long contexts, and it remains a core challenge for long-context AI. The practical fix is ordering — put key facts at the prompt edges, keep prompts lean. This guide explains the effect, the research behind it, and how to keep critical facts where the model will actually read them.

In this guide

What is the lost-in-the-middle effect
What does the Lost in the Middle paper say
Why do LLMs lose the middle
How does this affect AI agents
How do you avoid lost-in-the-middle problems
Reordering: a cheap, training-free fix
A worked example: ten documents, one answer
Common mistakes that bury the signal

What is the lost-in-the-middle effect?

The lost-in-the-middle effect is a position bias in how LLMs read their input. Facts at the top or bottom of the prompt get used; facts in the middle get overlooked.

The practical consequence: two prompts with identical content can produce different answers depending on where the important fact sits. This is one of the failure modes behind the Goldilocks problem of how much context an agent needs — long prompts bury their own signal.

What does the Lost in the Middle paper say?

The effect was documented in Lost in the Middle: How Language Models Use Long Contexts by Liu et al. (2023), later published in the Transactions of the Association for Computational Linguistics.

The researchers tested models on multi-document question answering and key-value retrieval. They found performance was highest when the relevant information appeared at the very start or very end of the input, and degraded significantly when it sat in the middle — even for models explicitly designed for long contexts. The conclusion: current models do not robustly use information across long inputs.

The experimental design is elegant. The team gave models a set of documents, exactly one of which contained the answer, and systematically slid that one document from the first position to the last, measuring accuracy at each step. The content was identical every time; only the position of the answer changed. If models read uniformly, the line would be flat. Instead it traced a clear U: strong at position one, sagging through the middle, recovering at the final position. Follow-on work has put numbers on how steep that sag can be — in some setups, moving the relevant passage from the first slot to roughly the fifth dropped answer accuracy by more than 20 percentage points. Same information, different slot, materially worse answer.

Why a U-shape?

The curve is high at both ends and dips in the center. This mirrors a primacy-and-recency pattern — the model anchors on what it reads first and last, much like a person skimming a long document.

The human parallel is more than a metaphor. Cognitive psychology has long documented the serial-position effect: people recall the first and last items in a list better than the middle ones. LLMs show a strikingly similar profile, though for mechanical rather than cognitive reasons — the first tokens benefit from strong early-position signals and the last tokens from recency, while the dense middle competes for a thinner slice of attention. Some research finds the bias is asymmetric, with an early-token (primacy) pull that is often stronger than the recency pull, and that the exact shape varies by model. The practical takeaway is unchanged: the middle is the weak zone wherever its exact floor sits.

Why do LLMs lose the middle?

Models lose the middle because of how attention distributes across a long sequence. Tokens at the edges receive stronger, more consistent attention; tokens in the dense middle compete for a thinner share.

This is closely related to context rot, where overall accuracy falls as input grows. Lost-in-the-middle is the positional face of that same problem: the longer the prompt, the more “middle” there is to lose.

It helps to separate the two ways length hurts. Rot is the aggregate effect — total accuracy drifts down as you add tokens, regardless of where the key fact sits. Lost-in-the-middle is the spatial pattern within that decline — some positions suffer far more than others. You can think of rot as the average falling and lost-in-the-middle as the variance across position. They share a root cause (attention spread thin over many tokens) but suggest different fixes: rot says shorten the prompt; lost-in-the-middle adds and put what matters at the edges. A robust context pipeline does both, because shortening reduces the middle and ordering protects whatever middle remains.

Encouragingly, researchers have shown the bias is partly correctable without retraining. Work on calibrating positional attention finds that the under-attention to middle positions can be measured and adjusted at inference time, recovering some of the lost accuracy. That reinforces the practical point: the middle is a known, tractable weakness, not an immovable law — and the cheapest way to exploit that is simply to stop putting important things there.

How does this affect AI agents?

For agents, lost-in-the-middle means you cannot just append everything and trust the model to find what matters. Retrieved documents, long histories, and tool outputs pile up in the middle of the prompt — exactly where attention is weakest.

An agent given ten documents may answer from the first and last while ignoring the one in position six that held the answer. This is a common cause of generic or wrong AI answers despite a “full” prompt.

The effect is especially treacherous for agents because it is invisible at assembly time. When you write the code that stuffs ten documents into the prompt, every document is “there” — the prompt looks complete and correct. The failure only shows up in the output, as an answer that quietly draws from the wrong source or misses a fact that was demonstrably present. There is no error to catch, no missing-data exception. This is why debugging an agent that “has all the information but still gets it wrong” so often comes down to position: the data is in the window, just not where the model reads it well.

How do you avoid lost-in-the-middle problems?

You avoid it by controlling order and length:

Put the most important facts at the start and end of the prompt.
Keep prompts short so there is less middle to lose — see context window management.
Prune irrelevant content that pads the middle.
Retrieve narrowly so only high-signal context enters the window.

These are all part of context engineering.

Reordering: a cheap, training-free fix

If you cannot shorten a prompt — say you genuinely have ten relevant documents — the next-best lever is order. Because the edges get the strongest attention, you want your highest-value content there and your lowest-value content buried in the middle, which is the opposite of what naive concatenation usually produces.

A well-studied technique in retrieval-augmented setups is to reorder retrieved passages by relevance score, placing the highest-scoring documents at the start and end and the weakest in the center. Research on long-context RAG reports that this consistently improves accuracy, especially when many passages are retrieved — and it requires no fine-tuning, no model change, and almost no extra compute. It is one of the highest-return-on-effort fixes in the whole context toolbox: you already have the documents and their scores; you are only changing the order in which you paste them.

The same logic applies beyond RAG. In any prompt, put the instruction the model must not miss and the single most important fact near the top or bottom — never in the soft center of a long block.

A worked example: ten documents, one answer

Picture an agent that retrieves ten passages to answer a question, and the answer lives entirely in passage six. Concatenate them in arbitrary retrieval order and passage six lands squarely in the lossy middle; the model may anchor on passages one and ten and never properly read six, producing a confident answer drawn from the wrong source.

Two fixes apply, and they stack. First, reorder: sort by relevance so the strongest passage — likely six — sits at an edge. Second, trim: if passages two, four, and eight are weak matches, drop them entirely, shrinking the middle from eight buried passages to two or three. After both moves, the answer-bearing passage is short, prominent, and near an edge — exactly where the model reads best. Notice that neither fix touched the model or the question; both are pure context discipline.

Common mistakes that bury the signal

Concatenating retrieved docs in arbitrary order. Sort by relevance and put the best at the edges.
Padding the prompt with “just in case” context. Every filler token lengthens the middle and dilutes attention.
Putting the critical instruction in the middle of a long system prompt. Move must-not-miss rules to the top or the very end.
Trusting a big window to compensate. A large window has a larger middle, not a safer one. Length is the enemy here, not capacity.
Replaying full history verbatim. Old turns swell the middle. Summarize them so the recent, relevant turns stay near the edge.

The through-line is simple: the surest way to beat lost-in-the-middle is to never build a long middle. Retrieve a small, relevant slice per query rather than appending everything, and the position problem mostly takes care of itself — there is little middle left for the model to lose.

FAQ

What does “lost in the middle” mean for LLMs? It means LLMs reliably use information placed at the beginning and end of a prompt, but tend to overlook facts in the middle. Answer quality depends on where the key fact sits, not just whether it is present.

Who discovered the lost-in-the-middle effect? It was documented by Liu et al. in the 2023 paper Lost in the Middle: How Language Models Use Long Contexts, from Stanford and collaborators, later published in the Transactions of the Association for Computational Linguistics.

Does lost-in-the-middle affect long-context models? Yes. The original study found the effect even in models explicitly designed for long contexts. A large window does not guarantee the model uses the middle of that window well.

How do I fix lost-in-the-middle in my prompts? Place critical facts at the start and end, keep prompts short, prune filler, and retrieve only high-signal context. Reducing prompt length shrinks the vulnerable middle. If you must keep many documents, reorder them by relevance so the strongest sit at the edges.

Does reordering retrieved documents really help? Yes. Research on long-context retrieval shows that sorting passages by relevance and placing the highest-scoring ones at the start and end consistently improves accuracy, especially with many passages. It is training-free and nearly free to compute — one of the best effort-to-payoff fixes available.

Is lost-in-the-middle the same as context rot? They overlap. Context rot is the overall accuracy decline as input grows; lost-in-the-middle is its positional component — facts in the center get dropped. Longer prompts have more middle, so the two effects reinforce each other.

Why does my agent ignore a document I clearly gave it? Most likely it landed in the lossy middle of a long prompt. Try moving it to the start or end, or trimming the surrounding documents so there is less middle. If quality recovers when you reposition the same content, you have confirmed a position problem.

Does lost-in-the-middle affect every model the same way? No. The strength of the effect varies by model and changes as architectures improve, but research across many long-context models keeps finding the same U-shaped curve — strongest attention at the edges, weakest in the center — even in models marketed as having very large windows. Newer models often soften the dip rather than remove it. The practical takeaway does not change with model version: do not assume a fact is “in context” just because it fits within the window, and verify with the reposition test rather than trusting the headline context length.