Long-Term vs Short-Term Memory for LLMs: The Brain Analogy
Short-term memory for an LLM is the context window — small, fast, and holding only what’s needed for the current task. Long-term memory is a persistent external store that keeps durable facts the model can recall much later. The split mirrors the human brain. Your working memory juggles the few things you’re actively thinking about, while your long-term memory stores everything from your name to skills learned years ago. You don’t keep your whole life in active attention, and an LLM shouldn’t either. Separating the two keeps each request focused and affordable, and lets the model build durable knowledge without overloading any single prompt.
This article walks the analogy end to end: how the brain splits memory by time horizon, what each tier maps to in an LLM, and why cramming everything into the short-term tier costs more and answers worse.
In this guide
- How does the brain split memory?
- What is short-term memory in an LLM?
- What is long-term memory in an LLM?
- Short-term vs long-term: a comparison
- Where the analogy breaks down
- Beyond two tiers: episodic, semantic, procedural
- Why not just use a giant short-term memory?
- How do the two work together?
- What goes in each tier?
- A worked example: the two tiers on one request
How does the brain split memory?
The brain separates memory by time horizon and capacity. Two systems matter here.
- Working memory is tiny and fast. It holds the handful of things you’re actively thinking about — a phone number you just heard, the sentence you’re reading.
- Long-term memory is vast and durable. It stores knowledge for years and surfaces it only when something triggers recall.
You don’t recite your entire life to make a decision. You pull the relevant memory and act. That selective recall is the model worth copying — and it underpins our pillar on AI agent memory.
Just how tiny is working memory? The famous estimate is George Miller’s “magical number seven, plus or minus two,” though later research revised the practical limit down to roughly four chunks. Either figure is startlingly small next to the lifetime of knowledge in long-term storage. The lesson for AI is the same one evolution settled on: don’t try to hold everything in active attention — keep working memory lean and lean on a large, queryable long-term store.
What is short-term memory in an LLM?
Short-term memory for an LLM is the context window — the text the model reads on a single request. It’s the model’s working memory: large compared to a human’s, but still bounded and temporary.
When the session ends, it’s gone. Nothing in the context window persists unless something writes it down. The window is attention, not storage — a distinction we draw in AI memory vs context.
What is long-term memory in an LLM?
Long-term memory is a persistent external store that survives sessions. It holds durable facts the model can recall later through a retrieval step.
Unlike the context window, long-term memory isn’t read on every request. It’s queried when relevant, and only the matching subset is pulled into the prompt. The durable layer is covered in persistent memory for AI agents, and the recall mechanics in how do AI agents remember.
Short-term vs long-term: a comparison
| Dimension | Short-term (context window) | Long-term (persistent store) |
|---|---|---|
| Role | Working memory | Durable knowledge |
| Lifespan | One request | Across sessions |
| Capacity | Bounded by tokens | Bounded by storage |
| When used | Every request | When relevant |
| Cost | Tokens per request | Storage + retrieval |
| Brain analog | Active attention | Long-term memory |
The takeaway: they do different jobs. Short-term keeps the current task sharp; long-term gives the model continuity over time.
Where the analogy breaks down
The brain analogy is a useful blueprint, not a literal equivalence, and it’s worth knowing where it strains so you don’t over-apply it.
- An LLM’s “working memory” is enormous by human standards. A model can hold tens of thousands of words in its context window — orders of magnitude beyond a human’s few chunks. The shapes rhyme; the scales don’t.
- The model doesn’t learn from its long-term store. When you recall a fact, your brain can integrate it and change. An LLM’s long-term memory is external; reading from it doesn’t alter the model. The weights stay frozen.
- There’s no automatic consolidation. Human memory moves things from short-term to long-term on its own, during rest. An LLM only “remembers” if a write step explicitly saves something. Nothing transfers by default.
Hold the analogy loosely. It tells you how to structure a system — lean working memory, large queryable long-term store, selective recall — without implying the two work the same way under the hood.
Beyond two tiers: episodic, semantic, procedural
The short-term/long-term split is the foundation, but the long-term tier is itself not monolithic. Research on LLM agents borrows a finer vocabulary from cognitive science, describing several distinct long-term memory types:
- Episodic memory holds specific past events — “the user asked about pricing last Tuesday.”
- Semantic memory holds durable, generalized facts — “this team’s standard contract is net-30.”
- Procedural memory holds how to do recurring things — a verified sequence of steps for a task.
These interact in a way the brain analogy predicts: repeated episodes consolidate into semantic facts (three date-format corrections become the rule “prefers DD/MM/YYYY”). The mix is also workload-dependent — research finds personal assistants lean on semantic memory while coding agents lean on procedural. For most knowledge-grounded assistants, semantic long-term memory does the heaviest lifting.
Why not just use a giant short-term memory?
Because cramming everything into the context window backfires. There’s a token limit, and even under it, accuracy sags for facts parked in the middle of a long input — a failure mode Liu and colleagues measured in 2023. Bigger isn’t better past a point.
It’s also expensive: every token in the window costs compute on every request. Selective long-term recall is cheaper and sharper than a bloated short-term memory. The right amount of context per call is its own topic, covered in how much context an AI agent needs, and the relevance principle in scoped memory for AI agents.
The brain made the same trade-off. Working memory could in principle have been larger — but a small, fast working set that pulls from a vast long-term store turned out to be the better architecture for acting in a complex world. LLMs are rediscovering the same engineering conclusion: a lean window fed by selective recall beats a giant window stuffed indiscriminately.
How do the two work together?
In a well-built agent, long-term memory feeds short-term memory. The flow is the same one the brain uses: recall the relevant durable facts, hold them in working memory, act.
- A request arrives with an empty context (short-term memory).
- A retrieval step queries long-term memory for relevant facts.
- Those facts enter the context window alongside the prompt.
- The model answers; anything new worth keeping is written back to long-term memory.
This loop lets an agent remember across sessions without ever overloading a single request.
What goes in each tier?
Knowing the split is one thing; deciding what belongs where is the practical skill. A useful rule of thumb: short-term memory holds what’s true for this request, and long-term memory holds what’s true across requests.
Short-term memory — the context window — should carry the immediate, request-specific material: the user’s current question, the active document or task, and the small set of facts retrieval just pulled in to answer it. It’s deliberately transient. Nothing belongs in the window simply because it might be relevant someday; that’s long-term memory’s job. Padding the window “just in case” is exactly the over-stuffing that triggers lost-in-the-middle.
Long-term memory holds the durable layer: stable facts, prior decisions, preferences, project state, conventions — anything that should outlive the current conversation and might matter to a future one. The test for promotion to long-term storage is whether a different session would benefit from knowing it. The user’s name passes; their typo two messages ago does not.
| Belongs in short-term (window) | Belongs in long-term (store) |
|---|---|
| The current question | Stable facts about the user or work |
| The active document or task | Decisions already made |
| Facts retrieval just pulled in | Preferences and conventions |
| The running conversation | Project state across sessions |
Get the sorting right and the two tiers stay in their lanes: a lean window kept sharp, a rich store kept durable, and a retrieval step moving the right facts between them on demand.
A worked example: the two tiers on one request
Make it concrete with a single request to a support assistant: “Can this customer get a refund?”
- Short-term memory (the window) starts nearly empty — just the system prompt and this question. There’s no point loading the entire knowledge base here; it would bury the question and cost a fortune per call.
- Long-term memory (the store) holds everything durable: the refund policy, this customer’s purchase history, the product’s return rules. None of it is in the window yet.
- Retrieval queries long-term memory for what’s relevant to this question — the refund policy and this customer’s recent purchase — and pulls just those into the window.
- The model answers using a short, sharp context: the question plus two or three relevant facts. The other thousand facts in long-term memory never entered the window, so they neither cost tokens nor distracted the model.
Swap in a giant short-term memory and you’d dump all thousand facts into the window: slower, far more expensive, and more likely to bury the two that mattered. The two-tier design wins precisely because it keeps working memory lean and lets long-term memory stay large.
Get this split right and the design questions that follow get easier: how much to recall per call, how to keep long-term storage organized, how to write back only what’s worth keeping. The two-tier brain model isn’t a metaphor you outgrow — it’s the structure every durable agent ends up rebuilding.
FAQ
What is short-term memory in an LLM?
It’s the context window — the text the model reads on a single request. It acts as working memory: large but bounded and temporary. When the session ends it’s discarded, so nothing in it persists unless a separate layer writes it down.
What is long-term memory in an LLM?
It’s a persistent external store that survives across sessions and holds durable facts. Unlike the context window, it isn’t read on every request; a retrieval step queries it when relevant and pulls only the matching subset into the prompt.
Why is the brain analogy useful for AI memory?
The brain separates working memory (the current task) from long-term memory (durable knowledge) and recalls only what’s relevant. That selective, two-tier design keeps each decision focused — exactly the structure that keeps an LLM’s requests sharp and affordable.
Can’t a bigger context window replace long-term memory?
No. The context window is temporary and read fresh each request, and quality drops when it’s overloaded. Long-term memory provides durable, scoped recall across sessions — something a larger short-term window cannot do, no matter its size.
Where does the brain analogy break down for LLMs?
In scale and mechanics. An LLM’s window is huge next to human working memory, reading from its long-term store doesn’t change the model the way human recall changes the brain, and nothing consolidates automatically — an LLM only remembers what a write step explicitly saves. The analogy guides structure, not literal behavior.
What are episodic, semantic, and procedural memory in LLMs?
They’re finer divisions of long-term memory borrowed from cognitive science: episodic holds specific past events, semantic holds durable generalized facts, and procedural holds how to do recurring tasks. Repeated episodes often consolidate into semantic facts. Semantic memory does most of the work for knowledge-grounded assistants.
How does an LLM decide what moves from short-term to long-term memory?
It doesn’t decide on its own — a separate write step makes that call, because the model is stateless and forgets the window the moment a session ends. In a well-built system, that write step runs after a turn and asks: is anything here worth keeping beyond this session? A decision, a stated preference, a durable fact about the project all qualify; small talk and one-off scratch work do not. This is the closest analog to the brain’s consolidation, except it is explicit engineering rather than something that happens automatically while you sleep. Get the write step wrong and the long-term store either fills with noise or misses the facts that mattered.