How Agents Remember: On Memory and the Art of Context Engineering
Preface
When we talk about memory in LLM agents, we’re not talking about neurons or synapses — we’re talking about tokens, context windows, and clever hacks to make a model feel less forgetful than it really is. Every agent, no matter how fancy, is stuck with a hard ceiling: the model can only “see” a limited number of tokens at once. That’s where context engineering comes in — the art of deciding what actually makes it into the prompt each turn.
The Context Window Problem
Every LLM has a fixed context window — the number of tokens it can “see” at once:
Model | Max context (tokens) | Rough words | Equivalent pages* |
---|---|---|---|
Qwen3-4B-Thinking | 256K | ~192 000 | ~480 pages |
GPT-5-mini | 400k | ~300 000 | ~750 pages |
GPT-5 | 400k | ~300 000 | ~750 pages |
GPT-4.1 | 1M | ~750 000 | ~2500 pages |
Claude 4 Sonnet | 1M | ~750 000 | ~2500 pages |
Claude 4.1 Opus | 1M | ~750 000 | ~2500 pages |
*very rough: ~400 words per page.
Here’s the catch: more context ≠ smarter use of context. The Lost in the Middle paper shows that models often ignore information buried in the middle of long prompts. Great at the start, decent at the end, fuzzy in between. So while bigger windows help, they’re not a free pass — they still need careful curation.
Before Agents: MemGPT and the OS Analogy
Long before today’s frameworks started bolting on “memory,” the MemGPT paper published a cool metaphor: treat LLMs like operating systems.
- Context window = RAM: fast, but strictly limited.
- Long-term memory (LTM) = Disk: slower, external, persistent.
- Virtual context management = Paging: swap content in/out of RAM to give the illusion of infinite memory.
MemGPT even gave models the ability to notice memory pressure and decide what to evict or recall — just like your laptop silently juggling processes between RAM and disk. This reframed the field: the problem isn’t “how do I cram my whole DB into the prompt?” but how do I page intelligently?
Why It Matters for Agents
- Multi-turn conversations: Without memory, agents forget earlier turns after just a few exchanges.
- Verbose tool use: JSON outputs, stack traces, logs — they eat tokens and crowd out the chat.
- Enterprise data: You can’t “just stuff your whole DB into the prompt.” It's too big and has to be siloed.
- Performance and cost: Bigger windows = slower, more expensive inference.
- User expectations: People expect an “assistant” to remember what was said before.
Memory Taxonomies
There are multiple frameworks nowadays that can be used to manage memory and context for your agent from Letta to mem0 to LangChain. What they all share is the distinction between short-term and long-term memory.
Long-Term Memory Types
If I'd have to generalize, I'd say there are two main types of long-term memory:
-
Episodic Memory — records specific past experiences.
- Example: “On Sept 2, the user asked me to book a flight.”
- Typically built from conversation logs or summaries.
-
Semantic Memory (a.k.a. Factual Memory) — builds generalized knowledge.
- Example: Paris is the capital of France.”
- Often extracted into structured forms like graphs.
All of these can be retrieved and saved using various techniques — from vector stores to graph databases.
Short Term Memory Types
- Conversation history: Several techniques can be put to place here but the main one is to use a buffer and summarization.
- The agent's state: The tools used, intermediate results, or metadata about the current plan.
A Minimal Virtual Context Manager (Code)
The GitHub repo has a minimal virtual context manager that you can use to manage memory and context for your agent. Here’s the basic flow:
- User query gets appended to short-term memory (a rolling buffer of recent turns).
- The same query is stored in long-term memory (using SBERT for embeddings, and FAISS for similarity search).
- When answering, the agent builds a context by fetching:
- the latest short-term memory (STM) window (conversation history).
- relevant items from LTM (facts, preferences).
- The model answers, and its response is stored back into both STM and LTM.
async def ask(self, query: str) -> str:
self.stm.append("user", query)
await self.ltm.save(query)
context = self._build_context(query)
response = await self._answer_query(context)
answer = response["content"]
self.stm.append("assistant", answer)
await self.ltm.save(answer)
return answer
Query the following:
input_text = "I live in NYC and love anchovy pizza."
response1 = await agent.ask(input_text)
query = "Where do I live and what topping do I like?"
response2 = await agent.ask(query)
You'll get the following response:
User: I live in NYC and love anchovy pizza.
Agent: That's great to hear! New York City has a vibrant pizza scene, and anchovy pizza can be a delicious choice for those who appreciate its unique flavor. Do you have any favorite spots in the city where you like to get your anchovy pizza?
User: Where do I live and what topping do I like?
Agent: You live in New York City and you like anchovy pizza!
Takeaways
- Context windows are finite — and bigger windows don’t magically fix recall issues (Lost in the Middle still applies).
- MemGPT’s OS metaphor (RAM vs. Disk, with paging) gives the right mental model.
- Long-term memory comes in two flavors:
- Episodic (time-tied events).
- Semantic (general truths).
- Short-term memory (buffers, summaries, state) keeps conversations coherent in the moment.
- In practice: STM + LTM = the context the model sees each turn.
- The art of context engineering is paging intelligently under token pressure.
What's next
In this post we’ve seen how to manage memory, the missing piece of everything is to tie it up together and create a full agent.
👉 Full code for the agent is available in my GitHub repo