How AI Agents Actually Remember (Part 1): Inside Mem0, Supermemory, and Letta

Jun 21, 2026

∙ Paid

Build an agent that forgets everything between sessions and you feel the pain fast. The context window is small. Each new conversation starts from zero. The user repeats their name, their stack, their preferences, and the long task that needed five turns of setup collapses on turn six.

The common fix is “just add a vector database.” That fix is wrong, or at least it is half the story. Dumping whole conversations into a vector store and pulling back the nearest neighbors gives you a search index, not a memory. Real agent memory is a lifecycle: decide what is worth keeping, capture it at the right moment, store it in a shape you can query, pull back only what the current turn needs, and throw away what went stale. Skip any one of those and the system rots.

I spent time in the public repos, docs, and changelogs of the three systems people reach for most: Mem0, Supermemory, and Letta (the project that grew out of MemGPT). This piece walks the same four questions across all three, grounds each claim in what the code and docs actually say as of mid-2026, and ends with the parts nobody markets: deduplication, conflict resolution, and forgetting.

The four questions every memory system answers

Strip away the branding and every memory system is answering four questions in order. What do I keep? When do I write it? Where does it live? How do I get it back and rank it? Hold those four in your head and any system becomes legible.

figure: the memory lifecycle from raw input to compact prompt

The figure reads left to right because that is the order the work happens in. Raw text comes in. A selection step throws out the noise so you are not storing “ok thanks.” A trigger decides this is the moment to extract. Storage gives the fact a typed home, a profile field or a graph edge or an in-context block. Retrieval pulls back a small, ranked slice and drops it into the prompt. The win at the end is token cost: instead of replaying the whole history, you inject a few hundred tokens of the right memory. Every system below is a different set of answers to these four questions, so that is the lens I will use on each.

Mem0: a write-only log that resolves conflicts when you read

Mem0 stores discrete facts and preferences, scoped along four axes you pass at write time: user_id, agent_id, run_id for a session, and app_id. Facts the agent itself produces, like a confirmed booking, are first-class, stored with the same weight as things the user said. Entities get extracted and linked across memories, which gives Mem0 a light graph layer on top of the flat fact store.

The interesting change is in how it writes. Older Mem0 ran two LLM passes on every add(): one to pull candidate facts, a second to reconcile them against existing memory with ADD, UPDATE, and DELETE operations. That second pass is where cost and latency lived, and where a bad reconciliation could silently delete a fact you wanted. The current algorithm, labeled v3 in the migration docs, drops the second pass.

figure: Mem0 v2 two-pass rewrite versus v3 single-pass add-only

The right side of the figure is the whole idea. One LLM call, and it only adds. When a fact changes, the new version lands next to the old one and both survive. Mem0 no longer tries to decide at write time whether “I moved to SF” should overwrite “I live in NYC.” It keeps both and lets retrieval sort out which is relevant now. That is a real tradeoff, not a free lunch: you trade a clean store for a cheap, fast write, and you push the conflict problem downstream to read time. The payoff shows up in the numbers Mem0 publishes: on the LoCoMo benchmark the score moves from 71.4 to 91.6, on LongMemEval from 67.8 to 93.4, and selective retrieval reports a p95 latency near 1.44s against 17.12s for stuffing the full context.

from mem0 import Memory

memory = Memory()

# add() triggers extraction; v3 does one LLM call and only adds
memory.add(messages, user_id=”ken”, agent_id=”code-agent”, run_id=”sess-1”)

# search() fuses semantic, BM25 keyword, and entity-match signals
hits = memory.search(query=”what are my preferences?”,
filters={”user_id”: “ken”}, top_k=3)

The two methods carry the whole model. add() is where extraction fires, and you tag the memory with whichever scopes apply so a session fact never leaks into another user’s profile. search() is the part people underestimate: it is not pure vector similarity. Mem0 fuses semantic search with BM25 keyword matching and entity matching, then ranks the fused set. That hybrid is why it can answer “what are my preferences” and also a sharp keyword query like a specific project name. Storage underneath is pluggable: a standalone setup usually points at Qdrant, while the official Docker stack ships Postgres with pgvector plus an optional Neo4j graph.

Supermemory: ingest is instant, understanding catches up in the background

Supermemory pushes higher up the stack. Instead of handing you raw facts, it builds a user profile with two halves: a static half for durable facts like skills and preferences, and a dynamic half for what happened in the last few conversations. Underneath sits a knowledge graph with typed, dated edges between entities. You scope everything with a containerTag, usually user_<id> for a personal agent or project_<name> for a shared knowledge base.

What makes it different is what happens after you write. Ingestion is instant and the new content is queryable right away, but the real consolidation runs later in a background process Supermemory calls “dreaming.”

figure: Supermemory instant ingest plus a background dream cycle

The top row is the fast path: add(content), extract facts and build the graph, queryable in the same breath. The branch downward is the slow, smart path. On its own schedule, with no cron job and no fixed timer, the system enters a dream cycle and reconsolidates: it merges fragments into cleaner abstractions, reconciles contradictions by checking timestamps so a newer fact supersedes an older one, and drops facts past their forgetAfter date. The docs put a number on the lag: the dreamt state catches up within about fifteen minutes. This is the opposite bet from Mem0. Mem0 keeps conflicts and resolves at read time; Supermemory does the reconciliation work in the background so the store stays clean.

import Supermemory from ‘supermemory’;
const client = new Supermemory({ apiKey: process.env.SUPERMEMORY_API_KEY });

// ingest; note containerTags is a plural array on add()
await client.add({ content: “Ken ships agents in TypeScript”,
containerTags: [”user_123”] });

// one call returns the built profile: static + dynamic
const { profile } = await client.profile({ containerTag: “user_123” });

The profile() call is the part worth stealing. Instead of running a search and assembling context yourself, you ask for the profile and get back a static array and a dynamic array already shaped for a system prompt. Pass a q and it folds in search results too. One detail to keep straight: add() and search take a plural containerTags array, while profile() takes a singular containerTag string. Storage is a unified vector plus graph engine with ontology-aware edges, and there are connectors for Drive, Gmail, Notion, OneDrive, and GitHub. You can also run the whole thing as a single self-hosted binary on localhost:6767. The benchmark numbers Supermemory cites, like 81.6% on LongMemEval, are vendor-reported, so read them as a claim to verify against your own tasks rather than a settled fact.

Letta: the agent manages its own memory like an operating system

Letta inherits the core idea from the MemGPT paper (arXiv:2310.08560): treat the context window like the main memory of an operating system. The window is small and precious, like RAM. Everything else lives on disk and gets paged in when the agent asks for it. The agent itself runs the paging through tool calls. Nothing extracts memory on a pipeline behind its back.

figure: Letta core, archival, and recall memory paged like an OS

Three tiers, and the figure labels each one. Core memory is a set of editable blocks pinned in the prompt, the RAM. Archival memory is a vector database the agent searches on demand, and recall memory is the full message history, both sitting on disk. The agent moves data between them with named tools: memory_replace and memory_rethink to rewrite a core block, archival_memory_insert and archival_memory_search for long-term facts, and conversation_search to dig through history. That paging is what creates the illusion of unlimited memory inside a fixed window.

from letta_client import Letta

client = Letta(api_key=os.getenv(”LETTA_API_KEY”))

agent = client.agents.create(
name=”assistant”,
memory_blocks=[
{”label”: “persona”, “value”: “I am a friendly assistant.”},
{”label”: “human”, “value”: “Name: Ken”},
],
model=”openai/gpt-4o-mini”,
)

You do not call a memory API here. You define the starting core blocks and the model, and from then on the agent edits its own memory inside its reasoning loop. When it learns your name, it calls memory_replace on the human block. When it needs an old fact, it calls archival_memory_search. That autonomy is the appeal and the cost: you get an agent that improves its own memory, but you have to design good tools and the behavior is harder to debug than a fixed pipeline. Letta backs all of this with Postgres and pgvector, runs hybrid vector plus full-text search fused with reciprocal rank fusion, and in Letta Code ships MemFS, a git-backed memory directory the agent commits to, so you get a real version history of what the agent learned. The “sleep-time compute” option adds a background agent that shares the same blocks and updates them while the primary agent is idle, which is Letta’s answer to the same problem Supermemory solves with dreaming.

The three side by side

Lay the four questions against the three systems and the design choices line up cleanly.

Continue reading this post for free, courtesy of Ken Huang.

Or purchase a paid subscription.

Agentic AI