Technical Notes

Memory is not a search problem

Thrindex

Ask most teams how their agent's memory works and the answer is some version of the same sentence: we embed the text, store the vectors, and run a similarity search at query time. That sentence describes a search engine. It does not describe a memory system, and the difference is the entire problem.

Search and memory look alike from the outside. Both take a query and return stored items. But they answer different questions. A search engine answers what did I store that resembles this? A memory system has to answer what is true right now that I should act on? The first is a geometry problem — distance between vectors. The second is not a geometry problem at all.

Where the work actually is

It helps to separate a memory system into two paths, because they have opposite shapes.

The read path is what runs while the agent waits. It must be fast — tens of milliseconds, not hundreds — because every millisecond here is latency the user feels. The temptation is to make it smart: call a model, reason about the query, traverse a graph of related facts. Every one of those choices is a mistake. Anything expensive on the read path is latency you cannot get back.

The write path is what runs when a new memory arrives. The agent is not waiting on it. This is where every expensive, intelligent operation belongs — and there are more of them than people expect:

Operation	Question it answers	Why it cannot live on the read path
Deduplication	Have we already stored this?	Comparing against the corpus is too slow to do per query
Importance scoring	How much should this memory matter?	Needs signals that take time to compute
Compression	What is the short version of this?	Summarizing is expensive; do it once, not per read
Conflict resolution	Does this contradict something we know?	Requires looking up and reasoning over related memories

The pattern is the same across all four rows. Each is a piece of cognition — judgment about the memory, not just storage of it. None of it can happen while an agent is waiting for an answer. So it happens before: continuously, in the background, every time a memory comes in. By the time a query arrives, the thinking is already done. The read path only has to look up the result.

This is the load-bearing idea, and it is worth stating plainly: a fast memory system is not fast because retrieval is optimized. It is fast because all the slow work was moved off the read path entirely.

  WRITE PATH (agent is not waiting)        READ PATH (agent is waiting)
  ────────────────────────────────        ────────────────────────────
  new memory                              query
     │                                       │
     ▼                                       ▼
  dedup → score → compress → resolve       look up pre-computed result
     │   (cognition, runs in background)      │   (no model calls, no graph walk)
     ▼                                       ▼
  durable store ───────────────────────▶  ranked answer
     (the slow, smart work lands here)       target: tens of ms

  WRITE PATH (agent is not waiting)        READ PATH (agent is waiting)
  ────────────────────────────────        ────────────────────────────
  new memory                              query
     │                                       │
     ▼                                       ▼
  dedup → score → compress → resolve       look up pre-computed result
     │   (cognition, runs in background)      │   (no model calls, no graph walk)
     ▼                                       ▼
  durable store ───────────────────────▶  ranked answer
     (the slow, smart work lands here)       target: tens of ms

What this changes

Once memory is framed this way, a lot of design questions answer themselves.

You stop asking which vector database is fastest and start asking what cognition runs on the write path, because that is what determines whether the answers are any good. A faster vector search returns the wrong memory faster. It does not return a better one.

You stop treating the store as a passive bucket and start treating it as something that is always working — comparing, scoring, compressing, resolving — even when no agent is talking to it. The memory is being maintained between queries, not just during them.

And you stop measuring quality by similarity alone. Similarity tells you a returned memory is related to the query. It does not tell you the memory is current, or important, or relevant to the task the agent is actually doing. Those are separate signals, and a memory system worth the name has to account for all of them.

Retrieval is the easy half. It is well understood, and the tools for it are mature and commoditized. The hard half — the half that decides whether an agent can be trusted — is everything that happens to a memory before anyone asks for it. That is not a search problem. That is the problem.

Where the work actually is

It helps to separate a memory system into two paths, because they have opposite shapes.

Operation	Question it answers	Why it cannot live on the read path
Deduplication	Have we already stored this?	Comparing against the corpus is too slow to do per query
Importance scoring	How much should this memory matter?	Needs signals that take time to compute
Compression	What is the short version of this?	Summarizing is expensive; do it once, not per read
Conflict resolution	Does this contradict something we know?	Requires looking up and reasoning over related memories

  WRITE PATH (agent is not waiting)        READ PATH (agent is waiting)
  ────────────────────────────────        ────────────────────────────
  new memory                              query
     │                                       │
     ▼                                       ▼
  dedup → score → compress → resolve       look up pre-computed result
     │   (cognition, runs in background)      │   (no model calls, no graph walk)
     ▼                                       ▼
  durable store ───────────────────────▶  ranked answer
     (the slow, smart work lands here)       target: tens of ms

What this changes

Once memory is framed this way, a lot of design questions answer themselves.