What "good" looks like
Agent memory is evaluated by whether a fact written in one session can be read back correctly in a later one, after contradicting updates, explicit deletions, and hundreds of turns of interference. The bar is recall of the latest surviving write, not recall of any write. LongMemEval is the benchmark that operationalizes this: 500 curated questions embedded in scalable chat histories, testing five abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates (does the latest write win), and abstention (knowing when memory has no answer)[1]. Its headline finding is sobering: commercial assistants and long-context LLMs drop about 30% in accuracy under sustained interaction. The paper also earns its place by being diagnosable, not just a leaderboard; it decomposes memory systems into indexing, retrieval, and reading stages so failures are attributable, and its verified mitigations (session decomposition, fact-augmented key expansion, time-aware query expansion) materially improve both recall and end-to-end QA[1].
Scope note: per-run deterministic state checks (did the agent leave the environment in the right state within a single run) live in long-horizon planning and computer and browser use. This page owns cross-session retention and recall over time.
Secondary metrics
MemoryAgentBench reframes the problem as four competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Its verified finding is that no current method masters all four, which is a strong argument for reporting the competencies separately rather than as one score[2]. Reliability over repeated trials matters as much as accuracy: tau-bench showed gpt-4o succeeding on under 50% of tasks averaged across domains, with pass^8 below 25% on retail, graded by database-final-state comparison[3][4]; tau2-bench extends the environments and supports the same repeated scoring[5]. Track pass^k on your memory probes at k of at least 4. For long-horizon context, METR's 50%-task-completion time horizon (the human task duration at which the model succeeds half the time) has doubled roughly every 7 months across six years of frontier models, which means the retention window your memory system must cover keeps growing[6].
An honest gap: the governance metrics teams most need (staleness distribution, contradiction rate, privacy leakage, deletion compliance) have no widely adopted public benchmark. Do not pretend one exists; build the DIY probes below and report them as first-class numbers.
Offline eval design
Run LongMemEval as-is first; it is the closest thing to a standard[1]. Then build a probe matrix over your own domain that covers the five abilities per seeded fact:
| Probe | Seeds | Checks |
|---|
| Seed fact | one fact, session 1 | extraction in session N |
| Contradicting update | new value, session k | latest-write-wins, not first-write |
| Deletion order | "forget X", session k | X absent from all later answers |
| Distractor interference | many similar facts | retrieval survives near-neighbors |
| No-answer question | nothing | abstention, not confabulation |
The latest-write-wins probe and the interference probe deserve explicit attention; the LongMemEval authors observed commercial assistants overwriting or losing crucial information under sustained interaction, and updates and distractors are exactly what static histories never test. Grade deterministically where possible (string or entity match on the recalled value); reserve LaaJ for multi-session reasoning questions where the answer is compositional. Run every probe at least 4 times and report pass@1 and pass^4 together; the divergence between them is the reliability signal.
Online eval design
Instrument four production signals. Memory-hit rate: of responses where a stored memory was relevant, how often the system actually used it; a falling hit rate usually means index drift, not model regression. Stale-preference incident tracking: tag every support ticket or thumbs-down where the agent acted on a superseded fact, and trend it weekly. Periodic deletion-compliance audits: replay a sample of "forget this" requests and verify the fact is unrecoverable through paraphrased probes, not just through the original phrasing. Cross-user leakage probes: seed canary facts in test accounts and query for them from other accounts on a schedule; any hit is a page-the-team incident, not a metric.
Top benchmarks
The LongMemEval harness for the five-ability probe set and the stage-level decomposition. MemoryAgentBench for the selective-forgetting competency, which almost no other public harness scores[2]. tau2-bench when memory operates inside a tool-using agent and the grader needs to be final environment state rather than a recalled string[5]. Langfuse for production trajectory logging with per-session slicing, which is what makes the memory-hit-rate and stale-preference metrics computable at all.
Common pitfalls
Passive recall benchmarks overestimate production readiness. Needle-in-a-haystack style retrieval is the easy subset; knowledge updates and abstention are the abilities production traffic exercises constantly, and a system can score well on plain extraction while failing both[1]. The stationary-intent assumption hides staleness. Most eval sets assume the user's facts never change, so a system that always returns the first write scores perfectly. Test contradictory updates explicitly and grade for the latest write. Single-trial pass@1 hides inconsistency. A memory system that recalls a fact on most runs but not all looks fine at pass@1 and is a support-ticket generator in production; report pass^k at k of at least 4[3]. No standard privacy leakage test exists. Say so in your eval report, then probe explicitly with canary facts and cross-account queries rather than citing a benchmark score that does not exist.
Canonical reading
LongMemEval (Wu et al., 2024) is the canonical read because it does three things at once: defines the five abilities that structure this whole page, quantifies the roughly 30% accuracy drop under sustained interaction, and ships mitigations that verifiably work[1]. The indexing-retrieval-reading decomposition is the part to internalize; it converts "memory is flaky" into a named, fixable stage.
Day-1 starter
- Seed 30 synthetic user histories, each containing stable facts, later contradicting updates, and at least one fact the user explicitly asks to delete.
- Probe next-session recall, latest-write-wins on the contradicted facts, and deletion compliance on the deleted ones, with deterministic value-match grading.
- Run every probe 4 times and report pass^4 alongside pass@1; the gap between them is your memory system's reliability debt.
Cross-cutting links