The ATM-Bench Leaderboard is Live

One Table for Every Personal-Memory System

Posted by Jingbiao on May 22, 2026, Reading time: 8 minutes.

When we released ATM-Bench earlier this year, the most common piece of feedback we got was not about the dataset or the metrics. It was a question:

“OK, but where can I see how my system stacks up?”

The paper has tables. The README has tables. But results were starting to pile up — coding agents, new memory systems, retrieval variants — and each new entry forced readers to mentally diff against numbers buried three sections apart in two different documents.

Today we’re fixing that. The ATM-Bench Leaderboard is live, with every reported result in one place, on one comparable axis, with one click to sort or filter.


What’s actually new

The leaderboard is not just “the README, but on a webpage.” Three things make it useful:

1. System type is a column, not a tab. Memory systems, RAG pipelines, oracle baselines, and general-purpose coding agents all live in the same table. You can compare A-Mem against Codex (GPT-5.2) against GPT-5 with gold context in a single sorted view. The historical convention is to put these in separate tables — but real readers want to know whether their memory architecture beats just-throwing-Claude-Code-at-the-problem, and the table format should let them check that in two seconds.

2. Oracle is filtered off by default. Oracle entries (the model answering with gold evidence pre-injected) are an unfair upper bound. Including them in the headline view biases the comparison; everyone looks bad next to a model that already has the answer. So the leaderboard hides the Oracle chip on page load. One click brings it back, but the headline comparison is between systems that actually do the retrieval.

3. One PR per submission. The leaderboard data is a plain JavaScript array at the bottom of leaderboard.html. To submit a result, append one object and open a PR. No build step, no CI, no contributor agreement. If a PR feels heavy, an issue with the numbers works too.

The page is a single static file. It loads under file:// for local previews, has bilingual EN/中文 chrome, sorts every column by click, and supports multi-select filter chips. No framework, no bundler, no auth.


What the numbers say today

The interesting comparisons depend on which split you look at.

ATM-Bench (full set)

Class System QS Recall@10
Oracle Claude Opus 4.5 (gold context) 86.0%
Oracle GPT-5 (gold context) 85.3%
Memory MemPalace 56.8% 76.4%
Memory ScrapMem (No-Forget) 52.5% 70.3%
RAG ATM-RAG 51.0% 68.7%
RAG Self-RAG 50.3% 68.7%
Memory ScrapMem 48.4% 66.1%
Memory MemoryOS 47.2% 59.2%
Memory A-Mem 44.8% 66.4%
Memory Mem0 43.5% 61.9%
RAG HippoRAG2 42.9% 66.4%

The headline here is that memory-system architectures are catching up. MemPalace cleared 56% — the first system to do so. The new entry from arXiv:2605.03804, ScrapMem (No-Forget), sits a couple of points behind, and Self-RAG and ATM-RAG cluster around 50%. The Oracle ceiling is still 30+ points away, but the spread is narrowing.

ATM-Bench-Hard

Class System QS Recall@10
Oracle GPT-5 (gold context) 74.7%
Oracle Gemini 2.5 Pro 64.3%
Agent Codex / GPT-5.2 39.7%
Agent Claude Code / Opus 4.6 33.8%
Agent OpenCode / Kimi K2.5 30.3%
RAG ATM-RAG 13.8% 30.4%
Memory MemoryOS 13.7% 32.7%
Memory A-Mem 9.9% 31.7%
Memory MemPalace 9.7% 28.3%
RAG HippoRAG2 9.4% 31.9%
Memory Mem0 9.2% 23.7%

This is the table that should worry anyone building a personal AI assistant. The Oracle hits 75%; the best end-to-end memory or RAG system hits ~14%. The best agent — Codex, burning 15.46M tokens per run — hits 40%, and only by virtue of being able to explore the data with code and tools.

The ~60-point gap between Oracle and everything else is the single most important number on this page. It says: the problem isn’t that we lack the right model. The problem is that no current architecture knows how to organize a person’s life into something a model can reason over.

ATM-Bench-Hard-NIAH-100

This board is intentionally a placeholder right now — two rows, both Qwen3-VL-8B-Instruct with different memory representations. We left the tab labeled Preview because we want it to fill up with submissions, not with our own ablations.


What we’re hoping to see

The leaderboard isn’t a vanity project; it’s an attempt to make the field’s claims falsifiable. A few specific submission shapes would be especially useful:

  • Smaller answer models on the Hard split. Right now most non-Oracle entries use Qwen3-VL-8B as the answerer. Does Qwen3-VL-2B catch up with a better harness? Does a 70B open model? These are the experiments that tell us whether the bottleneck is the model or the system around it.
  • NIAH-100 entries. The long-context stress test is the cleanest way to measure how much the harness helps versus just throwing more tokens at the model. We want every claim of “million-token context” to be checkable against a real, messy needle.
  • Submissions from architectures we haven’t tested. Graph-based memories, RL-trained memory agents, hybrid retrieval pipelines — the leaderboard explicitly tracks system type as a column precisely so non-canonical approaches have a visible spot.

If you build memory systems, RAG, or autonomous agents that touch personal data, please run ATM-Bench and submit. Even a low score is useful: it tells us where the field actually stands, not where the press releases say it stands.


What we changed under the hood

A few decisions that matter if you’re cloning the leaderboard pattern for your own benchmark:

  • - for unreported fields. Numeric cells default to null, which renders as a muted -, gets excluded from “best in column” highlights, and always sorts to the bottom regardless of direction. This means a submitter doesn’t have to fabricate a value to fill every cell.
  • Best-in-column is recomputed over visible rows. If you filter to just Memory + RAG, the gold highlight shifts to the best Memory/RAG score, not the global best. This is what you want when you’re using the filter to compare a subset.
  • Harness names are clickable when a link field exists. Every system in the table either links to its canonical repo (A-Mem → WujiangXu/A-mem, Mem0 → mem0ai/mem0, Claude Code → anthropics/claude-code, etc.) or to its paper on arXiv (ScrapMem, ATM-RAG). It’s a one-line render branch but it makes a huge difference for readers tracing where a number came from.

What we still don’t know

Two things keep me up at night about this leaderboard:

Is the Oracle ceiling itself wrong? GPT-5 with gold context gets 74.7% on Hard. The remaining 25% includes questions where the gold evidence is genuinely ambiguous, where multiple correct answers exist, and where the model and human annotator disagree on what counts as “the answer.” We treat Oracle as the ceiling, but it isn’t really — it’s a noisy ceiling, and we should probably report human agreement separately as the actual ceiling.

Are agents the wrong abstraction for memory? Codex burns 15M tokens to get to 40% on Hard. That’s not a memory system; that’s a brute-force exploration of a file system. If the eventual winning architecture looks more like Codex than like A-Mem, then “memory benchmarks” are really “long-horizon agent benchmarks,” and we should be designing the next benchmark around that framing. But if memory wins (i.e., a $0.10/query memory system eventually beats a $30/query agent), then the leaderboard is exactly the right artifact.

I don’t know which one is true. The leaderboard is how we’ll find out.


Try it

Leaderboard: atmbench.github.io/leaderboard.html

If you want to submit:

If you want to reproduce the existing entries:

The leaderboard updates as PRs land. If the headline number on the Hard split changes by more than five points before the end of the year, I’ll write a follow-up.


Jingbiao Mei is a final-year PhD student at the University of Cambridge’s Machine Intelligence Lab, working on multimodal retrieval, agent systems, and the kind of memory problems that don’t fit in a chat window. He is the creator of ATM-Bench, FLMR, PreFLMR, and ExPO-HM.



App ready for offline use.