May 06, 2026
Overview
We took MemPalace and extended its techniques to close the gap in the LongMemEval recall@5 retrieval benchmark to get a reproducible 100% score using only local compute (no LLM or API calls).
What this is not
- Not a LongMemEval leaderboard score. The full LongMemEval benchmark is end-to-end and involves generating answers plus GPT-4 judging. This experiment is strictly about the same retrieval metric that MemPalace was tackling.
- Not a strong metric. The metric is
recall_any@5, the softer variant.recall_all@5(requiring every gold session in the top 5) would be a harder bar. - Not an novel algorithm. Iterating on failures from the dataset, the patches made are general NLP patterns. A new benchmark could be put together with different heuristics required but that just continues the cat-and-mouse game of developing “human-comparable intelligence”.
These caveats aren’t intended to steer your attention away but more set the expectation of an interesting result. The central takeaway of grammatical patterns in text being applicable to vector stores still deserves some acknowledgement.
What we did do
We achieved 100% recall@5 retrieval on all 500 LongMemEval questions. The system uses no language model, makes no API calls, and requires no GPU. The MemPalace baseline on the same metric is 96.6%, so the +3.4% improvement represents a real engineering output. Shared in a project dubbed Retaining, it does:
- 500/500 R@5 (100% recall at rank 5)
- 500/500 R@10 (100% recall at rank 10)
- Fully deterministic and reproduced across multiple runs
Context
On April 6th, Ben Sigman shared that Milla Jovovich had fun with coding agents and built a solution for long-term memory named “MemPalace”. For those who are fans of sci-fi movies, you may recognize Jovovich as the one who played the Fifth Element as well as Alice in Resident Evil. The cherry on top is, at the end of the Resident Evil series, Alice is enabled to tackle the antagonist after her childhood memories were uploaded to her; a rather similar message to enabling agents after giving them a “memory palace”.
Originally proclaiming it to score 100% with optional Haiku rerank before backtracking, it’s racked up a good volume of attention and validation so it’s not totally “viral slop”. By both compressing content and making historical context navigatable, it highlights the efficacy of simple NLP techniques when applied with LLMs.
Improving MemPalace
What worked
If you’ve seen structured note taking like the Cornell Note Taking System or backlinks in Obsidian, then you know there’s more to outlining text than just indexing when or where words occur. With spaCy and named entity recognition, we can extend the existing pipeline by including noun phrases or other grammatical relations that give a more detailed picture of the “ontology” representing the content at hand.
Below is a table of newly added techniques and how much they contributed to the recall@5 performance:
| Technique | Measurement | Net Δ R@5 | Net Qs Fixed |
|---|---|---|---|
| NER-enriched synthetic documents | individual | +1.6% | +8 |
| Keyword overlap re-ranking | individual | +1.2% | +6 |
| Time-based date matching | individual | +0.8% | +4 |
| Logic engine scores | individual | +0.4% | +2 |
| Theme detection | individual | +0.2% | +1 |
| NP embeddings + LogicKB rewrite | cumulative | +0.4% | +2 |
| Rank preservation injection | cumulative | +0.6% | +3 |
| Temporal-NP bridge | cumulative | +0.2% | +1 |
Individual: technique alone added to the baseline. Cumulative: technique added on top of prior ones. Deltas overlap and do not sum to total.
The top three contributors are all simple re-ranking heuristics. The logic engine contributes modestly and actually causes the most regressions. The finding from this experiment: enrichening NLP extraction in a retrieval pipeline can produce more than improving the logic engine that queries them. (damn you bitter lesson!)
In more detail
1. spaCy-based extraction
Every session gets processed through spaCy’s en_core_web_sm pipeline. We extract entities, noun phrases, relations (subject-verb-object triples), time-related markers, and quoted phrases. This takes ~5 seconds per question’s haystack when run on my Macbook.
2. Pure-Python logic engine
A LogicKB Python class that stores extracted facts as inverted indexes. For each query, it looks up matching objects across all sessions, returning a weighted score per each one. This replaced an earlier Prolog approach with the same idea but much less complexity and no IPC overhead.
3. NER-enriched synthetic documents
For each session, we create an document containing its extracted facts and details. These get indexed alongside the raw session text, giving the embedding model a richer retrieval surface. This is the single biggest contributor to accuracy.
4. Noun-phrase embedding bridge
We embed each session’s extracted objects into a separate ChromaDB collection and query it with the question’s noun phrases. This bridges gaps that neither keywords nor full-document embeddings can cross. “Battery life phone” → “portable power bank” has a close enough embedding distance in the noun phrase space to pick up the right session.
5. Time related bridge
For time-related questions (“What did I buy 10 days ago?”), we first identify all sessions in the date window, then run the noun phrase bridge within that filtered set. This discriminates between 14 sessions that all share the same date by finding the one whose noun phrases are topically closest to the question.
What didn’t work
When people complain about LLMs not being able to answer questions or hallucinating false information, what nobody complains about is the LLMs’ ability to identify the question it needs to answer (we can depend on AI to write code that does a thing rather than depend on it to end-to-end handle a task). In the subject of answering questions or digging through “long context problems”, I first attempted to have the LLM use Prolog for storing and retrieving facts.
However, the semantic fuzziness (ie synonyms or finding similar topics to a query) ended up hurting the overall score more than helping. The approach in MemPalace to depend on a vector store actually showed to be “more correct” in this experiment.
Nevertheless, I do think there may be types of problems where realistic input queries (ignoring cases where people are funny and test jailbreaking support agents) would be usable with a more structured and queryable store of relations between objects. Prolog just may not be a low-hanging fruit solution for long-term memory problems where semantic similarity is something worth indexing.
Running yourself
First, clone the repo and install dependencies.
git clone https://github.com/hdresearch/retaining
cd retaining
python3 -m venv .venv && source .venv/bin/activate
pip install spacy chromadb
python -m spacy download en_core_web_sm
Next, download the dataset for the benchmark.
# Download LongMemEval data (~265MB)
curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
Lastly, run the benchmarks.
# Vector-only baseline: 96.6% R@5, ~5 min
python bench_v2.py /tmp/longmemeval_s_cleaned.json --mode vector
# Full hybrid: 100% R@5, ~50 min
python bench_v2.py /tmp/longmemeval_s_cleaned.json --mode hybrid
No API keys. No GPU. Python 3.9+ and ~300MB of disk.
Conclusion
AI famously hit “winters” in the past when some wall prevented computers from becoming sufficiently intelligent. Interestingly, the problem in the past was that “symbolic” approaches to AI would fall short when it came to the last mile of complexity. Similarly, LLM-maximalist approaches also run into a “last mile problem” when it comes to ensuring accuracy of details (ie hallucination).
By incorporating older NLP techniques to tackle the “last mile problems” with modern approaches involving LLMs, there are rather interesting results to be found! Albeit, the implementation used here to game recall@5 is, certainly, by no means a complete solution for knowledge retrieval.
The beauty of the finding is that the problem of “if only someone had sat down long enough to write every NLP grammar rule” now becomes somewhat negligible in a world with coding agents. So, rather than continue to see human text as black boxes, know that a richer pipeline may get the sufficient amount of complexity for some information to be adequately indexed.