I've been building an autonomous work agent — a system where the LLM decides what to do, and scripts, schedulers, and databases are tools it reaches for. The design wasn't planned upfront. It emerged over several weeks of building in rounds with Claude Code, starting from a previous agent that was mostly bash scripts and a constellation of markdown files feeding the context window. Each round surfaced a new limitation, and each limitation pointed at the same root cause: memory.
Self-evaluation needs memory of past evaluations. Initiative needs memory of what the operator hasn't seen. Value learning needs memory of decisions and their outcomes. None of this works with just a context window and a system prompt.
I finished the implementation yesterday, late. Then I went looking for references on the current state of agent memory research. I found convergences with the literature that gave me some confidence, at least one decision I think I got right, and many open paths I hadn't considered. I need weeks of actual use before drawing conclusions — but the problems are interesting enough to share now.
Why memory becomes structural
Most LLM systems follow a pattern: an application orchestrates the model. The script decides what to do, the model executes. Memory in this setup is curated — the orchestrator chooses what context to feed.
When you invert this — the model decides, the scripts serve — memory becomes load-bearing. The model must reason about its own context needs: what it knows, what it did before, what worked, what changed. Memory moves from a retrieval problem ("find the relevant chunk") to an infrastructure problem ("maintain the operational context of an autonomous entity").
The CoALA framework (Sumers et al., 2024) formalizes this as a cognitive architecture for language agents, distinguishing working memory, episodic memory, semantic memory, and procedural memory. I arrived at each of these categories empirically — because the system broke in specific ways when any one was missing. Finding the formal taxonomy afterward was reassuring.
Session logs as raw material
Claude Code writes a JSONL file per session — one structured event per line — with messages, tool calls, outcomes, errors, token usage. Append-only, real-time. Most users never look at them.
I built a daemon that watches these files with fswatch and parses every event as it arrives: decisions, tool calls, outcomes, errors. It feeds a temporal knowledge graph and maintains an aggregated portfolio state across projects. The context window within a session handles short-term memory through compression and summarization. But between sessions, that conversation context is gone. The daemon bridges this gap — processing everything that happened, so the next session starts with rich operational state reflecting all prior work.
Several open-source projects have emerged around this JSONL format — session viewers, replay tools, monitoring APIs. For agent builders, session logs are the richest first-party data available, and most architectures ignore them entirely.
Temporal graphs vs. static RAG
Standard RAG treats memory as search: embed chunks, find neighbors, inject into context. It works for factual recall but fails for the reasoning an autonomous agent actually needs.
Concrete example: my agent manages a portfolio of projects — software, a book, articles. These share concepts, but the connections are structural and temporal. The book thesis evolved when the software design was refactored. A business insight changed the book's framing. Static RAG retrieves each chunk independently; it cannot represent that these converged during a specific period, or why.
I use Graphiti, a temporal knowledge graph on FalkorDB. The graph is bi-temporal: every fact has a valid_at (when it became true), an invalid_at (when superseded), and a created_at (when the system learned it). Facts are never deleted — they're invalidated. Full lineage preserved.
This enables queries impossible with vector retrieval:
- "Which projects share concepts that evolved during the same period?"
- "What decisions had no recorded outcomes after seven days?"
- "What signals were detected but never acted on?"
The CONTRADICTS edge type is particularly interesting. When a new fact contradicts an existing one, both are preserved with an explicit link. The agent can reason about why it previously believed X and what changed. This is a different thing from having a search engine with good recall.
Learning from memory, not just retrieving it
Memory also becomes a substrate for learning operator values.
I use contextual Thompson Sampling — a reinforcement learning algorithm. Context: portfolio features (deadlines, momentum, project type). Actions: what to prioritize. Reward: operator approval (1) or override (0). The learned weights encode the operator's real preferences, including the gap between stated and revealed preferences.
The key architectural decision: the value function reads from the same temporal graph that stores everything else. Decisions, outcomes, operator reactions — all graph entities. Memory and learning share substrate, which keeps both debuggable.
This connects to Memory-R1 (2025), which proposes RL-oriented memory management. My approach is adjacent but distinct: RL learns preferences from memory, rather than managing memory itself. Separating the memory substrate from the learning layer has made both easier to reason about.
What's still unsolved
Selective forgetting
My graph never forgets — every fact preserved, possibly invalidated but never deleted. Philosophically clean, practically problematic as the graph grows. Not all memories are equally valuable.
FadeMem (2025) proposes bio-inspired forgetting based on Ebbinghaus curves — memories decay unless reinforced by access or relevance. This is compelling and well-grounded in cognitive science. I haven't implemented it yet. Current mitigation is coarse: stale insights expire after seven days. Adding proper decay dynamics is the clearest next step for my system.
Memory reconsolidation
Distinct from forgetting: updating existing memories when new information arrives, rather than adding new facts alongside old ones.
A-Mem (2025) proposes memory with agency — capable of self-organizing through reconsolidation. When new experience arrives, it evaluates whether existing memories should be updated, merged, or restructured. I have a crude version (EVOLVED_INTO edges connecting old facts to successors), but reconsolidation isn't automatic yet. The challenge is cost: reconsolidation requires LLM reasoning about memory relationships, competing with production work for token budget. At my scale — single operator, fixed subscription, no dedicated GPU — strategies like fine-tuning or multiple LLM calls for signal weighting and aggregation are out of reach. The architecture has to be efficient by design, not by throwing compute at it.
The local-first thesis
A pattern across both my implementation and the research landscape: the most capable agent memory systems are local-first.
The practical reasons are obvious: latency, privacy, cost, debuggability. But the deeper reason is architectural. An autonomous agent with local memory can be inspected at every layer. The graph is queryable. The logs are readable. The value function weights are a JSON file. There is no black box.
When the agent makes a surprising decision, I can trace the memory chain that led to it — from JSONL event to daemon extraction to graph entity to value function weight to recommendation. This inspectability is a prerequisite for trust. And trust is what gates how much autonomy you can responsibly grant.
Practical takeaways
If you're building agent memory:
Start with ingestion, not retrieval. The richest memory comes from your own session logs — structured, comprehensive, already there. Most architectures start by choosing a vector database. Start by parsing your logs.
Temporal metadata is not optional. Every fact needs valid_at and created_at at minimum. Without temporality, your memory is a snapshot masquerading as history.
Separate memory substrate from learning layer. The graph stores facts. The value function learns from facts. Keeping these distinct makes each independently debuggable.
Make memory inspectable. If you can't explain why the agent made a decision by tracing through its memory, you don't have an agent — you have a stochastic process.
I finished the implementation yesterday. The first real tests — long-term memory, initiative, analysis — will come from weeks of actual use. I'm sharing this now because the problems are interesting and the space is moving fast.
If you're working on temporal memory, value learning, or forgetting — I'd like to hear from you.