AI engineering · 11 min read
How to stop context loss in Claude Code: an operational memory approach
You are not rebuilding software anymore. You are rebuilding context.
Every new Claude Code session starts the same way: you re-explain the codebase, re-paste the playbook, re-state the rules that were obvious to the agent yesterday. The window forgets. The playbook gets longer. The drift gets worse. This guide shows why throwing more memory at the problem makes it worse, and how the operational memory layer GNETICS OPS uses — typed executable patterns retrieved contextually, plus explicit stop conditions on every ticket — addresses it in practice.
What context loss actually is (from an operator's seat)
The textbook description of context loss is "the agent's session memory ends when the session ends." That is true and useless. The way it actually feels, when you operate an AI agent against real tickets, is this: every Monday morning the agent reports for work as a fresh contractor. It does not know what we shipped Friday. It does not know which fix we rejected last week. It does not know that the convention on this codebase is to handle timeouts at the boundary and never inside the retry loop. You walk it through all of it again. By Tuesday the same agent, in a new session, has forgotten all of it again.
The agent is not failing. The agent is doing exactly what it was built to do: read the prompt, reason inside the window, generate. The failure is upstream — there is nothing feeding the agent the operational knowledge it needs at the moment it needs it. So either you re-feed it (the re-explanation tax), or you write it a giant playbook and hope it survives the next prompt (the drift problem we will get to in a minute).
Why bigger windows are not the fix
The reflex when you first hit context loss is to ask for a bigger window. Surely a 200,000-token context, or a million-token context, makes the problem go away?
It does not, for three concrete reasons.
1. Tokens are not free
A 200k-token operational preamble that ships on every single call is a tax, not a strategy. A real task does not need to re-read the whole company brain — it needs the two or three patterns relevant to this task. Stuffing the full playbook into the prompt is the wrong shape: you pay the full price for a tiny fraction of the value.
2. Attention degrades with distance
Long-context benchmarks consistently show retrieval accuracy dropping on instructions placed early in a long prompt. The model is technically able to read 200k tokens. It is not equally able to remember and apply what was said on token 4,000 when it is generating on token 180,000. A "remembered" instruction that gets functionally ignored is worse than no instruction at all — the operator loses the signal that the agent does not know.
3. Sessions still end
Even with an infinite window and perfect attention, the session boundary is set by the user, not by the model. People close laptops. New chats get opened. The window resets. What carries forward is only what the operator can re-paste, or what the agent can look up. That last word is the lever: if the agent can look something up, the window does not need to remember it.
The hidden cost: the silent regression
The visible cost of context loss is the re-explanation tax: 10 to 30 minutes at the start of every session bringing the agent back up to speed. Annoying, measurable, and what most teams complain about first.
The expensive cost is the one nobody sees coming.
Real GNETICS scenario
Problem. In an early version of our agent stack we shipped a single 14,000-character operational playbook on every task, on the assumption that more instructions would produce smarter behaviour.
What failed. Instructions placed early in the prompt got functionally forgotten by the time the agent reached the action. Constraints rejected last week reappeared in this week's diff. The agent looked confident, the diff looked plausible, and the silent regression was caught only after the fact — when at all.
What changed. We replaced the 14k playbook with operational memory retrieved contextually by the current ticket — only the patterns matching this stage, this tool, this error signature. The agent stopped re-reading the whole brain on every task and started looking up only what mattered, at the moment it mattered.
Measured operational effect. Sessions stopped opening with a re-paste ritual. Recurring regressions surfaced in retrieval before they surfaced in production. Operator review shifted from "did the agent re-learn the rules" to "did the agent apply the pattern correctly" — a smaller, more verifiable question.
This is what we mean by the silent regression: the agent re-proposes a fix you already rejected, because nothing remembers the rejection. The diff looks fine. It ships. You only notice the wrong pattern shipped two days later, when the bug it was supposed to fix comes back. The re-explanation tax wastes time. The silent regression wastes trust — once an operator stops trusting the diff, the productivity gain from running the agent at all starts collapsing.
Operational memory, not "more memory"
The lesson from the 14k playbook is that the problem is not memory volume, it is memory shape. Pouring more memory into the prompt produces instruction drift. Storing it in an unstructured vector blob produces noisy retrievals. What works is operational memory: a layer with a deliberate shape.
Three properties make a memory layer operational:
- Typed records the agent can act on directly. Not chat transcripts. Not free-form Markdown. Structured entries with named fields the agent reads at execution time and applies without reinterpretation.
-
Two bounded operations. The agent calls
searchbefore coding to retrieve patterns relevant to the current task, and callscontributeafter solving to file what it just learned. Two operations. Not eight. Not a Swiss army knife. The contract stays small so the behaviour stays predictable. - Per-tenant isolation as a hard guarantee. If two projects share a memory store without isolation, the agent will eventually retrieve a pattern from project A while working on project B and confidently apply it. Once. That is the last time you trust the catalogue.
A vector database is one possible implementation of this layer. A SQL table with full-text search is another. The retrieval engine matters less than the contract above. We have seen teams ship a perfectly capable operational memory on SQLite + FTS5; we have also seen teams fail spectacularly with a top-tier vector database wired to free-form notes. The shape decides the outcome.
Executable patterns: memory that is actionable
The records inside an operational memory layer have a name in our system: we call them executable patterns. The point is in the word. A pattern is not a note about something we learned. It is a typed entry an agent can execute against when the current ticket matches its signature, without reinterpreting a long playbook.
Here is the shape an executable pattern carries. Names kept simple on purpose so the schema survives contact with multiple coding agents:
{
"execution_stage": "before_edit",
"tool_name": "edit_file",
"error_signature": "TimeoutError waiting for FTS5 rebuild",
"expected_behavior": "Warm the FTS index in a readiness probe before \
serving traffic; never block first request on rebuild.",
"stop_condition": "Tests not green OR readiness probe missing.",
"doc_reference": "/blog/claude-code-context-loss#stop-conditions",
"quick_fix": "Trigger a no-op INSERT/DELETE in a startup hook to warm \
the FTS index before serving traffic.",
"root_fix": "Replace FTS5 rebuild-on-attach with explicit \
SELECT * FROM patterns_fts LIMIT 1 in the readiness probe.",
"tags": ["fts5", "sqlite", "warmup", "readiness-probe"],
"status": "resolved"
}
Five things make this useful to an agent at execution time:
-
execution_stage+tool_nametell the retrieval layer when this pattern is allowed to surface. Patterns scoped tobefore_editdo not pollute retrievals duringafter_test. Retrieval is filtered, not just similarity-ranked. -
error_signatureis the real string the next agent will match on. "Timeout after redeploy" is hopeless; "TimeoutError waiting for FTS5 rebuild" hits. Full-text search beats vector search for this kind of exact signature; a hybrid retriever combines both. -
expected_behavioris what the agent should do, written so it can be applied without rereading the playbook. This field is what makes the pattern executable rather than purely informational. -
stop_conditionis the explicit halt rule attached to the pattern. If the agent retrieves this pattern but the listed condition is not met (tests not green, readiness probe missing), it halts and surfaces the situation instead of pushing through. -
quick_fixandroot_fixare separated. The agent in a hurry needs the quick fix. The agent doing it right needs the root fix. Both ship in the same record so neither gets lost in the shuffle.
A pattern without these fields is a note. A pattern with these fields is a unit of operational behaviour the agent can carry out predictably.
Autonomous tickets and stop conditions
Operational memory solves the knowledge problem — what the agent should know at the moment it acts. There is a second problem the memory layer alone does not solve: what the agent should not do when it does not know enough. This is where stop conditions come in.
An agent that improvises freely is not autonomous. It is unsupervised.
In our model, an autonomous agent works against a ticket (a discrete unit of work — a bug, a feature, a sprint task) with a bounded loop: retrieve operational patterns relevant to the ticket, execute according to those patterns, and halt on explicit stop conditions before going off-piste. The stop conditions are not vibes; they are listed up front, both on the ticket and inside the patterns the agent retrieves.
What a real stop condition looks like
Concrete examples we use on autonomous tickets at GNETICS OPS:
- Proof of delivery missing. The ticket asks for a fix; the agent must produce a verifiable artifact (a passing test, a curl that returns the expected status, a log line in production). No artifact, no done.
- Tests not green. A diff that does not green the relevant test suite is not "almost ready" — it stops. Surfacing the failure is more valuable than a confident "should work."
- Owner approval gate. Certain classes of change (schema migrations, deletes, anything touching billing) require an owner to approve before the agent proceeds. The agent halts and asks; it does not assume.
- Risk-class change. If the agent is about to widen the blast radius of its action (touching another service, broadening permissions, changing a config that affects other tenants), it stops and surfaces the change of scope.
- Permission boundary reached. The agent has read access here but not write access; the agent can write here but not deploy. When the next step would cross a boundary, the loop halts.
The point of stop conditions is not to make the agent timid. It is to make the agent verifiable. An agent that halts on stop conditions can be trusted with longer tickets, because the operator knows the boundary cases where it will surface for review. An agent without stop conditions cannot be trusted with anything past the trivial.
Wiring it to Claude Code
Claude Code supports operational memory natively through tool use and MCP (Model Context Protocol). There is no plugin install, no fine-tuning, no clever prompt hack. Three steps.
1. Expose the two operations
POST /api/v1/search
body: { "q": "TimeoutError after redeploy", "limit": 5 }
-> [{ pattern... }, { pattern... }]
POST /api/v1/contribute
body: { execution_stage, tool_name, error_signature, ... }
-> { id, status: "stored" }
The auth is per-tenant: every call carries the agent's tenant key, every row in the store is scoped by tenant id, and the retrieval is filtered before similarity ranking. Isolation at the database layer, not at the prompt layer.
2. Bind the tools through MCP
In your project's MCP configuration, declare a memory server. Claude Code surfaces
memory.search and memory.contribute as first-class tools the
agent can call inside its planning loop. From the agent's perspective they look identical
to read_file or run_tests.
3. Set the bounded behaviour
A short system prompt does most of the work:
Before generating code, callmemory.searchwith the task description or the error signature. If a relevant pattern returns, follow itsexpected_behaviorand respect itsstop_condition. After resolving a non-trivial incident, callmemory.contributewith the full pattern shape. Do not skip these steps on the grounds of time.
What changes after this is observable in the first ten minutes. The agent's first move on a new task stops being "ask the user to re-explain." It becomes "search the catalogue, see what's already there, proceed from a more informed plan." The 14k playbook stays out of the prompt. The patterns that actually apply arrive on demand. The drift becomes visible and controllable.
Pitfalls and honest limits
Memory is only as good as what you put in it
A catalogue full of chat transcripts and standup notes is worse than no catalogue — every search returns noise, the agent learns to ignore the tool, and the whole pattern collapses. Only contribute records with a recurring lesson and the executable fields filled. Filter aggressively. A 200-pattern brain where every entry is useful beats a 2,000-pattern brain where 90% is noise.
Vector-only retrieval is not enough
Embeddings are powerful for semantic similarity but worse than full-text search for
matching exact error signatures like ECONNRESET on /api/v1/search. A hybrid
retriever (full-text on signature, vector on description) outperforms either alone. Do
not let the architecture astronaut in you skip the boring solution.
Stop conditions need to be specific or they are noise
"Stop if anything is unclear" is not a stop condition; it is an excuse. A real stop condition names a concrete failure mode (tests not green, owner gate, risk-class change) that the agent can detect by itself. If the condition is so abstract that only a human can evaluate it, it is not bounding the behaviour, it is just inviting interruption.
Within-session drift is a different problem
Operational memory addresses session-to-session loss cleanly. The within-session drift — the agent forgets an instruction you gave 50k tokens ago because attention degraded — is a separate failure that memory does not fix. The mitigation is shorter sessions, explicit re-anchoring on important constraints, and treating long sessions with suspicion.
Frequently asked questions
Why does Claude Code keep losing context between sessions?
Because the context window is working memory, not long-term memory. When a session ends, the in-memory state is discarded. A bigger window only delays the problem.
Is context loss a memory problem or a behaviour problem?
A behaviour problem first. Pouring more memory into the prompt produces drift, not better decisions. The fix is two bounded operations against an external store, plus stop conditions that prevent the agent from improvising past what it knows.
What is operational memory for AI coding agents?
A structured catalogue of executable patterns the agent retrieves contextually at execution time, instead of being fed a 14k-character playbook on every task. Typed records with execution stage, tool name, expected behaviour, stop condition, quick fix and root fix. Actionable, not just stored.
What is an executable pattern?
A typed record the agent applies during execution without reinterpreting a long playbook. Execution stage, tool name, expected behaviour, stop condition, error signature, quick fix, root fix, doc reference, tags, status.
How do stop conditions stop AI agents from improvising?
A stop condition is an explicit halt rule on the ticket or the pattern. Proof of delivery missing, tests not green, owner gate, risk-class change, permission boundary. The agent halts and surfaces, instead of pushing through.
Does increasing the context window fix context loss?
No. Tokens are not free, attention degrades on long contexts, and sessions still end. External operational memory queried on demand is the fix, not a bigger window.
What does context loss actually cost a builder?
The visible cost is the re-explanation tax. The expensive cost is the silent regression — the agent re-proposes a fix you already rejected because nothing remembers the rejection. The silent regression is what justifies the investment.
How does this work with Claude Code in practice?
Through tool use and MCP. Expose search and contribute,
bind them as tools, instruct the agent to search before coding and contribute after
solving. No fine-tuning, no plugin install.
If your team spends more time rebuilding context than shipping, the bottleneck may not be the model — it may be the absence of operational memory.
GNETICS OPS was built around that single assumption.