Context management long horizon agents
Design context management for long-horizon agents
Section titled “Design context management for long-horizon agents”Interview-style answer. First-person, as the candidate. I talk through decisions out loud, flag tradeoffs explicitly, and mark Staff-level signals wherever they appear. I’ll use ASCII diagrams and concrete numbers rather than hand-waving.
1. Clarify scope and assumptions
Section titled “1. Clarify scope and assumptions”Before I start designing, a few questions I’d want to pin down with the interviewer. I’ll ask them, then state what I’m assuming and move on — I won’t block on answers.
Questions I’d ask:
- Is this a single product (like a coding agent or research agent), or a platform hosting many agent applications? That changes whether the memory service is multi-tenant across apps or per-app.
- What does “long-horizon” mean here — minutes, hours, days, weeks? Days changes durability requirements substantially.
- Are we targeting a specific underlying model with a known context window (e.g. 200K, 1M), or must the design be model-agnostic?
- Do users expect cross-session memory (the agent “remembers me” across distinct conversations) or just intra-session persistence across pauses and resumes?
- Is there a human-in-the-loop step? If yes, the agent can idle for hours or days between inputs, and we need durable resumable workflow state.
- Are tools read-only (pure retrieval) or read-write (emailing, booking, committing code)? Read-write tools change the correctness story — idempotency and replay matter.
- What’s the cost envelope? If a user run costs $0.10 vs $10, different caching and retrieval strategies make sense.
- What latency target per step — interactive (~1s p50), batch (~30s), or background (minutes)?
- Is data per-tenant isolated (must never leak across customers) or per-user within a shared tenant?
- What evaluation signal do we have — user ratings, labeled traces, downstream task success?
Assumptions I’ll make to proceed:
- Multi-tenant platform. Memory store is tenant-scoped with hard isolation.
- Long-horizon = up to ~72 hours per task, multi-week cross-session memory.
- Target model is 200K-context-class with prompt caching (Claude Sonnet / GPT-4.1 family). I’ll design to not rely on a large window; the window is one layer, not the whole answer.
- Cross-session memory is in scope. The agent should remember users.
- Tools are in scope and some are write-capable. Idempotency and approval workflows matter.
- Interactive p50 latency target ≤ 2s per agent step; p99 ≤ 8s.
- Per-run cost budget ~$1–$10 depending on tier; we need to drive this down through caching and retrieval.
- Multi-agent workflows are in scope but orchestrated through the same memory substrate.
Staff-level signals in this section
The signal here is that I refuse to design without naming the assumptions, and I pick defaults that force the harder version of the problem. A mid-level candidate often assumes “large context window = solved” and then designs within that. I’m explicitly rejecting that — I’m assuming the window is a scarce, lossy resource, because that matches the operational reality: models exhibit context rot, where recall degrades as tokens pile up, even well below the hard limit. If the interviewer pushes me toward “why not just use a 1M window,” I have an answer: quality, latency, and cost all degrade with context length, and the window is not where durability lives.
2. Requirements
Section titled “2. Requirements”Functional:
- Maintain coherent behavior across turns, pauses, resumes, crashes, and multi-day gaps.
- Ingest and surface relevant history from: prior messages in the session, prior sessions, task-relevant documents, tool call results, environment state, user-declared facts.
- Support tool use with durable traces and replay-safe semantics for side-effecting tools.
- Support human-in-the-loop pauses: the agent must yield, persist state, and resume cleanly.
- Support multi-agent workflows sharing a common memory substrate with clear ownership.
- Support user-visible memory management: view, edit, forget.
Non-functional:
| Dimension | Target | Why |
|---|---|---|
| Step latency p50 | ≤ 2s | Interactive feel |
| Step latency p99 | ≤ 8s | Catches pathological cases |
| Cost / agent step | ≤ $0.02 median | Enables consumer-grade use |
| Long-horizon task success | No degradation beyond 10% vs short-horizon on matched tasks | Otherwise system is not actually “long-horizon” |
| Durability of workflow state | 99.99% | Users will not tolerate losing a 4-hour job |
| Tenant isolation | Hard (cryptographic + policy) | Enterprise requirement |
| Debuggability | Full provenance from any output claim back to source | Needed to triage regressions |
What “quality stays high as context grows” means operationally:
It means that on a fixed eval suite (graded by a judge model + labeled rubrics), the task success rate at step N is within a small ε of step 1. Specifically: I want the success curve over a 100-step run to be approximately flat, not a decay. That’s the metric I’d hold myself to. If I can’t measure it, I can’t claim the system works long-horizon.
It also means no silent degradation. Context rot is insidious precisely because it’s invisible — the model just gets dumber without telling you. So operationally I need synthetic probes that fail loudly when compaction or retrieval quality drops.
Staff-level signals in this section
Naming an SLO for quality preservation over horizon length is the staff move here. Mid-level requirements lists stop at latency and cost. The thing that kills long-horizon agents is quality drift, and I’m treating it as a first-class measurable requirement, not an aspiration. I’m also being honest about the failure mode — “silent” — which shapes the observability design later.
3. Core problem framing
Section titled “3. Core problem framing”This is not “store chat history.” If it were, we’d append to a list, truncate from the front, and be done. The reason it’s hard:
The core tension: the model’s quality is a non-monotonic function of context. More relevant context helps; more irrelevant context actively hurts through attention dilution. So the job is not to remember everything, it’s to assemble the right few thousand tokens at each step.
Key tensions I’d name to the interviewer:
- Full context vs selective context. Feeding everything loses to rot; feeding too little loses to missing facts. Sweet spot is selective and high-signal.
- Latency vs recall quality. Retrieval takes time. Cheap retrieval misses. Good retrieval (reranking, hybrid) adds 100–500ms per step.
- Summarization vs fidelity. A 10× compression loses detail. Stacked summaries lose it exponentially.
- Retrieval breadth vs distraction. Top-20 beats top-5 on recall but loses to it on precision once the model has to reason. Chroma’s context-rot study shows distractors degrade performance non-uniformly — even one semantically-close distractor hurts.
- Caching vs staleness. Cached prefixes save 70–90% cost and latency, but invalidate the moment you change instructions.
- Personalization vs privacy. Cross-session memory is valuable and a liability.
- Local context quality vs long-term memory quality. The policy that makes a good single-turn response differs from the one that makes a good 500-turn trajectory.
Failure modes I’d design against:
- Context bloat. The window fills; recall degrades silently; prefill latency balloons.
- Stale summaries. A day-old summary asserts facts that have since changed.
- Tool trace pollution. A 50KB JSON response from one tool call displaces useful history.
- Wrong retrievals. Semantically-similar-but-wrong memory fact gets injected, model confidently repeats it.
- Compaction drift. Summary of summary of summary. Errors compound.
- Recursive error accumulation. Agent reads its own earlier hallucination back as “fact.”
- Cache thrashing. Every turn invalidates the cache because we rewrote the prefix.
Staff-level signals in this section
The framing I’d want to land: context is a scarce, lossy resource with diminishing marginal returns; the job is allocation, not accumulation. That reframes the design from “database for memory” to “policy for context assembly,” which is the right abstraction. I’d also name recursive error accumulation explicitly — the agent reading its own bad output as ground truth later — because that’s the specific failure mode that kills trust in long-horizon systems and most candidates miss it.
4. High-level architecture
Section titled “4. High-level architecture”Here’s the end-to-end picture I’d draw on the whiteboard.
┌──────────────────────────────────────┐ │ Agent Controller │ │ (policy loop, step orchestration) │ └──────────────────┬───────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────────────────────┐ │ CONTEXT ASSEMBLER │ │ Builds the prompt for this step under a token budget. │ │ Pulls from all memory layers, applies priority + dedup. │ └──┬──────────┬─────────────┬──────────────┬──────────────┬──────────────┘ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ┌──────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐ ┌───────────────┐ │ Session │ │ Rolling │ │ Retrieval │ │ Tool │ │ Durable │ │ Working │ │ Summary │ │ Layer │ │ State & │ │ Memory Store │ │ Memory │ │ Service │ │ (vec+kw+ │ │ Trace │ │ (episodic + │ │ (Redis) │ │ │ │ struct) │ │ Store │ │ semantic) │ └──────────┘ └─────────┘ └─────┬──────┘ └────┬─────┘ └───────┬───────┘ │ │ │ ▼ ▼ ▼ ┌────────────────────────────────────────┐ │ Retrieval Index + Blob Store │ │ (vector DB, inverted index, object │ │ store for raw traces/docs) │ └────────────────────────────────────────┘ │ ▲ ▼ │ ┌────────────┐ ┌─────────────────┐ ┌──────────┴──────────┐ │ Prompt / │ │ Compaction & │ │ Policy Engine │ │ Prefix │◀──────▶│ Summarization │◀─────▶│ (what enters active │ │ Cache │ │ Service │ │ context, budgets) │ └─────┬──────┘ └─────────────────┘ └─────────────────────┘ │ ▼ ┌────────────┐ ┌──────────────────┐ │ Model │──────── tool calls ──────────────▶│ Tool Gateway │ │ Runtime │◀──────── tool results ────────────│ (with approvals,│ └─────┬──────┘ │ idempotency) │ │ └──────────────────┘ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ Evaluation + Observability Layer │ │ (judge models, probes, provenance traces, metrics, alerts) │ └──────────────────────────────────────────────────────────────────┘Components, briefly:
- Agent Controller: the policy loop. Decides “next step”: plan, call tool, read, respond, compact, checkpoint. This is where
Memory-as-Action-style decisions live if we want the agent itself to drive compaction (more on that later). - Context Assembler: the crux of this system. Given the controller’s intent for the next step, it builds the prompt under a fixed token budget by pulling from all layers.
- Session Working Memory: the last N turns, raw, in Redis or equivalent. Hot path.
- Rolling Summary Service: produces and versions summaries at multiple granularities (turn-window, milestone, task).
- Retrieval Layer: hybrid (vector + BM25 + structured). Queries episodic and semantic memory, documents, and prior tool traces.
- Tool State & Trace Store: normalized state snapshots, raw traces, side-effect logs.
- Durable Memory Store: episodic (events) and semantic (facts) memory, versioned and provenance-tagged.
- Prompt / Prefix Cache: both the provider-side cache (Anthropic/OpenAI prompt caching) and an internal cache for summaries and retrieval results.
- Policy Engine: enforces budgets, admission rules, privacy, tenant isolation.
- Compaction Service: background worker that produces summaries, detects drift, and schedules regeneration from source.
- Eval + Obs Layer: judges, probes, metric collection, provenance graphs.
Staff-level signals in this section
Three things I’d make sure land:
- The Context Assembler is a distinct, testable component with its own policy — it’s not implicit in the prompt template. This is the single most important abstraction in the system.
- The tool trace store is separate from the active context by design — raw traces never flow directly into the window, they’re normalized first. This is the difference between a toy agent and a production one.
- The prompt cache and memory store serve different purposes — the cache is for latency and cost, memory is for reasoning continuity. Conflating them is a common mid-level mistake.
5. Memory model and context layers
Section titled “5. Memory model and context layers”I think of this as a pyramid with latency and capacity roughly inverted:
┌──────────────────────────┐ │ Immediate Turn Context │ ~1K tokens │ (system + current user) │ <1ms └──────────────────────────┘ ┌────────────────────────────┐ │ Rolling Working Memory │ ~20K tokens │ (last K turns, verbatim) │ ms └────────────────────────────┘ ┌──────────────────────────────┐ │ Compressed Summaries │ ~5K tokens each, │ (rolling + milestone + task)│ versioned, ~10ms └──────────────────────────────┘ ┌────────────────────────────────┐ │ Episodic Memory (events) │ MB-scale │ — what happened, when │ ~50ms retrieve └────────────────────────────────┘ ┌──────────────────────────────────┐ │ Semantic Memory (facts) │ MB-scale │ — declared / extracted facts │ ~50ms retrieve └──────────────────────────────────┘ ┌────────────────────────────────────┐ │ Tool / Environment State │ KB-MB scale │ — normalized, current, versioned │ ~10ms └────────────────────────────────────┘ ┌──────────────────────────────────────┐ │ Cached Prompt Prefixes │ provider-side │ — sys prompt + tool schema + ... │ ~0ms prefill └──────────────────────────────────────┘ ┌────────────────────────────────────────┐ │ Durable Workflow State │ KB scale │ — checkpoints, pending approvals │ ~10ms └────────────────────────────────────────┘ ┌──────────────────────────────────────────┐ │ Cold Object Store (raw traces, docs) │ TB-PB scale │ — source of truth, rarely read directly │ ~200ms └──────────────────────────────────────────┘What goes in each layer and why:
| Layer | Contents | Write cadence | Read pattern | Why here |
|---|---|---|---|---|
| Immediate turn | System prompt, current user msg, active task anchor | Every turn | Every step | Highest attention weight; must be concise |
| Rolling working | Last K=10–20 turns verbatim | Append per turn | Every step | Recent coherence; verbatim fidelity |
| Compressed summaries | Rolling summary (last N turns), milestone summaries (task checkpoints), task-state summary | Every K turns, or on milestone | Every step | Bridges verbatim → durable |
| Episodic memory | Event records: user said X, tool returned Y at time T | On event | Retrieved on demand | Time-aware recall |
| Semantic memory | Extracted/declared facts: “user is vegetarian” | On extraction | Retrieved on demand | Durable knowledge about user/world |
| Tool/env state | Normalized current state (e.g. calendar view, repo tree) | On tool call | Every step if task-relevant | Compact, current — not raw trace |
| Prompt prefix cache | System prompt + tool schema + stable prelude | Once per change | Every call | Latency/cost — not for reasoning |
| Durable workflow state | Plan, pending approvals, checkpoints, task DAG progress | On state change | On resume | Resumability, crash-safety |
| Cold object store | Raw traces, full documents, full chat logs | On event | Rarely | Source of truth for regeneration |
A key design rule: every layer above the cold store must be reconstructible from the cold store. If a summary goes stale or a memory fact is wrong, I can throw it away and regenerate. This is the invariant that saves the system long-term.
Staff-level signals in this section
The reconstructibility invariant is the thing I’d emphasize. Without it, errors become permanent — you summarize a summary and there’s no way back to ground truth. With it, every summary or memory is a cache over source-of-truth, and you have a clear playbook when things go wrong: regenerate. I’d also separate episodic from semantic explicitly, because they have different retrieval patterns (time-indexed vs similarity-indexed) and different decay policies.
6. Context assembly policy
Section titled “6. Context assembly policy”This is where most of the design lives. Let me spell out exactly how the assembler builds the prompt for step t.
Budget: suppose the model’s useful-context window is 200K tokens, but based on context-rot research I’ll cap the effective working budget at 50K tokens — roughly 25% fill. Past that, recall degrades noticeably even before hard limits. I’ll describe how the 50K is allocated.
Fixed allocation (cached, prefix):
┌────────────────────────────────────────┬──────────┬─────────┐│ Slot │ Budget │ Cached │├────────────────────────────────────────┼──────────┼─────────┤│ System prompt │ 2K │ yes ││ Tool schemas │ 3K │ yes ││ Stable user profile / persona │ 1K │ yes │├────────────────────────────────────────┼──────────┼─────────┤│ Cached prefix subtotal │ 6K │ │└────────────────────────────────────────┴──────────┴─────────┘Variable allocation (not cached, or cached with short TTL):
┌────────────────────────────────────────┬──────────┐│ Task goal + plan (current) │ 1K ││ Rolling summary (most recent) │ 2K ││ Milestone summaries (relevant) │ 2K ││ Working memory (last K turns) │ 15K ││ Retrieved episodic memory │ 5K ││ Retrieved semantic facts │ 2K ││ Retrieved documents │ 8K ││ Active tool state snapshot │ 3K ││ Reserved for response + CoT │ 6K │├────────────────────────────────────────┼──────────┤│ Variable subtotal │ 44K │└────────────────────────────────────────┴──────────┘TOTAL: 50K (75% of useful capacity reserved as headroom)These numbers aren’t sacred — they come from per-workload tuning. The point is budgets are explicit and enforced, not emergent.
Assembly flow:
┌─────────────────────────────────────────────────────────┐ │ Step begins: controller emits intent (plan / act / ...)│ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 1. Seed cached prefix │ │ sys + tool schema + stable persona │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 2. Inject always-include │ │ task goal, plan, rolling summary │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 3. Generate retrieval queries │ │ (from current intent + recent turns) │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 4. Parallel retrieval │ │ episodic | semantic | docs | tool traces │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 5. Rerank + dedup across sources │ │ penalize near-duplicates and stale items │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 6. Budget allocation │ │ truncate/drop lowest-priority until fits │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 7. Inject working memory (last K turns) │ │ with age-weighted truncation │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 8. Inject current tool state snapshot │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 9. Emit prompt, preserving cache-friendly │ │ ordering (stable prefix unchanged) │ └─────────────────────────────────────────────────┘Key design decisions and defenses:
- Budget-first, not content-first. Each slot has a token ceiling. If retrieval returns 30K of potentially-relevant stuff, I cut to 5K based on rerank scores, not based on “whatever fits.” This bounds worst-case cost and latency.
- Stable prefix first, variable content last. This is critical for prompt caching to work. Cache hit requires the prefix to be byte-identical; a single changed character earlier in the prompt invalidates everything after. So the order is: cached slots → semi-stable → volatile.
- Recency doesn’t automatically win. Recent turns go in verbatim up to a cap. Beyond that, I summarize. Recency bias is how bloat starts.
- Retrieval is not always invoked. On routine continuation turns (the last step was “continue the code edit”), I skip retrieval entirely. It’s ~150ms of latency and tends to pollute context. The controller decides.
- Priority-based eviction. When over budget, I drop in this order: retrieved docs → retrieved episodic → older milestone summaries → older working-memory turns (replaced by a deeper rolling summary). The rolling summary itself is never dropped.
Staff-level signals in this section
The subtle point I’d make here: my effective budget is 50K in a 200K window on purpose. The interviewer might push on this — “why not use the whole window?” — and my answer is concrete: (a) recall degrades past ~50% fill even on current frontier models, so more tokens trade quality for nothing; (b) prefill latency scales with context length — 200K prefill is multiple seconds even with batching; (c) prompt caching recovers most of the latency only if the prefix is stable, so keeping the variable portion small is an explicit lever. A mid-level answer fills the window; a staff answer sets a budget below the limit and defends it with numbers.
Second signal: retrieval is a budgeted, gated action, not an always-on. That’s the difference between an agent that feels sharp and one that feels noisy.
7. Compaction and summarization
Section titled “7. Compaction and summarization”Compaction is where long-horizon quality is won or lost.
What gets compacted:
- Rolling conversation — every K turns (K ≈ 10), produce a summary of that window. Older raw turns can then be evicted from the active prompt while the summary carries forward.
- Milestones — when the agent completes a subgoal (“found 5 candidate vendors”, “deployed v1”), a milestone summary is produced. Milestones are pinned — they persist across compaction cycles.
- Task-state summary — a continuously-maintained summary of “what the task is, where we are, what’s decided, what’s open.” Regenerated, not appended.
- Tool traces — raw tool outputs are summarized on ingestion (see §10). Raw trace goes to cold store.
- Document chunks — when a document is brought into context, we summarize it and keep the summary + pointer. Full chunks retrieved on demand.
When compaction fires:
- Rolling: every K turns, in background.
- Milestone: on controller signal (plan advanced a node in the task DAG).
- Emergency: when the context assembler would exceed budget, a synchronous compaction pass runs on the oldest uncommitted working memory.
- Regeneration: scheduled. Every task-state summary has a max age (e.g., 30 min or 50 turns, whichever comes first) after which it’s regenerated from source.
Compaction pipeline:
┌─────────────────────────────────────────────────────────┐ │ Source inputs: raw turns, tool traces, docs, prior │ │ milestone summaries (but NOT prior rolling summaries) │ └───────────────────────────┬─────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Summarizer model call (small, fast, tuned) │ │ Prompted with: │ │ - explicit "preserve: decisions, open questions, │ │ user preferences, failed attempts, tool side- │ │ effects; drop: redundant confirmations, verbose │ │ tool output, closed side threads" │ └───────────────────────────┬─────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Provenance-tag every claim │ │ Each sentence in summary maps to source span(s) in │ │ cold store. Stored as a side-table. │ └───────────────────────────┬─────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Write summary version (monotonic id), index it │ │ Store side-by-side with predecessor for audit/rollback │ └─────────────────────────────────────────────────────────┘Anti-drift rules I’d bake in:
- Never summarize a summary of a summary. I always regenerate from source (raw turns + tool traces) plus at most one prior summary for continuity. Two levels of stacking is the limit. This is the single most important rule for avoiding compaction drift.
- Preserve “hard” information verbatim. Numbers, names, IDs, code diffs, commitments to the user — these are extracted into a structured “facts” sidebar alongside the summary and carried forward uncompressed.
- Confidence + provenance. Each summary claim has a provenance pointer and a confidence (from the summarizer’s own rationale, or from a secondary judge). Low-confidence claims get flagged for regeneration from source.
- Cold-start the rolling summary on drift detection. If a judge probe (see §11, §15) flags a regression, I throw the rolling summary away and rebuild from the last K*3 raw turns.
Tool trace summarization specifically:
Tool traces are the worst offender for context bloat. A single search call can return 20KB of JSON. Rule: raw traces never enter the prompt. The flow is:
tool call ──▶ raw trace ──▶ cold store (full fidelity) │ ├──▶ normalizer ──▶ compact state snapshot │ (e.g., "top 5 results: A,B,C,D,E") │ └──▶ summarizer ──▶ summary entry in episodic memory ("search for X returned 57 results, top 5: ...")Only the compact state snapshot goes into the active context. Everything else is retrievable on demand.
Staff-level signals in this section
Three things to highlight:
- “Never summarize a summary of a summary.” This is the specific technical rule that prevents the slow degradation that most long-horizon agents hit around turn 200. Most candidates don’t name it.
- Structured facts sidebar + narrative summary, not one or the other. The narrative summary is lossy; the structured facts are not. Carrying both is redundant in the good way.
- Compaction is triggered by quality signals, not just token pressure. If a judge probe says “summary is missing something important,” I regenerate. Token count is necessary but not sufficient.
8. Retrieval design
Section titled “8. Retrieval design”The purpose of retrieval in this system is to answer the question: “what does the agent need to know right now that isn’t already in the prompt?”
Pipeline:
┌─────────────────────────────────────────────┐ │ Query generation │ │ (from intent + recent turns + plan state) │ │ - Can produce multiple queries in parallel │ └──────────────────────┬──────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Fanout across sources: │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Vector │ │ BM25 / │ │ │ │ (dense) │ │ keyword │ │ │ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ ┌──────┴────────────────┴──────┐ │ │ │ Structured filters: │ │ │ │ tenant_id, user_id, session, │ │ │ │ time_range, memory_type, │ │ │ │ source, confidence ≥ τ │ │ │ └──────────────┬───────────────┘ │ └──────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Merge + rerank (cross-encoder) │ │ - learned reranker scores each candidate │ │ - freshness bonus for recent items │ │ - penalty for near-duplicates (MMR) │ └──────────────────────┬──────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Budget-fit: take top-K until budget spent │ │ With mandatory diversity: not all from │ │ the same source or the same time slice │ └──────────────────────┬──────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Return with citations │ │ (each chunk tagged with id for provenance) │ └─────────────────────────────────────────────┘Hybrid, not vector-only. Pure vector retrieval misses exact-match cases (IDs, file names, code symbols). Pure keyword misses semantic matches. Hybrid with learned fusion or reciprocal rank fusion gives materially better recall in practice — this is a well-established pattern and more importantly, recent memory-system work has shown retrieval-stage optimizations dominate over ingestion changes.
Structured filters always applied. Tenant isolation is a hard filter, enforced at the index layer, not post-hoc. No query is allowed to run without tenant_id. This is a safety invariant.
Multi-source merge with source-aware priors. I don’t treat “memory fact” and “document chunk” as interchangeable. A semantic memory fact (“user is allergic to peanuts”) should outrank a random doc chunk on relevance ties, because confidence and specificity matter.
Freshness weighting, not pure recency. For most queries, recent > old, but “what’s my birthday” should return the durable fact, not the most recent mention. Different memory types get different freshness priors.
When retrieval fires:
- Always: at step start, if the step is “plan” or “act with user-facing output.”
- Sometimes: mid-step, if the model emits a retrieval tool call (just-in-time retrieval, matching Claude Code’s pattern of
grep/glob-on-demand). - Never: on pure “continue the current in-progress action” steps.
Dedup and contradiction handling:
- Dedup by hash of normalized content plus semantic similarity threshold.
- If two memory items contradict (say, “user prefers dark mode” v1 vs “user prefers light mode” v2), the later-dated, higher-confidence one wins in retrieval but both are visible to the model with conflict flagged. The model can ask the user to clarify if the contradiction is material.
Staff-level signals in this section
Key points to emphasize:
- Retrieval is a policy decision per step, not always-on. Running retrieval every turn is expensive and often harmful (distractor injection).
- Tenant isolation is an index-level invariant, not a post-filter. Post-filters leak under bugs.
- I retrieve with explicit source awareness — a fact is not a doc chunk is not a tool trace. Each has different retrieval semantics.
- Contradictions are surfaced, not silently resolved. The system should make the agent aware that its memory is inconsistent, so it can ask the user.
9. Caching and reuse
Section titled “9. Caching and reuse”This is where real money gets saved. I’ll separate caching into two conceptually distinct things, because conflating them is a common mistake:
- Cache for latency/cost. Prompt prefix caching. Purpose: don’t re-prefill the same tokens.
- Memory for reasoning continuity. Summaries, facts. Purpose: carry meaning across calls when the verbatim text can’t.
Different systems, different invalidation rules.
Prompt prefix caching — how to structure for hits:
The rule: stable content first, variable content last. If I change anything in the prefix, everything after it invalidates.
┌─────────────────────────────────────────────────────────┐│ [CACHE BREAKPOINT] ││ System prompt (stable) ││ Tool schemas (stable across a deploy) ││ Stable user profile / persona ││ [CACHE BREAKPOINT] ││ Rolling summary (semi-stable, updates every K turns) ││ Task goal + plan ││ [CACHE BREAKPOINT] ││ Retrieved items (volatile; no cache) ││ Working memory turns (semi-stable; cached between ││ turns until next turn arrives) ││ Active tool state snapshot (volatile) ││ Current user message (volatile) │└─────────────────────────────────────────────────────────┘Cache economics — concrete numbers:
Using Claude Sonnet-class pricing at roughly $3/MTok input, cache write 5-min at ~$3.75/MTok (1.25×), cache read at ~$0.30/MTok (0.1×). A 50K-token cached prefix over a 10-turn conversation:
- No cache: 50K × 10 turns × $3/MTok = $1.50 of prefix cost alone.
- 5-min cache: first call writes at $0.1875, next 9 calls read at 9 × 50K × $0.30/MTok = $0.135. Total $0.32. Savings ≈ 78%.
- 1-hour cache (2× write): first call $0.30, next 9 reads $0.135. Total $0.435. Still ~71% savings, and useful if turns are spaced out (hitting 5-min TTL would cost a re-write).
Break-even on the write overhead: cache-write costs 0.25× extra. Cache-read saves 0.9× per hit. Break-even after ~0.28 hits. So anything read more than once is profitable. (This matches what providers publish and independent studies confirm for long-horizon agentic workloads.)
Latency math is similar in shape. Anthropic reports up to ~85% latency reduction on long prompts — my rough internal mental model: prefill for 50K tokens that would take ~2–3s without cache drops to ~300–500ms with cache hit.
What else to cache:
- Summaries. Summaries are expensive to produce (one LLM call each). Cache them by
(source_span_hash, summarizer_version). Invalidate when the source spans change. - Retrieval results. Cache by
(query_hash, tenant_id, filters, index_version). Short TTL (~60s) for interactive; can be longer for batch. - Tool schemas. Cacheable across the whole tenant until the tool catalog changes.
- Intermediate plans. If the agent produces a plan and we resume later, cache it along with its inputs.
What NOT to cache:
- Raw tool outputs beyond the current step. These are better written to the trace store once.
- Anything derived from freshness-sensitive data. Weather, stock prices, calendar state — either don’t cache or cache with very short TTL and freshness checks.
Invalidation:
- Prefix cache invalidates automatically when the prefix changes. The design job is to keep the prefix from changing unnecessarily.
- Summary cache invalidates on source change or summarizer version bump.
- Retrieval cache invalidates on index update or TTL.
- Fact memory has no cache — it is the source of truth for its layer.
Per-session vs cross-session caches:
- Prefix cache is naturally per-session-prefix-hash but shared across identical prefixes (different users with the same system prompt share cache).
- Summary and retrieval caches are keyed by tenant + content, so they can share across sessions for the same user or task.
Staff-level signals in this section
The key distinctions I’d make:
- Cache for latency/cost ≠ memory for reasoning. Different invalidation, different failure modes.
- The prompt structure is a cache contract. Rearranging slots to be cache-friendly is a specific, deliberate design choice with measurable ROI.
- Cache-write cost is a real budget item. In workloads with low reuse, caching can make things worse. The 5-min vs 1-hour TTL decision is an economic one that depends on actual reuse patterns.
- I’d instrument cache hit rate as a first-class metric. Without it, caching regressions are invisible.
10. Tool traces, world state, and workflow state
Section titled “10. Tool traces, world state, and workflow state”This is the layer most candidates underweight, and it’s where long-horizon agents actually break.
Three distinct representations:
- Raw trace — full request/response for every tool call. Written to object store. Never enters active context. Used for audit, replay, debugging, regeneration.
- Normalized state — a structured view of “what is true about the world now.” For a calendar tool:
current_view = {events: [...], last_refreshed: T}. Updated after tool calls. - Compact snapshot — the projection of normalized state that’s small enough to inject into context. Only the fields the current step plausibly needs.
┌──────────────┐ │ Tool call │ └──────┬───────┘ │ ▼ ┌──────────────┐ ┌──────────────────┐ │ Raw trace │────▶│ Object store │ <- source of truth │ (full JSON) │ │ (cold, cheap) │ └──────┬───────┘ └──────────────────┘ │ ▼ ┌──────────────┐ ┌──────────────────┐ │ Normalizer │────▶│ Normalized │ <- durable state │ │ │ state store │ (versioned) └──────┬───────┘ └──────────────────┘ │ ▼ ┌──────────────┐ ┌──────────────────┐ │ Summarizer │────▶│ Episodic memory │ <- event record │ (lossy) │ │ entry │ (indexed) └──────┬───────┘ └──────────────────┘ │ ▼ ┌──────────────┐ │ Compact │──────▶ Active context │ snapshot │ └──────────────┘Why raw traces don’t stay in context:
A single tool call can produce tens of KB of JSON, most of which is irrelevant. Ten calls and the window is polluted. Worse, the noise is high-entropy and model attention gets drawn to it over signal. Keeping raw traces out of context is the single biggest operational win over naive designs. This is exactly the failure mode Anthropic documents — “context window fills with hundreds of thousands of tokens of file content, most of it already processed and noted.”
Side effects and idempotency:
- Every write-capable tool call gets a client-generated
idempotency_keyderived from(session_id, step_id, action_id). If the agent retries, the tool layer deduplicates. - Side effects are logged with their idempotency key and result, so on replay the tool gateway can return the prior result rather than re-executing.
- For non-idempotent external APIs (e.g., sending an email), the gateway enforces an at-most-once contract with a write-ahead log: we record the intent, then the result, atomically.
Human-in-the-loop and approvals:
- On approval points, the controller writes a checkpoint: full normalized state + active plan + pending action. The run is then suspended.
- The checkpoint is durable (relational DB + object store references). Resumption is by checkpoint id.
- User approval comes back through an out-of-band channel (UI, email). The workflow coordinator resumes from checkpoint with the approval decision injected.
Resumability:
- All state needed to resume is in the durable workflow store: plan DAG + node states, pending tool calls with idempotency keys, last normalized state, last rolling summary version.
- On resume, we don’t replay tool calls; we re-read normalized state and continue. This avoids re-executing side effects.
- If a crash happens mid-tool-call: the idempotency key + WAL lets us detect that the call already happened, retrieve the result, and continue.
Concurrent multi-agent:
- When multiple agents touch shared workflow state (e.g., planner + executor + critic), each holds a lease on the specific nodes it’s acting on. Conflicting writes fail-fast and surface to the controller.
- Read access is non-blocking with version vectors; writes use optimistic concurrency.
Staff-level signals in this section
Four things I’d emphasize:
- Three representations, not one. The most common bug in naive designs is treating “tool output” as a single thing and trying to fit it into context.
- Idempotency keys and WAL are non-negotiable for write-capable tools. If you crash between “email sent” and “trace written,” you’d better know which happened.
- Resumability is from normalized state, not from replay. Replaying tool calls is correct in theory and catastrophic in practice.
- Leases and version vectors for multi-agent. Otherwise two agents can overwrite each other’s plan updates silently.
11. Long-running quality preservation
Section titled “11. Long-running quality preservation”This is the section where I’d spend the most time in a real interview, because it’s the heart of the question.
The problem: over a long run, the agent’s effective beliefs drift from ground truth. Summaries lose detail. Facts get stale. Hallucinations get reified. And the agent is the last to notice.
My principles:
- Provenance on every claim. Every item that can enter context carries a pointer to its source. If the agent outputs a claim, we can trace it backward.
- Confidence scores, surfaced to the agent. A memory fact has a confidence; the agent can see it and reason about uncertainty.
- Regeneration from source is the recovery mechanism. Summaries are caches; if they’re wrong, throw them away and regenerate from the raw traces and turns.
- Active contradiction detection. Run a cheap background check after every K turns: does the current rolling summary conflict with any semantic memory fact? If so, flag.
- Periodic re-planning and self-critique. At milestones, the agent re-reads the task-state summary and the original task description, and checks: am I still on track? This catches drift.
- Checkpoints as rollback points. If a judge eval detects quality has crashed, we can roll back to the last-known-good checkpoint and re-run from there with the problematic summary evicted.
- Eval-driven corrections. Judge probes run continuously on live traces (sampled). Regressions trigger alerts and can auto-evict suspect memories.
“What if the agent’s memory is wrong?”
Concretely: the agent confidently remembers “we decided on Option B,” but we actually decided on Option A. How do I detect?
- Provenance check on critical claims. When the agent makes a decision-referencing statement, the system can verify the claim against the sourced span in episodic memory. If the span doesn’t support the claim, flag.
- Contradiction probe. A lightweight judge model periodically asks: “given the raw transcript, does the rolling summary accurately capture X, Y, Z?” If the judge flags a miss, regenerate.
- User-facing confirmation on high-stakes actions. Before a write-capable tool fires on the basis of a remembered preference, confirm: “You’d said you wanted Option B — is that still right?”
“What if a summary omitted a crucial detail?”
- Structured facts sidebar (mentioned in §7) carries the hard-to-summarize stuff — IDs, numbers, commitments — verbatim alongside the narrative summary. So the summary can drop the detail and the sidebar still has it.
- Regeneration on demand. If the agent tries to retrieve a detail and can’t find it in the current compressed state, it can issue a retrieval query against the cold store (raw traces) to get it back. This is the “just-in-time” pattern — don’t pre-process, fetch on demand.
- Coverage probes. A sampled set of synthetic questions is answered against the compressed state vs against the raw transcript; coverage gap is a metric.
Periodic “fresh eyes” pass:
Every N steps, I’d run a dedicated “fresh context” call: same task, new model session, just the source-of-truth inputs (task, raw transcript, current state snapshot). Compare its proposed next step to the in-session agent’s proposed next step. If they diverge materially, that’s a signal the in-session agent has drifted. This is expensive, so it runs at milestone boundaries, not every turn.
Staff-level signals in this section
The “what if memory is wrong?” question is where candidates either shine or flatten. My answer has three specific mechanisms (provenance check, contradiction probe, user confirmation on high-stakes) plus a recovery path (regenerate from source). I’d also stress the fresh-eyes pass — most candidates don’t propose anything that actively checks for drift; they just hope it doesn’t happen.
The concept I’d reach for explicitly: every memory artifact is a cache, the raw transcript is the source of truth, and the recovery playbook is always “regenerate.” If the interviewer asks “but your summary might drop a detail,” the answer is “yes, and when the agent asks for it, we fetch it from source.”
12. Admission control, budgets, and limits
Section titled “12. Admission control, budgets, and limits”Budgets everywhere, enforced at assembly time. Without hard limits, context monotonically grows and bad things happen silently.
| Budget | Default | Enforced where | Degrade behavior |
|---|---|---|---|
| Active context tokens | 50K (of 200K window) | Context assembler | Drop lowest-priority slots |
| Retrieval returns per query | Top 20 pre-rerank, top 5 post-rerank | Retrieval layer | Cap hard, log the drop |
| Summarization rate | 1 compaction/10 turns max | Compaction service | Queue if exceeded, never skip |
| Per-run memory growth | 100 new semantic facts / run max | Memory writer | Reject with “user confirmation required” |
| Per-tenant QPS | Configurable | Gateway | 429 with retry-after |
| Per-run $ budget | Tenant-configurable | Controller | Pause + notify owner |
| Max plan depth | 20 | Controller | Force re-plan with constraint |
| Max tool calls per step | 5 | Controller | Force reasoning step |
| Total run duration | 72h | Workflow coordinator | Checkpoint + pause |
Why these specific limits matter:
- Memory growth per run. Without this, an agent in a pathological loop can write thousands of “facts” per hour, each polluting future retrievals. Capping memory writes per run and requiring confirmation for large batches prevents runaway pollution.
- Plan depth. Infinite plan expansion (subgoal → subgoal → …) is a classic long-horizon failure. A depth cap forces replanning at the top level.
- Retrieval returns. Chroma’s research on distractor sensitivity is direct evidence that top-5 with a good reranker beats top-20 on most tasks — more is not better.
Graceful degradation when budgets are hit:
- Budget-exceeded doesn’t kill the run; it triggers shedding. The system announces it’s operating with reduced context, and logs which slots were dropped so we can debug.
- On repeated degradation, an alert fires to eng. Sustained shedding means the workload needs rebalancing.
Staff-level signals in this section
The point is that budgets are the control surface for the whole system. Without them, the only thing stopping runaway behavior is the model’s context window, and by the time you hit that, you’re already in context-rot territory. Naming a specific list of budgets with specific numbers and degrade behaviors is the staff-level move.
I’d also note: budgets are per-tenant configurable, not global. Different customers have different cost/quality tradeoffs.
13. Distributed systems concerns
Section titled “13. Distributed systems concerns”Where state lives and how failures propagate.
State classification:
- Ephemeral (in-memory, session-scoped): current prompt being assembled, in-flight retrieval results. If lost, recomputed from durable state.
- Short-lived durable (Redis with TTL, or in-memory cache): working memory (last K turns as a raw list), compact snapshots, mid-session plan state. Persisted to durable store on checkpoint boundaries.
- Durable (relational + object): workflow state, summaries, facts, raw traces, idempotency records.
┌─────────────────────────────────────────────────────────────┐ │ Context Assembler (stateless) │ │ One instance per request, can scale horizontally. │ └──────────────┬──────────────────────────────────────────────┘ │ ┌─────────┴─────────┐ │ │ ▼ ▼ ┌─────────┐ ┌──────────────────┐ │ Redis │ │ Session │ │ (hot │ │ Coordinator │ (stateful, sharded │ tiers) │ │ (leases, locks) │ by session_id) └────┬────┘ └────────┬─────────┘ │ │ ▼ ▼ ┌────────────────────────────────────────────────────────────┐ │ Durable backing: │ │ Postgres (workflow state, metadata) │ │ Vector index (memory, docs) │ │ Object store (raw traces, full docs, checkpoints) │ │ Event log (Kafka / equivalent) for ingest pipelines │ └────────────────────────────────────────────────────────────┘Stateless context assembler, stateful session coordinator. The assembler can run anywhere; it reads state, produces a prompt, returns. The coordinator owns the session lease — it’s who decides what the next step is and who can write to session state. One session = one coordinator at a time (via lease).
Multi-region:
- Within a region: single primary for workflow state, read replicas for memory and vector indices.
- Cross-region: memory and workflow state are eventually consistent, with tenant affinity to home region. Cross-region failover is a break-glass that costs some stale reads but keeps sessions alive.
Concurrent access:
- Session-scoped state: coordinator lease prevents concurrent writers.
- Shared memory (cross-session semantic facts): optimistic concurrency with version vectors, LWW with conflict flags for human review.
- Vector index: eventually consistent; retrievals are best-effort.
Consistency of workflow state:
- Workflow state is the strongest-consistency piece. It lives in Postgres with serializable transactions for checkpoint writes.
- Tool call records and idempotency keys are in the same Postgres so we can atomically write
(intent, result)pairs. - Everything else is eventually consistent and regenerable.
Retries and replay:
- Agent step is idempotent-by-step-id. If a step crashes, the coordinator retries with the same step id.
- Tool calls use idempotency keys. Retry returns the prior result if any.
- Summary generation is idempotent per source span + summarizer version.
Versioning:
- Every summary has a version id. Readers pin to a version; writers produce new versions rather than overwriting.
- Memory facts have version chains; “current” is the latest non-retracted version.
- Summarizer itself has a version. When the summarizer version bumps, existing summaries aren’t retroactively regenerated — they’re lazily regenerated when read.
Failure modes and responses:
| Failure | Response |
|---|---|
| Model call fails | Retry with backoff; if persistent, checkpoint + alert |
| Retrieval layer down | Serve request without retrieval; log degraded mode; alert |
| Vector index lag | Fall back to keyword-only; alert if lag > threshold |
| Workflow DB down | Hard stop; in-flight sessions fail cleanly with checkpoint |
| Coordinator crash | Lease expires; another coordinator picks up session from last checkpoint |
Staff-level signals in this section
The distinction between stateless assembler and stateful coordinator is the architectural move. It lets the expensive, parallel part (assembly) scale horizontally while keeping session consistency. I’d also call out that most state is regenerable from a smaller durable core — workflow state + raw traces is enough to rebuild everything else. This is what makes the system recoverable.
14. Data model and storage
Section titled “14. Data model and storage”Concrete storage choices and schemas.
Storage layer choices:
| Tier | Store | Contents |
|---|---|---|
| In-memory | Redis (clustered) | Session working memory, compact snapshots, cached summaries, retrieval result cache |
| Vector/retrieval | Pinecone / Qdrant / internal shard (depends on scale) | Embeddings for episodic, semantic, docs |
| Keyword/inverted | OpenSearch or equivalent | BM25 over same corpus as vector |
| Relational | Postgres | Workflow state, plan DAG, checkpoints, idempotency, memory metadata |
| Object | S3 / equivalent | Raw traces, full documents, full transcripts, checkpoint blobs |
| Event log | Kafka | Ingest pipeline for memory writes, audit trail |
Key schemas (sketch):
memory_item├── id UUID├── tenant_id UUID (hard-filtered on every query)├── user_id UUID├── session_id UUID nullable (null = cross-session)├── type enum (episodic | semantic | doc_chunk | tool_trace_summary)├── content text├── embedding vector(1536)├── source_ref text (pointer into object store + span offsets)├── confidence float├── created_at ts├── valid_from ts (temporal scope)├── valid_until ts nullable├── supersedes UUID nullable (version chain)├── retracted bool└── metadata jsonb (tool name, tags, etc.)
summary_version├── id UUID├── scope enum (rolling | milestone | task_state | doc | tool_trace)├── session_id UUID nullable├── source_spans text[] (references to raw ranges used to generate this)├── predecessor_id UUID nullable (chain, but max depth 1 for regeneration)├── summarizer_version text├── content text├── provenance jsonb (sentence-level source map)├── confidence jsonb└── created_at ts
tool_trace├── id UUID├── session_id UUID├── step_id UUID├── tool_name text├── idempotency_key text (unique)├── request jsonb (or pointer to object store if large)├── response_ref text (pointer into object store)├── normalized_state_ref UUID (pointer to tool_state table)├── summary_ref UUID (pointer to summary_version)├── status enum (pending | success | failed)├── side_effect bool└── created_at ts
context_assembly_plan (logged for every step)├── id UUID├── session_id UUID├── step_id UUID├── budget_total int├── slots jsonb (slot → [item_ids with budget spent])├── retrieval_queries jsonb├── cache_status jsonb (hit/miss for each cached block)├── dropped_items UUID[]└── created_at ts
eval_record├── id UUID├── session_id UUID├── step_id UUID├── probe_type enum (summary_coverage | contradiction | fresh_eyes | user_rating)├── result jsonb├── severity enum└── created_at ts
workflow_state├── session_id UUID (PK)├── tenant_id UUID├── plan jsonb├── node_states jsonb├── last_checkpoint_id UUID├── pending_approvals jsonb├── rolling_summary_id UUID├── task_state_summary_id UUID├── status enum└── updated_at tsWhy these choices:
- Postgres for workflow state — needs serializable writes, strong consistency, easy joins for ops queries. Memory volume is low.
- Object store for raw traces — cheap, high-throughput, scales to TB-PB. Traces rarely read; when they are, latency is fine.
- Separate vector + keyword indices — hybrid retrieval wants both. Unified “hybrid store” products work but lose flexibility.
- Event log for ingest — memory writes should be async and replayable. Kafka gives us that plus a clean audit trail.
- Redis for hot session state — sub-ms reads for working memory matter; a database call per turn is too slow.
Sizing sanity check:
Assume 10M active users, 100 sessions/year each, 20 turns/session, ~300 tokens/turn.
- Raw session text per user/year: 100 × 20 × 300 × 4 bytes ≈ 2.4 MB.
- Across 10M users: 24 TB/year of raw transcripts to object store. Cheap.
- After 5:1 compaction to summaries + extracted facts: ~480 KB/user/year durable. Across fleet: 4.8 TB/year in Postgres+vector. Easy.
- Embeddings at 1536d × 4 bytes per chunk, ~5K chunks/user/year: 30 MB/user/year. Fleet: 300 TB/year in vector store. Meaningful but manageable; pruning decayed memory keeps steady state bounded.
- KV-cache-equivalent for prefix caching — this is ephemeral on the provider side; I don’t store it, but worth noting that a 50K-token prefix for a 70B-class model backs ~16 GB of KV cache per session. That’s why provider caching is TTL’d — it’s expensive GPU memory.
Staff-level signals in this section
I’d call out two things:
- Storage choices are derived from access patterns, not picked first. Each store is justified by the query pattern it serves. A common mid-level error is picking a vector DB and trying to cram everything into it.
- Versioning and provenance are in the schema, not bolted on.
supersedes,provenancejsonb,summarizer_version— these are first-class.
15. Evaluation and correctness
Section titled “15. Evaluation and correctness”You can’t operate what you can’t measure. This system has too many failure modes for eyeballing.
Evaluation axes and metrics:
| Axis | Offline metric | Online metric |
|---|---|---|
| Retrieval quality | Recall@k, nDCG on labeled sets | Retrieval-item usefulness (did a cited retrieval appear in the final answer?) |
| Summary fidelity | Judge-rated coverage on known Q/A pairs against source | Coverage-probe pass rate on sampled live sessions |
| Context usefulness | Ablation: remove slot X, does task success drop? | Per-slot “citation rate” — how often does each slot get used? |
| Long-horizon task success | Labeled end-to-end runs in eval harness | User task completion rate, user-reported resolution |
| Contradiction rate | Synthetic contradiction injection and detection | Judge-detected contradictions per 1000 steps |
| Stale-memory errors | Time-shifted eval: inject outdated fact, measure error rate | User correction events (user says “no, that’s wrong”) |
| Hallucination rate | Factual eval set | Unresolvable-citation rate (model makes a claim with no retrievable source) |
| Human preference | A/B comparisons | Session-level user ratings |
| Cost per task | Mean cost over eval set | p50/p95 cost per session |
| Latency | p50/p99 per step on eval | Same, live |
Offline evaluation:
- Curated set of long-horizon tasks with labeled milestones and ground truth.
- Step-level replays: take a real session, replay it with a changed policy (e.g., different retrieval config), diff the outputs.
- Synthetic stress tests: inject deliberate distractors, contradictions, and stale facts into the memory; see if the agent recovers.
- Shadow runs: new policy runs in shadow next to prod, diffs logged, not surfaced to user.
Online evaluation:
- Judge probes on sampled live sessions (~1%), running the checks above. Latency-neutral because they’re async.
- Per-step provenance audit: for each model-generated claim, can we trace it back? Track unresolvable-claim rate.
- User signal: ratings, corrections, abandonments. Correlate with internal quality metrics.
How I’d detect “context management is hurting quality”:
- Ablation in prod. Randomly, for 1% of sessions, disable a single context layer (e.g., don’t inject retrieved episodic memory) and measure downstream task success vs the control. If task success goes up with a layer disabled, that layer is net harmful.
- Trend alerts. Judge-measured quality on live traces trending down over a week triggers an investigation. Correlate with cache hit rate, retrieval depth, summary regeneration rate.
- Fresh-eyes probe vs in-session agent. If the two diverge on next-step more than a threshold, something in context management is misleading the live agent.
Staff-level signals in this section
Three signals:
- In-prod ablations as a quality measurement tool, not just an offline thing. Most candidates don’t think about running controlled experiments live.
- Unresolvable-claim rate as a specific hallucination metric. Tied to provenance, which is already in the design. This is concrete, computable, and alertable.
- Task success is decoupled from user rating. Both matter. A user might rate a failed run highly if the agent was polite; a successful run might get a low rating for UX reasons. Track both and don’t collapse them.
16. Observability and operations
Section titled “16. Observability and operations”What I’d want on a dashboard.
Per-step metrics:
- Active context token count by slot (histogram).
- Retrieval latency (per source), rerank latency.
- Cache hit rate, cache-read tokens / cache-write tokens / uncached tokens.
- Time-to-first-token, time-to-last-token.
- Tool call count, tool latency, tool error rate.
- Budget-shedding events (which slots got dropped).
Per-session / per-run metrics:
- Memory growth curve: tokens in durable memory over turn count.
- Summary regeneration rate.
- Contradiction detection events.
- Stale-memory incidents (user corrects agent).
- Long-horizon task completion rate, segmented by run length.
- Compaction “loss indicator” — difference between a sampled question answered against summary vs against source.
Per-tenant metrics:
- Cost per session, cost per user.
- Memory storage footprint.
- Privacy events (access-denied, redaction triggered).
Traces and logs:
- Every step logs a
context_assembly_planrecord — which slots, which items, which cache hits, which budgets. This alone lets us reconstruct any decision. - Every model call logs input/output token counts, latency, cache hit details, and model version.
- Every tool call logs idempotency key, status, and pointers to trace.
- Every memory read logs the query, top-k returned, and which items ended up in context (vs dropped by rerank/budget).
Dashboards I’d ship:
- Operator dashboard: availability, p50/p99 step latency, error rates, cache hit rate, per-region health.
- Quality dashboard: task success rate (long-horizon), contradiction rate, stale-memory rate, judge-probe pass rate, trend over time.
- Cost dashboard: tokens/session, cost/session by tenant, cost contribution by slot.
- Agent-forensic view: per-session drill-down showing plan DAG, all steps, context assembly plan, retrievals, tool calls, summaries, probes.
Alerting:
- Page on: availability drops, error rate spikes, tail latency regressions, contradiction rate spikes, judge quality drops.
- Ticket on: cache hit rate regressions, memory growth anomalies, per-tenant cost anomalies.
Staff-level signals in this section
The key claim: agent-forensic view is the single most valuable tool for operating this system. Without being able to drill into “why did this specific step go wrong,” you’re debugging by vibes. The context_assembly_plan log table exists exactly so this view is possible.
Also: quality metrics must be separate from availability metrics. Most AI systems conflate them because “it ran without crashing” feels like a win. For long-horizon agents, a silent quality regression is worse than a crash.
17. Tradeoffs and alternatives
Section titled “17. Tradeoffs and alternatives”Let me compare the design to the main alternatives and defend the choices.
Alternative A: Full-history prompting (feed everything).
- Pro: simple. No compaction complexity, no retrieval complexity.
- Con: hits context window hard limits; suffers context rot well before that; prefill latency grows linearly; cost grows linearly.
- Verdict: fails above ~50–100 turns on any current frontier model. Non-starter for the 72-hour horizon.
Alternative B: Aggressive summarization-only (no retrieval, just stacked summaries).
- Pro: context stays small; latency and cost are predictable.
- Con: compaction drift is severe; verbatim details are lost after 2–3 cycles; no way to recover specific facts.
- Verdict: acceptable for short, linear conversations; catastrophic for long, branching work.
Alternative C: Retrieval-heavy, minimal summarization.
- Pro: keeps raw source available; retrieval is more precise than summary recall.
- Con: retrieval has latency cost every turn; distractor risk is real; requires very good retrieval quality.
- Verdict: close to viable, and it’s basically what Claude Code does with
grep/glob— minimal pre-compaction, just-in-time retrieval. I think for pure document-over-large-corpus workloads, this is arguably better than my design. But for multi-turn conversation with tool side effects, you still need some summary to carry task state.
Alternative D: Centralized memory service vs per-session local memory.
- Pro of centralized: cross-session continuity, shared insights, economies of scale in infra.
- Con of centralized: privacy surface area, harder tenant isolation, cross-session contamination risk, higher infra complexity.
- Verdict: I chose centralized but with hard tenant isolation at the index layer. Per-session-local is fine for a chat app without cross-session memory, but the problem statement asked for long-horizon + multi-session.
Alternative E: Cache-heavy (big static prefix, everything cached) vs retrieval-heavy.
- Cache-heavy optimizes for latency and cost when the prefix is stable.
- Retrieval-heavy optimizes for relevance when the corpus is large and queries vary.
- Verdict: I use both, at different layers. The first ~6K of the prompt is cache-optimized; the retrieved slots are not. This is a hybrid; defending it is easy — they serve different purposes.
Alternative F: Vector-only retrieval vs hybrid.
- Pro of vector-only: single index, simpler ops.
- Con: misses exact-match cases (IDs, symbols), weaker on rare terms.
- Verdict: hybrid is worth the complexity. Ops overhead is real but modest, and the quality win on real-world queries is measurable.
Alternative G: “Memory-as-Action” (agent explicitly edits its own memory) vs controller-driven compaction.
- This is a recent line of research where compaction and retrieval are tool calls the agent itself decides to invoke, inside an RL-trained policy. The MemAct work shows this can outperform handcrafted compaction heuristics on some benchmarks.
- Pro: compaction decisions are task-aware; the model knows when to forget or regenerate.
- Con: requires RL training data; quality depends on the policy; unpredictable behavior in production without guardrails.
- Verdict: I’d adopt it as a layer on top of a heuristic baseline rather than as the core. Let the agent request compaction or retrieval, but have the controller enforce budgets and safety. The baseline heuristics are the floor; the learned policy is the ceiling.
Explicit decisions I’d make and defend:
- Keep effective context ~25% of the window, not 100%. Driven by context-rot evidence; any interviewer who pushes back, I’d defer to the Chroma/Anthropic data.
- Hybrid retrieval, not vector-only. Worth the ops cost.
- Raw traces never enter the prompt. Three representations.
- Never stack summaries beyond depth 1. Regenerate from source instead.
- Centralized memory service with per-tenant index-level isolation.
- Cache-friendly prompt ordering as an architectural invariant.
- Memory-as-action as an optional overlay, not a core.
Staff-level signals in this section
The signal is that I actually evaluate each alternative on specific metrics relevant to the workload, and I acknowledge where my choice might be suboptimal (e.g., retrieval-heavy without summarization is actually better for some workloads). A mid-level answer picks a design and declares it best; a staff answer says “here are the workloads where my choice wins and here are the workloads where another choice would win, and I picked this one because the target workload is this.”
18. Final summary
Section titled “18. Final summary”The core claim: context management for long-horizon agents is not a memory store, it’s a policy for allocating a scarce, lossy attention budget under uncertainty, with explicit durability and recovery.
The three design choices that matter most, in order:
-
A dedicated Context Assembler with budget-driven, priority-based slot allocation. This is the single most important component. Everything else feeds it. Without explicit budgets, the system silently degrades.
-
The reconstructibility invariant: every memory artifact is a cache, the raw transcript is the source of truth, and recovery is “regenerate from source.” This is what keeps compaction drift bounded. No stacked summaries deeper than one. Always a path back to ground truth.
-
Separation of prompt caching (for latency/cost) from memory (for reasoning continuity), with cache-friendly prompt structure as an architectural invariant. Conflating these is the most common mid-level mistake. Treating prompt structure as a cache contract — stable first, variable last — is what makes the economics work.
Supporting everything: budgets everywhere, provenance on every claim, raw tool traces out of the prompt, and continuous eval-driven drift detection.
19. Latest developments
Section titled “19. Latest developments”A few things from the recent literature and practice worth flagging for this design:
-
Context rot is now a mainstream, documented concern, including in provider docs. Anthropic’s own context engineering guidance explicitly frames context as a finite resource with diminishing returns. Chroma’s 2025 context-rot study across 18 frontier models quantified the degradation and showed even one well-placed distractor can hurt performance. This is the empirical backbone for budget-below-the-limit designs like mine.
-
Prompt caching has matured into a cost-and-latency lever, not a micro-optimization. Anthropic’s caching supports 5-min and 1-hour TTLs with different write multipliers (1.25× and 2×). A Feb 2026 systematic study “Don’t Break the Cache” evaluated prompt caching across OpenAI, Anthropic, and Google on long-horizon agentic workloads and confirmed roughly linear cost/TTFT benefits past the caching minimum — but also showed provider-specific strategies diverge, meaning cache design is now workload-specific engineering.
-
“Memory-as-action” is emerging as a unified framing — treating compaction and retrieval as tool calls the agent itself decides to invoke. MemAct (2025) and related work show RL-trained memory policies can outperform handcrafted heuristics. Relevant for this design as an overlay on the heuristic baseline.
-
AgentFold and IterResearch (late 2025) push proactive context management and Markovian state reconstruction for long-horizon web agents. The pattern: compress aggressively, carry forward a compact state, let the agent re-fetch detail on demand. This mirrors the just-in-time retrieval pattern Claude Code uses.
-
Evaluation is shifting toward trajectory-level rather than answer-level metrics. TRACE and AgentLongBench evaluate how the agent gets to an answer, not just whether it arrives. UltraHorizon benchmarks trajectories averaging 200K tokens and 400+ tool calls — these are getting close to the regime a real long-horizon agent sees in production. For evaluation design, trajectory-level signals (efficiency, hallucination rate along the way, adaptivity) are what to optimize.
-
LongMemEval and derived benchmarks (LoCoMo, AMA-Bench) have become the de facto evaluation substrate for memory systems. The MemMachine evaluation on LongMemEval showed retrieval-stage optimizations dominate over ingestion-stage optimizations — meaning how you query matters more than how you store. This sharpens my §8 emphasis on rerank and hybrid retrieval.
-
Production agents are converging on a small set of primitives — rolling compaction, tool result clearing, sub-agent delegation, and durable note-taking to files. The Anthropic cookbook on context engineering demonstrates these composing well. My design reflects this consolidation.
Likely follow-up questions
Section titled “Likely follow-up questions”- What if the user’s query depends on a detail that was dropped during compaction?
- How do you prevent a malicious user from using cross-session memory to leak information to another tenant?
- You said 50K effective context. What’s the evidence, and how would you tune K for a specific workload?
- Walk me through exactly what happens when a tool call crashes after side-effect but before trace write.
- How do you version the summarizer without invalidating all existing summaries?
- Your memory fact store has contradictions — how does the agent decide which to believe?
- What’s the cold-start experience? First session, no memory, how does the system behave?
- You have 10M users. At some point the vector index becomes the bottleneck. What’s your plan?
- Concretely, how do you detect that a specific memory fact is stale?
- An agent is in a tight loop making bad tool calls. How does your system break the loop?
Strong follow-up answers
Section titled “Strong follow-up answers”1. Missing detail after compaction.
Three lines of defense. First, the structured facts sidebar: hard-to-regenerate details (IDs, numbers, commitments) are extracted verbatim at compaction time, not summarized. Second, just-in-time retrieval: the agent can query the raw transcript in cold store and fetch the specific span on demand. Third, coverage probes: a background eval tests sampled questions against the summary vs the source; systematic gaps trigger regeneration. If the interviewer pushes further on “but what if the agent doesn’t know it needs that detail” — that’s the fresh-eyes probe’s job, running a parallel session from source and comparing outcomes.
2. Cross-tenant leak prevention.
Tenant isolation is enforced at the index layer, not post-filter. Every retrieval query must carry a tenant_id and the index refuses queries without one. Embeddings for user A are never in the same shard as user B’s embeddings unless they share a tenant, and cross-shard retrieval is disabled by default. Memory writes go through an authorization layer that rejects writes to a different tenant. On top of that, at the encryption layer, each tenant’s data is encrypted with a tenant-specific key so even accidental cross-reads return ciphertext. Finally, audit logs on all cross-session reads make leaks detectable.
3. Evidence for 50K, and workload tuning.
Evidence: Chroma’s 2025 context-rot study shows measurable degradation on semantic retrieval tasks well before 200K; Anthropic’s own docs explicitly state that recall degrades universally with length. 50K is a conservative default for 200K-class models, targeting ~25% fill.
Tuning: for a specific workload, I’d run the actual eval set at several fill levels — say 20K, 50K, 100K, 150K — and plot task success vs cost. The knee of the curve is the target. For code agents with very structured content the ceiling is higher; for conversational tasks with lots of chatty turns it’s lower. The budget is a per-product configuration, not a platform constant.
4. Crash between side-effect and trace write.
The write-ahead log is the answer. Before the tool call fires, we write a tool_call_intent record with the idempotency key and a status = pending. The tool gateway executes the call. On return, we update to status = success with the response pointer, atomically. If the process crashes between send and WAL update, on resume we see pending with no terminal state. For idempotent tools, we retry with the same key — the upstream deduplicates. For non-idempotent tools (email, irreversible action), we don’t retry; we surface an “unknown outcome” to the controller, which either asks the user or issues a reconciliation query to the upstream (e.g., “did email X get sent?”) via a verification tool.
5. Summarizer versioning.
Summaries carry summarizer_version. When the summarizer version bumps, existing summaries are not retroactively regenerated — that would be a massive batch job. Instead, lazy regeneration: on read, if the version is older than the current deployed version by more than N revisions and the summary is on a critical path, a background job enqueues regeneration. Newer version reads take priority. For catastrophic bugs in the new summarizer, we pin consumers to the old version while we fix. Versioned summaries with immutable history make this safe.
6. Contradiction resolution.
The memory fact store never silently overwrites. A new fact that contradicts an existing one gets stored as a new version with supersedes = old_id and a contradiction_flag. On retrieval, the later, higher-confidence version is surfaced by default, but the agent sees there’s a conflict. For high-stakes decisions, the agent is instructed to ask the user. For low-stakes, it goes with the surfaced version. Crucially, the record of the contradiction is preserved — we don’t delete the old version — so audit and rollback are possible.
7. Cold-start behavior.
First session, no memory. The system degrades to “just a chat agent with good prompt engineering.” System prompt, tool schemas, and rolling summary (empty initially) are there. No retrieval hits; the agent operates from current-session context only. As turns accumulate, working memory and episodic memory populate. Semantic memory starts populating as the agent extracts durable facts (with user confirmation for high-confidence claims). The key design point: the system must work correctly at zero memory. I’d test that explicitly in the eval harness.
8. Vector index scaling at 10M users.
The vector index hits several walls: write throughput, index size, query latency under load. Plan: (a) shard by tenant — tenant is the natural boundary and queries are tenant-scoped anyway, so there’s no cross-shard query. (b) Hot/cold tiering: recent-and-frequent items in an in-memory ANN index (HNSW or equivalent), older items in a disk-backed index; queries fan out if necessary but most hits are in the hot tier. (c) Time-decayed pruning: after a threshold, rarely-accessed memory falls out of the active index and is archived; it can be re-activated on explicit user request. (d) Embedding compression (product quantization) for cold items. (e) Per-tenant capacity limits — a pathological tenant doesn’t get unbounded storage without opting in.
9. Detecting stale facts.
Facts have a valid_from and optional valid_until. For facts with no valid_until, staleness is detected several ways: (a) temporal judge probe — when the fact is retrieved, the judge asks “is this fact the kind that can become stale?” and if so, checks whether recent sessions have evidence that contradicts it; (b) user correction events — if a user ever corrects a claim that traces back to a fact, the fact is flagged; (c) periodic re-confirmation — long-lived facts that are used often get re-confirmed with the user at natural checkpoints (“you still want me to sort by due date?”); (d) for facts with clear expiration semantics (e.g., “user’s current project is X”), the system applies a soft TTL and prompts a re-extraction.
10. Breaking a bad-tool-call loop.
Multiple guardrails. (a) Per-step tool call limit — 5 tool calls per step max. (b) Per-run tool call limit and cost budget — exceed and the run pauses with a controller alert. (c) Repetition detection — if the same tool is called with the same or near-identical arguments N times without progress, the controller forces a re-planning step. (d) Progress metric — the controller tracks whether the plan DAG is advancing. Stalled DAGs trigger re-planning after K steps. (e) Model-driven reflection — at milestones, the agent self-critiques; repeated failures at the same subgoal surface as an explicit “stuck” signal. (f) Backstop: a separate critic model reviews the last N steps periodically and can inject a stop.
Where candidates sound mid-level instead of staff-level
Section titled “Where candidates sound mid-level instead of staff-level”-
Treating “memory” as a single thing. Mid-level answers say “I’ll use a vector DB for memory.” Staff answers distinguish working memory, rolling summary, episodic, semantic, tool state, cached prefix, durable workflow state — seven layers with different semantics, stores, and policies. The layers matter more than any individual store.
-
Defaulting to “just use a bigger context window.” The window is not where durability, retrieval, or quality live. Context rot means more tokens actively hurt beyond a point. A staff candidate explicitly budgets below the hard limit and defends it with numbers.
-
Conflating cache and memory. Prompt caching is for latency and cost; memory is for reasoning continuity. They have completely different invalidation logic. If a candidate says “we’ll cache the summaries to make things faster” without distinguishing those concerns, they’re mid-level.
-
No story for correctness when memory is wrong. Mid-level answers assume the memory is right. Staff answers have explicit mechanisms for detecting, correcting, and recovering: provenance, confidence, contradiction detection, regeneration from source, and fresh-eyes probes. The question “what if the summary dropped a detail” should have a crisp answer.
-
No evaluation story, or eval-as-afterthought. Mid-level answers tack eval on at the end. Staff answers treat measurable quality preservation over horizon length as a requirement that drives the design, with specific online and offline metrics, in-prod ablations, and alerts. You can’t claim “quality stays high as context grows” without a way to observe that claim.