Context management long horizon agents

Design context management for long-horizon agents

Interview-style answer. First-person, as the candidate. I talk through decisions out loud, flag tradeoffs explicitly, and mark Staff-level signals wherever they appear. I’ll use ASCII diagrams and concrete numbers rather than hand-waving.

1. Clarify scope and assumptions

Before I start designing, a few questions I’d want to pin down with the interviewer. I’ll ask them, then state what I’m assuming and move on — I won’t block on answers.

Questions I’d ask:

Is this a single product (like a coding agent or research agent), or a platform hosting many agent applications? That changes whether the memory service is multi-tenant across apps or per-app.
What does “long-horizon” mean here — minutes, hours, days, weeks? Days changes durability requirements substantially.
Are we targeting a specific underlying model with a known context window (e.g. 200K, 1M), or must the design be model-agnostic?
Do users expect cross-session memory (the agent “remembers me” across distinct conversations) or just intra-session persistence across pauses and resumes?
Is there a human-in-the-loop step? If yes, the agent can idle for hours or days between inputs, and we need durable resumable workflow state.
Are tools read-only (pure retrieval) or read-write (emailing, booking, committing code)? Read-write tools change the correctness story — idempotency and replay matter.
What’s the cost envelope? If a user run costs $0.10 vs$ 10, different caching and retrieval strategies make sense.
What latency target per step — interactive (~1s p50), batch (~30s), or background (minutes)?
Is data per-tenant isolated (must never leak across customers) or per-user within a shared tenant?
What evaluation signal do we have — user ratings, labeled traces, downstream task success?

Assumptions I’ll make to proceed:

Multi-tenant platform. Memory store is tenant-scoped with hard isolation.
Long-horizon = up to ~72 hours per task, multi-week cross-session memory.
Target model is 200K-context-class with prompt caching (Claude Sonnet / GPT-4.1 family). I’ll design to not rely on a large window; the window is one layer, not the whole answer.
Cross-session memory is in scope. The agent should remember users.
Tools are in scope and some are write-capable. Idempotency and approval workflows matter.
Interactive p50 latency target ≤ 2s per agent step; p99 ≤ 8s.
Per-run cost budget ~ $1–$ 10 depending on tier; we need to drive this down through caching and retrieval.
Multi-agent workflows are in scope but orchestrated through the same memory substrate.

Staff-level signals in this section

The signal here is that I refuse to design without naming the assumptions, and I pick defaults that force the harder version of the problem. A mid-level candidate often assumes “large context window = solved” and then designs within that. I’m explicitly rejecting that — I’m assuming the window is a scarce, lossy resource, because that matches the operational reality: models exhibit context rot, where recall degrades as tokens pile up, even well below the hard limit. If the interviewer pushes me toward “why not just use a 1M window,” I have an answer: quality, latency, and cost all degrade with context length, and the window is not where durability lives.

2. Requirements

Functional:

Maintain coherent behavior across turns, pauses, resumes, crashes, and multi-day gaps.
Ingest and surface relevant history from: prior messages in the session, prior sessions, task-relevant documents, tool call results, environment state, user-declared facts.
Support tool use with durable traces and replay-safe semantics for side-effecting tools.
Support human-in-the-loop pauses: the agent must yield, persist state, and resume cleanly.
Support multi-agent workflows sharing a common memory substrate with clear ownership.
Support user-visible memory management: view, edit, forget.

Non-functional:

Dimension	Target	Why
Step latency p50	≤ 2s	Interactive feel
Step latency p99	≤ 8s	Catches pathological cases
Cost / agent step	≤ $0.02 median	Enables consumer-grade use
Long-horizon task success	No degradation beyond 10% vs short-horizon on matched tasks	Otherwise system is not actually “long-horizon”
Durability of workflow state	99.99%	Users will not tolerate losing a 4-hour job
Tenant isolation	Hard (cryptographic + policy)	Enterprise requirement
Debuggability	Full provenance from any output claim back to source	Needed to triage regressions

What “quality stays high as context grows” means operationally:

It means that on a fixed eval suite (graded by a judge model + labeled rubrics), the task success rate at step N is within a small ε of step 1. Specifically: I want the success curve over a 100-step run to be approximately flat, not a decay. That’s the metric I’d hold myself to. If I can’t measure it, I can’t claim the system works long-horizon.

It also means no silent degradation. Context rot is insidious precisely because it’s invisible — the model just gets dumber without telling you. So operationally I need synthetic probes that fail loudly when compaction or retrieval quality drops.

Staff-level signals in this section

Naming an SLO for quality preservation over horizon length is the staff move here. Mid-level requirements lists stop at latency and cost. The thing that kills long-horizon agents is quality drift, and I’m treating it as a first-class measurable requirement, not an aspiration. I’m also being honest about the failure mode — “silent” — which shapes the observability design later.

3. Core problem framing

This is not “store chat history.” If it were, we’d append to a list, truncate from the front, and be done. The reason it’s hard:

The core tension: the model’s quality is a non-monotonic function of context. More relevant context helps; more irrelevant context actively hurts through attention dilution. So the job is not to remember everything, it’s to assemble the right few thousand tokens at each step.

Key tensions I’d name to the interviewer:

Full context vs selective context. Feeding everything loses to rot; feeding too little loses to missing facts. Sweet spot is selective and high-signal.
Latency vs recall quality. Retrieval takes time. Cheap retrieval misses. Good retrieval (reranking, hybrid) adds 100–500ms per step.
Summarization vs fidelity. A 10× compression loses detail. Stacked summaries lose it exponentially.
Retrieval breadth vs distraction. Top-20 beats top-5 on recall but loses to it on precision once the model has to reason. Chroma’s context-rot study shows distractors degrade performance non-uniformly — even one semantically-close distractor hurts.
Caching vs staleness. Cached prefixes save 70–90% cost and latency, but invalidate the moment you change instructions.
Personalization vs privacy. Cross-session memory is valuable and a liability.
Local context quality vs long-term memory quality. The policy that makes a good single-turn response differs from the one that makes a good 500-turn trajectory.

Failure modes I’d design against:

Context bloat. The window fills; recall degrades silently; prefill latency balloons.
Stale summaries. A day-old summary asserts facts that have since changed.
Tool trace pollution. A 50KB JSON response from one tool call displaces useful history.
Wrong retrievals. Semantically-similar-but-wrong memory fact gets injected, model confidently repeats it.
Compaction drift. Summary of summary of summary. Errors compound.
Recursive error accumulation. Agent reads its own earlier hallucination back as “fact.”
Cache thrashing. Every turn invalidates the cache because we rewrote the prefix.

Staff-level signals in this section

The framing I’d want to land: context is a scarce, lossy resource with diminishing marginal returns; the job is allocation, not accumulation. That reframes the design from “database for memory” to “policy for context assembly,” which is the right abstraction. I’d also name recursive error accumulation explicitly — the agent reading its own bad output as ground truth later — because that’s the specific failure mode that kills trust in long-horizon systems and most candidates miss it.

4. High-level architecture

Here’s the end-to-end picture I’d draw on the whiteboard.

                          ┌──────────────────────────────────────┐
                          │           Agent Controller           │
                          │   (policy loop, step orchestration)  │
                          └──────────────────┬───────────────────┘
                                             │
                                             ▼
       ┌────────────────────────────────────────────────────────────────────────┐
       │                       CONTEXT ASSEMBLER                                │
       │    Builds the prompt for this step under a token budget.               │
       │    Pulls from all memory layers, applies priority + dedup.             │
       └──┬──────────┬─────────────┬──────────────┬──────────────┬──────────────┘
          │          │             │              │              │
          ▼          ▼             ▼              ▼              ▼
    ┌──────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐ ┌───────────────┐
    │ Session  │ │ Rolling │ │ Retrieval  │ │ Tool     │ │ Durable       │
    │ Working  │ │ Summary │ │ Layer      │ │ State &  │ │ Memory Store  │
    │ Memory   │ │ Service │ │ (vec+kw+   │ │ Trace    │ │ (episodic +   │
    │ (Redis)  │ │         │ │  struct)   │ │ Store    │ │  semantic)    │
    └──────────┘ └─────────┘ └─────┬──────┘ └────┬─────┘ └───────┬───────┘
                                   │             │               │
                                   ▼             ▼               ▼
                              ┌────────────────────────────────────────┐
                              │       Retrieval Index + Blob Store     │
                              │   (vector DB, inverted index, object   │
                              │    store for raw traces/docs)          │
                              └────────────────────────────────────────┘
          │                                                    ▲
          ▼                                                    │
    ┌────────────┐        ┌─────────────────┐       ┌──────────┴──────────┐
    │ Prompt /   │        │ Compaction &    │       │ Policy Engine       │
    │ Prefix     │◀──────▶│ Summarization   │◀─────▶│ (what enters active │
    │ Cache      │        │ Service         │       │  context, budgets)  │
    └─────┬──────┘        └─────────────────┘       └─────────────────────┘
          │
          ▼
    ┌────────────┐                                   ┌──────────────────┐
    │  Model     │──────── tool calls ──────────────▶│  Tool Gateway    │
    │  Runtime   │◀──────── tool results ────────────│  (with approvals,│
    └─────┬──────┘                                   │   idempotency)   │
          │                                          └──────────────────┘
          ▼
    ┌──────────────────────────────────────────────────────────────────┐
    │                Evaluation + Observability Layer                  │
    │   (judge models, probes, provenance traces, metrics, alerts)     │
    └──────────────────────────────────────────────────────────────────┘

Components, briefly:

Agent Controller: the policy loop. Decides “next step”: plan, call tool, read, respond, compact, checkpoint. This is where Memory-as-Action-style decisions live if we want the agent itself to drive compaction (more on that later).
Context Assembler: the crux of this system. Given the controller’s intent for the next step, it builds the prompt under a fixed token budget by pulling from all layers.
Session Working Memory: the last N turns, raw, in Redis or equivalent. Hot path.
Rolling Summary Service: produces and versions summaries at multiple granularities (turn-window, milestone, task).
Retrieval Layer: hybrid (vector + BM25 + structured). Queries episodic and semantic memory, documents, and prior tool traces.
Tool State & Trace Store: normalized state snapshots, raw traces, side-effect logs.
Durable Memory Store: episodic (events) and semantic (facts) memory, versioned and provenance-tagged.
Prompt / Prefix Cache: both the provider-side cache (Anthropic/OpenAI prompt caching) and an internal cache for summaries and retrieval results.
Policy Engine: enforces budgets, admission rules, privacy, tenant isolation.
Compaction Service: background worker that produces summaries, detects drift, and schedules regeneration from source.
Eval + Obs Layer: judges, probes, metric collection, provenance graphs.

Staff-level signals in this section

Three things I’d make sure land:

The Context Assembler is a distinct, testable component with its own policy — it’s not implicit in the prompt template. This is the single most important abstraction in the system.
The tool trace store is separate from the active context by design — raw traces never flow directly into the window, they’re normalized first. This is the difference between a toy agent and a production one.
The prompt cache and memory store serve different purposes — the cache is for latency and cost, memory is for reasoning continuity. Conflating them is a common mid-level mistake.

5. Memory model and context layers

I think of this as a pyramid with latency and capacity roughly inverted:

                          ┌──────────────────────────┐
                          │  Immediate Turn Context  │   ~1K tokens
                          │  (system + current user) │   <1ms
                          └──────────────────────────┘
                         ┌────────────────────────────┐
                         │   Rolling Working Memory   │   ~20K tokens
                         │   (last K turns, verbatim) │   ms
                         └────────────────────────────┘
                        ┌──────────────────────────────┐
                        │    Compressed Summaries      │   ~5K tokens each,
                        │  (rolling + milestone + task)│   versioned, ~10ms
                        └──────────────────────────────┘
                       ┌────────────────────────────────┐
                       │  Episodic Memory (events)      │   MB-scale
                       │  — what happened, when         │   ~50ms retrieve
                       └────────────────────────────────┘
                      ┌──────────────────────────────────┐
                      │  Semantic Memory (facts)         │   MB-scale
                      │  — declared / extracted facts    │   ~50ms retrieve
                      └──────────────────────────────────┘
                     ┌────────────────────────────────────┐
                     │  Tool / Environment State          │   KB-MB scale
                     │  — normalized, current, versioned  │   ~10ms
                     └────────────────────────────────────┘
                    ┌──────────────────────────────────────┐
                    │  Cached Prompt Prefixes              │   provider-side
                    │  — sys prompt + tool schema + ...    │   ~0ms prefill
                    └──────────────────────────────────────┘
                   ┌────────────────────────────────────────┐
                   │  Durable Workflow State                │   KB scale
                   │  — checkpoints, pending approvals      │   ~10ms
                   └────────────────────────────────────────┘
                  ┌──────────────────────────────────────────┐
                  │  Cold Object Store (raw traces, docs)    │   TB-PB scale
                  │  — source of truth, rarely read directly │   ~200ms
                  └──────────────────────────────────────────┘

What goes in each layer and why:

Layer	Contents	Write cadence	Read pattern	Why here
Immediate turn	System prompt, current user msg, active task anchor	Every turn	Every step	Highest attention weight; must be concise
Rolling working	Last K=10–20 turns verbatim	Append per turn	Every step	Recent coherence; verbatim fidelity
Compressed summaries	Rolling summary (last N turns), milestone summaries (task checkpoints), task-state summary	Every K turns, or on milestone	Every step	Bridges verbatim → durable
Episodic memory	Event records: user said X, tool returned Y at time T	On event	Retrieved on demand	Time-aware recall
Semantic memory	Extracted/declared facts: “user is vegetarian”	On extraction	Retrieved on demand	Durable knowledge about user/world
Tool/env state	Normalized current state (e.g. calendar view, repo tree)	On tool call	Every step if task-relevant	Compact, current — not raw trace
Prompt prefix cache	System prompt + tool schema + stable prelude	Once per change	Every call	Latency/cost — not for reasoning
Durable workflow state	Plan, pending approvals, checkpoints, task DAG progress	On state change	On resume	Resumability, crash-safety
Cold object store	Raw traces, full documents, full chat logs	On event	Rarely	Source of truth for regeneration

A key design rule: every layer above the cold store must be reconstructible from the cold store. If a summary goes stale or a memory fact is wrong, I can throw it away and regenerate. This is the invariant that saves the system long-term.

Staff-level signals in this section

The reconstructibility invariant is the thing I’d emphasize. Without it, errors become permanent — you summarize a summary and there’s no way back to ground truth. With it, every summary or memory is a cache over source-of-truth, and you have a clear playbook when things go wrong: regenerate. I’d also separate episodic from semantic explicitly, because they have different retrieval patterns (time-indexed vs similarity-indexed) and different decay policies.

6. Context assembly policy

This is where most of the design lives. Let me spell out exactly how the assembler builds the prompt for step t.

Budget: suppose the model’s useful-context window is 200K tokens, but based on context-rot research I’ll cap the effective working budget at 50K tokens — roughly 25% fill. Past that, recall degrades noticeably even before hard limits. I’ll describe how the 50K is allocated.

Fixed allocation (cached, prefix):

┌────────────────────────────────────────┬──────────┬─────────┐
│  Slot                                  │  Budget  │  Cached │
├────────────────────────────────────────┼──────────┼─────────┤
│  System prompt                         │    2K    │   yes   │
│  Tool schemas                          │    3K    │   yes   │
│  Stable user profile / persona         │    1K    │   yes   │
├────────────────────────────────────────┼──────────┼─────────┤
│  Cached prefix subtotal                │    6K    │         │
└────────────────────────────────────────┴──────────┴─────────┘

Variable allocation (not cached, or cached with short TTL):

┌────────────────────────────────────────┬──────────┐
│  Task goal + plan (current)            │    1K    │
│  Rolling summary (most recent)         │    2K    │
│  Milestone summaries (relevant)        │    2K    │
│  Working memory (last K turns)         │   15K    │
│  Retrieved episodic memory             │    5K    │
│  Retrieved semantic facts              │    2K    │
│  Retrieved documents                   │    8K    │
│  Active tool state snapshot            │    3K    │
│  Reserved for response + CoT           │    6K    │
├────────────────────────────────────────┼──────────┤
│  Variable subtotal                     │   44K    │
└────────────────────────────────────────┴──────────┘
TOTAL:  50K  (75% of useful capacity reserved as headroom)

These numbers aren’t sacred — they come from per-workload tuning. The point is budgets are explicit and enforced, not emergent.

Assembly flow:

   ┌─────────────────────────────────────────────────────────┐
   │  Step begins: controller emits intent (plan / act / ...)│
   └──────────────────────────┬──────────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  1. Seed cached prefix                          │
       │     sys + tool schema + stable persona          │
       └──────────────────────┬──────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  2. Inject always-include                       │
       │     task goal, plan, rolling summary            │
       └──────────────────────┬──────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  3. Generate retrieval queries                  │
       │     (from current intent + recent turns)        │
       └──────────────────────┬──────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  4. Parallel retrieval                          │
       │     episodic | semantic | docs | tool traces    │
       └──────────────────────┬──────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  5. Rerank + dedup across sources               │
       │     penalize near-duplicates and stale items    │
       └──────────────────────┬──────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  6. Budget allocation                           │
       │     truncate/drop lowest-priority until fits    │
       └──────────────────────┬──────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  7. Inject working memory (last K turns)        │
       │     with age-weighted truncation                │
       └──────────────────────┬──────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  8. Inject current tool state snapshot          │
       └──────────────────────┬──────────────────────────┘
                              │
                              ▼
       ┌─────────────────────────────────────────────────┐
       │  9. Emit prompt, preserving cache-friendly      │
       │     ordering (stable prefix unchanged)          │
       └─────────────────────────────────────────────────┘

Key design decisions and defenses:

Budget-first, not content-first. Each slot has a token ceiling. If retrieval returns 30K of potentially-relevant stuff, I cut to 5K based on rerank scores, not based on “whatever fits.” This bounds worst-case cost and latency.
Stable prefix first, variable content last. This is critical for prompt caching to work. Cache hit requires the prefix to be byte-identical; a single changed character earlier in the prompt invalidates everything after. So the order is: cached slots → semi-stable → volatile.
Recency doesn’t automatically win. Recent turns go in verbatim up to a cap. Beyond that, I summarize. Recency bias is how bloat starts.
Retrieval is not always invoked. On routine continuation turns (the last step was “continue the code edit”), I skip retrieval entirely. It’s ~150ms of latency and tends to pollute context. The controller decides.
Priority-based eviction. When over budget, I drop in this order: retrieved docs → retrieved episodic → older milestone summaries → older working-memory turns (replaced by a deeper rolling summary). The rolling summary itself is never dropped.

Staff-level signals in this section

The subtle point I’d make here: my effective budget is 50K in a 200K window on purpose. The interviewer might push on this — “why not use the whole window?” — and my answer is concrete: (a) recall degrades past ~50% fill even on current frontier models, so more tokens trade quality for nothing; (b) prefill latency scales with context length — 200K prefill is multiple seconds even with batching; (c) prompt caching recovers most of the latency only if the prefix is stable, so keeping the variable portion small is an explicit lever. A mid-level answer fills the window; a staff answer sets a budget below the limit and defends it with numbers.

Second signal: retrieval is a budgeted, gated action, not an always-on. That’s the difference between an agent that feels sharp and one that feels noisy.

7. Compaction and summarization

Compaction is where long-horizon quality is won or lost.

What gets compacted:

Rolling conversation — every K turns (K ≈ 10), produce a summary of that window. Older raw turns can then be evicted from the active prompt while the summary carries forward.
Milestones — when the agent completes a subgoal (“found 5 candidate vendors”, “deployed v1”), a milestone summary is produced. Milestones are pinned — they persist across compaction cycles.
Task-state summary — a continuously-maintained summary of “what the task is, where we are, what’s decided, what’s open.” Regenerated, not appended.
Tool traces — raw tool outputs are summarized on ingestion (see §10). Raw trace goes to cold store.
Document chunks — when a document is brought into context, we summarize it and keep the summary + pointer. Full chunks retrieved on demand.

When compaction fires:

Rolling: every K turns, in background.
Milestone: on controller signal (plan advanced a node in the task DAG).
Emergency: when the context assembler would exceed budget, a synchronous compaction pass runs on the oldest uncommitted working memory.
Regeneration: scheduled. Every task-state summary has a max age (e.g., 30 min or 50 turns, whichever comes first) after which it’s regenerated from source.

Compaction pipeline:

     ┌─────────────────────────────────────────────────────────┐
     │  Source inputs: raw turns, tool traces, docs, prior     │
     │  milestone summaries (but NOT prior rolling summaries)  │
     └───────────────────────────┬─────────────────────────────┘
                                 │
                                 ▼
     ┌─────────────────────────────────────────────────────────┐
     │  Summarizer model call (small, fast, tuned)             │
     │  Prompted with:                                         │
     │    - explicit "preserve: decisions, open questions,     │
     │      user preferences, failed attempts, tool side-      │
     │      effects; drop: redundant confirmations, verbose    │
     │      tool output, closed side threads"                  │
     └───────────────────────────┬─────────────────────────────┘
                                 │
                                 ▼
     ┌─────────────────────────────────────────────────────────┐
     │  Provenance-tag every claim                             │
     │  Each sentence in summary maps to source span(s) in     │
     │  cold store. Stored as a side-table.                    │
     └───────────────────────────┬─────────────────────────────┘
                                 │
                                 ▼
     ┌─────────────────────────────────────────────────────────┐
     │  Write summary version (monotonic id), index it         │
     │  Store side-by-side with predecessor for audit/rollback │
     └─────────────────────────────────────────────────────────┘

Anti-drift rules I’d bake in:

Never summarize a summary of a summary. I always regenerate from source (raw turns + tool traces) plus at most one prior summary for continuity. Two levels of stacking is the limit. This is the single most important rule for avoiding compaction drift.
Preserve “hard” information verbatim. Numbers, names, IDs, code diffs, commitments to the user — these are extracted into a structured “facts” sidebar alongside the summary and carried forward uncompressed.
Confidence + provenance. Each summary claim has a provenance pointer and a confidence (from the summarizer’s own rationale, or from a secondary judge). Low-confidence claims get flagged for regeneration from source.
Cold-start the rolling summary on drift detection. If a judge probe (see §11, §15) flags a regression, I throw the rolling summary away and rebuild from the last K*3 raw turns.

Tool trace summarization specifically:

Tool traces are the worst offender for context bloat. A single search call can return 20KB of JSON. Rule: raw traces never enter the prompt. The flow is:

   tool call ──▶ raw trace  ──▶ cold store (full fidelity)
                    │
                    ├──▶ normalizer ──▶ compact state snapshot
                    │                   (e.g., "top 5 results: A,B,C,D,E")
                    │
                    └──▶ summarizer ──▶ summary entry in episodic memory
                                        ("search for X returned 57 results,
                                         top 5: ...")

Only the compact state snapshot goes into the active context. Everything else is retrievable on demand.

Staff-level signals in this section

Three things to highlight:

“Never summarize a summary of a summary.” This is the specific technical rule that prevents the slow degradation that most long-horizon agents hit around turn 200. Most candidates don’t name it.
Structured facts sidebar + narrative summary, not one or the other. The narrative summary is lossy; the structured facts are not. Carrying both is redundant in the good way.
Compaction is triggered by quality signals, not just token pressure. If a judge probe says “summary is missing something important,” I regenerate. Token count is necessary but not sufficient.

8. Retrieval design

The purpose of retrieval in this system is to answer the question: “what does the agent need to know right now that isn’t already in the prompt?”

Pipeline:

      ┌─────────────────────────────────────────────┐
      │  Query generation                           │
      │  (from intent + recent turns + plan state)  │
      │  - Can produce multiple queries in parallel │
      └──────────────────────┬──────────────────────┘
                             │
                             ▼
      ┌─────────────────────────────────────────────┐
      │  Fanout across sources:                     │
      │                                             │
      │   ┌─────────────┐  ┌─────────────┐          │
      │   │ Vector      │  │ BM25 /      │          │
      │   │ (dense)     │  │ keyword     │          │
      │   └──────┬──────┘  └──────┬──────┘          │
      │          │                │                 │
      │   ┌──────┴────────────────┴──────┐          │
      │   │ Structured filters:          │          │
      │   │ tenant_id, user_id, session, │          │
      │   │ time_range, memory_type,     │          │
      │   │ source, confidence ≥ τ       │          │
      │   └──────────────┬───────────────┘          │
      └──────────────────┬──────────────────────────┘
                         │
                         ▼
      ┌─────────────────────────────────────────────┐
      │  Merge + rerank (cross-encoder)             │
      │  - learned reranker scores each candidate   │
      │  - freshness bonus for recent items         │
      │  - penalty for near-duplicates (MMR)        │
      └──────────────────────┬──────────────────────┘
                             │
                             ▼
      ┌─────────────────────────────────────────────┐
      │  Budget-fit: take top-K until budget spent  │
      │  With mandatory diversity: not all from     │
      │  the same source or the same time slice     │
      └──────────────────────┬──────────────────────┘
                             │
                             ▼
      ┌─────────────────────────────────────────────┐
      │  Return with citations                      │
      │  (each chunk tagged with id for provenance) │
      └─────────────────────────────────────────────┘

Hybrid, not vector-only. Pure vector retrieval misses exact-match cases (IDs, file names, code symbols). Pure keyword misses semantic matches. Hybrid with learned fusion or reciprocal rank fusion gives materially better recall in practice — this is a well-established pattern and more importantly, recent memory-system work has shown retrieval-stage optimizations dominate over ingestion changes.

Structured filters always applied. Tenant isolation is a hard filter, enforced at the index layer, not post-hoc. No query is allowed to run without tenant_id. This is a safety invariant.

Multi-source merge with source-aware priors. I don’t treat “memory fact” and “document chunk” as interchangeable. A semantic memory fact (“user is allergic to peanuts”) should outrank a random doc chunk on relevance ties, because confidence and specificity matter.

Freshness weighting, not pure recency. For most queries, recent > old, but “what’s my birthday” should return the durable fact, not the most recent mention. Different memory types get different freshness priors.

When retrieval fires:

Always: at step start, if the step is “plan” or “act with user-facing output.”
Sometimes: mid-step, if the model emits a retrieval tool call (just-in-time retrieval, matching Claude Code’s pattern of grep/glob-on-demand).
Never: on pure “continue the current in-progress action” steps.

Dedup and contradiction handling:

Dedup by hash of normalized content plus semantic similarity threshold.
If two memory items contradict (say, “user prefers dark mode” v1 vs “user prefers light mode” v2), the later-dated, higher-confidence one wins in retrieval but both are visible to the model with conflict flagged. The model can ask the user to clarify if the contradiction is material.

Staff-level signals in this section

Key points to emphasize:

Retrieval is a policy decision per step, not always-on. Running retrieval every turn is expensive and often harmful (distractor injection).
Tenant isolation is an index-level invariant, not a post-filter. Post-filters leak under bugs.
I retrieve with explicit source awareness — a fact is not a doc chunk is not a tool trace. Each has different retrieval semantics.
Contradictions are surfaced, not silently resolved. The system should make the agent aware that its memory is inconsistent, so it can ask the user.

9. Caching and reuse

This is where real money gets saved. I’ll separate caching into two conceptually distinct things, because conflating them is a common mistake:

Cache for latency/cost. Prompt prefix caching. Purpose: don’t re-prefill the same tokens.
Memory for reasoning continuity. Summaries, facts. Purpose: carry meaning across calls when the verbatim text can’t.

Different systems, different invalidation rules.

Prompt prefix caching — how to structure for hits:

The rule: stable content first, variable content last. If I change anything in the prefix, everything after it invalidates.

┌─────────────────────────────────────────────────────────┐
│  [CACHE BREAKPOINT]                                     │
│  System prompt (stable)                                 │
│  Tool schemas (stable across a deploy)                  │
│  Stable user profile / persona                          │
│  [CACHE BREAKPOINT]                                     │
│  Rolling summary (semi-stable, updates every K turns)   │
│  Task goal + plan                                       │
│  [CACHE BREAKPOINT]                                     │
│  Retrieved items (volatile; no cache)                   │
│  Working memory turns (semi-stable; cached between      │
│                        turns until next turn arrives)   │
│  Active tool state snapshot (volatile)                  │
│  Current user message (volatile)                        │
└─────────────────────────────────────────────────────────┘

Cache economics — concrete numbers:

Using Claude Sonnet-class pricing at roughly $3/MTok input, cache write 5-min at ~$ 3.75/MTok (1.25×), cache read at ~$0.30/MTok (0.1×). A 50K-token cached prefix over a 10-turn conversation:

No cache: 50K × 10 turns × $3/MTok = **$ 1.50** of prefix cost alone.
5-min cache: first call writes at $0.1875, next 9 calls read at 9 × 50K ×$ 0.30/MTok = $0.135. Total **$ 0.32**. Savings ≈ 78%.
1-hour cache (2× write): first call $0.30, next 9 reads$ 0.135. Total $0.435. Still ~71% savings, and useful if turns are spaced out (hitting 5-min TTL would cost a re-write).

Break-even on the write overhead: cache-write costs 0.25× extra. Cache-read saves 0.9× per hit. Break-even after ~0.28 hits. So anything read more than once is profitable. (This matches what providers publish and independent studies confirm for long-horizon agentic workloads.)

Latency math is similar in shape. Anthropic reports up to ~85% latency reduction on long prompts — my rough internal mental model: prefill for 50K tokens that would take ~2–3s without cache drops to ~300–500ms with cache hit.

What else to cache:

Summaries. Summaries are expensive to produce (one LLM call each). Cache them by (source_span_hash, summarizer_version). Invalidate when the source spans change.
Retrieval results. Cache by (query_hash, tenant_id, filters, index_version). Short TTL (~60s) for interactive; can be longer for batch.
Tool schemas. Cacheable across the whole tenant until the tool catalog changes.
Intermediate plans. If the agent produces a plan and we resume later, cache it along with its inputs.

What NOT to cache:

Raw tool outputs beyond the current step. These are better written to the trace store once.
Anything derived from freshness-sensitive data. Weather, stock prices, calendar state — either don’t cache or cache with very short TTL and freshness checks.

Invalidation:

Prefix cache invalidates automatically when the prefix changes. The design job is to keep the prefix from changing unnecessarily.
Summary cache invalidates on source change or summarizer version bump.
Retrieval cache invalidates on index update or TTL.
Fact memory has no cache — it is the source of truth for its layer.

Per-session vs cross-session caches:

Prefix cache is naturally per-session-prefix-hash but shared across identical prefixes (different users with the same system prompt share cache).
Summary and retrieval caches are keyed by tenant + content, so they can share across sessions for the same user or task.

Staff-level signals in this section

The key distinctions I’d make:

Cache for latency/cost ≠ memory for reasoning. Different invalidation, different failure modes.
The prompt structure is a cache contract. Rearranging slots to be cache-friendly is a specific, deliberate design choice with measurable ROI.
Cache-write cost is a real budget item. In workloads with low reuse, caching can make things worse. The 5-min vs 1-hour TTL decision is an economic one that depends on actual reuse patterns.
I’d instrument cache hit rate as a first-class metric. Without it, caching regressions are invisible.

10. Tool traces, world state, and workflow state

This is the layer most candidates underweight, and it’s where long-horizon agents actually break.

Three distinct representations:

Raw trace — full request/response for every tool call. Written to object store. Never enters active context. Used for audit, replay, debugging, regeneration.
Normalized state — a structured view of “what is true about the world now.” For a calendar tool: current_view = {events: [...], last_refreshed: T}. Updated after tool calls.
Compact snapshot — the projection of normalized state that’s small enough to inject into context. Only the fields the current step plausibly needs.

    ┌──────────────┐
    │  Tool call   │
    └──────┬───────┘
           │
           ▼
    ┌──────────────┐     ┌──────────────────┐
    │  Raw trace   │────▶│  Object store    │   <- source of truth
    │  (full JSON) │     │  (cold, cheap)   │
    └──────┬───────┘     └──────────────────┘
           │
           ▼
    ┌──────────────┐     ┌──────────────────┐
    │  Normalizer  │────▶│  Normalized      │   <- durable state
    │              │     │  state store     │      (versioned)
    └──────┬───────┘     └──────────────────┘
           │
           ▼
    ┌──────────────┐     ┌──────────────────┐
    │  Summarizer  │────▶│  Episodic memory │   <- event record
    │  (lossy)     │     │  entry           │      (indexed)
    └──────┬───────┘     └──────────────────┘
           │
           ▼
    ┌──────────────┐
    │  Compact     │──────▶ Active context
    │  snapshot    │
    └──────────────┘

Why raw traces don’t stay in context:

A single tool call can produce tens of KB of JSON, most of which is irrelevant. Ten calls and the window is polluted. Worse, the noise is high-entropy and model attention gets drawn to it over signal. Keeping raw traces out of context is the single biggest operational win over naive designs. This is exactly the failure mode Anthropic documents — “context window fills with hundreds of thousands of tokens of file content, most of it already processed and noted.”

Side effects and idempotency:

Every write-capable tool call gets a client-generated idempotency_key derived from (session_id, step_id, action_id). If the agent retries, the tool layer deduplicates.
Side effects are logged with their idempotency key and result, so on replay the tool gateway can return the prior result rather than re-executing.
For non-idempotent external APIs (e.g., sending an email), the gateway enforces an at-most-once contract with a write-ahead log: we record the intent, then the result, atomically.

Human-in-the-loop and approvals:

On approval points, the controller writes a checkpoint: full normalized state + active plan + pending action. The run is then suspended.
The checkpoint is durable (relational DB + object store references). Resumption is by checkpoint id.
User approval comes back through an out-of-band channel (UI, email). The workflow coordinator resumes from checkpoint with the approval decision injected.

Resumability:

All state needed to resume is in the durable workflow store: plan DAG + node states, pending tool calls with idempotency keys, last normalized state, last rolling summary version.
On resume, we don’t replay tool calls; we re-read normalized state and continue. This avoids re-executing side effects.
If a crash happens mid-tool-call: the idempotency key + WAL lets us detect that the call already happened, retrieve the result, and continue.

Concurrent multi-agent:

When multiple agents touch shared workflow state (e.g., planner + executor + critic), each holds a lease on the specific nodes it’s acting on. Conflicting writes fail-fast and surface to the controller.
Read access is non-blocking with version vectors; writes use optimistic concurrency.

Staff-level signals in this section

Four things I’d emphasize:

Three representations, not one. The most common bug in naive designs is treating “tool output” as a single thing and trying to fit it into context.
Idempotency keys and WAL are non-negotiable for write-capable tools. If you crash between “email sent” and “trace written,” you’d better know which happened.
Resumability is from normalized state, not from replay. Replaying tool calls is correct in theory and catastrophic in practice.
Leases and version vectors for multi-agent. Otherwise two agents can overwrite each other’s plan updates silently.

11. Long-running quality preservation

This is the section where I’d spend the most time in a real interview, because it’s the heart of the question.

The problem: over a long run, the agent’s effective beliefs drift from ground truth. Summaries lose detail. Facts get stale. Hallucinations get reified. And the agent is the last to notice.

My principles:

Provenance on every claim. Every item that can enter context carries a pointer to its source. If the agent outputs a claim, we can trace it backward.
Confidence scores, surfaced to the agent. A memory fact has a confidence; the agent can see it and reason about uncertainty.
Regeneration from source is the recovery mechanism. Summaries are caches; if they’re wrong, throw them away and regenerate from the raw traces and turns.
Active contradiction detection. Run a cheap background check after every K turns: does the current rolling summary conflict with any semantic memory fact? If so, flag.
Periodic re-planning and self-critique. At milestones, the agent re-reads the task-state summary and the original task description, and checks: am I still on track? This catches drift.
Checkpoints as rollback points. If a judge eval detects quality has crashed, we can roll back to the last-known-good checkpoint and re-run from there with the problematic summary evicted.
Eval-driven corrections. Judge probes run continuously on live traces (sampled). Regressions trigger alerts and can auto-evict suspect memories.

“What if the agent’s memory is wrong?”

Concretely: the agent confidently remembers “we decided on Option B,” but we actually decided on Option A. How do I detect?

Provenance check on critical claims. When the agent makes a decision-referencing statement, the system can verify the claim against the sourced span in episodic memory. If the span doesn’t support the claim, flag.
Contradiction probe. A lightweight judge model periodically asks: “given the raw transcript, does the rolling summary accurately capture X, Y, Z?” If the judge flags a miss, regenerate.
User-facing confirmation on high-stakes actions. Before a write-capable tool fires on the basis of a remembered preference, confirm: “You’d said you wanted Option B — is that still right?”

“What if a summary omitted a crucial detail?”

Structured facts sidebar (mentioned in §7) carries the hard-to-summarize stuff — IDs, numbers, commitments — verbatim alongside the narrative summary. So the summary can drop the detail and the sidebar still has it.
Regeneration on demand. If the agent tries to retrieve a detail and can’t find it in the current compressed state, it can issue a retrieval query against the cold store (raw traces) to get it back. This is the “just-in-time” pattern — don’t pre-process, fetch on demand.
Coverage probes. A sampled set of synthetic questions is answered against the compressed state vs against the raw transcript; coverage gap is a metric.

Periodic “fresh eyes” pass:

Every N steps, I’d run a dedicated “fresh context” call: same task, new model session, just the source-of-truth inputs (task, raw transcript, current state snapshot). Compare its proposed next step to the in-session agent’s proposed next step. If they diverge materially, that’s a signal the in-session agent has drifted. This is expensive, so it runs at milestone boundaries, not every turn.

Staff-level signals in this section

The “what if memory is wrong?” question is where candidates either shine or flatten. My answer has three specific mechanisms (provenance check, contradiction probe, user confirmation on high-stakes) plus a recovery path (regenerate from source). I’d also stress the fresh-eyes pass — most candidates don’t propose anything that actively checks for drift; they just hope it doesn’t happen.

The concept I’d reach for explicitly: every memory artifact is a cache, the raw transcript is the source of truth, and the recovery playbook is always “regenerate.” If the interviewer asks “but your summary might drop a detail,” the answer is “yes, and when the agent asks for it, we fetch it from source.”

12. Admission control, budgets, and limits

Budgets everywhere, enforced at assembly time. Without hard limits, context monotonically grows and bad things happen silently.

Budget	Default	Enforced where	Degrade behavior
Active context tokens	50K (of 200K window)	Context assembler	Drop lowest-priority slots
Retrieval returns per query	Top 20 pre-rerank, top 5 post-rerank	Retrieval layer	Cap hard, log the drop
Summarization rate	1 compaction/10 turns max	Compaction service	Queue if exceeded, never skip
Per-run memory growth	100 new semantic facts / run max	Memory writer	Reject with “user confirmation required”
Per-tenant QPS	Configurable	Gateway	429 with retry-after
Per-run $ budget	Tenant-configurable	Controller	Pause + notify owner
Max plan depth	20	Controller	Force re-plan with constraint
Max tool calls per step	5	Controller	Force reasoning step
Total run duration	72h	Workflow coordinator	Checkpoint + pause

Why these specific limits matter:

Memory growth per run. Without this, an agent in a pathological loop can write thousands of “facts” per hour, each polluting future retrievals. Capping memory writes per run and requiring confirmation for large batches prevents runaway pollution.
Plan depth. Infinite plan expansion (subgoal → subgoal → …) is a classic long-horizon failure. A depth cap forces replanning at the top level.
Retrieval returns. Chroma’s research on distractor sensitivity is direct evidence that top-5 with a good reranker beats top-20 on most tasks — more is not better.

Graceful degradation when budgets are hit:

Budget-exceeded doesn’t kill the run; it triggers shedding. The system announces it’s operating with reduced context, and logs which slots were dropped so we can debug.
On repeated degradation, an alert fires to eng. Sustained shedding means the workload needs rebalancing.

Staff-level signals in this section

The point is that budgets are the control surface for the whole system. Without them, the only thing stopping runaway behavior is the model’s context window, and by the time you hit that, you’re already in context-rot territory. Naming a specific list of budgets with specific numbers and degrade behaviors is the staff-level move.

I’d also note: budgets are per-tenant configurable, not global. Different customers have different cost/quality tradeoffs.

13. Distributed systems concerns

Where state lives and how failures propagate.

State classification:

Ephemeral (in-memory, session-scoped): current prompt being assembled, in-flight retrieval results. If lost, recomputed from durable state.
Short-lived durable (Redis with TTL, or in-memory cache): working memory (last K turns as a raw list), compact snapshots, mid-session plan state. Persisted to durable store on checkpoint boundaries.
Durable (relational + object): workflow state, summaries, facts, raw traces, idempotency records.

   ┌─────────────────────────────────────────────────────────────┐
   │                  Context Assembler (stateless)              │
   │        One instance per request, can scale horizontally.    │
   └──────────────┬──────────────────────────────────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
        ▼                   ▼
   ┌─────────┐         ┌──────────────────┐
   │  Redis  │         │  Session         │
   │  (hot   │         │  Coordinator     │  (stateful, sharded
   │  tiers) │         │  (leases, locks) │   by session_id)
   └────┬────┘         └────────┬─────────┘
        │                       │
        ▼                       ▼
   ┌────────────────────────────────────────────────────────────┐
   │            Durable backing:                                │
   │   Postgres (workflow state, metadata)                      │
   │   Vector index (memory, docs)                              │
   │   Object store (raw traces, full docs, checkpoints)        │
   │   Event log (Kafka / equivalent) for ingest pipelines      │
   └────────────────────────────────────────────────────────────┘

Stateless context assembler, stateful session coordinator. The assembler can run anywhere; it reads state, produces a prompt, returns. The coordinator owns the session lease — it’s who decides what the next step is and who can write to session state. One session = one coordinator at a time (via lease).

Multi-region:

Within a region: single primary for workflow state, read replicas for memory and vector indices.
Cross-region: memory and workflow state are eventually consistent, with tenant affinity to home region. Cross-region failover is a break-glass that costs some stale reads but keeps sessions alive.

Concurrent access:

Session-scoped state: coordinator lease prevents concurrent writers.
Shared memory (cross-session semantic facts): optimistic concurrency with version vectors, LWW with conflict flags for human review.
Vector index: eventually consistent; retrievals are best-effort.

Consistency of workflow state:

Workflow state is the strongest-consistency piece. It lives in Postgres with serializable transactions for checkpoint writes.
Tool call records and idempotency keys are in the same Postgres so we can atomically write (intent, result) pairs.
Everything else is eventually consistent and regenerable.

Retries and replay:

Agent step is idempotent-by-step-id. If a step crashes, the coordinator retries with the same step id.
Tool calls use idempotency keys. Retry returns the prior result if any.
Summary generation is idempotent per source span + summarizer version.

Versioning:

Every summary has a version id. Readers pin to a version; writers produce new versions rather than overwriting.
Memory facts have version chains; “current” is the latest non-retracted version.
Summarizer itself has a version. When the summarizer version bumps, existing summaries aren’t retroactively regenerated — they’re lazily regenerated when read.

Failure modes and responses:

Failure	Response
Model call fails	Retry with backoff; if persistent, checkpoint + alert
Retrieval layer down	Serve request without retrieval; log degraded mode; alert
Vector index lag	Fall back to keyword-only; alert if lag > threshold
Workflow DB down	Hard stop; in-flight sessions fail cleanly with checkpoint
Coordinator crash	Lease expires; another coordinator picks up session from last checkpoint

Staff-level signals in this section

The distinction between stateless assembler and stateful coordinator is the architectural move. It lets the expensive, parallel part (assembly) scale horizontally while keeping session consistency. I’d also call out that most state is regenerable from a smaller durable core — workflow state + raw traces is enough to rebuild everything else. This is what makes the system recoverable.

14. Data model and storage

Concrete storage choices and schemas.

Storage layer choices:

Tier	Store	Contents
In-memory	Redis (clustered)	Session working memory, compact snapshots, cached summaries, retrieval result cache
Vector/retrieval	Pinecone / Qdrant / internal shard (depends on scale)	Embeddings for episodic, semantic, docs
Keyword/inverted	OpenSearch or equivalent	BM25 over same corpus as vector
Relational	Postgres	Workflow state, plan DAG, checkpoints, idempotency, memory metadata
Object	S3 / equivalent	Raw traces, full documents, full transcripts, checkpoint blobs
Event log	Kafka	Ingest pipeline for memory writes, audit trail

Key schemas (sketch):

memory_item
├── id            UUID
├── tenant_id     UUID      (hard-filtered on every query)
├── user_id       UUID
├── session_id    UUID  nullable (null = cross-session)
├── type          enum      (episodic | semantic | doc_chunk | tool_trace_summary)
├── content       text
├── embedding     vector(1536)
├── source_ref    text      (pointer into object store + span offsets)
├── confidence    float
├── created_at    ts
├── valid_from    ts        (temporal scope)
├── valid_until   ts        nullable
├── supersedes    UUID      nullable (version chain)
├── retracted     bool
└── metadata      jsonb     (tool name, tags, etc.)

summary_version
├── id            UUID
├── scope         enum      (rolling | milestone | task_state | doc | tool_trace)
├── session_id    UUID      nullable
├── source_spans  text[]    (references to raw ranges used to generate this)
├── predecessor_id UUID     nullable (chain, but max depth 1 for regeneration)
├── summarizer_version text
├── content       text
├── provenance    jsonb     (sentence-level source map)
├── confidence    jsonb
└── created_at    ts

tool_trace
├── id            UUID
├── session_id    UUID
├── step_id       UUID
├── tool_name     text
├── idempotency_key  text   (unique)
├── request       jsonb     (or pointer to object store if large)
├── response_ref  text      (pointer into object store)
├── normalized_state_ref UUID  (pointer to tool_state table)
├── summary_ref   UUID      (pointer to summary_version)
├── status        enum      (pending | success | failed)
├── side_effect   bool
└── created_at    ts

context_assembly_plan           (logged for every step)
├── id            UUID
├── session_id    UUID
├── step_id       UUID
├── budget_total  int
├── slots         jsonb       (slot → [item_ids with budget spent])
├── retrieval_queries jsonb
├── cache_status  jsonb       (hit/miss for each cached block)
├── dropped_items UUID[]
└── created_at    ts

eval_record
├── id            UUID
├── session_id    UUID
├── step_id       UUID
├── probe_type    enum        (summary_coverage | contradiction | fresh_eyes | user_rating)
├── result        jsonb
├── severity      enum
└── created_at    ts

workflow_state
├── session_id    UUID (PK)
├── tenant_id     UUID
├── plan          jsonb
├── node_states   jsonb
├── last_checkpoint_id UUID
├── pending_approvals jsonb
├── rolling_summary_id UUID
├── task_state_summary_id UUID
├── status        enum
└── updated_at    ts

Why these choices:

Postgres for workflow state — needs serializable writes, strong consistency, easy joins for ops queries. Memory volume is low.
Object store for raw traces — cheap, high-throughput, scales to TB-PB. Traces rarely read; when they are, latency is fine.
Separate vector + keyword indices — hybrid retrieval wants both. Unified “hybrid store” products work but lose flexibility.
Event log for ingest — memory writes should be async and replayable. Kafka gives us that plus a clean audit trail.
Redis for hot session state — sub-ms reads for working memory matter; a database call per turn is too slow.

Sizing sanity check:

Assume 10M active users, 100 sessions/year each, 20 turns/session, ~300 tokens/turn.

Raw session text per user/year: 100 × 20 × 300 × 4 bytes ≈ 2.4 MB.
Across 10M users: 24 TB/year of raw transcripts to object store. Cheap.
After 5:1 compaction to summaries + extracted facts: ~480 KB/user/year durable. Across fleet: 4.8 TB/year in Postgres+vector. Easy.
Embeddings at 1536d × 4 bytes per chunk, ~5K chunks/user/year: 30 MB/user/year. Fleet: 300 TB/year in vector store. Meaningful but manageable; pruning decayed memory keeps steady state bounded.
KV-cache-equivalent for prefix caching — this is ephemeral on the provider side; I don’t store it, but worth noting that a 50K-token prefix for a 70B-class model backs ~16 GB of KV cache per session. That’s why provider caching is TTL’d — it’s expensive GPU memory.

Staff-level signals in this section

I’d call out two things:

Storage choices are derived from access patterns, not picked first. Each store is justified by the query pattern it serves. A common mid-level error is picking a vector DB and trying to cram everything into it.
Versioning and provenance are in the schema, not bolted on. supersedes, provenance jsonb, summarizer_version — these are first-class.

15. Evaluation and correctness

You can’t operate what you can’t measure. This system has too many failure modes for eyeballing.

Evaluation axes and metrics:

Axis	Offline metric	Online metric
Retrieval quality	Recall@k, nDCG on labeled sets	Retrieval-item usefulness (did a cited retrieval appear in the final answer?)
Summary fidelity	Judge-rated coverage on known Q/A pairs against source	Coverage-probe pass rate on sampled live sessions
Context usefulness	Ablation: remove slot X, does task success drop?	Per-slot “citation rate” — how often does each slot get used?
Long-horizon task success	Labeled end-to-end runs in eval harness	User task completion rate, user-reported resolution
Contradiction rate	Synthetic contradiction injection and detection	Judge-detected contradictions per 1000 steps
Stale-memory errors	Time-shifted eval: inject outdated fact, measure error rate	User correction events (user says “no, that’s wrong”)
Hallucination rate	Factual eval set	Unresolvable-citation rate (model makes a claim with no retrievable source)
Human preference	A/B comparisons	Session-level user ratings
Cost per task	Mean cost over eval set	p50/p95 cost per session
Latency	p50/p99 per step on eval	Same, live

Offline evaluation:

Curated set of long-horizon tasks with labeled milestones and ground truth.
Step-level replays: take a real session, replay it with a changed policy (e.g., different retrieval config), diff the outputs.
Synthetic stress tests: inject deliberate distractors, contradictions, and stale facts into the memory; see if the agent recovers.
Shadow runs: new policy runs in shadow next to prod, diffs logged, not surfaced to user.

Online evaluation:

Judge probes on sampled live sessions (~1%), running the checks above. Latency-neutral because they’re async.
Per-step provenance audit: for each model-generated claim, can we trace it back? Track unresolvable-claim rate.
User signal: ratings, corrections, abandonments. Correlate with internal quality metrics.

How I’d detect “context management is hurting quality”:

Ablation in prod. Randomly, for 1% of sessions, disable a single context layer (e.g., don’t inject retrieved episodic memory) and measure downstream task success vs the control. If task success goes up with a layer disabled, that layer is net harmful.
Trend alerts. Judge-measured quality on live traces trending down over a week triggers an investigation. Correlate with cache hit rate, retrieval depth, summary regeneration rate.
Fresh-eyes probe vs in-session agent. If the two diverge on next-step more than a threshold, something in context management is misleading the live agent.

Staff-level signals in this section

Three signals:

In-prod ablations as a quality measurement tool, not just an offline thing. Most candidates don’t think about running controlled experiments live.
Unresolvable-claim rate as a specific hallucination metric. Tied to provenance, which is already in the design. This is concrete, computable, and alertable.
Task success is decoupled from user rating. Both matter. A user might rate a failed run highly if the agent was polite; a successful run might get a low rating for UX reasons. Track both and don’t collapse them.

16. Observability and operations

What I’d want on a dashboard.

Per-step metrics:

Active context token count by slot (histogram).
Retrieval latency (per source), rerank latency.
Cache hit rate, cache-read tokens / cache-write tokens / uncached tokens.
Time-to-first-token, time-to-last-token.
Tool call count, tool latency, tool error rate.
Budget-shedding events (which slots got dropped).

Per-session / per-run metrics:

Memory growth curve: tokens in durable memory over turn count.
Summary regeneration rate.
Contradiction detection events.
Stale-memory incidents (user corrects agent).
Long-horizon task completion rate, segmented by run length.
Compaction “loss indicator” — difference between a sampled question answered against summary vs against source.

Per-tenant metrics:

Cost per session, cost per user.
Memory storage footprint.
Privacy events (access-denied, redaction triggered).

Traces and logs:

Every step logs a context_assembly_plan record — which slots, which items, which cache hits, which budgets. This alone lets us reconstruct any decision.
Every model call logs input/output token counts, latency, cache hit details, and model version.
Every tool call logs idempotency key, status, and pointers to trace.
Every memory read logs the query, top-k returned, and which items ended up in context (vs dropped by rerank/budget).

Dashboards I’d ship:

Operator dashboard: availability, p50/p99 step latency, error rates, cache hit rate, per-region health.
Quality dashboard: task success rate (long-horizon), contradiction rate, stale-memory rate, judge-probe pass rate, trend over time.
Cost dashboard: tokens/session, cost/session by tenant, cost contribution by slot.
Agent-forensic view: per-session drill-down showing plan DAG, all steps, context assembly plan, retrievals, tool calls, summaries, probes.

Alerting:

Page on: availability drops, error rate spikes, tail latency regressions, contradiction rate spikes, judge quality drops.
Ticket on: cache hit rate regressions, memory growth anomalies, per-tenant cost anomalies.

Staff-level signals in this section

The key claim: agent-forensic view is the single most valuable tool for operating this system. Without being able to drill into “why did this specific step go wrong,” you’re debugging by vibes. The context_assembly_plan log table exists exactly so this view is possible.

Also: quality metrics must be separate from availability metrics. Most AI systems conflate them because “it ran without crashing” feels like a win. For long-horizon agents, a silent quality regression is worse than a crash.

17. Tradeoffs and alternatives

Let me compare the design to the main alternatives and defend the choices.

Alternative A: Full-history prompting (feed everything).

Pro: simple. No compaction complexity, no retrieval complexity.
Con: hits context window hard limits; suffers context rot well before that; prefill latency grows linearly; cost grows linearly.
Verdict: fails above ~50–100 turns on any current frontier model. Non-starter for the 72-hour horizon.

Alternative B: Aggressive summarization-only (no retrieval, just stacked summaries).

Pro: context stays small; latency and cost are predictable.
Con: compaction drift is severe; verbatim details are lost after 2–3 cycles; no way to recover specific facts.
Verdict: acceptable for short, linear conversations; catastrophic for long, branching work.

Alternative C: Retrieval-heavy, minimal summarization.

Pro: keeps raw source available; retrieval is more precise than summary recall.
Con: retrieval has latency cost every turn; distractor risk is real; requires very good retrieval quality.
Verdict: close to viable, and it’s basically what Claude Code does with grep/glob — minimal pre-compaction, just-in-time retrieval. I think for pure document-over-large-corpus workloads, this is arguably better than my design. But for multi-turn conversation with tool side effects, you still need some summary to carry task state.

Alternative D: Centralized memory service vs per-session local memory.

Pro of centralized: cross-session continuity, shared insights, economies of scale in infra.
Con of centralized: privacy surface area, harder tenant isolation, cross-session contamination risk, higher infra complexity.
Verdict: I chose centralized but with hard tenant isolation at the index layer. Per-session-local is fine for a chat app without cross-session memory, but the problem statement asked for long-horizon + multi-session.

Alternative E: Cache-heavy (big static prefix, everything cached) vs retrieval-heavy.

Cache-heavy optimizes for latency and cost when the prefix is stable.
Retrieval-heavy optimizes for relevance when the corpus is large and queries vary.
Verdict: I use both, at different layers. The first ~6K of the prompt is cache-optimized; the retrieved slots are not. This is a hybrid; defending it is easy — they serve different purposes.

Alternative F: Vector-only retrieval vs hybrid.

Pro of vector-only: single index, simpler ops.
Con: misses exact-match cases (IDs, symbols), weaker on rare terms.
Verdict: hybrid is worth the complexity. Ops overhead is real but modest, and the quality win on real-world queries is measurable.

Alternative G: “Memory-as-Action” (agent explicitly edits its own memory) vs controller-driven compaction.

This is a recent line of research where compaction and retrieval are tool calls the agent itself decides to invoke, inside an RL-trained policy. The MemAct work shows this can outperform handcrafted compaction heuristics on some benchmarks.
Pro: compaction decisions are task-aware; the model knows when to forget or regenerate.
Con: requires RL training data; quality depends on the policy; unpredictable behavior in production without guardrails.
Verdict: I’d adopt it as a layer on top of a heuristic baseline rather than as the core. Let the agent request compaction or retrieval, but have the controller enforce budgets and safety. The baseline heuristics are the floor; the learned policy is the ceiling.

Explicit decisions I’d make and defend:

Keep effective context ~25% of the window, not 100%. Driven by context-rot evidence; any interviewer who pushes back, I’d defer to the Chroma/Anthropic data.
Hybrid retrieval, not vector-only. Worth the ops cost.
Raw traces never enter the prompt. Three representations.
Never stack summaries beyond depth 1. Regenerate from source instead.
Centralized memory service with per-tenant index-level isolation.
Cache-friendly prompt ordering as an architectural invariant.
Memory-as-action as an optional overlay, not a core.

Staff-level signals in this section

The signal is that I actually evaluate each alternative on specific metrics relevant to the workload, and I acknowledge where my choice might be suboptimal (e.g., retrieval-heavy without summarization is actually better for some workloads). A mid-level answer picks a design and declares it best; a staff answer says “here are the workloads where my choice wins and here are the workloads where another choice would win, and I picked this one because the target workload is this.”

18. Final summary

The core claim: context management for long-horizon agents is not a memory store, it’s a policy for allocating a scarce, lossy attention budget under uncertainty, with explicit durability and recovery.

The three design choices that matter most, in order:

A dedicated Context Assembler with budget-driven, priority-based slot allocation. This is the single most important component. Everything else feeds it. Without explicit budgets, the system silently degrades.
The reconstructibility invariant: every memory artifact is a cache, the raw transcript is the source of truth, and recovery is “regenerate from source.” This is what keeps compaction drift bounded. No stacked summaries deeper than one. Always a path back to ground truth.
Separation of prompt caching (for latency/cost) from memory (for reasoning continuity), with cache-friendly prompt structure as an architectural invariant. Conflating these is the most common mid-level mistake. Treating prompt structure as a cache contract — stable first, variable last — is what makes the economics work.

Supporting everything: budgets everywhere, provenance on every claim, raw tool traces out of the prompt, and continuous eval-driven drift detection.

19. Latest developments

A few things from the recent literature and practice worth flagging for this design:

Context rot is now a mainstream, documented concern, including in provider docs. Anthropic’s own context engineering guidance explicitly frames context as a finite resource with diminishing returns. Chroma’s 2025 context-rot study across 18 frontier models quantified the degradation and showed even one well-placed distractor can hurt performance. This is the empirical backbone for budget-below-the-limit designs like mine.
Prompt caching has matured into a cost-and-latency lever, not a micro-optimization. Anthropic’s caching supports 5-min and 1-hour TTLs with different write multipliers (1.25× and 2×). A Feb 2026 systematic study “Don’t Break the Cache” evaluated prompt caching across OpenAI, Anthropic, and Google on long-horizon agentic workloads and confirmed roughly linear cost/TTFT benefits past the caching minimum — but also showed provider-specific strategies diverge, meaning cache design is now workload-specific engineering.
“Memory-as-action” is emerging as a unified framing — treating compaction and retrieval as tool calls the agent itself decides to invoke. MemAct (2025) and related work show RL-trained memory policies can outperform handcrafted heuristics. Relevant for this design as an overlay on the heuristic baseline.
AgentFold and IterResearch (late 2025) push proactive context management and Markovian state reconstruction for long-horizon web agents. The pattern: compress aggressively, carry forward a compact state, let the agent re-fetch detail on demand. This mirrors the just-in-time retrieval pattern Claude Code uses.
Evaluation is shifting toward trajectory-level rather than answer-level metrics. TRACE and AgentLongBench evaluate how the agent gets to an answer, not just whether it arrives. UltraHorizon benchmarks trajectories averaging 200K tokens and 400+ tool calls — these are getting close to the regime a real long-horizon agent sees in production. For evaluation design, trajectory-level signals (efficiency, hallucination rate along the way, adaptivity) are what to optimize.
LongMemEval and derived benchmarks (LoCoMo, AMA-Bench) have become the de facto evaluation substrate for memory systems. The MemMachine evaluation on LongMemEval showed retrieval-stage optimizations dominate over ingestion-stage optimizations — meaning how you query matters more than how you store. This sharpens my §8 emphasis on rerank and hybrid retrieval.
Production agents are converging on a small set of primitives — rolling compaction, tool result clearing, sub-agent delegation, and durable note-taking to files. The Anthropic cookbook on context engineering demonstrates these composing well. My design reflects this consolidation.

Likely follow-up questions

What if the user’s query depends on a detail that was dropped during compaction?
How do you prevent a malicious user from using cross-session memory to leak information to another tenant?
You said 50K effective context. What’s the evidence, and how would you tune K for a specific workload?
Walk me through exactly what happens when a tool call crashes after side-effect but before trace write.
How do you version the summarizer without invalidating all existing summaries?
Your memory fact store has contradictions — how does the agent decide which to believe?
What’s the cold-start experience? First session, no memory, how does the system behave?
You have 10M users. At some point the vector index becomes the bottleneck. What’s your plan?
Concretely, how do you detect that a specific memory fact is stale?
An agent is in a tight loop making bad tool calls. How does your system break the loop?

Strong follow-up answers

1. Missing detail after compaction.

Three lines of defense. First, the structured facts sidebar: hard-to-regenerate details (IDs, numbers, commitments) are extracted verbatim at compaction time, not summarized. Second, just-in-time retrieval: the agent can query the raw transcript in cold store and fetch the specific span on demand. Third, coverage probes: a background eval tests sampled questions against the summary vs the source; systematic gaps trigger regeneration. If the interviewer pushes further on “but what if the agent doesn’t know it needs that detail” — that’s the fresh-eyes probe’s job, running a parallel session from source and comparing outcomes.

2. Cross-tenant leak prevention.

Tenant isolation is enforced at the index layer, not post-filter. Every retrieval query must carry a tenant_id and the index refuses queries without one. Embeddings for user A are never in the same shard as user B’s embeddings unless they share a tenant, and cross-shard retrieval is disabled by default. Memory writes go through an authorization layer that rejects writes to a different tenant. On top of that, at the encryption layer, each tenant’s data is encrypted with a tenant-specific key so even accidental cross-reads return ciphertext. Finally, audit logs on all cross-session reads make leaks detectable.

3. Evidence for 50K, and workload tuning.

Evidence: Chroma’s 2025 context-rot study shows measurable degradation on semantic retrieval tasks well before 200K; Anthropic’s own docs explicitly state that recall degrades universally with length. 50K is a conservative default for 200K-class models, targeting ~25% fill.

Tuning: for a specific workload, I’d run the actual eval set at several fill levels — say 20K, 50K, 100K, 150K — and plot task success vs cost. The knee of the curve is the target. For code agents with very structured content the ceiling is higher; for conversational tasks with lots of chatty turns it’s lower. The budget is a per-product configuration, not a platform constant.

4. Crash between side-effect and trace write.

The write-ahead log is the answer. Before the tool call fires, we write a tool_call_intent record with the idempotency key and a status = pending. The tool gateway executes the call. On return, we update to status = success with the response pointer, atomically. If the process crashes between send and WAL update, on resume we see pending with no terminal state. For idempotent tools, we retry with the same key — the upstream deduplicates. For non-idempotent tools (email, irreversible action), we don’t retry; we surface an “unknown outcome” to the controller, which either asks the user or issues a reconciliation query to the upstream (e.g., “did email X get sent?”) via a verification tool.

5. Summarizer versioning.

Summaries carry summarizer_version. When the summarizer version bumps, existing summaries are not retroactively regenerated — that would be a massive batch job. Instead, lazy regeneration: on read, if the version is older than the current deployed version by more than N revisions and the summary is on a critical path, a background job enqueues regeneration. Newer version reads take priority. For catastrophic bugs in the new summarizer, we pin consumers to the old version while we fix. Versioned summaries with immutable history make this safe.

6. Contradiction resolution.

The memory fact store never silently overwrites. A new fact that contradicts an existing one gets stored as a new version with supersedes = old_id and a contradiction_flag. On retrieval, the later, higher-confidence version is surfaced by default, but the agent sees there’s a conflict. For high-stakes decisions, the agent is instructed to ask the user. For low-stakes, it goes with the surfaced version. Crucially, the record of the contradiction is preserved — we don’t delete the old version — so audit and rollback are possible.

7. Cold-start behavior.

First session, no memory. The system degrades to “just a chat agent with good prompt engineering.” System prompt, tool schemas, and rolling summary (empty initially) are there. No retrieval hits; the agent operates from current-session context only. As turns accumulate, working memory and episodic memory populate. Semantic memory starts populating as the agent extracts durable facts (with user confirmation for high-confidence claims). The key design point: the system must work correctly at zero memory. I’d test that explicitly in the eval harness.

8. Vector index scaling at 10M users.

The vector index hits several walls: write throughput, index size, query latency under load. Plan: (a) shard by tenant — tenant is the natural boundary and queries are tenant-scoped anyway, so there’s no cross-shard query. (b) Hot/cold tiering: recent-and-frequent items in an in-memory ANN index (HNSW or equivalent), older items in a disk-backed index; queries fan out if necessary but most hits are in the hot tier. (c) Time-decayed pruning: after a threshold, rarely-accessed memory falls out of the active index and is archived; it can be re-activated on explicit user request. (d) Embedding compression (product quantization) for cold items. (e) Per-tenant capacity limits — a pathological tenant doesn’t get unbounded storage without opting in.

9. Detecting stale facts.

Facts have a valid_from and optional valid_until. For facts with no valid_until, staleness is detected several ways: (a) temporal judge probe — when the fact is retrieved, the judge asks “is this fact the kind that can become stale?” and if so, checks whether recent sessions have evidence that contradicts it; (b) user correction events — if a user ever corrects a claim that traces back to a fact, the fact is flagged; (c) periodic re-confirmation — long-lived facts that are used often get re-confirmed with the user at natural checkpoints (“you still want me to sort by due date?”); (d) for facts with clear expiration semantics (e.g., “user’s current project is X”), the system applies a soft TTL and prompts a re-extraction.

10. Breaking a bad-tool-call loop.

Multiple guardrails. (a) Per-step tool call limit — 5 tool calls per step max. (b) Per-run tool call limit and cost budget — exceed and the run pauses with a controller alert. (c) Repetition detection — if the same tool is called with the same or near-identical arguments N times without progress, the controller forces a re-planning step. (d) Progress metric — the controller tracks whether the plan DAG is advancing. Stalled DAGs trigger re-planning after K steps. (e) Model-driven reflection — at milestones, the agent self-critiques; repeated failures at the same subgoal surface as an explicit “stuck” signal. (f) Backstop: a separate critic model reviews the last N steps periodically and can inject a stop.

Where candidates sound mid-level instead of staff-level

Treating “memory” as a single thing. Mid-level answers say “I’ll use a vector DB for memory.” Staff answers distinguish working memory, rolling summary, episodic, semantic, tool state, cached prefix, durable workflow state — seven layers with different semantics, stores, and policies. The layers matter more than any individual store.
Defaulting to “just use a bigger context window.” The window is not where durability, retrieval, or quality live. Context rot means more tokens actively hurt beyond a point. A staff candidate explicitly budgets below the hard limit and defends it with numbers.
Conflating cache and memory. Prompt caching is for latency and cost; memory is for reasoning continuity. They have completely different invalidation logic. If a candidate says “we’ll cache the summaries to make things faster” without distinguishing those concerns, they’re mid-level.
No story for correctness when memory is wrong. Mid-level answers assume the memory is right. Staff answers have explicit mechanisms for detecting, correcting, and recovering: provenance, confidence, contradiction detection, regeneration from source, and fresh-eyes probes. The question “what if the summary dropped a detail” should have a crisp answer.
No evaluation story, or eval-as-afterthought. Mid-level answers tack eval on at the end. Staff answers treat measurable quality preservation over horizon length as a requirement that drives the design, with specific online and offline metrics, in-prod ablations, and alerts. You can’t claim “quality stays high as context grows” without a way to observe that claim.