Control plane
1. What a staff engineer actually needs to know
Section titled “1. What a staff engineer actually needs to know”What matters in interviews
- Crisp separation of control plane vs data plane
- Desired-state model and reconciliation loops
- Idempotency, retries, fencing
- Scheduler/controller separation
- Failure modes: leader failover, split brain, stale reads, thundering herd
- Tradeoff vocabulary: consistency vs availability, centralized vs sharded, push vs pull, static stability
Expected depth
- Whiteboard a control plane for a job scheduler / ML training platform / CDN edge orchestrator in 20 minutes
- Explain why each component exists and what breaks when it fails
- Reason about correctness under concurrent controllers, network partitions, and worker crashes
What you can safely ignore
- Raft/Paxos internals — say “back it with etcd/ZooKeeper/Spanner” and move on
- Specific Kubernetes CRD mechanics
- History of cluster managers (Borg → Omega → K8s trivia)
- Internals of specific schedulers (Sparrow, YARN). Know the archetypes, not the code.
2. Core mental model
Section titled “2. Core mental model”Control plane = decides what should be true. Stores desired state, observes actual state, issues actions to reconcile the gap. Low QPS, high stakes.
Data plane = does the actual work. Serves traffic, runs jobs, moves packets. High QPS, low per-decision stakes.
┌──────────────────────────────────────────┐ │ CONTROL PLANE │ │ (decisions, desired state, policy) │ │ │ │ API ──► State Store ──► Controllers │ │ ▲ │ │ │ │ ▼ │ │ status commands │ └────────────────────┬─────────┬───────────┘ │ │ observe actuate │ │ ┌────────────────────┴─────────┴───────────┐ │ DATA PLANE │ │ Workers / Proxies / Runtime Agents │ │ (execute, serve traffic, run code) │ └──────────────────────────────────────────┘Scheduling = deciding where and when work runs. A specialized controller whose job is placement under constraints.
Reconciliation loop = continuously compare desired vs actual; take corrective action; repeat. Not “do this once,” but “keep this true.”
Why this model wins: systems fail constantly. A one-shot imperative command loses information the moment it is issued. A reconciliation loop survives crashes, network blips, and partial failures because the intent is persisted and any healthy controller can resume convergence.
Three properties that define a good control plane
- Control — clear authority over what actions are issued
- Convergence — observed state approaches desired state over time
- Correctness under failure — no corruption, no duplicate destructive actions, no permanent divergence
3. Essential components
Section titled “3. Essential components”API / Desired-state store
Section titled “API / Desired-state store”- What: accepts writes of intent (“I want 10 replicas of service X”). Validates, authorizes, persists.
- Why: single source of truth. Decouples wanting from achieving.
- Failure/scale: becomes bottleneck under write load; stale reads if replicated async; large objects degrade watches.
- Interview signal: “Strongly consistent store (etcd/Spanner/FoundationDB); optimistic concurrency via resource versions.”
Controller / Operator
Section titled “Controller / Operator”- What: runs the reconciliation loop for a resource type. Reads desired + observed, computes diff, issues actions.
- Why: separates policy (what) from mechanism (how). Each resource type gets its own controller.
- Failure/scale: if it crashes mid-action, something else must take over without duplicating destructive work.
- Interview signal: “One active leader per control loop, fencing via lease, idempotent actions.”
Scheduler
Section titled “Scheduler”- What: assigns pending work to available capacity. A specialized controller.
- Why: placement is a global optimization problem (fairness, packing, constraints) that doesn’t belong in per-resource controllers.
- Failure/scale: decisions go stale fast (node disappears, capacity changes); queue can grow unbounded.
- Interview signal: “Separate from executor; OCC so stale decisions get rejected, not applied.”
Worker / Agent
Section titled “Worker / Agent”- What: runs on the data plane. Executes assigned work, reports status.
- Why: local execution with local knowledge (disk, GPU, processes).
- Failure/scale: worker crash mid-task → task in unknown state; heartbeats flood the control plane at scale.
- Interview signal: “Agents pull work, not push to; idempotent task execution; heartbeats with bounded frequency + jitter.”
State store / metadata store
Section titled “State store / metadata store”- What: persists desired + observed state, leader election, locks.
- Why: durable ground truth; a crash without it wipes intent.
- Failure/scale: etcd-class stores cap at a few GB and tens of thousands of watches; hot keys kill performance.
- Interview signal: “Shard by resource type or tenant once you outgrow a single cluster. Not a general-purpose DB.”
Event stream / watch mechanism
Section titled “Event stream / watch mechanism”- What: push-based notification on state change. Controllers watch rather than poll.
- Why: polling doesn’t scale; watches give low-latency reconciliation.
- Failure/scale: watches drop (network, server restart); events can duplicate, reorder, or be lost.
- Interview signal: “Treat watches as cache-invalidation hints, not ground truth. Periodic full resync + per-event reconcile. Events must be idempotent in handling.”
┌─────────┐ write desired ┌──────────────┐ User │ Client │ ───────────────► │ API Server │ └─────────┘ └──────┬───────┘ │ persist (w/ version) ▼ ┌──────────────┐ │ State Store │◄──┐ │ (etcd-like) │ │ └──────┬───────┘ │ watch │ │ status events ▼ │ updates ┌──────────────┐ │ │ Controller │───┘ │ (loop) │ └──────┬───────┘ │ actuate ▼ ┌──────────────┐ │ Worker │ └──────────────┘4. Reconciliation loops
Section titled “4. Reconciliation loops”The single most important concept in this domain.
┌─────────────────────────────────┐ │ │ ▼ │ ┌─────────────┐ │ │ OBSERVE │ read actual state │ │ current │ from workers/store │ └──────┬──────┘ │ │ │ ▼ │ ┌─────────────┐ │ │ COMPARE │ desired vs actual │ │ compute │ → diff / plan │ │ diff │ │ └──────┬──────┘ │ │ │ ▼ │ ┌─────────────┐ │ │ ACT │ issue idempotent, │ │ corrective │ bounded corrective │ │ action │ action │ └──────┬──────┘ │ │ │ └───── wait / re-trigger ─────────┘Why reconciliation beats one-shot orchestration
Section titled “Why reconciliation beats one-shot orchestration”| One-shot orchestration | Reconciliation |
|---|---|
| ”Run this sequence of steps" | "Make reality match this spec” |
| If step 3 crashes, workflow is stuck | Loop resumes from current observed state |
| Hard to handle drift (manual changes, failures) | Drift is just another diff to reconcile |
| Retries require explicit plumbing | Retries are the default mode |
| State lives in the orchestrator | State lives in the store; orchestrator is stateless |
Idempotency is non-negotiable
Section titled “Idempotency is non-negotiable”Every action must be safe to repeat.
create_if_not_exists(pod, spec)notcreate(pod)set_replicas(n)notadd_replica()- IDs derived deterministically from desired state, so retries collide and dedupe at the store.
Eventual convergence means
Section titled “Eventual convergence means”- At any instant, actual may diverge from desired
- Given enough time without new changes, it converges
- Strong invariant: no reconcile step leaves the system further from desired than before (monotonic progress, where possible)
Duplicate events must be tolerated
Section titled “Duplicate events must be tolerated”Watch streams can deliver the same event twice, or an event may fire while a full resync is in flight. The controller must produce the same final action regardless of event cardinality. Rule: key your actions off the observed state, not off the triggering event.
Retries with backoff
Section titled “Retries with backoff” event arrives ──► enqueue ──► dequeue ──► reconcile ▲ │ │ on error: │ └── requeue with ───────┘ exp backoff + jitter- Transient (network, rate limit) → exponential backoff + jitter
- Permanent (bad spec, auth) → surface in status, stop retrying
- Per-resource retry budget so one sick object can’t starve the queue
5. Scheduling at interview depth
Section titled “5. Scheduling at interview depth”Scheduling = “I have N pending units of work and M units of capacity; match them under constraints.”
Placement decisions weigh
Section titled “Placement decisions weigh”- Resource requirements — CPU, memory, GPU type, accelerator count
- Affinity / anti-affinity — “put these together” or “spread these apart”
- Constraints — zone, hardware class, license, data locality
- Fairness — no tenant monopolizes shared capacity (DRF, weighted shares)
- Priority / preemption — high-priority work can evict lower
Bin packing vs spreading
Section titled “Bin packing vs spreading” BIN PACKING SPREADING (maximize utilization) (maximize availability)
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ A B C │ │ │ │ A │ │ B │ │ C │ │ D E │ │ │ │ │ │ │ │ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ Node 1 Node 2 Node 1 Node 2 Node 3
+ cost efficient + failure isolation + fewer nodes running + predictable tail latency − one node failure = big blast − wasted capacity − noisy neighbors − higher infra costReal schedulers blend both: pack within a failure domain, spread across failure domains.
Queueing of pending work
Section titled “Queueing of pending work”A scheduler is fed by one or more queues. Queues let you:
- Decouple arrival rate from decision rate
- Apply fairness across tenants
- Retry un-schedulable work without blocking the loop
Preemption (high level)
Section titled “Preemption (high level)”If a higher-priority task can’t be placed, find lower-priority tasks to evict. Must be done carefully — preempting mid-training-step is expensive. Most schedulers use graceful preemption with a drain period (PDBs, graceful shutdown signals).
Why separate scheduling from execution
Section titled “Why separate scheduling from execution”- Scheduler is a decision engine: “this pod → that node.”
- Executor is a mechanism: “make this pod run on this node.”
- Scheduler crashes shouldn’t affect running workloads.
- Schedulers can be stateless and replaceable; executors keep local runtime state (mount points, process supervision).
What makes scheduling hard at scale
Section titled “What makes scheduling hard at scale”- Decision rate — 10K nodes × churn = many decisions/sec. Sequential schedulers cap out.
- Stale view — by the time you place a pod, the chosen node may be full. Use OCC: write fails, you re-schedule.
- Constraint complexity — NP-hard in general. Approximate: score + pick top-K.
- Heterogeneous hardware — GPU types, TPU pods, FPGAs, mixed generations.
- Fairness vs throughput — strict fairness starves latency-sensitive work; pure throughput starves small tenants.
Scheduler architectures
Section titled “Scheduler architectures” MONOLITHIC TWO-LEVEL (Mesos-like) SHARED STATE (Omega-like) single scheduler framework schedulers many schedulers sees everything negotiate offers contend via OCC
┌────────┐ ┌────────┐ ┌────────┐ ┌──────┐┌──────┐┌──────┐ │ Sched │ │Sched A │ │Sched B │ │Sched1││Sched2││Sched3│ └───┬────┘ └──┬─────┘ └────┬───┘ └──┬───┘└──┬───┘└──┬───┘ │ └── offers ──┘ └───────┼───────┘ ▼ ▼ ▼ Cluster Cluster Shared State (broker) (OCC writes)
simple, limited throughput scales horizontally scales, conflicts6. Must-know concepts
Section titled “6. Must-know concepts”- Desired state — what the user asked for (declarative spec)
- Observed state — what the system currently reports (actual reality)
- Convergence — observed → desired over time, absent new changes
- Control loop — observe/compare/act, repeating
- Idempotent action — safe to apply 1 or N times with the same final effect
- Work queue — buffer of pending reconciliations, typically per-key rate-limited
- Leader election — exactly one active controller per loop; backed by a consensus store via a lease
- Lease / fencing token — monotonically increasing token attached to actions; state store rejects writes with stale tokens
- Optimistic concurrency (OCC) — read version → compute change → write-if-version-unchanged; retry on conflict
- Retry with backoff — exponential + jitter; distinguish transient from permanent
- Stale state — cached/watched data that lags reality; handled by resync and versioned writes
- Drift — observed diverges from desired without a new intent change (manual edit, partial failure, external actor)
- Rollout — apply new desired state gradually (canary, %-based, wave-by-wave)
- Rollback — revert desired state to previous; reconciler converges back
Fencing in one picture
Section titled “Fencing in one picture” t=0 Controller A acquires lease, token = 7 t=1 Controller A gets GC-paused for 30s t=2 Lease expires; Controller B acquires lease, token = 8 t=3 Controller B issues writes with token 8 ─► ACCEPTED t=4 Controller A wakes up, issues writes with token 7 ─► REJECTED (store checks token)
Without fencing: A and B both write. Corruption. With fencing: A's stale writes are rejected at the store.OCC in one picture
Section titled “OCC in one picture” Reader: GET obj ─► {value=X, version=42} Compute: new = f(X) Writer: PUT obj if version==42 ─► either SUCCESS (version=43) or CONFLICT (someone else wrote) on conflict: re-read and retry7. Failure and scaling issues
Section titled “7. Failure and scaling issues”Controller crash
Section titled “Controller crash”- Another replica takes over via leader election.
- In-flight action must be idempotent so the successor doesn’t duplicate it.
- Work queue is rebuilt by a full resync on startup.
Scheduler leader failover
Section titled “Scheduler leader failover”- Active scheduler dies; standby wins lease after timeout (typically 5–15s).
- Pending decisions not yet persisted are lost; new leader re-derives from the queue.
- Decisions that were persisted are picked up by executors as normal.
Duplicate work
Section titled “Duplicate work”- Two controllers each think they’re leader during a partition.
- Or a single controller retries a non-idempotent action.
- Mitigations: fencing tokens, idempotency keys, exactly-once-effect via the store (not the network).
Stale reads from state store
Section titled “Stale reads from state store”- Async replicas may lag the leader.
- Correctness-critical reads → leader. Status/list reads → followers OK.
- Use resource versions on conditional writes.
Split brain
Section titled “Split brain” Network Partition │ ┌─────────┴─────────┐ ▼ ▼ ┌───────┐ ┌───────┐ │ Ctrl A│ │ Ctrl B│ │ lease │ │ lease │ │ = 7 │ │ = 8 │ └───┬───┘ └───┬───┘ │ │ ▼ ▼ writes w/ 7 writes w/ 8 │ │ └─────► Store ◄─────┘ │ accepts only highest token → A rejectedRoot causes: lease timeout < max clock skew, long partition, GC pause. Mitigations: fencing at the store, short leases + monotonic tokens, never trust “I am leader” without a fresh lease check just before writing.
Worker crash during task execution
Section titled “Worker crash during task execution”- Task is in unknown state: started? half-done? committed?
- Mitigations:
- Tasks write checkpoints + status to control plane
- Controller treats “unknown” as “retry after grace period”
- Task operations keyed by task ID → idempotent
- For non-idempotent external side effects, use idempotency tokens at the remote service or 2PC
Task stuck in unknown state
Section titled “Task stuck in unknown state”- Timeout-and-assume-failed after a conservative deadline.
- Before declaring dead, send a fencing signal (revoke worker’s lease/credentials) so the old worker can’t still act.
- Only then restart elsewhere.
Thundering herd after outage
Section titled “Thundering herd after outage” Outage ends → all agents reconnect at once → ░░░░░░░░░ API melts With jitter → reconnects spread over N min → ▒ ▒ ▒ ▒ ▒ steady loadMitigations:
- Jittered reconnect backoff on agents
- Rate-limited admission at API server
- Watch resumes use
resourceVersionto avoid full re-list - Priority queues so critical reconciles happen first
Control-plane overload
Section titled “Control-plane overload”- Symptoms: watch lag, reconcile queue growth, 503s on API.
- Mitigations: shard controllers by namespace/tenant, separate read/write paths, back-pressure at API, bounded work queues with drop policies for low-priority items.
Slow convergence
Section titled “Slow convergence”- Reconcile interval too long, or queue starvation on hot keys.
- Mitigation: per-key work queues with rate limiting; priority lanes for user-facing resources; watch-driven triggers instead of pure polling.
8. Multi-controller and coordination concerns
Section titled “8. Multi-controller and coordination concerns”One active leader per control loop, many replicas for HA
Section titled “One active leader per control loop, many replicas for HA” ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Replica 1 │ │ Replica 2 │ │ Replica 3 │ │ [LEADER] │ │ (standby) │ │ (standby) │ │ holds lease │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ ▼ actuatesStandby is cheap; failover is fast (seconds); correctness enforced via fencing.
Sharding control responsibilities
Section titled “Sharding control responsibilities” Resources A–H I–P Q–Z │ │ │ ▼ ▼ ▼ Ctrl 1 Ctrl 2 Ctrl 3 (leader) (leader)(leader)Split the keyspace (hash / tenant / zone) across leader-per-shard. Each shard is an independent control loop with its own lease. A resharding protocol is needed when topology changes — hand off cleanly, fence the old owner before the new owner starts acting.
Avoiding conflicting actuators
Section titled “Avoiding conflicting actuators”- Two controllers must never own the same field on the same object.
- Enforce via schema (field ownership) or single-writer-per-resource-type.
- Kubernetes solves this with server-side apply field ownership. Simpler systems just say “only the deployment controller writes replica counts.”
Centralized scheduling is fine when
Section titled “Centralized scheduling is fine when”- < ~10K decisions/sec
- Uniform constraints
- Single administrative domain
- You want strict global fairness/packing
Distributed / hierarchical scheduling is needed when
Section titled “Distributed / hierarchical scheduling is needed when”- Federated multi-region systems
- Extreme throughput (Sparrow-style random sampling)
- Heterogeneous workloads with specialized schedulers (batch vs online vs interactive)
HIERARCHICAL SCHEDULING
┌──────────────┐ │ Global Sched │ (routes jobs to regions) └──────┬───────┘ │ ┌───────┼────────┐ ▼ ▼ ▼ Region1 Region2 Region3 sched sched sched │ │ │ ▼ ▼ ▼ nodes nodes nodes9. Interview reasoning patterns
Section titled “9. Interview reasoning patterns”Q: Why reconciliation instead of imperative orchestration? Reconciliation makes failure recovery free. Intent is persisted once; the loop always picks up from current reality. An imperative workflow has to encode every failure branch explicitly and keeps state in the orchestrator, which becomes a SPOF. Reconciliation is also naturally tolerant of drift — external changes just look like a diff to fix.
Q: What belongs in control plane vs data plane? Control plane: desired state, decisions, policy, placement, coordination. Data plane: the actual work — serving requests, running containers, moving packets. Rule of thumb: anything in the request hot path is data plane; anything that decides what should exist is control plane. Goal: control-plane outages must not take down the data plane (static stability).
Q: How do you make control actions safe under retries? Make them idempotent. Use deterministic IDs derived from desired state so retries collide. For non-idempotent side effects, use idempotency tokens on the downstream API or 2PC if supported. Separate “decide” from “act”: the decision can be logged and recovered; the act references the logged decision ID.
Q: How do you avoid duplicate or conflicting work? Single-leader-per-loop via lease. Fencing tokens on all mutating actions — the store rejects writes with stale tokens, so a paused-then-resumed old leader can’t corrupt state. Field ownership so two controllers never write the same field. OCC on all updates.
Q: How do you design scheduler failover? Stateless schedulers with leader election. All pending work in the persisted queue, never in scheduler memory. New leader re-derives its view on startup. Decisions use OCC so stale decisions (from before failover) get rejected rather than applied. Target failover in < 30s.
Q: How do you reason about convergence and correctness? Define desired and observed state precisely. Show each reconcile step makes monotonic progress or is a no-op. Identify invariants the system must never violate (“no two primaries for the same shard”) and show they hold across any interleaving — usually by reducing to the lease/fencing mechanism.
Q: What are the first bottlenecks and operational pain points? (1) State store write QPS and watch fan-out. (2) Reconcile queue backlog on popular resource types. (3) Scheduler decision latency as node count grows. (4) Thundering herd after control-plane restarts. Mitigations in order: watch caching, per-key rate limits, shard controllers, jittered agent reconnects, priority queues.
10. Common candidate mistakes
Section titled “10. Common candidate mistakes”- Drawing the control plane as just an API server. API is one component. Without controllers, scheduler, and state store, nothing converges. Interviewers want to see the loop.
- Ignoring desired vs actual state. Treating the system as “submit job → run job” misses the entire reason control planes exist.
- Hand-waving retries and idempotency. “We’ll retry on failure” without addressing dedup / idempotency is a red flag.
- Ignoring stale state and race conditions. Assuming your cache matches the store. Assuming your “leader” status is still valid 10s after you checked.
- Coupling scheduler and executor. If the scheduler dies, running jobs shouldn’t care. If you can’t explain why they’re separate, you don’t understand the architecture.
- Ignoring leader election / fencing for mutating actions. Two controllers mutating the same resource is a data-corruption incident waiting to happen.
- Assuming events are reliable and unique. Watches drop, duplicate, and reorder. The reconciler must be correct under any event sequence.
- Ignoring backlog growth. “The controller will reconcile it eventually” with no bound on queue depth or per-key fairness.
- Treating the state store as a general-purpose DB. etcd is not Postgres. Large objects, high write rates, and watches don’t mix.
- Mixing control-plane and data-plane SLAs. Control plane: low-QPS / high-consistency. Data plane: high-QPS / high-availability. Designing both against the same tier causes misery.
11. Final cheat sheet
Section titled “11. Final cheat sheet”Control plane vs data plane
Section titled “Control plane vs data plane”| Aspect | Control plane | Data plane |
|---|---|---|
| Purpose | Decide what should exist | Execute the work |
| QPS | Low (ops/sec) | High (req/sec) |
| Consistency | Strong (for decisions) | Eventual is often fine |
| Latency budget | Seconds OK | ms or μs |
| Failure impact | Can’t change state; existing work keeps running | User-visible outage |
| Typical store | etcd / Spanner / ZK | In-memory / local disk / sharded DB |
| Primary goal | Correctness & convergence | Throughput & availability |
| Scaling axis | Shard by resource / tenant | Shard by request / key |
Scheduler vs controller
Section titled “Scheduler vs controller”| Aspect | Scheduler | Controller |
|---|---|---|
| Job | Placement decisions (where/when) | Drive a resource to desired state |
| Input | Pending work + capacity view | Desired spec + observed status |
| Output | Binding (work → resource) | CRUD actions on managed objects |
| Scope | Cross-object global optimization | One resource type |
| Statefulness | Mostly stateless | Mostly stateless (state in store) |
| Failure mode | Queue grows until leader recovers | Drift grows until leader recovers |
| Typical count per system | 1 (or a few specialized) | Many (one per resource type) |
Decision framework
Section titled “Decision framework” Given a request to design a distributed system:
1. Is there intent separate from execution? → Yes: declarative control plane with desired state → No: maybe a pure data plane (e.g., a KV store) — skip control plane
2. Does state need to converge under failure? → Yes: reconciliation loop, not imperative workflow
3. Is there placement/allocation of work across resources? → Yes: separate scheduler from controller → No: single controller per resource type is enough
4. Multiple writers to the same field possible? → Yes: leader election + fencing + field ownership → No: single controller, OCC for concurrent clients
5. What's the state store? → Strongly consistent: etcd / Spanner / FoundationDB → Size/QPS > single cluster: shard by tenant/resource type
6. How do workers learn about work? → Small scale: push from control plane → Large scale: pull + watch, with jittered reconnect
7. What's the blast radius of control-plane failure? → Data plane should keep serving (static stability) → Existing assignments persist; only new changes are blocked10 likely interview questions with short strong answers
Section titled “10 likely interview questions with short strong answers”1. Control plane vs data plane — what’s the difference and why does it matter? Control plane decides; data plane does. Separating them lets data plane survive control-plane outages (static stability) and lets you scale them independently.
2. Why reconciliation instead of orchestration? Reconciliation tolerates drift, crashes, and external changes by design. Orchestration has to encode every failure branch explicitly and keeps state in a fragile orchestrator.
3. How do you guarantee a control action isn’t applied twice? Idempotent actions keyed by deterministic IDs, fencing tokens rejected by the state store on stale leaders, idempotency keys on downstream APIs. Exactly-once effect, not exactly-once delivery.
4. How do you handle a controller crash mid-action? A replica takes over via leader election. On startup it full-re-lists to rebuild queue state. The in-flight action was either idempotent (safe to redo) or its effect was persisted (recognized as done via observed state).
5. What’s a fencing token and why do you need it? A monotonic token attached to the lease. The state store rejects writes with outdated tokens. Without it, a GC-paused or partitioned old leader can corrupt state after a new one is elected.
6. How does a scheduler handle stale node state? OCC: bind pod→node with a resource version; if the node’s state changed, the write fails and the pod goes back in the queue for rescheduling.
7. How do you scale a control plane beyond a single etcd cluster? Shard controllers by tenant/namespace/resource-type, each with its own state store. Route at the API layer. Avoid cross-shard transactions; make cross-shard operations eventually consistent with reconciliation.
8. How do you avoid thundering herd after a control-plane restart? Jittered reconnect on agents, rate-limited API admission, watch resumes via resourceVersion (not full re-list), priority lanes for critical reconciles.
9. How do two controllers avoid fighting over the same resource? Field ownership (schema enforces who writes what) or single-writer-per-resource-type (only one controller type mutates it). Fencing prevents stale leaders of the same type from conflicting.
10. When would you use hierarchical scheduling instead of a single scheduler? Multi-region systems where a global scheduler can’t meet decision-rate or locality needs; very large clusters where a single scheduler is the bottleneck; heterogeneous workloads where specialized schedulers (batch vs online) benefit from separation.
Five phrases that signal staff-level thinking
Section titled “Five phrases that signal staff-level thinking”- “I’d separate decide-from-act so retries are safe.”
- “Fencing tokens so the store rejects stale writes.”
- “Static stability — data plane should survive a control-plane outage.”
- “OCC on the binding so a stale scheduling decision gets rejected, not applied.”
- “Per-key work queue with rate limiting to avoid hot-key starvation.”