Analytical systems
1. What a staff engineer actually needs to know
Section titled “1. What a staff engineer actually needs to know”Interviewers probe depth in three places:
- Can you correctly decide when an analytical system is needed vs when OLTP + cache suffices.
- Can you reason about storage layout (row vs column), partitioning, sorting, and indexing as tradeoffs — not as “I add an index.”
- Can you articulate ingestion/freshness/query-latency tradeoffs without handwaving.
Expected depth:
- Explain columnar layout in 60 seconds, with compression and pruning reasoning.
- Pick partitioning + sort keys from a workload description and justify them.
- Name at least three index types, when each helps, when each hurts.
- Design an ingestion path that doesn’t block the query path.
- Describe why joins are usually denormalized or precomputed in analytics.
Ignore for most staff interviews:
- MVCC internals, WAL group commit, buffer pool replacement policies.
- Specific codec internals (Gorilla, FOR-delta details, ZSTD dictionary training).
- SIMD/vectorized execution internals beyond “it’s batched and cache-friendly.”
- Database-implementer depth of join algorithms (enough: hash, sort-merge, broadcast, shuffle).
- Specific product feature matrices. You need the shape of Pinot/Druid/ClickHouse, not their release notes.
2. Core mental model
Section titled “2. Core mental model”2.1 What an analytical data system is
Section titled “2.1 What an analytical data system is”A system optimized for reading large ranges of many rows to compute aggregates over few columns, with bulk ingestion and rare row-level updates. Reads dominate; writes arrive in batches or streams as append-only.
2.2 OLTP vs OLAP — the only distinction that matters
Section titled “2.2 OLTP vs OLAP — the only distinction that matters”OLTP (e.g. Postgres, MySQL) OLAP (e.g. ClickHouse, Druid, Pinot, BigQuery)───────────────────────────────── ────────────────────────────────────────────Workload: point lookups + updates Workload: scans + aggregationsQuery touches: 1–100 rows Query touches: 10^6 – 10^12 rowsColumns read: most of the row Columns read: 3–20 of 200Concurrency: thousands of small txns Concurrency: tens to hundreds of large queriesWrites: random, row-level, ACID Writes: bulk append, immutable segmentsLatency target: ms per query Latency target: ms–seconds per query, but over vastly more dataStorage layout: row-oriented Storage layout: column-orientedIndex: B-tree on PK + secondary Index: partitions, zone maps, bitmap, inverted2.3 Why analytical systems are optimized differently
Section titled “2.3 Why analytical systems are optimized differently”Three forcing functions:
- I/O is the bottleneck, not CPU. A 100-column table where you read 3 columns means row storage wastes 97% of bytes read. Column storage doesn’t.
- Values within a column are homogeneous, so compression ratios are 5–20×. Row storage fights you here — adjacent bytes are unrelated types.
- Queries are read-mostly and bulk, so you can make writes slow/batched to make reads fast, and you can make segments immutable (no per-row locking, no MVCC tax on reads).
2.4 Typical OLAP workloads
Section titled “2.4 Typical OLAP workloads”- Dashboards: pre-defined queries, p95 < 1s, high QPS, last N days.
- Ad hoc exploration: arbitrary slice/dice, p95 seconds to minutes, low QPS.
- ETL/reporting: nightly batch, minutes-to-hours acceptable.
- User-facing analytics: p95 < 200ms, high QPS, bounded query shape. (Pinot/Druid territory.)
2.5 “Just use Postgres” — when it breaks
Section titled “2.5 “Just use Postgres” — when it breaks”Postgres handles analytics fine until one of:
- Working set stops fitting in RAM (hot data ≫
shared_buffers). - Scan volume per query > ~10 GB and p95 matters.
- You need to aggregate billions of rows in seconds.
- Ingest rate exceeds what random-I/O row storage can sustain.
- Dashboard QPS × scan-per-query exceeds disk bandwidth.
At that point row storage + B-trees becomes structurally wrong — not “tune it harder” wrong. You need a columnar engine.
3. Essential design ideas
Section titled “3. Essential design ideas”The seven concepts below are the vocabulary. You must speak all of them fluently.
3.1 Columnar storage
Section titled “3.1 Columnar storage”- What: values of one column stored contiguously on disk, separate file/stream per column.
- Why: scans read only needed columns; homogeneous data compresses well; vectorized execution over tight loops.
- Tradeoff: row reconstruction costs more; row-level updates are expensive (touch N column files); wide-schema OLTP kills it.
- Say in interview: “Columnar cuts scan bytes proportional to columns-read / columns-total, and compression gives another 5–20× on top.”
3.2 Partitioning
Section titled “3.2 Partitioning”- What: physical split of data by a key, typically time (
day) or a low-cardinality categorical (region,tenant_id). - Why: partition pruning — queries with a predicate on the partition key skip whole partitions without reading them.
- Tradeoff: too fine-grained → many small files, metadata overhead; too coarse → no pruning benefit. Wrong key → no pruning at all.
- Say: “Partition by the dimension that’s in 90%+ of WHERE clauses, usually time.”
3.3 Sorting / clustering
Section titled “3.3 Sorting / clustering”- What: within a partition, rows sorted by a key or a composite key.
- Why: range scans, zone maps, and compression all benefit from locality. Sorted data compresses far better (RLE, delta).
- Tradeoff: only one sort order per storage copy; ingestion must sort or compaction must re-sort.
- Say: “Sort key = most selective frequent filter after the partition key.”
3.4 Indexing
Section titled “3.4 Indexing”- What: auxiliary structures (zone maps, bitmaps, inverted lists) that skip data.
- Why: avoid scanning what you don’t need.
- Tradeoff: storage overhead, write amplification, must be maintained during compaction.
- Say: “Indexes in OLAP are pruning aids, not lookup structures — selectivity and write cost decide.”
3.5 Pre-aggregation / materialization
Section titled “3.5 Pre-aggregation / materialization”- What: precompute aggregate tables or cubes (rollups, materialized views, star-trees).
- Why: trade storage + write-time compute for dramatically cheaper reads.
- Tradeoff: staleness, storage cost, query must match the cube’s dimensions or fall back to raw.
- Say: “If dashboards hit a fixed query shape at high QPS, pre-aggregation converts O(rows) reads into O(dimensions) reads.”
3.6 Compression
Section titled “3.6 Compression”- What: dictionary, RLE, delta, bit-packing, frame-of-reference, plus general-purpose (ZSTD, LZ4) on top.
- Why: smaller files = less I/O = faster scans. On modern hardware, decompression is faster than reading uncompressed bytes from disk.
- Tradeoff: CPU cost on decode, but almost always worth it. Encodings constrain update patterns.
- Say: “Compression isn’t about disk cost — it’s scan speed. I/O is the bottleneck.”
3.7 Append-heavy ingestion
Section titled “3.7 Append-heavy ingestion”- What: writes arrive as new immutable segments; updates/deletes are handled via tombstones + compaction or via eventual rewrite.
- Why: lets you build dense, sorted, compressed, indexed segments once; reads never fight writes.
- Tradeoff: updates are expensive; freshness bounded by segment-seal latency; need background compaction.
- Say: “Immutable segments are what lets columnar layouts exist at all — you can’t maintain 200 compressed column streams under random row updates.”
4. Columnar systems and data layout
Section titled “4. Columnar systems and data layout”This section is the one that separates strong candidates. Get it precise.
4.1 Row vs column layout — the picture
Section titled “4.1 Row vs column layout — the picture”Row store (Postgres page):┌────────────────────────────────────────────────────────┐│ row1: (user_id, ts, country, amount, status, ...) ││ row2: (user_id, ts, country, amount, status, ...) ││ row3: (user_id, ts, country, amount, status, ...) ││ row4: ... │└────────────────────────────────────────────────────────┘ ↑ Read one row = one page. Read SUM(amount) = read everything.
Column store (Parquet / ORC / native columnar):┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ user_id │ │ ts │ │ country │ │ amount ││ [101,102, │ │ [t1,t2,t3, │ │ [US,US,FR, │ │ [12.0,7.5, ││ 103,...] │ │ t4,...] │ │ US,...] │ │ 3.2,...] │└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ ↑ Read SUM(amount) = read one file. Read whole row = read N files + reconstruct.4.2 Why columnar helps scans and aggregation
Section titled “4.2 Why columnar helps scans and aggregation”- Bytes read:
bytes_read ≈ rows × sum(size(c) for c in projected_columns)vs row store’srows × sum(size(c) for c in all_columns). - Cache & SIMD: tight loop over a packed, type-homogeneous array → vectorized kernels, branch-predictable, prefetchable.
- Column-local predicates: filter on
country = 'US'by scanning one file, producing a bitmap, then materializing other columns only for matching rows (late materialization).
4.3 Why compression is better in columnar layouts
Section titled “4.3 Why compression is better in columnar layouts”Column values share type, distribution, and often sort order. That unlocks:
Dictionary: [US, US, FR, US, DE, US] → dict = [US:0, FR:1, DE:2] data = [0,0,1,0,2,0] (2-bit packed)
RLE: [0,0,0,0,1,1,2,2,2,2] → [(0,4),(1,2),(2,4)]
Delta: [1700, 1703, 1710, 1712] → [1700, +3, +7, +2]
Frame-of-ref: [1000003, 1000007, 1000012] → base=1000003, [0,4,9] (bit-packed)Row stores can’t use any of these well — adjacent bytes belong to unrelated fields.
4.4 Why columnar is bad for point writes / row updates
Section titled “4.4 Why columnar is bad for point writes / row updates”Updating one row in a 200-column table means touching 200 column streams, each compressed, often RLE’d across many rows. So OLAP systems:
- Make segments immutable.
- Handle updates via: delete-vector + re-insert, or merge-on-read, or periodic rewrite via compaction.
- Expose them at read time via a merge step (adds query cost).
This is why you don’t run OLTP on columnar engines.
4.5 How partitioning and sort order affect pruning and locality
Section titled “4.5 How partitioning and sort order affect pruning and locality”Query: SELECT SUM(amount) FROM events WHERE day = '2026-04-20' AND country = 'US' AND user_id = 42;
Partition pruning: day = '2026-04-20' → skip all other day partitionsZone-map pruning: country = 'US' → skip blocks whose min..max doesn't overlap 'US'Sort-key skip: user_id = 42 (if sorted) → seek into sorted runColumn scan: read `amount` column only → tiny I/O
I/O read: one day partition × surviving blocks × one column << full table scanPartitioning is coarse pruning (directories/files). Sort + zone maps are fine pruning (blocks within a file).
4.6 What staff-level candidates say about scan reduction
Section titled “4.6 What staff-level candidates say about scan reduction”Frame every design choice as how it reduces scan bytes:
- Partitioning → skip partitions.
- Sort + zone maps → skip blocks within partition.
- Column projection → skip columns.
- Compression → fewer bytes per surviving block.
- Pre-aggregation → skip the raw table entirely.
This layered pruning story is what interviewers want to hear.
5. Must-know indexing concepts
Section titled “5. Must-know indexing concepts”In OLAP, “index” mostly means “data-skipping structure.” Point lookups aren’t the common case.
5.1 Partition pruning
Section titled “5.1 Partition pruning”- Problem solved: avoid opening irrelevant files.
- When it helps: queries filter on the partition key (usually time).
- When it doesn’t: queries don’t filter on that key, or filter on a high-cardinality key you partitioned on (too many small files).
- Tradeoffs: essentially free at query time, but fine-grained partitioning causes metadata explosion and slow planning.
5.2 Zone maps / min-max metadata
Section titled “5.2 Zone maps / min-max metadata”- What: per block (e.g. per 64K-row chunk), store
min,max, null count for each column. - Problem solved: skip blocks whose value range can’t match the predicate.
- When it helps: predicate column is correlated with sort order. On sorted data: skips are dramatic.
- When it doesn’t: random/unsorted data → min/max spans the whole domain → no skipping.
- Tradeoff: cheap to store, cheap to maintain, useless without sort correlation.
Sorted by user_id: Unsorted:Block 1: min=1, max=1000 Block 1: min=1, max=999999Block 2: min=1001,max=2000 Block 2: min=2, max=999998Block 3: min=2001,max=3000 Block 3: min=1, max=999997 ↑ ↑WHERE user_id=1500 skips 2/3 WHERE user_id=1500 skips 0/35.3 Inverted index (high level)
Section titled “5.3 Inverted index (high level)”- What: term → list of doc/row IDs containing it. Same idea as search engines.
- Problem solved: highly selective equality/IN predicates on high-cardinality columns (e.g.
user_id = X,trace_id = Y). - When it helps: needle-in-haystack queries; text search; log lookups.
- When it doesn’t: broad scans, aggregations over most rows.
- Tradeoff: large storage overhead (can rival data size), expensive to build and maintain on updates. Common in Elasticsearch, Pinot text indexes, Druid.
5.4 Bitmap index (high level)
Section titled “5.4 Bitmap index (high level)”- What: per distinct value of a column, a bitmap of rows where that value appears.
- Problem solved: fast AND/OR over multiple low/medium-cardinality predicates.
- When it helps:
WHERE country='US' AND device='mobile' AND status='paid'— intersect three bitmaps. - When it doesn’t: high-cardinality columns (a bitmap per value explodes), heavy updates.
- Tradeoff: compressed bitmaps (Roaring) make this cheap for moderate cardinality. Great for faceted filtering.
country: US -> 1110010110 FR -> 0001101001 DE -> 0000000000...
device: mobile -> 1011010110 desktop -> 0100101001
AND: US & mobile -> 1010010110 (scan only these rows)5.5 Star-tree / pre-aggregation indexes (high level)
Section titled “5.5 Star-tree / pre-aggregation indexes (high level)”- What: a tree that materializes aggregates along a set of dimensions, with “star” nodes that represent “any value.”
- Problem solved: fixed-shape dashboard queries over many dimensions at high QPS with bounded latency.
- When it helps: you know the dimension set, slices are predictable, ingest rate is manageable.
- When it doesn’t: ad-hoc exploration over arbitrary columns; cube explosion.
- Tradeoff: storage blow-up proportional to dimension combinations; ingestion cost; staleness vs rebuild frequency. Pinot’s differentiator.
5.6 Dictionary encoding
Section titled “5.6 Dictionary encoding”- What: replace string/large values with integer codes; store a dictionary per column or per block.
- Problem solved: compression + fast equality/GROUP BY (integer compares, not string compares).
- When it helps: low-to-medium cardinality string columns (country, device, status).
- When it doesn’t: high-cardinality unique strings (URLs, user agents verbatim) — dictionary grows without bound.
- Tradeoff: small CPU cost for dict lookup vs huge wins for aggregation and joins.
5.7 Clustering / sorted order as indexing aid
Section titled “5.7 Clustering / sorted order as indexing aid”- What: sort the table (or sub-groups) on a key.
- Problem solved: makes zone maps effective, improves compression, enables range scans.
- When it helps: predictable hot filter columns (time + tenant_id is classic).
- When it doesn’t: evenly-accessed dimensions — pick one sort order and you lose on the others.
- Tradeoff: one physical sort per copy. Some systems support secondary sort via projections/skip-indexes.
5.8 When is full indexing worth it vs too expensive?
Section titled “5.8 When is full indexing worth it vs too expensive?” │ Low write rate │ High write rate │ + High QPS │ + Low QPS───────────────────┼───────────────────┼───────────────────High selectivity │ BUILD INDEX │ Maybe build zone(needle/haystack) │ (inverted,bitmap)│ maps only───────────────────┼───────────────────┼───────────────────Low selectivity │ Scan is fine; │ Definitely no(broad aggregate) │ use compression │ index; relies on │ + sort │ partitioning aloneDefault position: partition + sort + zone maps + dictionary. Add bitmap/inverted only when a specific query pattern demands it.
6. Ingestion vs query tradeoffs
Section titled “6. Ingestion vs query tradeoffs”6.1 Batch vs real-time
Section titled “6.1 Batch vs real-time”- Batch: hourly/daily load, large sorted compressed segments, maximum read performance, staleness = batch window.
- Real-time: streaming ingest, small segments, often less compressed/sorted, queryable in seconds, higher per-row overhead.
- Staff move: combine both. Lambda-flavored.
Stream (Kafka) Batch (S3 / warehouse) │ │ ▼ ▼ ┌──────────────────┐ ┌────────────────────┐ │ Real-time segs │ │ Historical segs │ │ (mutable-ish, │ │ (immutable, sorted,│ │ small, recent) │ │ compressed, pre- │ │ │ │ aggregated) │ └────────┬─────────┘ └──────────┬─────────┘ │ │ └──────────────────┬────────────────────────┘ ▼ Query layer merges results from both, dedupes on PK/version6.2 Mutable vs immutable segments
Section titled “6.2 Mutable vs immutable segments”- Immutable: append-only, compressed, indexed once, never rewritten except by compaction. Fast to read.
- Mutable (real-time): recent window, in-memory or lightly persisted, supports per-row writes, slower per-row reads but tiny data volume.
- Transition: real-time segment seals (at size/time threshold) → becomes an immutable segment → eventually merged by compaction.
6.3 Freshness vs query speed
Section titled “6.3 Freshness vs query speed”Three dials, pick your spot:
FRESHNESS │ SEGMENT SIZE │ QUERY SPEEDHigh (seconds) │ Small │ Slower (many segs, less compressed)Low (hours) │ Large │ Fast (few big sorted segs)6.4 Compaction / merge (high level)
Section titled “6.4 Compaction / merge (high level)”- Background process merges small segments into larger ones, re-sorts, re-compresses, rebuilds indexes, applies tombstones.
- Reduces segment count (read amplification) at the cost of write amplification.
- LSM-ish. Tuning knobs: trigger threshold, concurrency, resource isolation from query path.
Before: [s1][s2][s3][s4][s5][s6][s7][s8] (8 small segments, lots of metadata) │ │ │ │ │ │ │ │ └───┴───┴───┴───┼───┴───┴───┴──── compaction ▼After: [ S1 ][ S2 ] (2 large segments, sorted, compressed)6.5 Why separate ingest path from serving path
Section titled “6.5 Why separate ingest path from serving path”- Ingestion bursts should not crash queries. Queries under load shouldn’t block ingestion.
- Separate workers (different machines or at least isolated CPU/mem).
- Shared storage (object store or distributed FS) as handoff.
- Lets you scale each independently and do rolling upgrades safely.
6.6 Operational implications of low-latency analytics
Section titled “6.6 Operational implications of low-latency analytics”- Hot tier for recent data (local NVMe, replicated); cold tier on object store.
- Tiered storage with async demotion.
- Query-time freshness depends on segment handoff latency — watch it.
- Compaction as a first-class resource consumer — plan its budget.
- Rebalancing/replication when nodes come and go; segment assignment algorithms matter.
7. Tiered storage and decoupled storage-compute
Section titled “7. Tiered storage and decoupled storage-compute”Two linked architectural shifts worth fluency on: separating storage from compute, and tiering data by access frequency. Both show up in every modern analytics system, and the same ideas are now being imported into inference infrastructure.
7.1 Shared-nothing MPP vs decoupled architecture
Section titled “7.1 Shared-nothing MPP vs decoupled architecture”Classical MPP (Vertica, early Redshift, Greenplum) co-located storage and compute on each node. Scaling meant restriping data across new nodes. Compute ran 24/7 to keep storage online.
Decoupled architectures (Snowflake, BigQuery, Databricks, Trino-over-Iceberg) put durable data in object storage (S3, GCS, Azure Blob) and run stateless compute clusters over it. A catalog/metadata service mediates (file lists, schemas, snapshots, stats).
Shared-nothing MPP Decoupled storage-compute────────────────── ─────────────────────────┌─────┐ ┌─────┐ ┌─────┐ ┌───────┐ ┌───────┐ ┌───────┐│ C+S │ │ C+S │ │ C+S │ │compute│ │compute│ │compute│└─────┘ └─────┘ └─────┘ │ W1 │ │ W2 │ │ W3 │ ↑ ↑ ↑ └───┬───┘ └───┬───┘ └───┬───┘ scale = restripe │ │ │ data across nodes └─────────┼─────────┘ │ ┌──────▼──────┐ metadata: │ catalog │ files, stats, │ (Iceberg / │ snapshots, │ Snowflake │ schema │ meta svc) │ └──────┬──────┘ │ ┌──────▼──────┐ durable, infinite, │ object store│ cheap, │ (S3/GCS) │ high-latency └─────────────┘Wins from decoupling:
- Elastic compute: spin up a 100-node cluster for a heavy query, tear it down. Pay per second.
- Multi-tenant isolation: independent compute clusters over the same data; no resource contention.
- Scale to zero: compute cost → $0 when idle; storage cost stays linear in data volume.
- Storage is someone else’s problem: S3 gives 11 nines durability at commodity price; don’t reinvent it.
- Independent scaling: storage and compute grow at different rates. Storage usually grows faster.
Costs:
- Network is the new bottleneck. Object store GET is ~20–100ms vs local NVMe ~100µs. Orders-of-magnitude worse tail latency.
- Cold start. Warming caches on a fresh warehouse takes seconds to minutes.
- Consistency model. Historically object stores were eventually consistent (S3 is stronger now); table formats (Iceberg/Delta/Hudi) layer snapshot isolation on top.
7.2 Open table formats as the enabler
Section titled “7.2 Open table formats as the enabler”Iceberg / Delta Lake / Hudi turned “a bunch of Parquet files in S3” into a queryable table with schema evolution, snapshots, and ACID-ish semantics. Critical because they let multiple engines (Trino, Spark, Flink, ClickHouse) share the same data without a proprietary format.
What they give you:
- Manifest files: list of data files + per-file stats (min/max per column).
- Snapshots: atomic commit unit; readers see a consistent view.
- Schema evolution without full rewrite.
- Time travel: query as-of a snapshot.
- Partition evolution: change partition scheme without full rewrite.
Interview relevance: if someone asks about “data lake” design, Iceberg/Delta is the modern answer. Raw Parquet in S3 without a table format is a pre-2020 design.
7.3 Hot / warm / cold tiering
Section titled “7.3 Hot / warm / cold tiering”Within a given system, data is classified by access frequency and placed on storage with matching latency/cost.
Tier Latency Cost Typical use────── ─────────────── ───── ─────────────────────────────Hot 100µs (NVMe) $$$$ Last 1–24 hours; real-time dashboards; low-latency servingWarm 1–10ms (SSD, remote $$ Last 1–30 days; frequently block storage) queried historicalCold 50–200ms (S3/GCS) $ Months to years; rare ad-hoc or compliance queriesArchive seconds–minutes ¢ Years; audit, recovery only (Glacier, tape)Placement rules:
- By age: time-based demotion (hot → warm → cold as data ages).
- By access frequency: LRU-ish promotion/demotion based on recent query hits.
- By explicit label: user-declared hot datasets (Snowflake, Pinot support this).
7.4 Caching is the whole game
Section titled “7.4 Caching is the whole game”When the authoritative store is remote, cache hit rate is the dominant performance lever. Typical layers:
┌─────────────────────────────────────────┐│ Query result cache (query text → result)│ p99 µs; low hit rate for│ │ parameterized dashboards├─────────────────────────────────────────┤│ Fragment / partial-aggregation cache │ reuse across similar queries├─────────────────────────────────────────┤│ Block / page cache (local NVMe) │ the workhorse; column-block LRU├─────────────────────────────────────────┤│ Metadata cache (file lists, stats) │ critical; planning can take│ │ longer than execution if this│ │ misses against S3├─────────────────────────────────────────┤│ Remote object store (S3) │ authoritative, slow└─────────────────────────────────────────┘Interview points:
- Query planning hits the metadata cache first — raw S3 file listing is slow; with 10K+ partitions it can dominate latency.
- Block cache hit rate on recent data is typically high; cold historical queries are where tail latency blows up.
- Pre-warm caches on node addition (“cache hydration”) rather than cold-starting.
7.5 Inference-adjacent analogs
Section titled “7.5 Inference-adjacent analogs”Same pattern, different scale:
- Feature stores (Feast, Tecton): offline store on object storage for training; online store on Redis/DynamoDB for serving. Same data, two tiers for two latency regimes.
- Telemetry / request-log lakes: inference request/response logs in Parquet on S3; ClickHouse or Trino for ad-hoc. SLO is “queryable within minutes,” not milliseconds.
- Model artifact serving: weights in S3/GCS (cold) → warm cache on shared NVMe → hot copy in GPU HBM per worker. Same tiering principle, different latency scales.
- KV cache tiering for LLM inference: HBM (µs) → host DRAM (10s of µs) → local NVMe (100s of µs) → distributed object store (ms). Same eviction/promotion questions, same decoupled-compute insight (serve a request from any node that can pull the prefix fast enough). This is active research territory — prefix cache offloading, paged/tiered KV.
Honest framing for an inference-infra interview: analytical systems pioneered decoupled storage/compute a decade ago; inference systems are importing the ideas now because HBM is the new “RAM” and NVMe + object store are the new “disk.”
7.6 Interview reasoning — when do I pick decoupled?
Section titled “7.6 Interview reasoning — when do I pick decoupled?”- Pick decoupled when: workload is bursty or bimodal, multiple teams share data, storage ≫ sustained compute, cost elasticity matters, or you’re on a cloud with cheap durable object storage.
- Stick with shared-nothing when: p95 at single-digit ms is non-negotiable, workload is steady-state (always-on high-QPS dashboards), network to object store would dominate, or on-prem without a good object store.
- Hybrid is the common reality: hot tier on shared-nothing for user-facing queries; cold tier in object storage queried by a separate analytics cluster. Ingestion writes to both.
7.7 First-order bottlenecks in decoupled systems
Section titled “7.7 First-order bottlenecks in decoupled systems”- Object store throughput/latency — you’re GET-limited on scans. Parallelize aggressively; use ranged GETs, not whole-file fetches.
- Cache hit rate — one miss on a 1 GB file blows your latency budget.
- Metadata planning — listing 100K partitions from S3 is slow. Manifest files (Iceberg) fix this.
- Cold start — warming a fresh cluster’s caches from S3 on first queries.
- Request amplification — each Parquet file = header + footer + column chunks, i.e. multiple round trips if you’re not careful. Coalesce reads.
8. Query patterns and bottlenecks
Section titled “8. Query patterns and bottlenecks”8.1 The five query shapes
Section titled “8.1 The five query shapes”| Pattern | What dominates | What to exploit |
|---|---|---|
| Selective filter (1 in 10⁶) | index probe | inverted/bitmap index, sort + zone map |
| Broad scan | sequential I/O | columnar, compression, projection, parallelism |
| GROUP BY aggregation | hash aggregation, memory | dictionary encoding, pre-aggregation, partition-local agg |
| Top-K | partial heap | compute top-K per partition, merge |
| COUNT DISTINCT | memory/exactness | HyperLogLog (approx), sort+stream (exact, expensive) |
8.2 Distinct counts (high level)
Section titled “8.2 Distinct counts (high level)”- Exact distinct = keep a hash set. Memory = O(distinct), which can be huge, and doesn’t parallelize without shuffle.
- Approx distinct (HLL, Theta) = fixed ~16KB sketch, mergeable, 1–2% error. Default to it for dashboards.
- Interview answer: “Unless the PM requires exactness, use HLL. It merges across partitions and fits in memory.”
8.3 Joins at interview depth
Section titled “8.3 Joins at interview depth”OLAP joins are constrained relative to OLTP. You need to know:
- Broadcast join: small side replicated to all workers, big side scanned locally. Fast when one side is small (< few GB). No shuffle.
- Shuffle (hash) join: both sides repartitioned on join key. Expensive network; skew kills you.
- Sort-merge join: both sides sorted on join key. Good when data is already sorted or must be sorted anyway.
- Co-located join: both tables partitioned by the same key on the same nodes → local join, no shuffle. The one you design for when possible.
Broadcast: Shuffle: small left right │ │ │ replicate hash-partition hash-partition │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ [W1][W2][W3] [W1][W2][W3] ←←←← network scan big locally local hash join on each worker join + emit (watch skew!)8.4 Why joins are constrained or precomputed in analytics
Section titled “8.4 Why joins are constrained or precomputed in analytics”- Shuffles move terabytes; bandwidth and skew dominate latency.
- Star schema dim tables are usually small → broadcast joins are fine.
- When dims are huge, systems often denormalize (flatten at ingest) or pre-join into a wide table. Pinot/Druid famously restricted joins for this reason (loosening over time).
- Answer: “In analytics, you design the physical layout to avoid shuffles. Either co-locate on the join key, broadcast small dims, or denormalize.”
8.5 First-order bottlenecks
Section titled “8.5 First-order bottlenecks”In rough order of what kills queries:
- Scan volume — too many bytes, not enough pruning.
- Shuffle — cross-node data movement on joins and high-cardinality GROUP BY.
- Skew — one partition or key has 80% of data; one worker becomes the critical path.
- Memory pressure — hash aggregation spilling to disk.
- Query planning overhead — thousands of tiny partitions/files.
- Compaction interference — background work starving foreground queries.
Your design should name the bottleneck and kill it explicitly.
9. Interview reasoning patterns
Section titled “9. Interview reasoning patterns”These are the reusable scripts.
9.1 Do I need an analytical store vs OLTP?
Section titled “9.1 Do I need an analytical store vs OLTP?”Start │ ├── Query touches < 10K rows, needs ACID on row updates? ─── OLTP │ ├── Query scans millions+, aggregations, read-mostly? ───── OLAP │ ├── Mixed? Small hot (last hour) + big historical? ──────── OLTP for hot, │ OLAP for cold, │ CDC between them │ └── User-facing, high QPS, bounded query shape, fresh? ──── Pinot/Druid class (or pre-agg + cache)9.2 Does columnar help here?
Section titled “9.2 Does columnar help here?”Yes if: wide schema, queries project few columns, aggregation-heavy, read-mostly, bulk ingest. No if: row updates are frequent, queries need full rows, narrow schema.
9.3 Are indexes worth it?
Section titled “9.3 Are indexes worth it?”- Default: no index beyond sort + zone maps + dictionary.
- Add bitmap when you have 3+ low-cardinality filters ANDed together frequently.
- Add inverted when you have very selective equality on high-cardinality keys that are hot.
- Reject: “I’ll index everything.” Indexes cost storage, write time, and compaction work.
9.4 Should I pre-aggregate?
Section titled “9.4 Should I pre-aggregate?”- Yes if: query shape is fixed, QPS is high, latency SLO is tight, dimension set is bounded.
- No if: ad-hoc exploration, unbounded dimension combinations, low QPS.
- Middle ground: materialize rollups for the top-N known queries; fall back to raw for the rest.
9.5 Freshness vs latency
Section titled “9.5 Freshness vs latency”Three levers:
- Segment size — smaller = fresher, slower queries.
- Compaction cadence — more frequent = fresher sealed state, more background load.
- Real-time tier size — bigger RT window = more fresh data queryable, but slower since RT is less optimized.
Name the SLO first (e.g., p95 < 500ms, data ≤ 30s old), then pick the levers.
9.6 Ingestion + fast queries simultaneously
Section titled “9.6 Ingestion + fast queries simultaneously”- Separate write path and read path processes.
- Immutable segments so readers never lock against writers.
- Object store as durable landing zone; hot tier on local disk for serving.
- Background compaction with its own resource budget, isolated from query workers.
9.7 When is Pinot/ClickHouse/Druid thinking appropriate?
Section titled “9.7 When is Pinot/ClickHouse/Druid thinking appropriate?”- User-facing analytics with high QPS, sub-second p95, bounded queries → Pinot/Druid.
- Internal analytics with broad ad-hoc queries, tolerant of seconds-to-minutes → ClickHouse, Snowflake, BigQuery, Trino over Iceberg/Delta.
- Logs / traces at scale → columnar + some text indexing; ClickHouse, Pinot, or Elasticsearch depending on shape.
9.8 First bottlenecks / operational pains
Section titled “9.8 First bottlenecks / operational pains”- Partition key selection gone wrong → no pruning → everything scans full table.
- Skew on a tenant_id or user_id causing one hot shard.
- Compaction backlog causing segment count to explode, metadata planning time to spike.
- Real-time segment seal latency defining your freshness SLO.
- Tiered storage cache misses on cold data when someone runs a wide historical query.
10. Common candidate mistakes
Section titled “10. Common candidate mistakes”- “Use a data warehouse” without specifying latency, freshness, or QPS. Snowflake is not a dashboard engine out of the box.
- Ignoring row vs column — proposing scans over Postgres for analytics at TB scale.
- Handwaving “add an index” without naming the index type, what it indexes, its update cost, or its selectivity assumption.
- Skipping partition pruning — proposing a big table without saying what’s partitioned and how.
- Assuming indexes always help — indexes on low-selectivity columns are strictly negative.
- Ignoring ingest/query tradeoff — proposing low-latency fresh queries without saying how writes don’t collide with reads.
- Assuming OLTP scales to dashboards — Postgres with read replicas does not solve 10 TB scan dashboards.
- Ignoring skew and scan cost — uniform workload assumption breaks in real systems; tenant-based skew is the norm.
- Proposing heavy joins without discussing broadcast vs shuffle, co-location, or denormalization.
- Over-engineering — building a Pinot clone when a Postgres read replica + materialized view would suffice.
- Talking about “caching” as a primary strategy — for dashboards with varied parameters, cache hit rate is often low; fix the storage layout instead.
- Forgetting compaction — systems need it; ignoring it means ignoring half the operational pain.
11. Final cheat sheet
Section titled “11. Final cheat sheet”11.1 OLTP row store vs OLAP column store
Section titled “11.1 OLTP row store vs OLAP column store”| Axis | OLTP row store | OLAP column store |
|---|---|---|
| Layout | Rows contiguous | Columns contiguous |
| Read pattern | Point / small range | Large scan on few columns |
| Write pattern | Random, per-row, ACID | Bulk append, immutable segs |
| Compression | Modest | 5–20× |
| Index default | B-tree PK + secondary | Partition + sort + zone map |
| Concurrency | Thousands of small txns | Tens of large queries |
| Latency | ms per query | ms–seconds over 10⁹ rows |
| Update cost | Cheap | Expensive (tombstone + compact) |
| Canonical fit | Transactions | Analytics, dashboards, logs |
| Products | Postgres, MySQL, Spanner | ClickHouse, Druid, Pinot, BigQuery, Snowflake |
11.2 Partitioning vs sorting vs indexing vs pre-aggregation
Section titled “11.2 Partitioning vs sorting vs indexing vs pre-aggregation”| Technique | Granularity | Kills | Write cost | Storage cost | Freshness impact |
|---|---|---|---|---|---|
| Partitioning | Directory / file | Whole-file scans | Low | None | None |
| Sorting / clustering | Within partition | Block scans (via zone map) | Medium (must sort) | None | Sort may delay seal |
| Zone map / min-max | Block-level | Block reads | Trivial | Trivial | None |
| Bitmap index | Column-level | Multi-predicate scan | Medium | Moderate | Must rebuild on update |
| Inverted index | Term-level | Needle queries | High | High (rivals data) | High — costly to update |
| Dictionary encoding | Column-level | Decompression cost; speeds GROUP BY / eq | Low | Negative (saves storage) | None |
| Pre-aggregation | Query-level | Raw table scan | High (precompute) | High (cube blowup) | Rebuild window = staleness |
11.3 Decision framework
Section titled “11.3 Decision framework”1. Workload: OLTP? OLAP? mixed? ↓2. Query shape: scans / aggregations / selective / joins / top-K? ↓3. Freshness SLO: seconds? minutes? hours? ↓4. QPS: user-facing high QPS? internal low QPS? ↓5. Storage layout: row or columnar? ↓6. Partition key: what's in 90% of WHERE clauses? (usually time) ↓7. Sort key: next most selective frequent filter ↓8. Indexes: zone map by default; bitmap/inverted only if query demands it ↓9. Pre-aggregation: fixed query shape + high QPS + tight SLO → yes ↓10. Ingest path: immutable segments, background compaction, isolated from query path ↓11. Name bottlenecks: scan volume, shuffle, skew, memory, compaction ↓12. Operational: tiered storage, rebalancing, compaction budget, monitoring for segment count & skew11.4 Ten likely interview questions with short strong answers
Section titled “11.4 Ten likely interview questions with short strong answers”Q1. When would you not use Postgres for analytics? A. When working set exceeds RAM and per-query scans exceed ~10 GB with sub-second SLOs, or when ingest rate plus dashboard QPS saturates random-I/O row storage. At that point columnar + partitioning is structurally right, not just a tuning issue.
Q2. Why is columnar better for analytics? A. Scans read only projected columns (bytes_read ∝ cols_read/cols_total), column-homogeneous data compresses 5–20× better, and execution vectorizes cleanly. Tradeoff: row updates and row reconstruction are expensive, so segments are immutable.
Q3. How do you pick a partition key? A. The dimension that appears in 90%+ of WHERE clauses, usually time (day or hour). Low enough cardinality to avoid small-file explosion, high enough to give meaningful pruning. For multi-tenant, sometimes time + tenant_id, but watch skew.
Q4. When is a bitmap index worth it?
A. Low-to-medium cardinality columns used in ANDed predicates with moderate-to-high selectivity. Classic case: country AND device AND status dashboard filters. Not worth it on high-cardinality or write-heavy columns.
Q5. How do you handle updates in a columnar store? A. Immutable segments + delete vectors (or tombstones) + re-insert, merged at read time, with periodic compaction to rewrite. The system is append-heavy by design; true row-level UPDATE at high rate is a signal you picked the wrong store.
Q6. Freshness vs query latency — how do you balance them? A. Real-time tier holds recent data in small, less-optimized segments for low freshness latency. Historical tier holds sealed, sorted, compressed, indexed segments for fast queries. Queries merge both. Tune segment-seal cadence and compaction frequency to meet the SLO.
Q7. Why are joins constrained in analytics systems? A. Shuffles move huge data volumes across the network, and skew creates stragglers. Designs avoid shuffles via (a) co-located partitioning on the join key, (b) broadcast joins for small dims, or (c) denormalization at ingest into wide tables. Many early OLAP systems forbade joins for exactly this reason.
Q8. When would you pre-aggregate? A. Fixed query shape, bounded dimension set, high QPS, tight latency SLO. Pre-aggregation converts per-query work from O(rows) to O(dimensions). Don’t pre-aggregate when exploration is ad-hoc or the dimension cube would explode.
Q9. What’s the first bottleneck you’d watch for? A. Scan volume not reduced by partition pruning — usually a bad partition key. Then shuffle volume and skew on joins and high-cardinality GROUP BYs. Then compaction interference on the query path.
Q10. Pinot/Druid vs ClickHouse vs Snowflake — when does each fit? A. Pinot/Druid for user-facing, high-QPS, sub-second analytics with bounded query shapes. ClickHouse for internal analytics and logs with heavy ad-hoc SQL at seconds-latency. Snowflake/BigQuery for warehouse-scale ad-hoc analytics where minutes are acceptable and ease-of-use and decoupled storage matter. The axes are QPS, freshness, query flexibility.
One-page summary
Section titled “One-page summary”Decide: OLTP or OLAP? └─ OLAP → Storage: columnar, immutable segments Partition: the filter in 90%+ of queries (usually time) Sort: next most selective frequent filter Index: zone map + dict by default; bitmap/inverted only if specific query demands Pre-agg: if fixed shape + high QPS + tight SLO Ingest: separate from query path; real-time tier + sealed historical tier; background compaction Bottlenecks: scan volume, shuffle, skew, memory, compaction
Always frame choices as: how does this reduce scan bytes?