Skip to content

S3 storage

S3-Compatible Object Storage — Staff-Level Design

Section titled “S3-Compatible Object Storage — Staff-Level Design”

https://prachub.com/interview-questions/design-an-s3-like-object-storage-service

Q: Design an S3-compatible object storage service. REQUIRED ASSUMPTIONS (state, then design to them)

  • 10M PUT/s and 100M GET/s aggregate at steady state.
  • 10 EB stored, 30%/yr growth.
  • Object size: bimodal — 80% < 1 MB, 20% from 100 MB up to 5 TB.
  • Durability target: 11 nines.
  • Availability: 4 nines/region, 5 nines multi-region.
  • State your p50/p99 latency targets for small GETs and justify.
  • Cost model: storage + requests + egress. Design must reason in $.

Notation: > ▸ STAFF blockquotes mark moves that separate a staff/principal answer from a senior IC answer. Numbers are quoted inline; the spec drives the boxes, not the other way around.


PUTs 10M/s ┬─ 80% small (~256 KB avg) → 8M × 0.25 MB = 2 TB/s
└─ 20% large (~500 MB avg) → 2M × 500 MB = 1,000 TB/s ◄ dominant
───────────
~1 PB/s ingress
(large = many concurrent MPU streams,
not 2M independent writes/s)
GETs 100M/s aggregate → ~2 PB/s sustained egress
Storage: 10 EB / (800 TB/node × 0.70 util) = ~18k nodes → 24k after Y1 @ 30%
Frontend: 2 PB/s / (200 GbE × 0.80) = ~100k pods (data path)
Metadata: ~30M ops/s / 50k QPS/shard = ~600 shards → 2k with headroom
Latency commits (intra-region client):
small GET p50 9 ms p99 40 ms p999 150 ms
small PUT p50 12 ms p99 25 ms
large GET first-byte p99 60 ms

STAFF: The 1 PB/s “ingress” is bullshit unless restated as ~50k concurrent MPU streams at 20 Gbps each. Junior answers multiply request rate by mean object size and don’t notice. State the assumption, then design.


┌─────────────────────────────────┐
│ Client / SDK │
└────────────────┬────────────────┘
│ HTTPS, SigV4
┌────────────────▼────────────────┐
│ Edge / Anycast Routing │ TLS term, WAF
└────────────────┬────────────────┘
┌────────────────▼────────────────┐
│ Frontend Pods (~100k) │ auth, hedge,
│ - SigV4, IAM │ EC encode
│ - EC encode (CPU or DPU) │ pre-sign
│ - Pre-signed URL gate │
└────┬─────────────────────┬──────┘
│ metadata path │ data path
┌───────────▼─────────┐ ┌───────▼──────────────────┐
│ Metadata Service │ │ Storage Plane │
│ ~2k Raft ranges │ │ ~24k nodes (3 AZs) │
│ range-partitioned │ │ │
│ (Spanner/Cockroach)│ │ Small: log-struct extts │
│ 5x replicas / 3 AZ │ │ Large: chunked + EC │
└───────────┬─────────┘ └──────────────────────────┘
┌───────────▼──────────────────┐
│ Control Plane │
│ - Placer (centralized) │
│ - GC, scrub, anti-entropy │
│ - Lifecycle, replication │
└──────────────────────────────┘

REST surface: PUT/GET/HEAD/DELETE /bucket/key, GET /bucket?list-type=2&prefix=&continuation-token=, multipart triplet (Initiate, UploadPart, Complete), conditional ops (If-Match, If-None-Match), versioning, range reads.

Auth (SigV4-equivalent): kSecret → kDate → kRegion → kService → kSigning, request signed over canonical request hash. Pre-signed URLs encode the signature in query string with explicit expiry and a server-side fence on object version.

Consistency contract published to clients:

  • Strong read-after-write for new PUTs and overwrites.
  • Strong list-after-write within a bucket.
  • Linearizable conditional puts (If-None-Match: * for “create if absent”).
  • Multipart Complete is atomic; partial state never observable.
  • No cross-bucket transactions.

Part-size choice: min 5 MB (S3-compat), recommended 16–64 MB, max 5 GB; per-object cap 10k parts.

Why 64 MB and not 5 MB?
At 25 GB/s NIC, 64 MB part transfers in ~2.6 ms wire time.
Retry cost on 90% failure: bounded ~2 ms.
5 MB parts: 5 TB object = 1M parts = 1M metadata rows. UNWORKABLE.
64 MB parts: 5 TB object = 80k parts. Workable but >10k cap, so
5 TB → ~512 MB parts at the upper end.
State machine:
┌─────────────┐
│ Initiated │◄── InitiateMultipart (uploadId minted)
└──────┬──────┘
│ UploadPart idempotent key = (uploadId, partNum)
┌─────────────┐
│PartUploaded*│ parts table grows; per-part SHA-256 ETag
└──┬──────┬───┘
Complete Abort ───── 7d idle ─────► sweeper reclaims extents
│ │
▼ ▼
┌────────┐ ┌──────────┐
│ Object │ │ Reclaimed│
└────────┘ └──────────┘

Composite ETag = SHA256(concat(partETags)) + "-" + partCount (S3 contract). GC of abandoned uploads runs against the mpu_parts index, not the main objects table — they’re never co-mingled.


Two backends, picked at PUT by size:

SMALL OBJECT (< 4 MB) LARGE OBJECT (≥ 4 MB)
═════════════════════ ═════════════════════════════════
PUT key=foo, 100 KB PUT key=video, 5 GB (multipart)
│ │ chunk into 4 MB blocks
▼ ▼
[Frontend batches] [Frontend EC encoder]
256 MB extents RS(10,4) per chunk
│ │
▼ ▼
[3x replicated extent] [14 shards × 3 AZs]
(W=2/N=3 sync) (W=10/N=14 sync)
│ async, after seal │
▼ ▼
[EC re-encode → seal] [Sealed immediately]
Metadata: 64 B per obj Metadata: chunk_map ~1 KB / 5 GB obj
(extent_id, off, len) (chunk_id × N)

The small-object packer is Haystack/needle-style: collapses 8M IOPS/s into ~30k extent-level writes/s, achieves >97% disk utilization vs. ~60% for one-object-per-file. The IOPS reduction is the entire point — small files would otherwise destroy you.

│ 3x Repl │ RS(10,4) │ LRC(12,2,2) │ Clay
────────────────────┼──────────┼──────────┼─────────────┼─────────
Storage overhead │ 3.0x │ 1.4x │ 1.33x │ ~1.4x
Min shards to read │ 1 │ 10 │ 10 │ 10
Single-shard repair │ 1 │ 10 │ 6 │ ~5
Decode CPU │ none │ med │ med │ high
Production maturity │ ★★★★★ │ ★★★★★ │ ★★★★ │ ★★
Default fit │ ingest │ warm │ hot │ research
│ buffer │ default │ rebuild- │
│ │ │ hot data │

STAFF: Don’t pick a code rate; pick a policy: 3x replication on the open ingest extent for ~5 min, then async re-encode to RS(10,4). Tectonic/Pelican do this. Naive answer (“use 3x replication”) burns 2x your storage budget. Naive-plus answer (“use RS”) tanks your write latency because you’re encoding on the hot path.

11 nines analytic with RS(10,4) cross-rack at AFR 2%/disk gives >12 nines on independent failures. The binding constraint is correlated failures (firmware bugs, AZ events) — placement matters more than code rate.


Let me rebuild it from the ground up. The storage layer is doing three different things at once and I jammed them together. Let’s separate them.

Forget capacity for a second. The thing that kills naive object stores isn’t “where do I put 10 EB,” it’s how many disk operations per second the fleet can sustain.

NVMe SSD: ~500k–1M random IOPS, ~7 GB/s sequential
HDD: ~100–200 random IOPS, ~250 MB/s sequential

Now look at the workload:

8M small PUTs/sec. If each becomes 1 filesystem write
(1 inode + 1 data block + 1 metadata commit) on the storage node...
8M PUTs × ~3 IOPS/PUT = 24M IOPS/sec needed
24M IOPS / 500k IOPS-per-NVMe = 48 NVMes minimum JUST FOR IOPS
But you also need ~24k NVMes for capacity. So which is binding?
If you naively did one-file-per-object:
IOPS-binding → fleet sized by IOPS, capacity wasted
fsync-binding → tail latency tanks
inode-binding → metadata blows up

This is why small objects need a different storage path than large objects. Not for elegance — because the IOPS math forces it.


Path 1: Small objects → log-structured packing (Haystack)

Section titled “Path 1: Small objects → log-structured packing (Haystack)”

The trick: don’t store one object per file. Concatenate many objects into one big append-only file (“extent”). Facebook published this as Haystack in 2010. It’s how every modern blob store handles small files.

Naive (one file per object):
─────────────────────────────
filesystem:
/data/aa/bb/foo.jpg ← inode + 1 block + metadata
/data/aa/bb/bar.jpg ← inode + 1 block + metadata
/data/aa/bb/baz.jpg ← inode + 1 block + metadata
... 100M more ...
Each PUT = 3+ IOPS. 8M PUT/s = dies.
Log-structured (Haystack-style):
─────────────────────────────────
ONE file on disk: extent_42
┌──────────────────────────────────────────────────────┐
│ [hdr] foo.jpg bytes [hdr] bar.jpg bytes [hdr] baz... │
└──────────────────────────────────────────────────────┘
↑ ↑ ↑
offset=0 offset=104832 offset=237449
length=104832 length=132617 length=88011
Metadata service stores:
foo.jpg → (extent=42, offset=0, length=104832)
bar.jpg → (extent=42, offset=104832, length=132617)
baz.jpg → (extent=42, offset=237449, length=88011)
Each PUT = 1 sequential append. 30k extent-writes/s handles 8M obj/s.

The frontend batches writes from many concurrent clients into the same extent in memory, then does one sequential append to disk. Sequential writes on NVMe hit ~7 GB/s; random writes hit a small fraction of that. You’ve collapsed 8M random IOPS into 30k sequential appends.

Reads work because the metadata service knows the (extent, offset, length) for every object. A GET becomes: “metadata lookup → seek to offset on extent → read length bytes.” On NVMe, the seek is free.

Why this matters: without packing, the small-object workload is mathematically infeasible. With packing, it becomes a metadata problem (which is solvable with sharded KV) instead of a disk problem.


Path 2: Large objects → chunking + erasure coding

Section titled “Path 2: Large objects → chunking + erasure coding”

Large objects (≥ 4 MB) get a different path because:

  1. They’re already big enough that one-object-per-file doesn’t waste IOPS.
  2. They’re big enough that 3× replication is expensive at scale.
  3. They’re big enough that you can split them across many disks for parallel reads.
PUT video.mp4, 5 GB
Step 1: Chunk into 4 MB blocks
───────────────────────────────
[chunk 1][chunk 2][chunk 3] ... [chunk 1280] (1280 chunks)
Step 2: For EACH chunk, erasure code it
────────────────────────────────────────
chunk_1 (4 MB)
┌───────────────────────────────────────┐
│ RS(10,4) encoder │
│ splits 4 MB into 10 data shards │
│ computes 4 parity shards │
└───────────────────────────────────────┘
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 p1 p2 p3 p4
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
host1 host2 ... host14 (each on a different host, ≥3 AZs)
Each shard is 4 MB / 10 = 400 KB.
Total stored: 14 × 400 KB = 5.6 MB for a 4 MB chunk.
Overhead = 1.4×.

What erasure coding actually does (the math, briefly)

Section titled “What erasure coding actually does (the math, briefly)”

You don’t need to know Reed-Solomon internals to design this, but you need the intuition:

RS(k, m) means:
- Take your data, split into k equal pieces (data shards)
- Compute m extra pieces (parity shards) from the data
- Store all k+m pieces on different machines
- You can reconstruct the original from ANY k of the k+m pieces
RS(10, 4):
10 data + 4 parity = 14 shards
Can lose any 4 of 14 and still recover.
Storage overhead: 14/10 = 1.4×.
Compare to 3× replication:
Store 3 full copies.
Can lose any 2 of 3.
Storage overhead: 3.0×.
Replication RS(10,4)
Cost 3.0× 1.4× ← EC wins big
Tolerance 2 losses 4 losses ← EC also wins

So why doesn’t everything use EC? Because of the repair problem.


A disk fails. You need to rebuild the lost shard onto a new disk.
REPLICATION rebuild:
Pick a surviving copy. Read 1 shard. Copy it. Done.
Network read = 1× the lost data.
RS(10,4) rebuild:
Pick any 10 surviving shards. Read all 10.
Decode the original chunk. Re-compute the lost shard.
Network read = 10× the lost data. ← ouch

At fleet scale, disks fail constantly. If every disk failure costs you 10× its capacity in network bandwidth, your network bisection bandwidth becomes the binding constraint on cluster size.

LRC (Locally Repairable Codes, Microsoft Azure published this):

LRC(12, 2, 2):
12 data shards, split into 2 LOCAL groups of 6
Each group has 1 local parity (so it can repair 1 loss within the group)
Plus 2 global parities for catastrophic recovery
Group A: d1 d2 d3 d4 d5 d6 + l_A
Group B: d7 d8 d9 d10 d11 d12 + l_B
Globals: g1 g2
Total: 12 + 2 + 2 = 16 shards. Overhead 16/12 ≈ 1.33×
Single-shard rebuild within Group A:
Read 6 shards from Group A only.
Network read = 6× the lost data, not 10×.
Catastrophic rebuild (>1 loss in a group):
Fall back to global decode, read 12 shards.

The pick:

  • Hot tier (lots of churn, frequent disk replacements): LRC. Repair bandwidth is the cost driver.
  • Warm tier (default): RS(10,4). Better tolerance, simpler, lower steady-state overhead.
  • Cold tier: even wider EC like RS(20,4) for cheaper storage at the cost of slower repair.

Why two-stage write (3× replication → re-encode to EC)

Section titled “Why two-stage write (3× replication → re-encode to EC)”

Now the missing piece: encoding EC on the write path is slow and CPU-heavy. If you encode a 4 MB chunk into 14 shards on the hot path, you’ve added encoder CPU (~1–2 ms) plus 14 cross-host writes (slowest of which dominates).

Replication is fast: just send the same bytes to 3 hosts. No CPU overhead.

So the production pattern is:

Write arrives
┌─────────────────────────────────────┐
│ Stage 1: INGEST (hot path) │
│ Replicate 3× to an open extent │
│ W=2/N=3 — ack as soon as 2 of 3 │
│ Latency: ~5–10 ms (network only) │
└─────────────────────────────────────┘
│ extent fills up (~256 MB)
│ OR ages out (~5 minutes)
┌─────────────────────────────────────┐
│ SEAL the extent (no more writes) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Stage 2: BACKGROUND re-encode │
│ Read sealed extent, encode RS(10,4) │
│ Write 14 shards across failure dom. │
│ Delete the 3 replicas after verify │
│ Latency: doesn't matter (async) │
└─────────────────────────────────────┘

You only pay the 3× storage cost for the ~5-minute “open” window. After that, the data lives at 1.4× overhead. Hot-path latency stays low; steady-state storage cost stays low. This is what Tectonic (Meta) and Pelican (MS) do.


PUT request
┌───────────────┐
│ size < 4 MB? │
└───┬───────┬───┘
yes │ │ no
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ SMALL PATH │ │ LARGE PATH │
│ │ │ │
│ Pack into shared │ │ Chunk into 4 MB blks │
│ 256 MB extent │ │ Each chunk: 3× repl │
│ 3× replicate │ │ on open chunk-group │
│ extent │ │ │
│ │ │ When sealed: │
│ When extent │ │ re-encode RS(10,4) │
│ sealed: │ │ spread 14 shards │
│ re-encode to │ │ across 3 AZs │
│ RS(10,4) or │ │ │
│ LRC for hot │ │ │
└──────────────────┘ └──────────────────────┘
│ │
└───────────┬───────────┘
Metadata service stores:
- small: (extent, offset, length)
- large: chunk_map = list of chunk_ids
chunk_id → 14 shard locations

The whole storage layer is solving three problems with three different mechanisms:

ProblemMechanismCost saved
Small objects burn IOPSLog-structured packing (Haystack)100× IOPS reduction
Replication wastes storage at EB scaleErasure coding (RS / LRC)3.0× → 1.4× overhead
EC is slow on the write pathStage-1 replicate, stage-2 re-encodeLatency stays at replication speed; storage stays at EC cost
Centralized placer (chosen) vs. Consistent hash (vnodes) vs. CRUSH
- Hierarchy-aware - Trivially scaling - Hierarchy-aware
- Capacity/heat-aware - No domain control - Static rules
- Recovery budget tunable - O(1/N) move on add - Painful update churn
- ~50 µs / placement (batched) - Best for KV, not blob - Caps ~10k OSDs cleanly
Failure-domain hierarchy:
Region (us-east-1)
┌───────┼───────┬───────┐
AZ-a AZ-b AZ-c AZ-d ◄ chunk spans ≥3 AZs
Row-1 Row-2 Row-3
Rack Rack Rack ◄ chunk spans ≥10 racks
Host Host Host ◄ 14 distinct hosts per chunk
Recovery bandwidth budget: cap at 20% of bisection BW
18k nodes × 200 Gbps × 20% = 720 Tbps headroom
Full host-loss recovery (800 TB / 720 Tbps) ≈ 9 s minimum,
realistic ~1 hour with throttling and scrub priority.

STAFF: Same-AZ EC with cross-AZ replication is strictly worse than cross-AZ EC at this scale. It doubles your network bill on rebuild and gives up the AZ-correlated-failure protection that 11 nines actually requires. S3 Express explicitly trades this away to be 10x faster — that’s the right move for a separate tier, not the default.


6. Metadata service — the actual bottleneck

Section titled “6. Metadata service — the actual bottleneck”
RANGE-PARTITIONED (chosen) HASH-PARTITIONED (rejected)
═══════════════════════════ ═══════════════════════════════
Range 1: aaa..foo Shard 7: hash(bucket||key) % 1024
Range 2: foo..mum Shard ...:
Range 3: mum..zzz
LIST prefix='2026/04/' → LIST prefix='2026/04/' →
hits 1–2 ranges ✓ fans out to ALL 1024 shards ✗
OR maintain secondary list index ✗
(now you have 2 consistency
problems instead of 1)
Auto-split on: Hot-key behavior:
size > 100 MB no graceful split
QPS > 30k sustained only mitigation: app-level salting
prefix-aware boundary

Schema:

buckets (bucket_id PK) → policy, ACL, versioning, lifecycle
objects (bucket_id, key, version DESC) PK
→ size, etag, chunk_map_ref, content_type, ...
chunks (chunk_id) PK
→ ec_scheme, shard_locations[14], placement_epoch
mpu_uploads (bucket_id, upload_id) PK → state, initiator, key
mpu_parts (upload_id, part_num) PK → etag, chunk_ref, size

5-replica Paxos/Raft per range across 3 AZs, leader leases ~10 s. Read-replica fleet for HEAD/GET reads with strong-consistency epoch fence so you can serve a million read replicas without losing linearizability.

Hot-bucket mitigation: prefix-aware splits at detected boundaries (not arbitrary midpoints), per-prefix QPS counters in the leader, optional client-side salting hint for known hot prefixes. At 30M ops/s and 50k QPS/shard sustained: ~600 minimum, provision 2k for headroom + skew.

STAFF: Most candidates draw boxes for “metadata service” and move on. The metadata service is the entire system’s QPS ceiling — it sees more ops than the data plane (every PUT = 2+ writes, every GET = 1 read). If you don’t have a number for it, you don’t have a design.


S3 switched eventual → strong in December 2020. Implementation pattern (the right one to copy):

PUT path:
Frontend ──► write data shards (EC) ──► all shard acks
Metadata Paxos commit ◄── only after data durable
client 200 OK
Object visible to GET/LIST iff metadata commit succeeded.
No "in-between" state externally observable.
GET path:
Frontend ──► metadata read (linearizable) ──► chunk_map
──► storage read ──► bytes
Costs vs eventual:
+1 Paxos round on PUT ≈ +3–5 ms intra-AZ
≈ +80 ms cross-region (multi-region buckets)
Eventual saves the round but creates an entire bug class
(write S3, publish SQS, consumer reads stale S3) — not worth it.

Fenced caches: every cache entry carries (version, etag, lease_epoch). Overwrite bumps the epoch, revoking all stale cached copies in O(1) by epoch comparison, not per-key invalidation.


Quorum table:
Ingest extent (3x repl) N=3 W=2 R=1
Sealed EC chunk N=14 W=10 R=10 (any 10 of 14)
Metadata range N=5 W=3 R=3 (Paxos)
Anti-entropy:
Per-extent Merkle tree, gossiped between replica peers
Mismatch → re-replicate from majority
Scrub: every disk, every 14 days, ~50 MB/s background
Hinted handoff:
Write to offline target → buffer on peer with hint → replay on return
Cell-based blast radius:
┌──────── Cell 1 ────────┐ ┌──────── Cell 2 ────────┐
│ ~1k storage nodes │ │ ~1k storage nodes │
│ dedicated md ranges │ │ dedicated md ranges │
│ dedicated FE pods │ │ dedicated FE pods │
└────────────────────────┘ └────────────────────────┘
hard isolation hard isolation
no shared control plane no shared control plane
→ bug or rogue tenant ≤ 1 cell of damage
→ weekly cell-level chaos drills
RPO/RTO:
Intra-region: RPO = 0 (sync), RTO < 30 s (md leader election)
Multi-region: RPO ≈ 5 s (async repl), RTO minutes (DNS flip)
Cold backup: weekly md snapshot + continuous WAL ship to
separate pool with different SW version

STAFF: “Cells” with hard isolation is the move that keeps a 10 EB system from being one bug away from headlines. AWS, Google, Meta all do this. Junior answers leave the entire fleet sharing one control plane.


Request hedging on small GETs: dispatch primary, then dispatch a secondary to a replica if not back in p95 (~12 ms). Cuts p99 by ~3x at <2% extra read amplification. Cancel-on-first-success.

Edge caching is workload-dependent. For media-bucket-fronted-by-CDN: 95% hit rate, decisive. For ML training reading uniformly across an EB-scale shuffled dataset: hit rate ≈ 0, cache is pure overhead. Make it opt-in, per bucket, with explicit TTL.

Pre-signed URLs going direct to storage bypass the frontend on the data path:

WITHOUT pre-sign: WITH pre-sign:
Client → FE → Storage Client → FE (sign only)
data path Client → Storage (data path)
direct, FE not in line
~3 ms extra hop FE handles 2x more sigs/sec
for the same hardware.

Saves ~1–3 ms on small GETs, and offloads roughly half the frontend bytes-in-flight at the high end of the large-object distribution.


5 GB multipart upload, 64 MB parts (78 parts), 10-way client parallelism

Section titled “5 GB multipart upload, 64 MB parts (78 parts), 10-way client parallelism”
ms Client Frontend Metadata Storage
────────────────────────────────────────────────────────────────
0 ─Initiate────►
1 auth
3 ─ Paxos write ──►
8 ◄── ack
8 ◄─uploadId──
10 ─UploadPart 1..10 (parallel)─►
EC encode (RS 10,4)
─── fan-out 14 shards × 3 AZs ──────►
30 NVMe + fsync
35 ◄── 10/14 acks
─ part metadata ─►
40 ◄─ETag──
... [parts 11..78 stream similarly; client uplink-bound]
250 ─Complete────►
255 validate manifest
257 ─atomic flip────►
262 ◄── committed
263 ◄─200 OK──
Net: ~250 ms protocol overhead on a ~2-second client-uplink-bound upload.
Object visible to LIST/GET at t=262 ms, atomically.
ms Client Frontend Metadata Storage
────────────────────────────────────────────────────────────────
0 ─GET key────►
0.05 HMAC verify
md cache lookup
├── HIT (80%) → 1 ms
└── MISS → ──►
5 ◄── obj row
5 range read ──────────────────────►
NVMe ~100 µs
6 ◄── 1 KB
6 ◄─200 OK + body──
p50: 5–10 ms (cache hit dominates)
p99: 25–40 ms (with hedging at p95=12 ms)
p999: 100–150 ms (tail from gc/scrub interference, fsync stalls)

Shift │ What it changes for a 2026 design
────────────────────────────┼─────────────────────────────────────
S3 Express One Zone │ Single-AZ tier mandatory; -85% GET cost
April 2025 reprice │ → ship a "directory bucket" type
S3 strong consistency │ Eventual is no longer defensible default
Dec 2020, now well-known │ → metadata-first commit + fenced caches
R2/Tigris/B2 zero egress │ ~15× cost gap vs S3 on egress-heavy WL
│ → egress price = strategy, not COGS
S3-over-RDMA + GPUDirect │ Meta: 3.8× training speedup w/ GDS
NVIDIA GA Jan 2026 │ → RDMA data plane = table stakes for AI
S3 Tables (Iceberg-native) │ Object store ⇒ tabular DB substrate
re:Invent 2024 │ → typed buckets: blob | table | vector
LRC / Clay in production │ Repair BW = hot-tier cost driver
│ → LRC for hot, RS for warm
S3 Vectors │ Vector search inside the object store
2025 │ → ANN APIs as first-class bucket type

The April 2025 S3 Express price drop was substantial: 31% off storage, 55% off PUT, 85% off GET, with single-digit ms latency, 2M GET/s and 200K PUT/s per directory bucket. Storage is still ~5x Standard, so it’s not a default — it’s the AI-training-scratch-space tier you ship alongside Standard.

R2 has rewritten the cost model. $0.015/GB-month with zero egress fees; at 10 TB stored / 5 TB egress, AWS S3 costs ~15× the cheapest alternative. A new design that doesn’t take a position on egress (match it, or own a non-egress moat) is shipping a 2020 cost model.

NVIDIA’s RDMA for S3 went GA via the CUDA Toolkit in early 2026. RDMA bypasses host CPU on the data path, delivering higher throughput per terabyte and significantly lower latencies vs TCP. Meta reported 3.8× training speedup with GPUDirect Storage, going from 50 GB/s CPU-bottlenecked to 192 GB/s direct-to-GPU and dropping data loading from 35% to 5% of training time. The right factoring is control-plane on HTTP/TCP, object payloads on RDMA — don’t try to RDMA SigV4.

S3 Tables made the object store a database substrate. Native Iceberg with up to 3× faster queries and 10× more transactions per second than self-managed Iceberg on standard S3. The implication: a 2026 design has typed bucket modes (blob, tabular, vector), with compute-on-read (Lambda-equivalent runtime per cell) as a first-class extensibility point.


A system designed today: strong-consistent metadata via range-partitioned Spanner-class KV, RS(10,4) warm tier + LRC(12,2,2) hot tier + 3x replicated ingest extents async-encoded, log-structured packing for small objects, S3-over-RDMA data plane with TCP fallback, single-AZ “express” tier and multi-AZ default tier as separate bucket types, native Iceberg-and-vector typed buckets, zero or low egress as a competitive position, and aggressive cell-based blast-radius isolation. Anything less is shipping a 2020 design in 2026.