S3 storage

S3-Compatible Object Storage — Staff-Level Design

https://prachub.com/interview-questions/design-an-s3-like-object-storage-service

Q: Design an S3-compatible object storage service. REQUIRED ASSUMPTIONS (state, then design to them)

10M PUT/s and 100M GET/s aggregate at steady state.
10 EB stored, 30%/yr growth.
Object size: bimodal — 80% < 1 MB, 20% from 100 MB up to 5 TB.
Durability target: 11 nines.
Availability: 4 nines/region, 5 nines multi-region.
State your p50/p99 latency targets for small GETs and justify.
Cost model: storage + requests + egress. Design must reason in $.

Notation: > ▸ STAFF blockquotes mark moves that separate a staff/principal answer from a senior IC answer. Numbers are quoted inline; the spec drives the boxes, not the other way around.

0. Capacity math first

PUTs 10M/s  ┬─ 80% small (~256 KB avg)  → 8M × 0.25 MB =     2 TB/s
            └─ 20% large (~500 MB avg)  → 2M × 500  MB = 1,000 TB/s  ◄ dominant
                                                          ───────────
                                                          ~1 PB/s ingress
                                  (large = many concurrent MPU streams,
                                   not 2M independent writes/s)

GETs 100M/s aggregate                            → ~2 PB/s sustained egress

Storage:  10 EB / (800 TB/node × 0.70 util) = ~18k nodes → 24k after Y1 @ 30%
Frontend: 2 PB/s / (200 GbE × 0.80)         = ~100k pods (data path)
Metadata: ~30M ops/s / 50k QPS/shard        = ~600 shards → 2k with headroom

Latency commits (intra-region client):
  small GET   p50  9 ms   p99  40 ms   p999 150 ms
  small PUT   p50 12 ms   p99  25 ms
  large GET   first-byte p99 60 ms

▸ STAFF: The 1 PB/s “ingress” is bullshit unless restated as ~50k concurrent MPU streams at 20 Gbps each. Junior answers multiply request rate by mean object size and don’t notice. State the assumption, then design.

1. System overview

                     ┌─────────────────────────────────┐
                     │          Client / SDK           │
                     └────────────────┬────────────────┘
                                      │ HTTPS, SigV4
                     ┌────────────────▼────────────────┐
                     │   Edge / Anycast Routing        │  TLS term, WAF
                     └────────────────┬────────────────┘
                                      │
                     ┌────────────────▼────────────────┐
                     │   Frontend Pods  (~100k)        │  auth, hedge,
                     │   - SigV4, IAM                  │  EC encode
                     │   - EC encode (CPU or DPU)      │  pre-sign
                     │   - Pre-signed URL gate         │
                     └────┬─────────────────────┬──────┘
                          │ metadata path       │ data path
              ┌───────────▼─────────┐   ┌───────▼──────────────────┐
              │  Metadata Service   │   │  Storage Plane           │
              │  ~2k Raft ranges    │   │  ~24k nodes (3 AZs)      │
              │  range-partitioned  │   │                          │
              │  (Spanner/Cockroach)│   │  Small: log-struct extts │
              │  5x replicas / 3 AZ │   │  Large: chunked + EC     │
              └───────────┬─────────┘   └──────────────────────────┘
                          │
              ┌───────────▼──────────────────┐
              │  Control Plane               │
              │  - Placer (centralized)      │
              │  - GC, scrub, anti-entropy   │
              │  - Lifecycle, replication    │
              └──────────────────────────────┘

2. API & data model

REST surface: PUT/GET/HEAD/DELETE /bucket/key, GET /bucket?list-type=2&prefix=&continuation-token=, multipart triplet (Initiate, UploadPart, Complete), conditional ops (If-Match, If-None-Match), versioning, range reads.

Auth (SigV4-equivalent): kSecret → kDate → kRegion → kService → kSigning, request signed over canonical request hash. Pre-signed URLs encode the signature in query string with explicit expiry and a server-side fence on object version.

Consistency contract published to clients:

Strong read-after-write for new PUTs and overwrites.
Strong list-after-write within a bucket.
Linearizable conditional puts (If-None-Match: * for “create if absent”).
Multipart Complete is atomic; partial state never observable.
No cross-bucket transactions.

3. Multipart upload

Part-size choice: min 5 MB (S3-compat), recommended 16–64 MB, max 5 GB; per-object cap 10k parts.

Why 64 MB and not 5 MB?
  At 25 GB/s NIC, 64 MB part transfers in ~2.6 ms wire time.
  Retry cost on 90% failure: bounded ~2 ms.
  5 MB parts: 5 TB object = 1M parts = 1M metadata rows. UNWORKABLE.
  64 MB parts: 5 TB object = 80k parts. Workable but >10k cap, so
              5 TB → ~512 MB parts at the upper end.

State machine:

      ┌─────────────┐
      │  Initiated  │◄── InitiateMultipart (uploadId minted)
      └──────┬──────┘
             │ UploadPart  idempotent key = (uploadId, partNum)
             ▼
      ┌─────────────┐
      │PartUploaded*│   parts table grows; per-part SHA-256 ETag
      └──┬──────┬───┘
   Complete  Abort  ───── 7d idle ─────► sweeper reclaims extents
         │      │
         ▼      ▼
    ┌────────┐ ┌──────────┐
    │ Object │ │ Reclaimed│
    └────────┘ └──────────┘

Composite ETag = SHA256(concat(partETags)) + "-" + partCount (S3 contract). GC of abandoned uploads runs against the mpu_parts index, not the main objects table — they’re never co-mingled.

4. Storage layer

Two backends, picked at PUT by size:

SMALL OBJECT  (< 4 MB)            LARGE OBJECT  (≥ 4 MB)
═════════════════════             ═════════════════════════════════
PUT key=foo, 100 KB               PUT key=video, 5 GB (multipart)

      │                                 │ chunk into 4 MB blocks
      ▼                                 ▼
[Frontend batches]                [Frontend EC encoder]
 256 MB extents                    RS(10,4) per chunk
      │                                 │
      ▼                                 ▼
[3x replicated extent]            [14 shards × 3 AZs]
 (W=2/N=3 sync)                    (W=10/N=14 sync)
      │ async, after seal               │
      ▼                                 ▼
[EC re-encode → seal]             [Sealed immediately]

Metadata: 64 B per obj             Metadata: chunk_map ~1 KB / 5 GB obj
          (extent_id, off, len)              (chunk_id × N)

The small-object packer is Haystack/needle-style: collapses 8M IOPS/s into ~30k extent-level writes/s, achieves >97% disk utilization vs. ~60% for one-object-per-file. The IOPS reduction is the entire point — small files would otherwise destroy you.

EC vs replication, with numbers

                    │ 3x Repl  │ RS(10,4) │ LRC(12,2,2) │ Clay
────────────────────┼──────────┼──────────┼─────────────┼─────────
Storage overhead    │   3.0x   │   1.4x   │    1.33x    │  ~1.4x
Min shards to read  │     1    │    10    │     10      │    10
Single-shard repair │     1    │    10    │      6      │    ~5
Decode CPU          │   none   │   med    │     med     │   high
Production maturity │ ★★★★★  │ ★★★★★   │   ★★★★    │   ★★
Default fit         │  ingest  │  warm    │    hot      │ research
                    │  buffer  │  default │  rebuild-   │
                    │          │          │  hot data   │

▸ STAFF: Don’t pick a code rate; pick a policy: 3x replication on the open ingest extent for ~5 min, then async re-encode to RS(10,4). Tectonic/Pelican do this. Naive answer (“use 3x replication”) burns 2x your storage budget. Naive-plus answer (“use RS”) tanks your write latency because you’re encoding on the hot path.

11 nines analytic with RS(10,4) cross-rack at AFR 2%/disk gives >12 nines on independent failures. The binding constraint is correlated failures (firmware bugs, AZ events) — placement matters more than code rate.

Storage in more detail

Let me rebuild it from the ground up. The storage layer is doing three different things at once and I jammed them together. Let’s separate them.

The core problem: IOPS, not bytes

Forget capacity for a second. The thing that kills naive object stores isn’t “where do I put 10 EB,” it’s how many disk operations per second the fleet can sustain.

NVMe SSD:  ~500k–1M random IOPS, ~7 GB/s sequential
HDD:       ~100–200 random IOPS, ~250 MB/s sequential

Now look at the workload:

8M small PUTs/sec.  If each becomes 1 filesystem write
(1 inode + 1 data block + 1 metadata commit) on the storage node...

8M PUTs × ~3 IOPS/PUT = 24M IOPS/sec needed
24M IOPS / 500k IOPS-per-NVMe = 48 NVMes minimum JUST FOR IOPS

But you also need ~24k NVMes for capacity. So which is binding?

If you naively did one-file-per-object:
  IOPS-binding   → fleet sized by IOPS, capacity wasted
  fsync-binding  → tail latency tanks
  inode-binding  → metadata blows up

This is why small objects need a different storage path than large objects. Not for elegance — because the IOPS math forces it.

Path 1: Small objects → log-structured packing (Haystack)

The trick: don’t store one object per file. Concatenate many objects into one big append-only file (“extent”). Facebook published this as Haystack in 2010. It’s how every modern blob store handles small files.

Naive (one file per object):
─────────────────────────────
filesystem:
  /data/aa/bb/foo.jpg     ← inode + 1 block + metadata
  /data/aa/bb/bar.jpg     ← inode + 1 block + metadata
  /data/aa/bb/baz.jpg     ← inode + 1 block + metadata
  ... 100M more ...

Each PUT = 3+ IOPS. 8M PUT/s = dies.


Log-structured (Haystack-style):
─────────────────────────────────
ONE file on disk: extent_42

  ┌──────────────────────────────────────────────────────┐
  │ [hdr] foo.jpg bytes [hdr] bar.jpg bytes [hdr] baz... │
  └──────────────────────────────────────────────────────┘
   ↑                    ↑                    ↑
   offset=0             offset=104832        offset=237449
   length=104832        length=132617        length=88011

Metadata service stores:
  foo.jpg → (extent=42, offset=0,      length=104832)
  bar.jpg → (extent=42, offset=104832, length=132617)
  baz.jpg → (extent=42, offset=237449, length=88011)

Each PUT = 1 sequential append. 30k extent-writes/s handles 8M obj/s.

The frontend batches writes from many concurrent clients into the same extent in memory, then does one sequential append to disk. Sequential writes on NVMe hit ~7 GB/s; random writes hit a small fraction of that. You’ve collapsed 8M random IOPS into 30k sequential appends.

Reads work because the metadata service knows the (extent, offset, length) for every object. A GET becomes: “metadata lookup → seek to offset on extent → read length bytes.” On NVMe, the seek is free.

Why this matters: without packing, the small-object workload is mathematically infeasible. With packing, it becomes a metadata problem (which is solvable with sharded KV) instead of a disk problem.

Path 2: Large objects → chunking + erasure coding

Large objects (≥ 4 MB) get a different path because:

They’re already big enough that one-object-per-file doesn’t waste IOPS.
They’re big enough that 3× replication is expensive at scale.
They’re big enough that you can split them across many disks for parallel reads.

PUT video.mp4, 5 GB

Step 1: Chunk into 4 MB blocks
───────────────────────────────
[chunk 1][chunk 2][chunk 3] ... [chunk 1280]   (1280 chunks)

Step 2: For EACH chunk, erasure code it
────────────────────────────────────────
chunk_1 (4 MB)
       │
       ▼
   ┌───────────────────────────────────────┐
   │  RS(10,4) encoder                     │
   │  splits 4 MB into 10 data shards      │
   │  computes 4 parity shards             │
   └───────────────────────────────────────┘
       │
       ▼
   d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 p1 p2 p3 p4
   ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓   ↓  ↓  ↓  ↓
  host1 host2 ... host14 (each on a different host, ≥3 AZs)

Each shard is 4 MB / 10 = 400 KB.
Total stored: 14 × 400 KB = 5.6 MB for a 4 MB chunk.
Overhead = 1.4×.

What erasure coding actually does (the math, briefly)

You don’t need to know Reed-Solomon internals to design this, but you need the intuition:

RS(k, m) means:
  - Take your data, split into k equal pieces (data shards)
  - Compute m extra pieces (parity shards) from the data
  - Store all k+m pieces on different machines
  - You can reconstruct the original from ANY k of the k+m pieces

RS(10, 4):
  10 data + 4 parity = 14 shards
  Can lose any 4 of 14 and still recover.
  Storage overhead: 14/10 = 1.4×.

Compare to 3× replication:
  Store 3 full copies.
  Can lose any 2 of 3.
  Storage overhead: 3.0×.

           Replication     RS(10,4)
Cost          3.0×           1.4×       ← EC wins big
Tolerance     2 losses       4 losses   ← EC also wins

So why doesn’t everything use EC? Because of the repair problem.

The repair problem (and why LRC exists)

A disk fails. You need to rebuild the lost shard onto a new disk.

REPLICATION rebuild:
  Pick a surviving copy. Read 1 shard. Copy it. Done.
  Network read = 1× the lost data.

RS(10,4) rebuild:
  Pick any 10 surviving shards. Read all 10.
  Decode the original chunk. Re-compute the lost shard.
  Network read = 10× the lost data.   ← ouch

At fleet scale, disks fail constantly. If every disk failure costs you 10× its capacity in network bandwidth, your network bisection bandwidth becomes the binding constraint on cluster size.

LRC (Locally Repairable Codes, Microsoft Azure published this):

LRC(12, 2, 2):
  12 data shards, split into 2 LOCAL groups of 6
  Each group has 1 local parity (so it can repair 1 loss within the group)
  Plus 2 global parities for catastrophic recovery

  Group A: d1 d2 d3 d4 d5 d6 + l_A
  Group B: d7 d8 d9 d10 d11 d12 + l_B
  Globals: g1 g2

  Total: 12 + 2 + 2 = 16 shards. Overhead 16/12 ≈ 1.33×

  Single-shard rebuild within Group A:
    Read 6 shards from Group A only.
    Network read = 6× the lost data, not 10×.

  Catastrophic rebuild (>1 loss in a group):
    Fall back to global decode, read 12 shards.

The pick:

Hot tier (lots of churn, frequent disk replacements): LRC. Repair bandwidth is the cost driver.

Warm tier (default): RS(10,4). Better tolerance, simpler, lower steady-state overhead.

Cold tier: even wider EC like RS(20,4) for cheaper storage at the cost of slower repair.

Why two-stage write (3× replication → re-encode to EC)

Now the missing piece: encoding EC on the write path is slow and CPU-heavy. If you encode a 4 MB chunk into 14 shards on the hot path, you’ve added encoder CPU (~1–2 ms) plus 14 cross-host writes (slowest of which dominates).

Replication is fast: just send the same bytes to 3 hosts. No CPU overhead.

So the production pattern is:

   Write arrives
        │
        ▼
   ┌─────────────────────────────────────┐
   │ Stage 1: INGEST (hot path)          │
   │ Replicate 3× to an open extent      │
   │ W=2/N=3 — ack as soon as 2 of 3     │
   │ Latency: ~5–10 ms (network only)    │
   └─────────────────────────────────────┘
        │
        │  extent fills up (~256 MB)
        │  OR ages out (~5 minutes)
        ▼
   ┌─────────────────────────────────────┐
   │ SEAL the extent (no more writes)    │
   └─────────────────────────────────────┘
        │
        ▼
   ┌─────────────────────────────────────┐
   │ Stage 2: BACKGROUND re-encode       │
   │ Read sealed extent, encode RS(10,4) │
   │ Write 14 shards across failure dom. │
   │ Delete the 3 replicas after verify  │
   │ Latency: doesn't matter (async)     │
   └─────────────────────────────────────┘

You only pay the 3× storage cost for the ~5-minute “open” window. After that, the data lives at 1.4× overhead. Hot-path latency stays low; steady-state storage cost stays low. This is what Tectonic (Meta) and Pelican (MS) do.

Putting it back together

                     PUT request
                          │
                          ▼
                  ┌───────────────┐
                  │ size < 4 MB?  │
                  └───┬───────┬───┘
                  yes │       │ no
                      ▼       ▼
        ┌──────────────────┐ ┌──────────────────────┐
        │ SMALL PATH       │ │ LARGE PATH           │
        │                  │ │                      │
        │ Pack into shared │ │ Chunk into 4 MB blks │
        │ 256 MB extent    │ │ Each chunk: 3× repl  │
        │ 3× replicate     │ │ on open chunk-group  │
        │ extent           │ │                      │
        │                  │ │ When sealed:         │
        │ When extent      │ │   re-encode RS(10,4) │
        │ sealed:          │ │   spread 14 shards   │
        │   re-encode to   │ │   across 3 AZs       │
        │   RS(10,4) or    │ │                      │
        │   LRC for hot    │ │                      │
        └──────────────────┘ └──────────────────────┘
                  │                       │
                  └───────────┬───────────┘
                              ▼
                  Metadata service stores:
                    - small: (extent, offset, length)
                    - large: chunk_map = list of chunk_ids
                             chunk_id → 14 shard locations

The whole storage layer is solving three problems with three different mechanisms:

Problem	Mechanism	Cost saved
Small objects burn IOPS	Log-structured packing (Haystack)	100× IOPS reduction
Replication wastes storage at EB scale	Erasure coding (RS / LRC)	3.0× → 1.4× overhead
EC is slow on the write path	Stage-1 replicate, stage-2 re-encode	Latency stays at replication speed; storage stays at EC cost

5. Placement & replication

Centralized placer (chosen)        vs.  Consistent hash (vnodes)    vs.  CRUSH
- Hierarchy-aware                       - Trivially scaling              - Hierarchy-aware
- Capacity/heat-aware                   - No domain control              - Static rules
- Recovery budget tunable               - O(1/N) move on add            - Painful update churn
- ~50 µs / placement (batched)          - Best for KV, not blob         - Caps ~10k OSDs cleanly

Failure-domain hierarchy:

       Region (us-east-1)
           │
   ┌───────┼───────┬───────┐
   AZ-a   AZ-b   AZ-c   AZ-d        ◄ chunk spans ≥3 AZs
    │
  Row-1   Row-2   Row-3
    │
   Rack Rack Rack                   ◄ chunk spans ≥10 racks
    │
   Host Host Host                   ◄ 14 distinct hosts per chunk

Recovery bandwidth budget: cap at 20% of bisection BW
  18k nodes × 200 Gbps × 20% = 720 Tbps headroom
  Full host-loss recovery (800 TB / 720 Tbps) ≈ 9 s minimum,
    realistic ~1 hour with throttling and scrub priority.

▸ STAFF: Same-AZ EC with cross-AZ replication is strictly worse than cross-AZ EC at this scale. It doubles your network bill on rebuild and gives up the AZ-correlated-failure protection that 11 nines actually requires. S3 Express explicitly trades this away to be 10x faster — that’s the right move for a separate tier, not the default.

6. Metadata service — the actual bottleneck

RANGE-PARTITIONED  (chosen)        HASH-PARTITIONED  (rejected)
═══════════════════════════         ═══════════════════════════════
Range 1: aaa..foo                  Shard 7: hash(bucket||key) % 1024
Range 2: foo..mum                  Shard ...:
Range 3: mum..zzz

LIST prefix='2026/04/' →           LIST prefix='2026/04/' →
  hits 1–2 ranges ✓                  fans out to ALL 1024 shards ✗
                                     OR maintain secondary list index ✗
                                     (now you have 2 consistency
                                      problems instead of 1)

Auto-split on:                     Hot-key behavior:
  size > 100 MB                      no graceful split
  QPS > 30k sustained                only mitigation: app-level salting
  prefix-aware boundary

Schema:

buckets       (bucket_id PK) → policy, ACL, versioning, lifecycle
objects       (bucket_id, key, version DESC) PK
              → size, etag, chunk_map_ref, content_type, ...
chunks        (chunk_id) PK
              → ec_scheme, shard_locations[14], placement_epoch
mpu_uploads   (bucket_id, upload_id) PK → state, initiator, key
mpu_parts     (upload_id, part_num) PK → etag, chunk_ref, size

5-replica Paxos/Raft per range across 3 AZs, leader leases ~10 s. Read-replica fleet for HEAD/GET reads with strong-consistency epoch fence so you can serve a million read replicas without losing linearizability.

Hot-bucket mitigation: prefix-aware splits at detected boundaries (not arbitrary midpoints), per-prefix QPS counters in the leader, optional client-side salting hint for known hot prefixes. At 30M ops/s and 50k QPS/shard sustained: ~600 minimum, provision 2k for headroom + skew.

▸ STAFF: Most candidates draw boxes for “metadata service” and move on. The metadata service is the entire system’s QPS ceiling — it sees more ops than the data plane (every PUT = 2+ writes, every GET = 1 read). If you don’t have a number for it, you don’t have a design.

7. Consistency model

S3 switched eventual → strong in December 2020. Implementation pattern (the right one to copy):

PUT path:
  Frontend ──► write data shards (EC) ──► all shard acks
                                              │
                                              ▼
              Metadata Paxos commit  ◄── only after data durable
                      │
                      ▼
              client 200 OK

  Object visible to GET/LIST iff metadata commit succeeded.
  No "in-between" state externally observable.

GET path:
  Frontend ──► metadata read (linearizable) ──► chunk_map
            ──► storage read ──► bytes

Costs vs eventual:
  +1 Paxos round on PUT  ≈ +3–5 ms intra-AZ
                         ≈ +80 ms cross-region (multi-region buckets)
  Eventual saves the round but creates an entire bug class
  (write S3, publish SQS, consumer reads stale S3) — not worth it.

Fenced caches: every cache entry carries (version, etag, lease_epoch). Overwrite bumps the epoch, revoking all stale cached copies in O(1) by epoch comparison, not per-key invalidation.

8. Failure handling

Quorum table:
  Ingest extent (3x repl)   N=3   W=2   R=1
  Sealed EC chunk           N=14  W=10  R=10  (any 10 of 14)
  Metadata range            N=5   W=3   R=3   (Paxos)

Anti-entropy:
  Per-extent Merkle tree, gossiped between replica peers
  Mismatch → re-replicate from majority
  Scrub: every disk, every 14 days, ~50 MB/s background

Hinted handoff:
  Write to offline target → buffer on peer with hint → replay on return

Cell-based blast radius:
  ┌──────── Cell 1 ────────┐  ┌──────── Cell 2 ────────┐
  │ ~1k storage nodes       │  │ ~1k storage nodes       │
  │ dedicated md ranges     │  │ dedicated md ranges     │
  │ dedicated FE pods       │  │ dedicated FE pods       │
  └────────────────────────┘  └────────────────────────┘
       hard isolation              hard isolation
   no shared control plane     no shared control plane

  → bug or rogue tenant ≤ 1 cell of damage
  → weekly cell-level chaos drills

RPO/RTO:
  Intra-region:   RPO = 0 (sync), RTO < 30 s (md leader election)
  Multi-region:   RPO ≈ 5 s (async repl), RTO minutes (DNS flip)
  Cold backup:    weekly md snapshot + continuous WAL ship to
                  separate pool with different SW version

▸ STAFF: “Cells” with hard isolation is the move that keeps a 10 EB system from being one bug away from headlines. AWS, Google, Meta all do this. Junior answers leave the entire fleet sharing one control plane.

9. Hot-path optimization

Request hedging on small GETs: dispatch primary, then dispatch a secondary to a replica if not back in p95 (~12 ms). Cuts p99 by ~3x at <2% extra read amplification. Cancel-on-first-success.

Edge caching is workload-dependent. For media-bucket-fronted-by-CDN: 95% hit rate, decisive. For ML training reading uniformly across an EB-scale shuffled dataset: hit rate ≈ 0, cache is pure overhead. Make it opt-in, per bucket, with explicit TTL.

Pre-signed URLs going direct to storage bypass the frontend on the data path:

WITHOUT pre-sign:                 WITH pre-sign:
  Client → FE → Storage             Client → FE  (sign only)
         data path                  Client → Storage  (data path)
                                         direct, FE not in line
  ~3 ms extra hop                   FE handles 2x more sigs/sec
                                    for the same hardware.

Saves ~1–3 ms on small GETs, and offloads roughly half the frontend bytes-in-flight at the high end of the large-object distribution.

10. End-to-end flow

5 GB multipart upload, 64 MB parts (78 parts), 10-way client parallelism

ms      Client          Frontend         Metadata        Storage
────────────────────────────────────────────────────────────────
  0   ─Initiate────►
  1                  auth
  3                  ─ Paxos write ──►
  8                                     ◄── ack
  8                  ◄─uploadId──

 10   ─UploadPart 1..10 (parallel)─►
                     EC encode (RS 10,4)
                     ─── fan-out 14 shards × 3 AZs ──────►
 30                                                        NVMe + fsync
 35                                                ◄── 10/14 acks
                     ─ part metadata ─►
 40                  ◄─ETag──

 ...   [parts 11..78 stream similarly; client uplink-bound]

 250  ─Complete────►
 255                 validate manifest
 257                 ─atomic flip────►
 262                                    ◄── committed
 263                 ◄─200 OK──

  Net: ~250 ms protocol overhead on a ~2-second client-uplink-bound upload.
  Object visible to LIST/GET at t=262 ms, atomically.

1 KB GET

ms      Client          Frontend         Metadata        Storage
────────────────────────────────────────────────────────────────
  0   ─GET key────►
0.05                 HMAC verify
                     md cache lookup
                     ├── HIT  (80%) → 1 ms
                     └── MISS       → ──►
  5                                      ◄── obj row
  5                  range read ──────────────────────►
                                                       NVMe ~100 µs
  6                                                ◄── 1 KB
  6                  ◄─200 OK + body──

  p50:  5–10 ms  (cache hit dominates)
  p99: 25–40 ms  (with hedging at p95=12 ms)
  p999: 100–150 ms (tail from gc/scrub interference, fsync stalls)

11. Recent developments (2025–2026)

Shift                       │ What it changes for a 2026 design
────────────────────────────┼─────────────────────────────────────
S3 Express One Zone         │ Single-AZ tier mandatory; -85% GET cost
April 2025 reprice          │ → ship a "directory bucket" type
                            │
S3 strong consistency       │ Eventual is no longer defensible default
Dec 2020, now well-known    │ → metadata-first commit + fenced caches
                            │
R2/Tigris/B2 zero egress    │ ~15× cost gap vs S3 on egress-heavy WL
                            │ → egress price = strategy, not COGS
                            │
S3-over-RDMA + GPUDirect    │ Meta: 3.8× training speedup w/ GDS
NVIDIA GA Jan 2026          │ → RDMA data plane = table stakes for AI
                            │
S3 Tables (Iceberg-native)  │ Object store ⇒ tabular DB substrate
re:Invent 2024              │ → typed buckets: blob | table | vector
                            │
LRC / Clay in production    │ Repair BW = hot-tier cost driver
                            │ → LRC for hot, RS for warm
                            │
S3 Vectors                  │ Vector search inside the object store
2025                        │ → ANN APIs as first-class bucket type

The April 2025 S3 Express price drop was substantial: 31% off storage, 55% off PUT, 85% off GET, with single-digit ms latency, 2M GET/s and 200K PUT/s per directory bucket. Storage is still ~5x Standard, so it’s not a default — it’s the AI-training-scratch-space tier you ship alongside Standard.

R2 has rewritten the cost model. $0.015/GB-month with zero egress fees; at 10 TB stored / 5 TB egress, AWS S3 costs ~15× the cheapest alternative. A new design that doesn’t take a position on egress (match it, or own a non-egress moat) is shipping a 2020 cost model.

NVIDIA’s RDMA for S3 went GA via the CUDA Toolkit in early 2026. RDMA bypasses host CPU on the data path, delivering higher throughput per terabyte and significantly lower latencies vs TCP. Meta reported 3.8× training speedup with GPUDirect Storage, going from 50 GB/s CPU-bottlenecked to 192 GB/s direct-to-GPU and dropping data loading from 35% to 5% of training time. The right factoring is control-plane on HTTP/TCP, object payloads on RDMA — don’t try to RDMA SigV4.

S3 Tables made the object store a database substrate. Native Iceberg with up to 3× faster queries and 10× more transactions per second than self-managed Iceberg on standard S3. The implication: a 2026 design has typed bucket modes (blob, tabular, vector), with compute-on-read (Lambda-equivalent runtime per cell) as a first-class extensibility point.

Posture

A system designed today: strong-consistent metadata via range-partitioned Spanner-class KV, RS(10,4) warm tier + LRC(12,2,2) hot tier + 3x replicated ingest extents async-encoded, log-structured packing for small objects, S3-over-RDMA data plane with TCP fallback, single-AZ “express” tier and multi-AZ default tier as separate bucket types, native Iceberg-and-vector typed buckets, zero or low egress as a competitive position, and aggressive cell-based blast-radius isolation. Anything less is shipping a 2020 design in 2026.