Skip to content

Simulation Metrics: A 1-Hour Interview Learning Session

Simulation Metrics: A 1-Hour Interview Learning Session

Section titled “Simulation Metrics: A 1-Hour Interview Learning Session”

Companion notebook: simulation_metrics_colab.ipynb

Simulation metrics answer a deceptively hard question:

Did we generate a scenario that is realistic, physically valid, controllable, and useful for safety evaluation?

One metric cannot answer that. You need a dashboard.

0-10 min Why trajectory error is not enough
10-25 min Safety and map-validity metrics
25-40 min Realism, log divergence, and kinematics
40-50 min Evaluating generated scenarios
50-60 min Interview answers and drills

Autonomous driving simulation can fail in two opposite ways:

Very realistic, but boring:
good log replay, poor long-tail safety coverage
Very challenging, but fake:
lots of collisions, impossible actor motion

Good metrics distinguish:

  • realism,
  • safety relevance,
  • physical feasibility,
  • map compliance,
  • diversity,
  • controllability.

In robotics, the same idea applies: a generated manipulation rollout must be physically plausible and task-relevant, not just different.


Average displacement error:

ADE=1Tt=1Tp^tpt2ADE = \frac{1}{T}\sum_{t=1}^{T}\|\hat{p}_t-p_t\|_2

Final displacement error:

FDE=p^TpT2FDE = \|\hat{p}_T-p_T\|_2

For KK predicted modes:

minADE=mink1Ttp^t(k)pt2minADE = \min_k \frac{1}{T} \sum_t \|\hat{p}^{(k)}_t-p_t\|_2

Why trajectory error is useful:

  • Simple.
  • Easy to compare.
  • Good for prediction against logs.

Why it is insufficient:

  • Penalizes plausible alternatives.
  • Does not check collision.
  • Does not check offroad.
  • Does not check wrong-way.
  • Does not check generated mode probabilities.

Interview phrase:

ADE is a prediction metric, not a full simulation-quality metric.


For agents i,ji,j with bounding boxes Bi(t)B_i(t) and Bj(t)B_j(t):

collisionij(t)=1[Bi(t)Bj(t)]\text{collision}_{ij}(t) = \mathbf{1}[B_i(t)\cap B_j(t)\ne \varnothing]

Use oriented boxes when possible, not just center distance.

Let D\mathcal{D} be drivable area:

offroad(t)=1[ptD]\text{offroad}(t)=\mathbf{1}[p_t\notin\mathcal{D}]

Let hth_t be actor heading and dtd_t be lane direction:

htdt<τh_t\cdot d_t < \tau

flags wrong-way behavior.

Examples:

  • red-light running,
  • stop-sign violation,
  • illegal turn,
  • lane-boundary crossing,
  • speed-limit violation.

These require map and signal state, not just trajectories.


A generated trajectory can be map-valid but physically impossible.

Speed:

vt=ptpt1Δtv_t= \frac{\|p_t-p_{t-1}\|}{\Delta t}

Acceleration:

at=vtvt1Δta_t= \frac{v_t-v_{t-1}}{\Delta t}

Jerk:

jt=atat1Δtj_t= \frac{a_t-a_{t-1}}{\Delta t}

For vehicles, also inspect:

  • yaw rate,
  • curvature,
  • lateral acceleration,
  • reverse motion,
  • discontinuities.

Kinematic metrics catch teleporting actors and unrealistic sudden turns.


Realism and safety challenge are in tension.

Log-likelihood high:
likely realistic, but may be common/boring
Risk high:
useful for safety, but may be unrealistic

A useful generated scenario should sit in the middle:

rare enough to test the system
plausible enough to matter
controllable enough to reproduce

This is why collision rate alone is not enough. A high collision rate may mean the generator is broken.


If starting from a logged scene, divergence from log is:

D(t)=p^tptlog2D(t)=\|\hat{p}_t-p_t^{log}\|_2

This helps measure how quickly a rollout departs from recorded behavior.

Interpretation:

  • Low divergence: close replay.
  • Moderate divergence: plausible variation.
  • Huge early divergence: likely unrealistic or uncontrolled.

But divergence is not always bad. Simulation often wants counterfactuals. The question is whether divergence is plausible and controlled.


Use a dashboard:

Prediction:
ADE, FDE, minADE, miss rate
Map validity:
offroad, wrong-way, rule violation
Safety:
collision, near-miss, time-to-collision
Physics:
speed, acceleration, jerk, yaw rate
Realism:
discriminator score, log divergence, human review
Diversity:
unique modes, coverage, pairwise distance
Control:
prompt/intent success rate
Downstream:
planner failure rate, intervention rate

The strongest interview answer is to discuss metric tradeoffs, not just list metrics.


import torch
def ade(pred, target):
return torch.norm(pred - target, dim=-1).mean()
def fde(pred, target):
return torch.norm(pred[:, -1] - target[:, -1], dim=-1).mean()
def kinematic_flags(traj, dt=0.1, max_speed=40.0, max_accel=8.0):
vel = (traj[:, 1:] - traj[:, :-1]) / dt
speed = torch.norm(vel, dim=-1)
accel = (speed[:, 1:] - speed[:, :-1]) / dt
return (speed > max_speed).any(dim=1), (accel.abs() > max_accel).any(dim=1)
def rectangular_offroad(traj, x_min, x_max, y_min, y_max):
x, y = traj[..., 0], traj[..., 1]
off = (x < x_min) | (x > x_max) | (y < y_min) | (y > y_max)
return off.any(dim=1)

For real systems, replace rectangular offroad with map polygon checks.


9. Common interview questions and strong answers

Section titled “9. Common interview questions and strong answers”

Q: Why is ADE insufficient?
A: It compares against one logged future and ignores multi-modality, collision, offroad, wrong-way, and physical feasibility.

Q: Is collision rate always bad?
A: No. For rare scenario generation, collisions or near-misses may be intentional, but they must be physically plausible.

Q: How do you evaluate a generated scenario?
A: Use a dashboard covering realism, safety relevance, map compliance, kinematics, diversity, controllability, and downstream planner impact.

Q: What is log divergence?
A: It measures how far the generated rollout departs from the logged trajectory over time. It helps distinguish replay from counterfactual generation.


10. A 60-second explanation you can say out loud

Section titled “10. A 60-second explanation you can say out loud”

Simulation metrics need to measure more than trajectory error. ADE and FDE tell me how close I am to one logged future, but driving is multi-modal and simulation needs plausible alternatives. I also need collision, offroad, wrong-way, traffic-rule, and kinematic feasibility metrics. For generated scenarios, I evaluate realism, diversity, controllability, and safety relevance. A collision is useful only if it is physically plausible, not if it comes from impossible actor motion. The best evaluation is a dashboard plus visual inspection of high-risk cases.


Exercise 1: Why can minADE hide bad probability calibration?
Answer: It only checks whether one mode matches the log, not whether the model assigned that mode high probability.

Exercise 2: Why are oriented boxes better than center distance for collision?
Answer: Vehicles have size and heading. Center distance can miss side-swipe or corner collisions.

Exercise 3: A generated car goes from 0 to 30 m/s in 0.1s. Which metric catches it?
Answer: Acceleration or jerk feasibility.

Exercise 4: What does high early log divergence suggest?
Answer: The generated rollout may be uncontrolled or unrealistic, unless the goal explicitly requested a strong counterfactual.