Simulation Metrics: A 1-Hour Interview Learning Session
Simulation Metrics: A 1-Hour Interview Learning Session
Section titled “Simulation Metrics: A 1-Hour Interview Learning Session”Companion notebook: simulation_metrics_colab.ipynb
Simulation metrics answer a deceptively hard question:
Did we generate a scenario that is realistic, physically valid, controllable, and useful for safety evaluation?
One metric cannot answer that. You need a dashboard.
0. One-hour plan
Section titled “0. One-hour plan”0-10 min Why trajectory error is not enough10-25 min Safety and map-validity metrics25-40 min Realism, log divergence, and kinematics40-50 min Evaluating generated scenarios50-60 min Interview answers and drills1. Why you should care
Section titled “1. Why you should care”Autonomous driving simulation can fail in two opposite ways:
Very realistic, but boring: good log replay, poor long-tail safety coverage
Very challenging, but fake: lots of collisions, impossible actor motionGood metrics distinguish:
- realism,
- safety relevance,
- physical feasibility,
- map compliance,
- diversity,
- controllability.
In robotics, the same idea applies: a generated manipulation rollout must be physically plausible and task-relevant, not just different.
2. Trajectory error
Section titled “2. Trajectory error”Average displacement error:
Final displacement error:
For predicted modes:
Why trajectory error is useful:
- Simple.
- Easy to compare.
- Good for prediction against logs.
Why it is insufficient:
- Penalizes plausible alternatives.
- Does not check collision.
- Does not check offroad.
- Does not check wrong-way.
- Does not check generated mode probabilities.
Interview phrase:
ADE is a prediction metric, not a full simulation-quality metric.
3. Safety and map-validity metrics
Section titled “3. Safety and map-validity metrics”Collision
Section titled “Collision”For agents with bounding boxes and :
Use oriented boxes when possible, not just center distance.
Offroad
Section titled “Offroad”Let be drivable area:
Wrong-way
Section titled “Wrong-way”Let be actor heading and be lane direction:
flags wrong-way behavior.
Traffic-rule violations
Section titled “Traffic-rule violations”Examples:
- red-light running,
- stop-sign violation,
- illegal turn,
- lane-boundary crossing,
- speed-limit violation.
These require map and signal state, not just trajectories.
4. Kinematic feasibility
Section titled “4. Kinematic feasibility”A generated trajectory can be map-valid but physically impossible.
Speed:
Acceleration:
Jerk:
For vehicles, also inspect:
- yaw rate,
- curvature,
- lateral acceleration,
- reverse motion,
- discontinuities.
Kinematic metrics catch teleporting actors and unrealistic sudden turns.
5. Realism vs safety
Section titled “5. Realism vs safety”Realism and safety challenge are in tension.
Log-likelihood high: likely realistic, but may be common/boring
Risk high: useful for safety, but may be unrealisticA useful generated scenario should sit in the middle:
rare enough to test the systemplausible enough to mattercontrollable enough to reproduceThis is why collision rate alone is not enough. A high collision rate may mean the generator is broken.
6. Log divergence
Section titled “6. Log divergence”If starting from a logged scene, divergence from log is:
This helps measure how quickly a rollout departs from recorded behavior.
Interpretation:
- Low divergence: close replay.
- Moderate divergence: plausible variation.
- Huge early divergence: likely unrealistic or uncontrolled.
But divergence is not always bad. Simulation often wants counterfactuals. The question is whether divergence is plausible and controlled.
7. Evaluating generated scenarios
Section titled “7. Evaluating generated scenarios”Use a dashboard:
Prediction: ADE, FDE, minADE, miss rate
Map validity: offroad, wrong-way, rule violation
Safety: collision, near-miss, time-to-collision
Physics: speed, acceleration, jerk, yaw rate
Realism: discriminator score, log divergence, human review
Diversity: unique modes, coverage, pairwise distance
Control: prompt/intent success rate
Downstream: planner failure rate, intervention rateThe strongest interview answer is to discuss metric tradeoffs, not just list metrics.
8. Minimal PyTorch implementation
Section titled “8. Minimal PyTorch implementation”import torch
def ade(pred, target): return torch.norm(pred - target, dim=-1).mean()
def fde(pred, target): return torch.norm(pred[:, -1] - target[:, -1], dim=-1).mean()
def kinematic_flags(traj, dt=0.1, max_speed=40.0, max_accel=8.0): vel = (traj[:, 1:] - traj[:, :-1]) / dt speed = torch.norm(vel, dim=-1) accel = (speed[:, 1:] - speed[:, :-1]) / dt return (speed > max_speed).any(dim=1), (accel.abs() > max_accel).any(dim=1)
def rectangular_offroad(traj, x_min, x_max, y_min, y_max): x, y = traj[..., 0], traj[..., 1] off = (x < x_min) | (x > x_max) | (y < y_min) | (y > y_max) return off.any(dim=1)For real systems, replace rectangular offroad with map polygon checks.
9. Common interview questions and strong answers
Section titled “9. Common interview questions and strong answers”Q: Why is ADE insufficient?
A: It compares against one logged future and ignores multi-modality, collision, offroad, wrong-way, and physical feasibility.
Q: Is collision rate always bad?
A: No. For rare scenario generation, collisions or near-misses may be intentional, but they must be physically plausible.
Q: How do you evaluate a generated scenario?
A: Use a dashboard covering realism, safety relevance, map compliance, kinematics, diversity, controllability, and downstream planner impact.
Q: What is log divergence?
A: It measures how far the generated rollout departs from the logged trajectory over time. It helps distinguish replay from counterfactual generation.
10. A 60-second explanation you can say out loud
Section titled “10. A 60-second explanation you can say out loud”Simulation metrics need to measure more than trajectory error. ADE and FDE tell me how close I am to one logged future, but driving is multi-modal and simulation needs plausible alternatives. I also need collision, offroad, wrong-way, traffic-rule, and kinematic feasibility metrics. For generated scenarios, I evaluate realism, diversity, controllability, and safety relevance. A collision is useful only if it is physically plausible, not if it comes from impossible actor motion. The best evaluation is a dashboard plus visual inspection of high-risk cases.
11. Practice exercises with answers
Section titled “11. Practice exercises with answers”Exercise 1: Why can minADE hide bad probability calibration?
Answer: It only checks whether one mode matches the log, not whether the model assigned that mode high probability.
Exercise 2: Why are oriented boxes better than center distance for collision?
Answer: Vehicles have size and heading. Center distance can miss side-swipe or corner collisions.
Exercise 3: A generated car goes from 0 to 30 m/s in 0.1s. Which metric catches it?
Answer: Acceleration or jerk feasibility.
Exercise 4: What does high early log divergence suggest?
Answer: The generated rollout may be uncontrolled or unrealistic, unless the goal explicitly requested a strong counterfactual.