Skip to content

ML Modeling Metrics: A 1-Hour Interview Learning Session

ML Modeling Metrics: A 1-Hour Interview Learning Session

Section titled “ML Modeling Metrics: A 1-Hour Interview Learning Session”

Companion notebook: ml_modeling_metrics_colab.ipynb

Metrics tell you whether your model is improving the thing you actually care about. Without the right metric, you can optimize blindly.

For autonomous driving, the key point is:

Accuracy is rarely enough. The cost of a false negative pedestrian event is not the same as a false positive normal-driving event.

This article covers the core metrics you need for ML modeling interviews: precision, recall, F1, confusion matrix, ROC-AUC, PR-AUC, calibration, regression metrics, ranking metrics, and thresholding.

0-10 min Confusion matrix and accuracy
10-25 min Precision, recall, F1, false positives/negatives
25-35 min Thresholds, ROC-AUC, PR-AUC
35-45 min Calibration and probability quality
45-55 min Regression/ranking metrics
55-60 min Interview drills

Imagine a model predicts whether a pedestrian will enter the crosswalk.

If the dataset is:

does not cross: 99%
crosses: 1%

A model that always predicts “does not cross” gets 99% accuracy. It is useless for safety.

Metrics should answer:

  • What mistakes are we making?
  • Which mistakes are expensive?
  • Does the score threshold match the product need?
  • Are predicted probabilities trustworthy?
  • Does performance hold on rare slices?

In robotics, the same issue appears in grasp success, collision prediction, failure detection, and anomaly detection.


For binary classification:

Predicted positivePredicted negative
Actually positiveTPFN
Actually negativeFPTN

Definitions:

  • TP: predicted event, event happened.
  • FP: predicted event, event did not happen.
  • FN: missed event.
  • TN: correctly predicted no event.

Driving example:

positive = pedestrian crosses
negative = pedestrian does not cross

False negative:

model says no crossing, pedestrian crosses

False positive:

model says crossing, pedestrian waits

The false negative is often more safety-critical.


Accuracy=TP+TNTP+FP+FN+TNAccuracy = \frac{TP + TN}{TP + FP + FN + TN}

Accuracy is useful when:

  • classes are balanced,
  • mistake costs are similar,
  • the metric is only a rough sanity check.

Accuracy is dangerous when:

  • classes are imbalanced,
  • rare positives matter,
  • false positives and false negatives have different costs.

Interview phrase:

Accuracy can be high while the model is useless on the minority class.


Precision:

Precision=TPTP+FPPrecision = \frac{TP}{TP + FP}

Question it answers:

When the model predicts positive, how often is it right?

Recall:

Recall=TPTP+FNRecall = \frac{TP}{TP + FN}

Question it answers:

Of all real positives, how many did the model catch?

Driving interpretation:

  • High precision: few false alarms.
  • High recall: few missed events.

Tradeoff:

lower threshold:
more positives predicted
recall up
precision often down
higher threshold:
fewer positives predicted
precision up
recall often down

For safety-critical detection, you often start by requiring high recall, then manage false positives.


F1 is harmonic mean of precision and recall:

F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{Precision \cdot Recall} {Precision + Recall}

It is useful when you want one number that balances both.

If recall matters more than precision, use FβF_\beta:

F_\beta = (1+\beta^2) \frac{Precision \cdot Recall} \beta^2 Precision + Recall}

If β>1\beta > 1, recall matters more.

If β<1\beta < 1, precision matters more.

Autonomous driving example:

  • Pedestrian crossing detection may use high-recall operating points.
  • Scenario mining may tolerate lower precision if humans or filters review candidates.
  • Planner intervention prediction may need high precision to avoid noisy alarms.

Most classifiers output a score or probability:

p=P(y=1x)p = P(y=1|x)

To make a binary decision:

y^=1[pτ]\hat{y} = \mathbf{1}[p \ge \tau]

where τ\tau is threshold.

Changing τ\tau changes precision and recall.

threshold 0.1:
many positives
high recall
low precision
threshold 0.9:
few positives
low recall
high precision

Do not assume threshold 0.5 is correct. Pick thresholds based on validation metrics and product costs.


ROC curve plots:

TPR=TPTP+FNTPR = \frac{TP}{TP + FN}

against:

FPR=FPFP+TNFPR = \frac{FP}{FP + TN}

ROC-AUC measures ranking quality across thresholds.

PR curve plots precision vs recall. PR-AUC is often better for imbalanced rare-event problems because it focuses on positive-class performance.

Interview answer:

For rare events, I prefer PR-AUC over ROC-AUC because ROC-AUC can look strong when negatives dominate.

Example:

rare cut-in detection:
PR-AUC is more informative than accuracy

A model is calibrated if predicted probabilities match observed frequencies.

If the model predicts 0.8 probability for 1,000 examples, about 800 should be positive.

Calibration matters when:

  • downstream systems consume probabilities,
  • thresholds are safety-critical,
  • multiple model scores are compared,
  • uncertainty matters.

Expected Calibration Error:

ECE=b=1BSbnacc(Sb)conf(Sb)ECE = \sum_{b=1}^{B} \frac{|S_b|}{n} \left| acc(S_b) - conf(S_b) \right|

where:

  • SbS_b is confidence bin bb,
  • acc(Sb)acc(S_b) is accuracy in that bin,
  • conf(Sb)conf(S_b) is average confidence in that bin.

Driving example:

If a planner treats 0.9 crossing probability differently from 0.6, those probabilities must mean something.


Multi-class:

one label among many
example: maneuver = straight / left / right / stop

Multi-label:

multiple labels can be true
example: scenario = rainy + night + pedestrian + occlusion

For multi-class:

  • confusion matrix by class,
  • macro F1,
  • weighted F1,
  • per-class precision/recall.

For multi-label:

  • per-label precision/recall,
  • micro average,
  • macro average,
  • label co-occurrence analysis.

Macro average treats each class equally. Micro average pools decisions and can be dominated by common classes.


For trajectory or continuous prediction:

Mean absolute error:

MAE=1niy^iyiMAE = \frac{1}{n}\sum_i |\hat{y}_i-y_i|

Mean squared error:

MSE=1ni(y^iyi)2MSE = \frac{1}{n}\sum_i(\hat{y}_i-y_i)^2

Root mean squared error:

RMSE=MSERMSE = \sqrt{MSE}

Trajectory metrics:

ADE=1Ttp^tpt2ADE = \frac{1}{T} \sum_t \|\hat{p}_t-p_t\|_2 FDE=p^TpT2FDE = \|\hat{p}_T-p_T\|_2

Tradeoff:

  • MAE is more robust to outliers.
  • MSE/RMSE penalize large errors more.
  • ADE/FDE are intuitive but ignore map validity and multi-modality.

Many autonomous-driving systems rank candidates:

  • top likely trajectories,
  • risky scenarios,
  • objects to review,
  • retrieval results,
  • mined rare events.

Useful metrics:

  • Top-k accuracy.
  • Recall@k.
  • Precision@k.
  • Mean reciprocal rank (MRR).
  • NDCG.

Recall@k:

Recall@k=relevant items in top ktotal relevant itemsRecall@k = \frac{\text{relevant items in top k}} {\text{total relevant items}}

Scenario mining example:

If the review team can inspect only 100 scenarios, precision@100 matters.


import torch
def binary_metrics(scores, labels, threshold=0.5):
pred = scores >= threshold
labels = labels.bool()
tp = (pred & labels).sum().item()
fp = (pred & ~labels).sum().item()
fn = (~pred & labels).sum().item()
tn = (~pred & ~labels).sum().item()
precision = tp / max(tp + fp, 1)
recall = tp / max(tp + fn, 1)
accuracy = (tp + tn) / max(tp + fp + fn + tn, 1)
f1 = 2 * precision * recall / max(precision + recall, 1e-12)
return {
"tp": tp,
"fp": fp,
"fn": fn,
"tn": tn,
"precision": precision,
"recall": recall,
"accuracy": accuracy,
"f1": f1,
}

13. Common interview questions and strong answers

Section titled “13. Common interview questions and strong answers”

Q: Why is accuracy bad for rare event detection?
A: Because negatives dominate. A model can get high accuracy by predicting the majority class and missing rare positives.

Q: Precision or recall for pedestrian crossing?
A: I would usually prioritize recall because missing a crossing is safety-critical, then manage false positives through thresholding and downstream logic.

Q: ROC-AUC or PR-AUC for rare events?
A: PR-AUC is usually more informative because it focuses on positive-class performance under imbalance.

Q: Why does calibration matter?
A: If downstream planning uses probabilities, 0.9 should mean roughly 90% likelihood. Uncalibrated scores can lead to bad thresholds and risk decisions.

Q: What is macro vs micro averaging?
A: Macro averages metrics per class equally. Micro pools all examples and can be dominated by common classes.


  • Accuracy high, minority recall near zero.
  • ROC-AUC high, PR-AUC poor.
  • Threshold chosen arbitrarily at 0.5.
  • Model score ranks well but probabilities are uncalibrated.
  • Aggregate metric hides geography/weather/agent-type failures.
  • Regression metric improves while safety metrics worsen.
  • Top-k metric good but top-1 behavior poor.

Checklist:

  • Always inspect confusion matrix.
  • Report per-class precision/recall.
  • Sweep thresholds.
  • Plot PR curve for rare events.
  • Check calibration.
  • Slice by scenario type.
  • Match metric to product cost.

15. A 60-second explanation you can say out loud

Section titled “15. A 60-second explanation you can say out loud”

Metrics define what model improvement means. Accuracy is often insufficient in autonomous driving because rare events matter. I start with a confusion matrix, then look at precision and recall. Precision tells me how reliable positive predictions are; recall tells me how many real positives I catch. For rare safety events, PR-AUC is often more useful than ROC-AUC. I also care about calibration if downstream systems use probabilities. For trajectories, ADE and FDE are useful but incomplete because they ignore multi-modality, collision, offroad, and physical validity. I would always slice metrics by scenario type and choose thresholds based on the cost of false positives and false negatives.


Exercise 1: A model has 99% accuracy on a 1% positive dataset. What metric do you ask for next?
Answer: Positive-class recall and precision, plus PR-AUC.

Exercise 2: What happens to recall when you lower the threshold?
Answer: Recall usually increases because more examples are predicted positive.

Exercise 3: Why can ROC-AUC be misleading for rare events?
Answer: It includes true-negative behavior, and negatives dominate. PR-AUC focuses on positive-class retrieval.

Exercise 4: A model predicts 0.9 confidence but is correct 60% of the time in that bin. What is wrong?
Answer: It is overconfident and poorly calibrated.