Learning Path / Module 6
6

Testing & Evaluation

Phase 6 of the PMI-CPMAI Methodology

Overview

This module covers the critical process of validating AI models and evaluating their performance. You will learn about performance metrics, bias detection, cross-validation techniques, and statistical significance testing. Thorough testing ensures models meet business requirements and perform fairly across all user groups before deployment.

Learning Objectives

  • Evaluate model performance using appropriate metrics for classification, regression, and ranking tasks
  • Detect and mitigate bias across protected attributes using fairness metrics
  • Implement cross-validation strategies to ensure robust performance estimates
  • Interpret confusion matrices, ROC curves, and precision-recall trade-offs
  • Conduct A/B testing and statistical significance testing for model comparison

Key Concepts

Performance Metrics

Different problem types require different metrics. The project manager must ensure metrics align with business objectives and stakeholders understand metric trade-offs.

Classification Metrics
  • • Accuracy: Overall correctness
  • • Precision: Positive predictive value
  • • Recall: Sensitivity, true positive rate
  • • F1 Score: Harmonic mean of P & R
  • • AUC-ROC: Discrimination ability
Regression Metrics
  • • MAE: Mean absolute error
  • • MSE: Mean squared error
  • • RMSE: Root mean squared error
  • • MAPE: Mean absolute percentage error
  • • R-squared: Variance explained

Bias Detection

AI systems can perpetuate or amplify biases present in training data. Fairness metrics evaluate model performance across demographic groups to identify disparate impacts.

Demographic Parity
  • • Equal positive prediction rates
  • • Group-independent decisions
Equal Opportunity
  • • Equal true positive rates
  • • Equalized odds preferred
Calibration
  • • Predicted probabilities accurate
  • • Confidence matches reality

Cross-Validation

Cross-validation provides robust performance estimates by training and testing on multiple data splits. Common approaches include k-fold cross-validation, stratified sampling for imbalanced classes, and time-series cross-validation for temporal data.

Statistical Significance

Performance differences between models must be tested for statistical significance to avoid overinterpreting random variation. Use appropriate tests (t-test, McNemar's test, paired t-test) and consider practical significance alongside statistical significance.

Example Scenario

"The recommendation engine evaluation shows 72% accuracy, but deeper analysis reveals disparities. Bias testing shows a 15% lower recommendation rate for products in the 'electronics' category for users under 25 vs. users 25-40. Cross-validation with 5 folds shows consistent RMSE of 1.23 ± 0.08. A/B testing compares the new model against the baseline showing +8% click-through rate (p < 0.01, statistically significant). The team addresses bias by rebalancing training data and implementing post-processing fairness constraints, reducing the demographic gap to under 5%."

Summary

Module 6 has covered essential testing and evaluation techniques:

  • • Metric selection must align with business objectives and problem characteristics
  • • Bias detection is essential for responsible AI deployment
  • • Cross-validation provides reliable performance estimates
  • • Statistical significance testing prevents overinterpretation of results
  • • Documentation of evaluation results enables stakeholder communication