Learning Path / Module 4
4

Data Preparation

Phase 4 of the PMI-CPMAI Methodology

Overview

This module covers the practical aspects of preparing data for AI model training. You will learn data collection techniques, data cleaning methods, feature engineering strategies, and data transformation approaches. Data preparation typically consumes 60-80% of the time in AI projects, making it a critical skill for AI project managers to understand and manage effectively.

Learning Objectives

  • Implement data collection strategies from multiple sources including APIs, databases, and file systems
  • Apply data cleaning techniques including handling missing values, outliers, and duplicates
  • Create new features through feature engineering to improve model performance
  • Apply data augmentation techniques to increase dataset diversity and volume
  • Transform and normalize data using appropriate scaling and encoding methods

Key Concepts

Data Collection

Data collection involves gathering raw data from various sources to build training datasets. Common methods include API integrations, database queries, web scraping, sensor data collection, and manual data entry. The project manager must ensure collection processes are repeatable, well-documented, and compliant with data usage agreements.

Collection Considerations:
• API rate limits and quotas • Batch vs. streaming collection • Error handling and retry logic • Data validation at ingestion • Storage and indexing strategy • Collection scheduling

Data Cleaning

Data cleaning addresses data quality issues identified in Phase 3. Key techniques include:

Missing Values
  • • Imputation (mean, median, mode)
  • • Forward/backward fill
  • • Model-based imputation
  • • Removal when appropriate
Outliers
  • • Statistical detection (IQR, Z-score)
  • • Domain-based validation
  • • Winsorization
  • • Separate handling for anomalies
Duplicates
  • • Exact duplicate removal
  • • Fuzzy matching for near-duplicates
  • • Record linkage techniques
  • • Deduplication rules
Inconsistencies
  • • Format standardization
  • • Category consolidation
  • • Unit conversion
  • • Data type validation

Feature Engineering

Feature engineering creates new variables from raw data to improve model performance. This includes creating interaction features, aggregating temporal data, encoding categorical variables, and extracting meaningful signals. Effective feature engineering often requires deep domain knowledge and iterative experimentation.

Data Transformation

Data transformation prepares features for model consumption through normalization, scaling, and encoding. Common techniques include Min-Max scaling, Standardization (Z-score), One-Hot encoding for categorical variables, and Target encoding. The choice of transformation depends on the algorithm requirements and data distribution characteristics.

Example Scenario

"For the recommendation engine, the data team collects purchase history via database queries, browsing data from Kafka streams, and product metadata from the catalog API. Data cleaning handles 12,000 missing customer age values using purchase pattern-based imputation. Feature engineering creates 'days since last purchase,' ' browsing-to-purchase ratio,' and 'category affinity scores.' Data augmentation generates synthetic browsing sessions by combining existing patterns with seasonal variations. Finally, continuous features are Min-Max scaled while categorical features use target encoding."

Summary

Module 4 has covered the essential data preparation techniques:

  • • Data collection requires robust pipelines and error handling
  • • Data cleaning addresses quality issues systematically
  • • Feature engineering creates meaningful signals from raw data
  • • Data augmentation can improve model generalization
  • • Proper transformation ensures data compatibility with algorithms