Identifying Data Requirements
Phase 3 of the PMI-CPMAI Methodology
Overview
This module covers the critical process of defining data requirements for AI projects. You will learn how to conduct data inventories, assess data quality, establish data governance frameworks, and document data lineage. Understanding data requirements is essential because the quality and availability of data directly determines the success of any AI initiative.
Learning Objectives
- Conduct a comprehensive data inventory to identify available data sources
- Assess data quality across multiple dimensions including completeness, accuracy, and consistency
- Establish data governance policies including access controls, privacy, and compliance requirements
- Document data lineage to track data flow from source to model input
- Define data requirements specifications that align with project objectives
Key Concepts
Data Inventory
A data inventory is a comprehensive catalog of all data assets available for an AI project. It documents data sources, formats, storage locations, update frequencies, and ownership. Creating a thorough data inventory is the first step in understanding what data you have and what additional data you may need.
Data Quality Assessment
Data quality is evaluated across multiple dimensions:
- • Missing values and nulls
- • Coverage of required fields
- • Historical data availability
- • Value validation
- • Outlier detection
- • Cross-reference verification
- • Format standardization
- • Temporal alignment
- • Cross-system reconciliation
- • Data freshness
- • Update frequency
- • Processing latency
Data Governance Framework
Data governance establishes policies, procedures, and standards for data management. Key elements include data ownership, access controls, privacy compliance (GDPR, CCPA), security requirements, and data usage agreements. A robust governance framework ensures responsible data use throughout the AI project lifecycle.
Data Lineage
Data lineage tracks the origin, movement, and transformation of data from source systems to model inputs. Understanding data lineage is critical for debugging, compliance auditing, and ensuring reproducibility. Document the complete data flow including extraction methods, transformation rules, and any intermediate processing steps.
Example Scenario
"For the retail company's recommendation engine project, the data team conducts a data inventory revealing customer purchase history in the POS system, website browsing behavior in Google Analytics, product catalog in the ERP system, and customer demographics in the CRM. The data quality assessment shows 85% completeness in purchase history but only 60% in demographic data. Data governance review identifies that combining browsing data with purchase data requires explicit customer consent under GDPR, necessitating a consent management implementation before data combination."
Summary
Module 3 has covered the essential process of identifying data requirements:
- • Data inventory documents all available data assets and their characteristics
- • Data quality assessment evaluates completeness, accuracy, consistency, and timeliness
- • Data governance establishes policies for responsible data use and compliance
- • Data lineage tracking enables traceability and reproducibility
- • Early data requirement identification prevents downstream surprises