4.2 Phase 2: Data Curation & Management

Data quality and governance form the foundation of responsible AI. This phase establishes practices for sourcing, documenting, and preparing data that meets ethical, legal, and technical requirements.

1 Ideation
2 Data
3 Development
4 Testing
5 Deployment
6 Monitoring

Key Takeaways

  • EU AI Act Article 10 mandates data governance for high-risk AI training datasets
  • Bias in training data is the leading cause of discriminatory AI outcomes
  • Data provenance documentation is essential for regulatory compliance and liability management
  • Copyright and IP clearance for training data is an emerging legal battleground

4.2.1 Data Lineage & Provenance Tracking

Data lineage documents the complete journey of data from its original source through all transformations to its final use in an AI model. Provenance tracking establishes the origin, ownership, and legal basis for data use. Together, they form the foundation of responsible data management.

EU AI Act Data Governance Requirements

Article 10: Data and Data Governance

High-risk AI systems must use training, validation, and testing datasets that are:

  • Subject to appropriate data governance and management practices
  • Relevant, representative, free of errors, and complete
  • Have appropriate statistical properties for the intended purpose
  • Take into account characteristics specific to geographic, contextual, or behavioral settings

Data Lineage Framework

Lineage Element Documentation Requirements Tools & Methods
Source Origin
  • Original data source identification
  • Collection methodology
  • Collection date/time range
  • Geographic scope
Data catalogs, source contracts, metadata repositories
Legal Basis
  • Consent records (if applicable)
  • Contractual basis documentation
  • Legitimate interest assessment
  • License terms and restrictions
Consent management platforms, contract management systems
Transformations
  • Data cleaning operations
  • Feature engineering steps
  • Aggregation and anonymization
  • Augmentation techniques
ETL pipeline logs, version control, transformation scripts
Quality Metrics
  • Completeness scores
  • Accuracy validation results
  • Consistency checks
  • Freshness indicators
Data quality tools, automated validation pipelines
Access History
  • Who accessed data and when
  • Purpose of access
  • Downstream uses
  • Sharing and transfer records
Access logs, data sharing agreements, audit trails

Data Card Template

Every training dataset should have an accompanying Data Card that documents key characteristics:

Dataset Identification

[Unique identifier]
[Version number and date]
[Data steward responsible]
[Confidentiality level]

Composition

[Number]
[Number and types]
[Start - End dates]
[Regions/countries covered]

Representativeness

[Target population description]
[Key demographic distributions]
[Underrepresented groups or scenarios]

Legal & Ethical

[GDPR basis / license type]
[Yes/No - if yes, specify types]
[Special category data present?]
[Any limitations on use]

4.2.2 Bias Detection in Training Data

Training data bias is the single largest source of discriminatory AI outcomes. Systematic bias detection must be integrated into the data preparation pipeline to identify and mitigate issues before they propagate into model behavior.

Types of Data Bias

📊

Representation Bias

Occurs when certain groups are underrepresented or overrepresented in training data relative to the target population.

Example: A facial recognition dataset with 80% light-skinned faces leads to lower accuracy for darker-skinned individuals.
Detection: Compare demographic distributions in training data vs. target population using chi-square tests or KL divergence.
📜

Historical Bias

Reflects societal prejudices embedded in historical data, even when data accurately represents past reality.

Example: Hiring data showing historical underrepresentation of women in tech roles trains AI to perpetuate this pattern.
Detection: Analyze outcome distributions across protected groups; compare to equity benchmarks rather than historical rates.
📏

Measurement Bias

Arises when the features or labels used as proxies do not measure the intended concept equally across groups.

Example: Using zip code as a feature, which correlates with race due to historical housing discrimination.
Detection: Correlation analysis between features and protected attributes; proxy variable identification.
🎯

Sampling Bias

Results from non-random sampling that systematically excludes certain populations or scenarios.

Example: Online survey data excludes elderly populations with lower internet access.
Detection: Compare sampling methodology against target population; analyze coverage gaps.
🏷️

Label Bias

Occurs when labels are assigned inconsistently or reflect annotator prejudices.

Example: Content moderation labels that disproportionately flag African American Vernacular English as "toxic."
Detection: Inter-annotator agreement analysis across demographic groups; label distribution analysis.

Temporal Bias

Data from specific time periods may not represent current or future conditions.

Example: Economic models trained on pre-pandemic data failing to account for changed behaviors.
Detection: Time-series analysis of feature distributions; concept drift detection.

Bias Detection Methodology

1
Define Protected Attributes

Identify attributes requiring fairness analysis based on legal requirements and ethical considerations:

  • Legally protected: Race, gender, age, disability, religion, national origin
  • Context-specific: Geographic location, language, socioeconomic indicators
  • Proxy variables: Features that correlate with protected attributes
2
Statistical Analysis

Conduct quantitative assessment of data distributions:

  • Group representation ratios
  • Feature distribution comparisons (KS test, chi-square)
  • Label distribution across groups
  • Correlation analysis with protected attributes
3
Qualitative Review

Expert examination of data collection and labeling processes:

  • Review data collection methodology for systemic exclusion
  • Examine labeling guidelines for potential bias
  • Assess annotator demographics and training
  • Identify potential proxy variables
4
Documentation & Remediation

Record findings and implement corrections:

  • Document all identified biases and their potential impacts
  • Develop mitigation strategies (resampling, augmentation, weighting)
  • Track remediation effectiveness
  • Disclose known residual biases in model documentation

Bias Detection Metrics

Metric Formula Interpretation Threshold
Representation Ratio Group % in data / Group % in population 1.0 = proportional representation 0.8 - 1.25 acceptable range
Label Imbalance Ratio Positive rate Group A / Positive rate Group B 1.0 = equal positive rates 0.8 - 1.25 (four-fifths rule)
Feature Correlation Pearson/Spearman correlation with protected attribute 0 = no correlation |r| < 0.3 generally acceptable
Coverage Gap % of target scenarios not represented in data 0% = complete coverage < 5% for high-risk systems

4.2.3 Data Minimization & Privacy-Preserving Techniques

GDPR Article 5(1)(c) requires that personal data be "adequate, relevant and limited to what is necessary." For AI systems, this principle must be balanced against the data requirements for effective model training.

Data Minimization Framework

Necessity Test

For each data element, document:

  • Why is this data necessary for the stated purpose?
  • Could the objective be achieved with less data?
  • What is the marginal benefit of including this data?
  • What are the privacy risks of collection and processing?

Proportionality Test

Evaluate the balance between:

  • Model performance improvement from additional data
  • Privacy intrusion and individual rights impact
  • Data protection compliance risk
  • Storage and security burden

Privacy-Preserving Techniques

Differential Privacy

Mathematical framework adding calibrated noise to protect individual records while preserving statistical properties.

Use Case: Training models on sensitive data where individual privacy must be guaranteed.
Trade-off: Privacy budget (ε) vs. model accuracy
Implementation: TensorFlow Privacy, PyTorch Opacus, Google DP Library
Pr[M(D) ∈ S] ≤ e^ε × Pr[M(D') ∈ S]

For neighboring datasets D and D', output probability differs by at most e^ε

Federated Learning

Training models across decentralized data sources without centralizing raw data.

Use Case: Healthcare AI, mobile device personalization, cross-organizational collaboration
Trade-off: Communication overhead, model convergence challenges
Implementation: TensorFlow Federated, PySyft, NVIDIA FLARE

Synthetic Data Generation

Creating artificial datasets that preserve statistical properties without containing real individual records.

Use Case: Training/testing when real data is restricted, data augmentation
Trade-off: Fidelity to real-world distributions, potential for memorization
Implementation: Synthetic Data Vault, Gretel.ai, MOSTLY AI

K-Anonymity & L-Diversity

Ensuring each record is indistinguishable from k-1 other records on quasi-identifiers.

Use Case: Releasing datasets for research, regulatory reporting
Trade-off: Data utility reduction, vulnerable to background knowledge attacks
Implementation: ARX Data Anonymization Tool, sdcMicro

Secure Multi-Party Computation

Cryptographic protocols enabling computation on combined data without revealing individual inputs.

Use Case: Joint model training across competitors, privacy-preserving inference
Trade-off: Computational overhead, protocol complexity
Implementation: MP-SPDZ, CrypTen, TF Encrypted

Homomorphic Encryption

Enabling computation on encrypted data without decryption.

Use Case: Inference on sensitive data, cloud-based ML on encrypted inputs
Trade-off: Significant computational overhead, limited operations
Implementation: Microsoft SEAL, OpenFHE, Concrete ML

Privacy Technique Selection Guide

Scenario Recommended Technique Key Considerations
Training on health records Differential Privacy + Federated Learning Strong privacy guarantees required; data cannot leave institutions
Sharing data for research Synthetic Data Generation Balance utility with privacy; validate statistical fidelity
Cross-organization model training Federated Learning Data sovereignty requirements; model aggregation security
Inference on sensitive inputs Homomorphic Encryption or Secure MPC Performance overhead acceptable; strong security needed
Regulatory dataset release K-Anonymity + Differential Privacy Formal privacy guarantees for compliance demonstration

Implementation Guide

Data Curation Phase Deliverables

Tooling Recommendations

Capability Open Source Options Commercial Options
Data Cataloging & Lineage Apache Atlas, DataHub, Amundsen Collibra, Alation, Informatica
Bias Detection Fairlearn, AI Fairness 360, Aequitas Fiddler AI, Arthur AI
Privacy Techniques TensorFlow Privacy, PySyft, OpenDP Gretel.ai, MOSTLY AI, Privitar
Data Quality Great Expectations, Deequ, Soda Talend, Informatica DQ, Monte Carlo