4.2 Phase 2: Data Curation & Management

Data quality and governance form the foundation of responsible AI. This phase establishes practices for sourcing, documenting, and preparing data that meets ethical, legal, and technical requirements.

1 Ideation

2 Data

3 Development

4 Testing

5 Deployment

6 Monitoring

Key Takeaways

EU AI Act Article 10 mandates data governance for high-risk AI training datasets
Bias in training data is the leading cause of discriminatory AI outcomes
Data provenance documentation is essential for regulatory compliance and liability management
Copyright and IP clearance for training data is an emerging legal battleground

4.2.1 Data Lineage & Provenance Tracking

Data lineage documents the complete journey of data from its original source through all transformations to its final use in an AI model. Provenance tracking establishes the origin, ownership, and legal basis for data use. Together, they form the foundation of responsible data management.

EU AI Act Data Governance Requirements

Article 10: Data and Data Governance

High-risk AI systems must use training, validation, and testing datasets that are:

Subject to appropriate data governance and management practices
Relevant, representative, free of errors, and complete
Have appropriate statistical properties for the intended purpose
Take into account characteristics specific to geographic, contextual, or behavioral settings

Data Lineage Framework

Lineage Element	Documentation Requirements	Tools & Methods
Source Origin	Original data source identification Collection methodology Collection date/time range Geographic scope	Data catalogs, source contracts, metadata repositories
Legal Basis	Consent records (if applicable) Contractual basis documentation Legitimate interest assessment License terms and restrictions	Consent management platforms, contract management systems
Transformations	Data cleaning operations Feature engineering steps Aggregation and anonymization Augmentation techniques	ETL pipeline logs, version control, transformation scripts
Quality Metrics	Completeness scores Accuracy validation results Consistency checks Freshness indicators	Data quality tools, automated validation pipelines
Access History	Who accessed data and when Purpose of access Downstream uses Sharing and transfer records	Access logs, data sharing agreements, audit trails

Data Card Template

Every training dataset should have an accompanying Data Card that documents key characteristics:

Dataset Identification

Dataset Name: [Unique identifier]

Version: [Version number and date]

Owner: [Data steward responsible]

Classification: [Confidentiality level]

Composition

Total Records: [Number]

Features: [Number and types]

Time Range: [Start - End dates]

Geographic Scope: [Regions/countries covered]

Representativeness

Population Represented: [Target population description]

Demographic Breakdown: [Key demographic distributions]

Known Gaps: [Underrepresented groups or scenarios]

Legal & Ethical

Legal Basis: [GDPR basis / license type]

PII Present: [Yes/No - if yes, specify types]

Sensitive Data: [Special category data present?]

Use Restrictions: [Any limitations on use]

4.2.2 Bias Detection in Training Data

Training data bias is the single largest source of discriminatory AI outcomes. Systematic bias detection must be integrated into the data preparation pipeline to identify and mitigate issues before they propagate into model behavior.

Types of Data Bias

📊

Representation Bias

Occurs when certain groups are underrepresented or overrepresented in training data relative to the target population.

Example: A facial recognition dataset with 80% light-skinned faces leads to lower accuracy for darker-skinned individuals.

Detection: Compare demographic distributions in training data vs. target population using chi-square tests or KL divergence.

📜

Historical Bias

Reflects societal prejudices embedded in historical data, even when data accurately represents past reality.

Example: Hiring data showing historical underrepresentation of women in tech roles trains AI to perpetuate this pattern.

Detection: Analyze outcome distributions across protected groups; compare to equity benchmarks rather than historical rates.

📏

Measurement Bias

Arises when the features or labels used as proxies do not measure the intended concept equally across groups.

Example: Using zip code as a feature, which correlates with race due to historical housing discrimination.

Detection: Correlation analysis between features and protected attributes; proxy variable identification.

🎯

Sampling Bias

Results from non-random sampling that systematically excludes certain populations or scenarios.

Example: Online survey data excludes elderly populations with lower internet access.

Detection: Compare sampling methodology against target population; analyze coverage gaps.

🏷️

Label Bias

Occurs when labels are assigned inconsistently or reflect annotator prejudices.

Example: Content moderation labels that disproportionately flag African American Vernacular English as "toxic."

Detection: Inter-annotator agreement analysis across demographic groups; label distribution analysis.

⏰

Temporal Bias

Data from specific time periods may not represent current or future conditions.

Example: Economic models trained on pre-pandemic data failing to account for changed behaviors.

Detection: Time-series analysis of feature distributions; concept drift detection.

Bias Detection Methodology

Define Protected Attributes

Identify attributes requiring fairness analysis based on legal requirements and ethical considerations:

Legally protected: Race, gender, age, disability, religion, national origin
Context-specific: Geographic location, language, socioeconomic indicators
Proxy variables: Features that correlate with protected attributes

Statistical Analysis

Conduct quantitative assessment of data distributions:

Group representation ratios
Feature distribution comparisons (KS test, chi-square)
Label distribution across groups
Correlation analysis with protected attributes

Qualitative Review

Expert examination of data collection and labeling processes:

Review data collection methodology for systemic exclusion
Examine labeling guidelines for potential bias
Assess annotator demographics and training
Identify potential proxy variables

Documentation & Remediation

Record findings and implement corrections:

Document all identified biases and their potential impacts
Develop mitigation strategies (resampling, augmentation, weighting)
Track remediation effectiveness
Disclose known residual biases in model documentation

Bias Detection Metrics

Metric	Formula	Interpretation	Threshold
Representation Ratio	Group % in data / Group % in population	1.0 = proportional representation	0.8 - 1.25 acceptable range
Label Imbalance Ratio	Positive rate Group A / Positive rate Group B	1.0 = equal positive rates	0.8 - 1.25 (four-fifths rule)
Feature Correlation	Pearson/Spearman correlation with protected attribute	0 = no correlation	\|r\| < 0.3 generally acceptable
Coverage Gap	% of target scenarios not represented in data	0% = complete coverage	< 5% for high-risk systems

4.2.3 Data Minimization & Privacy-Preserving Techniques

GDPR Article 5(1)(c) requires that personal data be "adequate, relevant and limited to what is necessary." For AI systems, this principle must be balanced against the data requirements for effective model training.

Data Minimization Framework

Necessity Test

For each data element, document:

Why is this data necessary for the stated purpose?
Could the objective be achieved with less data?
What is the marginal benefit of including this data?
What are the privacy risks of collection and processing?

Proportionality Test

Evaluate the balance between:

Model performance improvement from additional data
Privacy intrusion and individual rights impact
Data protection compliance risk
Storage and security burden

Privacy-Preserving Techniques

Differential Privacy

Mathematical framework adding calibrated noise to protect individual records while preserving statistical properties.

Use Case: Training models on sensitive data where individual privacy must be guaranteed.
Trade-off: Privacy budget (ε) vs. model accuracy
Implementation: TensorFlow Privacy, PyTorch Opacus, Google DP Library

Pr[M(D) ∈ S] ≤ e^ε × Pr[M(D') ∈ S]

For neighboring datasets D and D', output probability differs by at most e^ε

Federated Learning

Training models across decentralized data sources without centralizing raw data.

Use Case: Healthcare AI, mobile device personalization, cross-organizational collaboration
Trade-off: Communication overhead, model convergence challenges
Implementation: TensorFlow Federated, PySyft, NVIDIA FLARE

Synthetic Data Generation

Creating artificial datasets that preserve statistical properties without containing real individual records.

Use Case: Training/testing when real data is restricted, data augmentation
Trade-off: Fidelity to real-world distributions, potential for memorization
Implementation: Synthetic Data Vault, Gretel.ai, MOSTLY AI

K-Anonymity & L-Diversity

Ensuring each record is indistinguishable from k-1 other records on quasi-identifiers.

Use Case: Releasing datasets for research, regulatory reporting
Trade-off: Data utility reduction, vulnerable to background knowledge attacks
Implementation: ARX Data Anonymization Tool, sdcMicro

Secure Multi-Party Computation

Cryptographic protocols enabling computation on combined data without revealing individual inputs.

Use Case: Joint model training across competitors, privacy-preserving inference
Trade-off: Computational overhead, protocol complexity
Implementation: MP-SPDZ, CrypTen, TF Encrypted

Homomorphic Encryption

Enabling computation on encrypted data without decryption.

Use Case: Inference on sensitive data, cloud-based ML on encrypted inputs
Trade-off: Significant computational overhead, limited operations
Implementation: Microsoft SEAL, OpenFHE, Concrete ML

Privacy Technique Selection Guide

Scenario	Recommended Technique	Key Considerations
Training on health records	Differential Privacy + Federated Learning	Strong privacy guarantees required; data cannot leave institutions
Sharing data for research	Synthetic Data Generation	Balance utility with privacy; validate statistical fidelity
Cross-organization model training	Federated Learning	Data sovereignty requirements; model aggregation security
Inference on sensitive inputs	Homomorphic Encryption or Secure MPC	Performance overhead acceptable; strong security needed
Regulatory dataset release	K-Anonymity + Differential Privacy	Formal privacy guarantees for compliance demonstration

4.2.4 Copyright & IP Clearance for Training Corpora

The legal landscape for AI training data is rapidly evolving, with ongoing litigation and regulatory developments creating significant uncertainty. Organizations must implement robust IP clearance processes to manage legal risk.

Emerging Legal Landscape

Major lawsuits regarding AI training on copyrighted content are proceeding through courts. The EU AI Act requires transparency about copyrighted training content. Organizations should treat this as a high-risk area requiring legal review.

Copyright Risk Categories

High Risk Third-Party Content

Web-scraped content without explicit license
Published books, articles, and creative works
Images from stock photo sites or social media
Code from public repositories without permissive licenses
Music, video, and audio content

Medium Risk Licensed Content

Data licensed for "research" but used commercially
Open source code with copyleft licenses (GPL)
Creative Commons content with NC or ND restrictions
APIs with terms limiting AI training use

Lower Risk Cleared Content

Content with explicit AI training licenses
Truly public domain works
Internally generated enterprise data
User-generated content with appropriate ToS
Permissively licensed datasets (e.g., Apache, MIT)

IP Clearance Process

Data Source Inventory

Create comprehensive inventory of all training data sources:

Source identification and URL/location
Content type and format
Volume of content from each source
Collection method and date

License Analysis

Document licensing status for each source:

Explicit license terms (if any)
Terms of service restrictions
robots.txt and crawling policies
Opt-out mechanisms honored

Risk Assessment

Evaluate legal exposure for each category:

Copyright infringement likelihood
Fair use/fair dealing arguments
Jurisdictional variations
Litigation risk from specific rights holders

Mitigation & Documentation

Implement risk reduction measures:

Remove or replace high-risk content
Obtain explicit licenses where feasible
Document fair use rationale
Implement opt-out mechanisms
Prepare for regulatory disclosure requirements

EU AI Act Transparency Requirements

Under the EU AI Act, providers of general-purpose AI models must:

Provide a sufficiently detailed summary of the content used for training
Comply with EU copyright law, including the text and data mining directive
Respect rights holders' opt-out of training use
Maintain records demonstrating compliance with copyright obligations

Implementation Guide

Data Curation Phase Deliverables

Data Cards - Completed documentation for all training datasets

Lineage Documentation - Complete provenance tracking from source to use

Bias Assessment Report - Analysis of representation and historical biases

Privacy Assessment - Data minimization analysis and PET selection

IP Clearance Report - Copyright and license analysis with risk ratings

Data Quality Report - Completeness, accuracy, and consistency metrics

Legal Basis Documentation - GDPR Article 6 basis for each processing activity

Data Governance Approval - Sign-off from Data Steward

Tooling Recommendations

Capability	Open Source Options	Commercial Options
Data Cataloging & Lineage	Apache Atlas, DataHub, Amundsen	Collibra, Alation, Informatica
Bias Detection	Fairlearn, AI Fairness 360, Aequitas	Fiddler AI, Arthur AI
Privacy Techniques	TensorFlow Privacy, PySyft, OpenDP	Gretel.ai, MOSTLY AI, Privitar
Data Quality	Great Expectations, Deequ, Soda	Talend, Informatica DQ, Monte Carlo