4.2 Phase 2: Data Curation & Management
Data quality and governance form the foundation of responsible AI. This phase establishes practices for sourcing, documenting, and preparing data that meets ethical, legal, and technical requirements.
Key Takeaways
- EU AI Act Article 10 mandates data governance for high-risk AI training datasets
- Bias in training data is the leading cause of discriminatory AI outcomes
- Data provenance documentation is essential for regulatory compliance and liability management
- Copyright and IP clearance for training data is an emerging legal battleground
4.2.1 Data Lineage & Provenance Tracking
Data lineage documents the complete journey of data from its original source through all transformations to its final use in an AI model. Provenance tracking establishes the origin, ownership, and legal basis for data use. Together, they form the foundation of responsible data management.
EU AI Act Data Governance Requirements
Article 10: Data and Data Governance
High-risk AI systems must use training, validation, and testing datasets that are:
- Subject to appropriate data governance and management practices
- Relevant, representative, free of errors, and complete
- Have appropriate statistical properties for the intended purpose
- Take into account characteristics specific to geographic, contextual, or behavioral settings
Data Lineage Framework
| Lineage Element | Documentation Requirements | Tools & Methods |
|---|---|---|
| Source Origin |
|
Data catalogs, source contracts, metadata repositories |
| Legal Basis |
|
Consent management platforms, contract management systems |
| Transformations |
|
ETL pipeline logs, version control, transformation scripts |
| Quality Metrics |
|
Data quality tools, automated validation pipelines |
| Access History |
|
Access logs, data sharing agreements, audit trails |
Data Card Template
Every training dataset should have an accompanying Data Card that documents key characteristics:
Dataset Identification
Composition
Representativeness
Legal & Ethical
4.2.2 Bias Detection in Training Data
Training data bias is the single largest source of discriminatory AI outcomes. Systematic bias detection must be integrated into the data preparation pipeline to identify and mitigate issues before they propagate into model behavior.
Types of Data Bias
Representation Bias
Occurs when certain groups are underrepresented or overrepresented in training data relative to the target population.
Historical Bias
Reflects societal prejudices embedded in historical data, even when data accurately represents past reality.
Measurement Bias
Arises when the features or labels used as proxies do not measure the intended concept equally across groups.
Sampling Bias
Results from non-random sampling that systematically excludes certain populations or scenarios.
Label Bias
Occurs when labels are assigned inconsistently or reflect annotator prejudices.
Temporal Bias
Data from specific time periods may not represent current or future conditions.
Bias Detection Methodology
Define Protected Attributes
Identify attributes requiring fairness analysis based on legal requirements and ethical considerations:
- Legally protected: Race, gender, age, disability, religion, national origin
- Context-specific: Geographic location, language, socioeconomic indicators
- Proxy variables: Features that correlate with protected attributes
Statistical Analysis
Conduct quantitative assessment of data distributions:
- Group representation ratios
- Feature distribution comparisons (KS test, chi-square)
- Label distribution across groups
- Correlation analysis with protected attributes
Qualitative Review
Expert examination of data collection and labeling processes:
- Review data collection methodology for systemic exclusion
- Examine labeling guidelines for potential bias
- Assess annotator demographics and training
- Identify potential proxy variables
Documentation & Remediation
Record findings and implement corrections:
- Document all identified biases and their potential impacts
- Develop mitigation strategies (resampling, augmentation, weighting)
- Track remediation effectiveness
- Disclose known residual biases in model documentation
Bias Detection Metrics
| Metric | Formula | Interpretation | Threshold |
|---|---|---|---|
| Representation Ratio | Group % in data / Group % in population | 1.0 = proportional representation | 0.8 - 1.25 acceptable range |
| Label Imbalance Ratio | Positive rate Group A / Positive rate Group B | 1.0 = equal positive rates | 0.8 - 1.25 (four-fifths rule) |
| Feature Correlation | Pearson/Spearman correlation with protected attribute | 0 = no correlation | |r| < 0.3 generally acceptable |
| Coverage Gap | % of target scenarios not represented in data | 0% = complete coverage | < 5% for high-risk systems |
4.2.3 Data Minimization & Privacy-Preserving Techniques
GDPR Article 5(1)(c) requires that personal data be "adequate, relevant and limited to what is necessary." For AI systems, this principle must be balanced against the data requirements for effective model training.
Data Minimization Framework
Necessity Test
For each data element, document:
- Why is this data necessary for the stated purpose?
- Could the objective be achieved with less data?
- What is the marginal benefit of including this data?
- What are the privacy risks of collection and processing?
Proportionality Test
Evaluate the balance between:
- Model performance improvement from additional data
- Privacy intrusion and individual rights impact
- Data protection compliance risk
- Storage and security burden
Privacy-Preserving Techniques
Differential Privacy
Mathematical framework adding calibrated noise to protect individual records while preserving statistical properties.
Trade-off: Privacy budget (ε) vs. model accuracy
Implementation: TensorFlow Privacy, PyTorch Opacus, Google DP Library
Pr[M(D) ∈ S] ≤ e^ε × Pr[M(D') ∈ S]
For neighboring datasets D and D', output probability differs by at most e^ε
Federated Learning
Training models across decentralized data sources without centralizing raw data.
Trade-off: Communication overhead, model convergence challenges
Implementation: TensorFlow Federated, PySyft, NVIDIA FLARE
Synthetic Data Generation
Creating artificial datasets that preserve statistical properties without containing real individual records.
Trade-off: Fidelity to real-world distributions, potential for memorization
Implementation: Synthetic Data Vault, Gretel.ai, MOSTLY AI
K-Anonymity & L-Diversity
Ensuring each record is indistinguishable from k-1 other records on quasi-identifiers.
Trade-off: Data utility reduction, vulnerable to background knowledge attacks
Implementation: ARX Data Anonymization Tool, sdcMicro
Secure Multi-Party Computation
Cryptographic protocols enabling computation on combined data without revealing individual inputs.
Trade-off: Computational overhead, protocol complexity
Implementation: MP-SPDZ, CrypTen, TF Encrypted
Homomorphic Encryption
Enabling computation on encrypted data without decryption.
Trade-off: Significant computational overhead, limited operations
Implementation: Microsoft SEAL, OpenFHE, Concrete ML
Privacy Technique Selection Guide
| Scenario | Recommended Technique | Key Considerations |
|---|---|---|
| Training on health records | Differential Privacy + Federated Learning | Strong privacy guarantees required; data cannot leave institutions |
| Sharing data for research | Synthetic Data Generation | Balance utility with privacy; validate statistical fidelity |
| Cross-organization model training | Federated Learning | Data sovereignty requirements; model aggregation security |
| Inference on sensitive inputs | Homomorphic Encryption or Secure MPC | Performance overhead acceptable; strong security needed |
| Regulatory dataset release | K-Anonymity + Differential Privacy | Formal privacy guarantees for compliance demonstration |
4.2.4 Copyright & IP Clearance for Training Corpora
The legal landscape for AI training data is rapidly evolving, with ongoing litigation and regulatory developments creating significant uncertainty. Organizations must implement robust IP clearance processes to manage legal risk.
Emerging Legal Landscape
Major lawsuits regarding AI training on copyrighted content are proceeding through courts. The EU AI Act requires transparency about copyrighted training content. Organizations should treat this as a high-risk area requiring legal review.
Copyright Risk Categories
High Risk Third-Party Content
- Web-scraped content without explicit license
- Published books, articles, and creative works
- Images from stock photo sites or social media
- Code from public repositories without permissive licenses
- Music, video, and audio content
Medium Risk Licensed Content
- Data licensed for "research" but used commercially
- Open source code with copyleft licenses (GPL)
- Creative Commons content with NC or ND restrictions
- APIs with terms limiting AI training use
Lower Risk Cleared Content
- Content with explicit AI training licenses
- Truly public domain works
- Internally generated enterprise data
- User-generated content with appropriate ToS
- Permissively licensed datasets (e.g., Apache, MIT)
IP Clearance Process
Data Source Inventory
Create comprehensive inventory of all training data sources:
- Source identification and URL/location
- Content type and format
- Volume of content from each source
- Collection method and date
License Analysis
Document licensing status for each source:
- Explicit license terms (if any)
- Terms of service restrictions
- robots.txt and crawling policies
- Opt-out mechanisms honored
Risk Assessment
Evaluate legal exposure for each category:
- Copyright infringement likelihood
- Fair use/fair dealing arguments
- Jurisdictional variations
- Litigation risk from specific rights holders
Mitigation & Documentation
Implement risk reduction measures:
- Remove or replace high-risk content
- Obtain explicit licenses where feasible
- Document fair use rationale
- Implement opt-out mechanisms
- Prepare for regulatory disclosure requirements
EU AI Act Transparency Requirements
Under the EU AI Act, providers of general-purpose AI models must:
- Provide a sufficiently detailed summary of the content used for training
- Comply with EU copyright law, including the text and data mining directive
- Respect rights holders' opt-out of training use
- Maintain records demonstrating compliance with copyright obligations
Implementation Guide
Data Curation Phase Deliverables
Tooling Recommendations
| Capability | Open Source Options | Commercial Options |
|---|---|---|
| Data Cataloging & Lineage | Apache Atlas, DataHub, Amundsen | Collibra, Alation, Informatica |
| Bias Detection | Fairlearn, AI Fairness 360, Aequitas | Fiddler AI, Arthur AI |
| Privacy Techniques | TensorFlow Privacy, PySyft, OpenDP | Gretel.ai, MOSTLY AI, Privitar |
| Data Quality | Great Expectations, Deequ, Soda | Talend, Informatica DQ, Monte Carlo |