4.3 Phase 3: Model Development & Training

Documentation standards, sustainability considerations, and reproducibility requirements for responsible model development and training processes.

1 Ideation
2 Data
3 Development
4 Testing
5 Deployment
6 Monitoring

Key Takeaways

  • Model Cards are required under EU AI Act Article 11 for high-risk AI systems
  • Training a single large language model can emit over 300 tons of CO2 equivalent
  • Reproducibility documentation is essential for audit trails and compliance
  • System Cards extend Model Cards to document complete AI system behavior

4.3.1 Model Cards & System Cards (Documentation Standards)

Model Cards, introduced by Google researchers in 2018, have become the industry standard for AI model documentation. The EU AI Act Article 11 requires comprehensive technical documentation for high-risk AI systems, making Model Cards a regulatory compliance requirement.

Model Card vs. System Card

Aspect Model Card System Card
Scope Single ML model Complete AI system (models + infrastructure + guardrails)
Focus Model behavior and limitations End-to-end system behavior and safety measures
When Used Individual model releases Product/application deployment
Primary Audience ML engineers, researchers Product teams, regulators, users

Model Card Template

1. Model Details

[Unique identifier, version number]
[Architecture: CNN, Transformer, XGBoost, etc.]
[Team/organization responsible]
[Initial release and updates]
[Usage license terms]
[Point of contact for questions]

2. Intended Use

[Specific applications the model is designed for]
[Who should use this model]
[Applications explicitly NOT supported]

3. Training Data

[Names and descriptions of training datasets]
[How data was collected and preprocessed]
[Known gaps, biases, or limitations]

4. Evaluation Data

[Datasets used for evaluation]
[Why these datasets were chosen]

5. Performance Metrics

[Accuracy, F1, AUC, etc.]
[Performance broken down by demographic groups]
[Uncertainty bounds on metrics]

6. Fairness Considerations

[Which protected groups were analyzed]
[Specific fairness measures and results]
[Documented performance gaps across groups]

7. Ethical Considerations

[Identified risks from model use]
[Steps taken to address risks]
[Outstanding ethical issues]

8. Caveats & Recommendations

[Technical and practical limitations]
[Best practices for responsible use]
[When human review is needed]

System Card Components

System Cards extend Model Cards to document complete AI systems, including safety measures and deployment context:

System Architecture

  • Component models and their interactions
  • Data flows and dependencies
  • Integration with other systems
  • Deployment infrastructure

Safety Measures

  • Input/output filters and guardrails
  • Content moderation approaches
  • Rate limiting and abuse prevention
  • Fallback mechanisms

Human Oversight

  • Human-in-the-loop decision points
  • Escalation procedures
  • Override capabilities
  • Monitoring dashboards

Testing & Red Teaming

  • Safety testing methodologies
  • Red team findings and mitigations
  • Adversarial testing results
  • Edge case handling

EU AI Act Documentation Requirements

Article 11 requires high-risk AI systems to maintain technical documentation including: general description, design specifications, development process description, risk management measures, changes made throughout lifecycle, performance metrics, and interaction with other systems.

4.3.2 Energy Consumption & Sustainability Reporting (Green AI)

AI training's environmental impact has become a significant concern, with large model training consuming electricity equivalent to hundreds of homes annually. Organizations must measure, report, and minimize the carbon footprint of their AI development activities.

Environmental Impact Context

300+
Tons CO2 equivalent
Training a large language model (estimates vary widely)
1,287
MWh electricity
GPT-3 training energy consumption (estimated)
5x
Lifetime car emissions
Equivalent CO2 of training one large model
10-100x
Inference vs. training
Total inference energy often exceeds training

Carbon Footprint Measurement

Component Measurement Approach Tools
Training Compute GPU-hours × power consumption × carbon intensity CodeCarbon, ML CO2 Impact, Carbontracker
Data Center Energy PUE × IT equipment energy × grid carbon intensity Cloud provider sustainability dashboards
Data Transfer Data volume × energy per byte transferred Network monitoring tools
Inference Queries × energy per inference × carbon intensity APM tools with energy monitoring
Hardware Embodied Manufacturing emissions amortized over hardware life Hardware lifecycle assessments

Sustainability Strategies

Efficient Architecture Selection

  • Choose smaller models when accuracy permits
  • Use distillation to compress large models
  • Apply pruning and quantization techniques
  • Consider sparse architectures
Potential reduction: 10-100x compute
🔄

Transfer Learning & Fine-tuning

  • Start from pre-trained models
  • Fine-tune only necessary layers
  • Use parameter-efficient fine-tuning (LoRA, adapters)
  • Leverage foundation models where appropriate
Potential reduction: 100-1000x compute vs. training from scratch
🌍

Geographic & Temporal Optimization

  • Train in regions with cleaner energy grids
  • Schedule training during low-carbon periods
  • Use cloud providers with renewable commitments
  • Consider on-premise renewable-powered facilities
Potential reduction: 30-50% carbon intensity
🎯

Efficient Experimentation

  • Use smaller proxy datasets for hyperparameter search
  • Implement early stopping for unpromising runs
  • Share and reuse experiment results
  • Document negative results to prevent duplication
Potential reduction: 10x fewer training runs

Sustainability Reporting Template

Model Sustainability Report

Training Phase
[GPU-hours]
[GPU type, quantity]
[kWh]
[kg CO2e]
[gCO2/kWh]
[Training location/region]
Efficiency Measures
[Energy vs. standard approach]
[List techniques applied]
[If applicable]

EU AI Act Sustainability Requirements

The EU AI Act requires providers of general-purpose AI models to document energy consumption during training, known or estimated energy consumption during use, and other resource usage. Organizations should begin tracking these metrics proactively.

4.3.3 Hyperparameter Tuning & Reproducibility Logs

Reproducibility is foundational to scientific validity, regulatory compliance, and operational reliability. Complete documentation of the training process enables audit, debugging, and continuous improvement.

Reproducibility Requirements

Code Reproducibility

  • Complete source code under version control
  • Dependency specifications (exact versions)
  • Container definitions (Docker/Singularity)
  • Build and execution scripts
Tools: Git, Docker, conda, pip freeze, Poetry

Data Reproducibility

  • Data versioning and snapshot identification
  • Preprocessing pipeline code
  • Train/validation/test split specifications
  • Data augmentation procedures
Tools: DVC, Delta Lake, lakeFS, Pachyderm

Training Reproducibility

  • Random seeds and initialization
  • All hyperparameter values
  • Hardware specifications
  • Training checkpoints
Tools: MLflow, Weights & Biases, Neptune, Comet

Environment Reproducibility

  • Operating system and version
  • CUDA/cuDNN versions
  • Hardware driver versions
  • Cloud instance specifications
Tools: Docker, Kubernetes, Terraform, cloud configs

Experiment Tracking Schema

Every training run should log the following metadata:

Category Fields Format
Identification
  • Experiment ID
  • Run ID
  • Parent experiment (if applicable)
  • Timestamp
  • User/team
UUID, ISO 8601
Code Reference
  • Git commit hash
  • Branch name
  • Repository URL
  • Diff from HEAD (if uncommitted changes)
SHA-256, URL
Data Reference
  • Dataset version/hash
  • Split ratios
  • Preprocessing config
  • Feature list
Hash, JSON config
Hyperparameters
  • Learning rate (schedule)
  • Batch size
  • Epochs/steps
  • Optimizer settings
  • Architecture parameters
  • Regularization settings
JSON/YAML config
Environment
  • Hardware (GPU type, count)
  • Framework versions
  • Random seeds
  • Container image
System info capture
Metrics
  • Training loss (per step)
  • Validation metrics (per epoch)
  • Test metrics (final)
  • Custom metrics
Time series, final values
Artifacts
  • Model checkpoints
  • Evaluation outputs
  • Visualizations
  • Logs
File references with hashes

Hyperparameter Documentation

Hyperparameter Registry

For each hyperparameter, document:

  • Name: Parameter identifier
  • Value: Chosen value for production model
  • Search Range: Range explored during tuning
  • Selection Method: How value was chosen (grid search, Bayesian optimization, manual)
  • Sensitivity: How much performance varies with this parameter
  • Rationale: Why this value was selected

Version Control Best Practices

Model Versioning

Use semantic versioning for model releases:

  • MAJOR: Breaking changes to input/output format
  • MINOR: Performance improvements, new capabilities
  • PATCH: Bug fixes, minor adjustments
model-name-v2.3.1

Immutable Artifacts

Treat trained models as immutable artifacts:

  • Never modify a released model in place
  • Store models with content-addressable hashes
  • Maintain complete lineage from data to deployment

Audit Trail

Maintain complete audit trail:

  • Who trained the model
  • When it was trained
  • What data was used
  • What changes were made between versions

Implementation Guide

Model Development Phase Deliverables

MLOps Tooling Stack

Capability Open Source Commercial
Experiment Tracking MLflow, Aim, Sacred Weights & Biases, Neptune, Comet
Model Registry MLflow Model Registry, BentoML AWS SageMaker, Azure ML, Databricks
Data Versioning DVC, lakeFS, Pachyderm Delta Lake, Databricks Unity Catalog
Carbon Tracking CodeCarbon, Carbontracker Cloud provider sustainability tools
Pipeline Orchestration Kubeflow, Airflow, Prefect AWS Step Functions, Azure ML Pipelines