4.3 Deployment & Monitoring

Deployment is not the finish line—it's the starting point of the most important phase. This is where the Cradle-to-Grave commitment becomes real: the pod that built the model now operates it, experiencing every production quirk, edge case, and 3 AM alert. Robust deployment and continuous monitoring ensure the AI product delivers value safely over its operational lifetime.

The Operations Mindset

In the AI Innovation model, there is no "throw it over the wall" to an operations team. The builders are the operators. This creates powerful incentives: if you build something brittle, you're the one who gets paged. The result is more robust systems, better documentation, and faster incident resolution.

Deployment Strategies

Progressive Rollout Options

Choose a deployment strategy based on risk and rollback requirements:

Shadow Deployment

New model runs in parallel, receiving real traffic but not affecting decisions. Outputs compared to production model.

Use when: High-risk, need confidence before any exposure

Canary Deployment

New model serves small percentage of traffic (1-5%). Gradually increase if metrics are healthy.

Use when: Want real-world validation with limited blast radius

Blue-Green Deployment

Two identical environments; switch traffic from old (blue) to new (green). Instant rollback by switching back.

Use when: Need fast rollback capability

Feature Flags

New model behind feature flag, enabled for specific users or segments. Fine-grained control.

Use when: Want targeted rollout to specific users

Deployment Execution

Pre-Deployment

Final Checks

Run production readiness checklist. Verify all sign-offs obtained. Confirm monitoring and alerting active. Brief on-call team on new deployment.

Initial Deployment

Limited Exposure

Deploy to shadow or small canary. Monitor closely for anomalies. Compare outputs to baseline. Validate infrastructure behavior under load.

Expansion

Graduated Rollout

If initial metrics are healthy, expand traffic percentage per the deployment plan. Continue monitoring at each stage. Pause or rollback if issues emerge.

Full Deployment

100% Traffic

Complete rollout to all traffic. Maintain elevated monitoring for 24-48 hours. Document any issues encountered and resolutions.

Stabilization

Post-Deployment Validation

After 1-2 weeks, conduct formal post-deployment review. Validate business metrics trending as expected. Update Model Card with production metrics.

Continuous Monitoring

Monitoring Layers

Effective AI monitoring operates at multiple layers:

Layer What to Monitor Example Metrics
Infrastructure System health and resources CPU, memory, disk, network, availability
Application Service behavior Latency, throughput, error rates, queue depth
Model ML-specific performance Prediction distribution, confidence scores, feature values
Data Input data health Schema compliance, missing values, distribution shifts
Business Outcome metrics Conversion rates, user satisfaction, business KPIs
Fairness Equitable treatment Subgroup performance, disparity metrics

Drift Detection

AI models degrade over time as the world changes. Monitor for multiple drift types:

Data Drift

Input feature distributions change from training data. May indicate changing user behavior or data pipeline issues.

Detection: Statistical tests (KS, PSI, chi-squared) on feature distributions

Concept Drift

Relationship between features and target changes. The world has changed in ways the model doesn't reflect.

Detection: Performance degradation on recent labeled data

Prediction Drift

Model output distribution shifts. May be caused by data drift or model issues.

Detection: Monitor prediction distribution over time

Label Drift

Ground truth label distribution changes. Business context or definitions may have shifted.

Detection: Monitor label rates when ground truth available

Alerting Strategy

Configure alerts at appropriate thresholds:

Severity Criteria Response
SEV-1 Model producing harmful outputs, system down, major accuracy degradation Immediate page, consider kill switch, escalate to STO
SEV-2 Significant performance degradation, fairness threshold breach, high error rate Page within 15 min, investigate immediately
SEV-3 Minor performance issues, drift warnings, elevated latency Notify during business hours, investigate same day
SEV-4 Informational alerts, approaching thresholds, minor anomalies Log for review, include in weekly monitoring

Retraining Triggers

When to Retrain

Model retraining can be scheduled or triggered by events:

Scheduled Retraining

Regular retraining on fresh data, regardless of drift signals. Frequency depends on domain:

  • Fast-changing domains (fraud, recommendations): Weekly or monthly
  • Moderate domains (credit risk, pricing): Monthly or quarterly
  • Stable domains (document classification): Quarterly or annually
Triggered Retraining

Event-driven retraining when conditions warrant:

  • Drift detection exceeds threshold
  • Performance metrics fall below SLA
  • Fairness metrics breach acceptable range
  • Significant new data becomes available
  • Business rules or requirements change

Retraining Process

Retraining follows a lighter version of the development process:

  1. Data Refresh: Incorporate recent data while maintaining historical coverage
  2. Training: Run standard training pipeline with version control
  3. Validation: Execute full test suite including fairness tests
  4. Comparison: Benchmark new model against production model
  5. Approval: Lightweight sign-off (STO for routine, full review for significant changes)
  6. Deployment: Standard progressive rollout

Steady-State Operations

On-Call Rotation

The pod maintains 24/7 coverage for production issues:

Incident Response

When issues occur, follow the incident response framework (detailed in Section 5.3):

1

Detect & Alert

Automated monitoring triggers alert. On-call acknowledges within SLA.

2

Assess & Classify

Determine severity, impact, and blast radius. Escalate if needed.

3

Mitigate

Take immediate action to reduce impact (rollback, kill switch, manual override).

4

Resolve

Fix root cause and restore normal operations.

5

Learn

Conduct blameless post-mortem. Update runbooks. Implement preventive measures.

Regular Operational Reviews

Review Frequency Focus
Daily Check Daily Quick dashboard review, overnight alerts, any anomalies
Weekly Review Weekly Metrics trends, incident summary, upcoming changes
Monthly Ops Review Monthly SLA performance, capacity planning, technical debt
Quarterly Business Review Quarterly OKR progress, business value delivery, roadmap
Model Card Review Monthly+ Documentation currency, risk assessment, compliance