4.3 Deployment & Monitoring
Deployment is not the finish lineāit's the starting point of the most important phase. This is where the Cradle-to-Grave commitment becomes real: the pod that built the model now operates it, experiencing every production quirk, edge case, and 3 AM alert. Robust deployment and continuous monitoring ensure the AI product delivers value safely over its operational lifetime.
In the AI Innovation model, there is no "throw it over the wall" to an operations team. The builders are the operators. This creates powerful incentives: if you build something brittle, you're the one who gets paged. The result is more robust systems, better documentation, and faster incident resolution.
Deployment Strategies
Progressive Rollout Options
Choose a deployment strategy based on risk and rollback requirements:
Shadow Deployment
New model runs in parallel, receiving real traffic but not affecting decisions. Outputs compared to production model.
Use when: High-risk, need confidence before any exposure
Canary Deployment
New model serves small percentage of traffic (1-5%). Gradually increase if metrics are healthy.
Use when: Want real-world validation with limited blast radius
Blue-Green Deployment
Two identical environments; switch traffic from old (blue) to new (green). Instant rollback by switching back.
Use when: Need fast rollback capability
Feature Flags
New model behind feature flag, enabled for specific users or segments. Fine-grained control.
Use when: Want targeted rollout to specific users
Deployment Execution
Final Checks
Run production readiness checklist. Verify all sign-offs obtained. Confirm monitoring and alerting active. Brief on-call team on new deployment.
Limited Exposure
Deploy to shadow or small canary. Monitor closely for anomalies. Compare outputs to baseline. Validate infrastructure behavior under load.
Graduated Rollout
If initial metrics are healthy, expand traffic percentage per the deployment plan. Continue monitoring at each stage. Pause or rollback if issues emerge.
100% Traffic
Complete rollout to all traffic. Maintain elevated monitoring for 24-48 hours. Document any issues encountered and resolutions.
Post-Deployment Validation
After 1-2 weeks, conduct formal post-deployment review. Validate business metrics trending as expected. Update Model Card with production metrics.
Continuous Monitoring
Monitoring Layers
Effective AI monitoring operates at multiple layers:
| Layer | What to Monitor | Example Metrics |
|---|---|---|
| Infrastructure | System health and resources | CPU, memory, disk, network, availability |
| Application | Service behavior | Latency, throughput, error rates, queue depth |
| Model | ML-specific performance | Prediction distribution, confidence scores, feature values |
| Data | Input data health | Schema compliance, missing values, distribution shifts |
| Business | Outcome metrics | Conversion rates, user satisfaction, business KPIs |
| Fairness | Equitable treatment | Subgroup performance, disparity metrics |
Drift Detection
AI models degrade over time as the world changes. Monitor for multiple drift types:
Data Drift
Input feature distributions change from training data. May indicate changing user behavior or data pipeline issues.
Detection: Statistical tests (KS, PSI, chi-squared) on feature distributions
Concept Drift
Relationship between features and target changes. The world has changed in ways the model doesn't reflect.
Detection: Performance degradation on recent labeled data
Prediction Drift
Model output distribution shifts. May be caused by data drift or model issues.
Detection: Monitor prediction distribution over time
Label Drift
Ground truth label distribution changes. Business context or definitions may have shifted.
Detection: Monitor label rates when ground truth available
Alerting Strategy
Configure alerts at appropriate thresholds:
| Severity | Criteria | Response |
|---|---|---|
| SEV-1 | Model producing harmful outputs, system down, major accuracy degradation | Immediate page, consider kill switch, escalate to STO |
| SEV-2 | Significant performance degradation, fairness threshold breach, high error rate | Page within 15 min, investigate immediately |
| SEV-3 | Minor performance issues, drift warnings, elevated latency | Notify during business hours, investigate same day |
| SEV-4 | Informational alerts, approaching thresholds, minor anomalies | Log for review, include in weekly monitoring |
Retraining Triggers
When to Retrain
Model retraining can be scheduled or triggered by events:
Regular retraining on fresh data, regardless of drift signals. Frequency depends on domain:
- Fast-changing domains (fraud, recommendations): Weekly or monthly
- Moderate domains (credit risk, pricing): Monthly or quarterly
- Stable domains (document classification): Quarterly or annually
Event-driven retraining when conditions warrant:
- Drift detection exceeds threshold
- Performance metrics fall below SLA
- Fairness metrics breach acceptable range
- Significant new data becomes available
- Business rules or requirements change
Retraining Process
Retraining follows a lighter version of the development process:
- Data Refresh: Incorporate recent data while maintaining historical coverage
- Training: Run standard training pipeline with version control
- Validation: Execute full test suite including fairness tests
- Comparison: Benchmark new model against production model
- Approval: Lightweight sign-off (STO for routine, full review for significant changes)
- Deployment: Standard progressive rollout
Steady-State Operations
On-Call Rotation
The pod maintains 24/7 coverage for production issues:
- Primary on-call: Rotates among technical pod members (1 week typical)
- Secondary on-call: Backup for escalation or extended incidents
- STO availability: Always reachable for Sev-1 incidents
- Compensation: Fair on-call compensation and time-off policies
Incident Response
When issues occur, follow the incident response framework (detailed in Section 5.3):
Detect & Alert
Automated monitoring triggers alert. On-call acknowledges within SLA.
Assess & Classify
Determine severity, impact, and blast radius. Escalate if needed.
Mitigate
Take immediate action to reduce impact (rollback, kill switch, manual override).
Resolve
Fix root cause and restore normal operations.
Learn
Conduct blameless post-mortem. Update runbooks. Implement preventive measures.
Regular Operational Reviews
| Review | Frequency | Focus |
|---|---|---|
| Daily Check | Daily | Quick dashboard review, overnight alerts, any anomalies |
| Weekly Review | Weekly | Metrics trends, incident summary, upcoming changes |
| Monthly Ops Review | Monthly | SLA performance, capacity planning, technical debt |
| Quarterly Business Review | Quarterly | OKR progress, business value delivery, roadmap |
| Model Card Review | Monthly+ | Documentation currency, risk assessment, compliance |