4.3 Deployment & Monitoring

Deployment is not the finish line—it's the starting point of the most important phase. This is where the Cradle-to-Grave commitment becomes real: the pod that built the model now operates it, experiencing every production quirk, edge case, and 3 AM alert. Robust deployment and continuous monitoring ensure the AI product delivers value safely over its operational lifetime.

The Operations Mindset

In the AI Innovation model, there is no "throw it over the wall" to an operations team. The builders are the operators. This creates powerful incentives: if you build something brittle, you're the one who gets paged. The result is more robust systems, better documentation, and faster incident resolution.

Deployment Strategies

Progressive Rollout Options

Choose a deployment strategy based on risk and rollback requirements:

Shadow Deployment

New model runs in parallel, receiving real traffic but not affecting decisions. Outputs compared to production model.

Use when: High-risk, need confidence before any exposure

Canary Deployment

New model serves small percentage of traffic (1-5%). Gradually increase if metrics are healthy.

Use when: Want real-world validation with limited blast radius

Blue-Green Deployment

Two identical environments; switch traffic from old (blue) to new (green). Instant rollback by switching back.

Use when: Need fast rollback capability

Feature Flags

New model behind feature flag, enabled for specific users or segments. Fine-grained control.

Use when: Want targeted rollout to specific users

Deployment Execution

Pre-Deployment

Final Checks

Run production readiness checklist. Verify all sign-offs obtained. Confirm monitoring and alerting active. Brief on-call team on new deployment.

Initial Deployment

Limited Exposure

Deploy to shadow or small canary. Monitor closely for anomalies. Compare outputs to baseline. Validate infrastructure behavior under load.

Expansion

Graduated Rollout

If initial metrics are healthy, expand traffic percentage per the deployment plan. Continue monitoring at each stage. Pause or rollback if issues emerge.

Full Deployment

100% Traffic

Complete rollout to all traffic. Maintain elevated monitoring for 24-48 hours. Document any issues encountered and resolutions.

Stabilization

Post-Deployment Validation

After 1-2 weeks, conduct formal post-deployment review. Validate business metrics trending as expected. Update Model Card with production metrics.

Continuous Monitoring

Monitoring Layers

Effective AI monitoring operates at multiple layers:

Layer	What to Monitor	Example Metrics
Infrastructure	System health and resources	CPU, memory, disk, network, availability
Application	Service behavior	Latency, throughput, error rates, queue depth
Model	ML-specific performance	Prediction distribution, confidence scores, feature values
Data	Input data health	Schema compliance, missing values, distribution shifts
Business	Outcome metrics	Conversion rates, user satisfaction, business KPIs
Fairness	Equitable treatment	Subgroup performance, disparity metrics

Drift Detection

AI models degrade over time as the world changes. Monitor for multiple drift types:

Data Drift

Input feature distributions change from training data. May indicate changing user behavior or data pipeline issues.

Detection: Statistical tests (KS, PSI, chi-squared) on feature distributions

Concept Drift

Relationship between features and target changes. The world has changed in ways the model doesn't reflect.

Detection: Performance degradation on recent labeled data

Prediction Drift

Model output distribution shifts. May be caused by data drift or model issues.

Detection: Monitor prediction distribution over time

Label Drift

Ground truth label distribution changes. Business context or definitions may have shifted.

Detection: Monitor label rates when ground truth available

Alerting Strategy

Configure alerts at appropriate thresholds:

Severity	Criteria	Response
SEV-1	Model producing harmful outputs, system down, major accuracy degradation	Immediate page, consider kill switch, escalate to STO
SEV-2	Significant performance degradation, fairness threshold breach, high error rate	Page within 15 min, investigate immediately
SEV-3	Minor performance issues, drift warnings, elevated latency	Notify during business hours, investigate same day
SEV-4	Informational alerts, approaching thresholds, minor anomalies	Log for review, include in weekly monitoring

Retraining Triggers

When to Retrain

Model retraining can be scheduled or triggered by events:

Scheduled Retraining

Regular retraining on fresh data, regardless of drift signals. Frequency depends on domain:

Fast-changing domains (fraud, recommendations): Weekly or monthly
Moderate domains (credit risk, pricing): Monthly or quarterly
Stable domains (document classification): Quarterly or annually

Triggered Retraining

Event-driven retraining when conditions warrant:

Drift detection exceeds threshold
Performance metrics fall below SLA
Fairness metrics breach acceptable range
Significant new data becomes available
Business rules or requirements change

Retraining Process

Retraining follows a lighter version of the development process:

Data Refresh: Incorporate recent data while maintaining historical coverage
Training: Run standard training pipeline with version control
Validation: Execute full test suite including fairness tests
Comparison: Benchmark new model against production model
Approval: Lightweight sign-off (STO for routine, full review for significant changes)
Deployment: Standard progressive rollout

Steady-State Operations

On-Call Rotation

The pod maintains 24/7 coverage for production issues:

Primary on-call: Rotates among technical pod members (1 week typical)
Secondary on-call: Backup for escalation or extended incidents
STO availability: Always reachable for Sev-1 incidents
Compensation: Fair on-call compensation and time-off policies

Incident Response

When issues occur, follow the incident response framework (detailed in Section 5.3):

Detect & Alert

Automated monitoring triggers alert. On-call acknowledges within SLA.

Assess & Classify

Determine severity, impact, and blast radius. Escalate if needed.

Mitigate

Take immediate action to reduce impact (rollback, kill switch, manual override).

Resolve

Fix root cause and restore normal operations.

Learn

Conduct blameless post-mortem. Update runbooks. Implement preventive measures.

Regular Operational Reviews

Review	Frequency	Focus
Daily Check	Daily	Quick dashboard review, overnight alerts, any anomalies
Weekly Review	Weekly	Metrics trends, incident summary, upcoming changes
Monthly Ops Review	Monthly	SLA performance, capacity planning, technical debt
Quarterly Business Review	Quarterly	OKR progress, business value delivery, roadmap
Model Card Review	Monthly+	Documentation currency, risk assessment, compliance