5.3 The "Andon Cord" for AI: Incident Response

When AI systems cause harm or behave unexpectedly, the response must be swift, effective, and learning-oriented. Inspired by Toyota's famous "Andon Cord"—which allows any worker to stop the production line when they see a problem—AI Innovations empower anyone to halt an AI system when safety or ethics are at risk. This culture of empowered response is essential for responsible AI at scale.

The Andon Cord Principle

In Toyota's production system, any worker can pull the Andon Cord to stop the assembly line if they see a quality or safety issue. This empowerment—combined with a culture that celebrates stopping problems rather than punishing line stops—creates exceptional quality. For AI, we apply the same principle: anyone can stop an AI system causing harm, and they will be celebrated, not blamed, for doing so.

The AI Kill Switch

Who Can Pull the Cord

The authority to stop an AI system is broadly distributed:

Kill Switch Implementation

Every AI product must have a functional kill switch:

Technical Kill Switch

  • One-click disable for model serving
  • Automatic rollback to safe state or previous version
  • Feature flag to disable AI without taking down surrounding system
  • Tested regularly (chaos engineering)

Operational Kill Switch

  • Clear runbook for manual shutdown
  • Contact tree for escalation
  • Communication templates ready
  • Authority to act without prior approval

Incident Classification

Severity Levels

Severity Criteria Response Time Escalation
SEV-1 Active harm being caused; system down; major compliance breach Immediate (15 min) STO + AI Council + Executive
SEV-2 Significant degradation; fairness breach; high error rate Within 1 hour STO + Ethics Liaison
SEV-3 Noticeable issues; elevated errors; approaching thresholds Within 4 hours On-call + STO informed
SEV-4 Minor issues; cosmetic problems; edge cases Next business day On-call handles

AI-Specific Incident Types

AI systems can fail in ways unique to machine learning:

Harmful Outputs

Model producing toxic, biased, or dangerous content

Example: LLM generating harmful medical advice

Fairness Failure

Systematic disadvantage to protected groups detected

Example: Hiring model rejecting qualified minority candidates

Privacy Breach

Model leaking training data or PII in outputs

Example: Model memorization exposing customer data

Performance Collapse

Model accuracy drops dramatically

Example: Fraud model missing most fraud due to data drift

Incident Response Process

The Response Framework

Phase 1: Detection (0-15 min)

Identify & Alert

  • Automated monitoring detects anomaly OR human reports issue
  • On-call acknowledges alert
  • Initial severity assessment
  • If SEV-1: Immediately consider kill switch
Phase 2: Triage (15-60 min)

Assess & Contain

  • Determine scope and impact
  • Identify affected users/systems
  • Decide on containment: kill switch, rollback, or mitigate
  • Execute containment
  • Begin stakeholder communication
Phase 3: Investigate (1-24 hours)

Understand Root Cause

  • Gather data: logs, metrics, affected examples
  • Identify root cause(s)
  • Develop fix options
  • Document findings
Phase 4: Resolve (Hours to days)

Fix & Restore

  • Implement fix
  • Validate fix addresses root cause
  • Progressive restoration of service
  • Confirm normal operation
Phase 5: Learn (Within 1 week)

Post-Mortem & Prevent

  • Conduct blameless post-mortem
  • Identify prevention measures
  • Update runbooks and documentation
  • Share learnings across organization

Communication During Incidents

Audience What to Communicate When
Internal Stakeholders What's happening, impact, mitigation status Within first hour, then hourly updates
Affected Users Service status, workarounds, expected resolution As soon as impact is confirmed
AI Council Severity, root cause, governance implications SEV-1: Immediately; SEV-2: Within 4 hours
Regulators Per regulatory requirements (varies by jurisdiction) Per compliance obligations
Public If required: transparent acknowledgment and remediation After internal alignment, if applicable

Blameless Post-Mortems

The Blameless Culture

Post-mortems focus on systems and processes, not individuals:

Blameless Principles
  • People tried their best: Given what they knew at the time, their actions made sense
  • Systems fail, not people: If a human could make a mistake, the system allowed it
  • Honesty enables improvement: Full disclosure is rewarded, not punished
  • Focus on prevention: How do we make this impossible to repeat?

Post-Mortem Template

Required Sections
  1. Incident Summary: What happened, when, impact
  2. Timeline: Detailed sequence of events
  3. Root Cause Analysis: The 5 Whys or similar technique
  4. Impact Assessment: Who was affected and how
  5. What Went Well: Effective response elements
  6. What Could Be Improved: Gaps in detection, response, or prevention
  7. Action Items: Specific, assigned, time-bound improvements
  8. Lessons Learned: Broader takeaways for organization

Learning Distribution

Post-mortem learnings should be shared broadly:

Prevention Measures

Every post-mortem should result in concrete prevention:

Detection Improvement

Could we have caught this sooner?

  • New monitoring metrics
  • Adjusted alert thresholds
  • Additional automated tests

Prevention Measures

Could we have prevented this entirely?

  • New guardrails
  • Process changes
  • Training requirements

Response Improvement

Could we have responded faster/better?

  • Runbook updates
  • Tool improvements
  • Communication templates

Incidents as Investment

Every incident is an opportunity to improve. Organizations that conduct thorough, blameless post-mortems and follow through on action items build increasingly robust AI systems. Those that rush past incidents without learning repeat the same mistakes. The goal is not zero incidents—it's continuous improvement in how quickly incidents are detected, contained, and prevented from recurring.