5.3 The "Andon Cord" for AI: Incident Response

When AI systems cause harm or behave unexpectedly, the response must be swift, effective, and learning-oriented. Inspired by Toyota's famous "Andon Cord"—which allows any worker to stop the production line when they see a problem—AI Innovations empower anyone to halt an AI system when safety or ethics are at risk. This culture of empowered response is essential for responsible AI at scale.

The Andon Cord Principle

In Toyota's production system, any worker can pull the Andon Cord to stop the assembly line if they see a quality or safety issue. This empowerment—combined with a culture that celebrates stopping problems rather than punishing line stops—creates exceptional quality. For AI, we apply the same principle: anyone can stop an AI system causing harm, and they will be celebrated, not blamed, for doing so.

The AI Kill Switch

Who Can Pull the Cord

The authority to stop an AI system is broadly distributed:

Any pod member who observes harmful behavior
The STO for any operational or business reason
The Ethics Liaison for governance concerns
On-call engineers responding to alerts
AI Council members for enterprise-wide concerns
Executive leadership for crisis situations

Kill Switch Implementation

Every AI product must have a functional kill switch:

Technical Kill Switch

One-click disable for model serving
Automatic rollback to safe state or previous version
Feature flag to disable AI without taking down surrounding system
Tested regularly (chaos engineering)

Operational Kill Switch

Clear runbook for manual shutdown
Contact tree for escalation
Communication templates ready
Authority to act without prior approval

Incident Classification

Severity Levels

Severity	Criteria	Response Time	Escalation
SEV-1	Active harm being caused; system down; major compliance breach	Immediate (15 min)	STO + AI Council + Executive
SEV-2	Significant degradation; fairness breach; high error rate	Within 1 hour	STO + Ethics Liaison
SEV-3	Noticeable issues; elevated errors; approaching thresholds	Within 4 hours	On-call + STO informed
SEV-4	Minor issues; cosmetic problems; edge cases	Next business day	On-call handles

AI-Specific Incident Types

AI systems can fail in ways unique to machine learning:

Harmful Outputs

Model producing toxic, biased, or dangerous content

Example: LLM generating harmful medical advice

Fairness Failure

Systematic disadvantage to protected groups detected

Example: Hiring model rejecting qualified minority candidates

Privacy Breach

Model leaking training data or PII in outputs

Example: Model memorization exposing customer data

Performance Collapse

Model accuracy drops dramatically

Example: Fraud model missing most fraud due to data drift

Incident Response Process

The Response Framework

Phase 1: Detection (0-15 min)

Identify & Alert

Automated monitoring detects anomaly OR human reports issue
On-call acknowledges alert
Initial severity assessment
If SEV-1: Immediately consider kill switch

Phase 2: Triage (15-60 min)

Assess & Contain

Determine scope and impact
Identify affected users/systems
Decide on containment: kill switch, rollback, or mitigate
Execute containment
Begin stakeholder communication

Phase 3: Investigate (1-24 hours)

Understand Root Cause

Gather data: logs, metrics, affected examples
Identify root cause(s)
Develop fix options
Document findings

Phase 4: Resolve (Hours to days)

Fix & Restore

Implement fix
Validate fix addresses root cause
Progressive restoration of service
Confirm normal operation

Phase 5: Learn (Within 1 week)

Post-Mortem & Prevent

Conduct blameless post-mortem
Identify prevention measures
Update runbooks and documentation
Share learnings across organization

Communication During Incidents

Audience	What to Communicate	When
Internal Stakeholders	What's happening, impact, mitigation status	Within first hour, then hourly updates
Affected Users	Service status, workarounds, expected resolution	As soon as impact is confirmed
AI Council	Severity, root cause, governance implications	SEV-1: Immediately; SEV-2: Within 4 hours
Regulators	Per regulatory requirements (varies by jurisdiction)	Per compliance obligations
Public	If required: transparent acknowledgment and remediation	After internal alignment, if applicable

Blameless Post-Mortems

The Blameless Culture

Post-mortems focus on systems and processes, not individuals:

Blameless Principles

People tried their best: Given what they knew at the time, their actions made sense
Systems fail, not people: If a human could make a mistake, the system allowed it
Honesty enables improvement: Full disclosure is rewarded, not punished
Focus on prevention: How do we make this impossible to repeat?

Post-Mortem Template

Required Sections

Incident Summary: What happened, when, impact
Timeline: Detailed sequence of events
Root Cause Analysis: The 5 Whys or similar technique
Impact Assessment: Who was affected and how
What Went Well: Effective response elements
What Could Be Improved: Gaps in detection, response, or prevention
Action Items: Specific, assigned, time-bound improvements
Lessons Learned: Broader takeaways for organization

Learning Distribution

Post-mortem learnings should be shared broadly:

Pod: Detailed review of incident and action items
AI Council: Summary of significant incidents monthly
All Pods: Relevant learnings that could prevent similar issues
Organization: Trends and themes from aggregate incident data

Prevention Measures

Every post-mortem should result in concrete prevention:

Detection Improvement

Could we have caught this sooner?

New monitoring metrics
Adjusted alert thresholds
Additional automated tests

Prevention Measures

Could we have prevented this entirely?

New guardrails
Process changes
Training requirements

Response Improvement

Could we have responded faster/better?

Runbook updates
Tool improvements
Communication templates

Incidents as Investment

Every incident is an opportunity to improve. Organizations that conduct thorough, blameless post-mortems and follow through on action items build increasingly robust AI systems. Those that rush past incidents without learning repeat the same mistakes. The goal is not zero incidents—it's continuous improvement in how quickly incidents are detected, contained, and prevented from recurring.