3.2 Agile for AI: Sprints, Epics, and Experiments

Standard Agile methodologies were designed for deterministic software development—where writing code produces predictable outputs. AI development is fundamentally different: experiments may fail, models may not converge, and "done" is often a moving target. AI Innovations adapt Agile practices to accommodate the inherent uncertainty of AI work while maintaining the discipline of iterative delivery.

The Core Adaptation

Traditional Agile assumes that committed work can be completed with reasonable confidence. AI work includes experiments where the outcome is genuinely uncertain. The AI Innovation approach treats experiments as first-class work items with their own planning and completion criteria, separate from deterministic engineering tasks.

Why AI Development is Different

The Uncertainty Dimension

AI work contains inherent uncertainty that traditional software development does not:

Traditional Software	AI Development
Requirements can be fully specified	Performance targets may be aspirational
Implementation is largely deterministic	Model training has stochastic elements
"Done" is clearly defined	"Good enough" requires judgment
Bugs are fixable	Model limitations may be inherent
Effort estimation is reasonable	Experiments may succeed or fail unpredictably

Types of AI Work

AI pods handle three distinct types of work, each with different planning characteristics:

Engineering Work

Deterministic tasks with predictable outcomes: building pipelines, creating APIs, implementing monitoring, writing tests.

Planning approach: Standard Agile estimation

Research Experiments

Exploratory work with uncertain outcomes: trying new architectures, testing hypotheses, exploring data patterns.

Planning approach: Time-boxed, outcome uncertain

Model Iteration

Incremental improvement work: tuning hyperparameters, adding features, addressing bias, fixing edge cases.

Planning approach: Hybrid—effort known, improvement uncertain

Adapted Sprint Structure

The Two-Track Sprint

AI Innovations run sprints with two parallel tracks that are planned and reviewed differently:

Track 1: Committed Work

Engineering tasks and well-understood model work that the pod commits to completing. These are planned with standard story points and included in velocity calculations.

Infrastructure and pipeline development
API and integration work
Known model improvements
Documentation and compliance tasks
Bug fixes and technical debt

Track 2: Experimental Work

Research and exploratory tasks where the outcome is uncertain. These are time-boxed (not point-estimated) and success is measured by learning, not delivery.

Architecture experiments
Data exploration
Hypothesis testing
Performance improvement attempts
Novel technique evaluation

Sprint Allocation

The balance between tracks varies by product maturity:

Product Stage	Committed Work	Experimental Work
Exploration	20-30%	70-80%
Development	50-60%	40-50%
Production	70-80%	20-30%
Mature Operations	80-90%	10-20%

Note: Even mature products should maintain some experimental capacity for continuous improvement and innovation.

Experiments as First-Class Work

Experiment Design

Every experiment should be structured with:

Hypothesis

A clear, testable statement of what we believe and why. "We believe [approach X] will improve [metric Y] by [amount Z] because [reasoning]."

Time Box

A strict limit on how long the experiment runs before evaluation. Experiments that drag on without conclusion waste resources and create uncertainty.

Success Criteria

Specific, measurable outcomes that would validate the hypothesis. What results would make us proceed? What results would make us stop?

Minimum Viable Experiment

The smallest version of the experiment that could validate the hypothesis. Don't build production infrastructure for an experiment.

Learning Documentation

Required output regardless of outcome. What did we learn? How does this inform next steps? This prevents "failed" experiments from being wasted effort.

Experiment Outcomes

Experiments have three possible outcomes, all of which are valuable:

Validated

Hypothesis confirmed. Results meet success criteria. Proceed to committed engineering work to productionize.

Next step: Create engineering stories

Invalidated

Hypothesis disproven. Results do not meet criteria. Document learnings and move on.

Next step: Archive and inform future decisions

Inconclusive

Results unclear. May need more data, different approach, or refined hypothesis.

Next step: Decide whether to extend, pivot, or stop

Celebrating "Failed" Experiments

An invalidated hypothesis is not a failure—it's valuable learning. Teams should celebrate experiments that quickly disprove bad ideas, saving the effort of building something that wouldn't work. The only true experiment failure is one that produces no actionable learning.

Pod Ceremonies

Daily Standup (15 minutes)

Brief synchronization focused on blockers and coordination:

Format: Each team member shares: progress, plans, blockers
AI Adaptation: Include "experiment status" for running experiments
Ethics Check: Quick flag for any emerging governance concerns
Anti-pattern: Status reporting; keep it focused on coordination

Sprint Planning (2-4 hours)

Plan both committed work and experimental work for the sprint:

Activity	Time	Participants
Review goals and priorities	30 min	STO leads, full pod
Committed work selection & estimation	60-90 min	Full pod
Experiment design & time-boxing	30-60 min	Full pod, ML focus
Governance/ethics considerations	15-30 min	Ethics Liaison leads
Capacity check & commitment	15 min	Full pod

Sprint Review (1-2 hours)

Demo completed work and share experiment learnings:

Committed Work: Standard demo of completed features/capabilities
Experiment Results: Share findings—validated, invalidated, or inconclusive
Metrics Review: Update on key performance and governance metrics
Stakeholder Feedback: Input from business sponsors and users

Sprint Retrospective (1 hour)

Continuous improvement focused on both delivery and AI-specific challenges:

AI-Specific Retro Questions

Did our experiments produce useful learning?
Were our time boxes appropriate?
Did we catch governance issues early enough?
Are we maintaining appropriate committed/experimental balance?
What technical debt is accumulating that we need to address?

Experiment Review (Weekly, 30 minutes)

AI-specific ceremony to manage experimental work:

Active Experiments: Status check on running experiments
Decision Points: Experiments reaching time-box limits
Knowledge Sharing: Brief share of interesting findings
Pipeline Management: Prioritization of potential future experiments

Model Card Review (Monthly or on significant change)

Governance ceremony to keep the Model Card current:

Performance Review: Are we meeting documented targets?
Fairness Metrics: Any drift in bias or fairness measures?
Documentation Accuracy: Does the Model Card still reflect reality?
Risk Assessment: Have risk factors changed?
Update Actions: Required changes to Model Card or controls

Ceremony Calendar

A typical two-week sprint might look like:

Week 1	Week 2
Monday: Sprint Planning Daily: Standup Friday: Experiment Review	Daily: Standup Thursday: Sprint Review Friday: Retrospective