3.2 Agile for AI: Sprints, Epics, and Experiments
Standard Agile methodologies were designed for deterministic software development—where writing code produces predictable outputs. AI development is fundamentally different: experiments may fail, models may not converge, and "done" is often a moving target. AI Innovations adapt Agile practices to accommodate the inherent uncertainty of AI work while maintaining the discipline of iterative delivery.
Traditional Agile assumes that committed work can be completed with reasonable confidence. AI work includes experiments where the outcome is genuinely uncertain. The AI Innovation approach treats experiments as first-class work items with their own planning and completion criteria, separate from deterministic engineering tasks.
Why AI Development is Different
The Uncertainty Dimension
AI work contains inherent uncertainty that traditional software development does not:
| Traditional Software | AI Development |
|---|---|
| Requirements can be fully specified | Performance targets may be aspirational |
| Implementation is largely deterministic | Model training has stochastic elements |
| "Done" is clearly defined | "Good enough" requires judgment |
| Bugs are fixable | Model limitations may be inherent |
| Effort estimation is reasonable | Experiments may succeed or fail unpredictably |
Types of AI Work
AI pods handle three distinct types of work, each with different planning characteristics:
Engineering Work
Deterministic tasks with predictable outcomes: building pipelines, creating APIs, implementing monitoring, writing tests.
Planning approach: Standard Agile estimation
Research Experiments
Exploratory work with uncertain outcomes: trying new architectures, testing hypotheses, exploring data patterns.
Planning approach: Time-boxed, outcome uncertain
Model Iteration
Incremental improvement work: tuning hyperparameters, adding features, addressing bias, fixing edge cases.
Planning approach: Hybrid—effort known, improvement uncertain
Adapted Sprint Structure
The Two-Track Sprint
AI Innovations run sprints with two parallel tracks that are planned and reviewed differently:
Engineering tasks and well-understood model work that the pod commits to completing. These are planned with standard story points and included in velocity calculations.
- Infrastructure and pipeline development
- API and integration work
- Known model improvements
- Documentation and compliance tasks
- Bug fixes and technical debt
Research and exploratory tasks where the outcome is uncertain. These are time-boxed (not point-estimated) and success is measured by learning, not delivery.
- Architecture experiments
- Data exploration
- Hypothesis testing
- Performance improvement attempts
- Novel technique evaluation
Sprint Allocation
The balance between tracks varies by product maturity:
| Product Stage | Committed Work | Experimental Work |
|---|---|---|
| Exploration | 20-30% | 70-80% |
| Development | 50-60% | 40-50% |
| Production | 70-80% | 20-30% |
| Mature Operations | 80-90% | 10-20% |
Note: Even mature products should maintain some experimental capacity for continuous improvement and innovation.
Experiments as First-Class Work
Experiment Design
Every experiment should be structured with:
Hypothesis
A clear, testable statement of what we believe and why. "We believe [approach X] will improve [metric Y] by [amount Z] because [reasoning]."
Time Box
A strict limit on how long the experiment runs before evaluation. Experiments that drag on without conclusion waste resources and create uncertainty.
Success Criteria
Specific, measurable outcomes that would validate the hypothesis. What results would make us proceed? What results would make us stop?
Minimum Viable Experiment
The smallest version of the experiment that could validate the hypothesis. Don't build production infrastructure for an experiment.
Learning Documentation
Required output regardless of outcome. What did we learn? How does this inform next steps? This prevents "failed" experiments from being wasted effort.
Experiment Outcomes
Experiments have three possible outcomes, all of which are valuable:
Validated
Hypothesis confirmed. Results meet success criteria. Proceed to committed engineering work to productionize.
Next step: Create engineering stories
Invalidated
Hypothesis disproven. Results do not meet criteria. Document learnings and move on.
Next step: Archive and inform future decisions
Inconclusive
Results unclear. May need more data, different approach, or refined hypothesis.
Next step: Decide whether to extend, pivot, or stop
An invalidated hypothesis is not a failure—it's valuable learning. Teams should celebrate experiments that quickly disprove bad ideas, saving the effort of building something that wouldn't work. The only true experiment failure is one that produces no actionable learning.
Pod Ceremonies
Daily Standup (15 minutes)
Brief synchronization focused on blockers and coordination:
- Format: Each team member shares: progress, plans, blockers
- AI Adaptation: Include "experiment status" for running experiments
- Ethics Check: Quick flag for any emerging governance concerns
- Anti-pattern: Status reporting; keep it focused on coordination
Sprint Planning (2-4 hours)
Plan both committed work and experimental work for the sprint:
| Activity | Time | Participants |
|---|---|---|
| Review goals and priorities | 30 min | STO leads, full pod |
| Committed work selection & estimation | 60-90 min | Full pod |
| Experiment design & time-boxing | 30-60 min | Full pod, ML focus |
| Governance/ethics considerations | 15-30 min | Ethics Liaison leads |
| Capacity check & commitment | 15 min | Full pod |
Sprint Review (1-2 hours)
Demo completed work and share experiment learnings:
- Committed Work: Standard demo of completed features/capabilities
- Experiment Results: Share findings—validated, invalidated, or inconclusive
- Metrics Review: Update on key performance and governance metrics
- Stakeholder Feedback: Input from business sponsors and users
Sprint Retrospective (1 hour)
Continuous improvement focused on both delivery and AI-specific challenges:
- Did our experiments produce useful learning?
- Were our time boxes appropriate?
- Did we catch governance issues early enough?
- Are we maintaining appropriate committed/experimental balance?
- What technical debt is accumulating that we need to address?
Experiment Review (Weekly, 30 minutes)
AI-specific ceremony to manage experimental work:
- Active Experiments: Status check on running experiments
- Decision Points: Experiments reaching time-box limits
- Knowledge Sharing: Brief share of interesting findings
- Pipeline Management: Prioritization of potential future experiments
Model Card Review (Monthly or on significant change)
Governance ceremony to keep the Model Card current:
- Performance Review: Are we meeting documented targets?
- Fairness Metrics: Any drift in bias or fairness measures?
- Documentation Accuracy: Does the Model Card still reflect reality?
- Risk Assessment: Have risk factors changed?
- Update Actions: Required changes to Model Card or controls
Ceremony Calendar
A typical two-week sprint might look like:
| Week 1 | Week 2 |
|---|---|
|
Monday: Sprint Planning Daily: Standup Friday: Experiment Review |
Daily: Standup Thursday: Sprint Review Friday: Retrospective |