6.3 Shared Services & Platform Teams
As the number of pods grows, certain capabilities become more efficient to centralize: MLOps infrastructure, model serving, governance tooling, and specialized expertise. Platform teams provide these shared services, enabling product pods to focus on their unique value while building on common foundations. The challenge is delivering these services without creating the bottlenecks that shared services typically introduce.
Platform teams exist to accelerate product pods, not to control them. A well-designed platform is self-service: pods can use it without waiting for platform team help. The platform team's success is measured by pod velocity, not by how busy the platform team is.
The Platform Team Model
Platform vs. Product Teams
Platform teams operate differently from product pods:
| Dimension | Product Pod | Platform Team |
|---|---|---|
| Customer | External users, business stakeholders | Other pods (internal customers) |
| Success Metric | Business outcomes (revenue, efficiency) | Pod adoption, velocity, satisfaction |
| Ownership Model | Single AI product cradle-to-grave | Shared capabilities used by many |
| Autonomy | High autonomy within guardrails | Must maintain compatibility, standards |
| Prioritization | STO decides based on product needs | Balances needs across all pods |
Self-Service Principle
Effective platforms minimize the need for human intervention:
Level 1: Documentation
Comprehensive docs that answer most questions without asking anyone
Level 2: Automation
Self-service tools that let pods provision what they need
Level 3: Guardrails
Automated checks that enable safe self-service
Level 4: Support
Human help only for edge cases and escalations
Common AI Platform Services
MLOps Platform
Infrastructure for building, training, and deploying models:
| Capability | What It Provides | Self-Service Level |
|---|---|---|
| Experiment Tracking | MLflow or similar for tracking experiments | Full self-service |
| Training Infrastructure | GPU clusters, distributed training | Self-service provisioning |
| Model Registry | Versioned model storage and metadata | Full self-service |
| Model Serving | Scalable inference infrastructure | Self-service deployment |
| Feature Store | Shared feature definitions and serving | Self-service with review for new features |
| Monitoring | Model performance and drift monitoring | Pre-configured dashboards |
Data Platform
Infrastructure for data access and management:
- Data Catalog: Discoverable inventory of available data
- Data Access: Governed access to training and inference data
- Data Quality: Automated quality monitoring and alerting
- Privacy Controls: PII detection, anonymization, consent management
- Data Pipelines: Reusable ETL/ELT frameworks
Evaluation Platform
Tools for assessing model quality and fairness:
- Fairness Testing: Automated bias evaluation against multiple metrics
- Performance Benchmarking: Standardized evaluation suites
- A/B Testing: Infrastructure for production experiments
- Human Evaluation: Tools for collecting human judgments
Governance Services
Centralized Governance Capabilities
Some governance functions are more efficient when centralized:
Ethics Liaison Pool
Central management of Ethics Liaisons, ensuring consistent training and shared practices across pods.
Model Card System
Standardized tooling for creating, maintaining, and reviewing Model Cards across all pods.
Risk Assessment Tools
Consistent frameworks and tools for risk tier classification and assessment.
Audit Support
Central coordination of internal and external audits, shared evidence collection.
Governance Guardrail Implementation
Central teams implement automated guardrails used by all pods:
| Guardrail | Implementation | Ownership |
|---|---|---|
| Fairness testing | Shared library integrated into CI/CD | Evaluation Platform team |
| PII detection | Automated scanning in data pipelines | Data Platform team |
| Model Card validation | Pre-deploy completeness check | Governance Services |
| Security scanning | Dependency and code scanning | Security team |
Operating Model for Platform Teams
Service Level Agreements
Platform teams define SLAs that pods can rely on:
- Availability: 99.9% uptime for production model serving
- Provisioning: New training environment within 4 hours self-service
- Support Response: P1 issues acknowledged within 15 minutes
- Feature Requests: Initial triage within 1 week
Prioritization Framework
Platform teams must balance competing pod requests:
Collect Demand
Regular intake process for pod requests. Track patterns to identify common needs.
Assess Impact
How many pods benefit? How much time saved? What's the strategic importance?
Balance Portfolio
Mix of quick wins (immediate pod help), investments (long-term capability), and maintenance.
Transparent Communication
Share roadmap with pods so they can plan. Be honest about what won't make the cut.
Avoiding Platform Traps
- The Bottleneck: Platform team becomes gatekeeper that pods must wait for
- The Ivory Tower: Platform builds what's technically elegant, not what pods need
- The One-Size-Fits-All: Forces all pods into identical patterns regardless of needs
- The Overbuilder: Builds comprehensive platforms before demand exists
- The Understaffed: Platform can't keep up, forcing pods to work around it
Success Metrics for Platform Teams
Platform as Product
The most successful platform teams treat their platforms as products, with pods as customers. They conduct user research, measure satisfaction, iterate based on feedback, and compete (in a sense) with the alternative of pods building their own solutions. This product mindset prevents platforms from becoming bureaucratic obstacles and keeps them focused on enabling pod success.