Operate
The Operate phase involves the day-to-day running of the software. This is where the value is delivered to the user - and where the consequences of every previous phase become real. Operations is not an afterthought; it is the sustained effort that keeps your product available, performant, and trustworthy.
SRE Principles
- pioneered at Google - provides a framework for running reliable systems at scale. Its core ideas apply to teams of any size.
Error Budgets
An is the inverse of your reliability target. If your is 99.9% availability, you have a 0.1% - roughly 43 minutes of downtime per month. This budget creates a powerful decision framework:
- Budget remaining? Ship features, take risks, move fast.
- Budget exhausted? Freeze feature work and focus entirely on reliability improvements.
This eliminates the perpetual tension between "ship faster" and "be more reliable" by making the trade-off data-driven.
Toil Reduction
is manual, repetitive operational work that scales linearly with service growth. teams aim to spend no more than 50% of their time on toil - the rest goes to engineering work that eliminates toil permanently.
- Identify the top 5 most time-consuming manual operations.
- Automate them systematically, starting with the highest-frequency tasks.
- Track toil hours to measure progress over time.
Incident Management
When things break (and they will), a structured incident process minimises user impact and accelerates recovery.
Severity Levels
Define clear severity levels so everyone knows how urgently to respond:
| Severity | Impact | Response Time | Examples |
|---|---|---|---|
| SEV-1 (Critical) | Complete service outage or data loss | Immediate (all hands) | Production down, security breach, data corruption |
| SEV-2 (Major) | Significant degradation affecting many users | Within 30 minutes | Core feature broken, major performance issue |
| SEV-3 (Minor) | Limited impact, workaround available | Within business hours | Non-critical feature bug, minor performance degradation |
| SEV-4 (Low) | Minimal impact, cosmetic or edge case | Next sprint | UI glitch, minor logging issue |
On-Call Rotations
- Rotate on-call responsibility across the team - no single person should carry the pager permanently.
- Provide clear for common alert scenarios so any team member can respond effectively.
- Compensate on-call engineers appropriately (time off in lieu, on-call payments, or reduced sprint commitments).
- Track on-call load - if one team gets paged 10x more than others, that is a reliability signal that needs attention.
Post-Incident Reviews (Blameless Postmortems)
After every SEV-1 and SEV-2 incident, conduct a :
- Timeline: What happened, when, and what actions were taken.
- Root Cause: What underlying conditions allowed this to happen.
- Contributing Factors: What made detection or recovery slower.
- Action Items: Specific, assigned tasks to prevent recurrence - with deadlines.
- Learnings: What the team learned and what went well during the response.
The "blameless" part is non-negotiable. If engineers fear blame, they will hide information, and your will be useless. The goal is to fix the system, not the person. Make visible (publish them internally) to spread learning across the organisation.
Capacity Planning and Scaling
Capacity Planning
ensures your infrastructure can handle current and future load.
- Horizontal Scaling: Adding more instances behind a load balancer. Preferred for stateless services.
- Vertical Scaling: Adding more resources (CPU, RAM) to existing instances. Simpler but has an upper limit.
- Auto-Scaling: Configure automatic scaling policies that add/remove capacity based on real-time metrics (CPU usage, request queue depth, custom metrics).
Runbooks
Operational are living documents that capture the "how to" for common operational tasks:
- How to restart a crashed service
- How to failover to a backup database
- How to investigate and resolve common alert types
- How to perform emergency rollbacks
Automate steps where possible - the best runbook is one that executes itself.
- Threat Detection: Monitor for suspicious activity (unusual login patterns, privilege escalation, data exfiltration attempts).
- Patch Management: Apply security patches promptly. Define for critical vulnerabilities (e.g. patch within 48 hours for CVE score 9+).
- Access Control: Implement least-privilege access. Review and audit access permissions quarterly.
- Compliance: Maintain audit logs and access records required by your regulatory framework.
- Startup: The founders and engineers are the on-call team. Use simple alerting (PagerDuty or Opsgenie free tier, or even Slack alerts). Write basic for the 3-5 most likely failure scenarios. Use cloud-managed services (, managed , serverless) to minimise operational burden. can be informal but should still happen.
- Growth Stage: Hire or designate a dedicated or platform team. Implement formal on-call rotations with compensation. Define for key services and track . Build a comprehensive library. Invest in auto-scaling and capacity monitoring. Structured process with published learnings.
- Established: Operate a 24/7 or follow-the-sun on-call model. Run programs (Chaos Monkey, Gremlin, Litmus) to proactively test resilience. Formal management with contractual commitments. Regular disaster recovery drills. Compliance audits (, ) covering operational practices.
Common Pitfalls
- Hero Culture: One person who "knows everything" and is always called when things break. This is a single point of failure for the organisation. Distribute knowledge through , pair on-call, and documentation.
- No : Without structured learning from incidents, the same failures repeat. are not optional - they are the primary mechanism for improving reliability.
- Alert Fatigue: Too many low-priority or noisy alerts cause engineers to ignore all alerts - including the critical ones. Tune alerts ruthlessly: every alert should be actionable and require human intervention.
- Ignoring Toil: Accepting manual operational work as "just the way things are" means your team scales linearly with your infrastructure. Automate relentlessly.
Operations Key Deliverables
- Service Level Objectives (SLOs) and Error Budgets
- On-Call Schedule and Escalation Policies
- Operational Runbooks
- Post-Incident Review Reports
- Capacity Plans
AI is critical for managing the complexity of modern distributed systems:
- Self-Healing: AI enables systems to recover automatically by rerouting traffic or faster scaling. Cisco AppDynamics uses AI to minimize downtime.
- Intelligent Incident Response: PagerDuty uses ML to automate incident triage and improve alert routing.
- Security: Palo Alto Networks and Splunk use AI to detect and block threats in real-time.