Operate

Back to SDLC Intro

The Operate phase involves the day-to-day running of the software. This is where the value is delivered to the user - and where the consequences of every previous phase become real. Operations is not an afterthought; it is the sustained effort that keeps your product available, performant, and trustworthy.

SRE Principles

- pioneered at Google - provides a framework for running reliable systems at scale. Its core ideas apply to teams of any size.

Error Budgets

An is the inverse of your reliability target. If your is 99.9% availability, you have a 0.1% - roughly 43 minutes of downtime per month. This budget creates a powerful decision framework:

Budget remaining? Ship features, take risks, move fast.
Budget exhausted? Freeze feature work and focus entirely on reliability improvements.

This eliminates the perpetual tension between "ship faster" and "be more reliable" by making the trade-off data-driven.

Toil Reduction

is manual, repetitive operational work that scales linearly with service growth. teams aim to spend no more than 50% of their time on toil - the rest goes to engineering work that eliminates toil permanently.

Identify the top 5 most time-consuming manual operations.
Automate them systematically, starting with the highest-frequency tasks.
Track toil hours to measure progress over time.

Incident Management

When things break (and they will), a structured incident process minimises user impact and accelerates recovery.

Severity Levels

Define clear severity levels so everyone knows how urgently to respond:

Severity	Impact	Response Time	Examples
SEV-1 (Critical)	Complete service outage or data loss	Immediate (all hands)	Production down, security breach, data corruption
SEV-2 (Major)	Significant degradation affecting many users	Within 30 minutes	Core feature broken, major performance issue
SEV-3 (Minor)	Limited impact, workaround available	Within business hours	Non-critical feature bug, minor performance degradation
SEV-4 (Low)	Minimal impact, cosmetic or edge case	Next sprint	UI glitch, minor logging issue

On-Call Rotations

Rotate on-call responsibility across the team - no single person should carry the pager permanently.
Provide clear for common alert scenarios so any team member can respond effectively.
Compensate on-call engineers appropriately (time off in lieu, on-call payments, or reduced sprint commitments).
Track on-call load - if one team gets paged 10x more than others, that is a reliability signal that needs attention.

Post-Incident Reviews (Blameless Postmortems)

After every SEV-1 and SEV-2 incident, conduct a :

Timeline: What happened, when, and what actions were taken.
Root Cause: What underlying conditions allowed this to happen.
Contributing Factors: What made detection or recovery slower.
Action Items: Specific, assigned tasks to prevent recurrence - with deadlines.
Learnings: What the team learned and what went well during the response.

Postmortem Culture

The "blameless" part is non-negotiable. If engineers fear blame, they will hide information, and your will be useless. The goal is to fix the system, not the person. Make visible (publish them internally) to spread learning across the organisation.

Capacity Planning and Scaling

Capacity Planning

ensures your infrastructure can handle current and future load.

Horizontal Scaling: Adding more instances behind a load balancer. Preferred for stateless services.
Vertical Scaling: Adding more resources (CPU, RAM) to existing instances. Simpler but has an upper limit.
Auto-Scaling: Configure automatic scaling policies that add/remove capacity based on real-time metrics (CPU usage, request queue depth, custom metrics).

Runbooks

Operational are living documents that capture the "how to" for common operational tasks:

How to restart a crashed service
How to failover to a backup database
How to investigate and resolve common alert types
How to perform emergency rollbacks

Automate steps where possible - the best runbook is one that executes itself.

Threat Detection: Monitor for suspicious activity (unusual login patterns, privilege escalation, data exfiltration attempts).
Patch Management: Apply security patches promptly. Define for critical vulnerabilities (e.g. patch within 48 hours for CVE score 9+).
Access Control: Implement least-privilege access. Review and audit access permissions quarterly.
Compliance: Maintain audit logs and access records required by your regulatory framework.

By Company Stage

Startup: The founders and engineers are the on-call team. Use simple alerting (PagerDuty or Opsgenie free tier, or even Slack alerts). Write basic for the 3-5 most likely failure scenarios. Use cloud-managed services (, managed , serverless) to minimise operational burden. can be informal but should still happen.
Growth Stage: Hire or designate a dedicated or platform team. Implement formal on-call rotations with compensation. Define for key services and track . Build a comprehensive library. Invest in auto-scaling and capacity monitoring. Structured process with published learnings.
Established: Operate a 24/7 or follow-the-sun on-call model. Run programs (Chaos Monkey, Gremlin, Litmus) to proactively test resilience. Formal management with contractual commitments. Regular disaster recovery drills. Compliance audits (, ) covering operational practices.

Common Pitfalls

Anti-patterns to Avoid

Hero Culture: One person who "knows everything" and is always called when things break. This is a single point of failure for the organisation. Distribute knowledge through , pair on-call, and documentation.
No : Without structured learning from incidents, the same failures repeat. are not optional - they are the primary mechanism for improving reliability.
Alert Fatigue: Too many low-priority or noisy alerts cause engineers to ignore all alerts - including the critical ones. Tune alerts ruthlessly: every alert should be actionable and require human intervention.
Ignoring Toil: Accepting manual operational work as "just the way things are" means your team scales linearly with your infrastructure. Automate relentlessly.

Operations Key Deliverables

Service Level Objectives (SLOs) and Error Budgets
On-Call Schedule and Escalation Policies
Operational Runbooks
Post-Incident Review Reports
Capacity Plans

How AI Can Help: Operations

AI is critical for managing the complexity of modern distributed systems:

Self-Healing: AI enables systems to recover automatically by rerouting traffic or faster scaling. Cisco AppDynamics uses AI to minimize downtime.
Intelligent Incident Response: PagerDuty uses ML to automate incident triage and improve alert routing.
Security: Palo Alto Networks and Splunk use AI to detect and block threats in real-time.

SRE Principles​

Error Budgets​

Toil Reduction​

Incident Management​

Severity Levels​

On-Call Rotations​

Post-Incident Reviews (Blameless Postmortems)​

Capacity Planning and Scaling​

Capacity Planning​

Runbooks​

Security Operations (SecOps)

Common Pitfalls​

Operations Key Deliverables​