Skip to main content

Operate

The Operate phase involves the day-to-day running of the software. This is where the value is delivered to the user.

Operations & Reliability

The goal is to ensure high availability, reliability, and performance.

  • Capacity Planning: Ensuring there are enough resources (CPU, RAM, Storage) to handle user load.
  • Incident Management: Responding to outages or service degradations.
  • Chaos Engineering: Proactively testing system resilience by simulating failures.
  • Security Operations (SecOps): Defending against active threats.

Key Deliverables

  • Service Level Agreements (SLAs) Reports
  • Incident Reports
  • Operational Runbooks
How AI Can Help: Operations

AI is critical for managing the complexity of modern distributed systems:

  • Self-Healing: AI enables systems to recover automatically by rerouting traffic or faster scaling. Cisco AppDynamics uses AI to minimize downtime.
  • Intelligent Incidence Response: PagerDuty uses ML to automate incident triage and improve alert routing.
  • Security: Palo Alto Networks and Splunk use AI to detect and block threats in real-time.