SRE Maturity in 8 Weeks
A focused playbook for SLOs, incident hygiene, and platform reliability that actually sticks
Executive summary
In most organizations, reliability initiatives stall because the scope is too broad, the targets are aspirational rather than empirical, and the day‑to‑day operating model never changes. This 8‑week playbook solves that by:
- Scoping ruthlessly to the smallest set of critical services that carry outsized business value.
- Basing targets on reality: measure how systems behave today, then refine SLOs/SLAs and set error budgets you can actually honor.
- Integrating reliability into how work flows: incident hygiene, change management, and SDLC gates tied to error budgets—up to and including pausing risky deployments when stability is threatened.
- Shipping a paved path: golden patterns, templates, and automation you can fan out to the rest of the estate after the first wins.
The result is a repeatable, auditable reliability program that improves customer outcomes without paralyzing delivery.
Principles
- Prioritize by business criticality. Not all components deserve the same engineering attention. Select the top 3–5 services or customer journeys that create or protect the most revenue.
- Measure what customers feel. Define SLIs that mirror user experience—availability, latency (p95/p99), correctness, and freshness. Cost and efficiency can be tracked, but don’t let them replace experience SLIs.
- Start with reality, not aspiration. Baseline real performance (last 30–90 days). Draft SLOs that reflect current behavior; tighten over time.
- Make error budgets real. Error budgets must be simple to understand, easy to calculate, and enforced through incident management and SDLC gates.
- Automate the boring, standardize the rest. Golden dashboards, alert policies, runbooks, post‑mortem templates, and CI/CD checks are the paved path.
Scope: pick the right “thin slice”
A thin, high‑value slice is the difference between traction and thrash. Use a criticality mapping exercise:
- Inventory: list services/systems and map to key user journeys (e.g., “checkout completes,” “report loads within 5s”).
- Business impact: estimate $/minute loss or churn risk for each journey.
- Feasibility: instrumentation readiness, team ownership, and expected lift.
- Select: pick the subset where impact × feasibility is highest. Typically this yields 3–5 services and 2–4 user journeys.
Deliverable: Service–Journey matrix with a ranked list and owners.
SLIs, SLOs, SLAs: crisp definitions
- SLI (Service Level Indicator): a measurement of experience. Examples: success rate, p95 latency, data freshness.
- SLO (Objective): a target for an SLI over a window. Example: “p95 checkout latency ≤ 600ms over 28 days.”
- SLA (Agreement): a contract with customers; typically downstream of SLOs and includes remedies/credits.
Choosing SLIs that matter
For each selected journey, define 2–4 SLIs:
- Availability: success / total requests (filter to “good” responses).
- Latency: p95 and/or p99 of end‑to‑end duration (client to durable write).
- Correctness: percent of responses with expected shape/invariants.
- Freshness: age of data powering the UI, or lag to eventual consistency.
From baseline to SLO
- Measure current SLIs (ideally 28–30 days) via logs/traces/metrics and synthetics.
- Propose draft SLOs at or slightly better than the median of current performance.
- Review with product & customer‑facing teams—ensure targets mirror customer expectations and seasonality.
- Ratify v1 SLOs with clear ownership and a date to revisit (e.g., quarterly).
Error budgets
If SLO is 99.9% monthly availability, the error budget is 0.1% of minutes in the window:
- 30 days = 43,200 minutes → budget = 43.2 minutes of “bad minutes.”
- Track burn rate across short and long windows (e.g., 1h and 6h) to detect acute and chronic burn.
Burn‑rate alert policy (example)
- Page when 1‑hour burn rate ≥ 14.4× (risk of exhausting 30‑day budget within ~2 hours).
- Page when 6‑hour burn rate ≥ 6× (sustained incident).
- Ticket when 3‑day burn rate ≥ 2× (chronic degradation).
Incident hygiene: the operating model
A reliable platform needs reliable process.
- Severity matrix tied to customer impact and SLO burn (e.g., Sev‑1 if active burn ≥ 14× and key journey unavailable).
- Roles: Incident Commander, Communications, Ops, and Subject‑Matter Leads.
- Runbooks: concise, one‑page actions with links to dashboards and rollback.
- Post‑incident reviews: blameless, small set of actionable items with owners and due dates.
- Tagging & taxonomy: every incident tagged to service/journey, failure mode, and root cause classification to surface systemic issues.
Integration points
- On‑call rotations formalized; paging only on customer‑impacting signals (SLOs), not raw metrics.
- Change management: deployments automatically annotated into traces/metrics; feature flags for safe rollouts.
- SDLC gates: when a service exhausts the error budget, high‑risk changes are paused or go through additional approval until the budget resets.
The 8‑week plan
Week 1 — Kickoff, scope, and charter
- Identify executive sponsor and working team (product + engineering + SRE).
- Build the service–journey matrix and select the thin slice.
- Agree on decision rights, cadence, and success metrics.
- Stand up a shared reliability backlog.
Artifacts: charter, RACI, ranked service–journey matrix, backlog.
Week 2 — Instrumentation and SLIs
- Map signals to sources: logs, traces, metrics, synthetics.
- Standardize telemetry: trace IDs in logs, request IDs across services.
- Implement golden dashboards per service (availability, latency, saturation, errors).
- Draft SLIs and begin baselining.
Artifacts: SLI catalog; dashboard templates.
Week 3 — Baseline & draft SLOs
- Gather 14–30 days of SLI data (pull historical where possible).
- Draft SLOs per journey with product and support alignment.
- Define the error budget policy (thresholds, actions, SLDC gates).
Artifacts: SLO proposals; error budget policy v1.
Week 4 — Alerts and incident hygiene
- Convert SLOs into burn‑rate alerts (short + long windows).
- Wire paging routes, escalation policies, and on‑call calendars.
- Create runbook skeletons; launch incident channels and templates.
Artifacts: alert policies; runbooks; incident templates.
Week 5 — Change safety and reliability work intake
- Add deployment annotations and release health checks.
- Enable canary or progressive delivery for at least one key service.
- Connect the error budget policy to SDLC gates (e.g., CI job checks or change approvals).
- Triage top toil sources and open platform tickets.
Artifacts: CI/CD checks; release checklist; toil reduction list.
Week 6 — Game day and capacity validation
- Run a game day on one or two likely failure modes (dependency outage, region failover, cache flood).
- Validate capacity assumptions (autoscaling, limits, backpressure).
- Update runbooks based on findings; refine SLOs if reality differs.
Artifacts: game day report; updated runbooks; capacity notes.
Week 7 — Adoption round 2 and reporting
- Apply the paved path to one additional service.
- Publish reliability reports: SLO attainment, incidents by tag, MTTx, change failure rate.
- Review with execs: what improved, where to invest next.
Artifacts: service #2 onboarding; SLO report; leadership update.
Week 8 — Codify the paved path & executive readout
- Freeze the golden patterns: SLO template, alert policies, dashboards, runbooks, post‑incident template.
- Finalize governance: review cadence (monthly), ownership, and annual target‑setting.
- Publish the fan‑out plan for the next quarter.
Artifacts: paved‑path package; program governance doc; roadmap.
The paved path (what “done” looks like)
Reusable templates and automation checked into a discoverable repo:
- SLO spec (YAML or JSON): SLI query, window, target, owner, dashboard link.
- Alert policies: multi‑window burn rate (e.g., 5m/1h and 1h/6h), ticketing/webhook routes.
- Dashboards: per service and per journey, with standardized panels.
- Runbook template: prerequisites, triage tree, rollback steps, comms.
- Post‑incident template: summary, timeline, impact, contributing factors, actions.
- CI/CD checks: error‑budget gate, change freeze override via executive approval.
service: checkout-api
journey: submit_order
owner: team-payments
sli:
type: latency
percentile: p95
query: sum(rate(http_request_duration_seconds_bucket{le="0.6",service="checkout"}[5m]))
/ sum(rate(http_request_duration_seconds_count{service="checkout"}[5m]))
window: 28d
objective: ">= 99% of requests ≤ 600ms"
error_budget:
window_minutes: 40320 # 28 days
budget_minutes: 40.32 # 0.1% of 28 days
alerts:
- name: fast-burn
policy: burn_rate >= 14.4 over 1h # page
- name: slow-burn
policy: burn_rate >= 6 over 6h # page
- name: chronic
policy: burn_rate >= 2 over 3d # ticket
links:
dashboard: https://observability.example/d/checkout
runbook: https://runbooks.example/checkout
Governance and cadence
- Ownership: every SLO has a named engineering owner and a product owner.
- Reviews: monthly SLO review (attainment, burn events, actions); quarterly target tune‑ups.
- Change policy: error‑budget‑driven—if a service is out of budget, risky deploys are paused until budget resets or an executive approves an exception.
- Transparency: publish a simple internal reliability scorecard each month.
Anti‑patterns to avoid
- Metric‑first paging (CPU, memory, queue depth) without customer‑visible impact → alert fatigue.
- Too many SLOs per service → dilution and confusion. Start with two or three.
- Aspirational SLOs that don’t match reality → constant breach and no behavior change.
- Ignoring product: SLOs set by ops alone miss what customers feel.
- Permanent freezes: if your error budget policy only blocks delivery, teams will route around it. Clear exit criteria and time‑boxed pauses are essential.
Measuring impact (what leaders will ask)
- Customer outcomes: fewer Sev‑1/Sev‑2 incidents on critical journeys; improved CSAT/NPS tied to those journeys.
- Engineering velocity: change failure rate and MTTR drop; on‑call interruptions decrease; fewer rollbacks.
- Financials: fewer credits (SLA), lower unplanned work; cost efficiency from right‑sizing and reduced thrash.
- Program health: SLO adoption rate, adherence to runbooks, post‑incident action closure within SLA.
Roles and time commitment (minimal but real)
- Sponsor (VP Eng/CTO): unblockers and policy backing (1–2 hours/week).
- SRE Lead / Facilitator: runs the playbook, authors paved path (50–75% FTE for 8 weeks).
- Service Owners (2–4 teams): instrumentation and adoption (2–4 hours/week each, with bursts).
- Product/Support partners: define user journeys and SLA implications (1–2 hours/week).
Tooling (vendor‑neutral)
- Observability: OpenTelemetry + Prometheus/Grafana, or Datadog/New Relic.
- Paging & incident: PagerDuty/Opsgenie + Slack/Teams; Jira/ServiceNow for follow‑ups.
- Synthetics: k6, Playwright, or vendor equivalents.
- Delivery: GitHub Actions, ArgoCD/Flux, feature flagging (LaunchDarkly/OpenFeature).
Conclusion
You can raise reliability within two months without stopping feature work—if you pick a narrow, valuable scope, measure what customers feel, and wire reliability into the way changes are made. A single golden pattern across your most critical services produces immediate customer benefit and a reusable operating model. From there, fan‑out is an execution detail, not a reinvention.
Appendix A — Sample incident severity matrix (sketch)
- Sev‑1: Key journey unavailable or >50% error; active burn ≥ 14×; page + exec comms; 24×7 response.
- Sev‑2: Degraded experience; active burn 6–14×; page; incident comms.
- Sev‑3: Minor impact; active burn 2–6×; ticket; business‑hours response.
Appendix B — Post‑incident review template (outline)
- Summary (what/when/impact)
- Timeline with sources of truth (logs, dashboards, PRs)
- Contributing factors (technical, process, organization)
- Customer impact (SLO burn, SLA credits if any)
- What went well / what didn’t
- Actions with owners and due dates (limit to 3–5)
*Prepared by CS Freedom Advisory.*