incident-response · reliability · sre
By Shayan Ghasemnezhad4min read
Your first major incident will happen. The difference between 20 minutes of downtime and 4 hours is what you prepared before the page fired.
Startups do not think about incident response until they have an incident. Then everyone is in a Slack channel, nobody knows who is in charge, three people are making changes to production simultaneously, and the CEO is asking “what is happening?” every five minutes. The incident takes four hours to resolve and two days to recover from. Most of that time is coordination failure, not technical complexity.
You do not need PagerDuty’s full incident management framework. You need four things: severity levels, a response checklist, a communication template, and a post-incident review process. These can live in a single Notion page or a markdown file in your repo. What matters is that everyone knows they exist and where to find them.
Define three severity levels. More than three adds debate about classification during an incident when you should be fixing it.
When an alert fires or a user reports an issue, the first responder follows a checklist:
Writing clear communication under stress is hard. Templates remove that burden. Prepare three templates: initial acknowledgement (“We are aware of an issue affecting [X]. We are investigating.”), progress update (“We have identified the cause as [X]. Estimated resolution: [time].”), and resolution (“The issue affecting [X] has been resolved. [Brief explanation].”).
Post these to your status page, Slack, and anywhere customers check. The goal is to reduce inbound “is it down?” queries, which consume responder attention during the incident.
A post-incident review (PIR) is not a blame session. It is a learning session. Cover four questions: What happened? (Timeline of events.) Why did it happen? (Contributing factors, not root cause—complex systems rarely have a single root cause.) How did we respond? (What worked, what did not.) What will we change? (Specific, assignable action items with deadlines.)
ai · lean-startup
Build-Measure-Learn was designed for web products. AI changes the feedback loop, the MVP definition, and the cost of experimentation.
Write the PIR in a shared document. Publish it to the team. The goal is institutional learning: the next incident should be easier, faster, or prevented entirely because of what you learned from this one.
Set up basic alerting before you need it: uptime checks (Pingdom, Better Uptime, or a CloudWatch synthetic canary), error rate monitoring (Sentry, Datadog), and infrastructure alerts (CPU, memory, disk). Route alerts to a Slack channel. Assign an on-call rotation—even if it is just two engineers alternating weeks.
# Minimal CloudWatch alarm for API error rate
AWSTemplateFormatVersion: '2010-09-09'
Resources:
ApiErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: api-5xx-rate
MetricName: 5XXError
Namespace: AWS/ApiGateway
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertTopic
The most common failure: alert fatigue. Too many alerts, most of which are not actionable, train the team to ignore them. Every alert should have a corresponding runbook entry that explains what to check and what to do. If you cannot write a runbook for an alert, the alert should not exist.
The second failure: skipping post-incident reviews because the team is too busy. The cost of not learning from incidents is repeat incidents. A 30-minute PIR that produces two action items is worth more than a week of feature work if it prevents the next four-hour outage.
Your first major incident will happen. The preparation you do now—severity levels, checklists, templates, review process—is the difference between a controlled response and organised chaos. An afternoon of preparation saves days of recovery.