updated wording

This commit is contained in:
Alysia Broddrick 2024-02-14 10:37:23 -08:00
parent d5e3c66012
commit a063cf012d
No known key found for this signature in database
GPG key ID: 03917052CD0F06B7

View file

@ -1,14 +1,15 @@
# Downtime Incident Management Runbook
Our team has agreed upon steps for handling incidents that cause our site to go offline or become unusable for users. For this document, an incident refers to one in which manage.get.gov is offline or displaying error pages (400-500) and is caused by a critical bug in our code, not to be confused with a security incident. This document should not be used for security incident response.
Our team has agreed upon steps for handling incidents that cause our site to go offline or become unusable for users. For this document, an incident refers to one in which manage.get.gov is offline or displaying error 400/500 HTTP errors on all pages and is caused by a critical bug in our code, not to be confused with a security incident. This document should not be used for security incident response.
## Response management rules
The following set of rules should be followed while an incident is in progress.
- The person who first notices that the site is down is responsible for using @here and notifying in #dotgov-announce that production is down
- This applies to any team member, including new team members and non-developers
- If no engineer has acknowledged the announcement within 10 minutes, whoever discovered the site was down should call each developer via the Slack DM huddle feature. If there is no response, this should escalate to a phone call.
- When calling, go down the list phone call list from top to bottom until someone answers
- When calling, go down the [phone call list](https://docs.google.com/document/d/1k4r-1MNCfW8EXSXa-tqJQzOvJxQv0ARvHnOjjAH0LII/edit) from top to bottom until someone answers
- Once an engineer is online, they should immediately start a huddle in the #dotgov-redalert channel to begin troubleshooting
- **All** available engineers should join the huddle once they see it.
- Uncomment the banner on get.gov, so it is transparent to users that we know about the issue on manage.get.gov