How to break prod

Yesterday my coworker’s PR caused a site outage for 10 minutes.

Taking down production is nobody’s idea of a good time!

But the way they handled it was so exemplary that they got praise from our manager.

Here’s what they did:

🙋 Responded to reports with relevant work

When they saw bug reports come in and made the connection, they pointed out their PR as a possible cause:

“This might be because of my PR: [GitHub link]”

⏪ Reverted the change as quickly as possible

They re-ran the last successful build, so that the site would be back up ASAP.

Then they opened a PR to revert the change in master.

📣 Let the full team know

They announced the issue to all staff, including:

  • ⏱ When the outage began
  • 👀 How it would be reported by a customer
  • 🆗 The current status of the issue

🔍 Identified the cause

As they confirmed that the site was back up, they began analyzing the issue.

They reported their debugging process in a Slack channel, public to the rest of the dev team.

🔧 Removed the cause of the issue

Not only did they fix this instance of the problem—they also put a fix in place that would prevent this type of issue from happening again.

📣 Announced when the issue was resolved

Updated the full team—including:

  • ✅ The issue was resolved
  • 💫 The cause was identified and is no longer possible
  • 🙏 Apology for any inconvenience
  • 👂 They were available for any questions

My coworker’s instincts were excellent, but we can all act this way if we’re prepared!

You can support your team’s quick action in crisis response by having a playbook with steps like the ones in this thread—and by making each technical step as easy to take as it can be.

This post was originally a Twitter thread as part of Ship 30 for 30.