Lessons learned from two decades of Site Reliability Engineering
- The riskiness of a mitigation should scale with the severity of the outage
- Recovery mechanisms should be fully tested before an emergency
- Canary all changes
- Have a “Big Red Button”
- Unit tests alone are not enough - integration testing is also needed
- COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!
- Intentionally degrade performance modes
- Test for Disaster resilience
- Automate your mitigations
- Reduce the time between rollouts, to decrease the likelihood of the rollout going wrong
- A single global hardware version is a single point of failure
Source: Lessons learned from two decades of Site Reliability Engineering
This is a good list. More details are in the article.