
Google - Site Reliability Engineering
by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff
Why It Matters for Leaders
This resource is essential for Engineering Leaders as it addresses key challenges in maintaining site reliability and continuous improvement within engineering teams. An actionable takeaway is the implementation of reliability engineering practices that can enhance system performance and team collaboration.
Who Should Read This
Infrastructure and platform engineering leaders. Teams adopting SRE practices. Anyone responsible for system reliability at scale.
What's Inside
1. Core Principles of SRE
2. Emphasizes the importance of reliability in modern applications.
3. Service Level Objectives (SLOs)
4. Defines measurable reliability goals to guide engineering efforts.
5. Monitoring and Incident Response
6. Offers strategies for effective monitoring and rapid incident response to maintain service health.
7. Capacity Planning and Management
8. Discusses techniques for forecasting demand and scaling systems accordingly.
9. Change Management
10. Outlines practices for safe deployment and managing changes to systems.
Tags
Related Content
The future of software engineering is SRE | Swizec Teller
Everything Breaks – Rands in Repose
Principles behind the Agile Manifesto
Technical Writing | Google Developers
The Infinite Game: How to Lead in the 21st Century - YouTube
The Infinite Game by Simon Sinek
How can I help?
12 “Manager READMEs” from Silicon Valley’s Top Tech Companies | Hacker Noon