Google - Site Reliability Engineering

by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

Published 2016·529 pages

SoftwareDevelopmentEngineering ManagementDelivery and Execution

Why It Matters for Leaders

This resource is essential for Engineering Leaders as it addresses key challenges in maintaining site reliability and continuous improvement within engineering teams. An actionable takeaway is the implementation of reliability engineering practices that can enhance system performance and team collaboration.

Who Should Read This

Infrastructure and platform engineering leaders. Teams adopting SRE practices. Anyone responsible for system reliability at scale.

What's Inside

1. Core Principles of SRE

2. Emphasizes the importance of reliability in modern applications.

3. Service Level Objectives (SLOs)

4. Defines measurable reliability goals to guide engineering efforts.

5. Monitoring and Incident Response

6. Offers strategies for effective monitoring and rapid incident response to maintain service health.

7. Capacity Planning and Management

8. Discusses techniques for forecasting demand and scaling systems accordingly.

9. Change Management

10. Outlines practices for safe deployment and managing changes to systems.

📰Featured in the Newsletter

MentionedOct 24, 2022

leadingIn(tech)#18: The value of asking good questions

This resource was discussed in this newsletter issue →

Google - Site Reliability Engineering

Why It Matters for Leaders

Who Should Read This

What's Inside

Tags

📰Featured in the Newsletter

More Related Content