I chose to read this book because I wanted to read about management of production services at scale from a devops/ops/development point of view. The team(s) around it, practices, recommendations, do’s and don’ts from one of the big tech giants like Google, and maybe adopt some for our team.
The summary will be around these big concepts:
- What is the SRE Team and its responsibilities
- Service Level Objectives
- The Error Budget
- SRE to the Rescue
- Incident Management
1. What is the SRE Team and its Responsibilities
SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.*1
One side of SRE is being OPS but with a different lens, is developing systems and processes to avoid OPS itself.
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). We have codified rules of engagement and principles for how SRE teams interact with their environment — not only the production environment, but also the product development teams, the testing teams, the users, and so on.*2
Google caps operational work for SREs at 50% of their time. This enables the team to focus on improvement and automation of the their processes, instead of occupying their time in mechaninc or predictable manual work. The goal are systems that are automatic, not just automated.
2. SLIs, SLOs and SLAs
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service.*3
These are the 3 pillars of service reliability, all the work of this team will be around them. So pay a lot of attention when you define them.
Service Level Indicators (SLIs)
An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.*4
Most services consider request latency — how long it takes to return a response to a request — as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second. Another kind of SLI important to SREs is availability, or the fraction of the time that a service is usable.
SRE help to define the SLIs as there obviously needs to be an objective way to measure the SLOs in the agreement, or disagreements will arise.
Service Level Objectives (SLOs)
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.*5
Example: average latency per request to be under 100 milliseconds.
Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service.
Service Level Agreements (SLAs)
SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial — a rebate or a penalty — but they can take other forms.*6
SRE doesn’t typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions. SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs.
3. The Error Budget
Product development performance is largely evaluated on product velocity, which creates an incentive to push new code as quickly as possible. Meanwhile, SRE performance is (unsurprisingly) evaluated based upon reliability of a service, which implies an incentive to push back against a high rate of change.*7
In order to base these decisions on objective data, the two teams jointly define a quarterly error budget based on the service’s SLO. The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
As long as the system’s SLOs are met, releases can continue. If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted while additional resources are invested in system testing and development to make the system more resilient, improve its performance, and so on.
What happens if a network outage or datacenter failure reduces the measured SLO? Such events also eat into the error budget. As a result, the number of new pushes may be reduced until the SLO is reset. The entire team supports this reduction because everyone shares the responsibility for uptime.
4. SRE to the Rescue
SRE is concerned with several aspects of a service, which are collectively referred to as production. These aspects include the following:
- System architecture and interservice dependencies
- Instrumentation, metrics, and monitoring
- Emergency response
- Capacity planning
- Change management
- Performance: availability, latency, and efficiency *8
Not all Google services can receive close SRE engagement. Because of this, very good documentation is maintained by SRE and developers can seek SRE consulting to discuss specific services or problem areas. But for some development teams, consultation is not sufficient. These types of services may have grown to the point at which they begin to encounter significant difficulties in production while simultaneously becoming important to users.
One way to do this is through an iterative review and implementation process.
Production Readiness Reviews
The objectives of the Production Readiness Review are as follows:
- Verify that a service meets accepted standards of production setup and operational readiness, and that service owners are prepared to work with SRE and take advantage of SRE expertise.
- Improve the reliability of the service in production, and minimize the number and severity of incidents that might be expected.
- A PRR targets all aspects of production that SRE cares about.
After sufficient improvements are made and the service is deemed ready for SRE support, an SRE team assumes its production responsibilities.
The process to get to this point is explained more throughly in the book.
The Early Engagement Model essentially immerses SREs in the development process. SRE’s focus remains the same, though the means to achieve a better production service are different. SRE participates in Design and later phases, eventually taking over the service any time during or after the Build phase. This model is based on active collaboration between the development and SRE teams.
The services that apply to this model of engagement are:
- The service implements significant new functionality and will be part of an existing system already managed by SRE.
- The service is a significant rewrite or alternative to an existing system, targeting the same use cases.
- The development team sought SRE advice or approached SRE for takeover upon launch.
Again, more details of this process in the book.
Frameworks and SRE Platform
The SRE organization is responsible for serving the needs of the large and growing number of development teams that do not already enjoy direct SRE support. This mandate calls for extending the SRE support model far beyond the original concept and engagement model.
To effectively respond to these conditions, it became necessary to develop a model that allowed for the following principles:
- Codified best practices
- Reusable solutions
- A common production platform with a common control surface
- Easier automation and smarter systems
Based upon these principles, a set of SRE-supported platform and service frameworks were created. Services built using these frameworks share implementations that are designed to work with the SRE-supported platform, and are maintained by both SRE and development teams.
4. Incident Management
Google’s incident management system is based on the Incident Command System, which is used in many services, like health, police, army, firemen, etc. This process needs to have:
- An incident commander, one leader, who holds all responsibilities except the one he delegates
- There’s also an Operation Lead, leading the needed Ops work
- A public face for any external communication
- Planning: Someone having a longer term in mind, and anything needed on the side for the incident
The most important on-call resources are:
- Clear escalation paths
- Well-defined incident-management procedures
- A blameless postmortem culture. *9
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR) [Sch15]. The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health — that is, the MTTR.*10
When they are focused on operations work, on average, SREs should receive a maximum of two events per 8–12-hour on-call shift. This gives the on-call engineer enough time to
- handle the event accurately and quickly,
- clean up and restore normal service,
- and then conduct a postmortem.
If more than two events occur regularly per on-call shift, problems can’t be investigated thoroughly and engineers are sufficiently overwhelmed to prevent them from learning from these events.
A scenario of pager fatigue also won’t improve with scale. Conversely, if on-call SREs consistently receive fewer than one event per shift, keeping them on point is a waste of their time.
Postmortems should be written for all significant incidents, regardless of whether or not they paged; postmortems that did not trigger a page are even more valuable, as they likely point to clear monitoring gaps. This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct the problem or improve how it is addressed next time.*11
Nothing super new here but the book provides a template for postmortems and incidents that might serve as example or model for your team’s.
The book is written by many authors and this sometimes makes the chapters a bit repetitive or disconnected but the content value is there. You just might need to speed read some parts. But overall I think it has a lot of interesting and useful parts. I’m sure you’ll get something out of it to implement in your workplace, no matter if you work in a big, small or unipersonal company.
I will read the following book too so let me know if you’d like me to review it too!
New SRE Book: “The Site Reliability Workbook: Practical Ways to Implement SRE”
Free Online Reading of the Book
Google video explaining some of the book’s concepts
*1 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 354-356). O’Reilly Media, Inc.. Kindle Edition.
*2 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 395-399). O’Reilly Media, Inc.. Kindle Edition.
*3 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 1012-1014). O’Reilly Media, Inc.. Kindle Edition.
*4 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 1022-1023). O’Reilly Media, Inc.. Kindle Edition.
*5 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 1041-1043). O’Reilly Media, Inc.. Kindle Edition.
*6 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 1071-1073). O’Reilly Media, Inc.. Kindle Edition.
*7 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 949-951). O’Reilly Media, Inc.. Kindle Edition.
*8 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 8836). O’Reilly Media, Inc.. Kindle Edition.
*9 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 2776-2777). O’Reilly Media, Inc.. Kindle Edition.
*10 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 459-461). O’Reilly Media, Inc.. Kindle Edition.
*11 Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering (Kindle Locations 413-416). O’Reilly Media, Inc.. Kindle Edition.