We’re pleased to announce that The Site Reliability Workbook is available in HTML now! Site Reliability Engineering (SRE), as it has come to be generally defined at Google, is what happens when you ask a software engineer to solve an operational problem. SRE is an essential part of engineering at Google. It’s a mindset, and a set of practices, metrics, and prescriptive ways to ensure systems reliability. The new workbook is designed to give you actionable tips on getting started with SRE and maturing your SRE practice. We’ve included links to specific chapters of the workbook that align with our tips throughout this post.
We’re often asked what implementing SRE means in practice, since our customers face challenges quantifying their success when setting up their own SRE practices. In this post, we’re sharing a couple of checklists to be used by members of an organization responsible for any high-reliability services. These will be useful when you’re trying to move your team toward an SRE model. Implementing this model at your organization can benefit both your services and teams due to higher service reliability, lower operational cost, and higher-value work for the humans.
But how can you tell how far you have progressed along this journey? While there is no simple or canonical answer, you can see below a non-exhaustive list to check your progress, organized as checklists by ascending order of maturity of a team. Within every checklist, the items are roughly in chronological order, but we do recognize that any given team’s actual needs and priorities may vary.
If you’re part of a mature SRE team, these checklists can be useful as a form of industry benchmark, and we’d love to encourage others to publish theirs as well. Of course, SRE isn’t an exact science, and challenges arise along the way. You may not get to 100% completion of the items here, but we’ve learned at Google that SRE is an ongoing journey.
SRE: Just getting started
The following three practices are key principles of SRE, but can largely be adopted by any team responsible for production systems, regardless of its name, before and in parallel to staffing an SRE team.
Beginner SRE teams
Most, if not all, SRE teams at Google have established the following practices and characteristics. We generally view these as fundamental to an effective SRE team, unless there are good reasons why they aren’t feasible for a specific team’s circumstances.
The following practices are also common for SRE teams starting out. If they don’t exist, that can be a sign of poor team health and sustainability issues:
Intermediate SRE teams
These characteristics are common in mature teams and generally indicate that the team is taking a proactive approach to efficient management of its services.
Advanced SRE teams
These practices are common in more senior teams, or sometimes can be achieved when an organization or set of SRE teams share a broader charter.
Another set of SRE “features” which may be desirable but unlikely to be implemented by most companies are:
- SREs are not on-call 24×7. SRE teams are geographically distributed in two locations, such as U.S. and Europe. It’s worth pointing out that neither half is treated as secondary.
- SRE and developer organizations share common goals and may have separate reporting chains up to SVP level or higher. This arrangement helps to avoid conflicts of interest.
What should I do next?
Once you’ve looked through these checklists, your next step is to think about whether they match your company’s needs.
For those without an SRE team where most of the beginner list is unfilled, we’d highly recommend reading the associated SRE Workbook chapters in the order they have been presented. If you happen to be a Google Cloud Platform (GCP) customer and would like to request CRE involvement, contact your account manager to apply for this program. But to be clear, SRE is a methodology that will work on a huge variety of infrastructures, and using Google Cloud is not a prerequisite for pursuing this set of engineering practices.
We’d also recommend attending existing conferences and organizing summits with other companies in order to share best practices on how to solve some of the blockers, such as recruiting.
We have also seen teams struggling to fill out the advanced list because of churn. The rate of systems and personnel changes may be a deterrent to get there. In order to avoid teams reverting to the beginner stage and other problems, our SRE leadership reviews key metrics per team every six months. The scope is more narrow than the checklists above because several of the items have now become standard.
As you may have guessed by now, answering the central question in this article involves addressing and attempting to assess a given team’s impact, health, and most importantly, how the actual work is done. After all, as we wrote in our first book on SRE: “If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings.”
So yes, you might have an SRE team already. Is it effective? Is it scalable? Are people happy? Wherever you are in your SRE journey, you can likely continue to evolve, grow and hone your team’s work and your company’s services. Learn more here about getting started building an SRE team.
Thanks to Adrian Hilton, Alec Warner, David Ferguson, Eric Harvieux, Matt Brown, Myk Taylor, Stephen Thorne, Todd Underwood and Vivek Rau among others for their contributions to this post.
Not Affiliated With Any University