SRE: The Next Big Thing in IT?

A two-part series on Site Reliability Engineering

SRE, or Site Reliability Engineering, has grown into a significant force during the past 10 years. What is SRE, and is there room for Test Engineers to start a new career in this highly demanded discipline? This two-part series explains SRE and explores the career opportunities.

What is SRE?

In the acronym “SRE,”  the original interpretation of the “S” was “site,” as in “website.” It has expanded over time to include “system,” “service,” “software,” and even more widely used, “online stuff.” The “R” stands for “reliability,” but “resilience” could also be used. The “E” stands for the practice, “engineering” or the people, “engineers.” 

The term SRE was first applied to a designated role at Google around 2003 by Ben Treynor Sloss, VP Engineering. He defined SRE in a Google interview, saying:

“Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.”

Google are pioneers behind this growing movement, and the rest of the industry has adopted SRE on its own terms, with results varying widely from company to company. Tammy Butow, SRE Manager at Dropbox, shares her definition:

“SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.”

Site Reliability Engineering is, first and foremost, an outgrowth of the “always-on” world of online services and these could be anything from Infrastructure (IaaS), Networking (NaaS) to Software (SaaS) and Platforms (PaaS). This is a broad discipline and requires skills to run a large, distributed site. It removes the conjecture and debate over what can be launched and when. SRE introduces unique metrics, such as service-level indicators (SLIs), service-level objectives (SLOs) and service-level agreements (SLAs) to continuously oversee the reliability of the product.

Ultimately, choosing appropriate metrics is helping to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.

An SLI is a service-level indicator. It’s what  you measure and where measurement is taken. Below is a list of common SLIs:

  • Latency
  • Throughput
  • Availability
  • Error rate
  • Durability 

An SLO is a service-level objective. It’s the goal or threshold of acceptable values for SLI within the given period of time.

Finally, SLA stands for service-level agreements. It’s a business promise to customers derived from meeting (or missing) the SLOs they contain. In other words, SLA defines how reliable the system needs to be to end-users. If the team agrees on a 99.9% SLA, that sets an error budget of 0.1%. An error budget is the maximum allowable threshold for errors and outages.

The high-level SRE responsibility may include addressing infrastructure and operations problems as code, reducing toil, and sharing ownership of production issues with the product development team. Let me highlight some areas that SRE teams own and refer to:

  • Monitoring and observability 
  • Incident response and reviews
  • Data center
  • Container platform 
  • Network 
  • Automation, Release Engineering
  • Databases 

As a discipline, SREs are devoted to helping an organization sustainably achieve the appropriate level of reliability for its services by implementing and continually improving data-informed production feedback loops to balance availability, performance and agility. 

SRE and DevOps

After learning about the definition of SRE and role descriptions, you may ask where DevOps fits into all this reliability movement? The answer is: they work in tandem.

A quote from the Google SRE handbook explains:

“The term ‘DevOps’ emerged in industry in late 2008 and as of this writing (early 2016) is still in a state of flux. Its core principles—involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks—are consistent with many of SRE’s principles and practices. 

One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.”

DevOps focuses on engineering continuous delivery and continuous testing to the point of deployment. This can be achieved by bringing dev, test and ops teams closer together.

SRE focuses on engineering continuous operations at the point of customer consumption – achieving reliability that satisfies users is the main goal. For most organizations, users will want not only reliability but new features too, so SREs fold in practices for improving software delivery.

The speed of deployment is important, in part, because it means a developer can revert code and fix issues sooner. DevOps creates tighter feedback loops for improving the software delivery process. 

You can’t achieve SRE without being a learning organization. Your systems are always changing, and you need to keep learning to be able to manage that complexity to achieve the reliability you want. SREs do this by implementing DevOps practices. 

In the next article in this series, we discuss if Test Engineers should consider changing roles to SRE.

Background Image