Fabian Iannarella

A well-known and respected member of the Australian DevOps community, Fabian is an excellent team motivator and enjoys building high-performing teams and fostering DevOps culture.

April 11, 2018

Rely on a Site Reliability Engineer | Spotlight on DevOps Talks

The DevOps Talks Conference, 22nd – 23rd March, was two days of both fun and thought-provoking presentations delivered in true meet-up style with speakers from around the globe. Not only was it a great way to end the week, but there was a great vibe in the air and no-one could question the hospitality with fantastically catered networking sessions to follow at the end of each day.

So what was it that made the event so great? Well, there were an abundance of some up-and-coming DevOps trends that really stood out for me. In particular, the notion of a Site Reliability Engineer.

 

Jamie Wilkinson took the stage with his talk “Introduction to SRE at Google”.

Source: DevOps Talks

———

 

A “Site Reliability Engineer” is an awesome concept by the minds of our friends at Google.

Jamie Wilkinson, an individual with over 15 years in the industry and a Site Reliability Engineer himself, gave us a run-through of on the key-aspects of this role. Beyond this, Jana Brummel and Robin Van Zijill from ING shared how they improved ING’s online banking reliability using SRE Principles.

 

So what is a Site Reliability Engineer role all about?

The Site Reliability Engineer role was developed with the idea that reliability is the key feature in any of your products as a business – after all, if your service is down it cannot be used, right? So it is integral that you work at the reliability of your product offerings.

Within a business, a team of SREs are tasked with reliability; they don’t always write software features but they do spend ample time with Developers, consulting on how to make their apps more reliable.  SREs are not the release gatekeepers but there are some strict rules established around whether teams are allowed to release.

The concept of an Error budgets is used here.  For example, an Service Level Agreement (SLA) is allocated to a service, lets say 99.9%,  that gives a 0.1% error allowance. If a prior release creates too many errors and the allocated budget is exceeded, further releases are not permitted until the error rate is reduced… bummer…

It is important however to constitute what an error is, and reliability metrics need to be visible to everyone for this to work correctly.

 

Some key details about the Site Reliability Engineer way of life:

  • SREs speak the same language as Developers (code, they are software Engineers)
  • They put a cap on Operations work, say 50% (the rest of the time is spent on improvement)
  • An SRE is “portable” amongst teams and are a part of the same staffing pool
  • Devs are obligated to help out with the Ops Backlog! This is the part I really like…
  • Through monitoring, systems are made observable
  • Post-mortems (conclusions of project) focus on process and technology

 

Fellow experts Krishna PCK and Navdip Kaur Kalsi also attended the insightful conference.

———

What I love about this role is that SREs are software engineers with a bonus eye for Ops – they spend at least 50% of their time making systems more reliable, and focus on the reduction of the amount of time spent on Ops style workload such as tickets, on-call, manual tasks, etc.

Decisions to release are made with metrics. SREs have no problem working with Dev teams, and together they help out with the Ops backlog when it exceeds the SRE teams’ capacity.

They work as a group to minimise errors and reduce Operation overhead, with the aim to increase reliability so that more time can be spent on feature releases!

If that ain’t DevOps then nothing is!

 

________________

 

To learn more about our DevOps capability, spearheaded by Champion Fabian Iannarella, click here.