In today’s fast-paced digital landscape, ensuring the reliability and stability of our products has never been more important. That’s where our Site Reliability Engineering (SRE) team comes in. Our Head of SRE, Bruce, shares how SRE’s evolution at VGW aligns with our unwavering commitment to putting our players first.
Introduction to Site Reliability Engineering (SRE)
SRE originated from Google as a discipline that combines software engineering principles with infrastructure and operations problem-solving. SRE applies software engineering best practices to ensure the resilience and stability of software and infrastructure systems.
Using tools like automation to reduce repetitive tasks and deployments of software and infrastructure, SRE advocates for observability so that their engineers can follow breadcrumbs to identify problems in code and infrastructure that threaten stability. SRE is also heavily involved in the Incident Response process which helps to ensure swift and effective responses to any failures impacting customers.
Different Models of SRE Teams
SRE teams can look very different depending on the size and nature of the organisation. In larger organisations like Google, SRE teams may hold the pager for product engineering teams and collaborate directly to enhance reliability. In other instances, SRE teams can be cross-functional and work across multiple teams. At VGW, our SRE team is an embedded resource which acts as a reliability multiplier for the team by focusing on levelling up that team to hit their reliability goals.
Striking the Right Balance Between Reliability and Innovation
Reliability is an important aspect of any product, as having continued unplanned outages and instability can erode customer trust and drive them to look for alternatives. However, as consumers we are also willing to tolerate some level of instability in our services e.g. a small network blip causing us to reload or having functionality partially reduced. This is why finding the right balance between investing in reliability and delivering new features is essential and requires an understanding of our teams, and how much to invest in reliability while not over-investing at the detriment of delivering new features.
The ROAD Framework for Assessing Reliability
At VGW, our SRE team has adopted the ROAD Framework, which consists of four guiding principles to evaluate teams on their reliability:
- Recovery – Evaluating how effectively the team responds to failures through the incident response process, runbooks, and disaster recovery testing.
- Observability – Assessing the team’s understanding of service health and their ability to surface events impacting availability and performance through telemetry.
- Availability – Ensuring teams understand the service’s availability, its ability to fulfil its intended function, and reliably service requests.
- Delivery – Analysing the team’s consistency in provisioning and deploying operational services and their dependencies.
Continuum of Reliability: From Reactive to Proactive and Strategic
The ROAD Framework maps the guiding principles on a continuum of reliability maturity, ranging from reactive to proactive and strategic.
Reactive: This is where teams would either start or aim towards if there were no controls or processes in place. In this phase a team could only react to reliability concerns stemming from recent outages or incidents leading to tactical fixes.
Proactive: As teams progress in maturity, we witness a transition towards a proactive approach. In this phase teams actively identify risks that could potentially impact the reliability of their product and take necessary precautions to mitigate these threats to reliability.
Strategic: In this phase the team has inherently ingrained reliability in its practices and how software is designed, built and operated with appropriate guardrails and controls in place to catch any threats.
Applying the ROAD Framework at VGW
The SRE team used the ROAD Framework as an assessment tool with a set of questions and criteria to identify gaps in a team’s reliability. The outcomes of the assessments would help identify how best the SRE team could engage with that team. This could come in the form of a pitch to the products engineering team to incorporate reliability work in their next cycle or short well scoped work conducted by the SRE team to level up the team’s observability, infrastructure or reliability patterns.
One of the added benefits of this we found as we began the conversations across the organisation was that the embedded Site Reliability Engineer within the teams now had a common understanding and language to engage with their teams as well as with the broader SRE team. As their teams levelled up and matured, the need to have a full time embedded Site Reliability Engineer became less important. This allowed for SRE to take on more strategic work with organisational wide impacts and also scale up and assist another Site Reliability Engineer in other teams as needed.
Overcoming Challenges and Moving Forward
Transitioning from a decentralised to a centralised SRE function presented challenges, including inconsistent approaches and processes across teams. The ROAD Framework provided clarity to our teams and allowed VGW to work towards addressing these issues. By consolidating incident response processes, tooling, and improving practices, we have begun our journey to address these issues with the ROAD Framework providing clarity to our teams on the scope and benefits of what SRE can bring to the team. We have a clear destination to improve reliability for our customers, and regardless of any bumps in the road we come across, now that our teams are using the same map I am confident we will reach it.