Using Site Reliability Engineering to Increase Operational Resilience

Drafted by Ben Saunders: OpRes Founder

Roughly a 6-minute Read

Last week, I took the time to write about OpRes and our logical architecture. Over the course of that blog, we stressed that we are not intending on building another monitoring and alerting system within our solution. That’s a crowded marketplace with the likes of DataDog, Sumologic, and Splunk already providing great solutions across both public and private cloud for organisations of all sizes. That aside, we are actively building the ability to integrate these types of solutions with OpRes. 

Let me explain why….

The PRA and FCA have stipulated that firms must set impact tolerances for their important business services by March 2022. That in itself can be an activity where firms baseline their initial impact tolerances and establish a framework for reviewing these at predefined intervals based on their risk appetite. The guidance from the PRA and FCA is that these impact tolerances apply time and metric-based indicators. In turn, enabling firms to identify when the frequency and severity of service disruptions could cause intolerable harm to customers. 

Invariably, conducting a pure “pens and paper” exercise will only get firms so far in complying with the recent operational resilience policies. Once mapping activities have been conducted, firms should have a sound understanding of their technology supply chains. This should include the hand-offs between internal and external suppliers. The relationships between 3rd & 4th party suppliers and ultimately, private and public cloud providers. 

More mature firms should also be able to outline the non-functional requirements of their important business services. Whether these be measurements such as transactions per second, page load times or peak expected load windows. Indeed, being able to fine-tune these measurements for minimum and maximum business requirements, across core trading hours, will also allow firms to master their respective responses to the operational resilience policies. 

In order to do this, we believe firms should explore applying Site Reliability Engineering (SRE) methods and techniques when setting their impact tolerances. Bringing together concepts like Service Level Objectives (SLO’s), Service Level Agreements (SLA’s) and Service Level Indicators (SLI’s). 

These might be new terms to some of our readers, so let's break each of these down and explain what we mean and why firms should apply them to increase their operational resilience. 

What is SRE 101?

SRE is a concept of technology operations made popular by Google, which is now becoming common practice across both start-ups and enterprise organisations. In short, SRE is what you get when your organisation treats service disruptions as if it’s a software problem. SRE has been heavily applied at Google across their customer-facing services and the principles, methods, and approaches can be easily transferred to regulated environments such as financial services. As an example, Nationwide Building Society, Monzo, Starling Bank and Capital One have spoken publicly about how they use SRE in their organisations.

SRE applies a set of golden performance indicators or signals that can report on the health of a business service in real-time. Typically, these are: 

  • Latency

  • Traffic

  • Errors

  • Saturation

Where these defined signals break out of defined thresholds, or more appropriately, impact tolerances. A set of remediation steps are executed to restore normal levels of service to limit the impact on customers. Or in the case of the operational resilience policies published by the FCA & PRA… to prevent intolerable harm to customers. 

So, if an organisation wanted to apply SRE principles as part of their impact tolerances…. What would they need to have in place to do this? And what information would they need to capture? Let’s unpack the concepts of SLO’s, SLA’s and SLI’s and the roles they have to play in improving operational resilience. 

What is an SLO in the world of SRE? 

No firm wants to experience a service disruption. This is why it is refreshing that SRE begins with the notion that availability is a prerequisite to business success. For any firm, an important business service that is unavailable and cannot perform its function will need to be captured, audited and reported on to the relevant regulatory body. Furthermore, whether a system is “up” and “available” it may well not be performing within the acceptable thresholds of an impact tolerance. 

In the world of SRE, availability is measured as to whether a system or in our case, a business service, is able to fulfill its desired function at a point in time. Furthermore, whilst being used as a reporting framework, the historical availability measurements could also project the probability that your business service will perform in-line with your golden signals and performance requirements in the future. (More on that shortly).  

Being able to measure the availability of a system or business service in a precise numerical target is where the concept of an SLO was created. An SLO can be measured in the total number of available service minutes across a calendar year or month. Or alternatively, as a percentage measurement of time across the same period (e.g. an SLO of 99.99% for the month of January). 

Business services with high SLO’s are often more expensive to operate. As such, firms may consider setting service tiers for standardised SLO’s. Within OpRes, we classify services using a tiered model. Namely, Tier 1 is how we classify important business services. Whilst we also enable firms the ability to record business services that would be less important across Tiers 2, Tiers 3 & 4 . Enabling wider dependency mapping across business services.

Therefore, it is imperative that firms apply an availability SLO to each of their important business services. Without them, key stakeholders and material risk-takers (MRT’s) cannot make informed decisions about whether a service is performing in or outside of it’s impact tolerances. Having these insights will also be useful for trend analysis and remediation investments which the FCA has outlined firms have 3-years to execute between 2022 and 2025 respectively. 

I know, I know….Some of you may be asking… “Isn’t this just a Service Level Agreement (SLA)?” 

Let me explain how Google distinguishes between SLO’s and SLA’s in their SRE model. As we believe it is a really powerful tool, in order to increase operational resilience. 

 

What is an SLA in the world of SRE?

Google distinguishes between SLO’s and SLA’s so that they can establish agreements of availability across their organisations internal software and platform engineering teams. The aim of these agreements is to ensure realistic and measurable targets are in place. In other words, ensuring that an SLA cannot be greater than the SLO. 

SLA’s typically stipulate an agreement with teams consuming or integrating with your service. Meaning that its availability SLO should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. For example, no further changes can be made against the application or platform codebase in order to protect against unplanned outages. These are typically called “error budgets”

For firms looking to apply the SLO/SLA model, then this framework would be expressed in availability numbers: for instance, an availability SLO of 99.9% over one month, with an internal availability SLO of 99.95%.

In order to master the real-time telemetry and early warning signs that they may be infringing on their impact tolerances, it's important for firms to monitor and measure SLO compliance explicitly. To provide accurate reporting to internal compliance functions (1st, 2nd 3rd Line Risk) and regulators alike, firms need to have insights that can report over the course of the SLA calendar period. Whilst having optimal altering and notification systems in place with predefined thresholds that provide early warning if it appears a business service is veering outside of its SLO and its impact tolerances. 

This is where SLI’s come into play. 

What is an SLI in the World of SRE?

In an earlier blog, we wrote about 12 Key Resilience Indicators firms should apply as part of their modern observability strategy. To a certain extent, when we refer to Key Resilience Indicators, we are also mimicking the need to capture, measure and report on SLI’s. At Google, they apply SLI’s as a direct measurement of a service's behaviour. Often using the Golden Signals we spoke about earlier in this blog, as the measurement for identifying non-conformance with expected service up-time. 

These concepts could be easily transferred to a volume of successful payments processed in banking. Or a number of quote and buy conversion requests in the insurance domain. Essentially assessing the number of probes or transactions that have been executed and whether they have completed successfully. 

SLI’s can be analysed over a defined period of time. Most often over the course of a week in order to validate the service availability percentage. With the requisite log tracking and analytics in place, firms will be able to introduce intervention measures should it appear as though SLO’s are going to be breached. In turn, preventing potential service disruption that could cause intolerable harm to customers. 

In order for firms to understand how reliable their business services actually are, then they must be able to measure rates of successful and unsuccessful queries against the SLI’s of their important business services. Whilst this is quite an involved activity, it will pay dividends over time and allow firms to move more towards a real-time operational resilience reporting and remediation posture. 

Interestingly, there are a lot of similarities between Google’s approach to SRE and the approaches outlined by the Central Bank of Ireland in their recent publication regarding Cross Industry Guidance on Operational Resilience. If you haven’t read it yet, we suggest you do! They suggest firms apply a 3 stage model to operational resilience.  Their three stages of operational resilience are in bold below. We’ve added a sprinkle of SRE terminology to demonstrate the tight alignment between the two.

Identify & Prepare - Set Impact Tolerances and SLO’s, SLA’s & SLI’s. 

Respond & Adapt - Leverage reporting and telemetry to continuously optimise operational resilience in real-time.

Recover & Learn - Where impact tolerances are exceeded, use the vast amounts of data available to enhance future performance across people, processes, and technology. 

Closing Thoughts: 

We’ve used this blog to introduce the concept of SLI’s and how they can be used as part of a wider SRE model to increase operational resilience in financial services. Mapping your business services and setting impact tolerances is a key requirement for firms as part of their responses to the FCA’s policies regarding operational resilience. However, treating this as a pen and paper activity is not going to solve the problem long-term. This is why applying SRE principles, in line with a modern observability strategy is the key to long-term operational resilience for firms. 

We’ve years of experience building cloud-native monitoring systems, interacting with business services that stretch across public and private cloud deployments. If you want to get in touch and ask us any questions, then feel free to fire them over to hq@opres.uk

Thanks for reading, 

Ben

Previous
Previous

What Is An Important Business Service? - 9 Questions Your Firm Needs to Ask Itself

Next
Next

Introducing OpRes – Our Logical Architecture Overview