How To Build A Dynamic Operational Resilience Framework with OpRes & AWS

AWSCloudOperational ResilienceMachine LearningArtificial IntelligenceData

26 Jul

Written By Ben Saunders

Drafted by Ben Saunders: OpRes Founder

Roughly a 10-minute Read

Introduction:

“I want it all. And I want it now!”

No, that is not the sound of Queen strumming out one of their biggest hits. It’s the cries of Chief Information Officers, Chief Risk Officers, Heads of Operational Resilience and Service Delivery Leads yearning for a real-time operational resilience framework that is fit for the 21st Century! They want a dynamic capability that is evidence-based, all underpinned by trusted data sources.

As firms embark on responding to the operational resilience policies set by the Prudential Regulatory Authority (PRA) & Financial Conduct Authority (FCA), some will be mapping their important business services in Word documents and Visio flowcharts. Whilst others will be setting their impact tolerances in spreadsheets and documenting their risk appetite in reams of PowerPoint presentations.

Then there are the other, “others”! The firms that have a robust and rigorous methodology that leverages an event-driven operational resilience architecture. By applying principles like the distributed data-mesh, they are able to trigger alerts, notifications, and real-time insights to tell Material Risk Takers (MRT’s) when Impact Tolerances are at risk of being breached. They can automate their regulatory reporting into the FCA’s new platform RegData and they can respond to service disruptions at speed using modern engineering practices that are akin to DevOps & Continuous Delivery. However, these firms are very few and far between….

It would be easy for me to type this blog and state that OpRes is the silver bullet solution that firms need to turn towards, in order to answer the cries of their C-Suite! However, no single solution will enable firms to implement a dynamic, real-time operational resilience framework. In order to move to this level of maturity, firms will need to pull together multiple services and solutions across both public and private cloud, to deliver an end-to-end capability.

Considering the compelling timelines for firms to have mapped their important business services and the fast-approaching nature of March 2022. The easy approach to take is merely to demonstrate a sufficient level of mapping and scenario testing between now and then, in order to placate the regulators that just enough homework has indeed been conducted. Indeed, the FCA has suggested that firms who treat operational resilience as a first-class citizen can “promote effective competition” and see this as a differentiating factor in order to “compete for, and keep, customers”.

In order to meet the objectives set by regulators, whilst instilling a step-change in how operational resilience is oriented around the firm’s customer. We are setting about combining OpRes with the power of the public cloud, specifically Amazon Web Services (AWS). Over the course of this blog, we will share several high-level reference architectures that will explain how we will utilise managed services in AWS and integrate these with a firm's trusted data sources to unlock dynamic, real-time operational resilience.

Build Fast with The Public Cloud

In a previous blog, I made a passing remark that there was a degree of irony associated with the fact that as firms identify resilience gaps, they will likely gravitate towards the public cloud to alleviate their points of contention. Whether this is through Infrastructure, Software, or Platform as a Service (IaaS, SaaS, PaaS) solutions. This is particularly ironic just as the Bank of England published their latest position around the growing adoption of public cloud services across financial services, and the risks this is introducing to the sector.

That said, March 2022 is fast approaching. Baselines for impact tolerances need to be set and now, more than ever is a great opportunity to capitalise on the disruptions bought by Covid to reimagine how your organisation responds to operational resilience challenges. We are firm believers that if you are not delivering business value in ~3-months then you should probably stop what you are doing. Evidently, regulatory-driven agendas don’t have this luxury! However, by plugging together public cloud-hosted services, firms can instantiate their dynamic solutions faster, whilst limiting administrative overheads during business as usual operations.

Furthermore, firms will most likely have a raft of data at their disposal which demonstrates their actual resilience posture. Whether this is information that is stored in static documents such as risk reports, service criticality assessments, or supplier agreements. Conversely, this could be data that is stored in systems of record such as a configuration management database, monitoring & alerting systems, or application delivery toolsets (e.g., Continuous integration, code analysis, infrastructure testing & compliance tooling). Furthermore, firms could be intrigued by integrating with 3rd party sources to track social media announcements, financial performance or ESG related data points of their suppliers.

As such, over the next passage of this blog we will start to explore how firms can combine higher-level managed services in AWS (e.g., PaaS solutions) with OpRes and other trusted sources. So that they can analyse data points that experience both low and high-frequency changes, in order to build a more complete view of their operational resilience posture. Let's start firstly with the data points that experience the least amount of change. For example, supplier contracts or internal risk assessments. Indeed, getting access to these types of datasets can be a challenge. Yet they are intrinsic for setting operational resilience baselines for your firm's important business services service level targets.

Harnessing Hard to Reach Data with Artificial Intelligence

In order to move fast and build scalable, dynamic operational resilience solutions at speed. Our reference architectures will wherever possible, make use of PaaS offerings made available by AWS. By using PaaS services, our end-to-end solution limits the need for organisations to manage the underlying infrastructure (usually hardware and operating systems) and will enable them to focus on the oversight and governance of their resilience frameworks.

As previously discussed, many firms sit on a wealth of operational resilience data. The hard thing is getting access to it. Or, even locating it in the first place. In addition, the mapping requirements outlined by the PRA & FCA can be a time-consuming activity and may in some instances result in the duplication of data. As such the question remains, how do we combine our existing data sets with a framework that is provided by a solution like OpRes?

During this show & tell we explained how users can map their important business services into OpRes via the user interface. Or alternatively by uploading multiple records via a CSV file. By using AWS services, we intend on going one further, by introducing the use of Natural Language Processing (NLP) technologies enabled by services such as Textract. Textract uses machine learning to read and process any type of document, accurately extracting text, handwriting, tables and other data without any manual effort. Once key-value pairs are identified the data can then be loaded into another source.

This will allow us to scan, analyse, prepare and ingest multiple documents into OpRes. Enabling the creation of new suppliers or important business service records. As well as ensuring that if changes are made to these static documents, that the data changes are replicated into OpRes as well. For example, we can search for keywords such as; RTO, RPO, SLA & SLO, Support Hours, TPS & M/S and pull these data points into the requisite fields in the OpRes data schema.

This also introduces the possibility of establishing an evergreen process to ensure that the data underpinning an important business service is relevant and up-to-date. Namely, whenever a document is updated and pushed to an S3 bucket the required workflows are triggered in order to scan the document, determine if any changes have been applied and then reflect these changes in OpRes.

In addition to Textract, we’d propose to use several additional services such as Lambda, S3, DynamoDB and Simple Queuing Services (SQS). Again each of these services will be patched and maintained by AWS. However, the firm will need to stipulate the patching and downtime windows when administrative activities can be executed. In addition, Identity & Access Management (IAM) and role-based access privileges will need to be configured by the firm and integrated with their active directory for optimal security.

We have illustrated a high-level reference architecture below to visualise how this might function in principle.

**AWS Textract Reference Architecture for Dynamic Operational Resilience**

At a high-level, the process is executed as follows:

The process starts when a message is sent to an SQS queue to analyze a document. In our example, these could be risk assessments, supplier contracts or service description documents.
A job scheduler Lambda function runs at a certain frequency. For example every 60-minutes minutes and polls for messages in the SQS queue. This polling could be set at daily or monthly intervals for documents that are unlikely to experience significant changes.
For each message in the queue it submits an Amazon Textract job to process the document and continues submitting these jobs until it reaches the maximum limit of concurrent jobs in your AWS account.
As Amazon Textract is finished processing a document it sends a completion notification to an SNS topic. In this instance, we can also send alerts and updates to collaboration tooling like Slack or Microsoft Teams. Alternatively, an email can also be sent.
SNS then triggers the job scheduler Lambda function to start the next set of Amazon Textract jobs.
SNS also sends a message to an SQS queue which is then processed by a Lambda function to get results from Amazon Textract and store them in a relevant dataset for example DynamoDB or S3.
Upon submission of a new artefact, the relevant values are identified by a Lambda function which then replicates the relevant changes into OpRes.
Notification alerts can also be pushed to collaboration tooling and email accounts in order to build in pre/post validation steps to ensure the changes have been committed successfully.

So now that we can get information into OpRes via the UI, whilst also interacting with low change volume documents. Let's turn our attention to streaming real-time resilience data points, from a wealth of internal and external sources.

When Banks make a trade on the stock market, they compare multiple data sources to understand the risk associated with the transaction. Whether this be, socio-economic considerations, political data points or the liquidity position of the firm. They will have a defined appetite and tolerance levels associated with how much risk they can bring on to the firm and this will be calculated in real-time. In order to determine the potential return on investment from their trades execution.

Conversely, when we look at the technology estate of the financial landscape, if a change is required to be administered in production it can take weeks, if not months to deliver the most minuscule of feature enhancements. Weeks of testing, risk assurance, change advisory boards and four eyes checking can diminish the value of the feature enhancement with each passing day.

With OpRes, we wanted to embark on reimagining how Operational Resilience is executed. Moving it from a stale point in time activity. To a real-time market differentiator for firms of all shapes and sizes. By further integrating with higher level services in AWS, we are able to realise this target and help firms further unlock a real-time operational resilience reporting capability.

Much like the architecture referenced above using AWS Textract, we have again opted to use managed PaaS offerings for our real-time event streaming architecture. This allows us to speed up the delivery of an entirely serverless platform. In turn, allowing firms to ingest terabytes of operational resilience, third party risk management and system performance data on a day by day basis. Whether this be information from the public domain, trusted market data sources or even system/application data from the corporate data centre. For example, you could start to define and ingest a set of key resilience indicators across your technology landscape!

In order for us to ingest, analyse, learn and report on real-time resilience data, we would need to leverage the following services in AWS:

Amazon Managed Streaming for Kafka

Amazon Kinesis Data Analytics

Amazon Simple Notification Service

Amazon Elasticsearch Service

Kibana (Not an AWS Service but part of the open source family!)

By pulling these services together. The high-level architecture would look like the diagram below in principle.

**Real Time Event Streaming for Dynamic Operational Resilience Insights**

At a high-level, the architecture would apply the following patterns:

Multiple producers generate operational resilience data continuously that might amount to terabytes of data per day. Whether these be system events, financial data, economic data or social media postings. Producers integrate with the Kinesis Agent, which is a standalone Java software application, to collect and send data to Amazon Managed Kinesis Data Streams. This agent continuously monitors a set of files and sends new data to the chosen delivery stream. The agent handles file rotation, checkpointing, and retry upon failures and delivers the data topics in a reliable, timely, and simple manner.
We have opted to use Amazon Managed Streaming for Kafka to enable rapid data ingestion and aggregation. This will be suitable for firms that are relatively new to Apache Kafka. Enabling them to run production grade capabilities without needing specific Kafka infrastructure management expertise.
Streaming data can be fed to a real-time predictive analytics system to derive inferences, which in turn can be used for alerting using Amazon SNS. In addition, we have opted to integrate with Sagemaker and Athena. This will allow the firm to perform meticulous analysis of their data sets with advanced analytics. For instance, a firm could use Athena to identify patterns between the resilience of an important business service, weighted against its technical debt and the volume of changes being deployed across its technology estate. This can be done dynamically, without having to build huge infrastructure capabilities and manage multiple ETL pipelines. ETL will be handled by Athena’s out of the box integrations with AWS Glue.
Furthermore, we have opted to use Sagemaker, in order to build predictive data models that are trained to notify a firm when certain performance conditions are being experienced or reported on. As an example, this could be used to alert the firm that their impact tolerances are at risk of being breached. Or alternatively, build models that predict when a technology disruption could be set to occur based on a set of key resilience indicators. Indeed, we believe that this could be a very powerful intelligence tool for running continuous scenario testing over time across a firm's important business services.
Finally, data topics that are streamed can be further distributed to Kinesis Data Streams for additional processing. Whilst we have opted to create an S3 data lake for integrating with Kibana and OpRes. In order to visualise the required resilience dashboards and reporting capabilities. For example, the real-time visualisation of service level indicators that are used to monitor impact tolerance thresholds.

What have we covered?

Over the course of this blog, we have explained our targets of moving firms towards a real-time operational resilience posture. Whilst we also have provided examples of reference architectures that make use of higher-level services in AWS. In addition, we have seeded some suggestions of how firms could start to maximise these capabilities to not only comply with the regulatory requirements set out by the PRA & FCA. But to use these capabilities as a means to build a competitive edge on their rivals.

Our suggestions to use AWS higher-level services are multi-threaded. Firstly, timelines are tight between now and March 2022. As such, using higher-level services allows us to deploy these reference architectures quickly. Most probably in a matter of weeks. Whilst leveraging PaaS capabilities removes the administrative overheads for firms with respect to running and maintaining infrastructure.

I said at the start of this blog that OpRes alone will not be a silver bullet solution for firms. However, by pulling together additional components from AWS’s rich set of offerings, we can enable firms to not just map their important business services and set baselined impact tolerances. But establish an evergreen process. That is highly automated and integrated with trusted data sources, in order to generate a set of real-time operational resilience insights. In turn, moving away from a subjective and static set of reporting perspectives. To an evidence-based approach, that is dynamic in nature.

Artificial IntelligenceMachine LearningOperational ResilienceDynamicReal-TimeCloudAWS

Ben Saunders

How To Build A Dynamic Operational Resilience Framework with OpRes & AWS

Governing Third Party Risk Management & Operational Resilience with OpRes

Scenario Testing for Operational Resilience: Key Considerations & Data Points