Article

CrowdStrike outage: A lesson in operational resilience

Manu Sharma
By:
Software engineers
Organisations worldwide have suffered large-scale disruption caused by a defective content update. Manu Sharma highlights the importance of effective operational resilience and third-party supply chain oversight.
Contents

Fortunately, this disruption wasn't caused by a malicious cyber-attack, but the defective cybersecurity software update affected Windows computers worldwide and led to significant disruption across many industries.

The outage challenged many organisations' operational resilience frameworks – sharply highlighting the requirements and challenges of rapid, co-ordinated, and effective recovery.

While regulators have their sights set on resilience, it's more than a regulatory requirement. Firms need to protect their increasingly complex operations and supply chains from internal and external incidents, shocks, and threats.

It's critical to have robust plans in place to ensure continuity of key operations, maintain data security, and to minimise the impact on customers and the overall business as inevitable issues and incidents occur. Understanding and assessing complex technology supply chains is a critical part of an effective overall resilience approach.

How did the global outage happen?

On 19 July, CrowdStrike issued a defective content update for its security software, leading to an estimated 8.5 million Microsoft Windows computers crashing and becoming unable to restart correctly. This resulted in an extensive outages and knock-on impacts to technology systems and services that relied upon them.

The outage disrupted daily operations for many financial services firms, but also businesses and governments worldwide, impacting various industries, such as airlines, airports, hotels, hospitals, manufacturing, broadcasting, fuel stations, and retail stores. It also affected public services, including emergency services and booking systems.

Although a fix was released shortly after the discovery of the error, the need for manual fixing of affected computers resulted in more prolonged outages and ongoing disruption across various services.

Operational resilience – not just a regulatory requirement

The IT outages and the subsequent disruption underscore the importance of effective operational resilience planning, highlighting that resilience isn't just a regulatory requirement but an essential element of ongoing business operations.

Understanding your third-party supply chain, assessing critical third parties, testing disruption scenarios, identifying vulnerabilities and scenario planning, including developing manual workarounds are crucial steps to ensure that important business services remain within impact tolerances.

Disruption can be caused both by external threats such as cyber-attacks but also internal changes such as software updates and technology configuration changes.

It's essential not just to accept risks but to proactively plan for and manage potential disruption. 

For financial firms operational resilience should be a key factor in change programmes. We look at how to merge the two.
Embedding operational resilience in change management
Read this article

How can firms strengthen operational resilience?

Ensure consistent focus and engagement across the organisation

Effective operational resilience requires engagement across the organisation to identify key services, understand the supply chain and assess vulnerabilities. This requires structure, controlled coordination, and the right level of ongoing senior management engagement and sponsorship.

While upcoming regulatory deadlines are important, operational resilience will not be ‘done’ – activities and approaches will need ongoing management and development to keep pace with technology and organisational changes and protect against new threats.

Understand your third-party supply chain

Contractual protection, oversight and assurance activities need to be proportionate to the services they provide to your organisation and their criticality.

Engage with - and consider auditing - key vendors to confirm how their activities fit within your resilience arrangements.  This needs to ensure that you understand exactly what activities and controls are being carried out on your behalf and make sure that these are properly picked up in your scenario testing, identification of vulnerabilities and remediation activities.

Use data effectively

Operational resilience needs to be based on a complete, consistent view of processes and activities across this organisation in sufficient depth and detail. Establishing the key data points and setting up effective information collation and analysis approaches supports prioritisation of effort and effective operational resilience delivery and ongoing monitoring.

Prioritise and ensure the right resource is available

Operational resilience requires a mix of skills and knowledge across the organisation to understand relevant business processes and their underlying dependencies, particularly IT. Getting into the required level of technical detail requires a mix of business, technical (IT and contracts / legal) and project resources. Prioritising the availability of this resource for operational resilience planning needs to be prioritised.

Operational resilience is complex and relies on a series of linked processes. What does it mean for insurance firms?
Operational resilience for insurance firms
Read this article

What’s next for financial services firms?

Aside from the important upcoming regulatory deadlines, without proper plans, firms risk prolonged service disruptions, financial losses, and reputational damage.

Having robust operational resilience plans is essential for maintaining trust, safeguarding sensitive information, and meeting regulatory requirements in the event of a cyber outage.

To learn more about strengthening operation resilience and preparing for large-scale disruption, contact our team:

tracking-pixel