On the 19th of July, many of us woke up to hear disruptions in the operations of airlines, banks, finance, and other sectors. The disruption was caused by one of the biggest Information Technology infrastructure meltdowns, where the Windows machine got stuck in reboots, throwing up the Blue Screen of Death (BSOD) on every boot cycle.
Though IT disruptions are not uncommon, mostly affecting one organisation, this time the effect was global, spanning across organisations. Those using Corwstrike’s cybersecurity solutions to secure Windows endpoints were affected. The primary cause of disruption is identified as an error in the content update by Crowdstrike’s Falcon Endpoint Detection and Response (EDR) tool. The iOS and Linux systems were unaffected.
IT organisations are increasingly using EDR solutions to safeguard against cyber threats, either as replacements or as complements to their anti-virus solutions. Traditional anti-viruses identify malicious activities by looking for signatures of malicious code, mostly their file hashes. However, modern threat actors are sophisticated; they use methods that may not be captured through signatures and keep evolving. Modern EDR tools look for the behavioural aspects of malicious activities and the methods employed, commonly known as Tactics, Techniques, and Procedures (TTPs). To capture these behavioural aspects, EDR solutions include ‘sensors’ (also known as agents) that run as part of the OS running on the endpoint device.
Malicious actors use combinations of known and new TTPs innovatively to launch attacks. These attacks may go unnoticed by traditional anti-viruses, as signatures may not be available for all malicious behaviours and new attacks. An attacker may need to execute a sequence of TTPs to successfully execute an attack. EDR tools continuously monitor activities on endpoints to look for TTPs that could potentially indicate malicious behaviour. Cybersecurity firms gather threat intelligence from various sources that indicate potential malicious behaviour captured through TTPs. EDR tools build detection rules based on a sequence of TTPs observed and the correlation among them.
Malicious behaviours evolve dynamically. Moreover, the techniques used by malicious actors are evasive, and distinguishing them from those used by legitimate users is challenging. Cybersecurity firms cater to this challenge by regularly pushing for updated and fine-tuned rules to defend against new attacks and reduce false alarms. To keep the detection rules up-to-date, end-users often subscribe to auto-uptation of rules.
Impact Shorts
More ShortsFurthermore, attackers try to gain execution control before security software like EDR detects them. To avoid such a situation, EDR sensors are designed to run with the highest privileges as part of an operating system’s kernel, giving them the highest level of privilege. As a side effect, this makes them potentially lethal, and any missteps from them can destabilise the kernel and eventually crash the OS.
The kernel of an operating system works differently than other programs. The simplest difference is that if a programme malfunctions and abruptly crashes, the damage is restricted to that programme. The rest of the system continues to work. However, a kernel is a shared component that acts as a resource (eg, hardware resources) broker for all other programmes.
Hence, if the kernel misbehaves, every other programme will be impacted. Therefore, an error in a kernel programme or an external module (like an EDR sensor) loaded as part of kernel, can lead to the crash of the entire operating system. Since the kernel modules are loaded automatically during system start, every time the system starts, the OS will automatically crash due to the error, and it becomes an endless loop; fixing it requires manual intervention.
In the case of Crowdstrike, instructions on how to respond to novel tactics, techniques, and procedures are sent via configuration files named “Channel-files”. This is part of the behavioural protection mechanisms used by the Falcon sensor. The configuration files are released by Crowdstrike regularly, and the one released on July 19, 2024, at 04:09 UTC, had a logical error resulting in a system crash and blue screen (BSOD) on impacted systems. Crowdstrike also confirmed that it was not a cyberattack but a logical error that only impacted the Windows version of their software. Crowdstrike reported this in their blog, which also provided a remediation method. The remediation method requires manual steps, which now need to be performed on millions of computers, making a recovery enormously painful.
The recent global outage is not the first one. In the past, it has happened and is likely to appear in the future, given the dependencies of IT infrastructure on multiple vendors, increased cybersecurity attacks, and growing demand to scale the business. The latest fiasco is an engineering failure, not a technological one, but may be driven by business priorities. The last such engineering failure in the cybersecurity domain that caused large-scale disruption was the McAfee DAT 5958 Issue. We are seeing a repeat of that, just on a different scale and with a different vendor.
The scale of disruption this logic error has caused highlights the critical need for stringent quality checks by organisations. This logic error should have been detected in a quality check, it is unclear how such an error crept despite recommended quality checks as standard operating procedures. Also, before the updates are pushed into the production version, the end users need to check the updates in a sandbox environment when stakes are as high as crippling their businesses. However, the organisation may prefer to skip such stringent checks as it may escalate their cost and when speed and scale of deployment take precedence over quality checks, this happens. Also, it is about Tradoff’s delay in getting updates and tolerating a bug. The choice lies with the organisation and the risk involved.
Manjesh Kumar Hanawal is Associate Professor, IIT Bombay, and Atul Kabra is Consultant, TCAAI, IIT Bombay. The views expressed in the above piece are personal and solely those of the authors. They do not necessarily reflect Firstpost’s or IIT Bombay’s views.
)