Article written by Vito Rallo
Tools, processes, and techniques to orchestrate security and perform rapid or automated response are increasingly powerful, accessible and straightforward.
However, the solutions put in place in response to cybersecurity incidents don’t fit the industrial operational technology (OT) environment. It’s clear to the engineers involved in operations and production, but the IT crowd with a cybersecurity mindset often struggles to understand why. Running a response in OT requires dedicated skill sets. Classic models don’t fit, nor does taking a traditional forensic approach to the emergency. And all the fancy host-based IT technology for detecting and containing advanced threats is a dangerous stretch; it’s not suitable for an industrial environment - period.
There’s a host of differences to consider before writing a plan, building internal incident response (IR) capabilities, or hiring an external IR team during the contingency. A lesson you learn in the field running a threat management consulting business in a country with a manufacturing sector bombarded with cyber attacks for months.
There are plenty of good reasons; if you're used to cybersecurity response concepts, the following points may provide useful insights, even if at first they may sound awkward.
Unfortunately, the consequences of a compromised OT network might be far worse than just the breach of the IT system in which it originated. A system shutdown can have a drastic impact on physical elements, with possible repercussions for safety and/or damage to the installation and environment.
There is good news: apart from sporadic cases of targeted attacks on a specific industry and technology, today's attacks are often an amplification of IT breaches. A dedicated piece of malware to control operational aspects in critical infrastructure would be more devastating - as was the case for renowned attacks like the Stuxnet malware, the Sandworms campaign or the Ukrainian power grid attack.
Often a hacker barely realises that the domain or the host just compromised is part of the OT environment. He spreads malware and moves laterally using known techniques, easily leveraging poor segmentation and protection, weak and old operating systems and poorly managed access privileges.
The OT environment is incredibly rich and colourful: hardware, software, real-time systems, etc. Don't expect to find a single homogeneous 64-bit Windows environment. General purpose malware or ransomware will either run with devastating efficiency on a matching CPU architecture when the operating system is compatible, unprotected and outdated - or it won't run at all. These types of attacks still carry a considerable risk factor, however. They are likely to impact human management interfaces (HMIs) running on old Windows systems rather than an embedded device, a programmable logic boards (PLC), or a dedicated real time operating system (RTOS). Data on the manufacturing execution systems (MESs) system will probably be compromised, tracking and quality control will be impossible, as will access to shared drives to sideload automation programmes. All good reasons to interrupt production.
OT priorities are different. In essence, an OT cyber incident response comes with safety, operations continuity and regulatory requirements top of mind. Industrial IT is a business enabler or a support function, not the core business.
The goal is to go back into production safely, without any harm to people, equipment and the environment as soon as possible, keeping an eye on regulations and with minimal financial impact. As simple as that - and little in common with forensic investigations, complex root cause analysis and automated response playbooks.
During OT incident response, communication is everything.
OT cybersecurity incidents always come with a level of IT involvement. IT to OT, close collaboration is crucial. We’re talking about communication that is often difficult even in normal circumstances. Asking a group of cybersecurity-driven technology geeks to talk to a team of process-driven, hard-headed automation engineers with production and efficiency in mind brings major challenges: different languages, mindsets and priorities.
In addition, unlike IT incidents, response handling cannot be completely externalised. Intimate knowledge of the environment and critical business operations can only come from internal resources. An experienced lead coordinating the many stakeholders involved is an excellent starting point.
Forget about forensics and classic data acquisition with lengthy post-processing; it's often impossible. In OT, low-level systems keep logs for short periods only and aren't engineered to store data outside of the production process. Memory or live forensics require dangerous (or even impossible) hooks in the running system; disassembling devices is not an option.
Volatile memory and kernels of many embedded devices and RTOS get wiped on reboots or programme loading. It's a multivendor environment that is often closed; good luck acquiring forensic data from a vendor-supported Windows CE OS infected thanks to an old implementation of the SMBv1 protocol.
Most of the time, OT environments are not ready for cyber incident response. Visibility of security events and detection of attacks over a variety of different devices is a real challenge. Often, even getting an overview of assets for classification and triage is a challenge in itself.
OT is not a simplified and legacy IT network; that’s the biggest misconception. It’s not the place to deploy the latest and greatest AI-powered host protection agent with advanced threat hunting capabilities, push a button, and investigate your freshly reverse engineered threat intelligence (indicators of compromise).
Recovery and system reimaging can be time-consuming or scarily costly. You’re not dealing with a typical corporate OS image; you’re dealing with a multi-vendor and often locked support ecosystem that needs proper recovery and reimaging procedures. Virtualisation and thin clients popped up in industrial IT, at least around support functions. One day, hybrid cloud will reach this domain; effective recovery on modern MES systems is tricky but possible. Preparation is crucial.
Before discussing a few essential aspects, one general axiom must be clear: each OT environment is different. Companies within the same industry are running their own business processes. Having the same FMS automation system or the same MES doesn't make their operations or production process similar. Instead, the OT reflects the essence of engineering, every aspect of which has been steadily improved upon across many years.
A single solution, a general OT response model, won't fit all - nor is the approach proposed here a silver bullet.
More than any other aspect, preparation is crucial. Know your environment and its possible weaknesses, know your enemies but keep it realistic: map attack trademarks (classes of attacks) to your real environment. NIST provides a fascinating dissection in the NIST SP 800-82 rev2, "Guide to Industrial Control System (ICS) Security (rev. 2)".
Consider the following aspects carefully in your OT response plan:
Impact and priorities
Life, safety, the environment and material property are your focus
Business involvement and communication
Remember that each decision taken during a cyber incident might have amplified consequences for safety, finance and regulation
Plan for containment
You must know your assets and carefully consider your data flows - how entities of your production process are using technology (networking) to exchange data
Critical business functions
Design workarounds (backup plans) to provide minimum functionalities under limited network communication. During the incident, you might need to work within a ring-fenced, segregated environment. Design your 'islands' in advance
Draft technical recovery processes and procedures (reimaging) involving all the necessary support from vendors and third parties.
Despite the many attempts to define an 'always good' approach to OT response (see ICS-CERT/US-CERT), most of the existing methods are deeply inspired by IT incident response. The US-CERT document brings many interesting ICS related elements to the table, but experience teaches that CERT/CIRT models are the last thing you consider when firefighting.
A real-life approach to response follows this process:
The three high level phases, classification, response and recovery, described in the NIST document fit with the restore operations logic. During an OT incident, a continuous decision making process fed with analysis (incident investigation and business impact) will drive the operations during the contingency to ensure a rapid return to production, or a potential shut down of parts of the operations that aren’t serving the company's critical business functions.
Cyber threat visibility during IR is the enabler of the many workarounds. It helps control the risks linked to going back into production in a contingency, often with the infected systems isolated.
Visibility and the ability to detect cyber-related threats is something that a cybersecurity-mature environment has permanently in place. If not there, OT responders should make it immediately available, part of the IR toolset, to give the team some level of comfort in ring-fencing networks. We really must monitor the OT for anomalies - new or known malicious activity (e.g. malware that keeps trying to beacon out, ransomware, any other form of existing or new compromise, a lateral move, etc.) but also identify deviations from the usual network behaviour. Communications and protocols in the ICS/SCADA environment tend to follow recurring patterns. That’s the principle used by industrial security systems based on AI learning and k-clustering for anomaly detection.
Experience shows that a specialised OT incident response team can benefit greatly from an advanced threat hunting solution incorporating passive sensor technology (with zero impact on the environment) that, together with ring-fencing at firewall level, helps control and monitor the risks.
Real understanding of risk is critical in providing input to the previously described iterative decision-making process. It's not about potential and residual risks. It's about making decisions quickly under pressure.
Taking risks? Yes. Restarting production as soon as possible, minimising financial loss, that’s what we want, isn’t it?
In every incident there are two significant elements to consider in determining and mitigating the risks associated with restarting production within a segregated and (probably) still compromised environment. These elements support the decision making process:
A deep technical understanding of the attack and the incident, which includes a complete dissection of all the malicious software used (a prerequisite for the analysis of the second factor);
What type of damage the malware can inflict on your specific OT technology (and under which conditions). The impact might range from making the system unusable, leaving it running with great danger of physical damage, exfiltrating IP, to controllable effects or even no impact at all.
“Been there, done that, got the t-shirt.” OT cyber incident response is something that you want to run with an experienced partner who genuinely understands your environment, and who won’t blindly start a trial-and-error approach based “classic” IT response - regardless of leading-edge technology and proven excellence. It's something that requires proper tooling, processes, preparation, and coordination. And a long list of lessons learned on the basis of blood, sweat and tears in the OT field.