Java, Algorithm and Data structures
Over the past decade, system resilience (a.k.a., system resiliency) has been widely discussed as a critical concern, especially in terms of data centers and cloud computing. It is also vitally important to cyber-physical systems, although the term is less commonly used in that domain. Everyone wants their systems to be resilient, but what does that actually mean? And how does resilience relate to other quality attributes, such as availability, reliability, robustness, safety, security, and survivability? Is resilience a component of some or all of these quality attributes, a superset of them, or something else? If we are to ensure that systems are resilient, we must first know the answer to these questions and understand exactly what system resilience is.
As part of work on the development of resilience requirements for cyber-physical systems, I recently completed a literature study of existing standards and other documents related to resilience. My review revealed that the term resilience is typically used informally as though its meaning were obvious. In those cases where it was defined, it has been given similar, but somewhat inconsistent, meanings.
Another issue I found was that the term resilience is used in two very different senses. The scope of this blog post, the first in a two-part series, focuses on system resilience and not organizational resilience, which has a much larger scope. Organizational resilience is primarily concerned with business continuity and includes the management of people, information, technology, and facilities.
Basically, a system is resilient if it continues to carry out its mission in the face of adversity (i.e., if it provides required capabilities despite excessive stresses that can cause disruptions). Being resilient is important because no matter how well a system is engineered, reality will sooner or later conspire to disrupt the system. Residual defects in the software or hardware will eventually cause the system to fail to correctly perform a required function or cause it to fail to meet one or more of its quality requirements (e.g., availability, capacity, interoperability, performance, reliability, robustness, safety, security, and usability). The lack or failure of a safeguard will enable an accident to occur. An unknown or uncorrected security vulnerability will enable an attacker to compromise the system. An external environmental condition (e.g., loss of electrical supply or excessive temperature) will disrupt service.
Due to these inevitable disruptions, availability and reliability by themselves are insufficient, and thus a system must also be resilient. It must resist adversity and provide continuity of service, possibly under a degraded mode of operation, despite disturbances due to adverse events and conditions. It must also recover rapidly from any harm that those disruptions might have caused. As in the old Timex commercial, a resilient system "can take a licking and keep on ticking."
However, system resilience is more complex than the preceding explanation implies. System resilience is not a simple Boolean function (i.e., a system is not merely resilient or not resilient). No system is 100 percent resilient to all adverse events or conditions. Resilience is always a matter of degree. System resilience is typically not measurable on a single ordinal scale. In other words, it might not make sense to say that system A is more resilient than system B.
To fully understand resilience, it must be decomposed into its component parts. To exhibit resilience, a system must incorporate controls that detect adverse events and conditions, respond appropriately to these disturbances, and rapidly recover afterward. Because resilience assumes that adverse events and conditions will occur, controls that prevent adversities are outside of the scope of resilience.
Some resilience controls support detection, while other controls support response or recovery. A system may therefore be resilient in some ways, but not in others. System A might be the most resilient in terms of detecting certain adverse events, whereas system B might be the most resilient in terms of responding to other adverse events. Conversely, system C might be the most resilient in terms of recovering from a specific type of harm caused by certain adverse events.
A system is resilient to the degree to which it rapidly and effectively protects its critical capabilities from disruption caused by adverse events and conditions.
Implicit in the preceding definition is the idea that adverse events and conditions will occur. System resilience is about what the system does when these potentially disruptive events occur and conditions exist. Does the system detect these events and conditions? Does the system properly respond to them once they are detected? Does the system properly recover afterward?
Some organizations include the avoidance of adverse events and conditions within system resilience. However, this is misleading and inappropriate as avoidance falls outside of the definition of system resilience. Avoiding or preventing adversities does not make a system more resilient. Rather, avoidance decreases the need for resilience because systems would not need to be resilient if adversities never occurred.
A resilient system protects its critical capabilities (and associated assets) from harm by using protective resilience techniques to passively resist adverse events and conditions or actively detect these adversities, respond to them, and recover from the harm they cause. As we shall see in the second post in this series, each of these adverse events and conditions is associated with subordinate quality characteristics.
To understand the full scope and complexity of system resilience, it is important to understand the meanings of the key words italicized in the preceding definition and how they are related in the preceding figure.
Protection consists of the following four functions.
Resistance is the system's ability to passively prevent or minimize harm from occurring during the adverse event or condition. Resilience techniques for passive resistance include a modular architecture that prevents failure propagation between modules, a lack of single points of failure, and the shielding of electrical equipment, computers, and networks from electromagnetic pulses (EMP).
Detection is the system's ability to actively detect the loss or degradation of critical capabilities, harm to assets needed to implement critical capabilities, and adverse events and conditions that can cause harm critical capabilities or related assets.
Reaction is the system's ability to actively react to the occurrence of an ongoing adverse event or respond to the existence of an adverse condition (whereby the reaction is implemented by reaction techniques). On detecting an adversity, a system might stop or avoid the adverse event, eliminate the adverse condition, and thereby eliminate or minimize further harm. Reaction techniques include employing exception handling, degraded modes of operation, and redundancy with voting.
Recovery is the system's ability to actively recover from harm after the adverse event is over (whereby recovery is implemented by recovery techniques). Recovery can be complete in the sense that the system is returned to full operational status with all damaged/destroyed assets having been repaired or replaced. Recovery can also be partial (e.g., full service is restored using redundant resources without replacement/repair) or minimal (e.g., degraded mode operations providing only limited services). Recovery might also include the system evolving or adapting (e.g., via reconfiguring itself) to avoid future occurrences of the adverse events or conditions.
Note that anti-tamper is a special case that, at first glance, might appear to be unrelated to resilience. The goal of AT is to prevent an adversary from reverse-engineering critical program information (CPI) such as classified software. Anti-tamper experts typically assume that an adversary will obtain physical possession of the system containing the CPI to be reverse engineered in which case, ensuring that the system continues to function despite tampering would be irrelevant. However, tampering can also be attempted remotely (i.e., without first acquiring possession of the system). In situations where the adversary does not have access, an AT countermeasure might be to detect an adversary's remote attempt to access and copy the CPI and then respond by zeroizing the CPI, at which point the system would no longer be operational. Thus, remote tampering does have resilience ramifications.