Safety Problems, Hazard Analysis, and Hazard Control:

Copyright(c), 1990, 1995 Fred Cohen - All Rights Reserved

As the examples from List 1 begin to indicate, safety problems have many factors. In general, they tend to be related to system complexity, in that more complex systems are more likely to contain problems that cause hazards. Safety is not just a matter of increasing reliability, because under current technology, we are unable to achieve ultra-high reliability in software, and research in this area is still in its infancy.

Hardware fault tolerance techniques are primarily aimed at preventing, detecting, and correcting errors due to random faults, while software reliability is difficult to assess in terms of random behavior because of its extreme complexity. N-modular redundancy is used in hardware to allow detection and masking of faults, but experiments in N-version programming have been relatively unsuccessful [Kelly83] [Chen78] , primarily because it is difficult to assure correctness of design, behavior of independently generated implementations of the same specification tend to be quite divergent, specification is generally too imprecise to allow unique solution, and specifications are no more reliable than software in terms of safety properties.

Extensive reuse of certified software has not yet progressed to the point of widespread practicality, although a great deal of effort in this area is underway and there are many examples of low level packages that are extensively reused. Exhaustive testing and verification are impractical for most software because of the large number of states and paths through a program. These problems are significantly exacerbated by interrupts which create the possibility of large numbers of branches at any point in any program, dynamic allocation which is susceptible to the availability of resources, and heavy loading, which is often difficult to simulate under test conditions.

There is no way to guarantee that simulations are accurate because assumptions must be made about the controlling and controlled processes and environments which may not be valid in every possible application. The problem is amplified when writing software for hardware that is new or does not yet exist, as is often done for the most critical portions of operating systems since a system, once built, typically uses an operating system for most further development. As we have seen from the study of operating system protection, there are great difficulties in writing correct software policies, designing appropriate models, transforming these into specifications, and implementing them correctly.

Computers are often used in safety critical systems because of their versatility, power, performance, and efficiency, while they present safety risks because of their extreme complexity and our inability to provide correct software for them. Software is just one part of the system, and while many techniques are used to assure safe operation of the hardware in critical systems, software is often given a great burden. Hazards typically arise from hardware component failures, interfacing problems (communication and timing), human error, and environmental stress. Software is often used to replace standard hardware safety devices such as interlocks, and this often places a disproportionate burden on the software engineer. Software controls cannot view these in isolation because problems often caused by complex interactions between components and by multiple failures.

It is quite likely that the future of software safety will be similar to that of secure operating system design. A few basic principles will be formalized, and systems will be generated in such a manner as to allow verification that the implementation meets the safety policies. Eventually, automatic programming holds hope for assurance of implementation and testing techniques, but the problems of policy, modeling, and specification are well beyond the state of the art in software safety.

There are no mathematically based software safety policies in the literature, and it is unlikely that any such policies will come into being without a substantial advancement in the state of the art. The closest thing to a safety policy comes from science fiction in Isaac Asimov's "I Robot", wherein the three laws of robotics are built into the "positronic brains" of robots. These three laws are (approximately):

A robot may not injure a human being, or through inaction, allow a human being to come to harm.
A robot must obey the orders given to it by human beings except where such orders would conflict with the 1st law.
A robot must protect its own existence as long as such protection does not conflict with the 1st or 2nd laws.

In fiction, Asimov covers a number of scenarios in which the interactions of these laws create problems that are invariably solved by either the humans in charge of the robots or the robots themselves. In reality, these policies are impossible to implement because of the undecidability of whether or not a given action will harm or keep from harm.

Because of the state of the art in policy making, the closest thing to a policy that exists in software safety is the policy of reducing risks to an 'acceptable' level. In risk analysis, the acceptability of risks is often assessed by comparison to other risks in everyday environments. For example, if the risk due to a particular system is reduced to the level where the increased hazard to each individual at risk is equivalent to that presented by the individual crossing the street one additional time in a lifetime, it might be acceptable. A fairly standard metric for measuring risk is the average reduction in life expectancy, but any number of other metrics may be used as well.

In practice, safety is implemented by step-wise improvement. We identify hazards posing unacceptable risks, determine if and how the system can exercise those hazards, and design the system so as to eliminate or minimize those hazards. The problem with this method is that there may be hazards that are not identified because there is no clear policy or model on which to base our analysis. Thus the state of affairs in safety is similar to the problem of fixing leaky sieves in operating systems.

In order to improve the situation to some degree, there are published standards for safety which specify pre-defined hazards. DoD nuclear safety requirements and NRC nuclear reactor safety standards are typical. We can also improve the situation by using hierarchical structure to reduce the complexity of design and analysis [Newell] and providing standardized tools for risk analysis, but these techniques do not in any way preclude the possibility of catastrophic failure in ways not specified under ad-hoc techniques.

Hazard control is generally based on the elimination of hazards or minimization of their occurrence or effects. Safety analysis is generally done in a precedence order as follows:

1 - design for intrinsic safety
2 - design to prevent or minimize occurrence of hazards
3 - design to automatically control hazards if they occurs
4 - provide warning devices, procedures, and training to react to hazards

The difference between intrinsic safety and fault tolerance is relative in that the fault tolerance at any given implementation level is normally treated as intrinsic safety at the next higher level of implementation. As an example, the design of semiconductor gates involves a great deal of redundancy in that many atomic particles are involved in storing a bit. At the level of the computer designer, gates are treated as having intrinsic reliability properties, and fault tolerance is used to improve the system reliability over the mission time where appropriate. At the OS level, the hardware is generally assumed to provide an intrinsic level of protection, and the OS is designed to add redundancy to achieve desired system goals. At the level of designing tools under an operating system, the OS is assumed to provide a level of intrinsic protection, and any added protection is provided by redundancy at that level. At the application level, the tools are assumed to provide a given level of intrinsic protection, and additional protection is added as required. At the user level, intrinsic behavior is expected, while the user provides some additional protection in the form of procedures for handling exceptional cases. In many systems, multiple users are provided to protect against failures in individuals, and in most large organizations, further redundancy is used to assure that the organization doesn't depend too heavily on any given group.

Design for intrinsic safety primarily involves the use of high quality equipment at the next lower implementation level. This usually involves fail safe mechanisms and reliability techniques. Minimizing hazard occurrence generally involves active monitoring of potentially hazardous conditions, automatic control of protection mechanisms, lockouts of functions that cause hazards in particular situations, lockins that force activities in particular situations, and interlocks that force complex sequences of activities before performing high risk functions or force active signaling to continue performing hazardous activities. Automated safety devices to control potentially hazardous conditions usually involve hazard detections and warnings, fail safe designs, and damage control or containment. Procedures and training help personnel react to hazards.