\section{Introduction} $$ \int_{0\-}^{\infty} f(t).e^{-s.t}.dt \; | \; s \in C$$ This thesis describes the application of, mathematical (formal) techniques to the design of safety critical systems. The initial motivation for this study was to create a system applicable to industrial burner controllers. The methodology developed was designed to cope with both the deterministic and probablistic approaches. %specific `simultaneous failures'\cite{EN298},\cite{EN230},\cite{EN12067} %and the probability to dangerous fault approach\cite{EN61508}. The visual notation developed was initially designed for electronic fault modelling. However, it was realised that it could be applied to mechanical and software domains as well. This changed the target for the study slightly to encompass these three domains in a common notation. \section{Background} I completed an MSc in Software engineering in 2004 at Brighton university while working for an Engineering firm as a software Engineer. The firm make industrial burner controllers. Industrial Burners are potentially very dangerous industrial plant. They are generally left running unattended for long periods. They are subject to stringent safety regulations and must conform to specific `EN' standards. One cannot merely comply with the standards. The product must be `certified' by an independent and `competent body' recognised under European law. The cerification involved stress testing with repeated operation cycles, over a specified a range of temperatures. Electrical stress testing with high voltage interference, and power supply voltage surges and dips. Electro static discharge testing, and EMC (Electro Magnetic Compatibility). A significant part of this process however, was `static testing'. This involved looking at the design of the products, from the perspective of components failing, and the effect on safety this would have. Some of the static testing involved checking that the germane `EN' standards had been complied with. Failure Mode Effects Analysis (FMEA) was also applied. This involved looking in detail at critical sections of the product and proposing component failure scenarios. For each failure scenario proposed either a satisfactory answer was required, or a counter proposal to change the design to cope with a theroretical component failure eventuality. FMEA was time consuming, and being directed by experts undoubtly ironed out many potential safety faults before the product saw light of day. However it was quickly apparent that only a small proportion of copmponent~failure modes was considered. Also there was no formalism. The component~failure~modes investigated were not analysed within any rigourous or mathematically proven framework. \subsection{ Blanket Risk Reduction Approach } The suite of tests applied for a certified product amount to a `blanket' approach. That is to say that by applying Electrical, repeated operations, and environmental stress testing it is hoped that the majority of latent faults are discovered. The FMEA and static testing only looked at the most obviously safety critical aspects, and a small minority of the total component base for a product. Systememic faults, or mistakes are missed by this form of static testing. \subsection{Possibility of applying mathematical techniques to FMEA} My MSc project was a diagram editor for Constraint diagrams. I wanted to apply constriant diagram techniques to FMEA and began thinking about how this could be done. One obvious factor was that a typical safety critical system could have more than 1000 component parts. Each component would typically have several failure modes. Trying to apply a rigourous methodology on an entire product was going to be impractical. To do this with complete coverage each component failure mode would have to have been checked against the other thousand or so components for influence, and then a determination of the effects on the system would have had to have been made. Thus millions of checks would have to have been performed, and as FMEA is an `expert only' time consuming technique, this idea was obviously impractical. Note that most of the checks made would be redundant. Most components affect the performance of a few that they are placed to work with to perform some particular low-level function. \paragraph{Top down Approach} A top down approach has several potential problems. By its nature it means that at the start of the process a set of system or top level faults or undesireable outcomes are defined. It then must break the system down into modules and decide which of these can contribute to a system level fault mode. Potentially failure modes, be they from components or the interaction betweem modules can be missed. A disturbing example of this is the NASA space shuttle in 1986, which missed the fault mode of an O ring. This was made even worse, by the fact that the `O' ring had a specified temperature range where the probability of this fault occuring was dramatically raised when below the temperature range. This was a known and documented feature of a safety critical component and it was ignored in the safety analysis. \paragraph{Bottom-up Approach} A bottom-up approach look impractical at first due to the shear number of component failure modes in a typical system. However were this bottom-up approach to be modular we can reduce the , and built into a hierachy of modules rising up until all components are covered, we can model an entire complex system. This is the core concept behind this study. By working from the bottom up, at the lowest level taking the smallest functional~groups of components and analysing these, we can obtain a set of failure modes for the functional~groups. We can then treat these as `higher level' components and combine them to form new `functional~groups'. In this way all failure modes from all components must be at the very least considered. Also a hierarchy is formed when the top level errors are formed naturally from the lower levels of analysis. Unlike a top~down analysis, we cannot miss a top level fault condition. \paragraph{Multi-disipline}. Most safety critical systems are composed of mechanical, electrical and computing elements. A tragic example of the mechanical and electircal elements interfacing to a computer~controller is found in the THERAC25 x-ray dosage machine. With no common notation to integrate the saftey analyis between the electricali/mechanical and computing domains synchronisation errors occurred that were in some cases fatal. \paragraph{Requirements for a rigourous FMEA process}. It was determined that any process to apply FMEA in rigourous and complete (in terms of complete component coverage) had to be a bottom~up process to eliminate the possibility of missing component failure modes. It also had to naturally converge to a failure model of the system. It had to take potentially thousands of component failure modes and simplify these into system level errors. To analyse the large number of component failure modes, and resolve these to perhaps a handful of system failure modes, would require a process of modularisation from the bottom~up. \begin{list}{$*$}{} \item The analysis process must be `bottom~up' \item The process must be modular and hierarchical \item The process must be multi-disipline and must be able to represent hardware, electronics and software \end{list} \section{Safety Critical Systems} \subsection{General description of a Safety Critical System} A safety critical system is one in which lives may depend upon it or it has the potential to become dangerous\cite{sccs}. %(/usr/share/texmf-texlive/tex/latex/amsmath/amstext.sty) %An industrial burner is typical of plant that is potentially dangerous. %An incorrect air/fuel mixture can be explosive. %Medical electronics for automatically dispensing drugs or maintaining %life support are examples of systems that lives depend upon. \subsection{Two approaches : Probablistic, and Deterministic} There are two main philosophies applied to safety critical systems certification. \paragraph{Probablistic safety Measures} One is a general number of acceptable failures per hour\footnote{The common metric is Failure in Time (FIT) values - failures per ${10}^{9}$ hours of operation} of operation or a given statistical failure on demand. This is the probablistic approach and is embodied in the European Standard EN61508 \cite{EN61508} (international standard IOC1508). \paragraph{Deterministic safety Measures} The second philosophy, applied to application specific standards, is to investigate components ior sub-systems in the critical safety path and to look at component failure modes and ensure that they cannot cause dangerous faults. With the application specific standards detail specific to the process are This philosophy is first mentioned in aircraft safety operation reseach WWII studies. Here potential single faults (usually mechanical) are traced to catastrophic failures % \cite{boffin}. % % \begin{example} % \label{exa1} % Test example % \end{example} % % And that is example~\ref{exa1} \subsection{Overview of regulation of safety Critical systems} reference chapter dealing speciifically with this but given a quick overview. \subsubsection{Overview system analysis philosophies } - General safety standards - specific safety standards \subsubsection{Overview of current testing and certification} ref chapter speciifically on this but give an overview now A modern industrial burner has mechanical, electronic and software elements, that are all safety critical. That is to say unhandled failures could create dangerous faults. %To add to these problems %Operators are often under pressure to keep them running. An boiler supplying %heat to a large greenhouse complex could ruin crops %should it go off-line. Similarly a production line relying on heat or steam %can be very expensive in production down-time should it fail. %This places extra responsibility on the burner controller. % % % This needs to become a chapter %\subsection{Mechanical components} %describe the mechanical parts - gas valves damper s %electronic and software %give a diagram of how it all fits A %together with a %\subsection{electronic Components} % %\subsection{Software/Firmware Components} % % %\subsection{A high level Fault Hierarchy for an Industrial Burner} % %This section shows the component level, leading up higher and higher in the abstraction level %to the software levels and finally a top level abstract level. If the system has been %designed correctly no `undetected faults' should be present here. % \section{An Outline of the FMMD Technique} The methodology takes a bottom up approach to the design of an integrated system. % Each component is assigned a well defined set of failure modes. The system under inspection is then searched for functional groups of components that perform simple well defined tasks. These functional groups are analysed with respect to the failure modes of the components. % The `functional group', after analysis, have its own set of derived failure modes. % The number of derived failure modes will be less than or equal to the sum of the failure modes of all its components. % % A `derived' set of failure modes, is at a higher abstraction level. % Thus we can now treat our `functional group' as a component in its own right, with its own set of failure~modes. We can create a `derived component' and assign it the derived failure modes as analysed from the `functional group'. % Derived Components may now be used as building blocks, to model the system at ever higher levels of abstraction, building a hierarchy until the top level is reached. % Any unhandled faults will appear at this top level and will be `un-resolved'. A formal description of this process is dealt with in Chapter \ref{fmmddefinition}. % % %This principally focuses %on simple control systems for maintaining temperature %and for industrial burners. It is hoped that a general mathematical %framework is created that can be applied to other fields of safety critical engineering. \subsection{Automated Systems and Safety} Automated systems, as opposed to manual ones are now the norm in the home and in industry. % Automated systems have long been recognised as being more effecient and more accurate than a human opperator, and the reason for automating a process can now be more likely to be cost savings due to better effeciency than a human operator \ref{burnereffency}. % For instance early automated systems were mechanical, with cams and levers simulating control functions. % A typical control function could be the fuel air mixture profile curves over a the firing range. % Because fuels vary slightly in calorific value, and air density changes with the weather, no optimal tuning can be optional. In fact for asethtic reasons (not wanting smoke to appear at the flue) the tuning was often air rich, causing air to be heated and uneccessarily passed through the burner, leading to direct loss of energy. An automated system analysing the combustions gasses and automatically adjusting the fuel air mix can get the effeciencies very close to theoretical levels. As the automation takes over more and more functions from the human operator it also takes on more responsibility. A classic example of an automated system failing, is the therac-25. This was an X-ray dosage machine, that, due to software errors caused the deaths of several patients and injured more during the 1980's. The Therac-25 was a designed from a manual system, which had checks and interlocks, and was computerised. Software bugs were the primnary causes of the radiation overdoses. \cite{therac} Any new safety critical analysis methodology should be able to model software, electrical and hardware faults using a common notation. Ideally the tool should be automated so that it can seamlessly analyse the entire system, and apply rigorous checking to ensure that no fault conditions are missed. % http://en.wikipedia.org/wiki/Autopilot \paragraph{Importance of self checking} To take an example of an Autopilot, simple early autopilots, were (i.e. they prevented the aircraft staying from a compass bearing and kept it flying striaght and level). Were they to fail the pilot would notice quite quickly and resume manual control of the bearing. Modern autopilots control all aspects of flight including the engines, and take off and landing phases. The automated system does not have the common sense of a human pilot either, if fed the wrong sensory information it could make horrendous mistakes. This means that simply reading sensors and applying control corrections cannot be enough. Checking for error conditions must also be incorporated. It could also develop an internal fault, and must be able to cope with this. \begin{figure} \vskip 7cm \special{psfile=introduction/millivoltsensor.ps hoffset=0 voffset=0 hscale=35 vscale=35 }\caption[Milli-Volt Sensor with safety resistor]{ Milli-Volt Sensor with safety resistor \label{fig:millivolt}} \end{figure} \paragraph{Component added to detect errors} For exmaple, if the sensor supplies a range of 0 to 40mV, and RG1 and RG2 are such that the op-amp supplies a gain of 100 any signal between 0 and 4 volts on the ADC will be considered in range. Should the sensor become disconnected the opamp will supply its maximum voltage, telling the system the sensor reading is invalid. This introduces a level of self checking into the system. We need to be able to react to not only errors in the process its self, but also validate and look for internal errors in the control system. This leads on to an important concept of three main states of a safety critical system. % To improve productivity, performance, and cost-effectiveness, we are developing more and more safety-critical systems that are under computer control. And centralized computer control is enabling many safety-critical systems (e.g., chemical and pesticide factories) to grow in size, complexity, and potential for catastrophic failure. % We use software to control our factories and refineries as well as power generation and distribution. We also use software in our transportation systems including airplanes, trains, ships, subways, and even in our family automobiles. Software is also a major component of many medical systems in which safe functioning is critical to the safety of patients and operators alike. Even when the software does not directly control safety-critical hardware, software can provide operators and users with safety-critical data with which they must make safety-critical decisions (e.g., air traffic control or medical information such as blood bank records, organ donor information, and patient medical records). As we have come to rely more on software-intensive systems, we have come to rely more on those systems functioning safely. % Many accidents are caused by problems with system and software requirements, and “empirical evidence seems to validate the commonly stated hypothesis that the majority of safety problems arise from software requirements and not coding errors” [Leveson1995]. Major accidents often result from rare hazards, whereby a hazard is a combination of conditions that increases the likelihood of accidents causing harm to valuable assets (e.g., people, property, and/or the environment). Most requirements specifications are incomplete in that they do not specify requirements to eliminate these rare hazards or mitigate their consequences. Requirements specifications are also typically incomplete in that they do not specify what needs to happen in exceptional “rainy day” situations or as a response to each possible event in each possible system state although accidents are often caused by the incorrect handling of rare combinations of events and states that were considered to be either impossible or too unlikely to worry about, and were therefore never specified. Even when requirements have been specified for such rare combinations of events and conditions, they may well be ambiguous (an unfortunately common characteristic of requirements in practice), partially incomplete (missing assumptions obvious only to subject matter experts), or incorrect, or inconsistently implemented. Thus, the associated hazards are not eliminated or the resulting harm is not properly mitigated when the associated accidents occur. Ultimately, safety related requirements are important requirements that need to be better engineered. % The goal of this column is to define safety requirements and clarify how they differ from safety constraints and from functional, data, and interface requirements that happen to be safety critical. I start by defining safety in terms of a powerful quality model and show how quality requirements (including safety requirements) can be specified in terms of the components of this quality model. I will then show how to use the quality model to specify safety requirements. Then, I will define and discuss safety constraints and safety-critical requirements. Finally, I will pose a set of questions regarding the engineering of these three kinds of safety-related requirements for future research and experience to answer. Safety critical systems in the context of this study, means that a safety critical system may be said to be in three distinct overall states. Operating normally, operating in a lockout mode with a detected fault, and operating dangerously with an undetected fault. The main role of the system designers of safety critical equipment should be to eliminate the possibility of this last condition. % Software plays a critical role in almost every aspect facet of our daily lives - from , to driving our cars, to working in our offices. % Some of these systems are safety-critical. % Failure of software could cause catastrophic consequences for human life. % Imagine the antilock brake system (ABS) in your car. % A software failure here could render the ABS inoperable at a time when you need it most. % For these types of safety-critical systems, having guidelines that define processes and % objectives for the creation of software that focus on software quality, or the ability % to use software that has been developed under this scrutiny, has tremendous value % for developers of safety-critical systems. \section{Motivation for developing a formal methodology} A feature of many safety critical systems specifications, including EN298, EN230 \cite{EN298} \cite{EN230} is to demand, at the very least that single failures of hardware or software cannot create an unsafe condition in operational plant. Further to this a second fault introduced, must not cause an unsafe state, due to the combation of both faults. \vskip 0.3cm This sounds like an entirely reasonable requirement. But to rigorously check the effect a particular component fault has on the system, we could check its effect on all other components. Should a diode in the powersupply fail in a particular way, by perhaps introducing a ripple voltage, we should have to look at all components in the system to see how they will be affected. %However consider a typical %small system with perhaps 1000 components each %with an average of say 5 failure modes. Thus, to ensure complete coverage, each of the effects of the failure modes must be applied to all the other components. Each component must be checked against the failure modes of all other components in the system. Mathematically with components as 'c' and failure modes as 'Fm'. \equation \label{crossprodsingle} checks = \{ \; (Fm,c) \; \mid \; \stackrel{\wedge}{c} \; \neq \; c \} \endequation Where demands are made for resilience against two simultaneous failures this effectively squares the number of checks to make. \equation \label{crossproddouble} doublechecks = \{ \; (Fm_{1},Fm_{2},c) \; \mid \\ \; c_{1} \; \neq \; c_{2} \; \wedge \; Fm_{1} \neq Fm_{2} \; \} \endequation If we consider a system which has a total of $N$ failure modes (see equation \ref{crossprodsingle}) this would mean checking a maximum of \equation NumberOfChecks = \frac{N ( N-1 )}{2} \endequation for individual component failures and their effects on other components when they fail. For a very small system with say 1000 failure modes this would demand a potential of 500,000 checks for any automated checking process. \vskip 0.3cm European legislation\cite{EN298} directs that a system must be able to react to two component failures and not go into a dangerous state. \vskip 0.3cm This raises an interesting problem from the point of view of formal modelling. Here we have a binary cross product of all components (see equation \ref{crossproddouble}). This increases the number of checks greatly. Given that the binary cross product is $ (N^{2} - N)/2 $ and has to be checked against the remaining $(N-2)$ components. \equation \label{numberofchecks} NumberOfchecks = \frac{(N^{2} - N) ( N - 2)}{2} \endequation Thus for a 1000 failure mode system, roughly a half billion possible checks would be required for the double simultaneous failure scenario. This astonomical number of potential combinations, has made formal analysis of this type of system, up until now, impractical. Fault simulators %\cite{sim} are commonly used for the gas certification process. Thus to manually check this number of combinations of faults is in practise impossible. A technique of modularising, or breaking down the problem is clearly necessary. \section{Challenger Disaster} One question that anyone developing a safety critical analysis design tool could do well to answer, is how the methodology would cope with known previous disasters. The Challenger disaster is a good example, and was well documented and invistigated. The problem lay in a seal that had an operating temperature range. On the day of the launch the temperature of this seal was out of range. A bottom up safety approach would have revealed this as a fault. The FTA in use by NASA and the US Nuclear regulatory commisssion allows for enviromental considerations such as temperature\cite{NASA}\cite{NUK}. But because of the top down nature of the FTA technique, the safety designer must be aware of the environemtnal constraints of all component parts in order to use this correctly. This element of FTA is discussed in \ref{surveysc} \section{Problems with Natural Language} Written natural language desciptions can not only be ambiguous or easy to misinterpret, it is also not possible to apply mathematical checking to them. A mathematical model on the other hand can be checked for obvious faults, such as tautologies and contradictions, but also intermediate results can be extracted and these checked. Mathematical modeling of systems is not new, the Z language has been used to model systems\cite{ince}. However this is not widely understood or studied even in engineering and scientific circles. Graphical techniques for representing the mathematics for specifying systems, developed at Brighton and Kent university have been used and extended by this author to create a methodology for modelling complex safety critical systems, using diagrams. This project uses a modified form of euler diagram used to represent propositional logic. %The propositional logic is used to analyse system components. \section{Determining Component Failure Modes} \subsection{Electrical} Generic component failure modes for common electrical parts can be found in MIL1991. Most modern electrical components have associated data sheets. Usually these do not explicitly list failure modes. % watch out for log axis in graphs ! \subsection{Mechanical} Find refs \subsection{Software} Software must run on a microprocessor/microcontroller, and these devices have a known set of failure modes. The most common of these are RAM and ROM failures, but bugs in particular machine instructions can also exist. These can be checked for periodically. Software bugs are unpredictable. However there are techniques to validate software. These include monitoring the program timings (with watchdogs and internal checking) applying validation checks (such as independent functions to validate correct operation). \subsection{Environmentally determined failures} Some systems and components are guaranteed to work within certain environmental constraints, temperature being the most typical. Very often what happens to the system outside that range is not defined. \section{Project Goals} \begin{itemize} \item To create a Bottom up FMEA technique that permits a connected hierarchy to be built representing the fault behaviour of a system. \item To create a procedure where no component failure mode can be un-handled. \item To create a user friendly formal common visual notation to represent fault modes in Software, Electronic and Mechanical sub-systems. \item To formally define this visual language in concrete and abstract domains. \item To prove that the derived~componets used to build the hierarchies provide traceable fault handling from component level to the highest abstract system 'top level'. \item To formally define the hierarchies and procedure for bulding them. \item To produce a software tool to aid in the drawing of diagrams and ensuring that all fault modes are addressed. \item to provide for determinisic and probablistic failure mode analysis processes \item To allow the possiblility of MTTF calculation for statistical reliability/safety calculations. \end{itemize} % fucking cunt \end{document}