The generic and statistical European Safety Standard, EN61508:6\cite{en61508}[B.6.6] describes Failure Mode Effect Analysis (FMEA) as: \begin{quotation} "To analyse a system design, by examining all possible sources of failure of a system's components and determining the effects of these failures on the behaviour and safety of the system." \end{quotation}. \section{FMEA} %\subsection{FMEA} %\tableofcontents[currentsection] \paragraph{FMEA basic concept.} FMEA~\cite{safeware}[pp.341-344] is widely used, and proof of its use is a mandatory legal requirement for a large proportion of safety critical products sold in the European Union. The acronym FMEA can be expanded as follows: \begin{itemize} \item \textbf{F - Failures of given component} Consider a component in a system, \item \textbf{M - Failure Mode} Look at one of the ways in which it can fail (i.e. determine a component `failure~mode'), \item \textbf{E - Effects} Determine the effects this failure mode will cause to the system we are examining, \item \textbf{A - Analysis} Analyse how much impact this symptom will have on the environment/people/the system its-self. \end{itemize} % FMEA is a broad term; it could mean anything from an informal check on how how failures could affect some equipment in an initial brain-storming session in product design, to formal submission as part of safety critical certification. % FMEA is always performed in context. That is, the equipment is always analysed for a particular purpose and in a given environment. An `O' ring for instance can fail by leaking but if fitted to a water seal on a garden hose, the system level failure is a would be a slight leak at the tap outside the house. Applied to the rocket engine on a space shuttle the failure mode is a catastrophic fire and destruction of the spacecraft~\cite{challenger}. % At a lower level, consider a resistor and capacitor forming a potential divider to ground. This could be considered a low pass filter in some electrical environments, but for fixed frequencies the same circuit could be used as a phase changer. The failure modes of the latter, could be `no~signal' and `all~pass', but when used as a phase changer, would be `no~signal' and `no~phase' change. This chapter describes basic concepts of FMEA, uses a simple example to demonstrate a single FMEA analysis stage, describes the four main variants of FMEA in use today and explores some concepts with which we can discuss and evaluate the effectiveness of FMEA. % \subsection{FMEA} % This talk introduces Failure Mode Effects Analysis, and the different ways it is applied. % These techniques are discussed, and then % a refinement is proposed, which is essentially a modularisation of the FMEA process. % % % % \begin{itemize} % \item Failure % \item Mode % \item Effects % \item Analysis % \end{itemize} % % % % % % \begin{itemize} % % \item Failure % % \item Mode % % \item Effects % % \item Analysis % % \end{itemize} %\clearpage FMEA is a procedure which starts with the failure modes of the low level components of a system, an example analysis will serve to demonstrate it in practise. \paragraph{ FMEA Example: Milli-volt reader} Example: Let us consider a system, in this case a milli-volt reader, consisting of instrumentation amplifiers connected to a micro-processor that reports its readings via RS-232. \begin{figure} \centering \includegraphics[width=175pt]{./CH2_FMEA/mvamp.png} % mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403 \end{figure} \subsection{FMEA Example: Milli-volt reader} Let us perform an FMEA and consider how one of its resistors failing could affect it. For the sake of example let us choose resistor R1 in the OP-AMP gain circuitry. % \begin{figure} % \centering % \includegraphics[width=175pt]{./mvamp.png} % % mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403 % \end{figure} \paragraph{FMEA Example: Milli-volt reader} % \begin{figure} % \centering % \includegraphics[width=80pt]{./mvamp.png} % % mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403 % \end{figure} \begin{itemize} \item \textbf{F - Failures of given component} The resistor (R1) could fail by going OPEN or SHORT (EN298 definition). \item \textbf{M - Failure Mode} Consider the component failure mode SHORT \item \textbf{E - Effects} This will drive the minus input LOW causing a HIGH OUTPUT/READING \item \textbf{A - Analysis} The reading will be out of the normal range, and we will have an erroneous milli-volt reading \end{itemize} The analysis above has given us a result for one failure scenario i.e. for one component failure mode. A complete FMEA report would have to contain an entry for each failure mode of all the components in the system under investigation. % Note here that we have had to look at the failure~mode in relation to the entire circuit. We have used intuition to determine the probable effect of this failure mode. For instance we have assumed that the resistor R1 going SHORT will not affect the ADC, the Microprocessor or the UART. % To put this in more general terms, have not examined this failure mode against every other component in the system. Perhaps we should: this would be a more rigorous and complete approach in looking for system failures. \section{Theoretical Concepts in FMEA} \paragraph{The unacceptability of a single component failure causing a catastrophe} FMEA, due to its inductive bottom-up approach, is very good at finding potential single component failures that could have catastrophic implications. Used in the design phase of a project FMEA is an invaluable tool for unearthing these failure scenarios. It is less useful for determining catastrophic events for multiple simultaneous\footnote{Multiple simultaneous failures are taken to mean failure that occur within the same detection period.} failures. \paragraph{Impracticality of Field Data for modern systems} Modern electronic components, are generally very reliable, and the systems built from them are thus very reliable too. Reliable field data on failures will, therefore be sparse. Should we wish to prove a continuous demand system for say ${10}^{-7}$ failures\footnote{${10}^{-7}$ failures per hour of operation is the threshold for S.I.L. 3 reliability~\cite{en61508}. Failure rates are normally measured per $10^9$ hours of operation and are know as Failure in Time (FIT) values. The maximum FIT values for a SIL 3 system is therefore 100.} per hour of operation, even with 1000 correctly monitored units in the field we could only expect one failure per ten thousand hours (a little over one a year). It would be utterly impractical to get statistically significant data for equipment at these reliability levels. However, we can use FMEA (more specifically the FMEDA variant, see section~\ref{sec:FMEDA}), working from known component failure rates, to obtain statistical estimates of the equipment reliability. \paragraph{Forward and backward searches} A forward search starts with possible failure causes and uses logic and reasoning to determine system level outcomes. Forward search types of fault analysis is said to be `inductive'. A backward search starts with (undesirable) system level events works back down to potential causes using de-composition of of the system and logic. FMEA based methodologies are forward searches\cite{Lutz:1997:RAU:590564.590572} and top down methodologies such as FTA~\cite{nucfta,nasafta} Forward search types of fault analysis is said to be `deductive'. \paragraph{Reasoning distance} A reasoning distance is the number of stages of logic and reasoning required to map a failure cause to its potential outcomes. %.... general concept... simple ideas about how complex a %failure analysis is the more modules and components are involved % cite for forward and backward search related to safety critical software %{sfmeaforwardbackward} \subsection{FMEA and the State Explosion Problem} \paragraph{Rigorous Single Failure FMEA} FMEA for a safety critical certification~\cite{en298,en61508} will have to be applied to all known failure modes of all components within a system. To perform FMEA rigorously (i.e. to examine every possible interaction of a failure mode with all other components in a system). Or in other words, ---we would need to look at all possible failure scenarios. %to do this completely (all failure modes against all components). This is represented in the equation below. %~\ref{eqn:fmea_state_exp}, where $N$ is the total number of components in the system, and $f$ is the number of failure modes per component. \begin{equation} \label{eqn:fmea_single} N.(N-1).f % \\ %(N^2 - N).f \end{equation} \paragraph{Rigorous Single Failure FMEA} This would mean an order of $O(N^2)$ number of checks to perform to undertake a `rigorous~FMEA'. Even small systems have typically 100 components, and they typically have 3 or more failure modes each. $100*99*3=29,700$. \paragraph{Rigorous Double Failure FMEA} For looking at potential double failure scenarios\footnote{Certain double failure scenarios are already legal requirements---The European Gas burner standard (EN298:2003)---demands the checking of double failure scenarios (for burner lock-out scenarios).} (two components failing within a given time frame) and the order becomes $O(N^3)$. \begin{equation} \label{eqn:fmea_double} N.(N-1).(N-2).f % \\ %(N^2 - N).f \end{equation} For our theoretical 100 components with 3 failure modes each example, this is $100*99*98*3=2,910,600$ failure mode scenarios. \paragraph{Reliance of experts for meaningful FMEA Analysis.} FMEA cannot consider---for practical reasons---a rigorous approach. We define rigorous FMEA as examining the effect of every component failure mode against the remaining components in the system under investigation. % Because we cannot perform rigorous FMEA, we rely on experts in the system under investigation to perform a meaningful FMEA analysis. \section{FMEA in practise: Five variants} \paragraph{Five main Variants of FMEA} \begin{itemize} \item \textbf{PFMEA - Production} Car Manufacture etc \item \textbf{FMECA - Criticallity} Military/Space \item \textbf{FMEDA - Statistical safety} EN61508/IOC1508 Safety Integrity Levels \item \textbf{DFMEA - Design or static/theoretical} EN298/EN230/UL1998 \item \textbf{SFMEA - Software FMEA --- only used in highly critical systems at present} \end{itemize} \section{PFMEA - Production FMEA : 1940's to present} Production FMEA (or PFMEA), is FMEA used to prioritise, in terms of cost, problems to be addressed in product production. It focuses on known problems, determines the frequency they occur and their cost to fix. This is multiplied together and called an RPN number. Fixing problems with the highest RPN number will return most cost benefit. % benign example of PFMEA in CARS - make something up. \subsection{PFMEA Example} \begin{table}[ht] \caption{FMEA Calculations} % title of Table %\centering % used for centering table \begin{tabular}{|| l | l | c | c | l ||} \hline \textbf{Failure Mode} & \textbf{P} & \textbf{Cost} & \textbf{Symptom} & \textbf{RPN} \\ \hline \hline relay 1 n/c & $1*10^{-5}$ & 38.0 & indicators fail & 0.00038 \\ \hline relay 2 n/c & $1*10^{-5}$ & 98.0 & doorlocks fail & 0.00098 \\ \hline % rear end crash & $14.4*10^{-6}$ & 267,700 & fatal fire & 3.855 \\ % ruptured f.tank & & & & \\ \hline \hline \end{tabular} \end{table} \section{FMECA - Failure Modes Effects and Criticality Analysis} \subsection{ FMECA - Failure Modes Effects and Criticality Analysis} % \begin{figure} % \centering % %\includegraphics[width=100pt]{./military-aircraft-desktop-computer-wallpaper-missile-launch.jpg} % \includegraphics[width=300pt]{./CH2_FMEA/A10_thunderbolt.jpg} % % military-aircraft-desktop-computer-wallpaper-missile-launch.jpg: 1024x768 pixel, 300dpi, 8.67x6.50 cm, bb=0 0 246 184 % \caption{A10 Thunderbolt} % \label{fig:f16missile} % \end{figure} Emphasis on determining criticality of failure. Applies some Bayesian statistics (probabilities of component failures and those thereby causing given system level failures). \subsection{ FMECA - Failure Modes Effects and Criticality Analysis} Very similar to PFMEA, but instead of cost, a criticality or seriousness factor is ascribed to putative top level incidents. FMECA has three probability factors for component failures. \textbf{FMECA ${\lambda}_{p}$ value.} This is the overall failure rate of a base component. This will typically be the failure rate per million ($10^6$) or billion ($10^9$) hours of operation. reference MIL1991. \textbf{FMECA $\alpha$ value.} The failure mode probability, usually denoted by $\alpha$ is the probability of a particular failure~mode occurring within a component. reference FMD-91. %, should it fail. %A component with N failure modes will thus have %have an $\alpha$ value associated with each of those modes. %As the $\alpha$ modes are probabilities, the sum of all $\alpha$ modes for a component must equal one. \subsection{ FMECA - Failure Modes Effects and Criticality Analysis} \textbf{FMECA $\beta$ value.} The second probability factor $\beta$, is the probability that the failure mode will cause a given system failure. This corresponds to `Bayesian' probability, given a particular component failure mode, the probability of a given system level failure. \textbf{FMECA `t' Value} The time that a system will be operating for, or the working life time of the product is represented by the variable $t$. %for probability of failure on demand studies, %this can be the number of operating cycles or demands expected. \textbf{Severity `s' value} A weighting factor to indicate the seriousness of the putative system level error. %Typical classifications are as follows:~\cite{fmd91} \begin{equation} C_m = {\beta} . {\alpha} . {{\lambda}_p} . {t} . {s} \end{equation} Highest $C_m$ values would be at the top of a `to~do' list for a project manager. \section{FMEDA - Failure Modes Effects and Diagnostic Analysis} \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} % \begin{figure} % \centering % \includegraphics[width=200pt]{./SIL.png} % % SIL.jpg: 350x286 pixel, 72dpi, 12.35x10.09 cm, bb=0 0 350 286 % \caption{SIL requirements} % \end{figure} \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} \begin{itemize} \item \textbf{Statistical Safety} Safety Integrity Level (SIL) standards (EN61508/IOC5108). \item \textbf{Diagnostics} Diagnostic or self checking elements modelled \item \textbf{Complete Failure Mode Coverage} All failure modes of all components must be in the model \item \textbf{Guidelines} To system architectures and development processes \end{itemize} FMEDA is the fundamental methodology of the statistical (safety integrity level) type standards (EN61508/IOC5108). It provides a statistical overall level of safety and allows diagnostic mitigation for self checking etc. It provides guidelines for the design and architecture of computer/software systems for the four levels of safety Integrity. %For Hardware % FMEDA does force the user to consider all hardware components in a system by requiring that a MTTF value is assigned for each failure~mode; the MTTF may be statistically mitigated (improved) if it can be shown that self-checking will detect failure modes. For software it provides procedural quality guidelines and constraints (such as forbidding certain programming languages and/or features. \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} \label{sec:FMEDA} \textbf{Failure Mode Classifications in FMEDA.} \begin{itemize} \item \textbf{Safe or Dangerous} Failure modes are classified SAFE or DANGEROUS \item \textbf{Detectable failure modes} Failure modes are given the attribute DETECTABLE or UNDETECTABLE \item \textbf{Four attributes to Failure Modes} All failure modes may thus be Safe Detected(SD), Safe Undetected(SU), Dangerous Detected(DD), Dangerous Undetected(DU) \item \textbf{Four statistical properties of a system} \\ $ \sum \lambda_{SD}$, $\sum \lambda_{SU}$, $\sum \lambda_{DD}$, $\sum \lambda_{DU}$ \end{itemize} % Failure modes are classified as Safe or Dangerous according % to the putative system level failure they will cause. % The Failure modes are also classified as Detected or % Undetected. % This gives us four level failure mode classifications: % Safe-Detected (SD), Safe-Undetected (SU), Dangerous-Detected (DD) or Dangerous-Undetected (DU), % and the probabilistic failure rate of each classification % is represented by lambda variables % (i.e. $\lambda_{SD}$, $\lambda_{SU}$, $\lambda_{DD}$, $\lambda_{DU}$). \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} \textbf{Diagnostic Coverage.} The diagnostic coverage is simply the ratio of the dangerous detected probabilities against the probability of all dangerous failures, and is normally expressed as a percentage. $\Sigma\lambda_{DD}$ represents the percentage of dangerous detected base component failure modes, and $\Sigma\lambda_D$ the total number of dangerous base component failure modes. $$ DiagnosticCoverage = \Sigma\lambda_{DD} / \Sigma\lambda_D $$ \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} The \textbf{diagnostic coverage} for safe failures, where $\Sigma\lambda_{SD}$ represents the percentage of safe detected base component failure modes, and $\Sigma\lambda_S$ the total number of safe base component failure modes, is given as $$ SF = \frac{\Sigma\lambda_{SD}}{\Sigma\lambda_S} $$ \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} \textbf{Safe Failure Fraction.} A key concept in FMEDA is Safe Failure Fraction (SFF). This is the ratio of safe and dangerous detected failures against all safe and dangerous failure probabilities. Again this is usually expressed as a percentage. $$ SFF = \big( \Sigma\lambda_S + \Sigma\lambda_{DD} \big) / \big( \Sigma\lambda_S + \Sigma\lambda_D \big) $$ SFF determines how proportionately fail-safe a system is, not how reliable it is ! Weakness in this philosophy; adding extra safe failures (even unused ones) improves the SFF. \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} To achieve SIL levels, diagnostic coverage and SFF levels are prescribed along with hardware architectures and software techniques. The overall the aim of SIL is classify the safety of a system, by statistically determining how frequently it can fail dangerously. \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} \begin{table}[ht] \caption{FMEA Calculations} % title of Table %\centering % used for centering table \begin{tabular}{|| l | l | c | c | l ||} \hline \textbf{SIL} & \textbf{Low Demand} & \textbf{Continuous Demand} \\ & Prob of failing on demand & Prob of failure per hour \\ \hline \hline 4 & $ 10^{-5}$ to $< 10^{-4}$ & $ 10^{-9}$ to $< 10^{-8}$ \\ \hline 3 & $ 10^{-4}$ to $< 10^{-3}$ & $ 10^{-8}$ to $< 10^{-7}$ \\ \hline 2 & $ 10^{-3}$ to $< 10^{-2}$ & $ 10^{-7}$ to $< 10^{-6}$ \\ \hline 1 & $ 10^{-2}$ to $< 10^{-1}$ & $ 10^{-6}$ to $< 10^{-5}$ \\ \hline \hline \end{tabular} \end{table} Table adapted from EN61508-1:2001 [7.6.2.9 p33] \subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis} FMEDA is a modern extension of FMEA, in that it will allow for self checking features, and provides detailed recommendations for computer/software architecture. It has a simple final result, a Safety Integrity Level (SIL) from 1 to 4 (where 4 is safest). %FMEA can be used as a term simple to mean Failure Mode Effects Analysis, and is %part of product approval for many regulated products in the EU and the USA... \section{FMEA used for Safety Critical Approvals} \subsection{DESIGN FMEA: Safety Critical Approvals FMEA} \begin{figure}[h] \centering \includegraphics[width=300pt,keepaspectratio=true]{./CH2_FMEA/tech_meeting.png} % tech_meeting.png: 350x299 pixel, 300dpi, 2.97x2.53 cm, bb=0 0 84 72 \caption{FMEA Meeting} \label{fig:tech_meeting} \end{figure} Static FMEA, Design FMEA, Approvals FMEA Experts from Approval House and Equipment Manufacturer discuss selected component failure modes judged to be in critical sections of the product. \subsection{DESIGN FMEA: Safety Critical Approvals FMEA} % \begin{figure}[h] % \centering % \includegraphics[width=70pt,keepaspectratio=true]{./tech_meeting.png} % % tech_meeting.png: 350x299 pixel, 300dpi, 2.97x2.53 cm, bb=0 0 84 72 % \caption{FMEA Meeting} % \label{fig:tech_meeting} % \end{figure} \begin{itemize} \item Impossible to look at all component failures let alone apply FMEA rigorously. \item In practise, failure scenarios for critical sections are contested, and either justified or extra safety measures implemented. \item Often Meeting notes or minutes only. Unusual for detailed arguments to be documented. \end{itemize} \section{Literature Review} %% FOCUS The focus of this literature review is to establish the practice and applications of FMEA, and to examine its strengths and weaknesses. %% GOAL Its goal is to identify central issues and to criticise and assess the current FMEA methodologies. %% PERSPECTIVE The perspective of the author, is as a practitioner of static failure mode analysis techniques concerning approval of product to European safety standards, both the prescriptive~\cite{en298,en230} and statistical~\cite{en61508}. A second perspective is that of a software engineer trained to use formal methods. Examining FMEA methodologies for mathematical properties, influenced by formal methods applied to software, should provide an angle not traditionally considered. %% COVERAGE The literature reviewed, has been restricted to published books, European safety standards (as examples of current safety measures applied), and traditional research, from journal and conference papers. %% ORGANISATION The review is organised by concept, that is, FMEA can be applied to hardware, software, software~interfacing and to multiple failure scenarios etc. Methodologies related to FMEA are briefly covered for the sake of context. %% AUDIENCE % Well duh! PhD supervisors and examiners.... \subsection{Related Methodologies} FTA --- HAZOP --- ALARP --- Event Tree Analysis --- bow tie concept \subsection{Hardware FMEA (HFMEA)} \subsection{Multiple Failure scenarios and FMEA} \subsection{Software FMEA (SFMEA)} \paragraph{Current work on Software FMEA} SFMEA usually does not seek to integrate hardware and software models, but to perform FMEA on the software in isolation~\cite{procsfmea}. % Work has been performed using databases to track the relationships between variables and system failure modes~\cite{procsfmeadb}, to %work has been performed to introduce automation into the FMEA process~\cite{appswfmea} and to provide code analysis automation~\cite{modelsfmea}. Although the SFMEA and hardware FMEAs are performed separately, some schools of thought aim for Fault Tree Analysis (FTA)~\cite{nasafta,nucfta} (top down - deductive) and FMEA (bottom-up inductive) to be performed on the same system to provide insight into the software hardware/interface~\cite{embedsfmea}. % Although this would give a better picture of the failure mode behaviour, it is by no means a rigorous approach to tracing errors that may occur in hardware through to the top (and therefore ultimately controlling) layer of software. \paragraph{Current FMEA techniques are not suitable for software} The main FMEA methodologies are all based on the concept of taking base component {\fms}, and translating them into system level events/failures~\cite{sfmea,sfmeaa}. % In a complicated system, mapping a component failure mode to a system level failure will mean a long reasoning distance; that is to say the actions of the failed component will have to be traced through several sub-systems, gauging its effects with and on other components. % With software at the higher levels of these sub-systems, we have yet another layer of complication. % %In order to integrate software, %in a meaningful way %we need to re-think the %FMEA concept of simply mapping a base component failure to a system level event. % SFMEA regards, in place of hardware components, the variables used by the programs to be their equivalent~\cite{procsfmea}. The failure modes of these variables, are that they could become erroneously over-written, calculated incorrectly (due to a mistake by the programmer, or a fault in the micro-processor on which it is running), or external influences such as ionising radiation causing bits to be erroneously altered. % \section{Conclusion} \paragraph{Where FMEA is now} FMEA useful tool for basic safety --- provides statistics on safety where field data impractical --- very good with single failure modes linked to top level events. FMEA has become part of the safety critical and safety certification industries. % SFMEA is in its infancy, but there is a gap in current certification for software, EN61508~\cite{en61508}, recommends hardware redundancy architectures in conjunction with FMEDA for hardware: for software it recommends language constraints and quality procedures but no inductive fault finding technique. FMEA has adapted from a cost saving exercise for mass produced items, to incorporating statistical techniques (FMECA) to allowing for self diagnostic mitigation (FMEDA). However, it is still based on the single component failure mapped to system level failure. All these FMEA based methodologies have the following short comings: \begin{itemize} \item Impossible to integrate Software and hardware models, \item State explosion problem exacerbated by increasing complexity due to density of modern electronics, \item Impossibility to consider all multiple component failure modes \end{itemize}