Robin_PHD/submission_thesis/CH2_FMEA/copy.tex
Robin Clark 4d55df3c05 OK still need to:
Go through Chris Garret CH6 note
First half CH7 notes, and remove allot of formal defs from CH7
2013-03-09 17:05:48 +00:00

1137 lines
49 KiB
TeX
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

%%% CHAPTER 2
\label{sec:chap2}
The generic and statistical European Safety Standard, EN61508:6\cite{en61508}[B.6.6]
describes Failure Mode Effect Analysis (FMEA) as:
\begin{quotation}
"To analyse a system design, by examining all possible sources of failure
of a system's components and determining the effects of these failures
on the behaviour and safety of the system."
\end{quotation}.
\section{FMEA}
\label{basicfmea}
%\subsection{FMEA}
%\tableofcontents[currentsection]
\paragraph{FMEA basic concept.}
FMEA~\cite{safeware}[pp.341-344] is widely used, and proof of its use is a mandatory legal requirement
for a large proportion of safety critical products sold in the European Union.
The acronym FMEA can be expanded as follows:
\begin{itemize}
\item \textbf{F - Failures of given component,} Consider a particular component in a system;
\item \textbf{M - Failure Mode,} Look at one of the ways in which it can fail (i.e. determine a component `failure~mode');
\item \textbf{E - Effects,} Determine the effects this failure mode will cause to the system we are examining;
\item \textbf{A - Analysis,} Analyse how much impact this symptom will have on the environment/people/the system its-self.
\end{itemize}
%
FMEA is a broad term; it could mean anything from an informal check on how
how failures could affect some equipment in %an initial
a brain-storming session
%in product design,
to formal submission as part of safety critical certification.
FMEA is a time intensive process. To reduce amount of work to perform,
software packages~\cite{931423, 1778436820050601} and analysis strategies have
been developed~\cite{incrementalfmea, automatingFMEA1281774}.
%
FMEA is always performed in context. That is, the equipment is always analysed for a particular purpose
and in a given environment. An `O' ring for instance can fail by leaking
but if fitted to a water seal on a garden hose, the system level failure is a
would be a slight leak at the tap outside the house.
%
Applied to the rocket engine on a space shuttle that same 'O' ring failure mode
could cause a catastrophic fire and destruction of the spacecraft~\cite{challenger}.
%
At a lower level, consider a resistor and capacitor forming a potential divider to ground.
This could be considered a low pass filter in some electrical environments~\cite{aoe},
but for fixed frequencies the same circuit could be used as a phase changer~\cite{electronicssysapproach}[p.114].
The failure modes of the latter, could be `no~signal' and `all~pass',
but when used as a phase changer, would be `no~signal' and `no~phase' change.
This chapter describes basic concepts of FMEA, uses a simple example to
demonstrate a single FMEA analysis stage, describes the four main variants of FMEA in use today
and explores some concepts with which we can discuss and evaluate
the effectiveness of FMEA.
\section{Determining the failure modes of components}
\label{sec:determine_fms}
In order to apply any form of FMEA we need to know the ways in which
the components we are using can fail.
%
A good introduction to hardware and software failure modes may be found in~\cite{sccs}[pp.114-124].
%
Typically when choosing components for a design, we look at manufacturers' data sheets
which describe functionality, physical dimensions
environmental ranges, tolerances and can indicate how a component may fail/misbehave
under given conditions.
%
How base components could fail internally, is not of interest to an FMEA investigation.
The FMEA investigator needs to know what failure behaviour a component may exhibit. %, or in other words, its modes of failure.
%
A large body of literature exists which gives guidance for determining component {\fms}.
%
For this study FMD-91~\cite{fmd91} and the gas burner standard EN298~\cite{en298} are examined.
%Some standards prescribe specific failure modes for generic component types.
In EN298 failure modes for most generic component types are listed, or if not listed,
determined by considering all pins OPEN and all adjacent pins shorted.
%a procedure where failure scenarios of all pins OPEN and all adjacent pins shorted
%are examined.
%
%
FMD-91 is a reference document released into the public domain by the United States DOD
and describes `failures' of common electronic components, with percentage statistics for each failure.
%
FMD-91 entries include general descriptions of internal failures alongside {\fms} of use to an FMEA investigation.
%
FMD-91 entries need, in some cases, some interpretation to be mapped to a clear set of
component {\fms} suitable for use in FMEA.
A third document, MIL-1991~\cite{mil1991} provides overall reliability statistics for
component types, but does not detail specific failure modes.
%
Using MIL1991 in conjunction with FMD-91 we can determine statistics for the failure modes
of component types.
%
The FMEDA process from European standard EN61508~\cite{en61508}
requires statistics for Meantime to Failure (MTTF) for all {\bc} failure modes.
% One is from the US military document FMD-91, where internal failures
% of components are described (with stats).
%
% The other is EN298 where the failure modes for generic component types are prescribed, or
% determined by a procedure where failure scenarios of all pins OPEN and all adjacent pins shorted
% is applied. These techniques
%
% The FMD-91 entries need, in some cases, some interpretation to be mapped to
% component failure symptoms, but include failure modes that can be due to internal failures.
% The EN298 SHORT/OPEN procedure cannot determine failures due to internal causes but can be applied to any IC.
%
% Could I come in and see you Chris to quickly discuss these.
%
% I hope to have chapter 5 finished by the end of March, chapter 5 being the
% electronics examples for the FMMD methodology.
\section{Determining the failure modes of Components.}
The starting point for FMEA are the failure modes of {\bcs}.
In order the define FMEA we must start with a discussion on how these failure modes are chosen.
%
In this section we look in detail at two common electrical components and examine how
the two sources of information define their failure mode behaviour.
We look at the reasons why some known failure modes % are omitted, or presented in
%specific but unintuitive ways.
%We compare the US. military published failure mode specifications wi
can be found in one source but not in the others and vice versa.
%
Finally we compare and contrast the failure modes determined for these components
from the FMD-91 reference source and from the guidelines of the
European burner standard EN298.
\subsection{Failure mode determination for generic resistor.}
\label{sec:resistorfm}
%- Failure modes. Prescribed failure modes EN298 - FMD91
\paragraph{Resistor failure modes according to FMD-91.}
The resistor is a ubiquitous component in electronics, and is therefore a good candidate for detailed examination of its failure modes.
%
FMD-91\cite{fmd91}[3-178] lists many types of resistor
and lists many possible failure causes.
For instance for {\textbf{Resistor,~Fixed,~Film}} we are given the following failure causes:
\begin{itemize}
\item Opened 52\%
\item Drift 31.8\%
\item Film Imperfections 5.1\%
\item Substrate defects 5.1\%
\item Shorted 3.9\%
\item Lead damage 1.9\%
\end{itemize}
% This information may be of interest to the manufacturer of resistors, but it does not directly
% help a circuit designer.
% The circuit designer is not interested in the causes of resistor failure, but to build in contingency
% against {\fms} that the resistor could exhibit.
% We can determine these {\fms} by converting the internal failure descriptions
% to {\fms} thus:
To make this useful for FMEA/FMMD we must assign each failure cause to symptomatic failure mode descriptor
as shown below.
%
%and map these failure causes to three symptoms,
%drift (resistance value changing), open and short.
\begin{itemize}
\item Opened 52\% $\mapsto$ OPENED
\item Drift 31.8\% $\mapsto$ DRIFT
\item Film Imperfections 5.1\% $\mapsto$ OPEN
\item Substrate defects 5.1\% $\mapsto$ OPEN
\item Shorted 3.9\% $\mapsto$ SHORT
\item Lead damage 1.9\% $\mapsto$ OPEN.
\end{itemize}
%
The main causes of drift are overloading of components.
This is borne out in in the FMD-91~\cite{fmd91}[232] entry for a resistor network where the failure
modes do not include drift.
%
If we can ensure that our resistors will not be exposed to overload conditions, the
probability of drift (sometimes called parameter change) occurring
is significantly reduced, enough for some standards to exclude it~\cite{en298}~\cite{en230}.
\paragraph{Resistor failure modes according to EN298.}
EN298, the European gas burner safety standard, tends to be give failure modes more directly usable for performing FMEA than FMD-91.
EN298 requires that a full FMEA be undertaken, examining all failure modes
of all electronic components~\cite{en298}[11.2 5] as part of the certification process.
%
Annex A of EN298, prescribes failure modes for common components
and guidance on determining sets of failure modes for complex components (i.e. integrated circuits).
EN298~\cite{en298}[Annex A] (for most types of resistor)
only requires that the failure mode OPEN be considered for FMEA analysis.
%
For resistor types not specifically listed in EN298, the failure modes
are considered to be either OPEN or SHORT.
The reason that parameter change is not considered for resistors chosen for an EN298 compliant system, is that they must be must be {\em downrated}.
That is to say the power and voltage ratings of components must be calculated
for maximum possible exposure, with a 40\% margin of error. This drastically reduces the probability
that the resistors will be overloaded,
and thus subject to drift/parameter change.
% XXXXXX get ref from colin T
%If a resistor was rated for instance for
%These are useful for resistor manufacturersthey have three failure modes
%EN298
%Parameter change not considered for EN298 because the resistors are down-rated from
%maximum possible voltage exposure -- find refs.
% FMD-91 gives the following percentages for failure rates in
% \label{downrate}
% The parameter change, is usually a failure mode associated with over stressing the component.
%In a system designed to typical safety critical constraints (as in EN298)
%these environmentally induced failure modes need not be considered.
\subsubsection{Resistor Failure Modes}
\label{sec:res_fms}
For this study we will take the conservative view from EN298, and consider the failure
modes for a generic resistor to be both OPEN and SHORT.
i.e.
\label{ros}
$$ fm(R) = \{ OPEN, SHORT \} . $$
%
% Mention tolerance here
%
% hmmmmmm
%
\subsection{Failure modes determination for generic operational amplifier}
\begin{figure}[h+]
\centering
\includegraphics[width=200pt]{CH5_Examples/lm258pinout.jpg}
% lm258pinout.jpg: 478x348 pixel, 96dpi, 12.65x9.21 cm, bb=0 0 359 261
\caption{Pinout for an LM358 dual OpAmp}
\label{fig:lm258}
\end{figure}
The operational amplifier (op-amp) %is a differential amplifier and
is very widely used in nearly all fields of modern analogue electronics.
They are typically packaged in dual or quad configurations---meaning
that a chip will typically contain two or four amplifiers.
For the purpose of example, we look at
a typical op-amp designed for instrumentation and measurement, the dual packaged version of the LM358~\cite{lm358}
(see figure~\ref{fig:lm258}), and use this to compare the failure mode derivations from FMD-91 and EN298.
\paragraph{ Failure Modes of an OpAmp according to FMD-91 }
%Literature suggests, latch up, latch down and oscillation.
For OpAmp failures modes, FMD-91\cite{fmd91}{3-116] states,
\begin{itemize}
\item Degraded Output 50\% Low Slew rate - poor die attach
\item No Operation - overstress 31.3\%
\item Shorted $V_+$ to $V_-$, overstress, resistive short in amplifier 12.5\%
\item Opened $V_+$ open 6.3\%
\end{itemize}
Again these are mostly internal causes of failure, more of interest to the component manufacturer
than a designer looking for the symptoms of failure.
We need to translate these failure causes within the OpAmp into {\fms}.
We can look at each failure cause in turn, and map it to potential {\fms} suitable for use in FMEA
investigations.
\paragraph{OpAmp failure cause: Poor Die attach}
The symptom for this is given as a low slew rate. This means that the op-amp
will not react quickly to changes on its input terminals.
This is a failure symptom that may not be of concern in a slow responding system like an
instrumentation amplifier. However, where higher frequencies are being processed,
a signal may entirely be lost.
We can map this failure cause to a {\fm}, and we can call it $LOW_{slew}$.
\paragraph{No Operation - over stress}
Here the OP\_AMP has been damaged, and the output may be held HIGH or LOW, or may be
effectively tri-stated, i.e. not able to drive circuitry in along the next stages of
the signal path: we can call this state NOOP (no Operation).
%
We can map this failure cause to three {\fms}, $LOW$, $HIGH$, $NOOP$.
\paragraph{Shorted $V_+$ to $V_-$}
Due to the high intrinsic gain of an op-amp, and the effect of offset currents,
this will force the output HIGH or LOW.
We map this failure cause to $HIGH$ or $LOW$.
\paragraph{Open $V_+$}
This failure cause will mean that the minus input will have the very high gain
of the OpAmp applied to it, and the output will be forced HIGH or LOW.
We map this failure cause to $HIGH$ or $LOW$.
\paragraph{Collecting OpAmp failure modes from FMD-91}
We can define an OpAmp, under FMD-91 definitions to have the following {\fms}.
\begin{equation}
\label{eqn:opampfms}
fm(OpAmp) = \{ HIGH, LOW, NOOP, LOW_{slew} \}
\end{equation}
\paragraph{Failure Modes of an OpAmp according to EN298}
EN298 does not specifically define OP\_AMPS failure modes; these can be determined
by following a procedure for `integrated~circuits' outlined in
annex~A~\cite{en298}[A.1 note e].
This demands that all open connections, and shorts between adjacent pins be considered as failure scenarios.
We examine these failure scenarios on the dual packaged $LM358$~\cite{lm358}%\mu741$
and determine its {\fms} in table ~\ref{tbl:lm358}.
Collecting the op-amp failure modes from table ~\ref{tbl:lm358} we obtain the same {\fms}
that we got from FMD-91, listed in equation~\ref{eqn:opampfms}.
%\paragraph{EN298: Open and shorted pin failure symptom determination technique}
\begin{table}[h+]
\caption{LM358: EN298 Open and shorted pin failure symptom determination technique}
\begin{tabular}{|| l | l | c | c | l ||} \hline
%\textbf{Failure Scenario} & & \textbf{Amplifier Effect} & & \textbf{Symptom(s)} \\
\textbf{Failure} & & \textbf{Amplifier Effect} & & \textbf{Derived Component} \\
\textbf{cause} & & \textbf{ } & & \textbf{Failure Mode} \\
\hline
& & & & \\ \hline
FS1: PIN 1 OPEN & & A output open & & $NOOP_A$ \\ \hline
FS2: PIN 2 OPEN & & A-input disconnected, & & \\
& & infinite gain on A+input & & $LOW_A$ or $HIGH_A$ \\ \hline
FS3: PIN 3 OPEN & & A+input disconnected, & & \\
& & infinite gain on A-input & & $LOW_A$ or $HIGH_A$ \\ \hline
FS4: PIN 4 OPEN & & power to chip (ground) disconnected & & $NOOP_A$ and $NOOP_B$ \\ \hline
FS5: PIN 5 OPEN & & B+input disconnected, & & \\
& & infinite gain on B-input & & $LOW_B$ or $HIGH_B$ \\ \hline
FS6: PIN 6 OPEN & & B-input disconnected, & & \\
FS6: PIN 6 OPEN & & infinite gain on B+input & & $LOW_B$ or $HIGH_B$ \\ \hline
FS7: PIN 7 OPEN & & B output open & & $NOOP_B$ \\ \hline
FS8: PIN 8 OPEN & & power to chip & & \\
FS8: PIN 8 OPEN & & (Vcc) disconnected & & $NOOP_A$ and $NOOP_B$ \\ \hline
& & & & \\
& & & & \\
& & & & \\ \hline
FS9: PIN 1 $\stackrel{short}{\longrightarrow}$ PIN 2 & & A -ve 100\% Feed back, low gain & & $LOW_A$ \\ \hline
FS10: PIN 2 $\stackrel{short}{\longrightarrow}$ PIN 3 & & A inputs shorted, & & \\
& & output controlled by internal offset & & $LOW_A$ or $HIGH_A$ \\ \hline
FS11: PIN 3 $\stackrel{short}{\longrightarrow}$ PIN 4 & & A + input held to ground & & $LOW_A$ \\ \hline
FS12: PIN 5 $\stackrel{short}{\longrightarrow}$ PIN 6 & & B inputs shorted, & & \\
& & output controlled by internal offset & & $LOW_B$ or $HIGH_B$ \\ \hline
FS13: PIN 6 $\stackrel{short}{\longrightarrow}$ PIN 7 & & B -ve 100\% Feed back, low gain & & $LOW_B$ \\ \hline
FS14: PIN 7 $\stackrel{short}{\longrightarrow}$ PIN 8 & & B output held high & & $HIGH_B$ \\ \hline
\hline
\end{tabular}
\label{tbl:lm358}
\end{table}
%\clearpage
\subsubsection{Failure modes of an OpAmp}
\label{sec:opamp_fms}
For the purpose of the examples to follow, the op-amp will
have the following failure modes:-
$$ fm(OPAMP) = \{ LOW, HIGH, NOOP, LOW_{slew} \} $$
\subsection{Comparing the component failure mode sources}
The EN298 pinouts failure mode technique cannot reveal failure modes due to internal failures.
The FMD-91 entries for op-amps are not directly usable as
component {\fms} in FMEA or FMMD and require interpretation.
%For our OpAmp example could have come up with different symptoms for both sides. Cannot predict the effect of internal errors, for instance ($LOW_{slew}$)
%is missing from the EN298 failure modes set.
% FMD-91
%
% I have been working on two examples of determining failure modes of components.
% One is from the US military document FMD-91, where internal failures
% of components are described (with stats).
%
% The other is EN298 where the failure modes for generic component types are prescribed, or
% determined by a procedure where failure scenarios of all pins OPEN and all adjacent pins shorted
% is applied. These techniques
%
% The FMD-91 entries need, in some cases, some interpretation to be mapped to
% component failure symptoms, but include failure modes that can be due to internal failures.
% The EN298 SHORT/OPEN procedure cannot determine failures due to internal causes but can be applied to any IC.
%
% Could I come in and see you Chris to quickly discuss these.
%
% I hope to have chapter 5 finished by the end of March, chapter 5 being the
% electronics examples for the FMMD methodology.
\clearpage
%%
%% Paragraph using failure modes to build from bottom up
%%
% \subsection{FMEA}
% This talk introduces Failure Mode Effects Analysis, and the different ways it is applied.
% These techniques are discussed, and then
% a refinement is proposed, which is essentially a modularisation of the FMEA process.
% %
%
% \begin{itemize}
% \item Failure
% \item Mode
% \item Effects
% \item Analysis
% \end{itemize}
%
%
%
% % % \begin{itemize}
% % \item Failure
% % \item Mode
% % \item Effects
% % \item Analysis
% % \end{itemize}
%\clearpage
FMEA is a procedure which starts with the failure modes of the low level components of a system, an example
analysis will serve to demonstrate it in practise.
\paragraph{ FMEA Example: Milli-volt reader}
Example: Let us consider a system, in this case a milli-volt reader, consisting
of instrumentation amplifiers connected to a micro-processor
that reports its readings via RS-232.
\begin{figure}
\centering
\includegraphics[width=175pt]{./CH2_FMEA/mvamp.png}
% mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403
\caption{System diagram of a milli-volt reader, showing an expanded circuit diagram for the component of interest.}
\end{figure}
\subsection{FMEA Example: Milli-volt reader}
Let us perform an FMEA and consider how one of its resistors failing could affect
it.
For the sake of example let us choose resistor R1 in the OP-AMP gain circuitry.
% \begin{figure}
% \centering
% \includegraphics[width=175pt]{./mvamp.png}
% % mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403
% \end{figure}
\paragraph{FMEA Example: Milli-volt reader}
% \begin{figure}
% \centering
% \includegraphics[width=80pt]{./mvamp.png}
% % mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403
% \end{figure}
\begin{itemize}
\item \textbf{F - Failures of given component} The resistor (R1) could fail by going OPEN or SHORT (EN298 definition).
\item \textbf{M - Failure Mode} Consider the component failure mode SHORT
\item \textbf{E - Effects} This will drive the minus input LOW causing a HIGH OUTPUT/READING
\item \textbf{A - Analysis} The reading will be out of the normal range, and we will have an erroneous milli-volt reading
\end{itemize}
The analysis above has given us a result for one failure scenario i.e.
for one component failure mode.
A complete FMEA report would have to contain an entry
for each failure mode of all the components in the system under investigation.
%
In theory we have had to look at the failure~mode
in relation to the entire circuit.
We have used intuition to determine the probable
effect of this failure mode.
For instance we have assumed that the resistor R1 going SHORT
will not affect the ADC, the Microprocessor or the UART.
%
We have taken the {\bc} {\fm} R1 SHORT and then followed the failure reasoning path through to a putative system level symptom.
We have not looked in detail at any side effects of this {\fm}.
%
To put this in more general terms, have not examined this failure mode
against every other component in the system.
Perhaps we should: this would be a more rigorous and complete
approach in looking for system failures.
\section{Theoretical Concepts in FMEA}
In this section we examine some fundamental concepts and underlying philosophies of FMEA.
\paragraph{The unacceptability of a single component failure causing a catastrophe.}
% NEED SOME NICE HISTORICAL REFS HERE
FMEA, due to its inductive bottom-up approach, is good
at mapping potential single component failures to system level faults/events.
Used in the design phase of a project, FMEA is a useful tool
for discovering potential failure scenarios~\cite{1778436820050601}.
%
% Subject Object Wiki answers : Best Answer
%It is not grammar or vocabulary. It is a philosophical reference.
%The dichotomy is the surrounding view of self that we act out of. It is often learned with language and not taught [like the alphabet and numbers are taught] in early life through language and the forming of distinctions.
%The Subject/Object dichotomy is related mostly to the Cartesian model of a 'self'. We can be both the subject that we observe, and the object doing the observing.But it goes beyond that into how we view the world we are in. In balanced thinking, we are both subjective and objective about situations and interactions in daily life, internally and externally. In unbalanced thinking, there is a tilt towards one side or the other. That is, either too subjective; as relating everything to how it affects you personally, [temperamental and self center] or, too objective; not having a sense of who you are in regards to what is occurring, [aloof, distant and apathetic]. It is related in Western philosophy as the basic nature of dualism. How do you know that you learned to live in a subject/object dichotomy?
%The core of Cartesianism is that you have a mind: a separate function of your'self'. If you have an invisible self called a mind - you are in the subject/object dichotomy. Non-dualism is mostly learned in Eastern philosophies and will refer to the mind as an integer of the self - not separate from it.
%You can not jump from one to the other. And, they both must be learned as referential contexts to who 'you' are in the world you live in.
%
\paragraph{Subjective and Objective thinking in relation to FMEA.}
\label{sec:subjectiveobjective}
FMEA is always performed in the context of the use of the equipment.
In terms of philosophy the context is in the domain of the subjective and the
logic and reasoning behind failure causation, the objective.
%
By using objective reasoning we trace a component level failure to a system level event,
but only in
the subjective sense can we determine its meaning and/or severity.
%
It is worth remembering that
failure mode analysis performed on the leaks possible from the O ring on the space shuttle
did not link this failure to the catastrophic failure of the spacecraft~\cite{challenger,sanjeev}.
This was not a failure in the objective reasoning, but more of the subjective, or the context in which the leak occurred.
%
FMEA is less useful for determining events for multiple
simultaneous\footnote{Multiple simultaneous failures are taken to mean failures that occur within the same detection period.}
failures.
%
Work has been performed using component failure statistics to
offer the more likely multiple failures~\cite{FMEAmultiple653556} for analysis.
%
%
This is because with the additional complication of having to change between these two modes of thinking, it becomes more difficult to
get a balance between subjective and objective perspectives.
%subjective/objective become more cluttered when there are multiple possibilities
%for the the results of an FMEA line of reasoning.
\paragraph{Failure modes and their observability criterion: detectable and undetectable.}
Often the effects of a failure mode may be easy to detect,
and our equipment can react by raising an alarm or compensating for the resulting fault.
%
Some failure modes may cause undetectable failures, for instance a component that causes
a measured reading to change could have adverse consequences yet not be flagged as a failure.
This type of failure would not be flagged as a failure by the system, because
it has no way of knowing the reading is invalid.
%
The term observable has a specific meaning in the field of control engineering~\cite{721666, ACS:ACS1297};
systems submitted for FMEA are generally related to control systems,
and so to avoid confusion the terms `detectable' and `undetectable'
will be used for describing the observability of failure modes in this document.
\glossary{name={observability}, description={The property of a system failure in relation to a particular component failure mode, where it can be determined whether the readings/actions associated     with it are valid, or the by-product of a failure. If we cannot determine that there is a fault present, the system level failure is said to be unobservable.}}
\paragraph{Impracticality of Field Data for modern systems.}
Modern electronic components, are generally very reliable, and the systems built from them
are thus very reliable too. Reliable field data on failures will, therefore be sparse.
Should we wish to prove a continuous demand system for say ${10}^{-7}$ failures\footnote{${10}^{-7}$ failures per hour of operation is the
threshold for S.I.L. 3 reliability~\cite{en61508}. Failure rates are normally measured per $10^9$ hours of operation
and are know as Failure in Time (FIT) values. The maximum FIT values for a SIL 3 system is therefore 100.}
per hour of operation, even with 1000 correctly monitored units in the field
we could only expect one failure per ten thousand hours (a little over one a year).
It would be utterly impractical to get statistically significant data for equipment
at these reliability levels.
However, we can use FMEA (more specifically the FMEDA variant, see section~\ref{sec:FMEDA}),
working from known component failure rates, to obtain
statistical estimates of the equipment reliability.
\paragraph{Forward and backward searches.}
A forward search starts with possible failure causes
and uses logic and reasoning to determine system level outcomes.
Forward search types of fault analysis is said to be `inductive'.
%
A backward search starts with (undesirable) system level events and
works back down to potential causes using de-composition
of the system and logic.
FMEA based methodologies are forward searches\cite{Lutz:1997:RAU:590564.590572} and top down
methodologies such as FTA~\cite{nucfta,nasafta} are backward searches.
Forward search types of fault analysis is said to be `deductive'.
Backward (or bottom-up) searches are said to be inductive (i.e. the results of failure are
induced).
\paragraph{Reasoning distance.}
\label{reasoningdistance}
A reasoning distance is the number of stages of logic and reasoning
required to map a failure cause to its potential outcomes.
%
In our basic FMEA example in section~\ref{basicfmea}
we were asked to consider one failure mode against all the components in the milli-volt reader.
%
To create a complete FMEA report on the milli-volt reader we would have had to examine every
known failure mode of every component within it---against all its other components.
%
The reasoning~distance is defined as the sum of the number of failure modes, against all other components
in that system.
%
If the milli-volt reader had say 100 components, with three failure modes each, this
would give a reasoning distance of 3 * 100 * 99.
%.... general concept... simple ideas about how complex a
%failure analysis is the more modules and components are involved
% cite for forward and backward search related to safety critical software
%{sfmeaforwardbackward}
\subsection{FMEA and the State Explosion Problem}
\paragraph{Exhaustive Single Failure FMEA.}
FMEA for a safety critical certification~\cite{en298,en61508} will have to be applied
to all known failure modes of all components within a system.
To perform FMEA exhaustively (i.e. to examine every possible interaction
of a failure mode with all other components in a system). Or in other words,
---we would need to look at all possible failure scenarios.
%to do this completely (all failure modes against all components).
This is represented in the equation below, %~\ref{eqn:fmea_state_exp},
where $N$ is the total number of components in the system, and
$f$ is the number of failure modes per component.
\begin{equation}
\label{eqn:fmea_single}
N.(N-1).f % \\
%(N^2 - N).f
\end{equation}
\paragraph{Exhaustive Single Failure FMEA}
This would mean an order of $O(N^2)$ number of checks to perform
to undertake an `exhaustive~FMEA'. Even small systems have typically
100 components, and they typically have 3 or more failure modes each.
$100*99*3=29,700$.
\paragraph{Exhaustive Double Failure FMEA}
For looking at potential double failure
scenarios\footnote{Certain double failure scenarios are already legal requirements---The European Gas burner standard (EN298:2003)---demands the checking of
double failure scenarios (for burner lock-out scenarios).}
(two components failing within a given time frame) and the order becomes $O(N^3)$.
\begin{equation}
\label{eqn:fmea_double}
N.(N-1).(N-2).f % \\
%(N^2 - N).f
\end{equation}
For our theoretical 100 components with 3 failure modes each example, this is
$100*99*98*3=2,910,600$ failure mode scenarios.
\paragraph{Reliance of experts for meaningful FMEA Analysis.}
Current FMEA methodologies cannot consider---for the reason of state explosion---an exhaustive approach.
We define exhaustive FMEA ({\XFMEA}) as examining the effect of every component failure mode
against the remaining components in the system under investigation.
%
Because we cannot perform XFMEA,
we rely on experts in the system under investigation
to perform a meaningful FMEA analysis.
%
In practise these experts have to select the areas they see as most critical for detailed FMEA analysis.
\subsection{Component Tolerance}
Component tolerances may need considered when determining if a component has failed.
Calculations for acceptable ranges to determine failure or acceptable conditions
must be made where appropriate.
An example of component tolerance considered for FMEA
is given in section~\ref{sec:resistortolerance}.
\section{FMEA in current usage: Five variants}
\paragraph{Five main Variants of FMEA}
\begin{itemize}
\item \textbf{PFMEA - Production} Car Manufacture etc
\item \textbf{FMECA - Criticality} Military/Space
\item \textbf{FMEDA - Statistical safety} EN61508/IOC1508 Safety Integrity Levels
\item \textbf{DFMEA - Design or static/theoretical} EN298/EN230/UL1998
\item \textbf{SFMEA - Software FMEA --- only used in highly critical systems at present}
\end{itemize}
\section{PFMEA - Production FMEA : 1940's to present}
Production FMEA (or PFMEA), is FMEA used to prioritise, in terms of
cost, problems to be addressed in product production.
It focuses on known problems, determines the
frequency they occur and their cost to fix.
This is multiplied together and called an RPN
number.
Fixing problems with the highest RPN number
will return most cost benefit.
% benign example of PFMEA in CARS - make something up.
\subsection{PFMEA Example}
\begin{table}[ht]
\caption{FMEA Calculations} % title of Table
%\centering % used for centering table
\begin{tabular}{|| l | l | c | c | l ||} \hline
\textbf{Failure Mode} & \textbf{P} & \textbf{Cost} & \textbf{Symptom} & \textbf{RPN} \\ \hline \hline
relay 1 n/c & $1*10^{-5}$ & 38.0 & indicators fail & 0.00038 \\ \hline
relay 2 n/c & $1*10^{-5}$ & 98.0 & doorlocks fail & 0.00098 \\ \hline
% rear end crash & $14.4*10^{-6}$ & 267,700 & fatal fire & 3.855 \\
% ruptured f.tank & & & & \\ \hline
\hline
\end{tabular}
\end{table}
\section{FMECA - Failure Modes Effects and Criticality Analysis}
\subsection{ FMECA - Failure Modes Effects and Criticality Analysis}
% \begin{figure}
% \centering
% %\includegraphics[width=100pt]{./military-aircraft-desktop-computer-wallpaper-missile-launch.jpg}
% \includegraphics[width=300pt]{./CH2_FMEA/A10_thunderbolt.jpg}
% % military-aircraft-desktop-computer-wallpaper-missile-launch.jpg: 1024x768 pixel, 300dpi, 8.67x6.50 cm, bb=0 0 246 184
% \caption{A10 Thunderbolt}
% \label{fig:f16missile}
% \end{figure}
Emphasis on determining criticality rather than the cost of system failures.
Applies some Bayesian statistics (probabilities of component failures and those
thereby causing given system level failures).
Applying Bayesian statistics to failure analysis, suffers the
problem that correlation does not imply causation~\cite{bayesfrequentist}.
However, correlation is evidence for causation, and maybe the only evidence to hand
and this is the justification behind its use.
A history of the usage and development of FMECA may be found in~\cite{FMECAresearch}.
\subsection{ FMECA - Failure Modes Effects and Criticality Analysis}
Very similar to PFMEA, but instead of cost, a criticality or
seriousness factor is ascribed to putative top level incidents.
FMECA has three probability factors for component failures.
\textbf{FMECA ${\lambda}_{p}$ value.}
This is the overall failure rate of a base component.
This will typically be the failure rate per million ($10^6$) or
billion ($10^9$) hours of operation. reference MIL1991.
\textbf{FMECA $\alpha$ value.}
The failure mode probability, usually denoted by $\alpha$ is the probability of
a particular failure~mode occurring within a component. reference FMD-91.
%, should it fail.
%A component with N failure modes will thus have
%have an $\alpha$ value associated with each of those modes.
%As the $\alpha$ modes are probabilities, the sum of all $\alpha$ modes for a component must equal one.
\subsection{ FMECA - Failure Modes Effects and Criticality Analysis}
\textbf{FMECA $\beta$ value.}
The second probability factor $\beta$, is the probability that the failure mode
will cause a given system failure.
This corresponds to `Bayesian' probability, given a particular
component failure mode, the probability of a given system level failure.
\textbf{FMECA `t' Value.}
The time that a system will be operating for, or the working life time of the product is
represented by the variable $t$.
%for probability of failure on demand studies,
%this can be the number of operating cycles or demands expected.
\textbf{Severity `s' value.}
A weighting factor to indicate the seriousness of the putative system level error.
%Typical classifications are as follows:~\cite{fmd91}
\begin{equation}
C_m = {\beta} . {\alpha} . {{\lambda}_p} . {t} . {s}
\end{equation}
Highest $C_m$ values would be at the top of a `to~do' list
for a project manager.
\section{FMEDA - Failure Modes Effects and Diagnostic Analysis}
\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
% \begin{figure}
% \centering
% \includegraphics[width=200pt]{./SIL.png}
% % SIL.jpg: 350x286 pixel, 72dpi, 12.35x10.09 cm, bb=0 0 350 286
% \caption{SIL requirements}
% \end{figure}
\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
% \begin{itemize}
% \item \textbf{Statistical Safety} Safety Integrity Level (SIL) standards (EN61508/IOC5108).
% \item \textbf{Diagnostics} Diagnostic or self checking elements modelled
% \item \textbf{Complete Failure Mode Coverage} All failure modes of all components must be in the model
% \item \textbf{Guidelines} To system architectures and development processes
% \end{itemize}
FMEDA is the fundamental methodology of the statistical (safety integrity level)
type standards (EN61508/IOC5108).
It provides a statistical overall level of safety
and allows diagnostic mitigation for self checking etc.
It provides guidelines for the design and architecture
of computer/software systems for the four levels of
safety Integrity.
%For Hardware
%
FMEDA does force the user to consider all hardware components in a system
by requiring that a MTTF value is assigned for each failure~mode;
the MTTF may be statistically mitigated (improved)
if it can be shown that self-checking will detect failure modes.
For software it provides procedural quality guidelines and constraints (such as forbidding certain
programming languages and/or features.
%\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
\label{sec:FMEDA}
\textbf{Failure Mode Classifications in FMEDA.}
\begin{itemize}
\item \textbf{Safe or Dangerous} Failure modes are classified SAFE or DANGEROUS
\item \textbf{Detectable failure modes} Failure modes are given the attribute DETECTABLE or UNDETECTABLE
\item \textbf{Four attributes to Failure Modes} All failure modes may thus be Safe Detected(SD), Safe Undetected(SU), Dangerous Detected(DD), Dangerous Undetected(DU)
\item \textbf{Four statistical properties of a system} \\
$ \sum \lambda_{SD}$, $\sum \lambda_{SU}$, $\sum \lambda_{DD}$, $\sum \lambda_{DU}$
\end{itemize}
% Failure modes are classified as Safe or Dangerous according
% to the putative system level failure they will cause.
% The Failure modes are also classified as Detected or
% Undetected.
% This gives us four level failure mode classifications:
% Safe-Detected (SD), Safe-Undetected (SU), Dangerous-Detected (DD) or Dangerous-Undetected (DU),
% and the probabilistic failure rate of each classification
% is represented by lambda variables
% (i.e. $\lambda_{SD}$, $\lambda_{SU}$, $\lambda_{DD}$, $\lambda_{DU}$).
%\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
\textbf{Diagnostic Coverage.}
The diagnostic coverage is simply the ratio
of the dangerous detected probabilities
against the probability of all dangerous failures,
and is normally expressed as a percentage. $\Sigma\lambda_{DD}$ represents
the percentage of dangerous detected base component failure modes, and
$\Sigma\lambda_D$ the total number of dangerous base component failure modes.
$$ DiagnosticCoverage = \Sigma\lambda_{DD} / \Sigma\lambda_D $$
%\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
The \textbf{diagnostic coverage} for safe failures, where $\Sigma\lambda_{SD}$ represents the percentage of
safe detected base component failure modes,
and $\Sigma\lambda_S$ the total number of safe base component failure modes,
is given as
$$ SF = \frac{\Sigma\lambda_{SD}}{\Sigma\lambda_S} $$
%\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
\textbf{Safe Failure Fraction.}
A key concept in FMEDA is Safe Failure Fraction (SFF).
This is the ratio of safe and dangerous detected failures
against all safe and dangerous failure probabilities.
Again this is usually expressed as a percentage.
$$ SFF = \big( \Sigma\lambda_S + \Sigma\lambda_{DD} \big) / \big( \Sigma\lambda_S + \Sigma\lambda_D \big) $$
SFF determines how proportionately fail-safe a system is, not how reliable it is !
Weakness in this philosophy; adding extra safe failures (even unused ones) improves the SFF.
\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
To achieve SIL levels, diagnostic coverage and SFF levels are prescribed along with
hardware architectures and software techniques.
The overall the aim of SIL is classify the safety of a system,
by statistically determining how frequently it can fail dangerously.
%\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
\begin{table}[ht]
\caption{FMEA Calculations} % title of Table
%\centering % used for centering table
\begin{tabular}{|| l | l | c | c | l ||} \hline
\textbf{SIL} & \textbf{Low Demand} & \textbf{Continuous Demand} \\
& Prob of failing on demand & Prob of failure per hour \\ \hline \hline
4 & $ 10^{-5}$ to $< 10^{-4}$ & $ 10^{-9}$ to $< 10^{-8}$ \\ \hline
3 & $ 10^{-4}$ to $< 10^{-3}$ & $ 10^{-8}$ to $< 10^{-7}$ \\ \hline
2 & $ 10^{-3}$ to $< 10^{-2}$ & $ 10^{-7}$ to $< 10^{-6}$ \\ \hline
1 & $ 10^{-2}$ to $< 10^{-1}$ & $ 10^{-6}$ to $< 10^{-5}$ \\ \hline
\hline
\end{tabular}
\end{table}
Table adapted from EN61508-1:2001 [7.6.2.9 p33]
%\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
FMEDA is a modern extension of FMEA, in that it will allow for
self checking features, and provides detailed recommendations for computer/software architecture.
It has a simple final result, a Safety Integrity Level (SIL) from 1 to 4 (where 4 is safest).
%FMEA can be used as a term simple to mean Failure Mode Effects Analysis, and is
%part of product approval for many regulated products in the EU and the USA...
\section{FMEA used for Safety Critical Approvals}
\subsection{DESIGN FMEA: Safety Critical Approvals FMEA}
\begin{figure}[h]
\centering
\includegraphics[width=300pt,keepaspectratio=true]{./CH2_FMEA/tech_meeting.png}
% tech_meeting.png: 350x299 pixel, 300dpi, 2.97x2.53 cm, bb=0 0 84 72
\caption{FMEA Meeting}
\label{fig:tech_meeting}
\end{figure}
Static FMEA, Design FMEA, Approvals FMEA
Experts from Approval House and Equipment Manufacturer
discuss selected component failure modes
judged to be in critical sections of the product.
\subsection{DESIGN FMEA: Safety Critical Approvals FMEA}
% \begin{figure}[h]
% \centering
% \includegraphics[width=70pt,keepaspectratio=true]{./tech_meeting.png}
% % tech_meeting.png: 350x299 pixel, 300dpi, 2.97x2.53 cm, bb=0 0 84 72
% \caption{FMEA Meeting}
% \label{fig:tech_meeting}
% \end{figure}
\begin{itemize}
\item Impossible to look at all component failures let alone apply FMEA rigorously.
\item In practice, failure scenarios for critical sections are contested, and either justified or extra safety measures implemented.
\item Often Meeting notes or minutes only. Unusual for detailed arguments to be documented.
\end{itemize}
\section{Conclusions on current FMEA Methodologies}
%% FOCUS
The focus of this chapter %literature review
is to establish the current practice and applications
of FMEA.
%, and to examine its strengths and weaknesses.
%% GOAL
Its
goal is to identify central issues and to criticise and assess the current
FMEA methodologies.
%% PERSPECTIVE
The perspective of the author, is as a practitioner of static failure mode analysis techniques
concerning approval of product
to European safety standards, both the prescriptive~\cite{en298,en230} and statistical~\cite{en61508}.
A second perspective is that of a software engineer trained to use formal methods.
Examining FMEA methodologies for mathematical properties, influenced by
formal methods applied to software, should provide a perspective not traditionally considered.
%% COVERAGE
The literature reviewed, has been restricted to published books, European safety standards (as examples
of current safety measures applied), and traditional research, from journal and conference papers.
%% ORGANISATION
The review is organised by concept, that is, FMEA can be applied to hardware, software, software~interfacing and
to multiple failure scenarios etc. Methodologies related to FMEA are briefly covered for the sake of context.
%% AUDIENCE
% Well duh! PhD supervisors and examiners....
% \subsection{Related Methodologies}
% FTA --- HAZOP --- ALARP --- Event Tree Analysis --- bow tie concept
% \subsection{Hardware FMEA (HFMEA)}
% \subsection{Multiple Failure scenarios and FMEA}
% \subsection{Software FMEA (SFMEA)}
\paragraph{Current work on Software FMEA}
SFMEA usually does not seek to integrate
hardware and software models, but to perform
FMEA on the software in isolation~\cite{procsfmea}.
%
Work has been performed using databases
to track the relationships between variables
and system failure modes~\cite{procsfmeadb}, to %work has been performed to
introduce automation into the FMEA process~\cite{appswfmea} and to provide code analysis
automation~\cite{modelsfmea}. Although the SFMEA and hardware FMEAs are performed separately,
some schools of thought aim for Fault Tree Analysis (FTA)~\cite{nasafta,nucfta} (top down - deductive)
and FMEA (bottom-up inductive)
to be performed on the same system to provide insight into the
software hardware/interface~\cite{embedsfmea}.
%
Although this
would give a better picture of the failure mode behaviour, it
is by no means a rigorous approach to tracing errors that may occur in hardware
through to the top (and therefore ultimately controlling) layer of software.
\paragraph{Current FMEA techniques are not suitable for software}
The main FMEA methodologies are all based on the concept of taking
base component {\fms}, and translating them into system level events/failures~\cite{sfmea,sfmeaa}.
%
In a complicated system, mapping a component failure mode to a system level failure
will mean a long reasoning distance; that is to say the actions of the
failed component will have to be traced through
several sub-systems, gauging its effects with and on other components.
%
With software at the higher levels of these sub-systems,
we have yet another layer of complication.
%
%In order to integrate software, %in a meaningful way
%we need to re-think the
%FMEA concept of simply mapping a base component failure to a system level event.
%
SFMEA regards, in place of hardware components, the variables used by the programs to be their equivalent~\cite{procsfmea}.
The failure modes of these variables, are that they could become erroneously over-written,
calculated incorrectly (due to a mistake by the programmer, or a fault in the micro-processor on which it is running), or
external influences such as
ionising radiation causing bits to be erroneously altered.
\paragraph{FMEA and Modularity}
From the 1940's onwards, software has evolved from a simple procedural languages (i.e. assembly language/Fortran~\cite{f77} call return)
to structured programming ( C~\cite{DBLP:books/ph/KernighanR88}, pascal etc) and then to object oriented models (Java C++...).
FMEA has undergone no such evolution.
%
In a world where sensor systems, often including embedded software components, are brought in to
create complex systems, FMEA still follows a rigid {\bc} {\fm} to system level error model,
that is only suitable for simple electro mechanical systems.
%
%
% MAYBE MOVE THIS TO CH3, FMEA CRITICISM
% 30JAN2013
%
\subsection{Where FMEA is now.}
FMEA useful tool for basic safety --- provides statistics on safety where field data impractical ---
very good with single failure modes linked to top level events.
FMEA has become part of the safety critical and safety certification industries.
%
SFMEA is in its infancy, and there are corresponding gaps in
certification for software, EN61508~\cite{en61508}, recommends hardware redundancy architectures in conjunction
with FMEDA for hardware: for software it recommends language constraints and quality procedures
but no inductive fault finding technique.
%
FMEA has adapted from a cost saving exercise for mass produced items~\cite{bfmea,generic_automotive_fmea_6034891}, to incorporating statistical techniques
(FMECA) to allowing for self diagnostic mitigation (FMEDA).
%
However, it is still based on the concept of single component failures mapped to top~level/system~failures.
All these FMEA based methodologies have the following short comings:
\begin{itemize}
\item Impossible to integrate Software and hardware models,
\item State explosion problem exacerbated by increasing complexity due to density of modern electronics,
\item Impossibility to consider all multiple component failure modes~\cite{FMEAmultiple653556}
\end{itemize}