Robin_PHD/submission_thesis/CH2_FMEA/copy.tex



The generic and statistical European Safety Standard, EN61508:6\cite{en61508}[B.6.6]
describes Failure Mode Effect Analysis (FMEA) as:
\begin{quotation}
"To analyse a system design, by examining all possible sources of failure
of a system's components and determining the effects of these failures
on the behaviour and safety of the system."
\end{quotation}.

\section{Concepts}

\paragraph{Forward and backward searches}

A forward search starts with possible failure causes
and uses logic and reasoning to determine system level outcomes.
A backward search starts with system level events
works back down (and not necessarily to
base components in a system) using de-composition of
of the system and logic.
FMEA based methodologies are forward searches\cite{Lutz:1997:RAU:590564.590572} and top down
methodologies such as FTA~\cite{nucfta,nasafta}

\paragraph{Reasoning distance}
A reasoning distance is the number of stages of logic and reasoning
required to map a failure cause to its potential outcomes.
%.... general concept... simple ideas about how complex a
%failure analysis is the more modules and components are involved
% cite for forward and backward search related to safety critical software
 %{sfmeaforwardbackward}

\section{FMEA}

%\subsection{FMEA}
%\tableofcontents[currentsection]


FMEA is a broad term; it could mean anything from an informal check on how
how failures could affect some equipment in an initial brain-storming session
in product design, to formal submission as part of safety critical certification.
%
This chapter describes basic concepts of FMEA, uses a simple example to
demonstrate a single  FMEA analysis stage, describes the four main variants of FMEA in use today
and explores some concepts with which we can discuss and evaluate
the effectiveness of FMEA.


% \subsection{FMEA}
% This talk introduces Failure Mode Effects Analysis, and the different ways it is applied.
% These techniques are discussed, and then
% a refinement is proposed, which is essentially a modularisation of the FMEA process.
% %
%
% \begin{itemize}
%    \item Failure
%     \item Mode
%     \item Effects
%    \item Analysis
% \end{itemize}
%
%
%
% % % \begin{itemize}
% %  \item Failure
% %  \item Mode
% %  \item Effects
% %  \item Analysis
% % \end{itemize}

\clearpage
\paragraph{FMEA basic concept.}


\begin{itemize}

   \item \textbf{F - Failures of given component} Consider a component in a system
    \item \textbf{M - Failure Mode} Look at one of the ways in which it can fail (i.e. determine a component `failure~mode')
    \item \textbf{E - Effects} Determine the effects this failure mode will cause to the system we are examining
   \item \textbf{A - Analysis} Analyse how much impact this symptom will have on the environment/people/the system itsself
\end{itemize}


FMEA is a procedure based on the low level components of a system, and an example
analysis will serve to demonstrate it in practise.

 \paragraph{ FMEA Example: Milli-volt reader}
Example: Let us consider a system, in this case a milli-volt reader, consisting
of instrumentation amplifiers connected to a micro-processor
that reports its readings via RS-232.
\begin{figure}
 \centering
 \includegraphics[width=175pt]{./CH2_FMEA/mvamp.png}
 % mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403
\end{figure}


 \subsection{FMEA Example: Milli-volt reader}
Let us perform an FMEA and consider how one of its resistors failing could affect
it.
For the sake of example let us choose resistor R1 in the  OP-AMP gain circuitry.
% \begin{figure}
%  \centering
%  \includegraphics[width=175pt]{./mvamp.png}
%  % mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403
% \end{figure}


 \paragraph{FMEA Example: Milli-volt reader}
% \begin{figure}
%  \centering
%  \includegraphics[width=80pt]{./mvamp.png}
%  % mvamp.png: 561x403 pixel, 72dpi, 19.79x14.22 cm, bb=0 0 561 403
% \end{figure}
\begin{itemize}
   \item \textbf{F - Failures of given component} The resistor (R1) could fail by going OPEN or SHORT (EN298 definition).
    \item \textbf{M - Failure Mode} Consider the component failure mode SHORT
    \item \textbf{E - Effects} This will drive the minus input LOW causing a HIGH OUTPUT/READING
   \item \textbf{A - Analysis} The reading will be out of the  normal range, and we will have an erroneous milli-volt reading
\end{itemize}


The analysis above has given us  a result for one failure scenario i.e.
for one component failure mode.
A complete FMEA report would have to contain an entry
for each failure mode of all the components in the system under investigation.
%
Note here that we have had to look at the failure~mode
in relation to the entire circuit.
We have used intuition to determine the probable
effect of this failure mode.
For instance we have assumed that the resistor R1 going SHORT
will not affect the ADC, the Microprocessor or the UART.
%
To put this in more general terms,  have not examined this failure mode
against every other component in the system.
Perhaps we should: this would be a more rigorous and complete
approach in looking for system failures.


\section{Theoretical Concepts in FMEA}


\subsection{The unacceptability of a single component failure causing a catastrophe}

FMEA, due to its inductive bottom-up approach, is very good
at finding potential single component failures that could have catastrophic implications.
Used in the design phase of a project FMEA is an invaluable tool
for unearthing these  failure scenarios.
It is less useful for determining catastrophic events for multiple
simultaneous\footnote{Multiple simultaneous failures are taken to mean failure that occur within the same detection period.} failures.

\subsection{Impracticality of Field Data for modern systems}

Modern electronic components, are generally very reliable, and the systems built from them
are thus very reliable too. Reliable field data on failures will, therefore be sparse.
Should we wish to prove a continuous demand system for say ${10}^{-7}$ failures\footnote{${10}^{-7}$ failures per hour of operation is the
threshold for S.I.L. 3 reliability~\cite{en61508}. Failure rates are normally measured per $10^9$ hours of operation
and are know as Failure in Time (FIT) values. The maximum FIT values for a SIL 3 system is therefore 100.}
per hour of operation, even with 1000 correctly monitored units in the field
we could only expect one failure per ten thousand hours (a little over one a year).
It would be utterly impractical to get statistically significant data for equipment
at these reliability levels.
However, we can use FMEA (more specifically the FMEDA variant, see section~\ref{sec:FMEDA}),
working from known component failure rates, to obtain
statistical estimates of the equipment reliability.


\subsection{FMEA and the  State Explosion Problem}

\paragraph{Rigorous Single Failure FMEA}

FMEA for a safety critical certification~\cite{en298,en61508} will have to be applied
to all known failure modes of all components within a system.

To perform FMEA rigorously (i.e. to examine every possible interaction
of a failure mode with all other components in a system). Or in other words,
---we would need to look at all possible failure scenarios.
%to do this completely (all failure modes against all components).
This is represented in the equation below. %~\ref{eqn:fmea_state_exp},
where $N$ is the total number of components in the system, and
$f$ is the number of failure modes per component.


\begin{equation}
  \label{eqn:fmea_single}
  N.(N-1).f % \\
  %(N^2 - N).f
\end{equation}


\paragraph{Rigorous Single Failure FMEA}
This would mean an order of $O(N^2)$ number of checks to perform
to undertake a `rigorous~FMEA'. Even small systems have typically
100 components, and they typically have 3 or more failure modes each.
$100*99*3=29,700$.

 \paragraph{Rigorous Double Failure FMEA}
For looking at potential double failure
scenarios\footnote{Certain double failure scenarios are already legal requirements---The European Gas burner standard (EN298:2003)---demands the checking of
double failure scenarios (for burner lock-out scenarios).}
(two components failing within a given time frame) and the order becomes $O(N^3)$.

\begin{equation}
  \label{eqn:fmea_double}
  N.(N-1).(N-2).f % \\
  %(N^2 - N).f
\end{equation}

For our theoretical 100 components with 3 failure modes each example, this is
$100*99*98*3=2,910,600$ failure mode scenarios.


\paragraph{Reliance of experts for meaningful FMEA Analysis.}
FMEA cannot consider---for practical reasons---a rigorous approach.
We define rigorous FMEA as examining the effect of every component failure mode
against the remaining components in the system under investigation.
%
Because we cannot perform rigorous FMEA,
we rely on  experts in the system under investigation
to perform a meaningful FMEA analysis.


\section{FMEA in practise: Five variants}

\paragraph{Five main Variants of FMEA}
 \begin{itemize}
  \item \textbf{PFMEA - Production}   Car Manufacture etc
    \item \textbf{FMECA - Criticallity}   Military/Space
    \item \textbf{FMEDA - Statistical safety}    EN61508/IOC1508  Safety Integrity Levels
   \item \textbf{DFMEA - Design or static/theoretical}    EN298/EN230/UL1998
   \item \textbf{SFMEA - Software FMEA --- only used in highly critical systems at present}
\end{itemize}


\section{PFMEA - Production FMEA : 1940's to present}


Production FMEA (or PFMEA), is FMEA used to prioritise, in terms of
cost, problems to be addressed in product production.

It focuses on known problems, determines the
frequency they occur and their cost to fix.
This is multiplied together and called an RPN
number.
Fixing problems with the highest RPN number
will return most cost benefit.

% benign example of PFMEA in CARS - make something up.
\subsection{PFMEA Example}
\begin{table}[ht]
\caption{FMEA Calculations} % title of Table
%\centering % used for centering table
\begin{tabular}{|| l | l | c | c | l ||} \hline
 \textbf{Failure Mode} &   \textbf{P}             & \textbf{Cost}        &  \textbf{Symptom} & \textbf{RPN} \\ \hline \hline
      relay 1 n/c      & $1*10^{-5}$              &  38.0                & indicators fail   & 0.00038 \\ \hline
        relay 2 n/c      & $1*10^{-5}$              &  98.0                & doorlocks fail   & 0.00098 \\ \hline
%       rear end crash    &  $14.4*10^{-6}$         & 267,700              & fatal fire       &  3.855 \\
%       ruptured f.tank   &                         &                      &                  &        \\ \hline
\hline
\end{tabular}
\end{table}


\section{FMECA - Failure Modes Effects and Criticality Analysis}

\subsection{ FMECA - Failure Modes Effects and Criticality Analysis}
% \begin{figure}
%  \centering
%  %\includegraphics[width=100pt]{./military-aircraft-desktop-computer-wallpaper-missile-launch.jpg}
%  \includegraphics[width=300pt]{./CH2_FMEA/A10_thunderbolt.jpg}
%  % military-aircraft-desktop-computer-wallpaper-missile-launch.jpg: 1024x768 pixel, 300dpi, 8.67x6.50 cm, bb=0 0 246 184
%  \caption{A10 Thunderbolt}
%  \label{fig:f16missile}
% \end{figure}
Emphasis on determining criticality of failure.
Applies some Bayesian statistics (probabilities of component failures and those thereby causing given system level failures).


\subsection{ FMECA - Failure Modes Effects and Criticality Analysis}
Very similar to PFMEA, but instead of cost, a criticality or
seriousness factor is ascribed to putative top level incidents.
FMECA has three probability factors for component failures.

\textbf{FMECA ${\lambda}_{p}$ value.}
This is the overall failure rate of a base component.
This will typically be the failure rate per million ($10^6$) or
billion ($10^9$) hours of operation. reference MIL1991.

\textbf{FMECA $\alpha$ value.}
The failure mode probability, usually denoted by $\alpha$ is the  probability of
a particular failure~mode occurring within a component.  reference FMD-91.
%, should it fail.
%A component with N failure modes will thus have
%have an $\alpha$ value associated with each of those modes.
%As the $\alpha$ modes are probabilities, the sum of all $\alpha$ modes for a component must equal one.


\subsection{ FMECA - Failure Modes Effects and Criticality Analysis}
\textbf{FMECA $\beta$ value.}
The second probability factor $\beta$, is the probability that the failure mode
will cause a given system failure.
This corresponds to `Bayesian' probability, given a particular
component failure mode, the probability of a given system level failure.

\textbf{FMECA `t' Value}
The time that a system will be operating for, or the working life time of the product is
represented by the variable $t$.
%for probability of failure on demand studies,
%this can be the number of  operating cycles or demands expected.

\textbf{Severity `s' value}
A weighting factor to indicate the seriousness of the putative system level error.
%Typical classifications are as follows:~\cite{fmd91}

\begin{equation}
 C_m  =  {\beta} .  {\alpha} . {{\lambda}_p} . {t} . {s}
\end{equation}

Highest $C_m$ values would be at the top of a `to~do' list
for a project manager.


\section{FMEDA - Failure Modes Effects and Diagnostic Analysis}


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
% \begin{figure}
%  \centering
%  \includegraphics[width=200pt]{./SIL.png}
%  % SIL.jpg: 350x286 pixel, 72dpi, 12.35x10.09 cm, bb=0 0 350 286
%  \caption{SIL requirements}
% \end{figure}


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}

\begin{itemize}
    \item \textbf{Statistical Safety}   Safety Integrity Level (SIL) standards (EN61508/IOC5108).
    \item \textbf{Diagnostics}          Diagnostic or self checking elements modelled
    \item \textbf{Complete Failure Mode Coverage}    All failure modes of all components must be in the model
   \item \textbf{Guidelines}    To system architectures and development processes
\end{itemize}

FMEDA is the methodology behind statistical (safety integrity level)
type standards (EN61508/IOC5108).
It provides a statistical overall level of safety
and allows diagnostic mitigation for self checking etc.
It provides guidelines for the design and architecture
of computer/software systems for the four levels of
safety Integrity.
%For Hardware
%
FMEDA does force the user to consider all hardware components in a system
by requiring that a MTTF value is assigned for each failure~mode;
the MTTF may be statistically mitigated (improved)
if it can be shown that self-checking will detect failure modes.
For software it provides procedural quality guidelines and constraints (such as forbidding certain
programming languages and/or features.


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
\label{sec:FMEDA}
\textbf{Failure Mode Classifications in FMEDA.}
 \begin{itemize}
  \item \textbf{Safe or Dangerous}   Failure modes are classified SAFE or DANGEROUS
    \item \textbf{Detectable failure modes}   Failure modes are given the attribute DETECTABLE or UNDETECTABLE
    \item \textbf{Four attributes to Failure Modes}    All failure modes may thus be Safe Detected(SD), Safe Undetected(SU), Dangerous Detected(DD), Dangerous Undetected(DU)
   \item \textbf{Four statistical properties of a system}     \\
$ \sum \lambda_{SD}$, $\sum \lambda_{SU}$, $\sum \lambda_{DD}$, $\sum \lambda_{DU}$
\end{itemize}

% Failure modes are classified as Safe or Dangerous according
% to the putative system level failure they will cause.
% The Failure modes are also classified as Detected or
% Undetected.
% This gives us four level failure mode classifications:
% Safe-Detected (SD), Safe-Undetected (SU), Dangerous-Detected (DD) or Dangerous-Undetected (DU),
% and the probabilistic failure rate of each classification
% is represented by lambda variables
% (i.e. $\lambda_{SD}$, $\lambda_{SU}$, $\lambda_{DD}$, $\lambda_{DU}$).


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}

\textbf{Diagnostic Coverage.}
The diagnostic coverage is simply the ratio
of the dangerous detected probabilities
against the probability of all dangerous failures,
and is normally expressed as a percentage. $\Sigma\lambda_{DD}$ represents
the percentage of dangerous detected base component failure modes, and
$\Sigma\lambda_D$ the total number of dangerous base component failure modes.

$$ DiagnosticCoverage = \Sigma\lambda_{DD} / \Sigma\lambda_D $$


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
The \textbf{diagnostic coverage} for safe failures, where  $\Sigma\lambda_{SD}$ represents the percentage of
safe detected base component failure modes,
and $\Sigma\lambda_S$ the total number of safe base component failure modes,
is given as

$$ SF = \frac{\Sigma\lambda_{SD}}{\Sigma\lambda_S} $$


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
\textbf{Safe Failure Fraction.}
A key concept in  FMEDA is Safe Failure Fraction (SFF).
This is the ratio of safe  and dangerous detected failures
against all safe and dangerous failure probabilities.
Again this is usually expressed as a percentage.

$$ SFF = \big( \Sigma\lambda_S + \Sigma\lambda_{DD} \big) / \big( \Sigma\lambda_S + \Sigma\lambda_D \big) $$

SFF determines how proportionately fail-safe a system is, not how reliable it is !
Weakness in this philosophy;  adding extra safe failures (even unused ones) improves the SFF.


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
To achieve SIL levels, diagnostic coverage and SFF levels are prescribed along with
hardware architectures and software techniques.
The overall   the aim of SIL is classify the safety of a system,
by statistically determining how frequently it can fail dangerously.


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}

\begin{table}[ht]
\caption{FMEA Calculations} % title of Table
%\centering % used for centering table
\begin{tabular}{|| l | l | c | c | l ||} \hline
 \textbf{SIL} &   \textbf{Low Demand}     & \textbf{Continuous Demand}          \\
              & Prob of failing on demand & Prob of failure per hour  \\ \hline \hline
      4       & $ 10^{-5}$ to $< 10^{-4}$  &   $ 10^{-9}$ to $< 10^{-8}$                \\ \hline
      3       &  $ 10^{-4}$ to $< 10^{-3}$ &    $ 10^{-8}$ to $< 10^{-7}$             \\ \hline
      2       &  $ 10^{-3}$ to $< 10^{-2}$ &    $ 10^{-7}$ to $< 10^{-6}$             \\ \hline
      1       &  $ 10^{-2}$ to $< 10^{-1}$ &    $ 10^{-6}$ to $< 10^{-5}$                        \\ \hline

\hline
\end{tabular}
\end{table}

Table adapted from EN61508-1:2001 [7.6.2.9 p33]


\subsection{ FMEDA - Failure Modes Effects and Diagnostic Analysis}
FMEDA is a modern extension of FMEA, in that it will allow for
self checking features, and provides detailed recommendations for computer/software architecture.
It   has a simple final result, a Safety Integrity Level (SIL) from 1 to 4 (where 4 is safest).

%FMEA can be used as a term simple to mean Failure Mode Effects Analysis, and is
%part of product approval for many regulated products in the EU and the USA...


\section{FMEA used for Safety Critical Approvals}


\subsection{DESIGN FMEA: Safety Critical Approvals FMEA}
\begin{figure}[h]
 \centering
 \includegraphics[width=300pt,keepaspectratio=true]{./CH2_FMEA/tech_meeting.png}
 % tech_meeting.png: 350x299 pixel, 300dpi, 2.97x2.53 cm, bb=0 0 84 72
 \caption{FMEA  Meeting}
 \label{fig:tech_meeting}
\end{figure}
Static FMEA, Design FMEA, Approvals FMEA

Experts from Approval House and Equipment Manufacturer
discuss selected component failure modes
judged to be in critical sections of the product.


\subsection{DESIGN FMEA: Safety Critical Approvals FMEA}

% \begin{figure}[h]
%  \centering
%  \includegraphics[width=70pt,keepaspectratio=true]{./tech_meeting.png}
%  % tech_meeting.png: 350x299 pixel, 300dpi, 2.97x2.53 cm, bb=0 0 84 72
%  \caption{FMEA  Meeting}
%  \label{fig:tech_meeting}
% \end{figure}

\begin{itemize}
   \item Impossible to look at all component failures let alone apply FMEA rigorously.
   \item In practise, failure scenarios for critical sections are contested, and either justified or extra safety measures implemented.
    \item Often Meeting notes or minutes only. Unusual for detailed arguments to be documented.
\end{itemize}


\section{Literature Review}

%% FOCUS
The focus of this literature review is to establish the practice and applications
of FMEA, and to examine its strengths and weaknesses.
%% GOAL
Its
goal is to identify central issues and to criticise and assess  the current
FMEA methodologies.
%% PERSPECTIVE
The perspective of the author, is as a practitioner of static failure mode analysis techniques
concerning approval of product
to European safety standards, both the prescriptive~\cite{en298,en230} and statistical~\cite{en61508}.
A second perspective is that of a software engineer trained to use formal methods.
Examining FMEA methodologies for mathematical properties, influenced by
formal methods applied to software, should provide an angle not traditionally considered.
%% COVERAGE
The literature reviewed, has been restricted to published books, European safety standards (as examples
of current safety measures applied), and traditional research, from journal and conference papers.
%% ORGANISATION
The review is organised by concept, that is, FMEA can be applied to hardware, software, software~interfacing and
to multiple failure scenarios etc. Methodologies related to FMEA are briefly covered for the sake of context.
%% AUDIENCE
% Well duh! PhD supervisors and examiners....

\subsection{Related Methodologies}
FTA --- HAZOP  --- ALARP  --- Event Tree Analysis --- bow tie concept
\subsection{Hardware FMEA (HFMEA)}
\subsection{Multiple Failure scenarios and FMEA}
\subsection{Software FMEA (SFMEA)}

\paragraph{Current work on Software FMEA}

SFMEA usually does not seek to integrate
hardware and software models, but to perform
FMEA on the software in isolation~\cite{procsfmea}.
%
Work has been performed using databases
to track the relationships between variables
and system failure modes~\cite{procsfmeadb}, to %work has been performed to
introduce automation into the FMEA process~\cite{appswfmea} and to provide code analysis
automation~\cite{modelsfmea}. Although the SFMEA and hardware FMEAs are performed separately,
some schools of thought aim for Fault Tree Analysis (FTA)~\cite{nasafta,nucfta} (top down - deductive)
and FMEA (bottom-up inductive)
to be performed on the same system to provide insight into the
software hardware/interface~\cite{embedsfmea}.
%
Although this
would give a better picture of the failure mode behaviour, it
is by no means a rigorous approach to tracing errors that may occur in hardware
through to the top (and therefore ultimately controlling) layer of software.

\paragraph{Current FMEA techniques are not suitable for software}

The main FMEA methodologies are all based on the concept of taking
base component {\fms}, and translating them into system level events/failures~\cite{sfmea,sfmeaa}.
%
In a complicated system, mapping a component failure mode to a system level failure
will mean a long reasoning distance; that is to say the actions of the
failed component will have to be traced through
several sub-systems, gauging its effects with and on other components.
%
With software at the higher levels of these sub-systems,
we have yet another layer of complication.
%
%In order to integrate software, %in a meaningful way
%we need to re-think the
%FMEA concept of simply mapping a base component failure to a system level event.
%
SFMEA regards, in place of hardware components, the variables used by the programs to be their equivalent~\cite{procsfmea}.
The failure modes of these variables, are that they could become erroneously over-written,
calculated incorrectly (due to a mistake by the programmer, or a fault in the micro-processor on which it is running), or
external influences such as
ionising radiation causing bits to be erroneously altered.


%


\section{Conclusion}

\paragraph{Where FMEA is now}
FMEA useful tool for basic safety --- provides statistics on safety where field data impractical ---
very good with single failure modes linked to top level events.
FMEA has become part of the safety critical and safety certification industries.
%
SFMEA is in its infancy, but there is a gap in current
certification for software, EN61508~\cite{en61508}, recommends hardware redundancy architectures in conjunction
with FMEDA for hardware: for software it recommends language constraints and quality procedures
but no inductive fault finding technique.

FMEA has adapted from a cost saving exercise for mass produced items, to incorporating statistical techniques
(FMECA) to allowing for self diagnostic mitigation (FMEDA).
However, it is still based on the single component failure mapped to system level failure.
All these FMEA based methodologies have the following short comings:
\begin{itemize}
 \item Impossible to integrate Software and hardware models,
 \item State explosion problem exacerbated by increasing complexity due to density of modern electronics,
 \item Impossibility to consider all multiple component failure modes
\end{itemize}