Robin_PHD/old_thesis/survey/survey.tex

%
% Make the revision and doc number macro's then they are defined in one place

\ifthenelse {\boolean{paper}}
{
\begin{abstract}
A survey of Static Failure Mode analysis Methodologies applicable to safety critical systems.
\end{abstract}
}
{
\section{Overvew}
A survey of Static Failure Mode analysis Methodologies applicable to safety critical systems.
}

There are four methodologies in common use for failure mode modelling.
These are FTA, FMEA, FMECA
and FMEDA (a form of statistical assessment).
%
These methodologies date from the 1940's onwards, and were designed for
different application areas and reasons; all have drawbacks and
advantages that are discussed in the next section.
%In short
%FTA, due to its top down nature, can overlook error conditions. FMEA and the Statistical Methods
%lack precision in predicting failure modes at the SYSTEM level.

\ifthenelse {\boolean{paper}}
{
paper
}
{
chapter
}
presents the design considerations that motivated and provided the specification for
the FMMD methodology.
%

\section{Introduction}

\subsection{Failure Modes and System Failure Symptoms}
describe briefly what a base component failure mode is and what a system level failure mode is.

\subsection{Data Mining for Component failure modes}

\subsubsection{Known component types}
MIL1992 etc
\subsubsection{Data sheets}

\paragraph{Environmental ranges}
Look for temperature ranges and other environmental effects
these will give clues
\paragraph{Log scales}
given example of schottky diode
\paragraph{Adjacent pin shorting scenarios}

\paragraph{known device problems}
like opamp latch up, micro processors running
slowly when voltage too low

\section {Four Current Failure Mode Analysis Methodologies}

\subsection{Working Example: Gas Valve Proving}
For each methodology outlined in the following sections,
an example will be provided using an industiral safeguard
used in the combustion industry. This example is
the arrangement of three valves~\cite{en161} to the gas supply to the burner.
The first two valves in the chain are emergency shutdown valves.
These are mains (line voltage) powered devices, that
raise the valve opening against a strong spring. If the
power is removed, they snap shut. The third valve
regulates the flow of fuel into the burner.
Between valve 1 and 2 there is a length of pipe,
of a given volume.


At start up
this is tested, by opening valve
2 and checking the pressure reading is ambient pressure.

Valve 2 is then closed and valve 1 is opened
allowing gas to enter the volume.
At start up gas is allowed to enter the first valve and presurises
the space between it and valve 2. The pressure is monitored for a given time
to ensure that valves 1 and  2 are not leaking.


\subsection { FTA }

\glossary{name={FTA},description={Fault Tree Analysis}}

%, or modelling at
%a too high level of failure mode abstraction.
FTA was invented for use on the minuteman nuclear defence missile
systems in the early 1960s~\cite{ftahistory} and was not designed as a rigorous
fault/failure mode methodology.
It was designed to look for disastrous top level hazards and
determine how they could be caused.
It is more like a procedure to
be applied when discussing the safety of a system, with a top down hierarchical
notation using logic symbols, that guides the analysis.
This methodology was designed for
experienced engineers sitting around a large diagram and discussing the safety aspects.
Also the nature of a large rocket with red wire, and remote detonation
fail-safes meant that the objective was to iron out common failures
not to rigorously detect all possible failures.
Consequently it was not designed to guarantee to covering all component failure modes,
and has no rigorous in-built safeguards to ensure coverage of all possible
system level outcomes~\cite{nasafta}[Section 1.2].

\paragraph{FTA: Potential to miss a large proportion of base compoenet failure modes}
FTA, like all top~down methodologies introduces the very serious problem
of potentially missing base component failure modes~\cite{faa}[Ch.9].
\paragraph{FTA: difficulty in modelling multiple/simultaneous failure modes}
FTA does not lend its self to modelling multiple failure modes.
Or conditions are often used where the cases for combinations
of the or'd failure modes occurring simultaneously are not defined.
It would be more correct, but less intuitive to use XOR gates instead.


NEED to FORMALISE EACH OF THESE TECHNIQUES AND SHOW THE WEAKNESSES AT EACH STAGE.

\paragraph{Outline of FTA Methodology}
FTA works by taking an undesirable event
(or SYSTEM level failure mode or TOP level failure)
and deciding top-down, what sub-systems it depends upon, and which
failure events of those sub-systems could cause the top level failure.
It then applies the same process to the sub-systems it identified
from the top level, identifying level level sub-systems and events.
It is not required to de-compose down to base component level.

\paragraph{One FTA Tree per System Level Failure Mode.}
This means that each system level error (or undesireable event) requires its own FTA tree.
This increases the amount of work to do, and in the case of updates to
particular sub-systems, introduces the requirement to update every FTA
tree modelling that sub-system.
From an FTA tree sets of base level events can be traced to SYSTEM level failures.
Thes are termed `cut sets'. A further refinement of this is the minimal cut set,
a reduced from of the `cut set' that contains the minimum number of base level
events to cause a particular  SYSTEM failure.
\glossary{name={cut set}, description={A cut set in a fault tree is a set of base component failure modes, whose occurrence ensures that a TOP (or SYSTEM) event occurs} }
\glossary{name={minimal cut set}, description={A cut set in a fault tree that cannot be reduced (i.e. \textbf{all} the base component failure modes are required to cause the SYSTEM level event) } }

\subsubsection{ FTA weaknesses }
\begin{itemize}
\item Possibility to miss component failure modes.
\item Possibility to miss environmental affects.
\item One FTA tree, per system failure mode. Thus there is not one model from which several FTA
trees can be derived. Maintainability and consistency cannot therefore be automatically checked.
\item No possibility to model base component level double failure modes.
\end{itemize}


\subsection {FTA Example}

Fault tree Analysis
Show how it works, top down,

FROM INTERBET HISTORY OF FTA

% A simple fault tree
% Author: Zhang Long, Mail: zhangloong[at]gmail.com
%\def\pgfsysdriver{pgfsys-dvipdfm.def}
%\documentclass{minimal}
%\usepackage{tikz}
%\usetikzlibrary{shapes.gates.logic.US,trees,positioning,arrows}
%\begin{document}

\begin{figure}
\begin{tikzpicture}[
% Gates and symbols style
    and/.style={and gate US,thick,draw,fill=blue!40,rotate=90,
		anchor=east,xshift=-1mm},
    or/.style={or gate US,thick,draw,fill=blue!40,rotate=90,
		anchor=east,xshift=-1mm},
    be/.style={circle,thick,draw,fill=white!60,anchor=north,
		minimum width=0.7cm},
    tr/.style={buffer gate US,thick,draw,fill=white!60,rotate=90,
		anchor=east,minimum width=0.8cm},
% Label style
    label distance=3mm,
    every label/.style={blue},
% Event style
    event/.style={rectangle,thick,draw,fill=yellow!20,text width=2cm,
		text centered,font=\sffamily,anchor=north},
% Children and edges style
    edge from parent/.style={very thick,draw=black!70},
    edge from parent path={(\tikzparentnode.south) -- ++(0,-1.05cm)
			-| (\tikzchildnode.north)},
    level 1/.style={sibling distance=7cm,level distance=1.4cm,
			growth parent anchor=south,nodes=event},
    level 2/.style={sibling distance=7cm},
    level 3/.style={sibling distance=6cm},
    level 4/.style={sibling distance=3cm}
%%  For compatability with PGF CVS add the absolute option:
%   absolute
    ]
%% Draw events and edges
    \node (g1) [event] {No flow to receiver}
	     child{node (g2) {No flow from Component B}
     	child {node (g3) {No flow into Component B}
	     	   child {node (g4) {No flow from Component A1}
	     	      child {node (t1) {No flow from source1}}
	     	      child {node (b2) {Component A1 blocks flow}}
          		}
	     	   child {node (g5) {No flow from Component A2}
	     	      child {node (t2) {No flow from source2}}
	     	      child {node (b3) {Component A2 blocks flow}}
			}
		   }
	     	child {node (b1) {Component B blocks flow}}
		};
%% Place gates and other symbols
%% In the CVS version of PGF labels are placed differently than in PGF 2.0
%% To render them correctly replace '-20' with 'right' and add the 'absolute'
%% option to the tikzpicture environment. The absolute option makes the
%% node labels ignore the rotation of the parent node.
   \node [or]	at (g2.south)	[label=-20:G02]	{};
   \node [and]	at (g3.south)	[label=-20:G03]	{};
   \node [or]	at (g4.south)	[label=-20:G04]	{};
   \node [or]	at (g5.south)	[label=-20:G05]	{};
   \node [be]	at (b1.south)	[label=below:B01]	{};
   \node [be]	at (b2.south)	[label=below:B02]	{};
   \node [be]	at (b3.south)	[label=below:B03]	{};
   \node [tr]	at (t1.south)	[label=below:T01]	{};
   \node [tr]	at (t2.south)	[label=below:T02]	{};
%% Draw system flow diagram
%   \begin{scope}[xshift=-7.5cm,yshift=-5cm,very thick,
%		node distance=1.6cm,on grid,>=stealth',
%		block/.style={rectangle,draw,fill=cyan!20},
%		comp/.style={circle,draw,fill=orange!40}]
%   \node [block] (re)					{Receiver};
%   \node [comp]	 (cb)	[above=of re]			{B}  edge [->] (re);
%   \node [comp]	 (ca1)	[above=of cb,xshift=-0.8cm]	{A1} edge [->] (cb);
%   \node [comp]	 (ca2)	[right=of ca1]			{A2} edge [->] (cb);
%   \node [block] (s1)	[above=of ca1]		{Source1} edge [->] (ca1);
%   \node [block] (s2)	[right=of s1]		{Source2} edge [->] (ca2);
%   \end{scope}
\end{tikzpicture}
\caption{Example FTA for a Gas Supply with two  Shutoff Valves}
\end{figure}
\clearpage

\subsubsection{A formal analysis of FTA - relationships and data types}

\subsection { FMEA }

\label{pfmea}
This is an early static analysis methodology, and concentrates
on SYSTEM level errors which have been investigated.
The investigation will typically point to a particular failure
of a component.

The methodology is now applied to find the significance of the failure.
It is based on a simple equation where $S$ ranks the severity (or cost \cite{bfmea}) of the identified SYSTEM failure,
$O$ its occurrence\footnote{The occurrence $O$ is the
probability of the failure happening.},
and $D$ giving the failures detectability\footnote{Detectability: often failures
may occur but not be noticed or cause an effect.
Consider an unused feature failing.}. Muliplying these
together,
gives a risk probability number (RPN), given by $RPN = S \times O \times D$.
This gives in effect
a prioritised `to~do~list', with higher $RPN$ values being the most urgent.

\fmeagloss
FMEA can be used as a quality improvement tool for industry. In this role,
it is a living document used to log system failures.
Management and quality product maintenance
can use the $RPN$ value for a new SYSTEM failure to decide on the urgency
of corrective action.


\subsubsection{ FMEA weaknesses }
\begin{itemize}
\item Possibility to miss the effects of failure modes at SYSTEM level.
\item Possibility to miss environmental effects.
\item No possibility to model base component level double failure modes.
\item Does not model component failure modes
that may cause more than one type of SYSTEM failure.
\end{itemize}

\paragraph{Note.} FMEA is sometimes used in its literal sense, that is to say
Failure Mode Effects analysis, simply looking at a systems' internal failure
modes and determining what may happen as a result.
FMEA described in this section (\ref{pfmea}) is sometimes called `production FMEA'.
\subsubsection{A formal analysis of FMEA - relationships and data types}

\subsection{FMECA}

Failure mode, effects, and criticality analysis (FMECA)~\cite{fmd91} extends FMEA
by associaing failure probabilities with component failure modes.
Essentially this adds a failure outcome criticallity factor to FMEA.
This is a bottom up methodology, which builds on an existing FMEA
analysis, which has already taken individual component failure modes
and traced them to the SYSTEM level failures.
%
Reliability data for components is used to predict the
failure statistics in the design stage.
An openly published source for the reliability of generic
electronic components was published by the DOD
in 1991 (MIL~HDK~1991~\cite{mil1991}) and is a typical
source for MTFF data.
%
FMECA has three probability factors for component failures.
\paragraph{FMECA ${\lambda}_{p}$ value.}
This is the overall failure rate of a base component.
This will typically be the failure rate per million ($10^6$) or
billion ($10^9$) hours of operation.

\paragraph{FMECA $\alpha$ value.}
The failure mode probability, usually dentoted by $\alpha$ is the  probability of
is the probability of a particular failure
mode occuring within a component, should it fail.
A component with N failure modes will thus have
have an $\alpha$ value associated with each of those modes.
As the $\alpha$ modes are probabilities, the sum of all $\alpha$ modes for a component must equal one.

\paragraph{FMECA $\beta$ value.}
The second probability factor $\beta$, is the probability that the failure mode
will cause a given SYSTEM failure.
This corresponds to Baysian probability, given a particular
component failure mode, the probability of a system level failure.
%\footnote{for a given component failure mode there will be a $\beta$ value, the
%probability that the component failure mode will cause a given SYSTEM failure}.
%
This lacks precision, or in other words, determinability prediction accuracy \cite{fafmea},
as often the component failure mode cannot be proven to cause a SYSTEM level failure, but is
assigned a probability $\beta$ factor by the design engineer. The use of  a $\beta$ factor
is often justified using Bayes theorem \cite{probstat}.
%Also, it can miss combinations of failure modes that will cause SYSTEM level errors.
%
\paragraph{FMECA `t' Value}
The time that a system will be operating for, or the working life time of the product is
represented by the variable $t$. for probability of failure on demand studies,
this can be the number of  operating cycles or demands expected.

\paragraph{Severity `s' value}
Component failure modes can cause failures that have levels of severity or seriousness.
Typical classifications are as follows:~\cite{fmd91}
\begin{itemize}
 \item Category I - Catastrophic
\item Category II - Critical
\item Category III - Marginal
\item Category IV - Minor.
\end{itemize}
Thus a component, because it may fail in different ways, may cause
different severity SYSTEM level errors on failing.

%I AM TOO TIRED
%this is fucking torture

\paragraph{Results of FMECA}
The results of FMECA are similar to FMEA, in that component errors are
listed according to importance, based on
probability of occurrence and criticallity.
% to prevent the SYSTEM fault of given criticallity.
Again this essentially produces a prioritised `to~do~list'
sorted by severity and likelihood.
A criticality number $C_m$,
%(where t is the operating time or product life time in hours),
which can be calculated for a given component failure mode $cfm$ for a given severity
$s$ thus:


\begin{equation}
 C_m(s) = cfm_{\beta} cfm_{\alpha} cfm_{{\lambda}_p} cfm_t  \; where \; cfm\rightarrow severity = s \;.
\end{equation}

%%-WIKI- Failure mode, effects, and criticality analysis (FMECA) is an extension of failure mode and effects analysis (FMEA).
%%-WIKI- FMEA is a a bottom-up, inductive analytical method which may be performed at either the functional or
%%-WIKI- piece-part level. FMECA extends FMEA by including a criticality analysis, which is used to chart the
%%-WIKI- probability  of failure modes against the severity of their consequences. The result highlights failure modes with relatively high probability
%%-WIKI- and severity of consequences, allowing remedial effort to be directed where it will produce the greatest value.
%%-WIKI- FMECA tends to be preferred over FMEA in space and North Atlantic Treaty Organization (NATO) military applications,
%%-WIKI- while various forms of FMEA predominate in other industries.

A second result, representing the overall reliability and safety of a component or item~\cite{fmd91}[2-17] $C$,
termed a criticallity number $C_r$ for the component.
We can consider $C$ to be a flat set of component failure modes, using $cfm$ as a variable to represent them.
% where $f \in F$)
The $C_r$ value, for a given serverity $s$ is calculated thus
\begin{equation}
 C_r(s) = \sum_{cfm \in C}  cfm_{\beta} cfm_{\alpha} cfm_{{\lambda}_p} cfm_t  \; where \; cfm\rightarrow severity = s \;.
\end{equation}


\subsubsection{ FMECA weaknesses }
\begin{itemize}
\item Possibility to miss the effects of failure modes at SYSTEM level.
\item Component failure modes are tied to one SYSTEM level error.
\item The $\beta$ factor is based on heuristics and does not reflect any rigourous calculations.
\item The $\alpha$ factor is based on heuristics or general data, and may not to specific to the environmental or operational conditions
under which the equipment is operating.
\item Possibility to miss environmental affects.
\item No possibility to model base component level double failure modes.
\item As with all failure mode methodologies based on FMEA, does not model component failure modes
that may cause more than one type of SYSTEM failure.
\end{itemize}
\subsubsection{A formal analysis of FMECA - relationships and data types}


\subsection { FMEDA or Statistical Analyis }

Failure Modes, Effects, and Diagnostic Analysis (FMEDA)
% This
is a process that takes all the components in a system,
and using the  failure modes of those components, the investigating engineer
ties them to possible SYSTEM level events/failure modes.
%
This technique
evaluates a products statistical level of safety
taking into account its self-diagnostic ability.
The calculations and procedures for FMEDA are
described in EN61508 %Part 2 Appendix C
\cite{en61508}[Part 2 App C].
The following gives an outline of the procedure.


\subsubsection{Two statistical perspectives}
FMEDA is a statistical analysis methodology and is used from one of two perspectives,
Probability of Failure on Demand (PFD), and Probability of Failure
in continuous Operation, or Failure in Time (FIT).
\glossary{name={FIT}, description={Failure in Time (FIT). The number of times a particular failure is expected to occur in a $10^{9}$ hour time period.}}


\label{survey:fit}
\paragraph{Failure in Time (FIT).} Continuous operation is measured in failures per billion ($10^9$) hours of operation.
For a continuously running nuclear powerstation, industrial burner or aircraft engine
we would be interested in its operational FIT values.
\label{survey:pfd}
\paragraph{Probability of Failure on Demand (PFD).} For instance with an anti-lock system in
automobile braking, or other fail safe measure applied in an emergency, we would be interested in PFD.
That is to say the ratio of it failing
to succeeding to operate correctly on demand.

\subsubsection{The FMEDA Analysis Process}

\paragraph{Determine SYSTEM level failures from base components}
The first stage is to apply FMEA to the SYSTEM.
%
Each component is analysed in terms of how its failure
would affect the system.
Failure rates of individual components in the SYSTEM
are calculated based on component type and
environmental conditions. The SYSTEM errors are categorised as `safe' or `dangerous'.
%
%Statistical data exists for most component types \cite{mil1992}.
%
This phase is typically implemented on a spreadsheet
with rows representing each component. A typical component spreadsheet row would
comprise of
component type, placement,
part number, environmental stress factors, MTTF, safe/dangerous etc.
%will be a determination of whether the component failing will lead to a `safe'
%or `unsafe' condition.

\paragraph{Overall SYSTEM failure rate.}
The product failure rate is the sum of all component
failure rates.  Typically the sum of all MTTF rates for all
components in an FMEDA spreadsheet.
\frategloss
%This is the sum of safe and unsafe
%failures.

\paragraph{Self Diagnostics.}
We  next evaluate the SYSTEM's self-diagnostic ability.

%Each component’s failure modes and failure rate are now available.
Failure modes are now classified  as safe or dangerous.
This is done by taking a component failure mode and determining
if the SYSTEM error it is tied to is dangerous or safe.
The decision for this may be
based on heuristics or field data.
EN61508 uses the $\lambda$ symbol to represent probabilities.
\glossary{name={Lambda $\lambda$},description={Failure rate is often denoted as Lambda ($\lambda$) }}


Because we have statistics for each component failure mode,
we can now now classify these in terms of safe and dangerous lambda values.
Detectable failure probabilities are labelled `$\lambda_D$' (for
dangerous) and  `$\lambda_S$' (for safe) \cite{en61508}.

\paragraph{Determine Detectable and Undetectable Failures.}
Each safe and dangerous failure mode is now
classified as detectable or un-detectable.
EN61508 assumes that products have a high level of
self checking features.
%
This gives us four level failure mode classifications:
Safe-Detected (SD), Safe-Undetected (SU), Dangerous-Detected (DD) or Dangerous-Undetected (DU),
and the probablistic failure rate of each classification
is represented by lambda variables
(i.e. $\lambda_{SD}$, $\lambda_{SU}$, $\lambda_{DD}$, $\lambda_{DU}$).

Because it is recognised that some failure modes may  not be discovered theoretically during the static
analysis, the
% admission of how daft it is to take a component failure mode on its own
% and guess how it will affect an ENTIRE complex SYSTEM
% Admission of failure of the process really !!!!
next step  is to investigate using an actual working SYSTEM.

Failures are deliberately caused (by physical intervention), and any new SYSTEM level
failures are added to the model.
Heuristics and MTTF failure rates for the components
are used to calculate probabilities for these new failure modes
along with their safety and detectability classifications (i.e.
$\lambda_{SD}$, $\lambda_{SU}$, $\lambda_{DD}$, $\lambda_{DU}$).
These new failures are added to the model.
%SD, SU, DD, DU.


\glossary{name={SU},description={Safe Undetected; a SYSTEM level failure mode that is considered safe, and is not detected by self checking mechanisms. See FMEDA~\cite{en61508}}}
\glossary{name={SD},description={Safe Detected; a SYSTEM level failure mode that is considered safe, and is detected by self checking mechanisms. See FMEDA~\cite{en61508}}}
\glossary{name={DD},description={Dangerous Detected; a SYSTEM level failure mode that is considered dangerous, and is detected by self checking mechanisms. See FMEDA~\cite{en61508}}}
\glossary{name={DU},description={Dangerous Undetected; a SYSTEM level failure mode that is considered dangerous, and is not detected by self checking mechanisms. See FMEDA~\cite{en61508}}}

With these classifications, and statistics for each component
we can now calculate statistics for the diagnostic coverage (how good at `self checking' the system is)
and its safe failure fraction (how many of its failures are self detected or safe compared to
all failures possible).

The calculations for these are described below.

\paragraph{Diagnostic Coverage.}
The diagnostic coverage is simply the ratio
of the dangerous detected probabilities
against the probability of all dangerous failures,
and is normally expressed as a percentage. $\Sigma\lambda_{DD}$ represents
the percentage of dangerous detected base component failure modes, and
$\Sigma\lambda_D$ the total number of dangerous base component failure modes.

$$ DiagnosticCoverage = \Sigma\lambda_{DD} / \Sigma\lambda_D $$

The diagnostic coverage for safe failures, where  $\Sigma\lambda_{SD}$ represents the percentage of
safe detected base component failure modes,
and $\Sigma\lambda_S$ the total number of safe base component failure modes,
is given as

$$ SF = \frac{\Sigma\lambda_{SD}}{\Sigma\lambda_S} $$


\paragraph{Safe Failure Fraction.}
A key concept in  FMEDA is Safe Failure Fraction (SFF).
This is the ratio of safe  and dangerous detected failures
against all safe and dangerous failure probabilities.
Again this is usually expressed as a percentage.

$$ SFF = \big( \Sigma\lambda_S + \Sigma\lambda_{DD} \big) / \big( \Sigma\lambda_S + \Sigma\lambda_D \big) $$

%This is the ratio of
%Step 4 Calculate SFF, SIL and PFD
%The SIL level of the product is finally determined from the Safe Failure Fraction (SFF) and the Probability of Failure on Demand (PFD). The following formulas are used.
%SFF = (lSD + lSU + lDD) / (lSD + lSU + lDD + lDU)
%PFD = (lDU)(Proof Test Interval)/2 + (lDD)(Down Time or Repair Time)

% Often a given component failure mode there will be a $\beta$ value, the
% probability that the component failure mode will cause a given SYSTEM failure.

%\paragraph{Risk Mitigation}
%
%The component may be have its risk factor
%reduced by the checking interval (or $\tau$ time between self checking procedures).
%
%Ultimately this technique calculates a risk factor for each component.
%The risk factors of all the components are summed and
%%give a value for the `safety level' for the equipment in a given environment.


\paragraph{Classification into Safety Integrity Levels (SIL).}
There are four SIL levels, from 1 to 4 with 4 being the highest safety level.
In addition to probablistic risk factors, the
diagnostic coverage and SFF
have threshold bands beoming stricter for each level.
Demanded software verification and specification techniques and constraints
(such as language subsets, s/w redundancy etc)
become stricter for each SIL level.
%%
%% Andrew asked me to expand on this here, but it would take at least two
%% pages. I think its more appropriate for the survey.tex chapter.
%%

Thus FMEDA uses statistical methods to determine
a safety level (SIL), typically used to meet an acceptable risk
value, specified for the environment the SYSTEM must work in.
EN61508 defines in general terms,
 risk assessment and required SIL levels \cite{en61508} [5 Annex A].

%the probability of
%failures occurring, and provide an adaquate risk level.
%
%A component failure mode, given its MTTF
%the probability of detecting the fault and its safety relevant validation time $\tau$,
%contributes a simple risk factor that is summed
%in to give a final risk result.
%
Thus an FMEDA
model can be implemented on a spreadsheet, where each component
has a calculated risk, a fault detection time (if any), an estimated risk importance
and other factors such as de-rating and environmental stress.
With one component failure mode per row,
all the statistical factors for SIL rating can be produced\footnote{A SIL rating will apply
to an installed plant, i.e. a complete installed and working SYSTEM. SIL ratings for individual components or
sub-systems are meaningless, and the nearest equivalent would be the FIT/PFD and SFF and diagnostic coverage figures.}.


\glossary{name={FIT}, description={Failure in Time (FIT). The number of times a particular failure is expected to occur in a $10^{9}$ hour time period.}}


\subsubsection{FMEDA and failure outcome prediction accuracy.}
FMEDA suffers from the same problems of
lack of component failure mode outcome prediction accuracy, as FMEA in section \ref{pfmea}.
%
This is because the analyst has to decide how particular components failing will impact on the SYSTEM or top level.
This involves a `leap of faith'. For instance, a resistor failing in a sensor circuit
may be part of a critical monitoring function.
The analyst is now put in a position
where he probably should assign a dangerous failure classification to it.
%
There is no analysis
of how that resistor would/could  affect the components close to it, but because the circuitry
is part of critical section it will most likely
be linked to a dangerous system level failure in an FMEDA study.
%
%%- IS THIS TRUE IS THERE A BETA FACTOR IN FMEDA????
%%-
%A $\beta$ factor, the heuristically defined probability
%of the failure causing the system fault may be applied.
%
%In FMEDA there is no detailed analysis of the failure mode behaviour
%of the component in its local environment
%Component failure modes are traceable directly to the SYSTEM level.
%it becomes more
%guess work than science.
%
With FMEDA, there is no rigorous cause and effect analysis for the failure modes
and how they interact on the micro~scale (the components adjacent to them in terms of functionality).
Unintended side effects that lead to failure can be missed.
Also component failure modes that are not
dangerous, may be wrongly assigned as dangerous simply because they exist in a critical
section of the product.

% some critical component failure
%modes, but we can only guess, in most cases what the safety case outcome
%will be if it occurs.

This leads to the practise of having components within a SYSTEM partitioned into different
safety level zones as recomended in EN61508\cite{en61508}. This is a vague way of determining
safety, as it can miss unexpected effects due to `unexpected' component interaction.

The Statistical Analysis methodology is the core philosophy
of the Safety Integrity Levels (SIL) embodied in EN61508 \cite{en61508}
and its international analog is standard IOC5108.


\subsubsection{ FMEDA weaknesses }
\begin{itemize}
\item Possibility to miss the effects of failure modes at SYSTEM level.
\item Statistical nature allows a proportion of undetected failures  for given S.I.L. level.
\item Allows a small proportion of `undetectable' error conditions.
\item No possibility to model base component level double failure modes.
\item As with all failure mode methodologies based on FMEA, does not model component failure modes
that may cause more than one type of SYSTEM failure.
\item Because FMEDA is based on one entry per component failure mode, top level symptoms are not grouped, and will be listed in a fragmented way, and may not have the same description.
\end{itemize}
%AND then how we can solve all there problems


\section{FMEDA Failure effect Mode Diagnositic Analysis}

This is the main babsis of SIL certification for Programmed Electronic Equipment.
Itr applies FMEA, with classification of criticality of
components, adjustment to MTTF values by self checking mechanisms in the product,
and mitigation for a safe failure fraction. This leads to a probablistic
mean time to failure or probability of failure on demand, that will
fall within the criteria for a given SIL safety level.
An overview for this method can be found in an EXIDA paper \cite{fmeda}
and detailed description of the method for SIL certification in part 2 of
EN61508 \cite{en61508}.

disadvantage: single component failure is used to determine its effect on
the entire system. This leads to classifying components as safety or non-safety critical
at an early stage in the analysis. This means that complex interactions or side effects
of the components failing may not be taken into account.

advantage: concepts of self checking systems, and safe failure fraction\footnote{Safe Failure Fraction (SFF) is the number of non-safety critical components
that can be detected as failed compared to the number of safety critcal components. The thinking here is that is components are detected as failing
even though they are not safety critical, the system is self checking a greater proportion of its own systems, and is therefore safer. This
is applying bayes theorem for probablistic error detection}

This is a probablistic based methodology.

\subsection{Safe Failure Fraction}

Introduce the idea of coverage.
A good example is RAM in a microprocessor/microcontroller, we cann ot give 100i\% coverage to it.
We can perform some tests that give us 60\% coverage etc

\subsection{Diagnostic interval}

Reducing FIT with detecting a fraction of the faults within an interval. Give formulas etc
\glossary{name={FIT}, description={Failure in Time (FIT). The number of times a particular failure is expected to occur in a $10^{9}$ hour time period.}}


\subsubsection{A formal analysis of FMEDA - relationships and data types}


\subsection{Redundancy - Models}

1oo1 2oo3 etc

\subsection{Field Data}

OK for EN61508, not OK for nuclear industry find refs.

\subsection{Bayes Theorm in Relation to Failure Modes}

\paragraph{Conditional Probability}
Bayes theorem describes the probability of causes.

In the context of failure modes in components
we are interested in how they may affect a SYSTEM.
The SYSTEM failure modes can be seen as symptoms of the failure modes of base
components.
For example, let $B$ be a base component failure mode
abd let $S$ be a system level failure mode.

We can say that the conditional probability of $S$ given $B$ is denoted as
\begin{equation}
\label{eqn:condprob}
 P(S|B) = \frac{P(S \cap B)}{P(S)}
\end{equation}

%\paragraph{Multiple Events and  conditional Probability}
%
%add copy, describe probabilities for multiple events.....


%Or in other words we can say that the probability of $B$ and $S$ occurring
%divided by the probability of $S$ occurring due to any cause, is the probability
%the $B$ caused $S$.
We can call this the {\em conditional probability} of $S$ given $B$.
Re-arranging \ref{eqn:bayes1}

$$ P(S) P(S|B)   = P(S \cap B)  $$

The inverse condition, $B$ given $S$ is

$$ P(B) P(B|S)   = P(S \cap B)  $$

As for one being the cause of the other, both equations must be equal,
we can state,
\begin{equation}
\label{eqn:bayes0}
 P(B) P(B|S)   = P(S \cap B) = P(S) P(S|B).
\end{equation}

We can now re-arrange the equation~\cite{probstat} to remove the intersection $P(S \cap B)$ term
thus

\begin{equation}
\label{eqn:bayes1}
 P(S|B) =  \frac{P(S) P(B|S)} {P(B)} .
\end{equation}


This equation gives us the probability that if event B has occurred, of
the event S occurring.
In the context of failure mode analysis, the event B would
be the occurance of a component failure mode, and S would be a system level error.

We can redefine $P(B)$ using equation \ref{eqn:bayes0}


$$ S = \bigcup_{i=1}^{i=N} S \cap B_n $$

now to find the probabilities we can express this as

$$ P(S) = P \big( \bigcup_{i=1}^{i=N} S \cap B_n \big) = \sum_{i=1}^{i=N} P(B|S) P(B) $$
and
$$ P(S) = P \big( \bigcup_{i=1}^{i=N} S \cap B_n \big) = \sum_{i=1}^{i=N} P(S|B) P(S) $$


We can express bayes theorem thus

\begin{equation}
\label{eqn:bayes2}
 P(S|B) =  \frac{P(S) P(B|S)} { \sum_{i=1}^{i=N} P(S|B) P(S) } .
\end{equation}

%

%Equation \ref{eqn:bayes1} means, given the event $B$ what is the probability it was caused by $S$.
%Because we are interested in what base component failure modes could have caused $S$
%we need to re-arrange this

%\begin{equation}
%\label{eqn:bayes2}
% P(B|S) =  \frac{P(B) P(S|B)} {P(S)} .
%\end{equation}
%
%Equation \ref{eqn:bayes2} can be read as given the system failure mode $S$

Typically a system level failure will have a number of possible causes,
or base component failure
modes.
For probability we are interested in these failure modes occuring, or rather
the event of the failure modes becoming active.

We can represent the the base component failure mode events as a partioned set~\cite{nucfta}[fig VI-7], and overlay
a given system failure mode on it.

\paragraph{Bayes Theorem}

Consider a SYSTEM error that has several potential base component causes.
Because a SYSTEM typically has a number of high level errors let us consider
a specific one and label it $S_k$.
We can call $P(S_k)$ the prior probability of the SYSTEM error. That is to
say the  iprobability od $S_k$ occuring with no information about possible causes for it.
 Consider a number of possible
base component `potential cause' events as $B_n$ where $n$ is an index.
Our sample space $SS$, for investigating the system failure mode/symptom
$S_k$ is thus $ SS = \{B_1 ... B_n\} $.
We can apply bayes theorem
to determine the statistical likelihood that a given failure mode $B_n$
will cause the system level error $S_k$ useing equation \ref{eqn:bayes1}.


\begin{figure}[h]
 \centering
 \includegraphics[width=350pt,keepaspectratio=true]{./survey/partition.jpg}
 % partition.jpg: 510x264 pixel, 72dpi, 17.99x9.31 cm, bb=0 0 510 264
 \caption{Base Component Failure Modes represented as partitioned sets}
 \label{fig:partitionbcfm}
\end{figure}


Figure \ref{fig:partitionbcfm} represents a small theoretical system
with nine events.
representing
failure mode events.

\begin{figure}[h]
 \centering
 \includegraphics[width=350pt,keepaspectratio=true]{./survey/partition2.jpg}
 % partition.jpg: 510x264 pixel, 72dpi, 17.99x9.31 cm, bb=0 0 510 264
 \caption{Base Component Failure Modes with Overlaid System Error}
 \label{fig:partitionbcfm2}
\end{figure}

Some base component failure modes may not be able to cause given system failures.
Figure \ref{fig:partitionbcfm2} represents the case where we are looking at a particular
system level failure $S_k$. Looking at the diagram we can see that this system failure
could be, but is not necessarily caused by base component failure modes $B_1, B_2 \; or \; B_4$.
Should any other base component failure mode (causation event occur) according to the diagram
it will not be able to cause the system failure $S_k$.


%IN ENGLEEEESH Inverse causality.....
%Prob $B_n$ caused $S_k$ is the prob $S_k$ caused by $B_n$ divided by prob of $B_n$

%%% \begin{equation}
%%%  P(S_k|B_n) = \frac{P(S_k) \; P(B_n | S_k) }{P(B_n)}
%%% %alternate form of no use to MEEEEEE
%%% %P(B_n|S_k) = \frac{P(B_n) \; P(S_k | B_n) }{P(S_k)}
%%% \end{equation}

For example were we to have a component that has a failure mode $B_n$ with an MTTF of $10^{-7}$ hours
and its associated system failure mode $S_k$ has a MTTF of $5.10^{-8}$ hours, and given that
when the system error $S_k$ occurs, there is a 10\% probability that $B_n$ had occured (i.e. $P(S_k | B_n) = 0.1$), we can determine
the probability that $S_k$ is caused by $B_n$ thus


$$
P(S_k|B_n) = \frac{5.10^{-8} .\; 0.1  }{ 10^{-7}} = 0.05 = 5\%
$$


Some base component failure modes may not be able to cause given system failures.
For instance in the diagram \ref{fig:partitionbcfm2}
events $B_5 ... B_9$ cannot cause event $S_k$.
Taking an example from the diagram (figure \ref{fig:partitionbcfm2}), where the base component fault cannot
lead to the system failure $S_k$.
Taking say $B_9$ which does not overlap with $S_k$ (i.e. $B_9 \cap S_k = \emptyset  $),
we can see that $P(S_k | B_9) = 0$.
Bayes theorem applied to $B_9$ becomes

$$P(S_k|B_9) = \frac{P(B_9) .\; 0  }{ 10^{-7}} = 0 = 0\%$$

As $ P(S_k | B_n)$  is a factor in the numerator,
the application of bayes theorem to $B_9$ being a cause for $S_k$ has a probability
of zero, as we would expect.


%%%%

%% BAYES

Because we are interested in finding the probability of $S_k$ for all
base component failure modes, it is helpful to re-define
$P(S_k)$.

In terms oif set intersection, we can express $S_k$ as
$$ S_k = \bigcup_{i=1}^{i=N} S_k \cap B_n .$$

now to find the probabilities we can express this as

$$ P(S_k) = P \big( \bigcup_{i=1}^{i=N} S_k \cap B_n \big) = \sum_{i=1}^{i=N} P(B_i|S_k) P(B_i) $$
and
$$ P(S_k) = P \big( \bigcup_{i=1}^{i=N} S_k \cap B_n \big) = \sum_{i=1}^{i=N} P(S_k|B_i) P(S_k) $$


We can express bayes theorem thus

\begin{equation}
\label{eqn:bayes2}
 P(S_k|B_n) =  \frac{P(S_k) P(B|S_k)} {\sum_{i=1}^{i=n} P(B_i|S_k) P(B_i)} .
\end{equation}


%
% here derive the trad version of bayes with the summation as the denominator
%


NOW also we need the justification of using MTTF
in probablistic equations.

This is a lambda -exp pow integral....

\section{Applying Bayes Theorem to Failure Mode Analysis}

\subsection{Belief Networks}


%http://www.csse.monash.edu.au/hons/projects/2000/Daniel.Willis/node5.html

RESTRICTIONS:

Because this uses conditional probability for multiple independent events
complications such as operational states or envi1ronmental conditions
cannot be represented by the Bayesian model.
% consider 747 engines and a volcanic ash cloud....

\paragraph{mutually independent events and base component failure statistics}

FMEA, FTA, FMECA and to a great extent FMEDA, apply bayesian
concepts to individual base~components failure rates, rather than
using base~component failure modes, for the events under
investigation.
This means a lack of precision in interpretting the base failure
modes as statistically independent events.
Typically, a base component may fail in more than one way,
and usually once it has it stays in that failure mode.
This violates the principle of the events being statistically independent.

show using area propostional Euler Diagrams the failure modes and their
possible sdystem level failure outcomes.

Discuss unused sections of hardware in a product.

Discuss protection devices like VDR's and capacitors for smoothing

Discuss microprocessor watchdog and CRC ROM schemes

Discuss hardware failsafes (good example over pressure saefty values).

Keep relating these back to bayes theorem.

\subsection{Deterministic FMEA}
........... FMEA cannot handle simultaneous failure modes.....
EN298 no two individual component failures may give rise to a dangerous condition.


%typeset in {\Huge \LaTeX} \today