Robin_PHD/fmmdset/fmmdset.tex

% $Id: fmmdset.tex,v 1.7 2009/06/06 11:52:09 robin Exp $

%

\ifthenelse {\boolean{paper}}
{
\begin{abstract}
This paper describes a process for analysing safety critical systems, to formally prove how safe the
designs and built -in safety measures are. It provides
the rigourous method for creating a fault effects model of a system from the bottom up  using part level fault modes.
From the model fault trees,
modular re-usable sections of safety critical systems,
and accurate, statistical estimation for fault frequency can be derived automatically.
It provides the means to trace the causes of dangerous detected and dangerous undetected faults.
It is intended to be used to formally prove systems to meet EN and UL standards, including and not limited to
EN298, EN61508, EN12067, EN230, UL1998.
\end{abstract}
}
{}


\section{Introduction}

%This paper describes the  Failure Mode Modular de-Composition (FMMD)  method.
% described here, models a safety critical system from the bottom up.

The purpose of the FMMD methodology is to apply formal techniques to
the assessment of safety critical designs, aiding in identifying detected and undetected faults
\footnote{Undetectable faults
are faults which may occur but are not self~detected, or are impossible to detect by the system}.
Formal methods are just begining to be specified in some safety standards.\footnote{Formal methods
such as the Z notation appear as `highly recommended' techniques in the EN61508 standard, but
apply only to software currently.} However, some standards are now implying the handling of
simultaneous faults which complicates the scenario based approvals that are
currently used\footnote{Standard EN298:2003 strongly implies that double simultaneeous failures must be handled.}.

% Some safety critical system assemesment criteria
%are statistical, and require a target failure rate per hour of operation be met \cite{EN61508}.
%Specific safety standards may apply criteria such as no single part failure in a system may lead to
%a dangerous fault.

There are two main philosophies in assessing safety critical systems.
One is to specify an acceptable level of dangerous faults per hour of operation\footnote{The probability of failure per hour (PFH)
is measured in failures per 1e-9 seconds}.
This is a statistical approach. This is the approach taken by the European safety reliability
standard EN61508 commonly referred to as the Safety Integrity Level (SIL)
standard.
The second is to specify
that any single or double part faults cannot lead to a dangerous fault in the system under consideration.
This entails tracing the effects of all part failure modes
 and working out if they can lead to any dangerous faults in the system under consideration.
%For instance, during WWII after operational research teams had analysed  data it was determined that
% an aircraft engine that can, through one part failure cause a catastrophic failure is an unacceptable design.\cite{boffin} .

Both of these methods require a complete fault analysis tree.
The statistical method
requires additional Mean Time To Failure (MTTF) data for all part failure modes.

The FMMD methodology applies defined stages and processes that will
create a modular fault mode hierarchy. From this
complete fault analysis trees can be determined. It uses a modular approach, so that repeated sections
of system design can be modelled once, and re-used.
%formally prove safety critical
%hardware designs.
The  FMMD method creates a hierarchy from
part~fault~mode level up to system level.
%It does this using
%well defined stages, and processes.
%It allows re-use of analysed modules DOH DOH DOH
%, and to create a framework where
%fault causation trees, and statistical likelihood
%of faults occurring are
When a design has been analysed using this method, fault~trees may be traversed, and statistical likelihoods of failure
and dangerous~faults can be determined from traversing the fault tree down to the MTTFs of individual parts.


%Starting with individual part failure modes, to collections of %parts (modules)
%and then to module level fault modes.

\subsection{Basic Concepts Of FMMD}


\paragraph{ Creating a fault hierarchy}

The main idea of the methodology is to build a hierarchy of fault modes from the part
level up to highest system levels.

The first stage is to choose
parts that interact and naturally form {\em functional groups}. {Functional groups} are thus collections of base parts.
%These parts all have associated fault modes. A module is a set fault~modes.

From the point of view of fault analysis, we are not interested in the parts themselves, but in the ways in which they can fail.

For this study a functional group will mean a collection of components.
In order to determine the symptoms or failure modes of a {\em functional group}
we need to consider all failure modes of its parts.
By analysing the fault behaviour of a `functional group' with respect these failure modes
we can derive a new set of possible failure modes.
%
This new set of faults is the set of derived faults from the module level and is thus at a higher level of
fault~mode abstraction. Thus we can say that the module as a whole entity can fail in a number of well defined ways.

In other words we have taken a functional group, and analysed how it can fail according to the failure modes of its parts.
The ways in which the module can fail now becomes a new set of fault modes, the fault~modes
being derived from the functional~group. We can now create a new `derived~component' which has
the failure symptoms of the functional~group as its set of failure modes.
This new derived~component is at a higher failure mode abstraction
level than the base components.
%What this means is the `fault~symptoms' of the module have been derived.
%
%When we have determined the fault~modes at the module level these can become a set of derived faults.
%By taking sets of derived faults (module level faults) we can combine these to form modules
%at a higher level of fault abstraction. An entire hierarchy of fault modes can now be built in this way,
%to represent the fault behaviour of the entire system. This can be seen as using the modules we have analysed
%as parts, parts which may now be combined to create new functional groups,
%but as parts at a higher level of fault abstraction.
Applying the same process with derived components we can bring derived components
together to form functional groups and create new derived components
at a higher abstraction level.

\subsubsection { Definitions }

\begin{itemize}
\item base component - a component with a known set of unitary state failure modes
\item functional group -  a collection of components chosen to perform a particular task
\item derived failure mode - a failure symptom of a functional group
\item derived component - a functional group after analysis
\end{itemize}

\subsubsection{An algebraic notation for identifying FMMD enitities}
Each component $C$ is a set of failure modes for the component.
We can define a function $FM$ that returns the
set of failure modes $F$ for the component $C$.

Let the set of all possible components  be $\mathcal{C}$
and let the set of all possible failure modes be $\mathcal{F}$.

We can define a function $FM$

\begin{equation}
FM : \mathcal{C} \mapsto \mathcal{P}\mathcal{F}
\end{equation}

defined by, where C is a component and F is a set of failure modes.

$$  FM ( C ) = F $$


%$$ \mathcal{FM}(C) \rightarrow S $$
%$$ {FM}(C) \rightarrow S $$

We can indicate the abstraction level of a component by using a superscript.
Thus for the component $C$, where it is a `base component' we can asign it
the abstraction level zero thus $C^0$. Should we wish to index the components
(for example as in a product parts~list) we can use a sub-script.
Our base component (if first in the parts~list) could now be uniquely identified as
$C^0_1$.

A functional group can use the variable name $FG$. A functional group is a collection
of components. We thus define $FG$ as a set of components that have been chosen as members
of a functional~group.
We can further define the abstraction level of a functional group.
We can say that it is the maximum abstraction level of any of its
components. Thus a functional group containing only base components
would have an abstraction level zero and could be represented with a superscript of zero thus
$FG^0$. The functional group set may also be indexed.

We can apply symptom abstraction to a functional group to find
a set of derived failure modes. We are interested in the failure modes
of all the components in the functional group. An analysis process
defined by the symbol  `$\bowtie$' is applied to the functional~group.

$$ \bowtie(FG^N) \rightarrow C^{N+1} $$

The $\bowtie$ function processes each member (component) of the set $FG$ and
extracts all the component failure modes, which are used by the analyst to
determine the derived failure modes. A new derived component is created
where its failure modes are the symptoms from $FG$.
Note that the component will have a higher abstraction level than the functional
group it analysed.

\subsubsection{FMMD Hierarchy}

By applying stages of analysis to higher and higher abstraction
levels we can converge to a complete failure mode model of the system under analysis.

An example of a simple system will illustrate this.

\subsection {Example FMEA process using an FMEA diagram}

Consider a simple functional~group  $ FG^0_1 $ derived from two base components $C^0_1,C^0_2$.

We can apply $\bowtie$ to the functional~group $FG$
and it will return a derived component at abstraction level 1 (with an index of 1 for completeness)

$$ \bowtie( FG^0_1 ) = C^1_1 $$

to look at this analysis process in more detail.

By way of exqample applying ${FM}$ to obtain the failure modes $f_N$


 $$ {FM}(C^0_1) = \{ f_1, f_2 \}  $$
 $$ {FM}(C^0_2) = \{ f_3, f_4, f_5 \}  $$


The analyst now considers failure modes $f_{1..5}$ in the context of the functional group.
The result of this process will be a set of derived failure modes.
Let these be $  \{ f_6, f_7, f_8 \} $.
We can now create a derived component $C^1_1$ with this set of failure modes.

Thus:

$$ {FM}(C^1_1) =  \{ f_6, f_7, f_8 \} $$


We can represent this analysis process in a diagram see figure \ref{fig:onestage}
\begin{figure}[h]
 \centering
 \includegraphics[width=200pt,bb=0 0 268 270]{fmmdset/onestage.jpg}
 % onestage.jpg: 268x270 pixel, 72dpi, 9.45x9.52 cm, bb=0 0 268 270
 \caption{FMMD analysis of functional group}
 \label{fig:onestage}
\end{figure}


% \begin{figure}
% \centering
% \input{fmmdset/fmmdh.tex}
% \caption{FMMD example Hierarchy}
% \label{fig:sdfmea}
% \end{figure}


\begin{figure}[h]
 \centering
 \includegraphics[width=400pt,bb=0 0 555 520,keepaspectratio=true]{fmmdset/fmmdh.jpg}
 % fmmdh.png: 555x520 pixel, 72dpi, 19.58x18.34 cm, bb=0 0 555 520
 \caption{FMMD Example Hierarchy}
 \label{fig:fmmdh}
\end{figure}


\section {Building the Hierarchy - Higher levels \\ of Fault Mode Analysis}

Figure \ref{fig:fmmdh} shows a hierarchy of failure mode de-composition.

It can be seen that the derived fault~mode sets are higher level abstractions of the fault behaviour of the modules.
We can take this one stage further by combining the derived component $C^{1}_{N}$ sets to form functional~groups. These
$FG^2_{N}$ functional~groups can be used to create $C^3_{N}$ derived components and so on.
At the top of the hierarchy, there will be one final (where $t$ is the
top level) component $C^{t}_{N}$ and {\em its fault modes, are the failure modes of the SYSTEM}. The causes for these
system level fault~modes will be traceable down to part fault modes, traversing the tree
through the lower level functional groups and components.
each SYSTEM level fault may have a number of paths through the
tree to different low level of base component failure modes.
In FTA terminology, these paths through the tree are called `minimal cut sets'.


A hierarchy of levels of faults becoming more abstract at each level should
converge to a small sub-set of system level errors.

This thinning out of the number of system level errors is borne out in practise;
real time control systems often have a small number of major reportable faults (typically $ < 50$),
even though they may have accompanying diagnostic data.


\cite{sem}


%\begin{figure}
%\subfigure[Euler Diagram]{\epsfig{file=fmmd_hierarchy_cimg5040.eps,width=4.2cm}\label{fig:exa}}
%\subfigure[Intersection A B ]{\epsfig{file=exampleareasubtraction2.eps,width=4.2cm}\label{fig:exb}}
%\subfigure[area to subtract]{\epsfig{file=exampleareasubtraction3.eps,width=4.2cm}\label{fig:exc}}
%\subfigure[A second graphic]{\epsfig{file=exampleareasubtraction3.eps,width=2cm}}
%{\epsfig{file=fmmd_hierarchy_cimg5040.eps,width=12cm}
%\label{fig:ex}
%\caption{Simple Euler Diagram}
%\end{figure}

\cite{sem}


\section {Modelling considerations}

\subsection{ Proof of number of part~failure \\ modes preserved in hierarchy build}

Here we need to prove that if there is an abstract fault, then as it goes higher in the tree, it can only collect MORE not less
actual part~failure modes. This is obvious but needs a proof.
Also this means that we may need dummy modules so as not to violate jumping up the tree structure

%Complete coverage for all derived hierarch levels can be generalised thus:

%$$ CompleteCoverage = \forall \; h \; \forall \; x  \exists  \; y \; ( \;  x  \; \in  \; \cup  \; {\cal F} \; D^{h}
% \; \Rightarrow \; x  \; \in  \; \cup  \; M^{h}_{y} ) $$


%% CASE STUDY BEGIN

\subsection{Case Study FMMD Hierarchy:\\ Simple RS-232 voltage reader}
\begin{figure}[h]
 \centering
 \includegraphics[width=340pt,bb=0 0 532 192,keepaspectratio=true]{./mvsblock.jpg}
 % mvsblock.png: 532x192 pixel, 72dpi, 18.77x6.77 cm, bb=0 0 532 192
 \caption{Milli-Volt Sensor Block Diagram}
 \label{fig:mvsblock}
\end{figure}


%%% This is the tikz picture ??/
%
%\begin{figure}[h+]
%\centering
%\input{fmmdset/mvsblock.tex}
%\caption{Block Diagram : Example  Milli-Volt Sensor : Block Diagram}
%%\includegraphics[scale=0.20]{ptop.eps}
%\label{fig:mvsblock}
%\end{figure}
%
Consider a simple electronic system, that provides say two milli-volt amplifiers
which passes the values onward via serial link - RS232 (see figure \ref{fig:mvsblock}). This is simple in concept, plug in a
computer, run a terminal program, and the instrument will report the milli volt readings in ASCII
with any error messages.

% in CRC checksum protected packets.

It is interesting to look at one of `functional~groups'. The milli-volt amplifiers are a good example.
These can be analysed by taking a functional~group, the components surrounding the op-amp,
a few resistors to determine offset and gain,
a safety resistor, and perhaps some smoothing capacitors.
These components form a functional group. This circuit is then analysed for all the fault combinations
of its parts. This produces a collection of possible symptoms/fault~modes for the milli-volt amplifier.
The two amplifiers are now connected to an ADC which converts the voltages to binary words for the microprocessor.
The micro-processor then uses the values to determine if the readings are valid and then formats text to send
via the RS232 serial line.

%
% \begin{figure}[h+]
% %\centering
% %\input{millivolt_sensor.tex}
% \includegraphics[scale=0.4]{fmmdset/millivolt_sensor.eps}
% \caption{Hierarchical Module Diagram : Milli-Volt Sensor Example}
% \label{fig:mvs}
% \end{figure}

\begin{figure}[h]
 \centering
 \includegraphics[width=400pt,bb=0 0 783 638,keepaspectratio=true]{./millivolt_sensor.jpg}
 % millivolt_sensor.jpg: 783x638 pixel, 72dpi, 27.62x22.51 cm, bb=0 0 783 638
 \caption{FMMD Hierarchy: Milli-volt sensor Example}
 \label{fig:vs}
\end{figure}


%
% \begin{figure}[h]
%  \centering
%  \includegraphics[width=400pt,bb=0 0 749 507,keepaspectratio=true]{fmmdset/millivolt_sensor.png}
%  % millivolt_sensor.png: 749x507 pixel, 72dpi, 26.42x17.89 cm, bb=0 0 749 507
%  \caption{Hierarchial Module Diagram : Millivolt Sensor Example}
%  \label{fig:mvs}
% \end{figure}

This has a number of obvious functional~groups, the PCB power supply, the milli-volt amplifiers,
the analog to digital conversion circuitry, the micro processor and the UART (serial link - RS232 transceiver).
It would make sense when analysing this system to take each one of these functional~groups in turn and examine them closely.

It would be sensible if the system could detect the most obvious fault~modes  by self testing.
When these have been examined and diagnostic safeguard strategies have been thought up,
we might look at reporting any fault via the RS232 link.
% (if it still works !).

By doing this we have already used a modular approach.
We have analysed each section of the circuitry,
and then using the abstract errors derived from each module,
can fit these into a picture of the
fault~modes of the milli-volt monitor as a whole. However this type of analysis is not guaranteed
to rigorously take into account all fault~modes.
It is useful to follow an example fault through levels of abstraction hierarchy however.

%The FMMD technique,
%goes further than this by considering all part fault~modes and
%places the analysis phases into a rigid structure.
%Each analysis phase is
%described using set theory in later sections.
%By creating a rigid hierarchy, not only can we traverse back
%down it to find possible causes for system errors, we can also determine
%combinations of fault modes that cause certain high level fault modes.
%For instance, it may be a criteria that no single part failure may cause a fatal error.
%If a fault tree can trace down to a single part fault for a potentially fatal
%fault mode, then a re-design must be undertaken.
%Some standards for automated burner controllers demand that two part failure modes cannot cause
%a dangerous/potentially fatal error. Again having a complete fault analysis tree will reveal these conditions.


\subsection{An example part Fault and its subsequent \\ abstraction to system or top level}

An example of a part fault effect on the example system is given below, showing how this fault
manifests itself at each abstraction level.

%\begin{example}
As an example let us consider a resistor failure in the first milli-volt sensor.

Let us say that this resistor, R48 say, with the particular fault mode `shorted'
causes the amplifier to output 5V.
At the part level, we have one fault mode in one part.
%This is the lowest or zero level of fault abstraction.
Let us say that this amplifier has been designed to amplify the milli-volt input
to between 1 and 4 volts, a convenient voltage for the ADC/microcontroller to read.
Any voltage outside this range will be considered erroneous.
As the resistor short causes the amplifier to output 5V we can detect the error condition.
This resistor is a part in the `millivolt amplifier 1' module.
% (see figure \ref{fig:mvs}).
The fault mode at the derived fault  level (abstraction level 1) is OUTPUT\_HIGH.
Looking higher in the hierarchy, the next abstraction level higher, level 2, will see this as
a  `CHANNEL\_1' input fault.
%The system as a whole (abstraction level 3) will see this as
%a `MILLI\_VOLT\_SENSOR' fault~mode.
%\end{example}/

\subsubsection{Abstraction Layer Summary \\ for example fault.}
\begin{description}
%\begin{list}
\item[Abstraction Level 0 :] Resistor has fault mode `R48\_SHORT' in amplifier 1.
\item[Abstraction Level 1 :] Amplifier 1 has fault mode `OUTPUT\_HIGH'.
\item[Abstraction Level 2 :] Milli-volt sensor has `CHANNEL\_1' fault.
%\item[Abstraction Level 3 :] System has `MILLI\_VOLT\_SENSOR' fault.
%\end{itemize}
%\end{list}
\end{description}

Thus we have looked at a single part fault and analysed its effect from the
bottom up on the system as a whole, going up through the abstraction layers.

\subsection{Natural Fault finding}

Suppose that we were handed one of these `dual milli-volt' sensors and told that it had a ``Channel 1''
fault and asked to trouble shoot and hopefully fix it.
The natural process would be to work from the top down.
First of all we would look at perhaps a circuit schematic.
We might, not beliving the operator that the equipment is actually faulty, feed in a known and valid milli-volt signal into the input.
On verifying it was actually faulty,
we could then find the ADC port pins used to make the reading, and measure a voltage on them.
We would find that the voltage was indeed out of range and our attention would turn to
the circuitry between the input milli-volt signal and the ADC/Microcontroller.
On examining this we would probably measure the in circuit resistances
and discover the faulty resistor.
With the natural fault finding process, we have narrowed down until we come to
the faulty component. FMMD analysis works from the bottom~up, and this is
because it must cover all component failure modes.
%%
%% END CASE STUDY
%%


\section{Future Ideas}

\subsection{ Production Quality Control }

Having a fault causation tree, could be used for PCB board fault finding (from the fault codes that are reported
by the equipment). This could be used in conjunction with a database to provide
Production oriented FMEA\footnote{The term FMEA applied to production, is a statistical process of
determining the probability of the fault occurring and multiplying that by the costs incurred from the fault.
This quickly becomes a priority to-do list with the most costly faults at the top}


\subsection { Test Rigs }

Test rigs apply a rigorous checking process to safety critical equipment before
they can be sold, and this usually is a legal or contractural requirement, backed up by inspections
and and an approval process.

They are usually a clamp arrangement where the PCB under test is placed.
Precesion and calibrated test signals are then applied to the board under test. For PCBs containing
microprocessor, custom test~rig software may be run on them to exersize
active sections of the PCB (for instance to drive outputs, relays etc).

The main purpose of a test rig is to prevent fault equipment from being shipped.
However, often a test rig, will reveal an easy to fix fault on a board (such as a part not soldered down completely
or missing parts). These boards can be mended and re-submitted to the test rig.

It is often a problem, when a unit fails in a test rig, to quickly determine why it has failed.

Having a fault causation tree, would be useful for identifying which parts may be missing, not soldered down
or simply incorrect. The test rig armed with the fault analysis tree could point to parts or combinations of parts that could be checked
to correct the product.

\subsection {Modules - re-usability}

In the example system in the introduction, the milli-volt amplifiers
are the same circuit. The set of derived faults for the module may therefore
simply be given a different index number and re-used.

\subsection{ Multi Channel Safety Critical Systems }

It is common in safety critical systems to use redundancy.
Two or sometimes three control systems will be assigned to the same process.
An arbittraion system, the arbiter, will decide which channel may control
the equipment.
Where a system has several independent parallel control channels, each one can be a separate FMMD hierarchy.

The FMMD trees for the channels can converge
up to a top hierarchy representing the arbiter (which is the sub-system that decides which control channels are valid).
This is commponly referred to as a multi-channel safety critical system.
Where there are 2 channels and one arbiter, the term 1oo2 is used (one out of two).
The Ericsson AXE telephone exchange hardware is a 1oo2 system, and the arbiter (the AMD)
can detect and switch control within on processor instruction. Should a hardware error
be detected,\footnote{Or in a test plant environment, more likely someone coming along and `borrowing' a cpu board from
your working exchange} the processor will switch to the redundant side without breaking any telephone calls
or any being set up. An alarm will be raised to inform that this has happened, but the impact to
the 1oo2 system, is a one micro-processor instruction delay to the entire process.

The premise here is that the arbiter should be able to determine which
of the two control channels is faulty and use the data/allow control from the non-faulty one.
1oo3 systems are common in highly critical systems.

\paragraph{Fault mode mode of interfaces}
An advantage with FMMD in this case is that the interface between the channels and the
safety arbiter is not only defined functionally but as a failure model as well.
Thus failures in the interfacing between the safety arbiter and the
each channel is modelled.

\paragraph{re-use of FMMD analysis}
Note that we can reuse the results from analysing one channel to model them all.
Identical channels will have the same high level failure modes.
% \small
% \bibliography{vmgbibliography,mybib}
% \normalsize


% Typeset in \ \ {\huge \LaTeX} \ \ on \ \  \today

% \begin{verbatim}
% CVS Revision Identity $Id: fmmdset.tex,v 1.7 2009/06/06 11:52:09 robin Exp $
% \end{verbatim}

%\end{document}

%\theend