Robin_PHD/fmmdset/fmmdset.tex

545 lines
24 KiB
TeX

% $Id: fmmdset.tex,v 1.7 2009/06/06 11:52:09 robin Exp $
%
\ifthenelse {\boolean{paper}}
{
\begin{abstract}
This paper describes a process for analysing safety critical systems, to formally prove how safe the
designs and built -in safety measures are. It provides
the rigourous method for creating a fault effects model of a system from the bottom up using part level fault modes.
From the model fault trees,
modular re-usable sections of safety critical systems,
and accurate, statistical estimation for fault frequency can be derived automatically.
It provides the means to trace the causes of dangerous detected and dangerous undetected faults.
It is intended to be used to formally prove systems to meet EN and UL standards, including and not limited to
EN298, EN61508, EN12067, EN230, UL1998.
\end{abstract}
}
{}
\section{Introduction}
%This paper describes the Failure Mode Modular de-Composition (FMMD) method.
% described here, models a safety critical system from the bottom up.
The purpose of the FMMD methodology is to apply formal techniques to
the assessment of safety critical designs, aiding in identifying detected and undetected faults
\footnote{Undetectabed faults
are faults which may occur but are not self~detected, or are impossible to detect by the system}.
Formal methods are just begining to be specified in some safety standards.\footnote{Formal methods
such as the Z notation appear as `highly recomended' techniques in the EN61508 standard, but
apply only to software currently}.However, some standards are now implying the handling of
simultaneous faults which complicates the scenario based approvals that are
currently used\footnote{Standard EN298 stronlgy implies that double simultaneeous failures must be handled.}.
% Some safety critical system assemesment criteria
%are statistical, and require a target failure rate per hour of operation be met \cite{EN61508}.
%Specific safety standards may apply criteria such as no single part failure in a system may lead to
%a dangerous fault.
There are two main philosophies in assessing safety critical systems.
One is to specify an acceptable level of dangerous faults per hour of operation\footnote{The probability of failure per hour (PFH)
is measured in failures per 1e-9 seconds}.
This is a statistical approach. This is the approach taken by the European safety reliability
standard EN61508 commonly referred to as the Safety Integrity Level (SIL)
standard.
The second is to specify
that any single or double part faults cannot lead to a dangerous fault in the system under consideration.
This entails tracing the effects of all part failure modes
and working out if they can lead to any dangerous faults in the system under consideration.
%For instance, during WWII after operational research teams had analysed data it was determined that
% an aircraft engine that can, through one part failure cause a catastrophic failure is an unacceptable design.\cite{boffin} .
Both of these methods require a complete fault analysis tree.%\cite{FMEA}.
The statistical method
requires additional Mean Time To Failure (MTTF) data for all part failure modes.
The FMMD methodology applies defined stages and processes that will
create a modular fault mode hierarchy. From this
complete fault analysis trees can be determined. It uses a modular approach, so that repeated sections
of system design can be modelled once, and re-used.
%formally prove safety critical
%hardware designs.
The FMMD method creates a hierarchy from
part~fault~mode level up to system level.
%It does this using
%well defined stages, and processes.
%It allows re-use of analysed modules DOH DOH DOH
%, and to create a framework where
%fault causation trees, and statistical likelihood
%of faults occurring are
When a design has been analysed using this method, fault~trees may be traversed, and statistical likelihoods of failure
and dangerous~faults can be determined from traversing the fault tree down to the MTTFs of individual parts.
%Starting with individual part failure modes, to collections of %parts (modules)
%and then to module level fault modes.
\subsection{Basic Concepts Of FMMD}
\paragraph{ Creating a fault hierarchy}
The main idea of the methodology is to build a hierarchy of fault modes from the part
level up to highest system levels.
The first stage is to choose
parts that interact and naturally form {\em functional groups}. {Functional groups} are thus collections of base parts.
%These parts all have associated fault modes. A module is a set fault~modes.
From the point of view of fault analysis, we are not interested in the parts themselves, but in the ways in which they can fail.
For this study a functional group will mean a collection of components.
In order to determine the symptoms or failure modes of a {\em functional group}
we need to consider all failure modes of its parts.
By analysing the fault behaviour of a `functional group' with respect these failure modes
we can derive a new set of possible failure modes.
%
This new set of faults is the set of derived faults from the module level and is thus at a higher level of
fault~mode abstraction. Thus we can say that the module as a whole entity can fail in a number of well defined ways.
In other words we have taken a functional group, and analysed how it can fail according to the failure modes of its parts.
The ways in which the module can fail now become a new set of fault modes, the fault~modes
derived from the functional~group. we can now create a new `derived~component' which has
the failure symtoms of the functional~group as its set of failure modes.
This new derived~component is at a higher failure mode abstraction
level than the base components.
%What this means is the `fault~symptoms' of the module have been derived.
%
%When we have determined the fault~modes at the module level these can become a set of derived faults.
%By taking sets of derived faults (module level faults) we can combine these to form modules
%at a higher level of fault abstraction. An entire hierarchy of fault modes can now be built in this way,
%to represent the fault behaviour of the entire system. This can be seen as using the modules we have analysed
%as parts, parts which may now be combined to create new functional groups,
%but as parts at a higher level of fault abstraction.
Applying the same process with derived components we can bring derived components
together to form functional groups and create new derived components
at a higher abstraction level.
\subsubsection { Definitions }
\begin{itemize}
\item base component - a component with a known set of unitary state failure modes
\item functional group - a collection of components chosen to perform a particular task
\item derived failure mode - a failure symptom of a functional group
\item derived component - a functional group after analysis
\end{itemize}
\subsubsection{An algebraic notation for identifying FMMD enitities}
Each component $C$ is a set of failure modes for the component.
We can define a function $\mathcal FM$ that returns the
set of failure modes $S$ for the component.
$$ \mathcal{FM}(C) \rightarrow S $$
We can indicate the abstraction level of a component by using a superscript.
Thus for the component $C$, where it is base component we can asign it
the abstraction level zero thus $C^0$. Should we wish to index the components
(for example as in a product parts~list) we can use a sub-script.
Our base component (if first in the parts~list) could now be uniquely identified as
$C^0_1$.
A functional group can use the letter $F$. A function group is a collection
of components. We thus define $F$ as a set of components.
We can further define the abstraction level of a functional group.
We can say that it is the maximum abstraction level of any of its
components. Thus a functional group containing only base components
would have an abstraction level zero and could be represented with a superscript of zero thus
$F^0$. The functional group set may also be indexed.
We can apply symptom abstraction to a functional group to find
a set of derived failure modes. We are interested in the failure modes
of all the components in the functional group. An analysis process
defined as $\bowtie$ is applied to the functional group.
$$ \bowtie(F^N) \rightarrow C^{N+1} $$
The $\bowtie$ function processes each member (component) of the set $F$ and
extracts all the component failure modes, which are used by the analyst to
determine the derived failure modes. A new derived component is created
where its failure modes are the symptoms from $F$.
Note that the component will have a higher abstraction level than the functional
group it analysed.
\subsubsection{FMMD Hierarchy}
By applying stages of analysis to higher and higher abstraction
levels we can converge to a complete failure mode model of the system under analysis.
An example of a simple system will illustrate this.
\subsection {Example FMEA process using an FMEA diagram}
Consider a simple functional~group $ F^0_1 $ derived from two base components $C^0_1,C^0_2$.
We can apply $\bowtie$ to the functional~group $F$
and it will return a derived component at abstraction level 1 (with an index of 1 for completeness)
$$ \bowtie( F^0_1 ) = C^1_1 $$
to look at this analysis process in more detail.
By way of exqample applying $\mathcal{FM}$ to obtain the failure modes $f_N$
$$ \mathcal{FM}(C^0_1) = \{ f_1, f_2 \} $$
$$ \mathcal{FM}(C^0_2) = \{ f_3, f_4, f_5 \} $$
The analyst now considers failure modes $f_{1..5}$ in the context of the functional group.
The result of this process will be a set of derived failure modes.
Let these be $ \{ f_6, f_7, f_8 \} $.
We can now create a derived component $C^1_1$ with this set of failure modes.
Thus:
$$ \mathcal{FM}(C^1_1) = \{ f_6, f_7, f_8 \} $$
We can represent this analysis process in a diagram see figure \ref{fig:onestage}
\begin{figure}[h]
\centering
\includegraphics[width=200pt,bb=0 0 268 270]{fmmdset/onestage.jpg}
% onestage.jpg: 268x270 pixel, 72dpi, 9.45x9.52 cm, bb=0 0 268 270
\caption{FMMD analysis of functional group}
\label{fig:onestage}
\end{figure}
% \begin{figure}
% \centering
% \input{fmmdset/fmmdh.tex}
% \caption{FMMD example Hierarchy}
% \label{fig:sdfmea}
% \end{figure}
\begin{figure}[h]
\centering
\includegraphics[width=400pt,bb=0 0 555 520,keepaspectratio=true]{fmmdset/fmmdh.png}
% fmmdh.png: 555x520 pixel, 72dpi, 19.58x18.34 cm, bb=0 0 555 520
\caption{FMMD Example Hierarchy}
\label{fig:fmmdh}
\end{figure}
\section {Building the Hierarchy - Higher levels \\ of Fault Mode Analysis}
Figure \ref{fig:fmmdh} shows a hierarchy of failure mode descopmosition.
It can be seen that the derived fault~mode sets are higher level abstractions of the fault behaviour of the modules.
We can take this one stage further by combining the $D^{1}_{N}$ sets to form modules. These
$M^2_{N}$ fault mode collections can be used to create $D^3_{N}$ derived fault~modes sets and so on.
At the top of the hierarchy, there will be one final (where $t$ is the
top level) set $D^{t}_{N}$ of abstract fault modes. The causes for these
system level fault~modes will be traceable down to part fault modes.
A hierarchy of levels of faults becoming more abstract at each level should
converge to a small sub-set of system level errors.
This thinning out of the number of system level errors is borne out in practise ;
real time control systems often have a small number of major reportable faults (typically $ < 50$),
even though they may have accompanying diagnostic data.
\cite{sem}
%\begin{figure}
%\subfigure[Euler Diagram]{\epsfig{file=fmmd_hierarchy_cimg5040.eps,width=4.2cm}\label{fig:exa}}
%\subfigure[Intersection A B ]{\epsfig{file=exampleareasubtraction2.eps,width=4.2cm}\label{fig:exb}}
%\subfigure[area to subtract]{\epsfig{file=exampleareasubtraction3.eps,width=4.2cm}\label{fig:exc}}
%\subfigure[A second graphic]{\epsfig{file=exampleareasubtraction3.eps,width=2cm}}
%{\epsfig{file=fmmd_hierarchy_cimg5040.eps,width=12cm}
%\label{fig:ex}
%\caption{Simple Euler Diagram}
%\end{figure}
\cite{sem}
\section {Modelling considerations}
\subsection{ Proof of number of part~failure \\ modes preserved in hierarchy build}
Here need to prove that if we have an abstract fault, then as it goes higher in the tree, it can only collect MORE not less
actual part~failure modes. This is obvious but needs a proof.
Also this means may need dummy modules to not violate jumping up the tree structure
%Complete coverage for all derived hierarch levels can be generalised thus:
%$$ CompleteCoverage = \forall \; h \; \forall \; x \exists \; y \; ( \; x \; \in \; \cup \; {\cal F} \; D^{h}
% \; \Rightarrow \; x \; \in \; \cup \; M^{h}_{y} ) $$
\subsection{Cardinality Constrained Powerset }
\label{ccp}
A Cardinality Constrained powerset is one where sub-sets of a cardinality greater than a threshold
are not included. This theshold is called the cardinality constraint.
To indicate this the cardinality constraint $cc$, is subscripted to the powerset symbol thus $\mathcal{P}_{cc}$.
Consider the set $S = \{a,b,c\}$. $\mathcal{P}_{2} S $ means all subsets of S where the cardinality of the subsets is
less than or equal to 2.
$$ \mathcal{P} S = \{ 0, \{a,b,c\}, \{a,b\},\{b,c\},\{c,a\},\{a\},\{b\},\{c\} \} $$
$$ \mathcal{P}_{2} S = \{ \{a,b\},\{b,c\},\{c,a\},\{a\},\{b\},\{c\} \} $$
$$ \mathcal{P}_{1} S = \{ \{a\},\{b\},\{c\} \} $$
A $k$ combination is a subset with $k$ elements.
The number of $k$ combinations (each of size $k$) from a set $S$
with $n$ elements (size $n$) is the binomial coefficient
$$ C^n_k = {n \choose k} = \frac{n!}{k!(n-k)!}$$
To find the number of elements in a cardinality constrained subset S with up to $cc$ elements
in each comination sub-set,
we need to sum the combinations,
%subtracting $cc$ from the final result
%(repeated empty set counts)
from $1$ to $cc$ thus
%
% $$ {\sum}_{k = 1..cc} {\#S \choose k} = \frac{\#S!}{k!(\#S-k)!} $$
%
$$ \#\mathcal{P}_{cc} S = \sum^{k}_{1..cc} \frac{\#S!}{k!(\#S-k)!} $$
\subsection{Actual Number of combinations to check with Unitary State Fault mode sets}
Where all components analysed only have one fault mode, the cardinality constrained powerset
calculation give the correct number of test case combinations to check.
Because set of failure modes is constrained to be unitary state, the acual number will
be less.
What must actually be done is to subtract the number of component `internal combinations'
from the cardinality constrain powerset number.
Thus were we to have a simple circuit with two components R and T, of which
$FM(R) = {R_o, R_s}$ and $FM(T) = {T_o, T_s, T_h}$.
For a cardinality constrained powerset of 2, because there are 5 error modes
gives $\frac{5!}/{1!(5-1)!} + \frac{5!}{2!(5-2)!} = 15$. OK
5 single fault modes, and ${2 \choose 5}$ ten double fault modes.
However we know that the faults are mutually exclusive for a component.
We must then subtract the number of `internal' component fault combinations.
For component R there is only one internal component fault that cannot exist
$R_o \wedge R_s$. As a combination ${2 \choose 2} = 1$ . For $T$ the component with
three fault modes ${2 \choose 3} = 3$.
Thus for $cc == 2$ we must subtract $(3+1)$.
Written as a general formula, where C is a set of the components (indexed by j where J
is the set of componets under analyis) and $\#C$
indicates the number of mutually exclusive fault modes the compoent has:-
%$$ \#\mathcal{P}_{cc} S = \sum^{k}_{1..cc} \frac{\#S!}{k!(\#S-k)!} $$
$$ \#\mathcal{P}_{cc} S = {\sum^{k}_{1..cc} \frac{\#S!}{k!(\#S-k)!}} - {\sum^{j}_{j \in J} {\#C_{j} \choose cc}} $$
%$$ \#\mathcal{P}_{cc} S = \sum^{k}_{1..cc} \big[ \frac{\#S!}{k!(\#S-k)!} - \sum_{j} (\#C_{j} \choose cc \big] $$
%% CASE STUDY BEGIN
\subsection{Case Study FMMD Hierarchy:\\ Simple RS-232 voltage reader}
%%% This is the tikz picture ??/
%
%\begin{figure}[h+]
%\centering
%\input{fmmdset/mvsblock.tex}
%\caption{Block Diagram : Example Milli-Volt Sensor : Block Diagram}
%%\includegraphics[scale=0.20]{ptop.eps}
%\label{fig:mvsblock}
%\end{figure}
%
Consider a simple electronic system, that provides say two milli amplifiers
which supplies these onward via serial link - RS232. This is simple in concept, plug in a
computer, run a terminal prgram, and the instrument will report the milli volt readings in ASCII
with any error messages.
% in CRC checksum protected packets.
It is interesting to look at one of `functional groups'. The millivolt amplifiers are a good example.
These can be analysed by taking a functional~group, the components surrounding the op-amp,
a few resistors to determine offset and gain,
a safety resistor, and perhaps some smoothing capacitiors.
These components form the functional group. The circuit is then analysed for all the fault combinations
of these parts. This produces a large collection of possible fault~modes for the milli-volt amplifier.
The two amplifiers are now connected to the ADC which converts the voltages to binary words for the microprocessor.
The microporessor then uses the values to determine if the readings are valid and then formats text to send
via the RS232 serial line.
%
% \begin{figure}[h+]
% %\centering
% %\input{millivolt_sensor.tex}
% \includegraphics[scale=0.4]{fmmdset/millivolt_sensor.eps}
% \caption{Hierarchical Module Diagram : Milli-Volt Sensor Example}
% \label{fig:mvs}
% \end{figure}
\begin{figure}[h]
\centering
\includegraphics[width=400pt,bb=0 0 749 507,keepaspectratio=true]{fmmdset/millivolt_sensor.png}
% millivolt_sensor.png: 749x507 pixel, 72dpi, 26.42x17.89 cm, bb=0 0 749 507
\caption{Hierarchial Module Diagram : Millivolt Sensor Example}
\label{fig:mvs}
\end{figure}
This has a number of obvious functional~groups, the PCB power supply, the milli-volt amplifiers,
the analog to digital conversion circuity, the micro processor and the UART (serial link - RS232 transceiver).
It would make sense when analysing this system to take each one of these functional~groups in turn and examine them closely.
It would be sensible if the system could detect the most obvious fault~modes by self testing.
When these have been examined and diagnostic safeguard strategies have been thought up,
we might look at reporting any fault via the RS232 link.
% (if it still works !).
By doing this we have already used a modular approach.
We have analysed each section of the circuitry,
and then using the abstract errors derived from each module,
can fit these into a picture of the
fault~modes of the milli-volt monitor as a whole. However this type of analysis is not guaranteed
to rigourously take into account all fault~modes.
It is useful to follow an example fault though levels of abstraction hierarchy however, see below.
%The FMMD technique,
%goes further than this by considering all part fault~modes and
%places the analysis phases into a rigid structure.
%Each analysis phase is
%described using set theory in later sections.
%By creating a rigid hierarchy, not only can we traverse back
%down it to find possible causes for system errors, we can also determine
%combinations of fault modes that cause certain high level fault modes.
%For instance, it may be a criteria that no single part failure may cause a fatal error.
%If a fault tree can trace down to a single part fault for a potentially fatal
%fault mode, then a re-design must be undertaken.
%Some standards for automated burner controllers demand that two part failure modes cannot cause
%a dangerous/potentially fatal error. Again having a complete fault analysis tree will reveal these conditions.
\subsection{An example part Fault and its subsequent \\ abstraction to system or top level}
An example of a part fault effect on the example system is given below, showing how this fault
manifests itself at each abstraction level.
%\begin{example}
As an example let us consider a resistor failure in the first milli-volt sensor.
Let us say that this resistor, R48 say, with the particular fault mode `shorted'
causes the amplifier to output 5V.
At the part level we have one fault mode in one part.
%This is the lowest or zero level of fault abstraction.
Let us say that this amplifier has been designed to amplify the milli-volt input
to between 1 and 4 volts, a convenient voltage for the ADC/microcontroller to read.
Any voltage outside this range will be considered erroneous.
As the resistor short causes the amplifier to output 5V we can detect the error condition.
This resistor is a part in the `millivolt amplifier 1' module.
% (see figure \ref{fig:mvs}).
The fault mode at the derived fault level (abstraction level 1) is OUTPUT\_HIGH.
Looking higher in the hierarchy, the next abstraction level higher, level 2, will see this as
a `CHANNEL\_1' input fault.
%The system as a whole (abstraction level 3) will see this as
%a `MILLI\_VOLT\_SENSOR' fault~mode.
%\end{example}
\subsubsection{Abstraction Layer Summary \\ for example fault.}
\begin{description}
%\begin{list}
\item[Abstraction Level 0 :] Resistor has fault mode `R48\_SHORT' in amplifier 1.
\item[Abstraction Level 1 :] Amplifier 1 has fault mode `OUTPUT\_HIGH'.
\item[Abstraction Level 2 :] Milli-volt sensor has `CHANNEL\_1' fault.
%\item[Abstraction Level 3 :] System has `MILLI\_VOLT\_SENSOR' fault.
%\end{itemize}
%\end{list}
\end{description}
Thus we have looked at a single part fault and analysed its effect from the
bottom up on the system as a whole, going up through the abstraction layers.
%%
%% END CASE STUDY
%%
\section{Future Ideas}
\subsection{ Production Quality Control }
Having a fault causation tree, could be used for PCB board fault finding (from the fault codes that are reported
by the equipment). This could be used in conjunction with a database to provide
Production oriented FMEA\footnote{The term FMEA applied to production, is a statistical process of
determining the probability of the fault occurring and multiplying that by the costs incurred from the fault.
This quickly becomes a priority to-do list with the most costly faults at the top}
\subsection { Test Rigs }
Test rigs apply a rigourous checking process to safety critical equipment before
they can be sold, and this usually is a legal or contractural requirement, backed up by inspections
and and an approval process.
They are usually a clamp arrangement where the PCB under test is placed.
Precesion and calibrated test signals are then applied to the board under test. For PCBs containing
microprocessor, custom test~rig software may be run on them to excersize
active sections of the PCB (for instance to drive outputs, relays etc).
The main purpose of a test rig is to prevent fault equipment from being shipped.
However, often a test rig, will reveal an easy to fix fault on a board (such as a part not soldered down completely
or missing parts). These boards can be mended and re-submitted to the test rig.
It is often a problem, when a unit fails in a test rig, to quickly determine why it has failed.
Having a fault causation tree, would be useful for identifying which parts may be missing, not soldered down
or simply incorrect. The test rig armed with the fault analysis tree could point to parts or combinations of parts that could be checked
to correct the product.
\subsection {Modules - re-usability}
In the example system in the introduction, the milli-volt amplifiers
are the same circuit. The set of derived faults for the module may therefore
simply be given a different index number and re-used.
\subsection{ Multi Channel Safety Critical Systems }
Where a system has several independent parrallel tasks, each one can be a separate hierarchy.
% \small
% \bibliography{vmgbibliography,mybib}
% \normalsize
% Typeset in \ \ {\huge \LaTeX} \ \ on \ \ \today
% \begin{verbatim}
% CVS Revision Identity $Id: fmmdset.tex,v 1.7 2009/06/06 11:52:09 robin Exp $
% \end{verbatim}
%\end{document}
%\theend