Lunchtime, Andrew Fish Comments from weekend.

This commit is contained in:
Robin Clark 2010-12-06 14:10:41 +00:00
parent 4da27903f4
commit 14a3dc4c34

View File

@ -16,7 +16,7 @@ incremental and rigorous approach.
The four main static failure mode analysis methodologies were examined and
in the context of newer European safety standards, assessed.
Some of the deficiencies identified in these methodologies lead to
a wish list for a more ideal methodology.
a wish list for a more rigorous methodology.
%% What I have found
%%
@ -24,7 +24,8 @@ From the wish list
%and considering some constraints determined from
%the evaluation of the four established methodologies,
a new
methodology is developed and proposed. The has been named Failure Mode Modular De-Composition (FMMD).
methodology is developed and proposed.
This has been named Failure Mode Modular De-Composition (FMMD).
%% Sell it
%%
@ -58,7 +59,8 @@ From the wish list %
%and considering some constraints determined from
%the evaluation of the four established methodologies,
a new
methodology is developed and proposed. The has been named Failure Mode Modular De-Composition (FMMD).
methodology is developed and proposed.
This has been named Failure Mode Modular De-Composition (FMMD).
%% Sell it
%%
@ -112,7 +114,7 @@ ensuring that all component failure modes must be considered in the model.
%
\paragraph{FMMD Process outline.}
This methodology has been named Failure Mode Modular De-composition (FMMD)
because it de-composes a SYSTEM into a hierarchy of modules or {\dc}s.
because it decomposes a SYSTEM into a hierarchy of modules or {\dc}s.
This
\ifthenelse {\boolean{paper}}
{
@ -133,10 +135,13 @@ is determined.
%
FMMD works from the bottom up, taking small groups
of components, {\fgs}, and then analysing how they can fail.
\input{./shortfg}
\paragraph{Micro Vs. Macro failure mode analysis.}
This analysis is performed using FMEA from a micro rather than a macro perspective.
Thus instead of looking at component failure modes and determining how
they {\em may} cause a failure at SYSTEM level, we are looking at how
they {\em will} affect the components local {\fg}.
they {\em will} affect the component's local {\fg}.
When we know the failure modes of a {\fg} we can treat it as a `black box'
or {\dc}. With {\dc}s we can build {\fgs}
at higher levels of analysis, until we have a complete
@ -168,8 +173,8 @@ a set of undesirable outcomes or `accidents'.
As most accidents are unexpected and the causes unforeseen \cite{safeware}
it is fair to say that a top down approach is not guaranteed to
predict all possible undesirable outcomes.
It also can miss known component failure modes, by
simply not de-composing down to the base component failure level of detail.
Top-down methodologies can miss known component failure modes, by
simply not decomposing down to the base component failure level of detail.
\paragraph{A general problem with bottom-up static failure analysis.}
With the bottom up techniques we have all the known component failure modes
@ -177,25 +182,29 @@ and the relative freedom to determine how each of these may affect the SYSTEM.
%
A problem with this is that a component typically
interacts in a complex way with several other functionally
adjacent components
adjacent components.
%
To take a component failure mode and then attempt to tie that
to a SYSTEM level outcome is very difficult.
%
The difficulty lies in
%
%Because of
the number of components
our failure mode under investigation may interact with is typically very large.
The number of components
a failure mode under investigation might interact with is typically very large.
This makes it very difficult to predict the effects of a component
failure mode, because we have to decide which components it could affect,
or
in other words, which components are functionally adjacent to it.
%
We cannot consider all the components in the SYSTEM
when looking at a single failure mode,
and human judgement must be used to
and therefore human judgement must be used to
decide which interactions could be important.
Let N be the number of components in our system, and K be the average number of component failure modes
(ways in which the component can fail). The total number of base component failure modes
is $N \times K$. To examine the effect that one failure mode has on all the other components
(ways in which the base~component can fail). The total number of base component failure modes
is $N \times K$. To examine the effect that one failure mode has on all
the other components\footnote{A base component failure will typically affect the sub-system
it is part of, and create a failure effect at the SYSTEM level.}
will be $(N-1) \times N \times K$, in effect a set cross product.
@ -207,18 +216,21 @@ Or we may have a mechanical device that has a different
failure mode behaviour for say, different ambient pressures or temperatures.
If $E$ is the number of applied states or environmental conditions to consider
in a system, the job of the bottom-up analyst is complicated by a cross product factor again
in a system, the job of the bottom-up analyst is presented with an
additional cross product factor,
$(N-1) \times N \times K \times E$.
If we put some typical very small embedded system numbers\footnote{these figures would
be typical of a very simple temperature controller, with a micro-controller sensor and heater circuit} into this, say $N=100$, $K=2.5$ and $E=10$
be typical of a very simple temperature controller, with a micro-controller sensor
and heater circuit} into this, say $N=100$, $K=2.5$ and $E=10$
we have $99 \times 100 \times 2.5 \times 10 = 247500 $.
To look in detail at a quarter of a million test cases is obviously impractical.
If we were to consider multiple simultaneous failure modes,
we have yet another complication cross product.
we have yet another cross product of checks to be performed.
For instance for looking at double simultaneous failure modes,
the equation reads $(N-2) \times (N-1) \times N \times K \times E$.
For instance for looking at double simultaneous failure modes, where $\#C$
is the number of checks to perform
the equation reads $\#C = (N-2) \times (N-1) \times N \times K \times E$.
The bottom-up methodologies FMEA, FMECA and FMEDA take single failure modes and link them
to SYSTEM level failure modes. Because of the astronomical number of possible interactions,
@ -232,7 +244,7 @@ component failure mode to the SYSTEM level).
An ideal static failure mode methodology would build a failure mode model
from which the traditional four models could be derived.
It would address the short-comings in the other methodologies, and
would have a user friendly interface, with a visual (rather than mathematical/formal) syntax with icons
would have a user friendly interface, with a visual (rather than symbolic) syntax with icons
to represent the results of analysis phases.
%
%There are four static analysis failure mode methodologies in common use.
@ -251,7 +263,7 @@ systems in the early 1960s and was not designed as a rigorous
fault/failure mode methodology.
It was designed to look for disastrous top level hazards and
determine how they could be caused.
It is more like a structure to
It is more like a procedure to
be applied when discussing the safety of a system, with a top down hierarchical
notation using logic symbols, that guides the analysis.
This methodology was designed for
@ -265,7 +277,7 @@ system level outcomes.
\subsubsection{ FTA weaknesses }
\begin{itemize}
\item Possibility to miss component failure modes
\item Possibility to miss component failure modes.
\item Possibility to miss environmental affects.
\item No possibility to model base component level double failure modes.
\end{itemize}
@ -279,7 +291,11 @@ The investigation will typically point to a particular failure
of a component.
The methodology is now applied to find the significance of the failure.
Its is based on a simple equation where $S$ ranks the severity (or cost \cite{bfmea}) of the identified SYSTEM failure,
$O$ its occurrance, and $D$ giving the failures detectability. Muliplying these
$O$ its occurrance\footnote{The occurrance $O$ is the
probability of the failure happening.},
and $D$ giving the failures detectability\footnote{Detectability: often failures
may occur but not be noticed or cause an effect.
Consider an unused feature failing.}. Muliplying these
together,
gives a risk probability number (RPN), given by $RPN = S \times O \times D$.
This gives in effect
@ -293,7 +309,7 @@ a prioritised `todo list', with higher the $RPN$ values being the most urgent.
\item No possibility to model base component level double failure modes.
\end{itemize}
\paragraph{note.} FMEA is sometimes used in its literal sense, that is to say
\paragraph{Note.} FMEA is sometimes used in its literal sense, that is to say
Failure Mode Effects analysis, simply looking at a systems internal failure
modes and determing what may happen as a result.
FMEA described in this section (\ref{pfmea}) is sometimes called `production FMEA'.
@ -311,21 +327,23 @@ electronic components was published by the DOD
in 1991 (MIL HDK 1991 \cite{mil1991}) and is a typical
source for MTFF data.
%
FMECA has a probability factor for a component causing
FMECA has a probability factor for a component error becoming % causing
a SYSTEM level error.
This is termed the $\beta$ factor.
%\footnote{for a given component failure mode there will be a $\beta$ value, the
%probability that the component failure mode will cause a given SYSTEM failure}.
%
This lacks precision, or in other words, determinability prediction accuracy \cite{fafmea},
as often the component failure mode cannot be proven to cause a SYSTEM level failure, but
as often the component failure mode cannot be proven to cause a SYSTEM level failure, but is
assigned a probability $\beta$ factor by the design engineer. The use of a $\beta$ factor
is often justified using bayes theorem \cite{probstat}.
%Also, it can miss combinations of failure modes that will cause SYSTEM level errors.
%
The results of FMECA are similar to FMEA, in that component errors are
listed according to importance of fixing it to prevent the SYSTEM fault of given criticallity.
Again this essentially produces a prioritised todo list.
listed according to importance, based on
probability of occurrance and criticallity.
% to prevent the SYSTEM fault of given criticallity.
Again this essentially produces a prioritised `todo' list.
%%-WIKI- Failure mode, effects, and criticality analysis (FMECA) is an extension of failure mode and effects analysis (FMEA).
%%-WIKI- FMEA is a a bottom-up, inductive analytical method which may be performed at either the functional or
@ -362,7 +380,7 @@ The following gives an outline of the procedure.
\subsubsection{Two statistical perspectives}
FMEDA is a statistical analysis methodology is used from one of two perspectives,
FMEDA is a statistical analysis methodology and is used from one of two perspectives,
Probability of Failure on Demand (PFD), and Probability of Failure
in continuous Operation, or Failure in Time (FIT).
\paragraph{Failure in Time (FIT).} Continuous operation is measured in failures per billion ($10^9$) hours of operation.
@ -372,7 +390,7 @@ we would be interested in its operational FIT values.
\paragraph{Probability of Failure on Demand (PFD).} For instance with an anti-lock system in
automobile braking, or other fail safe measure applied in an emergency, we would be interested in PFD.
That is to say the ratio of it failing
to succeeding on demand.
to succeeding to operate correctly on demand.
\subsubsection{The FMEDA Analysis Process}
@ -388,9 +406,10 @@ environmental conditions. The SYSTEM errors are categorised as `safe' or `danger
%Statistical data exists for most component types \cite{mil1992}.
%
This phase is typically implemented on a spreadsheet
with rows representing each component. A typical component spreadshet row would
with rows representing each component. A typical component spreadsheet row would
comprise of
component type, placing in the system, part number, environmental stress factors, MTTF, safe/dangerous etc.
component type, placement,
part number, environmental stress factors, MTTF, safe/dangerous etc.
%will be a determination of whether the component failing will lead to a `safe'
%or `unsafe' condition.
@ -410,6 +429,7 @@ This is done by taking a component failure mode and determining
if the SYSTEM error it is tied to is dangerous or safe.
The decision for this may be
based on hueristics or field data.
EN61508 uses the $\lambda$ symbol to represent probabilities.
Because we have statistics for each component failure mode,
we can now now classify these in terms of safe and dangerous lambda values.
Detectable failure probabilities are labelled `$\lambda_D$' (for
@ -417,8 +437,8 @@ dangerous) and `$\lambda_S$' (for safe) \cite{en61508}.
\paragraph{Determine Detectable and Undetectable Failures.}
Each safe and dangerous failure mode is now
classified as detectable or un-detectable, this
is determined by the SYSTEMs
classified as detectable or un-detectable.
EN61508 assumes that products have a high level of
self checking features.
%
This gives us four level failure mode classifications:
@ -436,7 +456,7 @@ next step is to investigate using an actual working SYSTEM.
Failures are deliberately caused (by physical intervention), and any new SYSTEM level
failures are added to the model.
Hueristics and MTTF failure rates for the components
Heuristics and MTTF failure rates for the components
are used to calculate probabilities for these new failure modes
along with their safety and detectability classifications (i.e.
$\lambda_{SD}$, $\lambda_{SU}$, $\lambda_{DD}$, $\lambda_{DU}$).
@ -454,11 +474,16 @@ The calculations for these are described below.
The diagnostic coverage is simply the ratio
of the dangerous detected probabilities
against the probability of all dangerous failures,
and is normally expressed as a percentage.
and is normally expressed as a percentage. $\Sigma\lambda_{DD}$ represents
the percentage of dangerous detected base component failure modes, and
$\Sigma\lambda_D$ the total number of dangerous base component failure modes.
$$ DiagnosticCoverage = \Sigma\lambda_{DD} / \Sigma\lambda_D $$
The diagnostic coverage for safe failures is given as
The diagnostic coverage for safe failures, where $\Sigma\lambda_SD$ represents the percentage of
safe detected base component failure modes,
and $\Sigma\lambda_S$ the total number of safe base component failure modes,
is given as
$$ SF = \frac{\Sigma\lambda_SD}{\Sigma\lambda_S} $$
@ -498,8 +523,13 @@ There are four SIL levels, from 1 to 4 with 4 being the highest safety level.
In addition to probablistic risk factors, the
diagnostic coverage and SFF
have threshold bands beoming stricter for each level.
Demanded software verification and specification techniques and constraints (such as language sub-sets, s/w redundancy etc)
Demanded software verification and specification techniques and constraints
(such as language subsets, s/w redundancy etc)
become stricter for each SIL level.
%%
%% Andrew asked me to expand on this here, but it would take at least two
%% pages. I think its more appropriate for the survey.tex chapter.
%%
Thus FMEDA uses statistical methods to determine
a safety level (SIL), typically used to meet an acceptable risk
@ -521,7 +551,7 @@ has a calculated risk, a fault detection time (if any), an estimated risk import
and other factors such as de-rating and environmental stress.
With one component failure mode per row,
all the statistical factors for SIL rating can be produced\footnote{A SIL rating will apply
to an installed plant, i.e. A complete SYSTEM. SIL ratings for individual components or
to an installed plant, i.e. a complete installed and working SYSTEM. SIL ratings for individual components or
sub-systems are meaningless, and the nearest equivalent would be the FIT/PFD and SFF and diagnostic coverage figures.}.
@ -541,7 +571,7 @@ where he probably should assign a dangerous failure classification to it.
%
There is no analysis
of how that resistor would/could affect the components close to it, but because the circuitry
it is part of critical section it will most likely
is part of critical section it will most likely
be linked to a dangerous system level failure in an FMEDA study.
%
%%- IS THIS TRUE IS THERE A BETA FACTOR IN FMEDA????
@ -571,7 +601,7 @@ safety level zones as recomended in EN61508\cite{en61508}. This is a vague way o
safety, as it can miss unexpected effects due to `unexpected' component interaction.
The Statistical Analysis methodology is the core philosophy
of the Safety Integrity Levels (SIL) ebodied in EN61508 \cite{en61508}
of the Safety Integrity Levels (SIL) embodied in EN61508 \cite{en61508}
and its international analog standard IOC5108.
@ -590,11 +620,12 @@ and its international analog standard IOC5108.
\item All component failure modes must be considered in the model.
\item It should be easy to integrate mechanical, electronic and software models \cite{sccs}[pp.287].
\item It should be re-usable, in that commonly used modules can be re-used in other designs/projects.
\item It should have a formal basis, that is to say, it should be able to produce mathematical proofs
\item It should have a formal basis, that is to say, be able to produce mathematical proofs
for its results, such as system level error causation trees, reliability and safety statistics.
\item It should be easy to use, ideally using a graphical syntax (as oppossed to a formal mathematical one).
\item It should be easy to use, ideally using a
graphical syntax (as oppossed to a formal symbolic/mathematical text based language).
\item From the top down, the failure mode model should follow a logical de-composition of the functionality
to smaller and smaller functional modules \cite{maikowski}.
to smaller and smaller functional groupings \cite{maikowski}.
\item Multiple failure modes may be modelled from the base component level up.
\end{itemize}
@ -608,16 +639,16 @@ and start with the component failure modes.
%
\paragraph{Natural Fault Finding is top down.}
The traditional fault finding, or natural fault finding
is to work from the top down.
is to start at the top with SYSTEM level failure modes/faults.
%
On encountering a
fault, the symptom is first observed at the top or
SYSTEM level. By de-composing the functionality of the faulty system and testing
we can further de-compose the system until we find the
SYSTEM level. By decomposing the functionality of the faulty system and testing
we can further decompose the system until we find the
faulty base level component.
De-composition of electrical circuits is formalised and explored
Decomposition of electrical circuits is formalised and explored
in \cite{maikowski}. This top down technique de-composes by functionality.
Simpler and simpler functional blocks are discovered as we delve
Simpler and simpler functional groups are discovered as we delve
further into the way the system works and is built.
@ -644,6 +675,9 @@ into manageable and separately testable entities.
A second justification for this is that the design process for a product requires both top down and bottom-up
thinking. To analyse a system from the bottom-up is a useful
design validation process in itself \cite{sommerville}.
%%
%% CAN we find a ref for both top and bottom up being used
%% as design validaion ????
\paragraph{Design Decision: Methodology must be bottom-up.}
In order to ensure that all component failure modes are handled,
@ -656,10 +690,15 @@ A hierarchy of functional grouping, leading to a system model
still leaves us with the problem of the number of component failure modes.
The base components will typically have several failure modes each.
%
Given a typical embedded system may have hundreds of components
Given a typical embedded system may have hundreds of components.
This means that we would still have to tie base component failure modes
to SYSTEM level errors. This is the `possibility to miss failure mode effects
at SYSTEM level' criticism of the FTA, FMEDA and FMECA methodologies.
to SYSTEM level errors.
The problem with this is that the base component failure mode under investigation
effects are not rigorously examined in relation to functionally adjacent components.
Thus there is the `possibility to miss failure mode effects
at the much higher SYSTEM level' criticism of the FTA, FMEDA and FMECA methodologies.
%%%
%%% OK Got up to here Lunchtime edit 06DEC2010.............
\paragraph{Design Decision: Methodology must reduce and collate errors at each functional group stage.}
SYSTEMS typically have far fewer failure modes than the sum of their component failure modes.