From 14a3dc4c34dd4fc9861c4cc70ea6db8f30beab8d Mon Sep 17 00:00:00 2001 From: Robin Clark Date: Mon, 6 Dec 2010 14:10:41 +0000 Subject: [PATCH] Lunchtime, Andrew Fish Comments from weekend. --- fmmd_concept/fmmd_concept.tex | 145 +++++++++++++++++++++------------- 1 file changed, 92 insertions(+), 53 deletions(-) diff --git a/fmmd_concept/fmmd_concept.tex b/fmmd_concept/fmmd_concept.tex index b66cb1b..5ef72f1 100644 --- a/fmmd_concept/fmmd_concept.tex +++ b/fmmd_concept/fmmd_concept.tex @@ -16,7 +16,7 @@ incremental and rigorous approach. The four main static failure mode analysis methodologies were examined and in the context of newer European safety standards, assessed. Some of the deficiencies identified in these methodologies lead to -a wish list for a more ideal methodology. +a wish list for a more rigorous methodology. %% What I have found %% @@ -24,7 +24,8 @@ From the wish list %and considering some constraints determined from %the evaluation of the four established methodologies, a new -methodology is developed and proposed. The has been named Failure Mode Modular De-Composition (FMMD). +methodology is developed and proposed. +This has been named Failure Mode Modular De-Composition (FMMD). %% Sell it %% @@ -58,7 +59,8 @@ From the wish list % %and considering some constraints determined from %the evaluation of the four established methodologies, a new -methodology is developed and proposed. The has been named Failure Mode Modular De-Composition (FMMD). +methodology is developed and proposed. +This has been named Failure Mode Modular De-Composition (FMMD). %% Sell it %% @@ -112,7 +114,7 @@ ensuring that all component failure modes must be considered in the model. % \paragraph{FMMD Process outline.} This methodology has been named Failure Mode Modular De-composition (FMMD) -because it de-composes a SYSTEM into a hierarchy of modules or {\dc}s. +because it decomposes a SYSTEM into a hierarchy of modules or {\dc}s. This \ifthenelse {\boolean{paper}} { @@ -133,10 +135,13 @@ is determined. % FMMD works from the bottom up, taking small groups of components, {\fgs}, and then analysing how they can fail. +\input{./shortfg} + +\paragraph{Micro Vs. Macro failure mode analysis.} This analysis is performed using FMEA from a micro rather than a macro perspective. Thus instead of looking at component failure modes and determining how they {\em may} cause a failure at SYSTEM level, we are looking at how -they {\em will} affect the components local {\fg}. +they {\em will} affect the component's local {\fg}. When we know the failure modes of a {\fg} we can treat it as a `black box' or {\dc}. With {\dc}s we can build {\fgs} at higher levels of analysis, until we have a complete @@ -168,8 +173,8 @@ a set of undesirable outcomes or `accidents'. As most accidents are unexpected and the causes unforeseen \cite{safeware} it is fair to say that a top down approach is not guaranteed to predict all possible undesirable outcomes. -It also can miss known component failure modes, by -simply not de-composing down to the base component failure level of detail. +Top-down methodologies can miss known component failure modes, by +simply not decomposing down to the base component failure level of detail. \paragraph{A general problem with bottom-up static failure analysis.} With the bottom up techniques we have all the known component failure modes @@ -177,25 +182,29 @@ and the relative freedom to determine how each of these may affect the SYSTEM. % A problem with this is that a component typically interacts in a complex way with several other functionally -adjacent components +adjacent components. % To take a component failure mode and then attempt to tie that to a SYSTEM level outcome is very difficult. % -The difficulty lies in % -%Because of -the number of components -our failure mode under investigation may interact with is typically very large. +The number of components +a failure mode under investigation might interact with is typically very large. +This makes it very difficult to predict the effects of a component +failure mode, because we have to decide which components it could affect, +or +in other words, which components are functionally adjacent to it. % We cannot consider all the components in the SYSTEM when looking at a single failure mode, -and human judgement must be used to +and therefore human judgement must be used to decide which interactions could be important. Let N be the number of components in our system, and K be the average number of component failure modes -(ways in which the component can fail). The total number of base component failure modes -is $N \times K$. To examine the effect that one failure mode has on all the other components +(ways in which the base~component can fail). The total number of base component failure modes +is $N \times K$. To examine the effect that one failure mode has on all +the other components\footnote{A base component failure will typically affect the sub-system +it is part of, and create a failure effect at the SYSTEM level.} will be $(N-1) \times N \times K$, in effect a set cross product. @@ -207,18 +216,21 @@ Or we may have a mechanical device that has a different failure mode behaviour for say, different ambient pressures or temperatures. If $E$ is the number of applied states or environmental conditions to consider -in a system, the job of the bottom-up analyst is complicated by a cross product factor again +in a system, the job of the bottom-up analyst is presented with an +additional cross product factor, $(N-1) \times N \times K \times E$. If we put some typical very small embedded system numbers\footnote{these figures would -be typical of a very simple temperature controller, with a micro-controller sensor and heater circuit} into this, say $N=100$, $K=2.5$ and $E=10$ +be typical of a very simple temperature controller, with a micro-controller sensor +and heater circuit} into this, say $N=100$, $K=2.5$ and $E=10$ we have $99 \times 100 \times 2.5 \times 10 = 247500 $. To look in detail at a quarter of a million test cases is obviously impractical. If we were to consider multiple simultaneous failure modes, -we have yet another complication cross product. +we have yet another cross product of checks to be performed. -For instance for looking at double simultaneous failure modes, -the equation reads $(N-2) \times (N-1) \times N \times K \times E$. +For instance for looking at double simultaneous failure modes, where $\#C$ +is the number of checks to perform +the equation reads $\#C = (N-2) \times (N-1) \times N \times K \times E$. The bottom-up methodologies FMEA, FMECA and FMEDA take single failure modes and link them to SYSTEM level failure modes. Because of the astronomical number of possible interactions, @@ -232,7 +244,7 @@ component failure mode to the SYSTEM level). An ideal static failure mode methodology would build a failure mode model from which the traditional four models could be derived. It would address the short-comings in the other methodologies, and -would have a user friendly interface, with a visual (rather than mathematical/formal) syntax with icons +would have a user friendly interface, with a visual (rather than symbolic) syntax with icons to represent the results of analysis phases. % %There are four static analysis failure mode methodologies in common use. @@ -251,7 +263,7 @@ systems in the early 1960s and was not designed as a rigorous fault/failure mode methodology. It was designed to look for disastrous top level hazards and determine how they could be caused. -It is more like a structure to +It is more like a procedure to be applied when discussing the safety of a system, with a top down hierarchical notation using logic symbols, that guides the analysis. This methodology was designed for @@ -265,7 +277,7 @@ system level outcomes. \subsubsection{ FTA weaknesses } \begin{itemize} -\item Possibility to miss component failure modes +\item Possibility to miss component failure modes. \item Possibility to miss environmental affects. \item No possibility to model base component level double failure modes. \end{itemize} @@ -279,7 +291,11 @@ The investigation will typically point to a particular failure of a component. The methodology is now applied to find the significance of the failure. Its is based on a simple equation where $S$ ranks the severity (or cost \cite{bfmea}) of the identified SYSTEM failure, -$O$ its occurrance, and $D$ giving the failures detectability. Muliplying these +$O$ its occurrance\footnote{The occurrance $O$ is the +probability of the failure happening.}, +and $D$ giving the failures detectability\footnote{Detectability: often failures +may occur but not be noticed or cause an effect. +Consider an unused feature failing.}. Muliplying these together, gives a risk probability number (RPN), given by $RPN = S \times O \times D$. This gives in effect @@ -293,7 +309,7 @@ a prioritised `todo list', with higher the $RPN$ values being the most urgent. \item No possibility to model base component level double failure modes. \end{itemize} -\paragraph{note.} FMEA is sometimes used in its literal sense, that is to say +\paragraph{Note.} FMEA is sometimes used in its literal sense, that is to say Failure Mode Effects analysis, simply looking at a systems internal failure modes and determing what may happen as a result. FMEA described in this section (\ref{pfmea}) is sometimes called `production FMEA'. @@ -311,21 +327,23 @@ electronic components was published by the DOD in 1991 (MIL HDK 1991 \cite{mil1991}) and is a typical source for MTFF data. % -FMECA has a probability factor for a component causing +FMECA has a probability factor for a component error becoming % causing a SYSTEM level error. This is termed the $\beta$ factor. %\footnote{for a given component failure mode there will be a $\beta$ value, the %probability that the component failure mode will cause a given SYSTEM failure}. % This lacks precision, or in other words, determinability prediction accuracy \cite{fafmea}, -as often the component failure mode cannot be proven to cause a SYSTEM level failure, but +as often the component failure mode cannot be proven to cause a SYSTEM level failure, but is assigned a probability $\beta$ factor by the design engineer. The use of a $\beta$ factor is often justified using bayes theorem \cite{probstat}. %Also, it can miss combinations of failure modes that will cause SYSTEM level errors. % The results of FMECA are similar to FMEA, in that component errors are -listed according to importance of fixing it to prevent the SYSTEM fault of given criticallity. -Again this essentially produces a prioritised todo list. +listed according to importance, based on +probability of occurrance and criticallity. +% to prevent the SYSTEM fault of given criticallity. +Again this essentially produces a prioritised `todo' list. %%-WIKI- Failure mode, effects, and criticality analysis (FMECA) is an extension of failure mode and effects analysis (FMEA). %%-WIKI- FMEA is a a bottom-up, inductive analytical method which may be performed at either the functional or @@ -362,7 +380,7 @@ The following gives an outline of the procedure. \subsubsection{Two statistical perspectives} -FMEDA is a statistical analysis methodology is used from one of two perspectives, +FMEDA is a statistical analysis methodology and is used from one of two perspectives, Probability of Failure on Demand (PFD), and Probability of Failure in continuous Operation, or Failure in Time (FIT). \paragraph{Failure in Time (FIT).} Continuous operation is measured in failures per billion ($10^9$) hours of operation. @@ -372,7 +390,7 @@ we would be interested in its operational FIT values. \paragraph{Probability of Failure on Demand (PFD).} For instance with an anti-lock system in automobile braking, or other fail safe measure applied in an emergency, we would be interested in PFD. That is to say the ratio of it failing -to succeeding on demand. +to succeeding to operate correctly on demand. \subsubsection{The FMEDA Analysis Process} @@ -388,9 +406,10 @@ environmental conditions. The SYSTEM errors are categorised as `safe' or `danger %Statistical data exists for most component types \cite{mil1992}. % This phase is typically implemented on a spreadsheet -with rows representing each component. A typical component spreadshet row would +with rows representing each component. A typical component spreadsheet row would comprise of -component type, placing in the system, part number, environmental stress factors, MTTF, safe/dangerous etc. +component type, placement, +part number, environmental stress factors, MTTF, safe/dangerous etc. %will be a determination of whether the component failing will lead to a `safe' %or `unsafe' condition. @@ -410,6 +429,7 @@ This is done by taking a component failure mode and determining if the SYSTEM error it is tied to is dangerous or safe. The decision for this may be based on hueristics or field data. +EN61508 uses the $\lambda$ symbol to represent probabilities. Because we have statistics for each component failure mode, we can now now classify these in terms of safe and dangerous lambda values. Detectable failure probabilities are labelled `$\lambda_D$' (for @@ -417,8 +437,8 @@ dangerous) and `$\lambda_S$' (for safe) \cite{en61508}. \paragraph{Determine Detectable and Undetectable Failures.} Each safe and dangerous failure mode is now -classified as detectable or un-detectable, this -is determined by the SYSTEM’s +classified as detectable or un-detectable. +EN61508 assumes that products have a high level of self checking features. % This gives us four level failure mode classifications: @@ -436,7 +456,7 @@ next step is to investigate using an actual working SYSTEM. Failures are deliberately caused (by physical intervention), and any new SYSTEM level failures are added to the model. -Hueristics and MTTF failure rates for the components +Heuristics and MTTF failure rates for the components are used to calculate probabilities for these new failure modes along with their safety and detectability classifications (i.e. $\lambda_{SD}$, $\lambda_{SU}$, $\lambda_{DD}$, $\lambda_{DU}$). @@ -454,11 +474,16 @@ The calculations for these are described below. The diagnostic coverage is simply the ratio of the dangerous detected probabilities against the probability of all dangerous failures, -and is normally expressed as a percentage. +and is normally expressed as a percentage. $\Sigma\lambda_{DD}$ represents +the percentage of dangerous detected base component failure modes, and +$\Sigma\lambda_D$ the total number of dangerous base component failure modes. $$ DiagnosticCoverage = \Sigma\lambda_{DD} / \Sigma\lambda_D $$ -The diagnostic coverage for safe failures is given as +The diagnostic coverage for safe failures, where $\Sigma\lambda_SD$ represents the percentage of +safe detected base component failure modes, +and $\Sigma\lambda_S$ the total number of safe base component failure modes, +is given as $$ SF = \frac{\Sigma\lambda_SD}{\Sigma\lambda_S} $$ @@ -498,8 +523,13 @@ There are four SIL levels, from 1 to 4 with 4 being the highest safety level. In addition to probablistic risk factors, the diagnostic coverage and SFF have threshold bands beoming stricter for each level. -Demanded software verification and specification techniques and constraints (such as language sub-sets, s/w redundancy etc) +Demanded software verification and specification techniques and constraints +(such as language subsets, s/w redundancy etc) become stricter for each SIL level. +%% +%% Andrew asked me to expand on this here, but it would take at least two +%% pages. I think its more appropriate for the survey.tex chapter. +%% Thus FMEDA uses statistical methods to determine a safety level (SIL), typically used to meet an acceptable risk @@ -521,7 +551,7 @@ has a calculated risk, a fault detection time (if any), an estimated risk import and other factors such as de-rating and environmental stress. With one component failure mode per row, all the statistical factors for SIL rating can be produced\footnote{A SIL rating will apply -to an installed plant, i.e. A complete SYSTEM. SIL ratings for individual components or +to an installed plant, i.e. a complete installed and working SYSTEM. SIL ratings for individual components or sub-systems are meaningless, and the nearest equivalent would be the FIT/PFD and SFF and diagnostic coverage figures.}. @@ -541,7 +571,7 @@ where he probably should assign a dangerous failure classification to it. % There is no analysis of how that resistor would/could affect the components close to it, but because the circuitry -it is part of critical section it will most likely +is part of critical section it will most likely be linked to a dangerous system level failure in an FMEDA study. % %%- IS THIS TRUE IS THERE A BETA FACTOR IN FMEDA???? @@ -571,7 +601,7 @@ safety level zones as recomended in EN61508\cite{en61508}. This is a vague way o safety, as it can miss unexpected effects due to `unexpected' component interaction. The Statistical Analysis methodology is the core philosophy -of the Safety Integrity Levels (SIL) ebodied in EN61508 \cite{en61508} +of the Safety Integrity Levels (SIL) embodied in EN61508 \cite{en61508} and its international analog standard IOC5108. @@ -590,11 +620,12 @@ and its international analog standard IOC5108. \item All component failure modes must be considered in the model. \item It should be easy to integrate mechanical, electronic and software models \cite{sccs}[pp.287]. \item It should be re-usable, in that commonly used modules can be re-used in other designs/projects. -\item It should have a formal basis, that is to say, it should be able to produce mathematical proofs +\item It should have a formal basis, that is to say, be able to produce mathematical proofs for its results, such as system level error causation trees, reliability and safety statistics. -\item It should be easy to use, ideally using a graphical syntax (as oppossed to a formal mathematical one). +\item It should be easy to use, ideally using a +graphical syntax (as oppossed to a formal symbolic/mathematical text based language). \item From the top down, the failure mode model should follow a logical de-composition of the functionality -to smaller and smaller functional modules \cite{maikowski}. +to smaller and smaller functional groupings \cite{maikowski}. \item Multiple failure modes may be modelled from the base component level up. \end{itemize} @@ -608,16 +639,16 @@ and start with the component failure modes. % \paragraph{Natural Fault Finding is top down.} The traditional fault finding, or natural fault finding -is to work from the top down. +is to start at the top with SYSTEM level failure modes/faults. % On encountering a fault, the symptom is first observed at the top or -SYSTEM level. By de-composing the functionality of the faulty system and testing -we can further de-compose the system until we find the +SYSTEM level. By decomposing the functionality of the faulty system and testing +we can further decompose the system until we find the faulty base level component. -De-composition of electrical circuits is formalised and explored +Decomposition of electrical circuits is formalised and explored in \cite{maikowski}. This top down technique de-composes by functionality. -Simpler and simpler functional blocks are discovered as we delve +Simpler and simpler functional groups are discovered as we delve further into the way the system works and is built. @@ -644,6 +675,9 @@ into manageable and separately testable entities. A second justification for this is that the design process for a product requires both top down and bottom-up thinking. To analyse a system from the bottom-up is a useful design validation process in itself \cite{sommerville}. +%% +%% CAN we find a ref for both top and bottom up being used +%% as design validaion ???? \paragraph{Design Decision: Methodology must be bottom-up.} In order to ensure that all component failure modes are handled, @@ -656,10 +690,15 @@ A hierarchy of functional grouping, leading to a system model still leaves us with the problem of the number of component failure modes. The base components will typically have several failure modes each. % -Given a typical embedded system may have hundreds of components +Given a typical embedded system may have hundreds of components. This means that we would still have to tie base component failure modes -to SYSTEM level errors. This is the `possibility to miss failure mode effects -at SYSTEM level' criticism of the FTA, FMEDA and FMECA methodologies. +to SYSTEM level errors. +The problem with this is that the base component failure mode under investigation +effects are not rigorously examined in relation to functionally adjacent components. +Thus there is the `possibility to miss failure mode effects +at the much higher SYSTEM level' criticism of the FTA, FMEDA and FMECA methodologies. +%%% +%%% OK Got up to here Lunchtime edit 06DEC2010............. \paragraph{Design Decision: Methodology must reduce and collate errors at each functional group stage.} SYSTEMS typically have far fewer failure modes than the sum of their component failure modes.