diff --git a/papers/JOURNAL_fmea_sw_hw/sw_hw_fmea.tex b/papers/JOURNAL_fmea_sw_hw/sw_hw_fmea.tex index 292180d..c210438 100644 --- a/papers/JOURNAL_fmea_sw_hw/sw_hw_fmea.tex +++ b/papers/JOURNAL_fmea_sw_hw/sw_hw_fmea.tex @@ -169,7 +169,7 @@ the ability to model integrated hardware and software systems. % coverage of the combined FMEA techniques. To demonstrate FMMD a small, but complete embedded system -(including both software and hardware), +(including both software and hardware) worked example is presented to show FMMD applied to an integrated electronics/software system. %, the industry standard @@ -183,17 +183,23 @@ integrated electronics/software system. FMEA stands for Failure Mode Effects Analysis. % -All components used to build a system can fail, also +All components used to build a system can fail; also they may fail in more than one way. The ways in which a component can fail, are known as its {\fms}. -At its simplest FMEA means taking taking a {\fm} of a component and predicting +At its simplest FMEA means taking a {\fm} of a component and predicting what problems it may cause for the system it is part of. % One way the electronic component the resistor can fail for instance, is if it were -to go open circuit. It could be because it was not soldered on properly and fell off, +to go open circuit. +% +This open circuit could be because it was not soldered on properly and fell off, it could have had an internal mechanical fault or it could have been destroyed/burnt~off by too much -electrical current. The cause does not matter. The fact that it can fail by going open circuit does. +electrical current. +% +The cause does not matter. +% +The fact that it can fail by going open circuit does. % This then is one of the {\fms} of a resistor, $OPEN$. % @@ -223,7 +229,7 @@ This means looking at every component in the system, and for each of those compo examining all known failure modes in the context of the system that it is part of. % Various handbooks and international standards list common components and -their know failure modes, often with accompanying statistics~\cite{en298, fmd91, mil1991}. +their known failure modes often with accompanying statistics~\cite{en298, fmd91, mil1991}. \subsection{Origins of FMEA techniques} %FMEA methodologies trace from the 1940's and were designed to @@ -250,7 +256,7 @@ programmatic/software elements. Software generally sits on top of most modern safety critical control systems and defines its most important system wide behaviour and communications. % -A typical control system, be in in a car or a microwave oven in the kitchen +A typical control system, be it in a car or a microwave oven in the kitchen will generally combine a micro-controller with electronics. It will form a hierarchy where low level electronics is implemented at the bottom, which prepares input/output (IO) @@ -276,8 +282,7 @@ do not specify FMEA for software but instead essentially just specify good pract i.e. review processes and language feature constraints. % That is to say FMEA has no formal framework for following -failure modes from low level hardware elements through into the software models. - +failure modes from low level hardware elements through into software models. % This is a weakness. % @@ -410,7 +415,7 @@ where the aim is to ensure that single component failures (at least) cannot cause unacceptable system level events~\cite{iec60812,boffin}, \item Failure Mode Effect Criticality Analysis (FMECA) is applied to determine the most potentially dangerous or damaging -failure modes to fix, using FMEA in conjunction with severity and failure probability figures~\cite{fmeca,mil1991,fmd91}, +failure modes using FMEA in conjunction with severity and failure probability figures~\cite{fmeca,mil1991,fmd91}, \item Failure Mode Effects and Diagnostics Analysis, is FMEA performed to determine a statistical level of safety. This is a fairly standard FMEA but with statistical values attached to each component {\fm}; @@ -436,13 +441,11 @@ When analysing a failure mode of a component, it is reasonable to look at how the failure mode will affect the other components in the system and to put this then into the context of the systems behaviour. % -Components may fail in several ways. European standard EN298~\cite{en298} gives the possible +Components may fail in several ways. European standard EN298~\cite{en298} gives two possible failure modes for a resistor as $OPEN$ and $SHORT$ for instance. % The term $f$ is defined as the number of component failure modes for a given component. -A system will have $N$ number of components. - - +%A system will have $N$ number of components. In the case of the resistor $f$ is two ~\footnote{A resistor is assigned two failure modes by the European Burner standard EN298~\cite{en298} @@ -498,7 +501,7 @@ the sum of these multiplications for all its components. % it contains. Take a hypothetical small system with say 100 components, with three failure modes per component, %this %would give an exhaustive reasoning distance for single failure analysis---of $3 \times 100 \times 99$. -that means to for each {\fm} of every component, i.e. $3$ checks, would have to be made +that means for each {\fm} of every component, i.e. $3$ checks, would have to be made against 99 other components. There are 100 components in this hypothetical example for single failure analysis this means $3 \times 100 \times 99$ checks. % @@ -547,7 +550,7 @@ of a failure mode with all other components in a system would have to be examine Or in other words, all possible failure scenarios considered. % %to do this completely (all failure modes against all components). -This is represented in the equation below, %~\ref{eqn:fmea_state_exp}, +This is represented in equation~\ref{eqn:fmea_single} below, %~\ref{eqn:fmea_state_exp}, where $N$ is the total number of components in the system, $RD_{single}$ is the reasoning~distance and $f$ is the number of failure modes per component: % @@ -566,7 +569,7 @@ The hypothetical example described above gives $100 \times 99 \times 3 = 29,700 %%% SANITY CHECK. %%% -When stating a general equation such as equation~\ref{eqn:fmea_single} it can be sanity checked +When stating a general equation such as equation~\ref{eqn:fmea_single}, it can be sanity checked by thinking of common examples. For instance a simple amplifier circuit with a handful of components would have a low $RD_{single}$ count of potential failure mode to components checks. @@ -576,9 +579,9 @@ how it would react to well defined component failure modes. For a larger circuit the problems of tracing side effects of the failure mode through the circuit mean that it is likely to be a far more complex task. - +% The order $O(N^2)$ for FMEA complexity, for single failures, therefore agrees with experience. - +% In general terms, for a very simple small circuit, a better understanding of failure effects is expected, than for a very large system where there are more variables and potential {\fm} interactions. % @@ -591,7 +594,7 @@ scenarios\footnote{Certain double failure scenarios are already legal requirements---The European Gas burner standard (EN298:2003~\cite{en298}) for instance---demands the checking of double failure scenarios (for burner lock-out scenarios).} % -(two components failing within a given time frame) and the order becomes $O(N^3)$. +(two components failing within a given time frame) the order becomes $O(N^3)$. Where $RD_{double}$ is the reasoning~distance for double failure scenarios: \begin{equation} \label{eqn:fmea_double} @@ -620,7 +623,7 @@ Current FMEA methodologies cannot consider---for the reason of state explosion-- %\fmmdglossSTATEEX % %Because for practical reasons, -In practical terms XFMEA cannot be performed for anything other than a trivial system, +In practical terms XFMEA cannot be performed for anything other than a trivial system, instead reliance is placed upon experts on the system under investigation to perform a meaningful analysis. % @@ -632,7 +635,7 @@ these experts have to select the areas they see as most critical for detailed FM it is usually impossible, for reasons of time to perform the work, to action a detailed level of analysis on all component {\fms} on anything but a very small %hypothetical -system. +system (i.e. XFMEA). % \subsection{Component Tolerance} % @@ -732,7 +735,8 @@ The automotive industry, because of mass production, must make products that hav but must also be affordable. % This leads to specialist firms producing modules, such as automatic braking systems, -that are bought in and assembled to make an auto-mobile. +that are bought in and assembled % better word then assembled???? included??? +to make an auto-mobile. % Performing failure analysis using the basic component single failure modes to system failure mapping, would thus be very difficult: this would require expert knowledge @@ -745,7 +749,7 @@ of the design behaviour and component types used in each module. % Some modular FMEA techniques are starting to be used and specified, and are described below. -\paragraph{Automotive SIL (ASIL) --- modularisation of FMEDA} +\paragraph{Automotive SIL (ASIL) --- modularisation of FMEDA.} % The EN61508 variant for automotive use, as defined in standard ISO~26262, is known as Automotive SIL (ASIL)~\cite{Kafka20122}. % @@ -756,13 +760,13 @@ This allows automotive designers to use pre-certified modules in their designs and applies broad statistical guidelines to achieving particular safety levels by use of redundancy and automated diagnostics etc. % -Note that the ASIL modules are given a relaibility rating which can be enhanced with redundancy. +Note that the ASIL modules are given a reliability rating which can be enhanced with redundancy. It does not introduce traceable {\fm} reasoning in its hierarchy. %% %% IN SOFTWARE THIS WOULD BE TIGHTLY COUPLED AS OPPOSED TO LOOSELY COUPLED FUNCTIONS. % -\paragraph{Indenture levels --- modularisation of FMECA} +\paragraph{Indenture levels --- modularisation of FMECA.} % The US military standard for FMECA~\cite{fmeca}, describes a very broad modularity regime, that it terms `indenture' levels. @@ -776,18 +780,18 @@ an altitude radar: within that finer grained modules may be identified until the base components are listed. % Note that this is a top down approach to modularisation and -this can introduce errors into the reliability calculations~\cite{MILSTD1629short} -and miss-out some component failure modes. +this can introduce errors into the reliability calculations +by missing out some component failure modes~\cite{MILSTD1629short}. % -\paragraph{Integrated Circuits (ICs)} +\paragraph{Integrated Circuits (ICs).} Consider some commonly used ICs an op-amp is a good example. % An op-amp will have a high internal component count. It is mainly a collection of transistors on a chip -and is a complex circuit designed to give a very high and precise gain. +and is a complex circuit designed to give a very high and precise differential gain. %These are made from several components including %ransistos, resistors capactors etc. In order to perform FMEA op-amps are given @@ -851,7 +855,7 @@ and treat those sections as components in their own right. \subsection{The problem of Systems using software and FMEA} Software systems are becoming part of everyday life. -It is getting increasingly rarer to find systems where there is not a computer +It is getting increasingly rare to find systems where there is not a computer controlling some part of it. All modern airliners are fly-by wire. The throttle in a modern car is fly-by wire. @@ -967,7 +971,7 @@ in an improved FMEA methodology, \section{Proposed Methodology: Failure Mode Modular De-composition (FMMD)} -The basic concept behind FMMD is to from the bottom-up, modularise the problem. +The basic concept behind FMMD is to, from the bottom-up, modularise the problem. FMEA cannot easily be modularised from the top-down, because it has to deal with component failure modes. @@ -996,7 +1000,7 @@ in the circuit, these modules can then be merged to form bigger modules until there is a hierarchy and one final module representing the whole system. -\paragraph{Broadly FMMD is modularisation from the bottom-up of FMEA} +\paragraph{Broadly FMMD is modularisation from the bottom-up of FMEA.} Firstly modules are identified (for instance common circuitry formations such as amplifiers or digital outputs) and then failure mode analysis is performed on them. @@ -1043,6 +1047,14 @@ They are then considered as higher level components with their own failure mode behaviour. These higher level components are then collected to form {\fgs} and so on until a hierarchy is built representing the entire system. +% +This means that failure modes can be traced through linking the +{\fgs}. This means that the system level {\fms} can be traced back to +the component {\fms} that can cause them. +% +This gives rigorous failure mode traceability through the model. + + % Any new static failure mode methodology must ensure that it represents all component failure modes and it therefore should be bottom-up, @@ -1055,7 +1067,7 @@ bottom-level component failure modes would be handled/used. % Starting at the bottom means having to deal with each component failure mode from the beginning. -\section{The proposed Methodology: quick guide or `how~to'.} +\subsection{The proposed Methodology: quick guide or `how~to'.} An FMEA typically begins with a parts list and then from that a series of entries for each component failure mode. @@ -1110,7 +1122,7 @@ can be found in~\cite{clark}. FMMD is described in more detail in the section below. -\paragraph{FMMD process description} +\subsection{FMMD process detailed description} To ensure all component failure modes are modelled and traceable through stages of analysis, the new methodology must be bottom-up. % @@ -1163,11 +1175,11 @@ access to frequency analysis of digital samples called the Fast Fourier Transfor This took the Discrete Fourier Transform (DFT), and applied de-composition to its mesh of (often repeated) complex number calculations~\cite{fpodsadsp}[Ch.8].} % -By doing this it broke the computing order of complexity down from having a polynomial %n exponential +By doing this it breaks the computing order of complexity down from having a polynomial %n exponential %order to logarithmic order~\cite{ctw}[pp.401-3]. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%FFT%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -It also means that modules are re-usable (analogous to software classes). +It also means that {\fgs} are re-usable (analogous to software classes). % Where there are repeated sections of circuitry (as in for instance common types of interface) the analysis for that module may be simply re-used. @@ -1177,7 +1189,7 @@ A practical example of a hardware FMEA performed both traditionally and using FM software and hardware hybrid example is analysed in~\cite{syssafe2012} and examples of `reasoning~distance' efficiency savings can be found in~\cite{clark}[Ch.7]. % -\paragraph{Integrating software into the FMMD model.} +\subsection{Integrating software into the FMMD model.} % %With modular FMEA i.e. FMMD %(FMMD) %the concepts of failure~modes @@ -1220,6 +1232,103 @@ For electrical and mechanical systems, although the original system designers concepts of modularity and sub-systems in design may provide guidance, applying FMMD means deciding on the members for {\fgs} and the subsequent hierarchy. +\paragraph{Contract Programming and FMMD.} +% +With electronic components, the literature points to suitable sets of +{\fms}~\cite{fmd91}~\cite{mil1991}~\cite{en298}. %~\cite{en61508}~\cite{en298}. +% +With software only some library functions are well known and rigorously documented +enough to have the equivalent of known failure modes, +most software is `bespoke'. +% +A different strategy is required to +describe the failure mode behaviour of software functions; %. +concepts from contract programming can be used to assist in this. % here. + +\subsection{Contract programming description} +\fmmdglossCONTRACTPROG +Contract programming~\cite{dbcbe} is a discipline for building software functions in a controlled +and traceable way. Each function is subject to pre-conditions (constraints on its inputs), +post-conditions (constraints on its outputs) and function wide invariants (rules). + + +\paragraph{Mapping contract `pre-condition' violations to component failure modes.} +\fmmdglossCONTRACTPROG +A precondition, or requirement for a contract software function +defines the correct ranges of input conditions for the function +to operate successfully. +% +% C Garret said this was unclear so I have added the following two sentences. +% +%If we consider a software function to be a {\fg} in the FMMD sense, i.e. +A software function is considered to be +a collection of code, functions called and %values/ +variables used. +% +In this way it is similar to an electronic circuit, which is a collection +of components connected in a specific way. +% +Using this analogy for software, the connections are the functions code, and the +called functions/variables/inputs %and variables +are the components. +% +Erroneous behaviour from called functions and variables/inputs has the same effect as component failure modes +on an electronic {\fg}. +% +% +If it is considered that %consider the +called functions and variables/inputs are the components of a function, +a modular and hierarchical failure mode model +from existing software can be built. +% +Thus for FMMD applied to software, a violation of a pre-condition is considered to be equivalent to a failure mode of `one of its components'. +% +\paragraph{Mapping contract `post-condition' violations to symptoms.} +%\fmmdglossCONTRACTPROG +% +A post-condition is a definition of correct behaviour of a function. +% +A violated post-condition is a symptom of failure, or, in FMMD terms a derived failure mode, for a function. +% +Post conditions could relate to either actions performed (i.e. the state of hardware changed) or an output value of a function. +% +In pure contract programming, a violation of a pre-condition would cause the function to \textbf{not} be executed. +% +In implementation code, a pre-condition violation should cause +an error to be generated, and thus a post-condition to fail. +% +A function can fail for reasons other than corruption of its input data (i.e. +failure caused by variables it uses or return values from functions it calls). +% +Variables can become corrupted, by radiation affecting RAM~\cite{5488118,5963919} or +by another software function erroneously overwriting variables~\cite{swseatbelt}. +% +Current work on software FMEA generally focuses on mapping +variable corruption to failure modes~\cite{procsfmea,procsfmeadb,sfmeaauto,sfmea}. +However, errors other than variable corruption can occur. +% +For instance a microprocessor may have subtle bugs in its instruction set, or +incorrectly handled +interrupt contention~\cite{concurrency_c_tool} which could cause side effects in software. +% +For the failure mode model of any software function, +it must be considered that all failure modes defined by post-condition +violations could simply occur. +%`components'. +% +\paragraph{Mapping contract `invariant' violations to symptoms and failure modes.} +Invariants are conditions that are considered to be relied on throughout the execution of +a program. +% +Here they are taken to mean invariants applying to data +or conditions that the function under analysis deals with or could be affected by. +% +Invariants in contract programming may apply to inputs to the function (where violations can be considered {\fms} in FMMD terminology), +and to outputs (where violations can be considered symptoms, or derived {\fms}, in FMMD terminology). +%\fmmdglossCONTRACTPROG + + + % \section{Example for analysis} % : How can we apply FMEA} @@ -1240,7 +1349,7 @@ The software then applies a PID~\cite{dcods} algorithm to determine the length/m -\section{Closed Loop Control Hardware/Software Hybrid Example} +\subsection{Closed Loop Control Hardware/Software Hybrid Example} It is desirable to model a complete standalone system with FMMD, not only a standalone system, but ideally a hybrid software/hardware system. @@ -1372,13 +1481,16 @@ functions should be called to control a process, or in `C' terms be the main fun Using figure~\ref{fig:contextsoftware} the transform bubble to represent the `main' or controlling function in the software must be chosen. % +All software functions will be written in bold with a pair of brackets +to distingish them as such. The `C' main function is thus presented as \cf{main}. +% This can be thought of as picking one bubble and holding it up. % The other bubbles hang underneath forming the software call tree hierarchy, see figure~\ref{fig:context_calltree}. % From examining the diagram, and in common with established embedded programming practise, -this is clearly going to be the monitor function. +this is clearly going to be the \cf{monitor} function. % \begin{figure}[h]+ \centering @@ -1396,7 +1508,7 @@ The monitor function will orchestrate the control process. Firstly it will examine the timer value, and when appropriate, call the \cf{PID} function. % The \cf{PID} function calls \cf{determine\_set\_point\_error} which calls \cf{convert\_ADC\_to\_T} -which in turn calls \cf{Read\_ADC} (the function developed in the earlier example) +which in turn calls \cf{Read\_ADC} (a function developed and analysed using FMMD in~\cite{syssafe2012}) which reads from hardware. % With the set point error value the \cf{PID} function will return an output control value to its calling @@ -1548,7 +1660,7 @@ level in the hierarchy is found, the Pt100 sensor. Beginning at the bottom, a {\fg} is formed with the function \cf{read\_ADC} and the Pt100. This gives a {\dc}, %which we call -`Read\_Pt100' (see appendix~\ref{sec:readPt100}). +`Read\_Pt100'. % (see appendix~\ref{sec:readPt100}). % % %