diff --git a/papers/software_fmea/Makefile b/papers/software_fmea/Makefile index 01fc0d4..93f6944 100644 --- a/papers/software_fmea/Makefile +++ b/papers/software_fmea/Makefile @@ -1,5 +1,5 @@ -PNG = fmmdh.png ct1.png hd.png +PNG = fmmdh.png ct1.png hd.png ftcontext.png %.png:%.dia dia -t png $< diff --git a/papers/software_fmea/ftcontext.dia b/papers/software_fmea/ftcontext.dia new file mode 100644 index 0000000..6688251 Binary files /dev/null and b/papers/software_fmea/ftcontext.dia differ diff --git a/papers/software_fmea/software_fmea.tex b/papers/software_fmea/software_fmea.tex index f8966bb..4787ff8 100644 --- a/papers/software_fmea/software_fmea.tex +++ b/papers/software_fmea/software_fmea.tex @@ -138,9 +138,10 @@ component failure modes on a system. It is used both as a design tool (to determine weakness), and is a requirement of certification of safety critical products. FMEA has been successfully applied to mechanical, electrical and hybrid electro-mechanical systems. -Work on software FMEA is begining~\cite{sfmea}~\cite{sfmeaa}, but -at present no technique for Software FMEA that +Work on software FMEA is beginning~\cite{sfmea}~\cite{sfmeaa}, but +at present no technique for software FMEA that integrates hardware and software models known to the authors exists. +% Software generally, sits on top of most modern safety critical control systems and defines its most important system wide behaviour and communications. Standards~\cite{en298}~\cite{en61508} that use FMEA @@ -148,9 +149,10 @@ do not specify it for Software, but do specify, good practise, review processes and language feature constraints. This is a weakness; where FMEA scientifically traces component {\fms} -to resultant system failures; software has been left in a non-analytical +to resultant system failures, software has been left in a non-analytical limbo of best practises and constraints. -If software FMEA were possible electro-mechanical-software hybrids could +% +If software FMEA were possible, electro-mechanical-software hybrids could be modelled; and could thus be `complete' failure mode models. %Failure modes in components in say a sensor, could be traced %up through the electronics and then through the controlling software. @@ -164,13 +166,16 @@ and integrate-able with FMEA performed on mechanical and electronic systems. { This paper describes a modular FMEA process that can be applied to software. This modular variant of FMEA is called Failure Mode Modular de-composition (FMMD). -Because this process is based on failure modes of components +% +Because this process is based on failure modes of components, it can be applied to electrical and/or mechanical systems. +% The hierarchical structure of software is then examined, -and then definitions from contract programming are used +and definitions from contract programming are used to define failure modes and failure symptoms in software functions. -With these definitions we can apply FMEA +% +With these definitions we can apply a modular form of FMEA to existing software\footnote{Existing software excluding recursive~\cite{misra}[16.2] code, and unstructured non-functional languages}. } @@ -195,7 +200,7 @@ the failures to fix in order of cost. Deisgn FMEA (DFMEA) is FMEA applied at the design or approvals stage where the aim is to ensure single component failures cannot cause unacceptable system level events. -Failure Mode effect Criticality Analysis (FMECA) is applied to determine the most potentially dangerous or damaging +Failure Mode Effect Criticality Analysis (FMECA) is applied to determine the most potentially dangerous or damaging failure modes to fix. @@ -207,6 +212,40 @@ FMMD is a modularisation of FMEA and can produce failure~mode models that can be all the above variants of FMEA. +\subsection{Current FMEA techniques are not suitable for software} + +The main FMEA methodologies are all based on the concept of taking +base component {\fms}, and translating them into system level events/failures. +In a complicated system, mapping a component failure mode to a system level failure +will mean a long reasoning distance; that is to say the actions of the failed component will have to be traced through +several sub-systems and the effects of other components on the way. +With software at the higher levels of these sub-systems +we have another layer of complication. + +In order to integrate software, in a meaningful way we need to re-think the +FMEA concept of mapping a base component failure to a system level event. + + +One strategy would be to modularise FMEA. To break down the failure effect +reasoning into small modules. +% +If we pre-analyse modules, and then they +can be combined with others, into +larger sub-systems, and eventually form a hierarchy of failure mode behaviour for the entire system. +% +With higher level modules, we can approach the level that the software re-sides in. +% +For instance, to read a voltage into software via an ADC we rely on an electronic sub-system +that conditions the input signal and then routes it through a multiplexer to the ADC. +% +We could easily consider this electronics a module, and with a +failure mode model for it, it makes modelling the software to hardware interface +far simpler. +% +The failure mode model, would give us the ways in which the signal conditioning +and multiplexer could fail. We can use this to work out how our software +could fail, and with this create a modular FMEA model of the software. + \section{Modularising FMEA} @@ -219,7 +258,7 @@ We can call these {\fgs}. We can then analyse the failure mode behaviour of a {\ using all the failure modes of all its components. % When we have its failure mode behaviour, or the symptoms of failure from the perspective of the {\fg}, -we now treat the {\fg} as a {\dc}; where the failure modes of the {\dc} are the symptoms of failure of the {\fg}. +we now treat the {\fg} as a {\dc}, where the failure modes of the {\dc} are the symptoms of failure of the {\fg}. % % We can now use {\dcs} to build higher level {\fgs} until we have a complete hierarchical model @@ -229,8 +268,8 @@ is given in~\cite{syssafe2011}. \paragraph{FMMD, the process.} The main aim of Failure Mode Modular Discrimination (FMMD) is to build a hierarchy of failure behaviour from the {\bc} -level up to the top, or system level, with analysis stages, {\fgs} %and corresponding {\dcs} -, between each +level up to the top, or system level, with analysis stages ({\fgs}) %and corresponding {\dcs} +between each transition to a higher level in the hierarchy. @@ -242,7 +281,7 @@ From the point of view of fault analysis, we are not interested in the component A {\fg} is a collection of components that perform some simple task or function. % In order to determine how a {\fg} can fail, -we need to consider all failure modes of its components. +we need to consider all the failure modes of its components. % By analysing the fault behaviour of a `{\fg}' with respect to all its components failure modes, we can determine its symptoms of failure. @@ -250,11 +289,16 @@ we can determine its symptoms of failure. %the symptoms of failure for the {\fg}. With these symptoms (a set of derived faults from the perspective of the {\fg}) -we can now state that the {\fg} (as an entity in its own right) can fail in a number of well defined ways. +we can now state that the {\fg} +% (as an entity in its own right) +can fail in a number of well defined ways. % In other words we have taken a {\fg}, and analysed how -\textbf{it} can fail according to the failure modes of its components, and then -determined the {\fg} failure modes. +%\textbf{it} +it can fail according to the failure modes of its components, and then +determine the {\fg} failure symptoms. +We then create a new {\dc} which has as its {\fms} the failure symptoms +of the {\fg} that it was derived from. % \paragraph{Creating a derived component.} % We create a new `{\dc}' which has @@ -279,15 +323,17 @@ determined the {\fg} failure modes. We can use the symbol $\bowtie$ to represent the creation of a derived component from a {\fg}. We show an FMMD hierarchy in figure~\ref{fig:fmmdh}. -Using this diagram we can follow the creation of the hierarcy in +Using this diagram, we can follow the creation of the hierarchy in a theoretical system. +% There are three functional groups comprised of {\bcs}. These are analysed individually using FMEA. That is to say their component failure modes are examined, and the -the ways in which the {\fgs} fail; its symptoms of failure are determined. +the ways in which the {\fgs} fail; and how its symptoms of failure are determined. +% The `$\bowtie$' function is now applied to create {\dcs}. These are shown in figure~\ref{fig:fmmdh} above the {\fgs}. -Now that we have {\dcs} we can use them to form a higher level functional group. +Now that we have {\dcs}, we can use them to form a higher level functional group. We apply the same FMEA process to this and can derive a top level derived component (which has the system---or top---level failure modes). @@ -306,17 +352,38 @@ programmatic function call tree. If FMEA can be applied to software we can build complete failure models of typical modern safety critical systems. -With modular FMEA (FMMD) we have the concepts of failure~modes +With modular FMEA i.e. FMMD %(FMMD) +we have the concepts of failure~modes of components, {\fgs} and symptoms of failure for a functional group. -A programmatic function is very similar to a f via hardware interactionunctional group. -It calls other functions, and uses data sources via hardware interaction, which could be viewed as its `components'. -It has outputs which will be used by functions that may call it. - map the FMMD concepts of {\fms}, {\fgs} and {\dcs} -to software functions. +A programmatic function has similariies with a {\fg} as defined by the FMMD process. +% +An FMMD {\fg} is placed into a hierarchy. +A Software function is placed into a hierarchy, that of its call-tree. +A software function typically calls other functions and uses data sources via hardware interaction, which could be viewed as its `components'. +It has outputs, i.e. it can perform actions +on data or hardware +which will be used by functions that may call it. + +We can map a software function to a {\fg} in FMMD. Its failure modes +are the failure modes of the software components (other functions it calls) +and the hardware its reads values from. +Its outputs are the data it changes, or the hardware actions it performs. + +When we have analysed a software function, initially usin its input failure modes +we can determine its symptoms of failure (how calling functions will see its failure mode behaviour). + +We can thus apply the $\bowtie$ process to software functions, by viewing them in terms of their failure +mode behaviour. To simplify things as well, software already fits into a hierarchy. +For Electronics and Mechanical systems, although we may be guided by the original designers +concepts of modularity and sub-systems in design, applying FMMD means deciding on the members for {\fgs} +and the subsequent hierarchy. With software already written, that hierarchy is fixed. + +% map the FMMD concepts of {\fms}, {\fgs} and {\dcs} +%to software functions. % %However, we need to map a the FMMD concepts of {\fms}, {\fgs} and {\dcs} -to software functions. +%to software functions. % failure modes of a function in order to %map FMMD to software. @@ -328,12 +395,21 @@ Because of this we can assume a direct call tree. Functions call functions from the top down and eventually call the lowest level library or IO functions that interact with hardware/electronics. +What is potentially difficult with a software function, is deciding what +are failure modes, and later what a failure symptoms. +With electronic components, we can use literature to point us to suitable sets of +{\fms}~\cite{en298}~\cite{fmd91}~\cite{mil1991}~\cite{en61508}. +With software, only some library functions are well known and rigorously documented +enough to have the equivalent of known failure modes. +Most software is `bespoke'. We need a different strategy to +describe the failure mode behaviour of software functions. +We can use definitions from contract programming to assist here. \subsection{Contract programming description} Contract programming is a discipline~\cite{dbcbe} for building software functions in a controlled and traceable way. Each function is subject to pre-conditions (constraints on its inputs), -post-conditions (constraints` on its outpu'ts) and function wide invariants (rules). +post-conditions (constraints on its outputs) and function wide invariants (rules). \paragraph{Mapping contract `pre-condition' violations to failure modes} @@ -343,7 +419,7 @@ defines the correct ranges of input conditions for the function to operate successfully. For a software function, a violation of a pre-condition is -in effect a failure mode of `one of its com'ponents. +in effect a failure mode of `one of its components'. \paragraph{Mapping contract `post-condition' violations to symptoms} @@ -354,19 +430,52 @@ Post conditions could be either actions performed (i.e. the state of hardware \paragraph{Mapping contract `invariant' violations to symptoms and failure modes} -Invariants in contract programming may apply to inputs to the function (where the can be considered {\fms} in FMMD terminology), -and to outputs (where the can be considered {failure symptoms} in FMMD terminology). +Invariants in contract programming may apply to inputs to the function (where they can be considered {\fms} in FMMD terminology), +and to outputs (where they can be considered {failure symptoms} in FMMD terminology). \subsection{Software FMEA} +For the purpose of example, we chose a simple common safety critical industrial circuit +that is nearly always used in conjunction with a programmatic element. +A common method for delivering a quantitative value in analogue electronics is +to supply a current signal to represent it~\cite{aoe}[p.849]. +Usually, 4mA represents a zero or starting value and 20mA represents the full scale, +and this is referred to as {\ft} signalling. +% +{\ft} has a an electrical advantage as well, because the current in a loop is constant~\cite{aoe}[p.20] +resistance in the wires between the source and the receiving end is not an issue +that can alter the accuracy of the signal. +% +This circuit has many advantages for safety. If the signal becomes discontented +it reads an out of range 0mA at the receiving end. This is outside the {\ft} range, +and is therefore easy to detect as an error rather than an incorrect value. +% +Should the driving electronics go wrong at the source end, it will usually +supply far too little or far too much current, making an error condition easy to detect. +% +At the receiving end, we only require one simple component to convert the +current signal into a voltage that we can read with an ADC: the humble resistor! + + +%BLOCK DIAGRAM HERE WITH FT CIRCUIT LOOP + +\begin{figure}[h] + \centering + \includegraphics[width=230pt]{./ftcontext.png} + % ftcontext.png: 767x385 pixel, 72dpi, 27.06x13.58 cm, bb=0 0 767 385 + \caption{Context Diagram for {\ft} loop} + \label{fig:ftcontext} +\end{figure} + + \subsection{Simple Software Example} Consider a function that reads a {\ft} input, and returns a value between 0 and 999 (i.e. per mil $\permil$) representing the current detected with an additional error indication flag . -Let us assume the {\ft} detection is via a \ohms{220} resistor., and that we read a voltage +Let us assume the {\ft} detection is via a \ohms{220} resistor, and that we read a voltage from an ADC into the software. Let us define any value outside the 4mA to 20mA range as an error condition. % @@ -423,15 +532,20 @@ int read_4_20_input ( int * value ) { %} \label{fig:code_read_4_20_input} \caption{Software Function: \textbf{read\_4\_20\_input}} -\label{fig:420i} +%\label{fig:420i} \end{figure} We now look at the function called by \textbf{read\_4\_20\_input}, \textbf{read\_ADC}, which returns a -voltage for a given ADC channel. This function -deals directly with the hardware in the micro-controller we are running the software on. +voltage for a given ADC channel. +% +This function +deals directly with the hardware in the micro-controller that we are running the software on. +% Its job is to select the correct channel (ADC multiplexer) and then to initiate a conversion by setting an ADC 'go' bit (see code sample in figure~\ref{code_read_ADC}). -It takes the raw ADC reading and converts it into a floating point\footnote{the type, `double' or `double precision', is a standard C language floating point type~\cite{kandr}.} +% +It takes the raw ADC reading and converts it into a i +floating point\footnote{the type, `double' or `double precision', is a standard C language floating point type~\cite{kandr}.} voltage value. @@ -497,12 +611,17 @@ We now have a very simple software structure, a call tree, shown in figure~\ref{ \label{fig:ct1} \end{figure} -This software is above the hardware in the call tree. -FMEA is always a bottom-up process and so we must being with the hardware. +This software is above the hardware in the conceptual call tree---by that, in software terms---the +software is reading values from the `lower~level' electronics. +% +FMEA is always a bottom-up process and so we must begin with this hardware. +% The hardware is simply a load resistor, connected across an ADC input pin on the micro-controller and ground. +% We can identify the resistor and the ADC module of the micro-controller as the base components in this design. +% We now apply FMMD starting with the hardware. @@ -573,7 +692,7 @@ With these failure modes, we can analyse our first functional group, see table~r We now have the symptoms for the hardware functional group, $\{ HIGH , LOW, V\_ERR \} $. -We can now create a {\dc} to represent this called $CMATV$. +We now create a {\dc} to represent this called $CMATV$. As its failure modes, are the symptoms of failure from the functional group we can now state: $$fm ( CMATV ) = \{ HIGH , LOW, V\_ERR \} $$ @@ -604,7 +723,7 @@ $$ fm(RA) = \{ CHAN\_NO, VREF \} $$ As we have a failure mode model for our function, we can now use it in conjunction with with the ADC hardware {\dc} CMATV, to form a {\fg}, where $G=\{ CMSTV, Read\_ADC \}$. -We can now analyse this hardware/software combined {\fg}. +We now analyse this hardware/software combined {\fg}. @@ -647,7 +766,7 @@ We can now analyse this hardware/software combined {\fg}. -We can now see that the symptoms of failure for the {\fg} analysed +We now have the symptoms of failure for the {\fg} analysed (see table~\ref{tbl:radc}) as $\{ VV\_ERR, HIGH, LOW \}$. We can add as well the violation of the postcondition for the function. This postcondition, {\em /* ensure: value is voltage input to within 0.1\% */ }, @@ -720,7 +839,7 @@ For single failures these are the two ways in which this function can fail. An $OUT\_OF\_RANGE$ will be flagged by the error flag variable. The $VAL\_ERR$ will simply mean that the value read is simply wrong. -We can now finally make a {\dc} to represent a failure mode model for our function $read\_4\_20\_input$ thus: +We can finally make a {\dc} to represent a failure mode model for our function $read\_4\_20\_input$ thus: $$fm(R420I) = \{OUT\_OF\_RANGE, VAL\_ERR\}$$