Robin_PHD/submission_thesis/CH3_FMEA_criticism/copy.tex

\label{sec:chap3}

\section*{Introduction}

This chapter examines current  FMEA
practise % practise is a noun and practise is a verb
in a
critical light.
Chapter~\ref{sec:chap2} introduced concepts underlying FMEA, and this chapter seeks to
use these concepts to determine the drawbacks and advantages in its current usage.
%
Legally mandatory FMEA, for a large proportion of safety critical systems
in Europe and the USA, at the very least means that experienced
engineers have to discuss a system at a level of detail starting
at {\bc} {\fms}.
\fmmdglossBC
\fmodegloss
%
This undoubtedly reveals dangers inherent in designs and makes
our lives safer. This chapter aims to look for the deficiencies in current  FMEA processes, to probe for weaknesses
and look for ways in which it could be performed better and more efficiently.

A major problem is with the scope of
examination---i.e. which/how~many components should be checked against a particular failure mode---to
apply FMEA analysis.
%
\fmmdglossSTATEEX
Checking all possible combinations of {\fms} against all components quickly leads to a state explosion problem. %:
%defining limits for the number of components to check for against given {\bc}
%{\fms} could address this.
%
The difficulties of integrating software
and hardware in FMEA failure models mean that FMEA is showing its age: designed
in an era of simple electro-mechanical systems, the modern world with ubiquitous
cheap micro-controllers and processors mean that most of today’s  systems are
now software/hardware hybrids.
\fmeagloss
%

Even analogue electronics, with the advent of surface mount and miniature components,
means that modern electronic circuits are typically far more complex and have
far higher component counts, than those
of the era when FMEA methodologies were invented.
%

With FMEA it is very difficult to perform %impossibility of performing
meaningful
multiple failure analysis~\cite{FMEAmultiple653556,maikowski}.
The main reasons for this are that in electronics, each failure
can introduce a circuit topology change and state explosion
means there can be  extremely large numbers of double failures to check.
\fmmdglossSTATEEX
%
In software, in a similar vein,
one failure can influence the programmatic behaviour and decisions made,
complicating the analysis of additional failures.
%
Dual failure analysis is required by some recent European standards~\cite{en298,en230}
and with increasing demands on safety, additional multiple failure
FMEA requirements are likely.

Other problems such as the inability to easily re-use, and validate/audit (through
traceable reasoning) FMEA models are presented.
%
Finally a list of deficiencies in current FMEA methodologies and a wish list
for an improved methodology are presented.

\section{Historical Origins of FMEA and the {\bc} {\fm} to system level failure/symptom paradigm}

\subsection{FMEA: {\bc} {\fm} to system level failure modelling}
FMEA traces it roots to the 1940s when it was used to identify the most costly
failures arising from car mass-production~\cite{bfmea}.
It was later modified slightly to identify/compare severity levels of the system level failures (FMECA~\cite{fmeca}).
In the 1980s FMEA was extended again (FMEDA~\cite{fmeda}) to provide statistics
for predicting safety~levels/failure~rates.
%
However a typical entry in each of the above methodologies, starts with a
particular component failure mode and associates it with a system---or top level---failure symptom.
%
This means that there is one analysis case per component failure mode for all the components in the system under investigation.
%
This analysis philosophy has not changed since FMEA was first used.


\subsection{FMEA does not support Traceable Reasoning}
An FMEA report normally assigns one line of a spreadsheet to
each {\bc} {\fm}.
%
This means that the reasoning involved in determining the system level failure/symptom  is described (if at all) very briefly.
%
Ideally supporting documentation would give the reasoning and calculations behind each analysis case,
but the structure of current FMEA reports does not encourage this.
%
\paragraph{Re-use of FMEA analysis.}
%
Given the {\bc} {\fm} to system level failure mode paradigm it is
difficult to re-use FMEA analysis.
%
Several strategies to aid re-use have been proposed~\cite{rudov2009language, patterns6113886, 931423}, but
the fundamental problem remains, that, with any changes
to the component base in a system, it is very difficult to
determine which FMEA test scenarios must be re-worked.
%
It is common in safety critical systems to have repeated circuit topologies.
%
For instance there may be several signal input and output
structures that are repeated.
%
The failure mode behaviour of these repeated structures will be the same.
%
However with the {\bc} {\fm} to system level failure mode mapping paradigm of FMEA
work is likely to be repeated.

\subsection{FMEA does not support modularity.}
It is a common practise in the process control industry to buy in sub-systems,
typically sensors and actuators connected to an industrially hardened computer bus, i.e. CANbus~\cite{can,canspec}, modbus~\cite{modbus} etc.
%
With traditional FMEA it is difficult to deal with
a `plug~and~play' paradigm.
%
The design philosophy of FMEA is to trace {\bc} failures through to system failures.
This is incompatible with a modular approach where the architecture of a
system may be different for implementation sites.
%
The modularity problem is exacerbated by FMEA's problems modelling software/hardware hybrids, a problem
examined in section~\ref{sec:distributed}.
% Most sensor systems now are `smart'~\cite{smartinstruments}, that is to say, they contain programmatic elements
% even if their outputs are %they supply
% analogue signals. For instance a liquid level sensor that
% supplies a {\ft} output, would have been typically have been implemented
% in analogue electronics before the 1980s. After that time, it would be common to use a micro-processor
% based system to perform the functions of reading the sensor and converting it to a current (\ft) output.
% For the non-safety critical systems integrator this brings with it the advantages
% that come with using a digital system (increased accuracy, self checking and  ease of
% calibration etc. ). For a safety critical systems integrator this can be very problematic when it
% comes to approvals. Even if the sensor manufacturer will let you see the internal workings and software
% we have a problem with tracing the FMEA reasoning through the sensor, through the sensors software
% and then though the system being integrated.
% This problem is compounded by the fact that traditional FMEA cannot integrate software into FMEA models~\cite{sfmea,safeware}.

%% Ideal case where we can determine one to one mapping of failure modes
%% to system level failures

\subsection{FMEA one to many mapping for component failure to system level {\fms}}
\label{sec:onetoone}
Traditional FMEA allows for the possibility of {\bc} {\fms} causing
more than one potential system failure mode.
%
This can be seen as an indicator of the lack of
cause to effect precision possible when analysing
large systems using FMEA.
%
Ideally this relationship would be many ({\bc} {\fms}) to one (system level symptoms).
%
This would be beneficial in terms of validating
precision of analysis, and for by-products of
the process such as developing diagnostic fault trees~\cite{cbds}[Ch 6.2] from
FMEA results.

\section{Comparison Complexity}
%\section{Reasoning Distance used to measure Comparison Complexity}
\label{sec:reasoningdistance}
Traditional FMEA cannot ensure that each failure mode of all its
components are checked against any other components in the system which
it may affect, due to state explosion.
\fmmdglossSTATEEX
%
FMEA is therefore performed using heuristics % at best
to decide on
which components to check the effect of a component failure mode. % on.
%We could term the number of checks made for each failure mode
%on aspects of the system to be the reasoning distance.
%
Typically FMEA will be performed by following the signal path
of the component failure mode to its system level effect,
echoing fault finding/diagnostic techniques~\cite{garrett}. % reasoning.
\fmmdglossSIGPATH
%
This is less than ideal
and it can easily miss interactions with adjacent components, that could cause
other system level symptoms.
%
% Were we to compare the reasoning distance with the theoretical maximum, the sum of all failure
% modes in a system, multiplied by the number of components in it, we could arrive at a maximum
% reasoning distance, which we can use as a comparison complexity figure.
If a reasoning distance used is compared with the theoretical maximum, i.e.
as defined in equation~\ref{eqn:fmea_single},
 comparison complexity figures can be produced.
%
% This figure would mean we could compare the maximum number of checks (i.e. exhaustive %rigorous
% analysis) with the number actually performed.
Complexity comparison here, means the maximum number of checks (i.e. exhaustive %rigorous
analysis)  compared to the number actually performed.
%
In effect a yard~stick for the amount of work performed
for a particular FMEA analysis technique/strategy.

\paragraph{The ideal of exhaustive FMEA (XFMEA).}
%
Obviously, exhaustively checking every component failure mode in a system,
against all other components is the ideal for finding all possible system level failures.
%
While this is impossible for all but trivial systems, it should be possible
for small groups of components that work together to provide a well defined function.
%
A small group of components performing a well defined function
is termed a `{\fg}'.
%
Potentially, using {\fgs}, is a way of de-composing
the problem and reducing the $O(N^2)$---see equation~\ref{eqn:fmea_single}---state explosion effect associated with XFMEA.
%
\fmmdglossSTATEEX
%
That is if the analysis problem can be broken into smaller steps, involving
small groups of components, XFMEA could be applied within those, without
causing a debilitating state explosion effect.
%
This property is examined in section~\ref{sec:theoreticalperfmodel}.
%
% Thus there would are many smaller reasoning distances, where
% $n_i$ are small {\fgs} of the components in a system $S$
% which has a number of components $N$.
% The reasoning distance $n_i^2$
%
A comparison complexity order, or reasoning distance, of $O(N^2)$
could be seen as desirable in an automated process such as a search algorithm,
but here it is a time consuming manual process which
demands experienced and highly qualified personnel.
%
It is therefore desirable to reduce this order further.


\section{Software and FMEA}

Traditional FMEA deals only with electrical and mechanical components, i.e. it does not have provision for software.
%
Modern control systems nearly always have a significant software/firmware element,
and not being able to model software with current FMEA methodologies
is a cause for criticism~\cite{safeware}[Ch.12].
%
Some techniques apply blanket estimates for a given software implementation~\cite{safeware}[pp.156-9], based
on the verification techniques applied in its testing,
to aid calculation of system level reliability statistics~\cite{5492693}.
%Even the traditionally conservative nuclear industry is now
%facing up to the ubiquity of software in control systems~\cite{parnas1991assessment}.
Similar difficulties in integrating mechanical and electronic/software
failure models are discussed in ~\cite{SMR:SMR580,swassessment}.


\paragraph{Current work on Software FMEA.}
\fmmdglossSFMEA
SFMEA usually does not seek to integrate
hardware and software models, but to perform
FMEA on the software in isolation~\cite{procsfmea}.
%
Work has been performed using databases
to track the relationships between variables
and system failure modes~\cite{procsfmeadb}, to %work has been performed to
introduce automation into the FMEA process~\cite{appswfmea} and to provide code analysis
automation~\cite{modelsfmea}. Although the SFMEA and hardware FMEAs are performed separately,
some schools of thought aim for Fault Tree Analysis (FTA)~\cite{nasafta,nucfta} (top down - deductive)
and FMEA (bottom-up inductive)
to be performed on the same system to provide insight into the
software hardware/interface~\cite{embedsfmea}.
%
Subtle problems in embedded software are often due to interrupt contention causing unintended
corruption of variables: automated tools to aid the detection of this
are becoming available~\cite{concurrency_c_tool}.
%
Although current software FMEA techniques
should give a better picture of the failure mode behaviour,
they are by no means a rigorous approach to tracing errors that may occur in hardware being followed
through to the top (and therefore ultimately controlling) layer of software.
%
With the increasing use of micro-controllers in place of much analogue electronics
for most new designs of electronic product, the poor software integration capabilities of FMEA
are now being seen as deficiencies.

This is becoming apparent in a  dilemma now faced
by organisations dealing with highly safety critical systems and having to rely on `smart~instruments'~\cite{justifysmartnuke}
that  can no longer be validated using FMEA.
%
Smart instruments are discussed in the section below.
Distributed real time systems, which rely on micro-controllers connected in a network
using a communications protocol, similarly are difficult to meaningfully analyse using FMEA (see section~\ref{sec:distributed}).

\subsection{The rise of the smart instrument}
\label{sec:smart}
\fmmdglossSMARTINSTRUMENT
%% AWE --- Atomic Weapons Establishment have this problem....
A smart instrument is defined as one that uses a micro-processor and software
in conjunction with its sensing electronics, rather than
analogue electronics only~\cite{smart_instruments_1514209}.
%
It is termed `smart' because it has some software, or intelligence incorporated into it.
%
For instance, an AVO-8 multi-meter circa 1970, uses only analogue electronics and it can therefore be determined
using FMEA how component failures within it could affect readings.
%
A modern multi-meter will have a small dedicated micro-processor and sensing electronics, all on the same chip,
with firmware to read the user controls, and display results on an LCD.
%
For quality control, many safety critical processes require regular inspections
and measurements of physical characteristics of materials and machinery.
%
For highly critical systems e.g. the nuclear industry~\cite{parnas1991assessment},
the instruments used to perform these measurements, must be analysed using traditional assessment (which entails
FMEA), to ensure that failure modes within the instrument cannot lead to invalid measurements.
%
Some work has been performed to offer black~box---or functional testing---of these instruments instead of
static analysis~\cite{Bishop:2010:ONT:1886301.1886325}.
%
However, black box testing of smart instruments is yet to be an approved method of validation.

Most modern instruments now use highly integrated electronics coupled to micro-controllers, which read and filter the measurements,
and interface to an LCD readout.
%
For the highly critical systems, that means they cannot use traditional FMEA to validate
the design of instruments.
%
While noting that being more modern, these instruments are likely to be more reliable and
accurate than the analogue instruments in use some twenty years ago but this cannot be validated
to a high level of reliability. This remains an unsolved problem for the industries dealing with highly safety critical
systems. %by traditional FMEA.
%to a high level of reliability by traditional FMEA.
%
Currently the only way that some smart~instruments have been permitted for
use in highly critical systems is to have them extensively
functionally tested~\cite{bishopsmartinstruments}.


%>>>>>>> 1b3d54f0ec2963017e98c4cdadc9a72a8bac911a

\subsection{Distributed real time systems}
\label{sec:distributed}
Distributed real time systems are control systems where
smart sensors/actuators communicate over a communications bus to
a master controller.
%
Most modern cars follow this information technology pattern and use CANbus~\cite{canspec,can}.
%
For instance, in a modern car there will be no mechanical linkage from the throttle pedal to the engine, instead the  pedal
will be linked to a sensor to determine how far down it is pressed.
%
This sensor will be read by a micro-controller, and values passed via CANbus, to the Engine Control Unit (ECU)
which will use that information (along with information from other sensors) to adjust the power required from the engine.
%
This adjustment could be direct, or could  be another CANbus message passed to a micro-controller regulating engine function.
%
In terms of FMEA, see figure~\ref{fig:distcon}, our reasoning path spans (at least) four interface layers of electronics to software.
%
Traditional FMEA does not cater for the software hardware interface and using
a distributed system means the signal path will
cross several hardware/software interfaces\footnote{The complications of introducing a
communications protocol and the failure mode characteristics of the communications
physical~layer must also be considered for a distributed system.}.
%of the communications physical layer..
%
%, and this leads on to the additional complications
%with the additional complications
%of the communications protocol used to transmit data and the failure mode characteristics
%of the communications physical layer.
%

%
\fmmdglossSIGPATH
%(figure~\ref{fig:distcon}
The failure reasoning paths for a distributed real time system, % AF does not like this but I think its OK
with its multiple passes of the hardware/software
interface, mean traditional FMEA, for these systems,
is impossible to perform.
%
The base component failure mode to system failure paradigm is thus
utterly anachronistic in the distributed real time system environment.


\begin{figure}[h]
 \centering
 \includegraphics[width=400pt]{./CH3_FMEA_criticism/distcon.png}
 % distcon.png: 1622x656 pixel, 72dpi, 57.22x23.14 cm, bb=0 0 1622 656
 \caption{Distributed Control System FMEA signal path for a single input.}
 \label{fig:distcon}
\end{figure}


\section{FMEA ---- general criticism --- conclusion}

%\subsection{FMEA - General Criticism}
A summary of deficiencies in current FMEA methodologies is listed below:
\begin{itemize}
   %\item FMEA type methodologies were designed for simple electro-mechanical systems of the 1940's to 1960's,
   \item State explosion - impossible to perform FMEA exhaustively, %rigorously
   \item Difficult to re-use previous analysis work,
   \item Very difficult to model simultaneous/multiple failures,
   \item Software and hardware models are separate (if the software is modelled at all) meaning the software interface may not be correctly modelled,
   %\item reasoning distance -- component failure to system level symptom process is undefined in regard
   %to the components to check against each given component {\fm},
   \item FMEA methodologies are undefined in regard to which components to check against given failure modes,
   %
   \item Distributed real time systems are very difficult to analyse with FMEA because they typically involve many hardware/software interfaces.
\end{itemize}

Traditional forms of FMEA are no longer % fit for purpose!
of meaningful use for complex modern systems especially those incorporating programmatic elements.
They were designed to analyse simple electro-mechanical systems
and even common place high component count analogue circuits (that are usually surface mount and therefore physically small), are
getting too complicated for meaningful analysis using FMEA.
%
%
% \section{Conclusions on current FMEA Methodologies}
%
% %% FOCUS
% The focus of this chapter %literature review
% is to establish the current practice and applications
% of FMEA.
% %, and to examine its strengths and weaknesses.
% %% GOAL
% Its
% goal is to identify central issues and to criticise and assess  the current
% FMEA methodologies.
% %% PERSPECTIVE
% The perspective of the author, is as a practitioner of static failure mode analysis techniques
% concerning approval of product
% to European safety standards, both the prescriptive~\cite{en298,en230} and statistical~\cite{en61508}.
% A second perspective is that of a software engineer trained to use formal methods.
% Examining FMEA methodologies for mathematical properties, influenced by
% formal methods applied to software, should provide a perspective not traditionally considered.
% %% COVERAGE
% The literature reviewed, has been restricted to published books, European safety standards (as examples
% of current safety measures applied), and traditional research, from journal and conference papers.
% %% ORGANISATION
% The review is organised by concept, that is, FMEA can be applied to hardware, software, software~interfacing and
% to multiple failure scenarios etc. Methodologies related to FMEA are briefly covered for the sake of context.
% %% AUDIENCE
% % Well duh! PhD supervisors and examiners....
%
% % \subsection{Related Methodologies}
% % FTA --- HAZOP  --- ALARP  --- Event Tree Analysis --- bow tie concept
% % \subsection{Hardware FMEA (HFMEA)}
% % \subsection{Multiple Failure scenarios and FMEA}
% % \subsection{Software FMEA (SFMEA)}
%
% \paragraph{Current work on Software FMEA}
%
% SFMEA usually does not seek to integrate
% hardware and software models, but to perform
% FMEA on the software in isolation~\cite{procsfmea}.
% %
% Work has been performed using databases
% to track the relationships between variables
% and system failure modes~\cite{procsfmeadb}, to %work has been performed to
% introduce automation into the FMEA process~\cite{appswfmea} and to provide code analysis
% automation~\cite{modelsfmea}. Although the SFMEA and hardware FMEAs are performed separately,
% some schools of thought aim for Fault Tree Analysis (FTA)~\cite{nasafta,nucfta} (top down - deductive)
% and FMEA (bottom-up inductive)
% to be performed on the same system to provide insight into the
% software hardware/interface~\cite{embedsfmea}.
% %
% Although this
% would give a better picture of the failure mode behaviour, it
% is by no means a rigorous approach to tracing errors that may occur in hardware
% through to the top (and therefore ultimately controlling) layer of software~\cite{swassessment}.
%
% \paragraph{Current FMEA techniques are not suitable for software}
%
% The main FMEA methodologies are all based on the concept of taking
% base component {\fms}, and translating them into system level events/failures~\cite{sfmea,sfmeaa}.
% %
% In a complicated system, mapping a component failure mode to a system level failure
% will mean a long reasoning distance; that is to say the actions of the
% failed component will have to be traced through
% several sub-systems, gauging its effects with and on other components.
% %
% With software at the higher levels of these sub-systems,
% we have yet another layer of complication.
% %
% %In order to integrate software, %in a meaningful way
% %we need to re-think the
% %FMEA concept of simply mapping a base component failure to a system level event.
% %
% SFMEA regards, in place of hardware components, the variables used by the programs to be their equivalent~\cite{procsfmea}.
% The failure modes of these variables, are that they could become erroneously over-written,
% calculated incorrectly (due to a mistake by the programmer, or a fault in the micro-processor on which it is running), or
% external influences such as
% ionising radiation causing bits to be erroneously altered.
%
%
% \paragraph{FMEA and Modularity}
% From the 1940's onwards, software has evolved from a simple procedural languages (i.e. assembly language/Fortran~\cite{f77} call return)
% to structured programming ( C~\cite{DBLP:books/ph/KernighanR88}, pascal etc) and then to object oriented models (Java C++...).
% FMEA has undergone no such evolution.
% %
% In a world where sensor systems, often including embedded software components, are brought in to
% create complex systems, FMEA still follows a rigid {\bc} {\fm} to system level error model,
% that is only suitable for simple electro mechanical systems.
%
%
%
% %
%
% %
% % MAYBE MOVE THIS TO CH3, FMEA CRITICISM
% 30JAN2013
%

\subsection{FMEA Criticism: Conclusions.}
FMEA is a useful tool for basic safety --- it provides statistics on safety where field data is impractical ---
and is good with single failure modes linked to top level events.
FMEA has become part of the safety critical and safety certification industries.
%
SFMEA is in its infancy, and there are corresponding gaps in
certification for software, EN61508~\cite{en61508} a modern standard based
on a modern variant of FMEA, FMEDA, recommends hardware redundancy architectures in conjunction
with FMEDA for hardware: for software it recommends language constraints, software life cycle control, testing regimes and quality procedures
but no inductive fault finding technique.
%
FMEA has adapted from a cost saving exercise for mass produced
items~\cite{bfmea,generic_automotive_fmea_6034891}, to incorporating statistical techniques
(FMECA) to allowing for self diagnostic mitigation (FMEDA).
%
However, it is still based on the concept of  single component failures mapped to top~level/system~failures,
with a one step analysis stage.
% All these FMEA based methodologies have the following short comings:
% \begin{itemize}
%  \item Impossible to integrate Software and hardware models,
%  \item State explosion problem exacerbated by increasing complexity due to density of modern electronics,
%  \item Impossible to consider all multiple component failure modes~\cite{FMEAmultiple653556}
% \end{itemize}
%
%\subsection{FMEA - Better Methodology - Wish List}
%
%
\subsection{FMEA - Better Methodology - Wish List}
%
A wish list is presented, stating the features that should exist
in an improved FMEA methodology,
\begin{itemize}
    \item Must be able to analyse hybrid software/hardware systems,
    \item no state explosion (i.e. XFMEA is impractical),
    \item exhaustive checking at a modular level, %(total failure coverage within {\fgs} all interacting component and failure modes checked),
    \item traceable reasoning inherent in system failure models,% to aid repeatability and checking,
    \item re-usable i.e. it should be possible to re-use analysis,
    \item possibility to analyse simultaneous/multiple failures,
    \item one to one mapping from {\bc} {\fms} to system level failures (see section~\ref{sec:onetoone}),
    \item modular --- i.e. usable in a distributed system.
  % \item
\end{itemize}
\fmmdglossSTATEEX
%
%FMEDA is a modern extension of FMEA, in that it will allow for
%self checking features, and provides detailed recommendations for computer/software architecture,
%but
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%