introduction text added

This commit is contained in:
Robin Clark 2012-02-03 17:19:47 +00:00
parent 112397b717
commit 3d05321669
2 changed files with 590 additions and 208 deletions

View File

@ -27,16 +27,37 @@
This thesis describes the application of, a common notation mathematical notation to This thesis describes the application of, a common notation mathematical notation to
describe the design of safety critical systems/PEC's from the perspective of failure modes. describe the design of safety critical systems/PEC's from the perspective of failure modes.
The initial motivation for this study was to create a system The initial motivation for this study was to create a system
applicable to industrial burner controllers\footnote{Burner Controllers cover the disiplines of applicable to industrial burner controllers\footnote{Burner Controllers cover the disciplines of
combustion, high pressure steam and hot water, mechanical control, electronics and embedded software.}. combustion, high pressure steam and hot water, mechanical control, electronics and embedded software.}.
The methodology developed was designed to cope with The methodology developed was designed to cope with
both the deterministic\footnote{Deterministic failure mode analysis, traces failure mode effects at the SYSTEM level to lower level causes in components or sub-systems.} and probablistic approaches both the deterministic\footnote{Deterministic failure mode analysis, traces failure mode effects at the SYSTEM level to lower level causes in components or sub-systems.} and probablistic approaches
\footnote{Probabilistic failure mode analysis tries to determine the probability of given SYSTEM failure modes, and pfrom these \footnote{Probabilistic failure mode analysis tries to determine the probability of given SYSTEM failure modes, and from these
can determine an overall failure rate, in terms of probability of failure on demand, or failure in time (or Mean Time to Failure (MTTF).}. can determine an overall failure rate, in terms of probability of failure on demand, or failure in time (or Mean Time to Failure (MTTF).}.
\glossary{name={safety critical},description={A safety critical system is one in which its failure may result in death or serious injury to humans, an environmental catastrophe or severe loss or damage}} \glossary{name={safety critical},description={A safety critical system is one in which its failure may result in death or serious injury to humans, an environmental catastrophe or severe loss or damage}}
\fmodegloss \fmodegloss
\pecgloss \pecgloss
\paragraph{Initial Perspective for thesis}
My initial work on this area~\cite{robin-paper2004} was to use Euler/Spider~\cite{spider}
diagrams to represent failure modes. Euler circles represented failure modes, the feet of the spiders represented test cases
(i.e. instances of the failure mode occurring for examination),
and could therefore model multiple failure modes
and the spiders (or joining lines) represented the symptom abstraction process.
A spider thus determined a common symptom which was caused by one or mode component failure modes.
% AFTER 5 YEARS FUCKING ABOUT with this CUNTS mathematicians GET A REAL FUCKING JOB
% At the 6 year point in this part time PhD I was finally appointed an electrical engineer.
% and the process of writing a paper for presentation as a result of this
% di-graphs instead were chosen.
As a by-product of writing a paper~\cite{iet2011}, it became apparent
that we could
%it was decided to
restrict the scope of the thesis to modularising FMEA
processes, and to restrict the examples examined to the domain of electronics only.
\footnote{Because FMEA deals with failure modes, in a static context---and all base components, whether mechanical, electrical
or software always have sets of failure modes associated with them---it should
be possible to apply it across all domains, and thus model integrated mechanical/electrical/software systems.}
\paragraph{Safety Critical Controllers, knowledge and culture sub-disiplines} \paragraph{Safety Critical Controllers, knowledge and culture sub-disiplines}
The maturing of the application of the programmable electronic controller (PEC) The maturing of the application of the programmable electronic controller (PEC)
for a wide range safety critical applications, has led to a fragmentation of sub-disciplines for a wide range safety critical applications, has led to a fragmentation of sub-disciplines
@ -66,7 +87,7 @@ to understand the process being controlled, the mechanical and electrical
sensors and actuators and the software. Not only must the sensors and actuators and the software. Not only must the
safety engineer understand more than four potential disciplines, he/she safety engineer understand more than four potential disciplines, he/she
must be able to trace failure modes of components to SYSTEM levels failure modes, must be able to trace failure modes of components to SYSTEM levels failure modes,
and classify these according to their criticallity. and classify these according to their criticality.
\paragraph{Desire to introduce formal methods to static failure mode analysis} \paragraph{Desire to introduce formal methods to static failure mode analysis}
There has been much work introducing formal methods into There has been much work introducing formal methods into
@ -85,7 +106,7 @@ Having a common failure mode notation across all disciplines in a project
would allow all the specialists to prepare failure mode would allow all the specialists to prepare failure mode
analysis and then bring them together to model the PEC. analysis and then bring them together to model the PEC.
\paragraph{Visual form of the notation} \paragraph{Visual form of the notation}
The visual notation developed was initially designed for electronic fault modeling. The visual notation developed was initially designed for electronic fault modelling.
This notation deals with failure modes of components using concepts derived from This notation deals with failure modes of components using concepts derived from
Euler and Spider diagrams. Euler and Spider diagrams.
However, as the notation dealt with generic failure modes, it was realised that it could be applied to However, as the notation dealt with generic failure modes, it was realised that it could be applied to
@ -139,7 +160,8 @@ FMEA was time consuming, and being directed by
experts undoubtedly ironed out many potential safety faults before the product saw experts undoubtedly ironed out many potential safety faults before the product saw
light of day. light of day.
However it was quickly apparent that only a small proportion However it was quickly apparent that only a small proportion
of component~failure modes was considered. Also there was no formalism. of component~failure modes was considered\footnote{The small proportion of components chosen for approvals FMEA
were generally those in critical sections of the PEC}. Also there was no formalism.
The component~failure~modes investigated were not analysed within The component~failure~modes investigated were not analysed within
any rigorous or mathematically proven framework. any rigorous or mathematically proven framework.
@ -624,13 +646,14 @@ $(N-2)$ components.
NumberOfchecks = \frac{(N^{2} - N) ( N - 2)}{2} NumberOfchecks = \frac{(N^{2} - N) ( N - 2)}{2}
\endequation \endequation
Thus for a 1000 failure mode system, roughly a half billion possible checks would be required for the double simultaneous failure scenario. This astonomical number of potential combinations, has made formal analysis of this Thus for a 1000 failure mode system, roughly a half billion possible checks would be required for the double simultaneous failure scenario.
This astronomical number of potential combinations, has made formal analysis of this
type of system, up until now, impractical. Fault simulators %\cite{sim} type of system, up until now, impractical. Fault simulators %\cite{sim}
are commonly used for the gas certification process. Thus to are commonly used for the gas certification process. Thus to
manually check this number of combinations of faults is in practise impossible. manually check this number of combinations of faults is in practise impossible.
A technique of modularising, or breaking down the problem is clearly necessary. A technique of modularising, or breaking down the problem is clearly necessary.
\section{Examples of disasters caused by designs \\ missing component errors} \section{Famous Examples of disasters caused by missed component errors}
\subsection{Challenger Disaster} \subsection{Challenger Disaster}
@ -717,7 +740,7 @@ temperature being the most typical. Very often what happens to the system outsid
\begin{itemize} \begin{itemize}
\item To create a Bottom up FMEA technique that permits a connected hierarchy to be \item To create a Bottom up FMEA technique that permits a connected hierarchy to be
built representing the fault behavior of a system. built representing the fault behaviour of a system.
\item To create a procedure where no component failure mode can be accidentally ignored. \item To create a procedure where no component failure mode can be accidentally ignored.
\item To create a user friendly formal common visual notation to represent fault modes \item To create a user friendly formal common visual notation to represent fault modes
in Software, Electronic and Mechanical sub-systems. in Software, Electronic and Mechanical sub-systems.
@ -725,10 +748,10 @@ in Software, Electronic and Mechanical sub-systems.
\item To prove that the derived~components used to build the hierarchies \item To prove that the derived~components used to build the hierarchies
provide traceable fault handling from component level to the provide traceable fault handling from component level to the
highest abstract system 'top level'. highest abstract system 'top level'.
\item To formally define the hierarchies and procedure for bulding them. \item To formally define the hierarchies and procedure for building them.
\item To produce a software tool to aid in the drawing of diagrams and \item To produce a software tool to aid in the drawing of diagrams and
ensuring that all fault modes are addressed. ensuring that all fault modes are addressed.
\item to provide a data model that can be used as a source for deterministic and probablistic failure mode analysis reports. \item to provide a data model that can be used as a source for deterministic and probabilistic failure mode analysis reports.
\item To allow the possibility of MTTF calculation for statistical \item To allow the possibility of MTTF calculation for statistical
reliability/safety calculations. reliability/safety calculations.
\end{itemize} \end{itemize}

View File

@ -1,53 +1,354 @@
%
% Structure to introduction
%
%
% Application Area - safety critical controllers - define safety critical - describe
% approval processes - describe static testing
%
% Now start looking at the philosophy of making PEC's
% safer. Describe what can and cannot be done.
%
% Point out errors in currently used techniques.
% Bottom-up vs. top down discussion
%
% No current common notation for static testing that models both software and hardware
%
% How a new methodology should plug these gaps
%
%
\section{Introduction} \section{Introduction}
$$ \int_{0\-}^{\infty} f(t).e^{-s.t}.dt \; | \; s \in C$$ %% $$ \int_{0\-}^{\infty} f(t).e^{-s.t}.dt \; | \; s \in \mathcal{C}$$
This thesis describes the application of, mathematical (formal) techniques to
the design of safety critical systems. \paragraph{Scope of thesis}
This thesis describes the application of, a common notation mathematical notation to
describe the design of safety critical systems/PEC's from the perspective of failure modes.
The initial motivation for this study was to create a system The initial motivation for this study was to create a system
applicable to industrial burner controllers. applicable to industrial burner controllers\footnote{Burner Controllers cover the disciplines of
combustion, high pressure steam and hot water, mechanical control, electronics and embedded software.}.
The methodology developed was designed to cope with The methodology developed was designed to cope with
both the specific `simultaneous failures'\cite{EN298},\cite{EN230},\cite{EN12067} both the deterministic\footnote{Deterministic failure mode analysis, traces failure mode effects at the SYSTEM level to lower level causes in components or sub-systems.} and probablistic approaches
and the probability to dangerous fault approach\cite{EN61508}. \footnote{Probabilistic failure mode analysis tries to determine the probability of given SYSTEM failure modes, and from these
can determine an overall failure rate, in terms of probability of failure on demand, or failure in time (or Mean Time to Failure (MTTF).}.
\glossary{name={safety critical},description={A safety critical system is one in which its failure may result in death or serious injury to humans, an environmental catastrophe or severe loss or damage}}
\fmodegloss
\pecgloss
The visual notation developed was initially designed for electronic fault modelling.
However, it could be appleid to mechanical and software domains as well. \paragraph{Initial Perspective for thesis}
Due to this a common notation/diagram style My initial work on this area~\cite{robin-paper2004} was to use Euler/Spider~\cite{spider}
can be used to model any integrated safety relevant system. diagrams to represent failure modes. Euler circles represented failure modes, the feet of the spiders represented test cases
(i.e. instances of the failure mode occurring for examination),
and could therefore model multiple failure modes
and the spiders (or joining lines) represented the symptom abstraction process.
A spider thus determined a common symptom which was caused by one or mode component failure modes.
% AFTER 5 YEARS FUCKING ABOUT with this CUNTS mathematicians GET A REAL FUCKING JOB
% At the 6 year point in this part time PhD I was finally appointed an electrical engineer.
% and the process of writing a paper for presentation as a result of this
% di-graphs instead were chosen.
As a by-product of writing a paper~\cite{iet2011}, it became apparent
that we could
%it was decided to
restrict the scope of the thesis to modularising FMEA
processes, and to restrict the examples examined to the domain of electronics only.
\footnote{Because FMEA deals with failure modes, in a static context---and all base components, whether mechanical, electrical
or software always have sets of failure modes associated with them---it should
be possible to apply it across all domains, and thus model integrated mechanical/electrical/software systems.}
\paragraph{Safety Critical Controllers, knowledge and culture sub-disiplines}
The maturing of the application of the programmable electronic controller (PEC)
for a wide range safety critical applications, has led to a fragmentation of sub-disciplines
which speak imperfectly to one another.
This is because
the main three engineering disciplines, Electrical, Software and Mechanical Engineering
produced equipment that was interfaced a a later time.
Just as electronic circuitry becomes more integrated, and sub-domains
of electrical engineering (analog and digital for instance) are commonly found along-side on the same chip,
so modern PEC's are becoming more and more integrated and now typically encompass
input from the three engineering disciplines\footnote{Consider an aircraft, this involves expert knowledge from
Software, Electronic and Mechanical Engineering and requires a high degree of safety validation}.
Additional disiplines are defined by application area of the PEC. All of these sub-disciplines
are in turn split into even finer units.
The practitioners of these fields tend to view a PEC in different ways.
Discoveries and culture in one field diffuse only slowly into the consciousness of a specialist in another.
Too often, one discipline's unproven assumptions or working methods, are treated as firm boundary conditions
for an overlapping field.
For failure mode analysis a common notation, across disciplines is a very desirable and potentially useful
tool.
\paragraph{Safety Assessment/analysis of PEC's}
\glossary{name={safety assessment},description={A critical appraisal, typically following legal or formal guidelines, which will encompass design, and failure effects analysis}}
For a anyone responsible for ensuring or proving the safety of a PEC must be able
to understand the process being controlled, the mechanical and electrical
sensors and actuators and the software. Not only must the
safety engineer understand more than four potential disciplines, he/she
must be able to trace failure modes of components to SYSTEM levels failure modes,
and classify these according to their criticality.
\paragraph{Desire to introduce formal methods to static failure mode analysis}
There has been much work introducing formal methods into
the requirements and validation phases of electromechanical systems.
Apart from the ability to check, precisely, that what ha been
build behaves correctly and as requested, the process
of formal specification ensures that all important details are analysed
and looked at in detail.
It is an aim of this project to bring formal methods to
static failure mode analysis. This means being able to account for every base
component failure mode in a model, and to be able to represent
mechanical, electrical and software components in a single failure mode model.
\paragraph{Desirability of a common failure mode notation}
Having a common failure mode notation across all disciplines in a project
would allow all the specialists to prepare failure mode
analysis and then bring them together to model the PEC.
\paragraph{Visual form of the notation}
The visual notation developed was initially designed for electronic fault modelling.
This notation deals with failure modes of components using concepts derived from
Euler and Spider diagrams.
However, as the notation dealt with generic failure modes, it was realised that it could be applied to
mechanical and software domains as well.
This changed the target for the study slightly to encompass these three domains in a common notation.
\paragraph{PEC's: Legal and Insurance Issues}
In most safety critical industries the operators of plant have to demonstrate a through consideration of safety.
There is also usually a differentiation between the manufacturers
and the the plant operators.
The manufacturers have to ensure
that the device is adequately safe for use in its operational context.
This usually means conforming to device specific standards~\footnote{in Europe, conformance to European Norms (EN) are legal requirements
for specific types of controllers, and in the USA conformance to Underwriters Laboratories (UL) standards
are usually a minimum requirement to take out insurance}, and offering training
of operators.
Operators of safety critical plant are concerned with maintenance and legal obligations for
periodic safety checks (both legal and insurance driven).
\section{Background}
I completed an MSc in Software engineering in 2004 at Brighton University while working for
an Engineering firm as an embedded `C' programmer.
The firm specialise in industrial burner controllers.
Industrial Burners are potentially very dangerous industrial plant.
They are generally left running unattended for long periods.
They are subject to stringent safety regulations and
must conform to specific `EN' standards.
For a non-safety critical product one can merely comply with the standards, and `self~certify' by applying a CE mark sticker.
Safety critical products are categorised and listed. These require
certification by an independent and `competent body' recognised under European law.
The certification process typically involves stress testing with repeated operation cycles
over a specified a range of temperatures, electrical stress testing with high voltage interference,
power supply voltage ranges with surges and dips, electro static discharge testing, and
EMC (Electro Magnetic Compatibility). A significant part
of this process however, is `static testing'. This involves looking at the design of the products,
from the perspective of environmental stresses, natural input fault conditions\footnote{For instance in a burner controller, the gas supply pressure reducing},
components failing, and the effects on safety this could have.
Some static testing involves checking that the germane `EN' standards have
been complied with\footnote{for instance protection levels of an enclosure for the device, or down rating of electrical components}.
Failure Mode Effects Analysis (FMEA) was also applied. This involved
looking in detail at selected critical sections of the product and proposing
component failure scenarios.
For each failure scenario proposed either a satisfactory
answer was required, or a counter proposal to change the design to cope with
a theoretical component failure eventuality.
FMEA was time consuming, and being directed by
experts undoubtedly ironed out many potential safety faults before the product saw
light of day.
However it was quickly apparent that only a small proportion
of component~failure modes was considered\footnote{The small proportion of components chosen for approvals FMEA
were generally those in critical sections of the PEC}. Also there was no formalism.
The component~failure~modes investigated were not analysed within
any rigorous or mathematically proven framework.
\subsection{ Blanket Risk Reduction Approach }
The suite of tests applied for a certified product amount to a `blanket' approach.
That is to say that by applying electrical, repeated operations, and environmental
stress testing it is hoped that the majority of latent faults are discovered.
The FMEA and static testing only looked at the most obviously safety critical
aspects, and a small minority of the total component base for a product.
Systemic faults, or mistakes are missed by this form of static testing.
\subsection{Possibility of applying mathematical techniques to FMEA}
My MSc project was a diagram editor for Constraint diagrams.
I wanted to apply constraint diagram techniques to FMEA
and began thinking about how this could be done. One
obvious factor was that a typical safety critical system could
have more than 1000 component parts. Each component
would typically have several failure modes.
Trying to apply a rigorous methodology on an entire product
was going to be impractical. To do this with complete coverage
each component failure mode would have to have been checked against
the other thousand or so components for influence, and then
a determination of the effects on the system would have had to have been
made. Thus millions of checks would have to have been performed, and
as FMEA is an `expert only' time consuming technique, this idea was
obviously impractical. Note that most of the checks made would be redundant.
Most components affect the performance of a few that they are placed to work with
to perform some particular low-level function.
\paragraph{Top down Approach}
A top down approach has several potential problems.
By its nature it means that at the start of the process
a set of system or top level faults or undesirable outcomes are defined.
It then must break the system down into modules and
decide which of these can contribute to a system level fault mode.
Potentially failure modes, be they from components or the interaction
between modules can be missed. A disturbing example of this
is the NASA space shuttle in 1986, which missed the fault mode of an O
ring. This was made even worse, by the fact that the `O' ring had a specified temperature
range where the probability of this fault occurring was dramatically raised when below
the temperature range. This was a known and documented feature of a safety critical component
and it was ignored in the safety analysis.
\paragraph{Bottom-up Approach}
A bottom-up approach looked impractical at first due to the sheer number
of component failure modes in a typical system.
However were this bottom-up approach to be modular, (reducing the order of cross checking), and build a hierarchal
of modules rising up until all components are covered, we
can model an entire complex system.
This is the core concept behind this study.
By working from the bottom up, at the lowest level taking the
smallest functional~groups of components
and analysing these, we can obtain a set of failure modes
for the functional~groups. We can then treat these
as `higher level' components and combine them
to form new `functional~groups'.
In this way all failure modes from all components must be at the very least considered.
Also a hierarchy is formed when the top level errors are formed
naturally from the lower levels of analysis.
Unlike a top~down analysis, we cannot miss a top level fault condition.
\paragraph{Repeated Circuitry Sub-Systems}
In all safety critical real time systems the author has worked with
all have repeated sections of hardware.
for instance self checking digital inputs, analog inputs, sections of circuitry to
generate {\ft} loops, micro-processors with watchdog~\cite{embupsys}[pp.81] secondary
circuity.
In other words spending time on analysing these lower level sub-systems
seems worthwhile, since they will be used in many designs, and are often
repeated within a SYSTEM
(and thus the analysis results may be re-used).
In general terms we can describe
these circuitry sub-systems
as collections of components or smaller sub-systems, that interact to perform a given function.
We can call these collections {\fg}s.
In these `safety critical' circuitry sections, especially ones claiming to
be self-checking, the actual level of safety depends upon not
just the MTTF/reliability of the components, but the
{\fg}s reaction to a component failure
within the ciruit.
That is to say how the circuit section or {\fg}
reacts to component failures within it.
We may find for instance that the circuit reacts to most component failure modes
in ways that we can detect that there has been a failure.
Some can component failure modes in the {\fg} can lead to serious errors, such as an incorrect reading
that we cannot immediately detect.
%
We will, if these specific component
failures occur, not know and feed incorrect data into our system.
%
Figure \ref{fig:millivolt} shows a typical industrial
circuit to measure and amplify millivolt signals.
It will detect a disconnected Milli-volt source (the most common
failure, and usually due to wiring faults) and some other internal component failures.
It can however provide an incorrect (slightly low reading) if
one of two resistors fail in particular ways.
% Although statistically unlikely, in a very critical system
% this may have to be considered.
To the author, it seems that paying attention
to the way {\fg}s of components interact and proving
a safety case for them is a very important aspect
of detecting `undetected failures' in safety critical product design.
\paragraph{Multi-disipline} Most safety critical systems are composed of mechanical, electrical and
computing elements. A tragic example of the mechanical and electrical elements
interfacing to a computer is found in the THERAC25 x-ray dosage machine.
With no common notation to integrate the safety analysis between the electrical/mechanical and computing
domains, synchronisation errors occurred that were in some cases fatal.
The interfacing between the hardware and software for the THERAC-25 was not considered
in the design phase.
Niel Story in the formal methods chapter of "safety critical computer systems"
describes the different formal languages suitable for hardware and software and
bemaons the fact that no single language is suitable for for such a broad range of tasks \cite{sccs}[pp. 287].
\paragraph{Requirements for a rigorous FMEA process}
It was determined that any process to apply
FMEA in rigorous and complete (in terms of complete component coverage) had to be
a bottom~up process to eliminate the possibility of missing component failure modes.
It also had to naturally converge to a failure model of the system.
It had to take potentially thousands of component failure modes and simplify
these into system level errors.
To analyse the large number of component failure modes, and resolve these to perhaps a handful
of system failure modes, would require
a process of modularisation from the bottom~up.
\begin{list}{$*$}{}
\item The analysis process must be `bottom~up'
\item The process must be modular and hierarchical
\item The process must be multi-discipline and must be able to represent hardware, electronics and software
\end{list}
\section{Safety Critical Systems} \section{Safety Critical Systems}
\glossary{name={safety critical},description={A safety critical system is one in which its failure may result in death or serious injury to humans, an environmental catastrophe or severe loss or damage}}
%
%How safe is "safe"?
%The word "safety" is too general—it really doesn't mean anything definitive. Therefore, we use terms such as safety-related and safety-critical.
%
%A safety-related device provides or ensures safety. It is required for machines/vehicles, which cause bodily harm or death to human being when they fail. A safe state can be defined (in other words, safety-related). In case of a buzz saw, this could be a motor that seizes all movements immediately. The seizure of movement makes the machine safe at that moment. IEC 61508 defines the likelihood of failures of this mechanism, the Safety Integrity Levels (SIL). SIL 3 is defined as the likelihood of failing less than 10-7% per hour. This is a necessary level of safety integrity for products such as lifts, where several people's lives are endangered. The buzz saw is likely to require SIL 2 only, it endangers just one person.
%
%Safety-critical is a different matter. To understand safety-critical imagine a plane in flight: it is not "safe" to make all movement stop since that would make the plane crash. A safe state for a plane is in the hangar, but this is not an option when you're in flight. Other means of ensuring safety must be found. One method used in maritime applications is the "CANopen flying master" principle, which uses redundancy to prevent failure. For the above example an SIL 4, meaning likelihood of failing less than 10-8% per hour is necessary. This is also true for nuclear power station control systems, among other examples.
%
\subsection{General description of a Safety Critical System} \subsection{General description of a Safety Critical System}
A safety critical system is one in which lives may depend upon it or A safety critical system is one in which lives may depend upon it or
it has the potential to become dangerous. it has the potential to become dangerous\cite{sccs}.
(/usr/share/texmf-texlive/tex/latex/amsmath/amstext.sty %(/usr/share/texmf-texlive/tex/latex/amsmath/amstext.sty)
An industrial burner is typical of plant that is potentially dangerous. %An industrial burner is typical of plant that is potentially dangerous.
An incorrect air/fuel mixture can be explosive. %An incorrect air/fuel mixture can be explosive.
Medical electronics for automatically dispensing drugs or maintaining %Medical electronics for automatically dispensing drugs or maintaining
life support are examples of systems that lives depend upon. %life support are examples of systems that lives depend upon.
\subsection{Two approaches : Probablistic, and Compnent fault tolerant} \subsection{Two approaches : Probabilistic, and Deterministic}
There are two main philosophies applied to safety critical systems.
One is a general number of acceptable failure per hour of operation.
This is the probablistic approach and is embodied in the european standard
EN61508 \cite{EN61508}.
There are two main philosophies applied to safety critical systems certification.
\paragraph{Probablistic safety Measures}
One is a general number of acceptable failures per hour\footnote{The common metric is Failure in Time (FIT) values - failures per ${10}^{9}$
hours of operation} of operation or
a given statistical failure on demand.
This is the probablistic approach and is embodied in the European Standard
EN61508 \cite{en61508} (international standard IOC1508).
\glossary{name={deterministic},description={Deterministic in the context of failure mode analysis, traces the causes of SYSTEM level events to base level component failure modes}}
\glossary{name={probablistic},description={Probablistic in the context of failure mode analysis, traces the probability of base level failure modes causing of SYSTEM level events/failure modes}}
\fmodegloss
\paragraph{Deterministic safety Measures}
The second philosophy, applied to application specific standards, is to investigate The second philosophy, applied to application specific standards, is to investigate
components ior sub-systems in the critical safety path and to look at component failure modes components for sub-systems in the critical safety path and to look at component failure modes
and ensure that they cannot cause dangerous faults. and ensure that they cannot cause dangerous faults.
With the application specific standards detail %With the application specific standards detail
specific to the process are %specific to the process are
This philosophy is first mentioned in aircraft safety operation reseach WWII The simplest deterministic safety measure is to require that no single component failure
studies. Here potential single faults (usually mechanical) are traced to mode can cause a dangerous error.
catastrophic failures This philosophy is first mentioned in aircraft safety operation reseach (WWII)
studies. Here potential single faults (usually mechanical) were traced to
% \cite{boffin}. catastrophic failures \cite{boffin}.
EN298, the European Gas burner standard, goes further than this
and requires that no two single component faults may cause
a dangerous condition.
% %
@ -60,196 +361,215 @@ catastrophic failures
\subsection{Overview of regulation of safety Critical systems} \subsection{Overview of regulation of safety Critical systems}
reference chapter dealing speciifically with this but given a quick overview. Reference chapter dealing specifically with this but given a quick overview.
\subsubsection{Overview system analysis philosophies } \subsubsection{Overview system analysis philosophies }
- General safety standards - General safety standards
- specific safety standards - specific safety standards
\subsubsection{Overview of current testing and certification} \subsubsection{Overview of current testing and certification}
ref chapter speciiffically on this but give an overview now Ref chapter specifically on this but give an overview now
\section{Background to the Industrial Burner Safety Analysis Problem}
An industrial burner is a good example of a safety critical system.
It has the potential for devatating explosions due to boiler overpressure, or
ignition of an explosive mixture, and, because of the large amounts of fuel used,
is a potential fire hazard. They are often left running unattended 24/7.
To add to these problems
Operators are often under pressure to keep them running. An boiler supplying
heat to a large greenhouse complex could ruin crops
should it go off-line. Similarly a production line relying on heat or steam
can be very expensive in production down-time should it fail.
This places extra responsibility on the burner controller.
These are common place and account for a very large proportion of the enery usage
in the world today (find and ref stats)
Industrial burners are common enough to have different specific standards
written for the fuel types they usei \ref{EN298} \ref{EN230} \ref{EN12067}.
A modern industrial burner has mechanical, electronic and software A modern industrial burner has mechanical, electronic and software
elements, that are all safety critical. That is to say elements, that are all safety critical. That is to say
unhandled failures could create dangerous faults. unhanded failures could create dangerous faults.
To add to these problems %To add to these problems
Operators are often under pressure to keep them running. An boiler supplying %Operators are often under pressure to keep them running. An boiler supplying
heat to a large greenhouse complex could ruin crops %heat to a large greenhouse complex could ruin crops
should it go off-line. Similarly a production line relying on heat or steam %should it go off-line. Similarly a production line relying on heat or steam
can be very expensive in production down-time should it fail. %can be very expensive in production down-time should it fail.
This places extra responsibility on the burner controller. %This places extra responsibility on the burner controller.
%
%
These are common place and account for a very large proportion of the enery usage
in the world today (find and ref stats)
Industrial burners are common enough to have different specific standards
written for the fuel types they usei \ref{EN298} \ref{EN230} \ref{EN12067}.
A modern industrial burner has mechanical, electronic and software
elements, that are all safety critical. That is to say
unhandled failures could create dangerous faults.
A more detailed description of industrial burner controllers
is dealt with in chapter~\ref{burnercontroller}.
\subsection{Mechanical components}
describe the mechanical parts - gas valves damper s
electronic and software
give a diagram of how it all fits A
together with a
\subsection{electronic Components}
\subsection{Software/Firmware Components}
\subsection{A high level Fault Hierarchy for an Industrial Burner}
This section shows the component level, leading up higher and higher in the abstraction level
to the software levels and finally a top level abstract level. If the system has been
designed correctly no `undetected faults' should be present here.
% This needs to become a chapter
%\subsection{Mechanical components}
%describe the mechanical parts - gas valves damper s
%electronic and software
%give a diagram of how it all fits A
%together with a
%\subsection{electronic Components}
%
%\subsection{Software/Firmware Components}
%
%
%\subsection{A high level Fault Hierarchy for an Industrial Burner}
%
%This section shows the component level, leading up higher and higher in the abstraction level
%to the software levels and finally a top level abstract level. If the system has been
%designed correctly no `undetected faults' should be present here.
%
\section{An Outline of the FMMD Technique} \section{An Outline of the FMMD Technique}
{\fmmdgloss}
The methodology takes a bottom up approach to %\glossary{name={FMMD},description={Failure Mode Modular De-Composition}}
The FMMD methodology takes a bottom up approach to
the design of an integrated system. the design of an integrated system.
%
Each component is assigned a well defined set of failure modes. Each component is assigned a well defined set of failure modes.
The components are formed into modules, or functional groups. The system under inspection is then searched for functional groups of components that
perform simple well defined tasks.
These functional groups are analysed with respect to the failure modes of the These functional groups are analysed with respect to the failure modes of the
components. The `functional group' or module will have a set of derived components.
failure modes. The number of derived failure modes will be %
The `functional group', after analysis, has its own set of derived
failure modes.
\fmodegloss
%
The number of derived failure modes will be
less than or equal to the sum of the failure modes of all its components. less than or equal to the sum of the failure modes of all its components.
%
%
A `derived' set of failure modes, is at a higher abstraction level. A `derived' set of failure modes, is at a higher abstraction level.
derived modules may now be used as building blocks, to model the system at %
ever higher levels of abstraction until the top level is reached. Thus we can now treat our `functional group' as a component in its own right,
with its own set of failure~modes. We can create
a `derived component' and assign it the derived failure modes as analysed from the `functional group'.
%
Derived Components may now be used as building blocks, to model the system at
ever higher levels of abstraction, building a hierarchy until the top level is reached.
%
Any unhandled faults will appear at this top level and will be `un-resolved'. Any unhandled faults will appear at this top level and will be `un-resolved'.
A formal description of this process is dealt with in Chapter \ref{fmmddefinition}. A formal description of this process is dealt with in Chapter \ref{fmmddefinition}.
%
%
%This principally focuses %This principally focuses
%on simple control systems for maintaining temperature %on simple control systems for maintaining temperature
%and for industrial burners. It is hoped that a general mathematical %and for industrial burners. It is hoped that a general mathematical
%framework is created that can be applied to other fields of safety critical engineering. %framework is created that can be applied to other fields of safety critical engineering.
\subsection{Automated Systems and Safety}
Automated systems, as opposed to manual ones are now the norm Automated systems, as opposed to manual ones are now the norm
in the home and in industry. in the home and in industry.
%
Automated systems have long been recognised as being more effecient and Automated systems have long been recognised as being more efficient and
more accurate than a human opperator, and the reason for automating a process more accurate than a human operator, and the reason for automating a process
can now be more likely to be cost savings due to better effeciency can now be more likely to be cost savings due to better efficiency
thatn a human operator \ref{burnereffency}. than a not paying a salary to a human operator \ref{burnereffency}.
%
For instance For instance
early automated systems were mechanical, with cams and levers simulating early automated systems were mechanical, with cams and levers simulating
fuel air mixture profile curves over the firing range. control functions.
%
A typical control function could be the
fuel air mixture profile curves over a the firing range.
%
Because fuels vary slightly in calorific value, and air density changes with the weather, no optimal tuning can be optional. Because fuels vary slightly in calorific value, and air density changes with the weather, no optimal tuning can be optional.
In fact for asethtic reasons (not wanting smoke to appear at the flue) In fact for aesthetic reasons (not wanting smoke to appear at the flue)
the tuning was often air rich, causing air to be heated and the tuning was often air rich, causing air to be heated and
uneccessarily passed through the burner, leading to direct loss of energy. unnecessarily passed through the burner, leading to direct loss of energy.
An automated system analysing the combustions gasses and automatically An automated system analysing the combustion gases and automatically
adjusting the fuel air mix can get the effeciencies very close to theoretical levels. adjusting the fuel air mix can get the efficiencies very close to theoretical levels.
As the automation takes over more and more functions from the human operator it also takes on more responsibility. As the automation takes over more and more functions from the human operator it also takes on more responsibility.
A classic example of an automated system failing, is the therac-25. A classic example of an automated system failing, is the therac-25.
This was an X-ray dosage machine, that, due to software errors This was an X-ray/electron~beam dosage machine, that, due to software errors
caused the deaths of several patients and injured more during the 1980's. caused the deaths of several patients and injured more during the 1980's.
The Therac-25 was a designed from a manual system, which had checks and interlocks,
and was subsequently computerised. Software safety interlock problems were the primary causes of the radiation
overdoses.
\cite{safeware}[App. A]
Any new safety critical analysis methodology should
be able to model software, electrical and hardware faults using
a common notation.
Ideally the tool should be automated so that it can
seamlessly analyse the entire system, and apply
rigorous checking to ensure that no
fault conditions are missed.
% http://en.wikipedia.org/wiki/Autopilot % http://en.wikipedia.org/wiki/Autopilot
To take an example of an Autopilot, simple early autopilots, were (i.e. they \paragraph{Importance of self checking}
prevented the aircraft staying from a compass bearing and kept it flying striaght and level). To take an example of an Aircraft Autopilot, simple early devices\footnote{from the 1920's simple aircraft autopilots were in service},
prevented the aircraft straying from a compass bearing and kept it flying straight and level.
Were they to fail the pilot would notice quite quickly Were they to fail the pilot would notice quite quickly
and resume manual control of the bearing. and resume manual control of the bearing.
Modern autopilots control all aspects of flight including the engines, and take off and landing phases. Modern autopilots control all aspects of flight including the engines, take off and landing phases.
The automated system does not have the The automated system do not have the
common sense of a human pilot either, if fed the wrong sensory information common sense of a human pilot; and if fed the incorrect sensory information
it could make horrendous mistakes. This means that simply reading sensors and applying control can make horrendous mistakes. This means that simply reading sensors and applying control
corrections cannot be enough. corrections cannot be enough.
Checking for error conditions must also be incorporated. Checking for error conditions must also be incorporated.
It could also develop an internal fault, and must be able to cope with this. Equipment can also develop an internal faults, and strategies
must be in-place to firstly recognise internal faults,
and then cope with them in the safest possible way.
\begin{figure}[h]
Systems such as industrial burners have been partially automated for some time. \centering
A mechanical cam arrangement controls the flow of air and fuel for the range of \includegraphics[width=300pt,keepaspectratio=true]{introduction/mv_opamp_circuit.png}
firing rate (output of the boiler). % mv_opamp_circuit.png: 577x479 pixel, 72dpi, 20.35x16.90 cm, bb=0 0 577 479
\caption{Milli-Volt Amplifier with added Safety Resistor (R18)}
These mechanical systems could suffer failures (such as a mechanical linkage beoming \label{fig:millivolt}
detached) and could then operate in a potentially dangerous state.
More modern burner controllers use a safety critical computer controlling
motors to operate the fuel and air mixture and to control the safety
valves.
In working in the industrial burner industry and submitting product for
North American and European safety approval, it was apparent that
formal techniques could be applied to aspects of the ciruit design.
Some safety critical circuitry would be subjected to thought experiments, where
the actions of one or more components failing would be examined.
As a simple example a milli-volt input could become disconnected.
A milli-volt input is typically amplified so that its range matches that
of the A->D converter that you are reading. were this signal source to become disconnected
the systems would see a floating, amplified signal.
A high impedance safety resistor can be added to the circuit,
to pull the signal high (or out of nornal range) upon disconnection.
The system then knows that a fault has occurred and will not use
that sensor reading (see \ref{fig:millivolt}).
\begin{figure}
\vskip 7cm
\special{psfile=introduction/millivoltsensor.ps hoffset=0 voffset=0 hscale=35 vscale=35 }\caption[Milli-Volt Sensor with safety resistor]{
Milli-Volt Sensor with safety resistor
\label{fig:millivolt}}
\end{figure} \end{figure}
For exmaple, if the sensor supplies a range of 0 to 40mV, and RG1 and RG2 are such that the op-amp supplies a gain of 100 % \begin{figure}[h]
any signal between 0 and 4 volts on the ADC will be considered in range. Should the sensor become disconnected the % \centering
opamp will supply its maximum voltage, telling the system the sensor reading is invalid. % \includegraphics[width=300pt,bb=0 0 678 690,keepaspectratio=true]{introduction/mv_opamp_circuit.png}
% % mv_opamp_circuit.png: 678x690 pixel, 72dpi, 23.92x24.34 cm, bb=0 0 678 690
% \caption{Milli-volt amplifier with added safety Resistor}
% \label{fig:millivolt}
% \end{figure}
%
% %5
% \begin{figure}
% \vskip 7cm
% \special{psfile=introduction/millivoltsensor.ps hoffset=0 voffset=0 hscale=35 vscale=35 }\caption[Milli-Volt Sensor with safety resistor]{
% Milli-Volt Sensor with safety resistor
% \label{fig:millivolt}}
% \end{figure}
\paragraph{Component added to detect errors}
The op-amp in the circuit in figure \ref{fig:millivolt}, supplies a gain of $\approx 184$ \footnote{
applying formula for non-inverting op-amp gain\cite{aoe} $\frac{150 \times 10^3}{820}+ 1 \approx 184$ }.
The safety case here is that
any amplified signal between a range say, of 0.5 and 4 volts on the ADC will be considered in range.
This means that between 3mV and 21mV on the input correctly amplified
can be measured.\footnote{this would be a typical thermocouple amplifier circuit where milli-volt signals
are produced by the Seebeck effect\cite{aoe}}
Should the sensor become disconnected the input will drift up due to the safety resistor $R18$.
This will cause the opamp to supply its maximum voltage, telling the system the sensor reading is invalid.
Should the sensor become shorted, the input will fall below 3mV and the op amp will
supply a voltage below 0.5. Note that the sensor breaking and becoming open, or
becoming disconnected is the `Raison d'être' of this safety addition.
This circuit would typically be used to amplify a thermocouple, which typically
fails by going open circuit.
It {\em does}
detect several other failure modes of this circuit and a full analysis is given in appendix \ref{mvamp}.
\fmodegloss
% Note C14 shorting is potentially v dangerous could lead to a high output by the opamp being seen as a
% low temperature.
%
\paragraph{Self Checking}
This introduces a level of self checking into the system. This introduces a level of self checking into the system.
We need to be able to react to not only errors in the process its self, Admittedly this is the simplest failure mode scenario (that the
but also validate and look for internal errors in the control system. sensor is not wired correctly or has become disconnected).
%
This safety resistor has a side effect, it also checks for internal errors
that could occur in this circuit.
Should the input resistor $R22$ go OPEN this would be detected.
Should the gain resistors $R30$ or $R26$ go OPEN or SHORT a fault condition will be detected.
%
\paragraph{Not rigorous, but tested by time}
This is a typical example of an industry standard circuit that has been
thought through, and in practise works and detects most commonly encountered failure modes.
But it is not rigorous: it does not take into account every failure
mode of every component in it.
This leads on to an important concept of three main states of a safety critical system. However it does lead on to an important concept of three main states of a safety critical system.
%
\paragraph{Working, safe fault mode, dangerous fault mode}
% To improve productivity, performance, and cost-effectiveness, we are developing more and more safety-critical systems that are under computer control. And centralized computer control is enabling many safety-critical systems (e.g., chemical and pesticide factories) to grow in size, complexity, and potential for catastrophic failure. We can say that a safety critical system may be said to have three distinct
% We use software to control our factories and refineries as well as power generation and distribution. We also use software in our transportation systems including airplanes, trains, ships, subways, and even in our family automobiles. Software is also a major component of many medical systems in which safe functioning is critical to the safety of patients and operators alike. Even when the software does not directly control safety-critical hardware, software can provide operators and users with safety-critical data with which they must make safety-critical decisions (e.g., air traffic control or medical information such as blood bank records, organ donor information, and patient medical records). As we have come to rely more on software-intensive systems, we have come to rely more on those systems functioning safely.
% Many accidents are caused by problems with system and software requirements, and “empirical evidence seems to validate the commonly stated hypothesis that the majority of safety problems arise from software requirements and not coding errors” [Leveson1995]. Major accidents often result from rare hazards, whereby a hazard is a combination of conditions that increases the likelihood of accidents causing harm to valuable assets (e.g., people, property, and/or the environment). Most requirements specifications are incomplete in that they do not specify requirements to eliminate these rare hazards or mitigate their consequences. Requirements specifications are also typically incomplete in that they do not specify what needs to happen in exceptional “rainy day” situations or as a response to each possible event in each possible system state although accidents are often caused by the incorrect handling of rare combinations of events and states that were considered to be either impossible or too unlikely to worry about, and were therefore never specified. Even when requirements have been specified for such rare combinations of events and conditions, they may well be ambiguous (an unfortunately common characteristic of requirements in practice), partially incomplete (missing assumptions obvious only to subject matter experts), or incorrect, or inconsistently implemented. Thus, the associated hazards are not eliminated or the resulting harm is not properly mitigated when the associated accidents occur. Ultimately, safety related requirements are important requirements that need to be better engineered.
% The goal of this column is to define safety requirements and clarify how they differ from safety constraints and from functional, data, and interface requirements that happen to be safety critical. I start by defining safety in terms of a powerful quality model and show how quality requirements (including safety requirements) can be specified in terms of the components of this quality model. I will then show how to use the quality model to specify safety requirements. Then, I will define and discuss safety constraints and safety-critical requirements. Finally, I will pose a set of questions regarding the engineering of these three kinds of safety-related requirements for future research and experience to answer.
Safety critical systems in the context of this study, means that a safety critical system may be said to be in three distinct
overall states. overall states.
Operating normally, operating in a lockout mode with a detected fault, and operating Operating normally, operating in a safe mode with a fault, and operating
dangerously with an undetected fault. dangerously with a fault.
%
The main role of the system designers of safety critical equipment should be to eliminate the possibility of this last condition. The main role of the system designers of safety critical equipment should be
to reduce the possibility of this last condition.
% Software plays a critical role in almost every aspect facet of our daily lives - from , to driving our cars, to working in our offices. % Software plays a critical role in almost every aspect facet of our daily lives - from , to driving our cars, to working in our offices.
% Some of these systems are safety-critical. % Some of these systems are safety-critical.
@ -263,19 +583,19 @@ The main role of the system designers of safety critical equipment should be to
\section{Motivation for developing a formal methodology} \section{Motivation for developing a formal methodology}
A feature of many safety critical systems specifications, A feature of some newer safety critical systems standards,
including EN298, EN230 \cite{EN298} \cite{EN230} including the gas burner standard EN298~\cite{en298}[Section 9]
is to demand, is to demand,
at the very least that single failures of hardware at the very least that single failures of hardware
or software cannot or software cannot
create an unsafe condition in operational plant. Further to this create an unsafe condition in operational plant. Further to this
a second fault introduced, must not cause an unsafe state, due a second fault introduced, must not cause an unsafe state, due
to the combation of both faults. to the combination of both faults.
\vskip 0.3cm \vskip 0.3cm
This sounds like an entirely reasonable requirement. But to rigorously This sounds like an entirely reasonable requirement. But to rigorously
check the effect a particular component fault has on the system, check the effect a particular component fault has on the system,
we could check its effect on all other components. we could check its effect on all other components.
Should a diode in the powersupply fail in a particular way, by perhaps Should a diode in the power supply fail in a particular way, by perhaps
introducing a ripple voltage, we should have to look at all components introducing a ripple voltage, we should have to look at all components
in the system to see how they will be affected. in the system to see how they will be affected.
@ -314,7 +634,7 @@ for individual component failures and their effects on other components when the
For a very small system with say 1000 failure modes this would demand a potential of 500,000 For a very small system with say 1000 failure modes this would demand a potential of 500,000
checks for any automated checking process. checks for any automated checking process.
\vskip 0.3cm \vskip 0.3cm
European legislation\cite{EN298} directs that a system must be able to react to two component failures European legislation\cite{en298} directs that a system must be able to react to two component failures
and not go into a dangerous state. and not go into a dangerous state.
\vskip 0.3cm \vskip 0.3cm
This raises an interesting problem from the point of view of formal modelling. Here we have a binary cross product of all components This raises an interesting problem from the point of view of formal modelling. Here we have a binary cross product of all components
@ -332,19 +652,44 @@ are commonly used for the gas certification process. Thus to
manually check this number of combinations of faults is in practise impossible. manually check this number of combinations of faults is in practise impossible.
A technique of modularising, or breaking down the problem is clearly necessary. A technique of modularising, or breaking down the problem is clearly necessary.
\section{Challenger Disaster} \section{Examples of disasters caused by designs \\ missing component errors}
\subsection{Challenger Disaster}
One question that anyone developing a safety critical analysis design tool One question that anyone developing a safety critical analysis design tool
could do well to answer, is how the methodology would cope with known previous disasters. could do well to answer, is how the methodology would cope with known previous disasters.
The Challenger disaster is a good example, and was well documented and invistigated. The Challenger disaster is a good example, and was well documented and investigated~\cite{challenger}.
The problem lay in a seal that had an operating temperature range. The problem lay in a seal that had an operating temperature range.
On the day of the launch the temperature of this seal was out of range. On the day of the launch the temperature of this seal was out of range.
A bottom up safety approach would have revealed this as a fault. A bottom up safety approach would have revealed this as a fault.
\section{Problems with Natural Language} The FTA in use by NASA and the US Nuclear regulatory commission
allows for environmental considerations such as temperature\cite{nasafta}\cite{nucfta}.
But because of the top down nature of the FTA technique, the safety designer must be aware of
the environmental constraints of all component parts in order to use this correctly.
This element of FTA is discussed in \ref{surveysc}
Written natural language desciptions can not only be ambiguous or easy to misinterpret, it \subsection{Therac 25}
The therac-25 was a computer controlled radiation therapy machine, which
overdosed 6 people between 1985 and 1987.
An earlier computerised version of the therac-25 (the therac-20) used the same software but kept the
hardware interlocks from the previous manual operation machines. The hardware interlocks
on the therac-20 functioned correctly and the faulty software in it caused no accidents.
A safety study for the device, using Fault Tree Analysis % \cite{nucfta}
carried out in 1983
excluded the software \cite{safeware}[App. A].
\section{Practical problems in using formal methods}
%% Here need more detail of what therac 25 was and roughly how it failed
%% with refs to nancy
%% and then highlight the fact that the safety analysis did not integrate software and hardware domains.
\subsection{Problems with Natural Language}
Written natural language descriptions can not only be ambiguous or easy to misinterpret, it
is also not possible to apply mathematical checking to them. is also not possible to apply mathematical checking to them.
A mathematical model on the other hand can be checked for A mathematical model on the other hand can be checked for
@ -352,48 +697,62 @@ obvious faults, such as tautologies and contradictions, but also
intermediate results can be extracted and these checked. intermediate results can be extracted and these checked.
Mathematical modeling of systems is not new, the Z language Mathematical modeling of systems is not new, the Z language
has been used to model systems\cite{ince}. However this is not widely has been used to model physical and software systems\cite{ince}. However this is not widely
understood or studied even in engineering and scientific circles. understood or studied even in engineering and scientific circles.
Graphical techniques for representing the mathematics for Graphical techniques for representing the mathematics for
specifying systems, developed at Brighton and Kent university specifying systems, developed at Brighton and Kent university
have been used and extended by this author to create a methodology have been used and extended by this author to create a methodology
for modelling complex safety critical systems, using diagrams. for modelling complex safety critical systems, using diagrams.
This project uses a modified form of euler diagram used to represent propositional logic. This project uses a modified form of Euler diagram used to represent propositional logic.
%The propositional logic is used to analyse system components. %The propositional logic is used to analyse system components.
\section{Ideal System Designers world} \section{Determining Component Failure Modes}
\subsection{Electrical}
Generic component failure modes for common electrical parts can be found in MIL1991.
Most modern electrical components have associated data sheets. Usually these do not explicitly list
failure modes.
% watch out for log axis in graphs !
\subsection{Mechanical}
Find refs
\subsection{Software}
Software must run on a microprocessor/micro-controller, and these devices have a known set of failure modes.
The most common of these are RAM and ROM failures, but bugs in particular machine instructions
can also exist.
These can be checked for periodically.
Software bugs are unpredictable.
However there are techniques to validate software.
These include monitoring the program timings (with watchdogs~\cite{embupsys}[pp.81] and internal checking)
applying validation checks (such as independent functions to validate correct operation).
Imagaine a world where, when ordering a component, or even a complex module
like a a failsafe sensor/scientific instrunment, one page of the datasheet
is the failure modes of the system. All possible ways in which the component can fail
and how it will react when it does.
\subsection{Environmentally determined failures} \subsection{Environmentally determined failures}
Some systems and components are guaranteed to work within certain environmental constraints, Some systems and components are guaranteed to work within certain environmental constraints,
temperature being the most typical. Very often what happens to the system outside that range is not defined. temperature being the most typical. Very often what happens to the system outside that range is not defined.
Where this is the case, these are undetectable errors.
\section{Project Goals} \section{Project Goals}
\begin{itemize} \begin{itemize}
\item To create a Bottom up FMEA technique that permits a connected hierarchy to be
built representing the fault behavior of a system.
\item To create a procedure where no component failure mode can be accidentally ignored.
\item To create a user friendly formal common visual notation to represent fault modes \item To create a user friendly formal common visual notation to represent fault modes
in Software, Electronic and Mechanical sub-systems. in Software, Electronic and Mechanical sub-systems.
\item To formally define this visual language. \item To formally define this visual language in concrete and abstract domains.
\item To prove that tehe modules may be combined into hierarchies that \item To prove that the derived~components used to build the hierarchies
truly represent the fault handling from component level to the provide traceable fault handling from component level to the
highest abstract system 'top level'. highest abstract system 'top level'.
\item To reduce to complexity of fault mode checking, by modularising and
building complexity reducing hierarchies.
\item To formally define the hierarchies and procedure for bulding them. \item To formally define the hierarchies and procedure for bulding them.
\item To produce a software tool to aid in the drawing of diagrams and \item To produce a software tool to aid in the drawing of diagrams and
ensuring that all fault modes are addressed. ensuring that all fault modes are addressed.
\item To allow the possiblility of MTTF calculation for statistical \item to provide a data model that can be used as a source for deterministic and probablistic failure mode analysis reports.
\item To allow the possibility of MTTF calculation for statistical
reliability/safety calculations. reliability/safety calculations.
\end{itemize} \end{itemize}
\end{document}