400 lines
21 KiB
TeX
400 lines
21 KiB
TeX
|
|
|
|
\section{Introduction}
|
|
|
|
$$ \int_{0\-}^{\infty} f(t).e^{-s.t}.dt \; | \; s \in C$$
|
|
|
|
This thesis describes the application of, mathematical (formal) techniques to
|
|
the design of safety critical systems.
|
|
The initial motivation for this study was to create a system
|
|
applicable to industrial burner controllers.
|
|
The methodology developed was designed to cope with
|
|
both the specific `simultaneous failures'\cite{EN298},\cite{EN230},\cite{EN12067}
|
|
and the probability to dangerous fault approach\cite{EN61508}.
|
|
|
|
The visual notation developed was initially designed for electronic fault modelling.
|
|
However, it could be appleid to mechanical and software domains as well.
|
|
Due to this a common notation/diagram style
|
|
can be used to model any integrated safety relevant system.
|
|
|
|
\section{Safety Critical Systems}
|
|
|
|
\subsection{General description of a Safety Critical System}
|
|
|
|
A safety critical system is one in which lives may depend upon it or
|
|
it has the potential to become dangerous.
|
|
(/usr/share/texmf-texlive/tex/latex/amsmath/amstext.sty
|
|
|
|
An industrial burner is typical of plant that is potentially dangerous.
|
|
An incorrect air/fuel mixture can be explosive.
|
|
Medical electronics for automatically dispensing drugs or maintaining
|
|
life support are examples of systems that lives depend upon.
|
|
|
|
\subsection{Two approaches : Probablistic, and Compnent fault tolerant}
|
|
|
|
There are two main philosophies applied to safety critical systems.
|
|
One is a general number of acceptable failure per hour of operation.
|
|
This is the probablistic approach and is embodied in the european standard
|
|
EN61508 \cite{EN61508}.
|
|
|
|
The second philosophy, applied to application specific standards, is to investigate
|
|
components ior sub-systems in the critical safety path and to look at component failure modes
|
|
and ensure that they cannot cause dangerous faults.
|
|
With the application specific standards detail
|
|
specific to the process are
|
|
This philosophy is first mentioned in aircraft safety operation reseach WWII
|
|
studies. Here potential single faults (usually mechanical) are traced to
|
|
catastrophic failures
|
|
|
|
% \cite{boffin}.
|
|
|
|
|
|
|
|
%
|
|
% \begin{example}
|
|
% \label{exa1}
|
|
% Test example
|
|
% \end{example}
|
|
%
|
|
% And that is example~\ref{exa1}
|
|
|
|
\subsection{Overview of regulation of safety Critical systems}
|
|
|
|
reference chapter dealing speciifically with this but given a quick overview.
|
|
\subsubsection{Overview system analysis philosophies }
|
|
- General safety standards
|
|
- specific safety standards
|
|
|
|
\subsubsection{Overview of current testing and certification}
|
|
ref chapter speciiffically on this but give an overview now
|
|
|
|
\section{Background to the Industrial Burner Safety Analysis Problem}
|
|
|
|
An industrial burner is a good example of a safety critical system.
|
|
It has the potential for devatating explosions due to boiler overpressure, or
|
|
ignition of an explosive mixture, and, because of the large amounts of fuel used,
|
|
is a potential fire hazard. They are often left running unattended 24/7.
|
|
|
|
To add to these problems
|
|
Operators are often under pressure to keep them running. An boiler supplying
|
|
heat to a large greenhouse complex could ruin crops
|
|
should it go off-line. Similarly a production line relying on heat or steam
|
|
can be very expensive in production down-time should it fail.
|
|
This places extra responsibility on the burner controller.
|
|
|
|
|
|
These are common place and account for a very large proportion of the enery usage
|
|
in the world today (find and ref stats)
|
|
Industrial burners are common enough to have different specific standards
|
|
written for the fuel types they usei \ref{EN298} \ref{EN230} \ref{EN12067}.
|
|
|
|
A modern industrial burner has mechanical, electronic and software
|
|
elements, that are all safety critical. That is to say
|
|
unhandled failures could create dangerous faults.
|
|
|
|
To add to these problems
|
|
Operators are often under pressure to keep them running. An boiler supplying
|
|
heat to a large greenhouse complex could ruin crops
|
|
should it go off-line. Similarly a production line relying on heat or steam
|
|
can be very expensive in production down-time should it fail.
|
|
This places extra responsibility on the burner controller.
|
|
|
|
|
|
These are common place and account for a very large proportion of the enery usage
|
|
in the world today (find and ref stats)
|
|
Industrial burners are common enough to have different specific standards
|
|
written for the fuel types they usei \ref{EN298} \ref{EN230} \ref{EN12067}.
|
|
|
|
A modern industrial burner has mechanical, electronic and software
|
|
elements, that are all safety critical. That is to say
|
|
unhandled failures could create dangerous faults.
|
|
|
|
A more detailed description of industrial burner controllers
|
|
is dealt with in chapter~\ref{burnercontroller}.
|
|
|
|
\subsection{Mechanical components}
|
|
describe the mechanical parts - gas valves damper s
|
|
electronic and software
|
|
give a diagram of how it all fits A
|
|
together with a
|
|
\subsection{electronic Components}
|
|
|
|
\subsection{Software/Firmware Components}
|
|
|
|
|
|
\subsection{A high level Fault Hierarchy for an Industrial Burner}
|
|
|
|
This section shows the component level, leading up higher and higher in the abstraction level
|
|
to the software levels and finally a top level abstract level. If the system has been
|
|
designed correctly no `undetected faults' should be present here.
|
|
|
|
\section{An Outline of the FMMD Technique}
|
|
|
|
The methodology takes a bottom up approach to
|
|
the design of an integrated system.
|
|
Each component is assigned a well defined set of failure modes.
|
|
The components are formed into modules, or functional groups.
|
|
These functional groups are analysed with respect to the failure modes of the
|
|
components. The `functional group' or module will have a set of derived
|
|
failure modes. The number of derived failure modes will be
|
|
less than or equal to the sum of the failure modes of all its components.
|
|
A `derived' set of failure modes, is at a higher abstraction level.
|
|
derived modules may now be used as building blocks, to model the system at
|
|
ever higher levels of abstraction until the top level is reached.
|
|
|
|
Any unhandled faults will appear at this top level and will be `un-resolved'.
|
|
A formal description of this process is dealt with in Chapter \ref{fmmddefinition}.
|
|
|
|
|
|
%This principally focuses
|
|
%on simple control systems for maintaining temperature
|
|
%and for industrial burners. It is hoped that a general mathematical
|
|
%framework is created that can be applied to other fields of safety critical engineering.
|
|
|
|
Automated systems, as opposed to manual ones are now the norm
|
|
in the home and in industry.
|
|
|
|
Automated systems have long been recognised as being more effecient and
|
|
more accurate than a human opperator, and the reason for automating a process
|
|
can now be more likely to be cost savings due to better effeciency
|
|
thatn a human operator \ref{burnereffency}.
|
|
|
|
For instance
|
|
early automated systems were mechanical, with cams and levers simulating
|
|
fuel air mixture profile curves over the firing range.
|
|
Because fuels vary slightly in calorific value, and air density changes with the weather, no optimal tuning can be optional.
|
|
In fact for asethtic reasons (not wanting smoke to appear at the flue)
|
|
the tuning was often air rich, causing air to be heated and
|
|
uneccessarily passed through the burner, leading to direct loss of energy.
|
|
An automated system analysing the combustions gasses and automatically
|
|
adjusting the fuel air mix can get the effeciencies very close to theoretical levels.
|
|
|
|
|
|
As the automation takes over more and more functions from the human operator it also takes on more responsibility.
|
|
A classic example of an automated system failing, is the therac-25.
|
|
This was an X-ray dosage machine, that, due to software errors
|
|
caused the deaths of several patients and injured more during the 1980's.
|
|
|
|
|
|
% http://en.wikipedia.org/wiki/Autopilot
|
|
To take an example of an Autopilot, simple early autopilots, were (i.e. they
|
|
prevented the aircraft staying from a compass bearing and kept it flying striaght and level).
|
|
Were they to fail the pilot would notice quite quickly
|
|
and resume manual control of the bearing.
|
|
|
|
Modern autopilots control all aspects of flight including the engines, and take off and landing phases.
|
|
The automated system does not have the
|
|
common sense of a human pilot either, if fed the wrong sensory information
|
|
it could make horrendous mistakes. This means that simply reading sensors and applying control
|
|
corrections cannot be enough.
|
|
Checking for error conditions must also be incorporated.
|
|
It could also develop an internal fault, and must be able to cope with this.
|
|
|
|
|
|
Systems such as industrial burners have been partially automated for some time.
|
|
A mechanical cam arrangement controls the flow of air and fuel for the range of
|
|
firing rate (output of the boiler).
|
|
|
|
These mechanical systems could suffer failures (such as a mechanical linkage beoming
|
|
detached) and could then operate in a potentially dangerous state.
|
|
|
|
More modern burner controllers use a safety critical computer controlling
|
|
motors to operate the fuel and air mixture and to control the safety
|
|
valves.
|
|
|
|
In working in the industrial burner industry and submitting product for
|
|
North American and European safety approval, it was apparent that
|
|
formal techniques could be applied to aspects of the ciruit design.
|
|
Some safety critical circuitry would be subjected to thought experiments, where
|
|
the actions of one or more components failing would be examined.
|
|
As a simple example a milli-volt input could become disconnected.
|
|
A milli-volt input is typically amplified so that its range matches that
|
|
of the A->D converter that you are reading. were this signal source to become disconnected
|
|
the systems would see a floating, amplified signal.
|
|
A high impedance safety resistor can be added to the circuit,
|
|
to pull the signal high (or out of nornal range) upon disconnection.
|
|
The system then knows that a fault has occurred and will not use
|
|
that sensor reading (see \ref{fig:millivolt}).
|
|
|
|
|
|
|
|
\begin{figure}
|
|
\vskip 7cm
|
|
\special{psfile=introduction/millivoltsensor.ps hoffset=0 voffset=0 hscale=35 vscale=35 }\caption[Milli-Volt Sensor with safety resistor]{
|
|
Milli-Volt Sensor with safety resistor
|
|
\label{fig:millivolt}}
|
|
\end{figure}
|
|
|
|
For exmaple, if the sensor supplies a range of 0 to 40mV, and RG1 and RG2 are such that the op-amp supplies a gain of 100
|
|
any signal between 0 and 4 volts on the ADC will be considered in range. Should the sensor become disconnected the
|
|
opamp will supply its maximum voltage, telling the system the sensor reading is invalid.
|
|
|
|
This introduces a level of self checking into the system.
|
|
We need to be able to react to not only errors in the process its self,
|
|
but also validate and look for internal errors in the control system.
|
|
|
|
This leads on to an important concept of three main states of a safety critical system.
|
|
|
|
|
|
% To improve productivity, performance, and cost-effectiveness, we are developing more and more safety-critical systems that are under computer control. And centralized computer control is enabling many safety-critical systems (e.g., chemical and pesticide factories) to grow in size, complexity, and potential for catastrophic failure.
|
|
|
|
% We use software to control our factories and refineries as well as power generation and distribution. We also use software in our transportation systems including airplanes, trains, ships, subways, and even in our family automobiles. Software is also a major component of many medical systems in which safe functioning is critical to the safety of patients and operators alike. Even when the software does not directly control safety-critical hardware, software can provide operators and users with safety-critical data with which they must make safety-critical decisions (e.g., air traffic control or medical information such as blood bank records, organ donor information, and patient medical records). As we have come to rely more on software-intensive systems, we have come to rely more on those systems functioning safely.
|
|
|
|
% Many accidents are caused by problems with system and software requirements, and “empirical evidence seems to validate the commonly stated hypothesis that the majority of safety problems arise from software requirements and not coding errors” [Leveson1995]. Major accidents often result from rare hazards, whereby a hazard is a combination of conditions that increases the likelihood of accidents causing harm to valuable assets (e.g., people, property, and/or the environment). Most requirements specifications are incomplete in that they do not specify requirements to eliminate these rare hazards or mitigate their consequences. Requirements specifications are also typically incomplete in that they do not specify what needs to happen in exceptional “rainy day” situations or as a response to each possible event in each possible system state although accidents are often caused by the incorrect handling of rare combinations of events and states that were considered to be either impossible or too unlikely to worry about, and were therefore never specified. Even when requirements have been specified for such rare combinations of events and conditions, they may well be ambiguous (an unfortunately common characteristic of requirements in practice), partially incomplete (missing assumptions obvious only to subject matter experts), or incorrect, or inconsistently implemented. Thus, the associated hazards are not eliminated or the resulting harm is not properly mitigated when the associated accidents occur. Ultimately, safety related requirements are important requirements that need to be better engineered.
|
|
|
|
% The goal of this column is to define safety requirements and clarify how they differ from safety constraints and from functional, data, and interface requirements that happen to be safety critical. I start by defining safety in terms of a powerful quality model and show how quality requirements (including safety requirements) can be specified in terms of the components of this quality model. I will then show how to use the quality model to specify safety requirements. Then, I will define and discuss safety constraints and safety-critical requirements. Finally, I will pose a set of questions regarding the engineering of these three kinds of safety-related requirements for future research and experience to answer.
|
|
|
|
Safety critical systems in the context of this study, means that a safety critical system may be said to be in three distinct
|
|
overall states.
|
|
Operating normally, operating in a lockout mode with a detected fault, and operating
|
|
dangerously with an undetected fault.
|
|
|
|
The main role of the system designers of safety critical equipment should be to eliminate the possibility of this last condition.
|
|
|
|
% Software plays a critical role in almost every aspect facet of our daily lives - from , to driving our cars, to working in our offices.
|
|
% Some of these systems are safety-critical.
|
|
% Failure of software could cause catastrophic consequences for human life.
|
|
% Imagine the antilock brake system (ABS) in your car.
|
|
% A software failure here could render the ABS inoperable at a time when you need it most.
|
|
% For these types of safety-critical systems, having guidelines that define processes and
|
|
% objectives for the creation of software that focus on software quality, or the ability
|
|
% to use software that has been developed under this scrutiny, has tremendous value
|
|
% for developers of safety-critical systems.
|
|
|
|
\section{Motivation for developing a formal methodology}
|
|
|
|
A feature of many safety critical systems specifications,
|
|
including EN298, EN230 \cite{EN298} \cite{EN230}
|
|
is to demand,
|
|
at the very least that single failures of hardware
|
|
or software cannot
|
|
create an unsafe condition in operational plant. Further to this
|
|
a second fault introduced, must not cause an unsafe state, due
|
|
to the combation of both faults.
|
|
\vskip 0.3cm
|
|
This sounds like an entirely reasonable requirement. But to rigorously
|
|
check the effect a particular component fault has on the system,
|
|
we could check its effect on all other components.
|
|
Should a diode in the powersupply fail in a particular way, by perhaps
|
|
introducing a ripple voltage, we should have to look at all components
|
|
in the system to see how they will be affected.
|
|
|
|
%However consider a typical
|
|
%small system with perhaps 1000 components each
|
|
%with an average of say 5 failure modes.
|
|
Thus, to ensure complete coverage, each of the effects of
|
|
the failure modes must be applied
|
|
to all the other components.
|
|
Each component must be checked against the
|
|
failure modes of all other components in the system.
|
|
Mathematically with components as 'c' and failure modes as 'Fm'.
|
|
|
|
|
|
\equation
|
|
\label{crossprodsingle}
|
|
checks = \{ \; (Fm,c) \; \mid \; \stackrel{\wedge}{c} \; \neq \; c \}
|
|
\endequation
|
|
|
|
Where demands
|
|
are made for resilience against two
|
|
simultaneous failures this effectively squares the number of checks to make.
|
|
\equation
|
|
\label{crossproddouble}
|
|
doublechecks = \{ \; (Fm_{1},Fm_{2},c) \; \mid \\ \; c_{1} \; \neq \; c_{2} \; \wedge \; Fm_{1} \neq Fm_{2} \; \}
|
|
\endequation
|
|
|
|
|
|
If we consider a system which has a total of
|
|
$N$ failure modes (see equation \ref{crossprodsingle}) this would mean checking a maximum of
|
|
\equation
|
|
NumberOfChecks = \frac{N ( N-1 )}{2}
|
|
\endequation
|
|
|
|
for individual component failures and their effects on other components when they fail.
|
|
For a very small system with say 1000 failure modes this would demand a potential of 500,000
|
|
checks for any automated checking process.
|
|
\vskip 0.3cm
|
|
European legislation\cite{EN298} directs that a system must be able to react to two component failures
|
|
and not go into a dangerous state.
|
|
\vskip 0.3cm
|
|
This raises an interesting problem from the point of view of formal modelling. Here we have a binary cross product of all components
|
|
(see equation \ref{crossproddouble}).
|
|
This increases the number of checks greatly. Given that the binary cross product is $ (N^{2} - N)/2 $ and has to be checked against the remaining
|
|
$(N-2)$ components.
|
|
\equation
|
|
\label{numberofchecks}
|
|
NumberOfchecks = \frac{(N^{2} - N) ( N - 2)}{2}
|
|
\endequation
|
|
|
|
Thus for a 1000 failure mode system, roughly a half billion possible checks would be required for the double simultaneous failure scenario. This astonomical number of potential combinations, has made formal analysis of this
|
|
type of system, up until now, impractical. Fault simulators %\cite{sim}
|
|
are commonly used for the gas certification process. Thus to
|
|
manually check this number of combinations of faults is in practise impossible.
|
|
A technique of modularising, or breaking down the problem is clearly necessary.
|
|
|
|
\section{Challenger Disaster}
|
|
|
|
One question that anyone developing a safety critical analysis design tool
|
|
could do well to answer, is how the methodology would cope with known previous disasters.
|
|
The Challenger disaster is a good example, and was well documented and invistigated.
|
|
|
|
The problem lay in a seal that had an operating temperature range.
|
|
On the day of the launch the temperature of this seal was out of range.
|
|
A bottom up safety approach would have revealed this as a fault.
|
|
|
|
\section{Problems with Natural Language}
|
|
|
|
Written natural language desciptions can not only be ambiguous or easy to misinterpret, it
|
|
is also not possible to apply mathematical checking to them.
|
|
|
|
A mathematical model on the other hand can be checked for
|
|
obvious faults, such as tautologies and contradictions, but also
|
|
intermediate results can be extracted and these checked.
|
|
|
|
Mathematical modeling of systems is not new, the Z language
|
|
has been used to model systems\cite{ince}. However this is not widely
|
|
understood or studied even in engineering and scientific circles.
|
|
Graphical techniques for representing the mathematics for
|
|
specifying systems, developed at Brighton and Kent university
|
|
have been used and extended by this author to create a methodology
|
|
for modelling complex safety critical systems, using diagrams.
|
|
|
|
This project uses a modified form of euler diagram used to represent propositional logic.
|
|
%The propositional logic is used to analyse system components.
|
|
|
|
|
|
\section{Ideal System Designers world}
|
|
|
|
Imagaine a world where, when ordering a component, or even a complex module
|
|
like a a failsafe sensor/scientific instrunment, one page of the datasheet
|
|
is the failure modes of the system. All possible ways in which the component can fail
|
|
and how it will react when it does.
|
|
|
|
\subsection{Environmentally determined failures}
|
|
|
|
Some systems and components are guaranteed to work within certain environmental constraints,
|
|
temperature being the most typical. Very often what happens to the system outside that range is not defined.
|
|
Where this is the case, these are undetectable errors.
|
|
|
|
|
|
\section{Project Goals}
|
|
|
|
\begin{itemize}
|
|
\item To create a user friendly formal common visual notation to represent fault modes
|
|
in Software, Electronic and Mechanical sub-systems.
|
|
\item To formally define this visual language.
|
|
\item To prove that tehe modules may be combined into hierarchies that
|
|
truly represent the fault handling from component level to the
|
|
highest abstract system 'top level'.
|
|
\item To reduce to complexity of fault mode checking, by modularising and
|
|
building complexity reducing hierarchies.
|
|
\item To formally define the hierarchies and procedure for bulding them.
|
|
\item To produce a software tool to aid in the drawing of diagrams and
|
|
ensuring that all fault modes are addressed.
|
|
\item To allow the possiblility of MTTF calculation for statistical
|
|
reliability/safety calculations.
|
|
\end{itemize}
|
|
|
|
|
|
% fucking cunt \end{document}
|