Robin_PHD/introduction/introduction.tex
2010-01-09 07:38:21 +00:00

400 lines
21 KiB
TeX

\section{Introduction}
$$ \int_{0\-}^{\infty} f(t).e^{-s.t}.dt \; | \; s \in C$$
This thesis describes the application of, mathematical (formal) techniques to
the design of safety critical systems.
The initial motivation for this study was to create a system
applicable to industrial burner controllers.
The methodology developed was designed to cope with
both the specific `simultaneous failures'\cite{EN298},\cite{EN230},\cite{EN12067}
and the probability to dangerous fault approach\cite{EN61508}.
The visual notation developed was initially designed for electronic fault modelling.
However, it could be appleid to mechanical and software domains as well.
Due to this a common notation/diagram style
can be used to model any integrated safety relevant system.
\section{Safety Critical Systems}
\subsection{General description of a Safety Critical System}
A safety critical system is one in which lives may depend upon it or
it has the potential to become dangerous.
(/usr/share/texmf-texlive/tex/latex/amsmath/amstext.sty
An industrial burner is typical of plant that is potentially dangerous.
An incorrect air/fuel mixture can be explosive.
Medical electronics for automatically dispensing drugs or maintaining
life support are examples of systems that lives depend upon.
\subsection{Two approaches : Probablistic, and Compnent fault tolerant}
There are two main philosophies applied to safety critical systems.
One is a general number of acceptable failure per hour of operation.
This is the probablistic approach and is embodied in the european standard
EN61508 \cite{EN61508}.
The second philosophy, applied to application specific standards, is to investigate
components ior sub-systems in the critical safety path and to look at component failure modes
and ensure that they cannot cause dangerous faults.
With the application specific standards detail
specific to the process are
This philosophy is first mentioned in aircraft safety operation reseach WWII
studies. Here potential single faults (usually mechanical) are traced to
catastrophic failures
% \cite{boffin}.
%
% \begin{example}
% \label{exa1}
% Test example
% \end{example}
%
% And that is example~\ref{exa1}
\subsection{Overview of regulation of safety Critical systems}
reference chapter dealing speciifically with this but given a quick overview.
\subsubsection{Overview system analysis philosophies }
- General safety standards
- specific safety standards
\subsubsection{Overview of current testing and certification}
ref chapter speciiffically on this but give an overview now
\section{Background to the Industrial Burner Safety Analysis Problem}
An industrial burner is a good example of a safety critical system.
It has the potential for devatating explosions due to boiler overpressure, or
ignition of an explosive mixture, and, because of the large amounts of fuel used,
is a potential fire hazard. They are often left running unattended 24/7.
To add to these problems
Operators are often under pressure to keep them running. An boiler supplying
heat to a large greenhouse complex could ruin crops
should it go off-line. Similarly a production line relying on heat or steam
can be very expensive in production down-time should it fail.
This places extra responsibility on the burner controller.
These are common place and account for a very large proportion of the enery usage
in the world today (find and ref stats)
Industrial burners are common enough to have different specific standards
written for the fuel types they usei \ref{EN298} \ref{EN230} \ref{EN12067}.
A modern industrial burner has mechanical, electronic and software
elements, that are all safety critical. That is to say
unhandled failures could create dangerous faults.
To add to these problems
Operators are often under pressure to keep them running. An boiler supplying
heat to a large greenhouse complex could ruin crops
should it go off-line. Similarly a production line relying on heat or steam
can be very expensive in production down-time should it fail.
This places extra responsibility on the burner controller.
These are common place and account for a very large proportion of the enery usage
in the world today (find and ref stats)
Industrial burners are common enough to have different specific standards
written for the fuel types they usei \ref{EN298} \ref{EN230} \ref{EN12067}.
A modern industrial burner has mechanical, electronic and software
elements, that are all safety critical. That is to say
unhandled failures could create dangerous faults.
A more detailed description of industrial burner controllers
is dealt with in chapter~\ref{burnercontroller}.
\subsection{Mechanical components}
describe the mechanical parts - gas valves damper s
electronic and software
give a diagram of how it all fits A
together with a
\subsection{electronic Components}
\subsection{Software/Firmware Components}
\subsection{A high level Fault Hierarchy for an Industrial Burner}
This section shows the component level, leading up higher and higher in the abstraction level
to the software levels and finally a top level abstract level. If the system has been
designed correctly no `undetected faults' should be present here.
\section{An Outline of the FMMD Technique}
The methodology takes a bottom up approach to
the design of an integrated system.
Each component is assigned a well defined set of failure modes.
The components are formed into modules, or functional groups.
These functional groups are analysed with respect to the failure modes of the
components. The `functional group' or module will have a set of derived
failure modes. The number of derived failure modes will be
less than or equal to the sum of the failure modes of all its components.
A `derived' set of failure modes, is at a higher abstraction level.
derived modules may now be used as building blocks, to model the system at
ever higher levels of abstraction until the top level is reached.
Any unhandled faults will appear at this top level and will be `un-resolved'.
A formal description of this process is dealt with in Chapter \ref{fmmddefinition}.
%This principally focuses
%on simple control systems for maintaining temperature
%and for industrial burners. It is hoped that a general mathematical
%framework is created that can be applied to other fields of safety critical engineering.
Automated systems, as opposed to manual ones are now the norm
in the home and in industry.
Automated systems have long been recognised as being more effecient and
more accurate than a human opperator, and the reason for automating a process
can now be more likely to be cost savings due to better effeciency
thatn a human operator \ref{burnereffency}.
For instance
early automated systems were mechanical, with cams and levers simulating
fuel air mixture profile curves over the firing range.
Because fuels vary slightly in calorific value, and air density changes with the weather, no optimal tuning can be optional.
In fact for asethtic reasons (not wanting smoke to appear at the flue)
the tuning was often air rich, causing air to be heated and
uneccessarily passed through the burner, leading to direct loss of energy.
An automated system analysing the combustions gasses and automatically
adjusting the fuel air mix can get the effeciencies very close to theoretical levels.
As the automation takes over more and more functions from the human operator it also takes on more responsibility.
A classic example of an automated system failing, is the therac-25.
This was an X-ray dosage machine, that, due to software errors
caused the deaths of several patients and injured more during the 1980's.
% http://en.wikipedia.org/wiki/Autopilot
To take an example of an Autopilot, simple early autopilots, were (i.e. they
prevented the aircraft staying from a compass bearing and kept it flying striaght and level).
Were they to fail the pilot would notice quite quickly
and resume manual control of the bearing.
Modern autopilots control all aspects of flight including the engines, and take off and landing phases.
The automated system does not have the
common sense of a human pilot either, if fed the wrong sensory information
it could make horrendous mistakes. This means that simply reading sensors and applying control
corrections cannot be enough.
Checking for error conditions must also be incorporated.
It could also develop an internal fault, and must be able to cope with this.
Systems such as industrial burners have been partially automated for some time.
A mechanical cam arrangement controls the flow of air and fuel for the range of
firing rate (output of the boiler).
These mechanical systems could suffer failures (such as a mechanical linkage beoming
detached) and could then operate in a potentially dangerous state.
More modern burner controllers use a safety critical computer controlling
motors to operate the fuel and air mixture and to control the safety
valves.
In working in the industrial burner industry and submitting product for
North American and European safety approval, it was apparent that
formal techniques could be applied to aspects of the ciruit design.
Some safety critical circuitry would be subjected to thought experiments, where
the actions of one or more components failing would be examined.
As a simple example a milli-volt input could become disconnected.
A milli-volt input is typically amplified so that its range matches that
of the A->D converter that you are reading. were this signal source to become disconnected
the systems would see a floating, amplified signal.
A high impedance safety resistor can be added to the circuit,
to pull the signal high (or out of nornal range) upon disconnection.
The system then knows that a fault has occurred and will not use
that sensor reading (see \ref{fig:millivolt}).
\begin{figure}
\vskip 7cm
\special{psfile=introduction/millivoltsensor.ps hoffset=0 voffset=0 hscale=35 vscale=35 }\caption[Milli-Volt Sensor with safety resistor]{
Milli-Volt Sensor with safety resistor
\label{fig:millivolt}}
\end{figure}
For exmaple, if the sensor supplies a range of 0 to 40mV, and RG1 and RG2 are such that the op-amp supplies a gain of 100
any signal between 0 and 4 volts on the ADC will be considered in range. Should the sensor become disconnected the
opamp will supply its maximum voltage, telling the system the sensor reading is invalid.
This introduces a level of self checking into the system.
We need to be able to react to not only errors in the process its self,
but also validate and look for internal errors in the control system.
This leads on to an important concept of three main states of a safety critical system.
% To improve productivity, performance, and cost-effectiveness, we are developing more and more safety-critical systems that are under computer control. And centralized computer control is enabling many safety-critical systems (e.g., chemical and pesticide factories) to grow in size, complexity, and potential for catastrophic failure.
% We use software to control our factories and refineries as well as power generation and distribution. We also use software in our transportation systems including airplanes, trains, ships, subways, and even in our family automobiles. Software is also a major component of many medical systems in which safe functioning is critical to the safety of patients and operators alike. Even when the software does not directly control safety-critical hardware, software can provide operators and users with safety-critical data with which they must make safety-critical decisions (e.g., air traffic control or medical information such as blood bank records, organ donor information, and patient medical records). As we have come to rely more on software-intensive systems, we have come to rely more on those systems functioning safely.
% Many accidents are caused by problems with system and software requirements, and “empirical evidence seems to validate the commonly stated hypothesis that the majority of safety problems arise from software requirements and not coding errors” [Leveson1995]. Major accidents often result from rare hazards, whereby a hazard is a combination of conditions that increases the likelihood of accidents causing harm to valuable assets (e.g., people, property, and/or the environment). Most requirements specifications are incomplete in that they do not specify requirements to eliminate these rare hazards or mitigate their consequences. Requirements specifications are also typically incomplete in that they do not specify what needs to happen in exceptional “rainy day” situations or as a response to each possible event in each possible system state although accidents are often caused by the incorrect handling of rare combinations of events and states that were considered to be either impossible or too unlikely to worry about, and were therefore never specified. Even when requirements have been specified for such rare combinations of events and conditions, they may well be ambiguous (an unfortunately common characteristic of requirements in practice), partially incomplete (missing assumptions obvious only to subject matter experts), or incorrect, or inconsistently implemented. Thus, the associated hazards are not eliminated or the resulting harm is not properly mitigated when the associated accidents occur. Ultimately, safety related requirements are important requirements that need to be better engineered.
% The goal of this column is to define safety requirements and clarify how they differ from safety constraints and from functional, data, and interface requirements that happen to be safety critical. I start by defining safety in terms of a powerful quality model and show how quality requirements (including safety requirements) can be specified in terms of the components of this quality model. I will then show how to use the quality model to specify safety requirements. Then, I will define and discuss safety constraints and safety-critical requirements. Finally, I will pose a set of questions regarding the engineering of these three kinds of safety-related requirements for future research and experience to answer.
Safety critical systems in the context of this study, means that a safety critical system may be said to be in three distinct
overall states.
Operating normally, operating in a lockout mode with a detected fault, and operating
dangerously with an undetected fault.
The main role of the system designers of safety critical equipment should be to eliminate the possibility of this last condition.
% Software plays a critical role in almost every aspect facet of our daily lives - from , to driving our cars, to working in our offices.
% Some of these systems are safety-critical.
% Failure of software could cause catastrophic consequences for human life.
% Imagine the antilock brake system (ABS) in your car.
% A software failure here could render the ABS inoperable at a time when you need it most.
% For these types of safety-critical systems, having guidelines that define processes and
% objectives for the creation of software that focus on software quality, or the ability
% to use software that has been developed under this scrutiny, has tremendous value
% for developers of safety-critical systems.
\section{Motivation for developing a formal methodology}
A feature of many safety critical systems specifications,
including EN298, EN230 \cite{EN298} \cite{EN230}
is to demand,
at the very least that single failures of hardware
or software cannot
create an unsafe condition in operational plant. Further to this
a second fault introduced, must not cause an unsafe state, due
to the combation of both faults.
\vskip 0.3cm
This sounds like an entirely reasonable requirement. But to rigorously
check the effect a particular component fault has on the system,
we could check its effect on all other components.
Should a diode in the powersupply fail in a particular way, by perhaps
introducing a ripple voltage, we should have to look at all components
in the system to see how they will be affected.
%However consider a typical
%small system with perhaps 1000 components each
%with an average of say 5 failure modes.
Thus, to ensure complete coverage, each of the effects of
the failure modes must be applied
to all the other components.
Each component must be checked against the
failure modes of all other components in the system.
Mathematically with components as 'c' and failure modes as 'Fm'.
\equation
\label{crossprodsingle}
checks = \{ \; (Fm,c) \; \mid \; \stackrel{\wedge}{c} \; \neq \; c \}
\endequation
Where demands
are made for resilience against two
simultaneous failures this effectively squares the number of checks to make.
\equation
\label{crossproddouble}
doublechecks = \{ \; (Fm_{1},Fm_{2},c) \; \mid \\ \; c_{1} \; \neq \; c_{2} \; \wedge \; Fm_{1} \neq Fm_{2} \; \}
\endequation
If we consider a system which has a total of
$N$ failure modes (see equation \ref{crossprodsingle}) this would mean checking a maximum of
\equation
NumberOfChecks = \frac{N ( N-1 )}{2}
\endequation
for individual component failures and their effects on other components when they fail.
For a very small system with say 1000 failure modes this would demand a potential of 500,000
checks for any automated checking process.
\vskip 0.3cm
European legislation\cite{EN298} directs that a system must be able to react to two component failures
and not go into a dangerous state.
\vskip 0.3cm
This raises an interesting problem from the point of view of formal modelling. Here we have a binary cross product of all components
(see equation \ref{crossproddouble}).
This increases the number of checks greatly. Given that the binary cross product is $ (N^{2} - N)/2 $ and has to be checked against the remaining
$(N-2)$ components.
\equation
\label{numberofchecks}
NumberOfchecks = \frac{(N^{2} - N) ( N - 2)}{2}
\endequation
Thus for a 1000 failure mode system, roughly a half billion possible checks would be required for the double simultaneous failure scenario. This astonomical number of potential combinations, has made formal analysis of this
type of system, up until now, impractical. Fault simulators %\cite{sim}
are commonly used for the gas certification process. Thus to
manually check this number of combinations of faults is in practise impossible.
A technique of modularising, or breaking down the problem is clearly necessary.
\section{Challenger Disaster}
One question that anyone developing a safety critical analysis design tool
could do well to answer, is how the methodology would cope with known previous disasters.
The Challenger disaster is a good example, and was well documented and invistigated.
The problem lay in a seal that had an operating temperature range.
On the day of the launch the temperature of this seal was out of range.
A bottom up safety approach would have revealed this as a fault.
\section{Problems with Natural Language}
Written natural language desciptions can not only be ambiguous or easy to misinterpret, it
is also not possible to apply mathematical checking to them.
A mathematical model on the other hand can be checked for
obvious faults, such as tautologies and contradictions, but also
intermediate results can be extracted and these checked.
Mathematical modeling of systems is not new, the Z language
has been used to model systems\cite{ince}. However this is not widely
understood or studied even in engineering and scientific circles.
Graphical techniques for representing the mathematics for
specifying systems, developed at Brighton and Kent university
have been used and extended by this author to create a methodology
for modelling complex safety critical systems, using diagrams.
This project uses a modified form of euler diagram used to represent propositional logic.
%The propositional logic is used to analyse system components.
\section{Ideal System Designers world}
Imagaine a world where, when ordering a component, or even a complex module
like a a failsafe sensor/scientific instrunment, one page of the datasheet
is the failure modes of the system. All possible ways in which the component can fail
and how it will react when it does.
\subsection{Environmentally determined failures}
Some systems and components are guaranteed to work within certain environmental constraints,
temperature being the most typical. Very often what happens to the system outside that range is not defined.
Where this is the case, these are undetectable errors.
\section{Project Goals}
\begin{itemize}
\item To create a user friendly formal common visual notation to represent fault modes
in Software, Electronic and Mechanical sub-systems.
\item To formally define this visual language.
\item To prove that tehe modules may be combined into hierarchies that
truly represent the fault handling from component level to the
highest abstract system 'top level'.
\item To reduce to complexity of fault mode checking, by modularising and
building complexity reducing hierarchies.
\item To formally define the hierarchies and procedure for bulding them.
\item To produce a software tool to aid in the drawing of diagrams and
ensuring that all fault modes are addressed.
\item To allow the possiblility of MTTF calculation for statistical
reliability/safety calculations.
\end{itemize}
% fucking cunt \end{document}