193 lines
7.8 KiB
TeX
193 lines
7.8 KiB
TeX
|
||
|
||
|
||
|
||
\ifthenelse {\boolean{paper}}
|
||
{
|
||
\abstract{
|
||
This paper looks at current methodologies
|
||
for static analysis of safety critical systems
|
||
and looks at the statistical justifications for their application.}
|
||
}
|
||
{
|
||
This chapter looks at the current state of
|
||
safety critical systems
|
||
and provides background to concepts and
|
||
standard practises.
|
||
Its aims to bridge
|
||
|
||
}
|
||
|
||
|
||
\section{Introduction}
|
||
|
||
\section{Product}
|
||
\subsection{life cycle}
|
||
\subsection{parts list}
|
||
Important document, used for quality inspection and production validation etc
|
||
\subsubsection{BOM}
|
||
\subsection{Components and Sub-systems}
|
||
How can have failure modes
|
||
\section{Safety and Reliability}
|
||
|
||
- How these are different.
|
||
|
||
- Safety is environmentally sensitive
|
||
|
||
|
||
In order to quantify a difference between safety and reliability we
|
||
need to determine which system failure modes are dangerous or safe.
|
||
Were a burner controller to detect a problem with an air pressure switch
|
||
and refuse to start up (and raise an alarm) we can see this is a safe failure mode.
|
||
Were a burner controller to pump fuel into the combustion chamber
|
||
and then ignite it after long duration\footnote{Most GAS safety timeouts for seeing a flame under ignition conditions specify $<$ 3 seconds\cite{en298}}
|
||
we would have a clear risk of a dangerous explosion.
|
||
Here, the picture is further complicated by the environment.
|
||
If the burner was placed in a remote building and operated
|
||
remotely, there would be minimal risk to life.
|
||
Were the burner to be located in a busy factory, surrounded by people
|
||
the safety risk is higher.
|
||
|
||
|
||
|
||
- How safety and reliability get confused.
|
||
A tale of two customers (for integrated boiler controls).
|
||
|
||
Customer 1. Brewery.
|
||
Impact of boiler going down, delayed production - some cost.
|
||
|
||
Customer 2. Nuclear Powerstation.
|
||
Impact of boiler going down, no CO2 primary coolant available, possible reactor shutdown, possible emergency shutdown methods. Cost very high.
|
||
|
||
For the Brewery, safety is of the highest importance.
|
||
For the Nuclear power station
|
||
|
||
|
||
\section{Terms and Concepts in \\ Safety Critical Engineering}
|
||
\subsection{Timing And Safety Checking}
|
||
|
||
\subsubsection{CANopen Timing Definitions}
|
||
CAN is a mainstream network and was internationally standardized (ISO 11898–1) in 1993.
|
||
CANopen is a protocol suite based on the hardware of the CANbus\cite{canspec}.
|
||
CANbus is a hardened differential serial communications bus and
|
||
is arbitration free\footnote{Implemented at the physical and data link layers using DOMINANT and PASSIVE bits, with self monitoring and auto back off
|
||
form the node first transmitting a PASSIVE bit that is DOMINANT on the bus}
|
||
It also has a 15 bit\cite{crcembedd} CRC built into the protocol, which can detect a guaranteed six consecutive bit failures.
|
||
This makes it a very safe and robust messaging medium to use for safety critical systems.
|
||
CAN is a message based protocol, designed originally for automotive applications but
|
||
now also used in other areas such as industrial automation, industrial burner controllers and medical equipment.
|
||
CANopen literature discusses some of the concepts based around the timing relevance
|
||
of given items of safety critical data.
|
||
\paragraph{Safety Relevant Data Object}
|
||
A Safety Relevant Data Object (SRDO)\cite{caninauto}, is a data structure describing the status of
|
||
a particular feature or attribute of a safety critical system.
|
||
For instance, in a burner this could be a flame signal value, or in a nuclear powerstation
|
||
the measure neutron flux.
|
||
\paragraph{Safety relevant Object Validation Time}
|
||
Safety times can be given for SRDO's; these are termed Safety Related Object Validation Times (SROVT's)\cite{caninauto}. For instance were
|
||
a flame to fail in operation in a gas burner
|
||
standards state \cite{en298} that the gas may not continue to be fed into the
|
||
furnace for more than three seconds.
|
||
We can say that the SROVT for a flame signal in a gas burner is 3 seconds.
|
||
\subsection{Single and Double Failure Modes}
|
||
A Safety critical system must self check within the relevant SROVT's.
|
||
On detecting a failure mode it must react appropriately.
|
||
Consider the case though where two failures occurr within overlapping
|
||
time windows of their SROVT's. We can term this a double simultaneous failure mode.
|
||
To take an extreme example, were the checking function/mechanism and the object under supervision
|
||
to fail within the SROVT, it may be impossible to detect the failure.
|
||
|
||
\section{Interfacing}
|
||
|
||
Mech - elec - sw
|
||
|
||
Most problems occur here need citations
|
||
look at some of Nancys accident papaers.
|
||
|
||
|
||
\section{Current Methods for Safety Critical Analysis}
|
||
|
||
\section{STAMP}
|
||
|
||
High level technique, look at processes with feed back loops and rules, and then interfaces wbetween them.
|
||
|
||
|
||
\section{Deterministic Approach}
|
||
\paragraph{NOT WRITTEN YET PLEASE IGNORE}
|
||
No single component fault may lead to a dangerous condition.
|
||
EN298 En230 etc
|
||
|
||
|
||
\section{Statistical - tolerated failure frequencies}
|
||
|
||
Euopean standard
|
||
EN61508 takes a statistical approach.
|
||
It sets out four Safety Integrity Levels (SIL)
|
||
|
||
\subsection{Bayes Theorem}
|
||
\paragraph{NOT WRITTEN YET PLEASE IGNORE}
|
||
\label{bayes}
|
||
Describe application - likely hood of faults being the cause of symptoms -
|
||
probablistic approach - no direct causation paths to the higher~abstraction fault mode.
|
||
Often for instance a component in a module within a module within a module etc
|
||
that has a probability of causing a SYSTEM level fault.
|
||
|
||
Philosophy behind FTA\cite{nasafta}\cite{nucfta}.
|
||
The idea being that probabilities can be assigned to components
|
||
failing, causing system level errors.
|
||
|
||
Problems, difficult to get reliable stats
|
||
for probability to cause because of small sample numbers...
|
||
|
||
FMMD approach can by traversing down the tree use known component failure figures
|
||
to get {\em accurate} probabilities and potential causes.
|
||
%$$ c1 \cap c2 \eq \emptyset | c1 \neq c2 \wedge c1,c2 \in C \wedge C \in U $$
|
||
|
||
%Thus if the failure~modes are pairwaise mutually exclusive they qualify for inclusion into the
|
||
%unitary~state set family.
|
||
|
||
\subsection{ Saftey Integrity Level Analysis }
|
||
\paragraph{NOT WRITTEN YET PLEASE IGNORE}
|
||
\label{sil}
|
||
This technique looks at all components in the parts list
|
||
and asks what the effect of the component failing will be.
|
||
Note that particular failure modes of the compoent are not considered.
|
||
The component can fail in any of its failure modes from the perspective of this analysis.
|
||
The analyst has to make a choice between four conditions:
|
||
|
||
\begin{itemize}
|
||
\item sd - A safe fault that is detected by an automated system
|
||
\item su - A safe fault that is undetected by an automated system
|
||
\item dd - A potentially dangerous fault that is detected by an automated system
|
||
\item du - A potentially dangerous fault that is not detected by an automated system
|
||
\end{itemize}
|
||
Actually this is almost how sil analysis is done, because
|
||
the base components are listed
|
||
and their failure result as either sd su dd du
|
||
|
||
A formula is then applied according to the system architecture 1oo1 2oo3 3oo3 etc
|
||
|
||
What is not done is the probability for all these conditions, the sil analysis
|
||
person simple has to decide which it is.
|
||
Another fault in this is that it is very difficult to
|
||
extract meaning ful stats
|
||
for how likely the detection systems are to pick the fault up, or even to introduce a fault of their own.
|
||
|
||
\subsection{Tests of Hypotheses and Significance}
|
||
\paragraph{NOT WRITTEN YET PLEASE IGNORE}
|
||
Linked in with Bayes theorem
|
||
Accident analysis
|
||
plane crashes and faults etc
|
||
In high reliability systems the fauls are often logged - strange occurances -
|
||
processors resetting - what are the common factors - P values -
|
||
for instance very high voltage spikes can reset micro controllers -
|
||
but how do you corrollate that with unshielded suppressed contactors...
|
||
|
||
Maybe looking at the equipment and seeing if there is a 5\%
|
||
level of the error being caused ?
|
||
i.e. using it to search for these conditions ?
|
||
|
||
|
||
Actually this could be used to refine the SIL method \ref{sil}
|
||
and give probabilities for the four conditions.
|