Robin_PHD/statistics/statistics.tex



\ifthenelse {\boolean{paper}}
{
\abstract{
This paper looks at current methodologies
for static analysis of safety critical systems
and looks at the statistical justifications for their application.}
}
{
This chapter looks at the current  state of
safety critical systems
and provides background to concepts and
standard practises.
Its aims to bridge

}


\section{Introduction}

\section{Product}
\subsection{life cycle}
\subsection{parts list}
Important document, used for quality inspection and production validation etc
\subsubsection{BOM}
\subsection{Components and Sub-systems}
How can have failure modes
\section{Safety and Reliability}

- How these are different.

- Safety is environmentally sensitive


In order to quantify a difference between safety and reliability we
need to determine which system failure modes are dangerous or safe.
Were a burner controller to detect a problem with an air pressure switch
and refuse to start up (and raise an alarm) we can see this is a safe failure mode.
Were a burner controller to pump fuel into the combustion chamber
and then ignite it after long duration\footnote{Most GAS safety timeouts for seeing a flame under ignition conditions specify $<$ 3 seconds\cite{en298}}
we would have a clear risk of a dangerous explosion.
Here, the picture is further complicated by the environment.
If the burner was placed in a remote building and operated
remotely, there would be minimal risk to life.
Were the burner to be located in a busy factory, surrounded by people
the safety risk is higher.


- How safety and reliability get confused.
A tale of two customers (for integrated boiler controls).

Customer 1. Brewery.
Impact of boiler going down, delayed production - some cost.

Customer 2. Nuclear Powerstation.
Impact of boiler going down, no CO2 primary coolant available, possible reactor shutdown, possible emergency shutdown methods. Cost very high.

For the Brewery, safety is of the highest importance.
For the Nuclear power station


\section{Terms and Concepts in \\ Safety Critical Engineering}
\subsection{Timing And Safety Checking}

\subsubsection{CANopen Timing Definitions}
CAN is a mainstream network and was internationally standardized (ISO 11898–1) in 1993.
CANopen is a protocol suite based on the hardware of the CANbus\cite{canspec}.
CANbus is a hardened differential serial communications bus and
is arbitration free\footnote{Implemented at the physical and data link layers using DOMINANT and PASSIVE bits, with self monitoring and auto back off
form the node first transmitting a PASSIVE bit that is DOMINANT on the bus}
It also has a 15 bit\cite{crcembedd} CRC built into the protocol, which can detect a guaranteed six consecutive bit failures.
This makes it a very safe and robust messaging medium to use for safety critical systems.
CAN is a message based protocol, designed originally for automotive applications but
now also used in other areas such as industrial automation, industrial burner controllers and medical equipment.
CANopen literature discusses some of the concepts based around the timing relevance
of given items of safety critical data.
\paragraph{Safety Relevant Data Object}
A Safety Relevant Data Object (SRDO)\cite{caninauto}, is a data structure describing the status of
a particular feature or attribute of a safety critical system.
For instance, in a burner this could be a flame signal value, or in a nuclear powerstation
the measure neutron flux.
\paragraph{Safety relevant Object Validation Time}
Safety times can be given for SRDO's; these are termed Safety Related Object Validation Times (SROVT's)\cite{caninauto}. For instance were
a flame to fail in operation in a gas burner
standards state \cite{en298} that the gas may not continue to be fed into the
furnace for more than three seconds.
We can say that the SROVT for a flame signal in a gas burner is 3 seconds.
\subsection{Single and Double Failure Modes}
A Safety critical system must self check within the relevant SROVT's.
On detecting a failure mode it must react appropriately.
Consider the case though where two failures occurr within overlapping
time windows of their SROVT's. We can term this a double simultaneous failure mode.
To take an extreme example, were the checking function/mechanism and the object under supervision
to fail within the SROVT, it may be impossible to detect the failure.

\section{Interfacing}

Mech - elec - sw

Most problems occur here need citations
look at some of Nancys accident papaers.


\section{Current Methods for Safety Critical Analysis}

\section{STAMP}

High level technique, look at processes with feed back loops and rules, and then interfaces wbetween them.


\section{Deterministic Approach}
\paragraph{NOT WRITTEN YET PLEASE IGNORE}
No single component fault may lead to a dangerous condition.
EN298 En230 etc


\section{Statistical - tolerated failure frequencies}

Euopean standard
EN61508 takes a statistical approach.
It sets out four Safety Integrity Levels (SIL)

\subsection{Bayes Theorem}
\paragraph{NOT WRITTEN YET PLEASE IGNORE}
\label{bayes}
Describe application - likely hood of faults being the cause of symptoms -
probablistic approach - no direct causation paths to the higher~abstraction fault mode.
Often for instance a component in a module within a module within a module etc
that has a probability of causing a SYSTEM level fault.

Philosophy behind  FTA\cite{nasafta}\cite{nucfta}.
The idea being that probabilities can be assigned to components
failing, causing system level errors.

 Problems, difficult to get reliable stats
for probability to cause because of small sample numbers...

FMMD approach can by traversing down the tree  use known component failure figures
to  get {\em accurate} probabilities and potential causes.
%$$ c1 \cap c2 \eq \emptyset  | c1 \neq c2 \wedge c1,c2 \in C \wedge C \in U  $$

%Thus if the failure~modes are pairwaise mutually exclusive they qualify for inclusion into the
%unitary~state set family.

\subsection{ Saftey Integrity Level Analysis }
\paragraph{NOT WRITTEN YET PLEASE IGNORE}
\label{sil}
This technique looks at all components in the parts list
and asks what the effect of the component failing will be.
Note that particular failure modes of the compoent are not considered.
The component can fail in any of its failure modes from the perspective of this analysis.
The analyst has to make a choice between four conditions:

\begin{itemize}
\item sd - A safe fault that is detected by an automated system
\item su - A safe fault that is undetected by an automated system
\item dd - A potentially dangerous fault that is detected by an automated system
\item du - A potentially dangerous fault that is not detected by an automated system
\end{itemize}
Actually this is almost how sil analysis is done, because
the base components are listed
and their failure result as either sd su dd du

A formula is then applied according to the system architecture 1oo1 2oo3 3oo3 etc

What is not done is the probability for all these conditions, the sil analysis
person simple has to decide which it is.
Another fault in this is that it is very difficult to
extract meaning ful stats
for how likely the detection systems are to pick the fault up, or even to introduce a fault of their own.

\subsection{Tests of Hypotheses and Significance}
\paragraph{NOT WRITTEN YET PLEASE IGNORE}
Linked in with Bayes theorem
Accident analysis
plane crashes and faults etc
In high reliability systems the fauls are often logged - strange occurances -
processors resetting - what are the common factors - P values -
for instance very high voltage spikes can reset micro controllers -
but how do you corrollate that with unshielded suppressed contactors...

Maybe looking at the equipment and seeing if there is a 5\%
level of the error being caused ?
i.e. using it to search for these conditions ?


Actually this could be used to refine the SIL method \ref{sil}
and give probabilities for the four conditions.