Safety Engineering
1.0 Introduction
Safety engineering
is an applied science strongly related to systems engineering. Safety
engineering assures that a life-critical system behaves as needed even
when components of that system fail.
In
the real world the term "safety engineering" refers to any act of
accident prevention by a person qualified in the field. Safety
engineering is often reactionary to adverse events, also described as
"incidents", as reflected in accident statistics. This arises largely
because of the complexity and difficulty of collecting and analysing
data on "near misses".
Increasingly,
the importance of a safety review is being recognised as an important
risk managament tool. Failure to identify risks to safety, and the
according inability to address or "control" these risks, can result in
massive costs, both human and ecomic. The multidisciplinary nature of
safety engineering means that a very broad array of professionals are
actively involved in accident prevention or safety engineering.
|
|
|
The
majority of those practicing safety engineering are employed in industry
to keep workers safe on a day to day basis. See the American Society of
Safety Engineers publication Scope and Function of the Safety Profession.
Safety
engineers distinguish different extents of defective operation: A
"failure" is "the inability of a system or component to perform its
required functions within specified performance requirements", while a
"fault" is "a defect in a device or component, for example: a short
circuit or a broken wire. System-level failures are caused by
lower-level faults, which are ultimately caused by basic component
faults. (Some texts reverse or confuse these two terms). The unexpected
failure of a device that was operating within its design limits is a
"primary failure", while the expected failure of a component stressed
beyond its design limits is a "secondary failure". A device which
appears to malfunction because it has responded as designed to a bad
input is suffering from a "command fault" ] A "critical" fault endangers one or a few people. A "catastrophic" fault endangers, harms or kills a significan t number of people.
Safety
engineers also identify different modes of safe operation: A
"probabilistically safe" system has no single point of failure, and
enough redundant sensors, computers and effectors so that it is very
unlikely to cause harm (usually "very unlikely" means, on average, less
than one human life lost in a billion hours of operation). An inherently
safe system is a clever mechanical arrangement that cannot be made to
cause harm – obviously the best arrangement, but this is not always
possible. A fail-safe system is one that cannot cause harm when it
fails. A "fault-tolerant" system can continue to operate with faults,
though its operation may be degraded in some fashion.
These
terms combine to describe the safety needed by systems: For example,
most biomedical equipment is only "critical", and often another
identical piece of equipment is nearby, so it can be merely
"probabilistically fail-safe". Train signals can cause "catastrophic"
accidents (imagine chemical releases from tank-cars) and are usually
"inherently safe". Aircraft "failures" are "catastrophic" (at least for
their passengers and crew) so aircraft are usually "probabilistically
fault-tolerant". Without any safety features, nuclear reactors might
have "catastrophic failures", so real nuclear reactors are required to
be at least "probabilistically fail-safe", and some such as pebble bed reactors are "inherently fault-tolerant".
1.1 The Process
Ideally,
safety-engineers take an early design of a system, analyze it to find
what faults can occur, and then propose changes to make the system
safer. In an early design stage, often a fail-safe system can be made
acceptably safe with a few sensors and some software to read them.
Probabilitically fault-tolerant systems can often be made by using more,
but smaller and less-expensive pieces of equipment.
Historically,
many organizations viewed "safety engineering" as a process to produce
documentation to gain regulatory approval, rather than a real asset to
the engineering process. These same organizations have often made their
views into a self-fulfilling prophecy by assigning less-able personnel
to safety engineering.
Far too often, rather than actually helping with the design, safety engineer s
are assigned to prove that an existing, completed design is safe. If a
competent safety engineer then discovers significant safety problems
late in the design process, correcting them can be very expensive. This
project management error has wasted large sums of money in the
development of commercial nuclear reactors.
Additionally,
failure mitigation can go beyond design recommendations, particularly
in the area of maintenance. There is an entire realm of safety and
reliability engineering known as "Reliability Centered Maintenance"
(RCM), which is a discipline that is a direct result of analyzing
potential failures within a system, and determining maintenance actions
that can mitigate the risk of failure. This methodology is used
extensively on aircraft, and involves understanding the failure modes of
the serviceable replaceable assemblies, in addition to the means to
detect or predict an impending failure. Every automobile owner is
familiar with this concept when they take in their car to have the oil
changed or brakes checked. Even filling up one's car with gas is a
simple example of a failure mode (failure due to fuel starvation), a
means of detection (fuel gauge), and a maintenance action (fill 'er
up!).
For
large scale complex systems, hundreds if not thousands of maintenance
actions can result from the failure analysis. These maintenance actions
are based on conditions (eg, gauge reading or leaky valve), hard
conditions (eg, a component is known to fail after 100 hrs of operation
with 95% certainty), or require inspection to determine the maintenance
action (eg, metal fatigue). The Reliability Centered Maintenance concept
then analyzes each individual maintenance item for its risk
contribution to safey, mission, operational readiness, or cost to repair
if a failure does occur. Then the sum total of all the maintenance
actions are bundled into maintenance intervals so that maintenance is
not occurring around the clock, but rather, at regular intervals. This
bundling process introduces further complexity, as it might stretch some
maintenance cycles, thereby increasing risk, but reduce others, thereby
potentially reducing risk, with the end result being a comprehensive
maintenance schedule, purpose built to reduce operational risk and
ensure acceptable levels of operational readiness and availability.
1.2 Analysis techniques
The two most common fault modeling techniques are called "failure modes and effects analysis" and "fault tree analysis".
These techniques are just ways of finding problems and of making plans
to cope with failures, as in Probabilistic Risk Assessment (PRA or PSA).
One of the earliest complete studies using PRA techniques on a
commercial nuclear plant was the Reactor Safety Study (RSS), edited by
Prof. Norman Rasmussen.
1.3 Failure modes and effects analysis
In
the technique known as "failure mode and effects analysis" (FMEA), an
engineer starts with a block diagram of a system. The safety engineer
then considers what happens if each block of the diagram fails. The
engineer then draws up a table in which failures are paired with their
effects and an evaluation of the effects. The design of the system is
then corrected, and the table adjusted until the system is not known to
have unacceptable problems. Of course, the engineers may make mistakes.
It is very helpful to have several engineers review the failure modes
and effects analysis.
1.4 Fault tree analysis
In
the technique known as "fault tree analysis", an undesired effect is
taken as the root ('top event') of a tree of logic. Then, each situation
that could cause that effect is added to the tree as a series of logic
expressions. When fault trees are labelled with actual numbers
about failure probabilities, which are often in practice unavailable
because of the expense of testing, computer programs such as "fault tree plus" can calculate failure probabilities from fault trees.
The
Tree is usually written out using conventional logic-gate symbols.The
route through a Tree between an event and an initiator in the tree is
called a Cutset. The shortest credible way through the tree from Fault
to initiating Event is called a Minimal Cutset.
Some
industries use both Fault Trees and Event Trees (see Probabilistic Risk
Assessment). An Event Tree starts from an undesired initiator (loss of
critical supply, component failure etc) and follows possible further
system events through to a series of final consequences. As each new
event is considered, a new node on the tree is added with a split of
probabilities of taking either branch. The probabilities of a range of
'top events' arising from the initial event can then be seen.
Classic
programs include the EPRI (Electric Power Research Institute)'s CAFTA
Software which is used by almost all the Nuclear Power Plants in the US
and by a majority of US and international aerospace manufacturers and
the Idaho National Engineering and Environmental Laboratory's SAPHIRE,
which is used by the U.S. government to evaluate the safety and
reliability of nuclear reactors, the space shuttle, and the
International Space Station.
Unified Modeling Language (UML) activity diagram have been used as graphical components in a fault tree analysis.
1.5 Safety certification
Usually a failure in safety-certified systems is acceptable if, on average, less than one life per 30 years of operation (109
seconds) is lost to failure. Most Western nuclear reactors, medical
equipment, and commercial aircraft are certified to this level. The cost
versus loss of lives has been considered appropriate at this level (by
FAA for aircraft under Federal Aviation Regulations).
1.6 Preventing failure
1.6.1 Probabilistic fault tolerance: adding redundancy to equipment & systems
Once
a failure mode is identified, it can usually be prevented entirely by
adding extra equipment to the system. For example, nuclear reactors emit
dangerous radiation and contain nasty poisons, and nuclear reactions
can cause so much heat
that no substance might contain them. Therefore reactors have emergency
core cooling systems to keep the temperature down, shielding to contain
the radiation, and engineered barriers (usually several, nested,
surmounted by a containment building) to prevent accidental leakage.
Most biological organisms have a certain amount of redundancy: multiple organs, multiple limbs, etc.
For any given failure, a fail-over, or redundancy can almost always be designed and incorporated into a system.
1.6.2 Inherent fail-safe design
When
adding equipment is impractical (usually because of expense), then the
least expensive form of design is often "inherently fail-safe". The
typical approach is to arrange the system so that ordinary single
failures cause the mechanism to shut down in a safe way. (For nuclear
power plants, this is termed a passively safe design, although more than
ordinary failures are covered.)
One
of the most common fail-safe systems is that in an elevator the cable
supporting the car keeps spring-loaded brakes open. If the cable breaks,
the brakes grab rails, and the car does not fall.
Inherent
fail-safes are common in medical equipment, traffic and railway
signals, communications equipment, and safety equipment.
-----------------------------------------------------------------------------------------------------------------------------------
1.7 References
- ^ Radatz, Jane (Sep 28, 1990). IEEE Standard Glossary of Software Engineering Terminology
(PDF), New York, NY, USA: The Institute of Electrical and Electronics
Engineers, 84 pages. ISBN 1-55937-067-X. Retrieved on 2006-09-05.
- ^ Vesely, W.E.; F. F. Goldberg, N. H. Roberts, D. F. Haasl (Jan, 1981). Fault Tree Handbook (PDF), Washington, DC, USA: U.S. Nuclear Regulatory Commission, page V-3. NUREG-0492. Retrieved on 2006-08-31.
- ^ Rasmussen, Professor Norman C.; et al (Oct, 1975). Reactor Safety Study
(PDF), Washington, DC, USA: U.S. Nuclear Regulator Commission, Appendix
VI "Calculation of Reactor Accident Consequences". WASH-1400
(NUREG-75-014). Retrieved on 2006-08-31.
|
No comments:
Post a Comment