EHSQL(Environment,Health,Safety , Quality & Laboratory) Technical services: Improving Alarm Systems with focus on Process and Human Factors

Operator Alarms should be the first line of defence in every plant but all too often are more of a nuisance than an aid to the operator. This exposes safety alarms to more process excursions with the consequent increase in probability of a Failure upon Demand detracting from the plants overall safety capability. Poor operator alarms also contribute to poor process economics [Ref 1].

Understanding and using the geometric relationship between an operating envelope and its approximating hypercube eliminates many false alarms. This substantially improves the credibility of the alarm system to the operator and allows earlier annunciation with more time for the operator to respond. The new alarms give the operator earlier and positive warning of deviation from whatever combination of business, environmental and process performance objectives are the operating windows chosen objective thus contributing to the economic performance of the plant and so earning the alarm system a share of the business case for further investment.

There are two major alarm systems in a process plant. The first is the Safety Alarm System responsible for taking control and shutting down the process in extreme process excursions which both the process control system and the operator have been unable to prevent. Its role is to prevent an extreme excursion from turning into a disaster with liabilities and costs that can run into hundreds of millions of dollars. Its costs are viewed as an insurance premium against a disaster that most plants will never experience.

The second is the Operator Alarm system intended to draw the process operator’s attention to a situation beyond the capability of the process control system to prevent and requiring application of the operator’s considerably greater human intelligence to resolve and correct before the safety system intervenes and shuts down the plant. Automatic plant shutdowns are expensive in lost production time and possible consequential plant damage. Operator alarms give the operator time to intervene and correct the situation to avoid a shutdown.

Most plants accept that ‘Normal’ operation refers to the Operating Envelope within which desired economic results are achieved similarly to Figure 1 and place the operator alarm limits where they imagine the boundary to be.

This would suggest that:

Alarm limits are ideally the same as operating limits and
The economic cost of violating an alarm limit is the delta cost between the material produced and operating costs of desired and undesired operation.

Figure 1. Operator alarm limits at the boundary of where the process normally operates

But the first practical problem behind the advice to ‘put the operator alarms on the boundary of where the plant normally operates’ is that there has been no way to determine the location of the boundary of normal operation when the operating objective is that of meeting all KPI’s, including those that cannot be measured in real-time, at all times.

The consequence of this is that Figure 2 is a representation of alarm limits as they really are in practice. Some alarm limits are set in the orange recovery space where they will, at best, annunciate late, giving the process disturbance more time to grow and requiring a larger corrective action, or in many cases are set so wide that they can never annunciate. Other alarm limits are set inside the green ‘normal operation’ space where they will annunciate unnecessarily some of the time creating false alarms and leading to their being labelled as ‘bad actors’.

Without knowledge of the location of boundary or of how alarms relate to each other there is little that can be done to cure a bad actor other than to push the alarm limit ‘outwards’ towards or past the guessed position of the boundary.

Figure 2. Operator alarm limits as they usually are since the boundary of normal operation is unknown

Lessons from history

Poor alarm management has been implicated in many high profile disasters, for example the explosions at the Texaco Milford Haven oil refinery (HSE, 1997) and the Longford gas plant (Hopkins, 2000). Whilst excessive alarm load was recognised as an important factor in each of these incidents, other inadequacies relating to alarm design, presentation and management were also identified. For example, the investigation into the Longford Gas plant explosion found that the plant was habitually run beyond alarm set points, whilst the Texaco Milford Haven investigation identified poor prioritisation and delayed alarm response as contributory to the subsequent loss of containment and explosion.

Literature and UK regulatory context

EEMUA 191 (EEMUA, 2013) comprehensively outlines the principles of effective alarm system design, including the management of HF issues. For example, how an alarm should be presented within the DCS, and what information and functionality should be available to support operators in navigating to the required controls to execute a response. EEMUA 191 also provides a wealth of information regarding the wider organisational arrangements which should be in place to support the design, maintenance and improvement of alarm systems.

Effective alarm management is particularly important in the process industries given the potential Major Accident Hazard (MAH) implications of their operations. In the UK, onshore high hazard sites are regulated by the Health and Safety Executive (HSE) under the COMAH (Control of Major Accident Hazard) Regulations (HSE, 2006). A core requirement of these regulations is for operators to submit a Safety Report which demonstrates to the regulator that their activities are, as far as is reasonably practicable, safe and that MAH events are suitably controlled. One aspect of this is the need to demonstrate that alarm systems have been both properly conceived during plant design, and are subject to ongoing management and review to ensure that alarms continue to support safe and reliable operations.

There are two aspects to this: firstly, duty holders must ensure that their alarm system is safe, and that it offers reliable protection against MAH events. Secondly, they must provide evidence to the regulator that the system is designed in accordance with best practice and that there is a verification process in place which ensures that the system fully supports effective operator response. This includes providing a demonstration that best practice standards for alarm design are being applied on site.

Improving alarm systems – the challenge

It is essential that high hazard sites operate with confidence that, when a high-criticality alarm arises, those charged with the task of responding to the alarm can indeed do so. This confidence is particularly important at times of high workload or elevated alarm levels, for example during a serious plant upset. Failure to ascertain whether a reliable operator response is probable undermines the foundations upon which the entire alarm system is based.

However, providing this verification can be difficult. Firstly, modern process plants are complex, with distributed systems to maintain process control across extensive networks. Secondly, the number of variables associated with the effective design and presentation of an alarm can be significant. In short, there are many alarms to assess and many factors to consider for each alarm.

Many MAH sites with complex alarm systems utilise alarm management software as part of their assurance strategy. Such software provides data for alarm system performance which can be used to judge the overall adequacy of the system (for example average alarm rate, number of alarms following an upset, number and distribution of alarms by priority). This information is important from the perspective of performance monitoring and for developing alarm rationalisation strategies to reduce alarm load and improve system performance. Alarm metrics can also be interrogated at a deeper level to examine, for example, response times to particular alarms. Alarm management software is therefore often viewed as an important tool in the quest to improve alarm systems.

However, such software often provides little insight into how the operator interacts with the DCS to respond to an alarm and whether, and where, the operator encounters any difficulty in doing so. With the exception of drawing conclusions about the overall alarm load, such software rarely provides much analysis regarding which specific features of the alarm system present problems to the operator and the aspects of system design that need to be addressed to improve alarm reliability.

Therefore, in the context of achieving reliable verification that operators will respond to an alarm, the limitations of tools that measure overall alarm load as the sole means to achieve this should be recognised.

Where sites utilise this method as the only means of alarm system analysis it could be argued that the reliability of response, at times of highest need during a serious plant upset, may often be based upon little more than assumption.

Given the complexity of the task facing many MAH operators, a pragmatic solution is therefore required to provide the verification which they, and the Regulator, require: that their alarm system is safe and that a reliable response to the most critical process alarms is possible.

Possible approaches to analysis of HF issues related to alarms

One obvious approach is for the MAH operator to carry out their own full review of the content of EEMUA 191 and assess their most critical alarms against this guidance. This is clearly achievable. However, the time and resource required for such an unstructured analysis may present difficulties, particularly for smaller sites. Whilst EEMUA 191 is an excellent source of information, the presentation of that information within the document does not necessarily support a simple, systematic and consistent analysis process.

For example, in the guide, specific information relating to individual alarm design is often incorporated within wider guidance relating to organisational arrangements to support alarm systems. Moreover, information relating to how alarms should be presented to facilitate prompt and effective identification by operators is distributed throughout the document, rather than being collated in one discrete, easy-to-interpret section.

Unless significant time is spent reviewing the guidance, it may be difficult to identify the key information against which alarms should be assessed to determine that a specific alarm adheres to the various requirements of the guidance. The extensive nature of EEMUA 191 means that this approach, when coupled with the number of potential alarms to be reviewed at any given site, may appear an overwhelming challenge.

An alternative approach is to carry out full task and failure analyses of the highest criticality alarms in the context of response tasks (see, for example, Energy Institute, 2011). While this would represent a thorough approach it may present its own challenges. For example, whilst such analyses should give a fully-rounded analysis of the task in the operating context, these analyses can be complex and potentially time consuming, and will often require external HF support. In addition, whilst such approaches provide an excellent framework for identifying potential failures for the full range of different task types, they do not necessarily provide specific support for assessing the cognitive aspects of alarm response (e.g. diagnosing the causes of alarms and deciding upon appropriate responses). Finally, EEMUA 191 outlines a substantial number of specific design expectations and it is uncertain whether a traditional failure analysis approach would reliably identify all of these factors.

Overview of the alarm review process

The potential complexity associated with assessing alarms, coupled with the inconsistent approach which many MAH operators take to verify that alarm systems optimise operator response, encouraged the authors of this report [Ref 2] to develop an analysis process that could potentially support MAH sites in the analysis of critical alarms.

The Alarm Review Tool, or ART, provides a means for MAH operators to reliably and rapidly analyse critical alarms and their associated management systems against the alarm system design principles described in EEMUA 191. It distils the key guidance from EEMUA 191 into related sections, meaning that the user can be confident that they have considered all of the relevant information for a specific alarm from the guidance without having to hunt through the document.

The process has been designed to provide a comprehensive analysis of the alarm system, and currently comprises four core elements:

Critical alarm screening: This is a facility for alarm filtering to determine whether alarms which are currently assigned highest criticality within the system justify that categorisation. This screening helps, in the first instance, identify alarms which have been wrongly prioritised. This ensures that time spent analysing alarms is initially focused on those alarms which are most important. Such high level screening can also assist with rationalisation by identifying alarms which are not truly critical.
Individual alarm review: This element facilitates a quick but thorough review of individual safety-critical alarms against the usability principles outlined in EEMUA 191. This examines all HF aspects of alarm response from signal presentation, availability of DCS information for diagnosis, to execution of response. This depth of analysis provides the necessary verification that alarm design is optimised.
Alarm management system review: This is an in-depth assessment of the management system which supports the alarm system. It examines the adequacy of organisational arrangements for the ongoing maintenance, development and review of the alarm system. It is envisaged that this review would take place periodically – for example by undertaking an initial management system review then possibly only re-reviewing at a later date if significant organisational changes have occurred which affect the management of the alarm system.
Alarm performance metrics: This provides a facility for recording and trending alarm metrics in relation to ongoing rationalisation provided via the alarm review tool. This chart alarm system improvements in relation to any changes made to problem alarms.

The analysis can be completed as a paper analysis. However, a software tool has also been developed to speed to assessment process and facilitate the aggregation of multiple analyses. This is still in the process of being developed, however screenshots from a prototype of this software are included in this article to illustrate the process.

Figure 3. Example statements in the ‘Maintain Salience’ phase of the critical alarm review process

Figure 4. Example summary report for one critical alarm

Ammonia Plants experience

Caribbean Nitrogen Company (CNC) and Nitrogen 2000 (N2000) ammonia plants are both located in the Point Lisas Industrial Estate in Trinidad, each with a nameplate capacity of 1850 MTPD. The CNC Plant was commissioned in 2002 with N2000 being commissioned in 2004. An alarm management project was initiated and a white paper that describes the approach adopted by CNC/N2000 in implementing this system on an existing facility was presented at AICHE Ammonia Safety in 2010.

The ease with which alarms can be added carries the risk of alarm overload, and the ease with which alarms can be modified or suppressed can, in the absence of proper change-anagement protocols, lead to serious degradation of an alarm system’s reliability.

With plant safety, environmental safety, regulatory compliance and bottom-line success depending on effective alarming, it has become clear to industry that proper alarm management is an essential part of overall best practices. [Ref 3]

In the earlier days of pneumatic controls, adding a new control room alarm was quite involved and required significant effort. Present day DCS makes it very easy to add new alarms to process variables. However, once added, change management procedures can make its removal quite involved. Additionally, almost inevitably, a significant amount of plant incident investigation reports will recommend adding new alarms.

Alarm Management has been defined in literature as the “Process by which alarms are engineered, monitored, and managed to ensure safe, reliable operations”. A key misconception is that Alarm Management is only about reducing the number of alarms. The objective is to improve the quality of normal and abnormal operations alarm rates.

Recognizing that an Alarm Management program would improve operator workload, improve plant reliability and avoid unplanned outages as well as avoid possible safety and environmental incidents, CNC & N2000 embarked on such a project.

Project Execution The objective of the project was to develop an alarm management philosophy and a rationalized alarm system in alignment with the principles outlined in the EEMUA3 Standard (Publication 191:1999) and the industry’s best practices.

The project was executed on a phased basis which is described as follows.

Phase 1- Hardware & Software

The first phase involved procurement of the necessary hardware and software to provide a statistical analysis of the present alarm system.

The hardware used allowed data to be collected from the DCS and securely broadcast this data through a dedicated server on the business LAN. The software selected had two key elements.

Data Collection – This application collects and stores alarm and event history for long term archive and data analysis. The software’s analysis and monitoring capabilities instantly identify problematic alarms and help to immediately reduce operator alerts and resulting alarms. Operators can also select the real-time viewer alarms to view additional documentation on the alarms. This documentation provides the operator with guidance on resolving the abnormal situation. Operators can access and verify information regarding the cause of each alarm, its priority, the appropriate response and the consequences of not responding.

Management of Change – This application serves as the master alarm database during the rationalization process. After the alarm rationalization process is completed, it then becomes the means of managing change to any and all alarm configurations. The master alarm database is designed to record and log all alarm changes, such as what and when were alarm changes made, what are the new configuration settings (and what were the old ones), who made the changes, the reasons behind the changes and finally, how the change was authorized.

Phase 2 – Alarm Philosophy

The second phase was to review the facility’s existing alarm philosophy and make the necessary recommendations and subsequent changes for improvement of same. However, no written alarm philosophy document was able to be located and a new alarm philosophy document needed to be developed.

This document establishes rules for configuring the DCS to help improve the alarm system.

The purpose is to reduce the number of alarms, eliminate redundant and nuisance alarms, and properly prioritize alarms.

This alarm philosophy document allows CNC/N2000 to have a set of guidelines to assess the need for alarms and the corrective response required by the operators. Through the use of this document, the following objectives are expected to be achieved:

Improved process reliability and safety.
Reduced number and ultimately, cost of abnormal situations.
Assist in adhering to industry best practices, guidelines and regulations.

Phases 3 & 4 – Implementation

The third phase involved the implementation of the philosophy which was then used in the rationalization of the system through a multidisciplinary review of the existing system. Alarm rationalization has been defined as “Systematic review and documentation of each alarmable tag in the DCS with the objective of optimizing alarm quantity and quality”.

The fourth phase was implementation of the new system with the requisite training of personnel to manage and manipulate the system as required.

Phases 3 and 4 were essentially combined where the selected vendor demonstrated the alarm rationalization process to a team of personnel from CNC & N2000 comprising of Process & Electrical Engineers as well as Senior Operators. The specialist contractor makes quarterly visits to the plants to review progress and observe the sessions to maintain effectiveness.

The DCS architecture is such that the plant is subdivided into approximately 26 “areas”.

Teams were then set up and a schedule developed for rationalization exercises to be done with a target completion of within 12 months.

On completion of each area rationalization, an MOC is done and circulated for management approval. Before implementation, the Operations team members who were part of the rationalization exercises have to visit all shifts and review each MOC with them in detail for maximum understanding and buy-in.

Alarm Performance Monitoring

In order to quantify the initial and on-going performance of an alarm system, certain metrics are used. In addition to the rationalization exercise, these metrics can be used to get immediate improvement on the alarm system.

Examples are taken from the CNC Plant over a four month period; November 2009 – February. It should be noted that the plant was shutdown for a turnaround and returned to production on 30 October 2009. This means that in early November, there would have been a transient period of plant optimization. Additionally, the plant had to be shutdown on 22 November 2009 for another issue. This means that of the four month period chosen, November 2009 represented an abnormal month with December 2009 and January and February 2010 representing steady state months. For this reason, where applicable, key performance indicators (KPIs) are shown separately for the two different states within the chosen period.

Average number of alarms per hour

This is the ratio of the total number of alarms annunciated to an operator during the analysis period to the total number of hours during the same period. It is a measure of an average level of load imposed on the operator by the alarm system. The target by best practices is less than 12 alarms per hour.

Referring to Figure 5, it can be seen that the November events roughly doubled the average when compared to the steady state period. What can be seen here is that there is room for improvement by reducing the average through alarm rationalization exercises.

Figure 5. Alarm Distribution Over Time

Maximum number of alarms per hour

This is the measure of the worst-case load on an operator during any ten-minute time period. The target by best practices is less than 15 alarms per 10 minutes or 90 alarms per hour.

Referring to Figure 5, it is clear that the peak hourly alarm rate is significantly greater than target and significant benefit can be gained through alarm rationalization.

Percentage of hours with alarms greater than 30

This is a measure of proportion of time an alarm system was in an upset state. Such a performance indicator judges the reasonable level of manageability of the alarm system (1 alarm per 2 minutes). The target by best practices is less than 2% of operating time.

Alarm Distribution Over Time

Figure 5 – Alarm Distribution Over Time By computing these values, each area can be assessed and plotted on the following chart. The above data puts the CNC alarm system in what is referred to as a “stable” mode of operation.

The yellow zone on Figure 6 illustrates CNC’s target zone for alarm system performance.

The target dot is the EEMUA 191 best-practice performance level. The yellow zone is based on Industrial experience and EEMUA 191 figures 42 & 454.

Figure 6 – Alarm Performance Assessment Chart These KPIs would be evaluated after the results of each rationalized DCS area have been implemented to ascertain that the KPIs are approaching the target criteria.

Referring to Figure 5, it can be seen that the November events roughly doubled the average when compared to the steady state period.

What can be seen here is that there is room for improvement by reducing the average through alarm rationalization exercises.

Maximum number of alarms per hour This is the measure of the worst-case load on an operator during any ten-minute time period.

The target by best practices is less than 15 alarms per 10 minutes or 90 alarms per hour.

Referring to Figure 5, it is clear that the peak hourly alarm rate is significantly greater than target and significant benefit can be gained through alarm rationalization. Percentage of hours with alarms greater than 30 This is a measure of proportion of time an alarm system was in an upset state. Such a performance indicator judges the reasonable level of manageability of the alarm system (1 alarm per 2 minutes). The target by best practices is less than 2% of operating time.

Figure 6. Alarm performance assessment chart

The Alarm Management project has been implemented and rationalization of the different DCS areas is in progress on a phased basis.

Industry guidelines place the CNC alarm system in what is defined as a “stable” mode of operation and the plant is working to further improve the performance of the system.

References

Robin W. Brooks, PhD∗, Alan Mahoney, PhD, John Wilson, PhD, CEng, Na Zhao PhD Process Plant Computing Limited (PPCL), PO Box 43, Gerrards Cross, Bucks. SL9 8UX, UK, OPERATOR ALARMS ARE THE FIRST LINE OF DEFENCE, IChemE Hazards 23 2012.
Neil Hunter, Jamie Henderson, David Embrey, Human Reliability Associates Ltd. Development of an alarm analysis process for use within the process industries, Hazards 25, IChemE, 2015.
Tony Sewdass, Natalie Harrichand, Industrial Plant Services Limited, Process Safety Improvements at the CNC & N2000 Ammonia Plants, AICHE Ammonia Safety Manual 2010Facebook

witterkedInEmail

EHSQL(Environment,Health,Safety , Quality & Laboratory) Technical services

Wednesday 12 December 2018

Improving Alarm Systems with focus on Process and Human Factors