Effective Risk Management and Quality Improvement by Application of FMEA and Complementary Techniques
++
Introduction
+This paper provides my expert opinion of the use and effectiveness of + Failure Modes and Effects Analysis (FMEA) for managing risks and +improving quality in several industrial domains. I also consider and +evaluate several other analytical techniques as complementary extensions + of FMEA.[1]
+The opinions that I express in this paper are based on a thorough +review that I conducted of industry standards and procedures for risk +management, FMEA techniques, and FMEA applications in aviation and other + industries. I also base these opinions on my 25 years of experience in +transportation management and analysis, airline flight operations, +safety investigation management, safety research, and airline accident +investigation. I have ten years of experience on the staff of the U.S. +National Transportation Safety Board (NTSB), concluding my service there + as the Chief of the Major Investigations Division. In that position, I +managed the overall investigative effort for U.S. air carrier accidents +from the field investigation to the public board meeting and final +accident report. I also managed the U.S. Government’s participation in +foreign aviation accidents. My previous NTSB experience included +management of flight operations, air traffic control, and meteorological + aspects of air carrier accident investigations; on-scene and follow-up +investigations of flight operations for several major accident +investigations including the USAir flight 427 Boeing 737 accident near +Pittsburgh and ValuJet flight 592 DC-9 accident in the Everglades; and +management of research programs on flight crew human factors and +regional air safety issues, both of which were adopted and published by +the NTSB. I am a pilot for a major U.S. air carrier, qualified in the +Boeing 737 and two other transport category aircraft types. I have +consulted with the National Aeronautics and Space Administration (NASA), + the World Bank, the European Bank for Reconstruction and Development, +the U.S. President’s Aviation Safety Commission, and several airlines, +financial institutions, airport authorities, and other private entities +on safety and analytical matters. I received the A.B. degree summa cum laude in Economics from Harvard College and am a member of the Phi Beta Kappa Society.
+FMEA—Summary and Definition
+According to the Society of Automotive Engineers (SAE) International Aerospace Recommended Practice (ARP) 5580, Recommended Failure Modes and Effects (FMEA) Practices for Non-Automobile Applications, + FMEA is “a formal and systematic approach to identifying potential +system failure modes, their causes, and the effects of the failure mode +occurrence on the system operation…FMEA provides a basis for identifying + potential system failures and unacceptable failure effects that prevent + achieving design requirements from postulated failure modes…FMEA is +used in many system design analyses including assessing system safety, +planning system maintenance activities, defining provisions for fault +recovery, fault tolerance, and failure detection and isolation, and +identifying design modifications and corrective actions needed to +mitigate the effects of a failure on the system.”
+The basic FMEA process involves examining each basic hardware, +software, personnel, or functional element of a system, identifying all +the ways in which that element can fail (failure modes), assessing the +effects of each failure mode upon the function of other elements of the +system and the entire system (failure effects), and then assessing the +criticality of the failure effects. Integral to the FMEA process is the +specification of corrective actions that will prevent critical failures +or restore critical functions.
+FMEA typically uses a worksheet for analyzing data and documenting +the results. The worksheet proceeds, left to right, from the component +identification, to the associated failure modes, to the failures’ +effects at various levels of the system (including detectability of the +failure modes/effects), to their risk, reliability, or quality +consequences. The following is an example of an FMEA worksheet that was +prepared by the SAE for analysis of a fictitious aerospace application:
+Source:SAE ARP926B, p. 32.
+The criticality or level of risk, from a failure is a combination of +the severity of the effect and the probability of its occurrence. Under +FMEA the severity is estimated qualitatively with each effect assigned +to one of several categories ranging from none to catastrophic, and the +probability is assessed either qualitatively or quantitatively (the +latter if failure rate data are available from previous experience or +from laboratory or field experimentation). The severity and probability +assessments are combined into an overall assessment of the risk level of + the failure effect as being acceptable or unacceptable, along the lines + of the following graphic from Federal Aviation Administration (FAA) +guidance material:
+Source:FAA Advisory Circular 25.1309-1A, System Design and Analysis, p. 7
+One aspect of the FMEA process that is often ignored in discussions +of the methodology (perhaps because it is not represented on the FMEA +worksheet) is the importance of documenting and retaining all +assumptions, including rationales for failure rates and effects +categorization that underlie the FMEA worksheet entries. This is +specifically cited by the SAE in its recommended standard ARP4761, +appendix G, section 3.2.1.
+My review of FMEA utilization in aerospace and several other fields +suggests that the most common applications of FMEA are in product design + and manufacturing processes. FMEA has not typically been applied to the + post-manufacturing environment (such as product distribution and field +usage by providers, operators, maintainers, and customers); however, +post-manufacturing applications are not specifically excluded in FMEA +standards. In fact, in SAE ARP5580 section 6.1.1 (5), “failure +conditions caused by the operational and maintenance environment” are +specifically cited among the failure modes to be considered.
+Cross-industry acceptance and use of FMEA
+FMEA is firmly established as a risk analysis and risk management +methodology. Originating in the U.S. military during the 1940s and +supported by military specification beginning in 1949 (MIL-P-1649, Procedures for Performing a Failure Mode, Effects, and Criticality Analysis), + FMEA methods and applications were officially accepted as a recommended + practice for aerospace engineering by the SAE beginning in 1967 under +ARP926, Fault/Failure Analysis Procedure. FMEA had become a +standard part of the design process in the aerospace industry by the +1980s and has been in continuous use through the present. For example, +the Boeing Commercial Airplane Group relied upon FMEA to substantiate +the safety and reliability of design changes for two generations of the +Boeing 737 commercial airliner: the 737-300/400/500 series, first +produced in the mid-1980s, and the “next generation” 737-600/700/800/900 + series, first produced in the late 1990s and early 2000s. I have +personally examined numerous FMEA documents and FMEA-based safety +analyses prepared by aircraft manufacturers for original and modified +transport-category aircraft designs (these FMEA applications are +proprietary to the manufacturers). In addition to these aviation +applications of FMEA, the late 1980s saw the application of FMEA to +design and manufacturing processes by a major U.S. automobile +manufacturer, and these practices were recognized by the automotive +industry under the auspices of the Automotive Industry Action Group +(AIAG) and the SAE (Surface Vehicle Recommended Practice J-1739, first +issued in 1994). Currently, FMEA is recognized by the SAE (ARP5580, Recommended Failure Modes and Effects Analysis (FMEA) Practices for non-Automobile Applications), the FAA (Advisory Circular 25.1309-1A, System Design and Analysis), and the National Aeronautics and Space Administration (NPA 8715.3, NASA Safety Manual, and NSTS 22206, Instructions for Preparation of FMEA and CIL). + In a subsequent section of this paper, I will provide an example of a +successful government-sponsored (and therefore non-proprietary) aviation + industry application of FMEA that resulted in a significant improvement + in commercial air carrier flight safety.
+FMEA has also been applied successfully in a wide range of other +domains. For example, FMEA is being used to analyze design and +maintenance issues in building structures (Anker Nielson, Ph.D., “Use of + FMEA, Failure Modes Effects Analysis on Moisture Problems in +Buildings,” Building Physics 2002—6th Nordic Symposium). + Also, engineers have applied FMEA to design and manufacturing processes + in the semiconductor industry (Steven Martin and Bedwyr Humphreys, +“FMEA Speeds Time to Market in Photonic IC Manufacturing”, Compound Semiconductor, + November 2002). The authors concluded, “The FMEA technique has been +successfully implemented at MetroPhotonics, aiding in the rapid +development and the successful launch of the SurePath product suite…Time + to market and development costs were greatly reduced through the +selection of optimum system alternatives (through FMEA), resulting in a +successful product launch within four months of concept” (Martin and +Humphreys, p. 69).
+FMEA has become established as a standard methodology for risk +management in the healthcare industry. Under Joint Commission on +Accreditation of Healthcare Organizations (JCAHO) Standard LD.5.2, +adopted July 1, 2000, healthcare organizations are required to +proactively identify and manage potential risks to patient safety, using + FMEA and root cause analysis to analyze at least one high-risk process +annually. The U.S. Veteran’s Administration has developed and begun +implementation of an application of FMEA that the agency customized for +healthcare delivery (Joseph DeRosier, Erik, Stalhandske, James P. +Bagian, and Tina Nudell, “Using Health Care Failure Mode and Effect +Analysis™: The VA National Center for Patient Safety’s Prospective Risk +Analysis System,” The Joint Commission Journal on Quality Improvement, + Vol 28. No 5, May 2002). Private health care organizations (for +example, Kaiser Permanente) have begun to implement FMEA-based processes + (Kaiser Permanente, Failure Modes and Effects Analysis Team Instruction Guide, + March 2002). Although healthcare-related applications of FMEA have +considered some aspects of pharmaceutical delivery (for example, +Institute for Healthcare Improvement, “Sample FMEA: Comparison of Five Medication Dispensing Scenarios,” + 2003), I am not aware that a comprehensive analysis of pharmaceutical +distribution, delivery, and use, treating all post-manufacture +activities as an integrated system, has been performed to date using +FMEA or any alternative, formal risk-management methodology.
+Advantages of FMEA
+I suggest that FMEA has several general advantages for organizations seeking to improve quality and safety:
+First, FMEA is a structured process that promotes disciplined +elicitation of ideas about the kinds of failures that may occur, careful + analysis of specific risk/hazard areas, proper documentation of sources + and assumptions, and identification of interventions that manage risks +to an acceptable level. Regarding the ultimate goal of risk management, +in most applications the FMEA process requires intervention in each +identified adverse outcome until the residual level of risk is +acceptable.
+Further, as a “bottom-up process” proceeding from the failure an +individual component of a system to the effects on the entire system, +FMEA helps organizations identify unforeseen, undesired outcomes. Its +best applications are prospective, facilitating the control or +mitigation of adverse outcomes before they occur.
+Also, FMEA explicitly considers the detectability of failure modes, +and thus it promotes consideration of failures that can remain latent; +that is, failures that have no immediate effect and (if they remain +undetected) are capable of resulting in adverse effects when combined +with subsequent failure modes or events (however, as is discussed below, + the basic FMEA methodology may need to be modified to fully address +latent failures).
+Limitations of FMEA
+SAE ARP5580 provides the following “cautions” for the application of FMEA:
+-
+
- First, a FMEA traditionally considers only non-simultaneous failure +modes. Each failure mode is considered individually, assuming that all +other system components are performing as designed. Hence, a typical +FMEA provides limited insight into the following anomalous behaviors: +
-
+
- the effects of multiple component failures on system functions, and +
- latent manifestations of defects such as timing, sequencing, etc. +
-
+
- Second, the prioritization of the failure modes for corrective +actions is substantially subjective. Thus, care should be taken in +decision making when using any quantitative aspects of the numbers +presented in the analysis (SAE ARP5580, Section 3.3). +
I concur that the basic approach of FMEA is to consider single +failures and that a typical FMEA application handles multiple +(simultaneous/sequential) failures with difficulty (later in this paper, + I will suggest several extensions to FMEA that are capable of +addressing these issues).
+Further, I suggest that the following additional general limitations exist for FMEA:
+First, as FMEA has typically been applied in aerospace engineering, +designers are permitted to rely upon human performance (such as +interventions by pilots and mechanics) to mitigate the adverse effects +of hardware and software component or system failures. However, in doing + so, no consideration is given to given to imperfect human performance. +For example, FAA guidance for aircraft certification states, “If…a +potential failure condition can be alleviated or overcome…without +requiring exceptional pilot skill or strength, credit can be taken for +correct and appropriate action” (FAA AC25.1309-1A, pararaph 11). The +assessment of “exceptional” skill or strength is subjective, and once a +specific human response to a failure mode is determined to require +unexceptional skill or strength, FMEA typically assumes that the human +will intervene reliably every time that the failure mode occurs. I +believe that this is an unrealistic assumption for human performance, +and as a common treatment of human performance in FMEAs it constitutes a + limitation of the typical FMEA methodology.
+Also, as FMEA typically has been applied in design/process +applications, there is no inherent feedback to the FMEA process from the + actual failure modes and outcomes experienced in field use. However, +this feedback is not excluded by the FMEA process and the continuing +refinement of an FMEA through feedback has been explicitly recognized as + an important aspect of system safety analysis in some applications.
+Keys to successful application of FMEA
+I believe that several additional issues are important for obtaining satisfactory results from an FMEA.
+First, while FMEA is a structured technique that provides a +comprehensive analysis, it is difficult (or impossible) to prospectively + identify all possible failure modes/adverse outcomes from a complex +component or functional element of a system. Because even the best FMEA +effort may leave some failure modes and effects undiscovered, after +completing an FMEA it is essential to avoid concluding that all risks +have been compensated for or controlled. This suggests that FMEA +analysts need to maintain an open and creative attitude about +identifying failure modes and assessing their effects and consequences, +It also establishes the rationale for obtaining, analyzing, and reacting + to feedback from field use and operations, and for treating the FMEA as + a “living document” that will be revisited and revised on a continuing +basis.
+Further while planning and performing an FMEA, it is essential to +understand the scope of the analysis and to choose a proper scope that +will allow the evaluation of all critical risks that can result from +failure modes. For example, many FMEAs are limited to design issues and +do not necessarily consider manufacturing variations or errors. An +aircraft part that includes several linkages may not consider the +effects of cumulative (stack-up) of the manufacturing tolerances that +are allowed for each individual linkage as a possible contributor to +failure modes and effects. Even if the scope of the FMEA for this part +is enlarged to include manufacturing processes and therefore considers +tolerance stack-up, the analysis still may not consider the effects of +failure modes that remain downstream from the processes that have been +included within the analytical scope, such as improper maintenance or +use. When considering all of a product’s failure modes and effects in +all environments, a still broader scope of analysis might reveal +additional factors that significantly affect safety and quality. For +example, consider a pharmaceutical product with an adverse side effect +that poses a risk to some users. One option for controlling the risks of + these side effects would be for the Food and Drug Administration (FDA) +to withdraw approval for the product. However, because the product also +has therapeutic value, withdrawal of the product may actually result in a + net reduction of patient health and safety, even considering the +adverse consequences of the side effects. The net therapeutic benefit of + the product relative to its side effects will not be identified by an +FMEA of its design, manufacturing, and use—unless the withdrawal of the +product is considered as a failure mode and the scope of analysis is +broadened to consider the net consequences of non-use.
+In addition to considering downstream effects in scoping the +analysis, it is essential to recognize that the interventions selected +in an FMEA to mitigate an identified risk can also introduce their own +failure modes and effects having critical risks. Interventions should be + designed to “first, do no harm;” that is, they should introduce no new uncorrected + failure modes. This suggests that FMEA should be performed on each +intervention, as well. In some cases controlling the hazard from one +failure mode can increase the hazard from another, and this may require +consideration of multiple simultaneous or sequential failures as an +extension of FMEA.
+Also, while interpreting the results of an FMEA, it is essential to +understand the derivation and limitations of the probability analysis +that is incorporated in the evaluation of the risks associated with +failure effects. The probability that a failure mode will occur can be +obtained from engineering, field, or registry data such as historic +component failure rates; the probability that a functional element or +complex component will fail can be estimated by combining the failure +rates of sub-assemblies or sub-systems. Failure rates may be obtained +from laboratory research if actual field data are unavailable. Lacking +in both field and laboratory data, failure mode probabilities may be +estimated. The FMEA analyst’s confidence in the results should depend on + the derivation of these probabilities. An additional probabilistic +element in some FMEA applications is the likelihood that an effect of +stated severity will follow from a failure mode. This element needs to +be estimated in a similar manner, with confidence in the results of the +analysis once again depending on the source of the probability +estimates. Another probabilistic element can enter FMEA when considering + interventions to control or mitigate an identified risk; here, the +probability that the intervention will successfully address the risk +needs to be estimated.
+Failure and reliability rates are particularly difficult to estimate +when human performance is involved. The FAA states in its design +guidance material that “quantitative assessments of the probabilities of + crew error are not considered feasible” (FAA AC25.1309-1A, paragraph +11); as I have already discussed, the FAA then turns at times to the +unrealistic assumption that humans perform with perfect reliability. In +other domains, performance by trained professionals has been estimated +as being satisfactory in 30-60 percent of exposures to a demanding task. + Although the reliability level of human performance is highly variable +depending on the nature of the task, environment, and individual, it is +probably best to assume that human performance in systems often may be +much less reliable than what is demanded of hardware and software +systems, and accordingly to plan compensations when humans may be +responsible for detecting primary failure modes or for intervening to +mitigate failure effects.
+Review of FMEA applications in various industries suggests that there + is no standard definition for an acceptable level of risk. Based on the + high volume of operations with consequent risk exposure and the +public’s low tolerance for mishaps, commercial aviation design and +manufacturing is held to a stringent reliability criterion: +certification guidance requires that every failure having catastrophic +consequences must be demonstrated to be extremely improbable; the FAA +defines “extremely improbable failure conditions” as “those having a +probability of on the order of 1 X 10E-9 or less” (AC251309-1A, +paragraph 10). In contrast, FMEA applications in other industrial +domains accept catastrophic outcomes with probabilities that may be +orders of magnitude more likely. An interesting criterion for aviation +design that incorporates both probability and severity factors +establishes that “in general, a failure condition resulting from a +single failure mode of a device cannot be accepted as being extremely +improbable” (FAA AC 25.1309-1A, paragraph 2-g). Thus, every failure mode + having catastrophic consequences, regardless of its estimated +likelihood, must be mitigated by a redundant system or a means of +reliably detecting the failure before it occurs (the FAA guidance does +suggest that “…in very unusual cases, however, experienced engineering +judgment may enable an assessment that such a failure mode is not a +practical possibility.”).
+When considering the effectiveness of interventions in mitigating the + risks of failure effects, a significant implication of probability +analysis is the assumption of independent events. Normally, the +probability of two events both occurring is the probability of one event + multiplied by the probability of the other event. For example, consider + an aircraft component that FMEA determines to have an unacceptable +failure rate. To control this risk, designers require the mechanic to +check the component before each flight and also require the pilot to +recheck the component during the taxi-out checklist. If there is a 10 +percent chance of the mechanic forgetting to check the component and +also a 10 percent chance of the pilot skipping the same item on the +checklist, the probability of the check being omitted by both persons is + only 1 in 100. In this manner, adequate reliability can be obtained +from two somewhat unreliable human performances by imposing multiple, +redundant interventions. However, this analysis assumes that the pilot +and mechanic events are independent, while in reality these events may +interact: a pilot who knows that the mechanic is supposed to be checking + the component may grow to rely on the mechanic and become less likely +to perform the re-check. As another example, consider a pharmaceutical +product that requires patients to receive periodic lab tests to detect +possible adverse side effects. Multiple, redundant interventions are +designed to ensure that patients receive the lab tests: doctors and +pharmacists are both instructed to track the due dates for the tests and + notify patients. However, if doctors become aware that pharmacists are +tracking the due dates, the doctors may become less likely to perform +this effort as well; therefore, multiple intervention collapses to a +single intervention and the redundancy is lost. Whenever the assumption +of independent events is violated and the likelihood of one event +becomes a function of another event, it is impossible to conclude that +the desired reliability will result from multiple interventions. +Therefore, interventions must be designed and implemented so as to +provide and preserve the independence of the events.
+Complementary analytical techniques
+In its Safety Manual, NASA states that “risk assessment +should use the simplest methods that adequately characterize the +probability and severity of undesired events.” The NASA manual further +states, “Qualitative methods that characterize hazards and failure modes + and effects should be used first…quantitative methods are to be used +when qualitative methods do not provide an adequate understanding of +failures, consequences, and events” (NASA NPG 8715.3).
+A variety of analytical methods are available to apply to risk +management, in addition to FMEA. I will briefly define and discuss +several of these methods and indicate how they can be used to complement + FMEA and extend its applications into areas in which FMEA is otherwise +inherently limited.
+I have described the FMEA method as a “bottom-up” approach that +attempts to identify failure effects (some of which may not yet have +occurred in actual use of the product) by starting with individual +component failures, imagining the ways the component can fail, and then +proceeding up the chain of the system to subsequent failures and +consequences. Further, I identified the bottom-up orientation of FMEA as + advantageous for a prospective, accident-prevention program.
+Some alternative analytical methods are “top-down” in that they begin + with the ultimate system consequence or failure event and then proceed +down into the system to identify why the failure occurred. These methods + perform well as retrospective analyses; for example, investigations of +accidents or incidents that have already occurred. However, top-down +methods can also be useful in prospective analysis; for example, when +concerned about a severe consequence, recognizing that the primary FMEA +method may miss some failure effects, it may also be helpful to analyze +beginning with the consequence itself and to search creatively for other + sub-system functions or component failures might bring about the +undesired result.
+The SAE’s recommended standard for the general evaluation of aircraft safety (ARP4761, Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment) + describes an over-arching “System Safety Assessment” (SSA) process. SSA + integrates FMEA and some of the following approaches, as required, to +thoroughly evaluate all of the failure modes, failure effects, and risks + of a system and show that the entire system (the aircraft) operates at +the required level of safety/reliability despite all anticipated failure + modes.
+Functional Hazard Analysis + (FHA) is a top-down approach that is most often performed at the +beginning of a design effort, when the final specifications for a +product have not yet been settled yet its basic functions are already +established. Using engineering judgment and knowledge from similar +efforts, analysts review the basic functions of a product or process and + suggest system-level hazardous outcomes for further analysis. This +method allows the safety/quality improvement process to begin early in +product development, at least at a level of broad generality.
+Methods similar to FHA also can be applied retrospectively, after a +product is fielded. One successful application is Hazard Analysis of +Critical Control Points, which is used in the food services industries +to evaluate the entire chain of food production and distribution, +identifying and controlling sources of food contamination. This +application seems amenable to the simpler FHA methodology rather than a +formal FMEA.
+Fault Tree Analysis +(FTA) is more formal top-down approach to identifying the causal links +between functional breakdowns and their antecedents in events or +failures of lower-level components. The FTA begins with the system-level + failure or consequence that the analysts want to understand. Proceeding + down through the system from the top-end level to the underlying +processes and components, the analysis results in a graphical +representation of the combinations of subsystem and component failures +that can result in the system event. The fault tree (so-named because it + resembles the root structure of a tree) uses standard notations of +Boolean logic to denote precursor or lower-level events that must occur +individually (“or-gate”) or in combination (“and-gate”) to bring about +the higher level event. In this manner, FTA directly incorporates +multiple causation (simultaneous/sequential) events. Further, when +failure rates are added to each component of the tree diagram, the +probabilities of each of the lower-level events can be added or +multiplied to estimate the probability of the ultimate system-level +event.
+The following is an example of FTA provided by the SAE:
+Source: SAE ARP926B, p. 46.
+As a top-down approach, FTA may identify one or more underlying +causes of the top-level event but omit others that might be identified +in the bottom-up FMEA. Additional limitations of FTA are that the +methodology (unlike FMEA) does not represent the severity of +consequences; hence, it is difficult to assess the risks of failure and +evaluate them with respect to the available countermeasures, without +also undertaking an FMEA.
+Because it handles multiple failures, various multiple causations as +expressed through Boolean logic, and the associated probabilities rather + naturally, FTA also complements FMEA where the latter is limited. I +suggest that FTA notation and techniques should be applied selectively +to explore multiple failures and associated probabilities once these +factors have been identified in the basic FMEA. Another advantage of FTA + when used in combination with FMEA is the top-down check of the +bottom-up process that I have already described. FTA might be applied +selectively, once again, to confirm that FMEA has not omitted +catastrophic outcomes. I would consider selective application of FTA as a + complementary extension to the basic FMEA methodology. This is +explicitly recognized by the SAE in ARP926B.
+Probabilistic Risk Assessment + (PRA) has been adopted by NASA as formal methodology for analyzing “the + probability (or frequency) of occurrence of a consequence of interest, +and the magnitude of that consequence, including assessment and display +of uncertainties.” (Michael A. Greenfield, “Risk Management Tools,” NASA + Langley Research Center presentation, May 2, 2000). A key contribution + of PRA is that it considers, tracks, and documents the current state of + knowledge and certainty of the probabilities that are employed in basic + FMEA and other analyses. One significant limitation of PRA, as defined +by NASA, is that the methodology requires specific experience-based +failure rate data for the components and functions that are being +analyzed. As a result, I suggest that it may be difficult to apply +formal PRA to “softer” areas such as human performance in FMEA +interventions.
+Markov Analysis (MA) + is a specialized probabilistic analysis especially well suited to +evaluating the failure effects and consequences of high-technology +systems that include self-monitoring, self-repairing and +self-reconfiguring functionalities. MA is capable of handling these +complex relationships between failure mode, effect, and consequence by +representing the relationship as a chain, each element in the chain in +an operational or non-operational state, and the movement between states + as a system of differential equations. I would suggest that MA is a +good methodology to employ as a complement to basic FMEA and FTA when +the nature of the components, environment, or operators require it; +otherwise, in accordance with the principle of minimizing the complexity + of risk analysis, MA does not appear warranted in most applications.
+To summarize these alternative methodologies, it is quite possible to + extend a basic FMEA into areas in which the FMEA method is limited, +including multiply caused events, simultaneous or sequential events, and + the estimation of probabilities of failure modes, effects, and +consequences (and our confidence in the estimated probabilities), by +applying selected aspects of FTA and PRA to the FMEA. I do not suggest +that complete, formal FTA and PRA need to be undertaken in every FMEA +application; rather, these methodologies should be drawn from as +required.
+Complementary field reporting and data analysis systems from aviation
+In a previous section, I mentioned the importance of feeding +information from the post-manufacturing user communities and processes +back into the FMEA to ensure that the consequences of failure modes that + arise only in product use (perhaps because they were rare events and +did not occur during design and testing) are recognized and compensated +for once they have been discovered. There are several fairly recent +developments in aviation industry reporting and analysis systems, +potentially useful for refining and refreshing an FMEA on a continuing +basis, that may also have applications in other industries.
+Aviation Safety Action Programs + (ASAP) are cooperative reporting systems for persons active in +commercial aviation operations, including pilots, mechanics, and +aircraft dispatchers, to report the events that happen in daily line +operations. ASAP reports are non-jeopardy; in fact, if a person reports +an event to ASAP independently of enforcement action by the regulatory +authority (FAA) then the FAA will typically waive sanctions for any +regulatory violation related to the event. This waiver of sanctions +motivates personnel to report the information. ASAP reflects the +aviation system’s recognition that for human failings, obtaining the +information is often more important than punishment the transgressions, +most of which are inadvertent in any case. A key feature of the ASAP +program is the Event Review Team, comprising representatives from the +airline, the pilot’s association, and the FAA, which meets periodically +to review all submitted ASAP reports and act on the information in the +reports. ASAP is considered to be successful in revealing, +disseminating, and promoting resolution of adverse events in daily +flight operations that would otherwise remain unknown. ASAP applications + are increasingly popular in commercial aviation. These programs are +described in official FAA guidance (Advisory Circular 120-66B, Aviation Safety Action Program).
+Whereas ASAP obtains information from the personnel in the aviation +system, Flight Operations Quality Assurance (FOQA) programs tap into the + volumes of parametric data generated during regular flight operations +and recorded continuously by on-board solid state recording equipment +(similar to, but usually distinct from the crash-hardened Digital Flight + Data Recorders that are used in accident investigations). In FOQA, the +greatest challenges are handling mass data and then interpreting the +information. Initial applications of FOQA concentrated on identifying +events in which normal flight parameters (such as airspeed limitations, +g-loading, touchdown relative to target) were exceeded. The programs are + beginning to delve beyond exceedance monitoring to the consideration of + within-specification performance statistics, including both the means +and the distributions about them, which can then define the norms of the + industry. There is also a growing trend in FOQA programs to link the +information obtained from FOQA with information derived from ASAP about +the same events. This facilitates the combined analysis of “what” +happened (from FOQA) and “why” it happened (ASAP, to the extent that the + personnel involved in the event were aware of why they performed the +way that they did). A long-term NASA research program, the Automated +Performance Management System, is encouraging the establishment of FOQA +programs at various U.S. airlines and enhancing data analysis along +these lines. Most of the major U.S. air carriers are generating and +collecting FOQA data on at least their more modern fleet types (these +aircraft are equipped with the required data busses). FOQA programs are +described in the Flight Safety Foundation’s Flight Safety Digest, + July-September 1998, “Aviation Safety: U.S. Efforts to Implement Flight + Operational Quality Assurance Programs.” Although analogous data may +not be available in other applications, FOQA demonstrates the value of +routine monitoring of the use of products in the field, including the +identification of product misuse (exceedances in FOQA) and the +characterization of norms for product use.
+The Continuing Airworthiness Surveillance System (CASS) is an +aviation reporting and analysis system that concentrates on tracking +product failure modes, effects, and consequences in actual line +maintenance operations. CASS is one of the oldest data-driven quality +assurance programs, beginning in 1964 and tracing its history to +industry concerns about several maintenance-related air carrier +accidents during the 1950s. Air carriers are required to implement CASS +by Federal aviation regulations (14 CFR Part 121.373); interestingly, +CASS is the only safety management/quality assurance system that has +been specifically mandated by the FAA. CASS is defined by the FAA as a +“structured process to identify factors that could lead to an accident +or incident through collection and evaluation of information that can be + used as indicators of the degree of maintenance program effectiveness +and performance…accomplished through a closed-loop, continuous cycle of +surveillance, investigations, data collection and analysis, corrective +action, corrective action monitoring, and back to surveillance.” (FAA AC + 120-16D, Air Carrier Maintenance Programs, and AC 120-79, Developing and Implementing a Continuing Airworthiness Surveillance System).
+Event reporting systems with many similarities to these aviation +systems are being developed and used in other industries, including +healthcare. I think that review of the characteristics and +implementation of ASAP, FOQA, and CASS may enhance similar systems in +alternative industries, particularly as these aviation systems are +applied in combination to obtain information that only the personnel in +the system can report, additional mass data about regular operations, +and specific product and personnel failures in the post-manufacturing +environment. Also, I suggest that information systems with these +characteristics can be effective feedback mechanisms for the ongoing +analysis of failure modes, effects, and consequences through FMEA.
+The Boeing 737 Flight Controls Engineering Test and Evaluation Board: a successful application of extended FMEA
+On September 8, 1994, USAir flight 427, a Boeing 737-300 airplane, +crashed while maneuvering to land at Pittsburgh International Airport, +Pittsburgh, Pennsylvania. All of the 132 persons aboard were killed, and + the airplane was destroyed. The accident occurred in clear weather with + light winds, during the hours of daylight. After a three-year +investigation, the National Transportation Safety Board (NTSB) +determined that the probable cause of this accident was “loss of control + of the airplane resulting from the movement of the rudder surface to +its blowdown limit…The rudder surface most likely deflected in a +direction opposite to that commanded by the pilots as a result of a jam +of the main rudder power control unit servo valve secondary slide to the + servo valve housing offset from its neutral position and overtravel of +the primary slide.” (National Transportation Safety Board, Uncontrolled +Descent and Collision With Terrain, USAir Flight 427, Boeing 737-300, +N513AU, Near Aliquippa, Pennsylvania, September 8, 1994, NTSB AAR-99/01, + adopted on 3/24/99).
+Before this accident the rudder system of the 737 had been evaluated +by Boeing and the FAA, in full compliance with existing certification +requirements, using failure analysis (a less rigorous version of FMEA) +for the original design reviews performed during the 1960s and FMEA for +new-model reviews performed during the 1980s and 90s. Because the rudder + systems had not been completely redesigned in the new model 737s, the +FAA required only a very limited scope for the FMEAs conducted in the +80s and 90s. Despite these analyses and consistent with their limited +scope, the NTSB investigation determined that the airplane’s rudder +system was subject to several previously unidentified single-point +failures that could have catastrophic results. One or more of these +failure modes was most likely involved in the rudder system jam and +reversal, which led to the fatal accidents.
+The NTSB issued numerous safety recommendations related to its +findings regarding the Boeing 737 rudder system and unusual attitude +recovery procedures for flight crews. In Safety Recommendation A-99-21, +the NTSB recommended to the FAA:
+Convene an engineering test and +evaluation board to conduct a failure analysis to identify potential +failure modes, a component and subsystem test to isolate particular +failure modes found during the failure analysis, and a full-scale +integrated systems test of the Boeing 737 rudder actuation and control +system to identify potential latent failures and validate operation of +the system without regard to minimum certification standards and +requirements in 14 Code of Federal Regulations Part 25. Participants in +the engineering test and evaluation board should include the Federal +Aviation Administration (FAA); National Transportation Safety Board +technical advisors; the Boeing Company; other appropriate manufacturers; + and experts from other government agencies, the aviation industry, and +academia. A test plan should be prepared that includes installation of +original and redesigned Boeing 737 main rudder power control units and +related equipment and exercises all potential factors that could +initiate anomalous behavior (such as thermal effects, fluid +contamination, maintenance errors, mechanical failure, system +compliance, and structural flexure). The engineering board’s work should + be completed by March 31, 2000 and published by the FAA.
+In response to this recommendation, the Engineering Test and +Evaluation Board (ETEB) was convened in May 1999 and completed its work +in July 2000 with the issuance of a final report. (Federal Aviation +Administration, 737 Flight Controls Engineering Test and Evaluation Board Final Report, + July 20, 2000.) The staff of the ETEB was detailed from the FAA, Boeing + (Commercial, Space, and Military Airplane divisions), Air Line Pilots +Association, Ford Motor Company, Air Transport Association, Interstate +Aviation Commission (Russia), NASA, and U.S. Navy.
+According to the ETEB’s report, the group conducted:
+-
+
- A failure analysis of the flight control system to identify potential failure modes; +
- Component and subsystem tests to isolate particular failure modes found during the failure analysis; and +
- Full-scale integrated systems tests, including ground and flight +testing, of the … 737 rudder actuation and control system to identify +potential latent failures and to validate the operation of the system +(ETEB Final Report, p. 2-3). +
The ETEB noted that normal certification procedures for aircraft and +components require consideration of the probabilities of a failure mode +or adverse effect. However, the ETEB chose to evaluate the severity of +failure mode consequences without regard to their probability of +occurrence. The ETEB’s rationale for this approach was that the Boeing +737 had experienced approximately four serous failures of its rudder +system in 100 million flight hours, two of which had resulted in fatal +accidents. Therefore, the failures under investigation were extremely +rare but of extremely adverse outcome. Consequently, it was considered +appropriate to treat any failure mode with the potential for +catastrophic consequences as of the highest risk level, regardless of +how unlikely the failure mode or effect. A related goal of this new +analysis was to “focus…on rare failures that may not have been +considered in the original certification requirements” (because the +failures were considered extremely improbable, ETEB Final Report, p. +2-8). The ETEB described its analytical approach as follows:
+The ETEB conducted a comprehensive and +detailed failure modes and effects analysis (FMEA) for the complete +rudder control system…Preliminary hazard classifications were assigned +to each failure, based on the predicted severity and the ability of the +flight crew to maintain control of the airplane and conduct a safe +landing. For all failures classified as “catastrophic (Class I)” or +“hazardous (Class II),” the ETEB conducted failure simulations using a +detailed high-fidelity simulation of the rudder control system. In +addition, the ETEB conducted pilot-in-the-loop failure simulations using + a motion-base flight simulator. The purpose was to identify the impact +of the failures on the operation of the airplane following flight crew +actions. The hazard classifications of the failures were updated, based +on the combined results from these two simulation activities (ETEB Final + Report, p. 2-7).
+These tests and simulations were used to verify and validate the +hazard levels that had preliminarily been assigned to the failure modes. + Because some failures and interventions had unexpected consequences in +the testing, the feedback from these verifications was extremely +important and influential in the final conclusions and recommendations +of the ETEB. This demonstrates how an FMEA that is open to feedback and +change, either from testing or field experience, can provide much better + results than a one-time evaluation.
+The ETEB illustrated the verification and feedback built into the FMEA in the following figure from its final report:
+Source: ETEB Final Report, p. 2-6
+The full range of hazard classifications followed standard FAA practice and was defined as follows by the ETEB:
+Source: ETEB Final Report p. 3-3
+The ETEB used a standard adaptation of the FMEA analysis form (see +table). It is interesting to note how the form explicitly recognized the + mitigating effects of flight crew actions in response to equipment +malfunctions (columns 5, 7, and 8).
+Source: ETEB Final Report, p.3-2
+Although the possibility of imperfect flight crew performance (a +realistic expectation for human intervention in a complex or stressful +situation) was not explicitly modeled on the FMEA worksheet, the ETEB +accomplished this important extension to the basic FMEA by validating +and revising assumptions about the reliability of flight crew +performance through its testing process. The ETEB found that flight +crews were not able to reliably intervene and mitigate the consequences +of rudder component failures in some operational circumstances, and +these revised expectations were entered into the final versions of the +FMEA worksheets.
+The following figure provides an excerpt of an actual FMEA worksheet. + This worksheet includes a finding of catastrophic severity for a +failure effect that could not be mitigated:
+Source: ETEB Final Report, appendix A, p. 95
+Another useful extension that the ETEB added to the basic FMEA was +the explicit consideration of latent (preexisting, undetected) failures +combined with active failures. Although FMEA is not considered to be +well-suited to the analysis of multiple failure modes, the ETEB was able + to readily analyze these sequential failure combinations by treating +the latent and active failures as a single combined failure mode for +subsequent evaluation of the failure effects and consequences. This +manual extension of the FMEA method was effective for linked pairs of +errors; I think that it may have been very complicated to use this +method to track and display triple or even more complicated failure +combinations, but these failure combinations were not required.
+The table that follows (from ETEB Final Report, p. 3-40) provides a +sample of the new latent/active failure combinations that the ETEB was +able to identify and analyze using FMEA:
+The FMEA undertaken by the ETEB was successful in identifying a large + number of previously unknown or unevaluated failure modes, several of +which had the potential to result in catastrophic consequences. The +following are excerpted from the results presented by the ETEB in its +final report:
+The [Boeing] 737 rudder control system is susceptible to a number of:
+-
+
- Failures and jams that can cause uncommanded rudder motion; +
- Failures and jams that affect the operation of both the rudder main + and standby power control units (PCU), thereby defeating the +independence of the two systems; and +
- Latent failures. +
These failure modes are single failures, single jams, or latent failures in combination with a detectable failure or jam.
+The rudder control system of the Initial and Classic Model 737s with +the modifications required by the applicable FAA [Airworthiness +Directives]…have:
+-
+
- 14 single failures and jams, and 12 latent failure combinations, +that have Class I failure effects in the takeoff and landing regimes. +These same failure modes have 4 Class I effects and 22 Class III (major) + effects in the rest of the flight envelope. +
- 8 single failures and jams, and 11 latent failure combinations, that have Class II failure effects. (ETEB Final Report p.. 1-3) +
The ETEB drew strong conclusions about factors influencing the +efficacy of human interventions to mitigate rudder system failures:
+The ETEB conducted 40 hours of pilot-in-the-loop rudder failure +simulations with10 pilot and co-pilot flight crews from four airlines.
+-
+
- In general, the flight crews found the existing Jammed or Restricted Rudder Emergency Procedure difficult to use. +
- The flight crews appeared to have received little training in the +use of the Jammed or Restricted Rudder Emergency Procedure or the +Uncommanded Yaw or Roll Emergency Procedure. +
- The lack of a clear and unambiguous display of rudder position made +it difficult for the crews to diagnose uncommanded rudder deflections +and take prompt corrective actions. +
- Uncommanded rudder hardover deflections during takeoff and landing +resulted in Class I failure effects [i.e., human intervention was not +reliably effective] (ETEB Final Report, p. 1-4). +
The ETEB’s investigation of latent failure effects using extended +FMEA methods resulted in a conclusion that “there are several latent +failures that, when combined with one additional single failure or jam, +result in Class I or Class II failure effects. There are insufficient +inspections for these latent failures” (ETEB Final Report, p. 1-5).
+As I have indicated throughout, no FMEA is can be considered complete + unless it leads to the mitigation of the unacceptable risks that the +analysis identifies. The ETEB’s application of FMEA resulted in the +following recommendations for redesign of the rudder system:
+Modify the Boeing Model 737 rudder control system to ensure that:
+-
+
- No single failure or single jam of the rudder control system will +cause uncommanded motion of the rudder surface that results in a Class I + failure effect; +
- No combination of failures or jams will result in a Class I failure +effect, except for those combinations that are shown to be extremely +improbable; and +
- No probable single failure or jam will have an effect worse than Class IV.
In + addition, The Boeing Company should consider providing a fail-safe +rudder control system design that provides protection from latent +failures that contribute to a Class I failure effect (ETEB Final Report, + p. 1-6).
+
As a result of these recommendations (and the preceding accident +investigation causal findings and recommendations of the NTSB), the +Boeing 737 rudder system has been redesigned to provide reliable +redundancy, and a major hardware retrofit program is underway for the +entire fleet.
+To mitigate risks pending completion of this fleet retrofit, the ETEB + also provided the following recommendations to improve the risk +mitigation value of human (pilot and mechanic) interventions following a + rudder system failure:
+-
+
- Revise and simplify the current “Jammed or Restricted Rudder” emergency procedure. +
- Provide additional training to flight crews in the use of the +“Jammed or Restricted Rudder” emergency procedure and the related +“Uncommanded Yaw or Roll” emergency procedure. +
- Display rudder position to the flight crew. +
- Alert flight crews and maintenance crews to the signs of rudder +malfunctions, such as uncommanded pedal motion (ETEB Final Report, p. +1-6). +
These recommendations targeted at improving human performance have +been partially implemented by the aircraft manufacturer and FAA, from +2000 to present. Despite the limitations that remain in human +interventions, it is most significant, I believe, that the result of the + FMEA performed by the ETEB was to render the designers’ expectations +for human performance, and the design’s reliance on human intervention, +much more consistent with realistic human capabilities and limitations. + This was a strong contributor to the accuracy and applicability of the +FMEA’s results and its ability to improve system safety.
+In all, I believe that the ETEB process was a very successful example + of the application of FMEA extended with (1) top-down analysis (the +program began with foreknowledge that the end-level adverse event to +eliminate or mitigate was flight control malfunction leading to loss of +aircraft control), (2) consideration of multiple (latent) failures, and +(3) realistic consideration of human performance during interventions, +and (4) feedback from external data sources to FMEA revision. In the +ETEB application, FMEA was not supplemented by data-driven analysis of +conditional probabilities, this was an appropriate, conservative +response to the extremely rare/extremely hazardous nature of the +environment and threats.
+The ETEB’s work shows how the basic FMEA combined with complementary +extensions can form a comprehensive safety analysis that results in real + safety improvement. The excellent results of the ETEB program are +equally a testament, I think, to a strong effort to creatively re-think +the failure modes and effects for a system that had been thought to be +completely well-understood and thoroughly time-tested by 100 million +hours of field use. This creativity and openness are necessary +ingredients for any successful analysis.
+Conclusions about FMEA
+Based on the foregoing review, I conclude the following about the Failure Modes and Effects Analysis methodology:
+-
+
- FMEA is a sound methodology for basic, structured risk management and quality improvement analysis. +
- The ideal approach can be to use FMEA as the backbone for analysis +that also includes the integration of complementary methods, as +required; for example, it may be appropriate to apply elements of FTA or + PRA to understand and explore the proper scope of analysis, the +significance of failure effects, and the effectiveness of risk +management interventions. +
- Thoughtful application of FMEA can identify when these extensions +are required and to integrate and document results of an extended +analysis. +
- The limited reliability of humans in complex systems argues for +multiple, redundant, independent interventions when relying on humans to + detect failure modes or actively intervene to mitigate failure effects. +
- FMEA, as extended with appropriate top-down, probabilistic, and +feedback methods, is an excellent framework for risk management and +quality improvement in the post-design/post-manufacture (field +distribution, application, or user) environment, including the human +performance aspects of this environment. +
+
[1] + I acknowledge and thank ParagonRx, LCC for its support of my review of +risk-management methodologies and the writing of this paper. All +opinions expressed herein are my own and do not necessarily represent +the opinions, policies, and products of ParagonRx, LLC.
++ +