The link between root cause analysis, risk and reliability,

Root Cause Analysis:

Head to the business management section of any bookstore, and you’ll see volumes dedicated to improving the work process. They range from statistical methods like Six Sigma to the Deming Cycle (plan, do, check, act), to reliability centered maintenance and, of course, root cause analysis.

Although these tools have evolved into valuable, even vital practices, they also have had one unfortunate side effect: people focus on the specifics of each tool rather than the underlying principles. Many believe, for instance, that root cause analysis has no connection with managing risk or improving reliability. If you want to improve your operation’s reliability, well then, you need reliability-centered maintenance, or perhaps Six Sigma. Want to reduce the risk of failure of your systems or machinery? You need failure modes effects analysis (FMEA), right?

Well, not necessarily.

It’s easy to focus on this book or that method, but the basic principle behind them all is the same: to discover better work processes—in other words, a better way of doing things. All of these tools, including root cause analysis, ultimately change an organization’s work process. Many may think root cause analysis may have more to do with solving problems. But how do you solve a problem? By uncovering the steps of a work process that led to a problem, then changing the work process to prevent the problem from happening again.

Any time you change a work process, you change the degree of risk and reliability associated with it.  For this reason, the methods behind root cause analysis remain solidly linked to reducing risk and improving reliability.
 

Boiling Down the Terminology

Work-process improvement programs have dictionaries-worth of terminology.  Although this isn’t necessarily a bad thing, it can spur some confusing conversations, particularly between those who are immersed in the improvement process and those who aren’t.

Consider Six Sigma (itself a rather daunting phrase), which aims toward 3.4 defects per million opportunities (DPMO), based on process capability studies showing the standard deviation of a statistical sample; the narrower the normal distribution curve, the more reliable the process, and the lower the Sigma. Six Sigma uses improvement methodologies like design of experiments (DOE), which is part DMAIC: define, measure, analyze, improve, and control, a method inspired by Dr. W. Edward Deming’s plan-do-check act (PDCA) cycle, which itself is an iteration of the scientific method.
While all these terms help carry out these methods in their own way, the fundamental issue remains the same: to improve a work process. This doesn’t mean one method isn’t necessary better than another. They’re just different, but they all work toward the same goal.
To use an analogy, think of all the diet programs on the market today.  Each diet has its own rules and measurement systems, but all of these programs work with the same fundamental issue: the physical law of energy. In any system, energy cannot be created or destroyed, only conserved or moved. Any diet is about moving energy within and out of the body, consuming and burning a certain amount of calories (energy) to attain the desired weight loss. That’s it.

The same thinking applies to work-process improvement, including root cause analysis. RCA analyzes what happened, that is, the causes that produced the problem, and then is used to improve the work process, which in turn increases reliability and reduces risk. As another problem is identified, the cycle begins again. It’s analogous to continuous improvement. The most severe problems (“waste” in lean manufacturing or continuous improvement terms) are discovered, work processes are changed to eliminate them. More problems are then discovered through root cause analysis and eliminated through work process improvement. Like in continuous improvement, the cycle never stops. And with every cycle, an operation can increase its reliability and decrease its risk.

Why the Disconnect?

It’s easy to see why many don’t see the link between root cause analysis, risk, and reliability. The misconception occurs because people tend to automatically think that, when a problem occurs, somebody made a mistake. It’s not about the risk and reliability of a work process. The procedure’s fine. It’s about a person not doing as he should. If he followed the procedure, they say, the problem wouldn’t have occurred. They peg everything on the person and hold the procedure blameless.

But human error is not abnormal – nobody is perfect. Why exactly did the person not follow the procedure? What were the steps he took that led to the error, and why did those steps lead to the failure? Something about those steps led to the problem. Root cause analysis uncovers those steps and helps develop new ones to improve the operation—again, with the aim to reduce risk and increase reliability.

The more reliable the process, the more reliable people can perform it. This applies to all work processes, anywhere, no matter the complexity. Ever wonder why a hamburger tastes the same at every fast food restaurant in a same chain across the nation? It’s the same reason it’s so rare for a commercial jet to crash. The processes for making that hamburger and landing that jet keep error rates low.  Some processes are so finely tuned that even when an error does occur it is trapped to prevent a catastrophic loss. Even though the jet aircraft is extremely complex and the pilot’s training is extensive, commercial flights carry extremely low risk and high reliability.
This proves that a person’s level of skill and pay has or no bearing to the level of reliability an operation can attain. Think about it. Those hamburgers were put together by people making close to minimum wage, though the product is incredibly consistent—just as consistent, in fact, as those aircraft carrier landings. The latter is infinitely more complex than the former, but because both have developed work processes, they have similar levels of risk and reliability.

Evidence versus Anecdote

Such levels of risk and reliability weren’t reached overnight. They were attained by continuing a cycle of work-process improvement, and at the heart of any work process is the relationship between its specific steps: the causes to produce the desired effects.

Such thinking, though seemingly straightforward, doesn’t come naturally for most, and for good reason. As humans, we’re social creatures, and we live day to day experiencing events that happen one after the other, sequentially. Our lives follow a storyline, with one perspective—our own. As we’ve all learned in high school English, a story’s plot follows a line that curves up toward the climax and descends quickly to the denouement. The line describes one or a limited number of causes and effect that lead to the final outcome, or solution. It’s linear. Reality, though, is anything but linear. Every person sees things from different perspectives. As part of root cause analysis, these perspectives uncover the fact that every effect has multiple causes.

 Uncovering a system of causes in turn reveals a series of solutions, and each has a certain level of risk and reliability. Consider a broad problem: How do we prevent fatalities from auto accidents? This has many possible solutions, and the combination of solutions chosen determines the level of risk and reliability. Using seat belts greatly reduces the risk of a fatality in an accident, but used alone has only a certain level of reliability. A solution involving both seat belts and air bags increases reliability and reduces risk even further, as do crumple zones designed into the vehicle, driving the speed limit, and defensive driving.

Each potential solution has its own level of risk and reliability when used alone and in combination with others. It’s a complicated picture, and organizations can use many tools to find the best set of solutions. But no matter the tool, they all work toward the same goal: Find the most efficient set of solutions with the highest practical reliability and lowest practical risk.

No set of solutions, of course, can have 100% reliability and 0% risk. At the same time, an organization’s overall goals—those foundational drivers that every member of the organization knows and agrees with—must enter the equation, be they zero injuries or remaining a profitable company. Incredibly time consuming solutions can increase the reliability and reduce risk, but the rate of return may diminish and the operation’s efficiency may plummet so much that the organization can’t be profitable. To look at it another way, increasing the reliability and reducing risk for one operation (e.g., manufacturing) may decrease reliability and increase risk in other areas (on-time delivery, sales, profits).

Reactive and Proactive Improvement

Organizations use improvement tools to react to a problem or to prevent a future problem. Both reactive and proactive programs deal with risk and reliability, but in different places. For instance, root cause analysis, usually thought of as a reactive tool, doesn’t deal with probabilities during the investigation. The events already happened. A possible cause in RCA lacks sufficient evidence. An event happened or it didn’t; there’s no going halfway. Work process improvement resulting from RCA, however, can change the risk and reliability probabilities of a work process.

Consider the Cause Mapping method of RCA, in which causes and effects literally are mapped out in boxes, starting on the left, with a system of related causes and effects splitting to the right. Possible causes can be placed on the Cause Map with a question mark until the team gathers sufficient evidence. Elements not on the Cause Map are still worth investigating, but they’re kept off the Cause Map until they have sufficient evidence to link them to the problem.
Probabilities do, of course, apply to future events, and here the Cause Mapping tool can be just as effective, particularly if the map is used in conjunction with tools such as Failure Modes Effects Analysis. Performing an FMEA assigns values to potential failure modes, or the ways a machine or system can fail. (In cause and effect terms, this would simply be a negative effect)

Specifically, the FMEA chart assigns a value (usually 1 to 10, with 10 being the highest) to a problem’s severity, the chance that problem will occur, as well as how difficult it is to detect the problem. These three values are multiplied to produce a so-called risk priority number, or RPN. The higher the number, the greater chance there is of failure (i.e., a negative effect on a Cause Map). Organizations can use this tool in conjunction with a proactive Cause Map to help predict and prevent future failures. Inside the various cause boxes can be put a number from the FMEA—combining the visual detail of the Cause Mapping method with the predictive measurements of an FMEA.

Focus on Principles, Avoid the Program of the Month

Let’s be honest. Even mentioning continuous improvement, plan-do-check-act, reliability centered maintenance and all the rest has made many an employee’s eyes glaze over. “Here we go again,” they think. “It’s another program of the Month.” Apathy reigns because, in their hearts, they think nothing will change. They have little faith in these programs—but this isn’t to say the programs themselves are at fault.

Consider the storytelling analogy again. Almost any narrative, fiction or nonfiction, focuses on people, and for good reasons. For one, we relate and compare ourselves to others; we admire heroes and abhor villains. Also, focusing on all the possible causes of every problem in a story would simply make terrible prose. Instead, we relate a story linearly, with a straight line of causes and effects occurring sequentially. The trick of any work improvement program is to overcome these urges, and instead draw from all perspectives—not just a single narrative—to focus on the specific steps involved that led up to the problem at hand. It’s about focusing on the why, not the who. Problems have many causes and the people are of paramount importance which is why their names (that is, who did what) shouldn’t be part of the analysis.

The investigation can be reactive or proactive. It can use root cause analysis as part of a plan-do-check-act initiative, lean or continuous improvement initiative, an FMEA, a reliability-centered maintenance program, or myriad other methodologies. In the end, what method to use depends on what best suits an organization’s culture. But no matter the tool, the goal never changes: Analyze the nature of causes and effects to create a better work process—with less risk and better the reliability.

Root Cause Analysis