Thinking about Three Sigma

Sten Westgard, MS

In a previous lesson on Six Sigma, we discussed some possible actions to take when the Sigma-metric for a method is higher than Six. But what about those methods with low Sigma-metrics? What do you do when Sigma analysis delivers bad news?

Two Thoughts on Troublesome Performance

1. Does performance match the manufacturer claims, is it real? And is there anything we can do about it?
2. Are we being too hard on ourselves, or is the standard for performance appropriate?
Conclusion


Note: This QC application is an extension of the lesson From Method Validation to Six Sigma: Translating Method Performance Claims into Sigma Metrics. This article assumes that you have read that lesson first, and that you are also familiar with the concepts of QC Design, Method Validation, and Six Sigma. If you aren't, follow the links provided.

What do you do about bad news? It’s a fact that we face in laboratories all the time: staff shortages, budget cuts, unreasonable regulatory demands, and the list goes on. With Sigma-metric analysis, now laboratories face another possible source of bad news: a bad Sigma-metric for a method. In a period of history when so many things seem to be going wrong, is it really time to get a judgment on method performance?

Earlier, we took a look at how to handle extremely good news - what to do if you find out your methods are performing at Six Sigma or better. This time, we’re going to look at what to do about bad news from Sigma-metric analysis. If the Sigma-metric is below three Sigma - which in other industries is considered the minimum level of acceptable performance - what’s a laboratory to do?

This discussion will tackle two approaches to dealing with the news. The first approach is to consider the laboratory, the method, and other aspects of the data. The second approach is to consider the standard itself. It could be that either of these approaches may find a “solution” to a poor Sigma-metric rating. But, be warned, it’s also entirely possible that after looking at all the variables, options, and remedies, the laboratory still has a poor method.

1. Does performance match the manufacturer claims, is it real? And is there anything we can do about it?

When looking at the Sigma-metric, there are really just two factors over which you have control: precision and accuracy. So obviously, if the method isn’t performing well, the precision or the accuracy or possibly both are not adequate to the task. The next question to pose is, What can be done about that?

First, is this performance consistent with the manufacturer claims?

That is, if you’re experiencing a CV of 5%, for example, is that what the manufacturer claimed what the performance was going to be? Or, if you’ve determined you’ve got a 10% bias from the comparative method, is that what the manufacturer stated the bias would be? If you’ve got poor performance, but the manufacturer stated that’s what performance was going to be, then you can stop asking questions (and possibly start kicking yourself for buying the instrument). Remember, the manufacturer is only required to make performance claims and meet those claims - the FDA does not require that the manufacturer make performance claims at a specific level of quality. If, however, the performance experienced by your method is not the same as what the manufacturer claimed – if your CV is significantly higher than advertised, or your bias against the manufacturer’s claimed comparative method is significantly higher - then you have a real case for manufacturer remedy. The manufacturer needs to fix the method to achieve the performance as listed in the claims. (Note, the bias claim may be harder to pin on the manufacturer; if you chose a different reference method in your method validation comparison study, the manufacturer can’t be held responsible for a bias estimate different than the claim; your comparison study must match the manufacturer’s comparison study if you want to have your bias match their bias claim.)

If the performance doesn’t match the claims, that’s the easiest scenario, because you don’t have to investigate further. You simply contact the manufacturer and ask them to correct the method performance.

Second, is this performance real?

If you’re basing your Sigma-metric on a very small sample of data, either for bias or imprecision, it’s possible that you’re about to do yourself a world of harm. If there isn’t enough data, or if it’s the wrong data, you could be heading down a path of trouble-shooting, tech support, and frustration for nothing. If you’ve gotten an estimate of bad performance based on some fluke or outlier, that’s no reason to throw out the method. You need to make sure that the performance you’re seeing is consistent.

There are two situations to consider: (1) method validation studies, where the source of data are carefully designed experiments to estimate errors, and (2) routine operation, where the source of data are the results of normal operation (QC values, PT events, peer group reports, etc.).

With the first scenario, you’re relying on smaller sets of data, but the careful design of the studies are meant to protect you from getting unreliable estimates of error. If the studies are conducted or supplied by the manufacturer, the burden of providing reliable estimates is on them. If their studies are getting different estimates than their claims, as we said earlier, they’ve got to do something to bring the performance back in line.

If you’re getting undesirable performance from routine performance data, however, you want to make sure that what you’re seeing isn’t a temporary anomaly. One of the key resources to help you answer this question is going to be your interlaboratory comparison report. If you have a peer group report, or proficiency testing results, you can check to see if the bias (inaccuracy) is occurring consistently - and if it’s occurring just for your laboratory or for all laboratories with your same instrument. If that bias persists for multiple reports, PT events, and/or months, you know this is not a fluke. It’s a real problem.

If you’ve reached the point where you’ve established that the performance characteristics/problems are real, and that the performance is nevertheless within the claims of the manufacturer (i.e. you’re on your own), then it’s time to determine whether or not you can do something about it.

What’s the biggest problem?

It’s helpful to look at the two performance variables of the Sigma-metric and try to determine which one is the worst - or which one is the variable the laboratory can improve. Is bias the major problem? Or is it imprecision?

One simple way to explore these questions is to assume bias is zero and run the Sigma-metric calculation. Then assume imprecision is 1.0 (or some lower number) and run the calculation. Does eliminating bias give a better Sigma-metric? How much of a change in precision is necessary to get a more desirable Sigma-metric?

Several years ago, Dr. David Parry introduced a Quality Goal Index, which allows you to assess the relative impact of imprecision and inaccuracy. This simple tool can help bring into focus which characteristic may be the problem.

Is it possible to reduce bias?

Bias is a tricky thing. Bias against what? When you calculate bias, are you calculating the difference between your method and a true reference method, a recognized standard method (like a gold standard or a NIST certified method), or is it just the last method you had in your laboratory? Are you comparing your method to a peer group or a proficiency testing group? Or are you doing a round-robin test with other labs?

Sometimes, the easiest bias calculations may not be the most useful ones. The comparison of methods study is typically done between a new method and the old method that is being replaced. While that may be practical in the short term - both methods are right at hand and the bias between the two will be felt immediately - eventually the new method is operating by itself and needs to be compared to a test that has more relevance.

For some tests, the method “principles of measurement” themselves are different, so results are not wholly comparable. This discussion leads us into the great debates on standardization and harmonization, arguments which lasted for years for some analytes and continue to rage for many more. We’re not going to attempt to summarize or duplicate those arguments here. In the long run, it would be nice if all laboratory methods were standardized or harmonized so the results were on the same playing field. But in the short run, the laboratory might need to come up with a more immediate solution.

Ultimately, the laboratory director or some authority of the laboratory has to make a decision on what is considered the comparison method. In the absence of a reference method, there is no perfect answer. This is another area where professional judgment and leadership comes into play. Make a choice based on the best data available. Then calculate bias and act accordingly.

If all else fails, particularly if no information on bias is available or plausible reasons exist to exclude the bias that can be calculated, assume bias to be zero. Then, as soon as practically possible, find a way to obtain a more realistic bias estimate.

Once you’ve settled on an estimate of bias, there are common ways to improve it. Recalibration is the chief corrective action. It has attendant risks, particularly if you recalibrate without adjusting means, ranges, and other characteristics of the test.

Is it possible to reduce imprecision?

Unlike bias, determining imprecision is more straightforward and entirely within the laboratory’s control. Analysis of the routine daily QC data can easily give you a definitive estimate for imprecision.

However, while it’s easy to determine the CV, it’s more difficult to reduce it.

For most of today’s highly automated instruments, there are few technical aspects left to the customer’s discretion. The engineering is so complex, no one on the bench level is going to have the ability to make modifications that improve performance.

If you have poor CV on a method that isn’t automated, then standardization of all the steps of the process, better training and skills by the operators, etc., will have an impact on imprecision. But for the big boxes, the laboratory has little control over the internal processes. It may be that the instrument manufacturer can send a field service technician to fix, improve, or tweak performance. However, it’s probably unwise to place all of your faith in the power of technical service to improve performance. Field technicians don’t come with magic wands.

One obvious but overlooked technique to reduce imprecision is replicate measures. Duplicate measurements - when the results are averaged - do reduce imprecision significantly. An average of three measurements reduces imprecision even more. Granted, doubling or tripling the specimen volume for a testing process is highly impractical for most laboratories. The cost is too high in volume and resources. However, with some instruments allowing customers to choose cost-per-reportable instead of cost-per-test as a method of payment, these replicate measurements may be feasible for some laboratories. On a cost-per-reportable basis, the laboratory pays no additional cost for duplicate measurements; the instrument manufacturer must absorb that cost. This arrangement has the built-in benefit of providing an incentive for the manufacturer to improve method performance.

At the end of the day, you may still find that

the performance data - and the problem - is real,
the performance is within the expected range stated the manufacturer,
there is no practical way to reduce the bias or CV

That is, the performance is what it is. What then?

Well, there’s one last big question to ask. This time it’s not about performance, but about the standard of quality we are trying to achieve.

2. Are we being too hard on ourselves, or is the standard for performance appropriate?

If we find our methods falling short of the goal, it may be worth asking, is it the right goal line?

Many of the quality requirements we apply to methods are not necessarily the most scientific or rational. Take for instance the CLIA requirements. Most of those requirements are suspiciously round - 10%, 15%, 20%, 25%, 30%, etc. It’s a clue that those numbers were arrived at by consensus (guessing). In contrast, when you look at the Desirable Specifications for allowable error generated by Carmen Ricos and colleagues - quality requirements calculated from studies of observed within-subject biologic variation - you won’t find many round numbers at all.

Remember, too, there is a hierarchy of quality requirements - as established in the 1999 Stockholm consensus conference - and the specifications set by PT groups are 4th or 5th (if the specifications are based on “state of the art,” which is essentially what CLIA limits were) out of 5. So the CLIA requirements, while they are the most well-known and commonly-applied, are at the bottom of the quality pyramid. Unfortunately, for US labs, they have the weight of law. Even if a CLIA quality requirement may not be scientific or appropriate, US laboratories still need to meet them.

But since now we have so many non-regulated analytes (including such heavily-tested analytes as HbA1c ), and many laboratories are not governed by CLIA, it is possible to evaluate and select more appropriate quality requirements. One of the most useful new sources of quality requirements is the Desirable Specifications for total error, imprecision and bias, derived from Biologic Variation, a database that is updated every two years by Dr. Ricos and colleagues.

To change a quality requirement may make you uncomfortable, like a bending of the rules, or the equivalent of “playing the ref” instead of playing the game. But if there is a demonstrable benefit to choosing a different quality requirement, and if the choice moves you up in the hierarchy of requirements, it’s entirely valid. If a method is not able to reach an acceptable Sigma-metric based on the PT analytical requirement, but can achieve a better Sigma-metric when a biologic total allowable error is calculated, that is fine. Indeed, since the choice of a biologic total allowable error is higher in the Stockholm hierarchy, one can argue this is an improvement.

At the top of the quality requirement hierarchy are specifications based on an evaluation of the effect of analytical performance on clinical outcomes in specific clinical settings - which includes such things as clinical decision intervals, clinical pathways, or treatment guidelines, “evidence-based” quality requirements, if you will. Basically, if you can define how a clinician uses a test, what levels are important, and what intervals are considered medically distinct, you can “reverse-engineer” the quality required by the test.

A clinical QC planning/design model exists to quantify the performance, taking into account such pre-analytical factors as within-subject variation, number of specimens, number of samples, and matrix bias. This QC Design model can assess the ability of QC procedures to detect critical-errors (errors that cause medically important changes). Note that the clinical QC Design planning model takes you beyond the Sigma-metric equation; you won’t be able to quantify performance on a Sigma-scale if you widen the scope of the process to include so many pre-analytic and analytic variables. At that point, you’ll be able to choose an appropriate QC procedure, but you won’t be able to characterize performance on the Sigma-scale.

Of course, it should not be assumed that the further up the quality specification hierarchy you go, the larger the requirement (and easier to achieve). To the contrary, often the clinical quality requirements are harder to achieve, since within-subject biologic variation must be included and frequently consumes a huge share of the error budget. Going up the hierarchy may not be a solution - you may even find judging your method against “better” quality requirements means even lower Sigma-metrics.

Nevertheless, the ability to choose different quality requirements is not a magic bullet. Despite your best efforts, it’s likely that a few methods in your laboratory will not achieve the performance you want. Even though you’ve done all you can to improve CV or bias (if it’s possible at all), and you’ve spent time selecting the most appropriate quality requirement, you will still find you’re stuck with some methods that aren’t performing well. What then?

At that point, it’s probably worthwhile to ask, are there any methods on the market that can achieve the chosen quality requirement? Is there any quality requirement that the method(s) can achieve? It’s possible in this scenario that technology has not caught up with clinical use and need. If doctors are making decisions based on assumptions of method precision that doesn’t exist, essentially that means that decisions are being affected by variation. It might be worth informing the clinicians that they are assuming better performance than the laboratory method can deliver - and that they should change their decision-making process to account for the actual performance of the test. (You can easily work the Sigma-metric equation in the other direction, calculating a feasible quality requirement based on current method performance and an assumed a Sigma-metric.) If you are able to adjust the quality requirement based on performance, than a test with poor performance should be interpreted with wider clinical decision intervals. If the clinicians make decisions based on a larger change in the results, that reduces the influence of bias and variation. Admittedly, it’s unlikely that clinicians are going to change their behavior based on laboratory advice - it’s far more likely that the clinician will shoot the messenger and blame the laboratory for poor performance.

Conclusion

Assessing method performance on a Sigma-scale can be distressing when it determines that a method is suboptimal. This lesson has listed a number of approaches to determine if there is better way to assess performance, improve performance or apply a more appropriate quality requirement – in an attempt to reach a better Sigma-metric.

In some ways, this is the Kubler-Ross model of handling poor laboratory performance. First, we deny (maybe the data is wrong). Second, we have anger (maybe the method isn’t working as advertised - get the manufacturer to fix it). Next, we bargain (are there changes we can make to improve performance, are there any scenarios where this performance is adequate?). Then comes depression (we may have to tell the doctors to accept less precision and accuracy from our results). Finally, after we’ve exhausted all the options and still we can’t find a way to improve the Sigma-metric, we come to acceptance. We have a poor laboratory method.

Ultimately, any laboratory is going to have some methods that don’t achieve the desirable Sigma-metric performance. It’s a fact of life - not every method in every laboratory is going to be wonderful. With test methodology constantly changing and (we hope) improving, that situation may not have to last long. But until a better method is implemented, the laboratory should choose an appropriate Total QC strategy to provide better non-statistical QC for the troublesome method. Knowing which methods need better care and handling is a valuable piece of information in itself. That knowledge may shape the future capital purchase decisions for the laboratory.

Contributions from David Plaut, MS, and James O. Westgard, PhD

Tools, Technologies and Training for Healthcare Laboratories

Advanced Quality Management / Six Sigma