What about Bias?

The issue of bias in analytical measurements generates a lot of debate. Existential debates (does bias exist? should it?) are often mixed with more practical debates (what's the best way to calculate bias?). Here's a description of the different kinds of bias that (might?) exist in the laboratory.

January 2010

Does Bias Exist? Should we pay attention to it?
What kinds of bias exist?
So, if Bias Exists. when do we assess it?
The Commutability Conundrum and the Matrix Effect
Can't we just forget about Bias?

"The combination of imprecision and bias into a single parameter appears to simplify daily quality assurance. However, no practical benefit of such a combination in daily quality assurance has been demonstrated in comparison with the well-established separate checks for both errors. Laboratorians are used to thinking in terms of imprecision and bias separately. It has been postulated that clinicians may favor combination models. Many clinicians are well aware that laboratory results vary, but they assume that bias can be neglected. They are not used to combining both errors in any model. Therefore, combination models are probably also of no benefit to clinicians."
Benefits of combining bias and imprecision in quality assurance of clinical chemical procedures, Rainer Haeckel and Werner Wosniok J Lab Med 2007;31(2):87–89 _ 2007

"A majority of the methods used in thyroid function testing have biases that limit their clinical utility.Traditional proficiency testing materials do not adequately reflect these biases."
Analytic Bias of Thyroid Function Tests: Analysis of a College of American Pathologists Fresh Frozen Serum Pool by 3900 Clinical Laboratories. Bernard W. Steele, MD; Edward Wang, PhD; George G. Klee, MD, PhD; Linda M. Thienpont, PhD; Steven J. Soldin, PhD; Lori J. Sokoll, PhD; William E. Winter, MD; Susan A. Fuhrman, MD; Ronald J. Elin, MD, PhD. Arch Pathol Lab Med. 2005;129:310–317.

"Analytic bias caused by assay differences and reagent variations can cause major problems for clinicians trying to interpret the tests results."
Clinical interpretation of reference intervals and reference limits. A plea for assay harmonization. George Klee, Clin Chem Lab Med 2004: 42(7):752-757

"That's the news from Lake Wobegon, where all the women are strong, all the men are good-looking, and all the children are above average."
Garrison Keillor, Prairie Home Companion

In the United States, bias is always a hot issue, particularly in media and politics. "Bias" is the typical accusation thrown by supporters of the Political Party of the "Buffalo" when a report in the media comes out that they believe is somehow favorable to Political Party of the "Fox". Likewise, if a media outlet criticizes a policy or person associated with "Buffalos", the Buffalos cry foul. Both "Buffalos" and "Foxes" allege that different media outlets, journals, or research groups are biased in favor of their opponents. As a result, networks, newspapers and journalists are vilified by one side or another or frequently both. Objective truth - whether a policy is actually good the country, or whether a politician has told the truth or lied, for example - is often lost in the finger-pointing.

Back in the laboratory, the fight over bias is not quite as contentious, although at times it seems the conversation is almost as lively. There is both an existential debate (does bias exist? should we allow it to exist when we detect it? should we incorporate bias into our calculations?) and a practical concern (what's the best way to determine bias? what is the "truth" against which we determine our bias?). Often, one part of the argument overshadows the other part. As we argue about whether or not bias should be incorporated into our models and calculations, we may forget to discuss or even consider the best way to practically calculate bias.

Does Bias exist? Should we pay attention to it?

The discussion of whether or not bias exists has been covered in other discussions in the literature and on this website and the blog. But if you want a quick recap: the ISO GUM model (Guide to Uncertainty of Measurements) asserts that Measurement Uncertainty (MU) is the best expression of performance by laboratory tests - and this expression does not include bias. Bias, therefore, should be eliminated whenever found, so that Measurement Uncertainty can be calculated. Attempts have been made since the original formulation of Measurement Uncertainty to include and account for bias [for example, [Quality assessment of quantitative analytical results in laboratory medicine by root mean square of measurement deviation, Rainer Macdonald, J Lab Med 2006:30(3):111-117,], but these attempts have been found wanting [Calculation of Measurement Uncertainty - Why Bias Should Be Treated Separately, Linda M. Thienpont, Clinical Chemistry 54: 1587-1588, 2008; Letter to the Editor: Benefits of combining bias and imprecision in quality assurance of clinical chemical procedures, Rainer Haeckel and Werner Wosniok, J Lab Med 2007:31(2):87-89]

On the other side of the debate, the Total Error model acknowledges the existence of bias and includes it in the calculations. The Total Error model agrees with the Measurement Uncertainty model that bias should be eliminated where possible, but is not dogmatic on this point. [On the practical side, recommendations for calculating Total Error include an assumption of zero bias when data is not available for this quantity.] And if bias really is zero, the estimates for Total Error and Measurement Uncertainty converge.

In other words, Measurement Uncertainty is biased against bias, but Total Error is not.

Here's an example of bias in the real world. Quest Diagnostics, specifically its subsidiary Nichols Institute Diagnostics, was fined $302,000,000 because of bias in 2009 (a $40,000,000 criminal fine plus $262,000,000 as a civil settlement of the False Claims Act). It was found that the Nichols Advantage Chemiluminescence Intact Parathyroid Hormone Immunoassay "provided inaccurate and unreliable results" and that during "periods of time...provided elevated results." These results caused "some medical providers to submit false claims for reimbursement to federal health programs for unnecessary treatments." In other words, a high bias on this test led to unnecessary operations. That's the real world impact of bias.
[Quest Diagnostics to Pay U.S. $302 Million to Resolve Allegations That a Subsidiary Sold Misbranded Test Kits, Department of Justice Press Release, April 15, 2009. http://www.usdoj.gov/opa/pr/2009/April/09-civ-350.html]

For those who still contend that bias doesn't exist, because everywhere it's detected a correction is made to eliminate it, there's no need to read further. For those who suspect bias does exist, does affect the laboratory and cannot always be eliminated, read on.

What kinds of bias exist?

Just because we've decided to acknowledge the existence of bias doesn't make life any easier. The harder question is how to measure bias. Since bias is a relative term - you measure it against something else - you have to decide, What is the standard?

There are many possible biases, including, just to name a few,

* Bias from reference material or reference method
* Bias from the all-method mean of a PT or EQA survey
* Bias from the mean of a peer group
* Bias from a comparison method
* Bias between identical instruments in the same laboratory
* Bias between reagent lots

Bias from a reference material, reference method, or standard

For some analytes, there is a gold standard (or reference) method or material. There is, in other words, a "true" value that should be achieved by all methods. To get to this true value, and relate your laboratory method to it, you must enter the world of Metrology.

"Metrology has been very good about identifying reference methods and reference materials and putting together a formal traceability chain so that you can tie your kit calibrator in your clinical lab back to a reference material and a reference method that are internationally recognized...The whole idea is that you can then come close to scientific truth rather than a test result that is a relative truth."
David Armbruster, quoted in The Pursuit of Traceability, Bill Malone, Clinical Laboratory News, October 2009, cover story

When you calculate bias against a reference method and/or reference material, you're figuring out a "true" bias, one that is more scientifically true than just relatively true. With the former, you know you are not getting the true answer. With the latter, you only know that you aren't getting the same answer as everyone else.

Bias calculated from PT or EQA

One of the routine ways to determine bias is to compare the results of your laboratory against those of other laboratories through proficiency testing (PT), which is sometimes known as external quality assurance (EQA). Typically, a sample is sent out to all laboratories in the program, all laboratories run the sample and report the result, then the program tabulates the results and issues a report back to the labs. Each report typically states the difference(or bias) between the individual laboratory's result and that of the PT/EQA group method mean. Given that information, each individual laboratory is supposed to decide if the bias is significant and warrants a correction, adjustment, or calibration on their part.

For some analytes, reference methods and/or reference materials are used, so they include a definitive value for the event or sample. This means that all labs should get a specific result, and every increment away from that result is considered bias. If you determine bias using the method mean from a reference method or reference material , you are measuring the difference between your result and the "true" result.

For many other analytes, no reference methods exist - or, even though a reference method may be available, the PT/EQA group might not run the sample using it - so there is no definitive value reported for the sample. Instead, there is only the "all-method" mean reported. There are other terms for this mean, and sometimes it is simply called the group mean. But the essential meaning is that this mean is the (albeit trimmed in some way) average of all the different laboratory results . In other words, the mean is close to the answer that all the laboratories reported. This doesn't mean that the answer is the "true" answer, because all the laboratory methods could be biased in the same direction (revisit the concept of "precise but not accurate"). Here, if you determine bias using an all-method mean, you are measuring the difference between your laboratory result and the results that most of the other laboratories got.

Bias from a peer group

The next possible way to measure bias is through a peer group. This is very similar to participating in PT or an EQA program, except all the participants in the testing event are, well, peers. A peer group typically is a group of the same instruments using the same controls and/or reagents. So the answers the each laboratory obtains should be much closer to each other. Now, again, while there should be smaller differences between participants, the peer group mean is not a "true" mean like you get with a reference method and/or reference material. Peer group means are, in effect, all-method means for a single method. You have more confidence that your bias exists, because if all your peers diverge from you value, there must be something going on (you can't blame the difference on different methods or materials anymore). If the peer group is using a reference material or including a reference method measurement with the results, the value of the report is improved. Still, in the absence of additional information, peer group reports are better, but they cannot tell you if you have a "true" bias.

Bias from a comparative method

Part of the method validation process includes a comparison of methods study, typically done between the new method that has just been purchased and the old method which is being replaced. Note the difference between "comparative" and "reference" method. The comparative method is only a relative comparison; there is no claim to scientific truth here. It could be that the old method was more scientifically true while the new method is less scientifically true, so a new relative bias exists in the wrong direction.

In a sense, any bias determined by a comparison study is still quite real - because test results that span the switch-over to the new method will be shifted up or down even when there is only a relative difference between the new and old method. A patient receiving care before and after the switch could see a rise or fall in their test results, resulting, in the worst case, in misdiagnosis and treatment. So this is "real" bias - even if it isn't "true" bias.

Bias between identical methods/instruments in the same laboratory

In large health systems, laboratory testing volumes have grown to the point where it's possible that multiple big box analyzers reside in the same laboratory. But even when the same instrument is used, with the same lot of reagents, the same calibrators, and the same lot of quality controls, two "identical" instruments won't be. That is, each instrument will have its own performance, and the same sample run on instrument A will have a different result than when it is run on instrument B. Since patients within the health system can't control on which instrument their samples will run, this is a bias that they undoubtedly will experience.

The question is, how big is that bias? Two identical instruments within the same laboratory are the ultimate peer group, and it should be easy to determine the nature and extent of any bias. It falls to the laboratory professionals to determine the bias between the two instruments and, once that bias is calculated, they must make a judgment on whether or not that bias will impact patient care. If the laboratory decides that a medically important between-instrument bias exists, they then must take some action to account for this bias in reports or they must eliminate the bias in some way.

If you think this is not a big problem, step out of the laboratory for the moment and head for the near-patient testing environment: this same challenge happens writ large with point-of-care (POC) devices. With hundreds if not thousands of different operators and dozens or possibly hundreds of the same POC device, how does a health system ensure that POC device A-1 delivers the same test results as POC device A-34? The US accreditation and regulatory systems have shrunk from the implications of this problem, and health systems and diagnostic manufacturers blanche at the costs of monitoring that would be required to truly monitor performance of these devices.

Bias between reagent lots (and control lots)

Even if you decided to isolate your instrument and method from the rest of the testing world - using only one instrument in your laboratory, exclusively, without referencing any outside results - you still cannot escape bias. Why? because you make changes to your instrument periodically and the method itself changes over time. Sometimes this is expressed as growing (or optimistically, declining) imprecision. Other changes occur more distinctly.

Take reagent switches. When you bring in a set of reagents, those materials are not the same as the old lot. Manufacturers take great pains to make them as close to identical as possible, but there will always be differences. Everyone hopes that these are small differences. As with the two- or multiple-instrument problem above, laboratories have to identify, calculate, and make a judgment about any difference. Re-calibration usually takes care of the issues with reagent switches, but laboratories need to monitor QC carefully after a reagent switch. In fact, changing reagents is an event that may trigger a run of extra controls.

A shift in values for control lots is also common - but it's not really a bias. Just as with the reagents, the controls can't be manufactured perfectly, so there is always a slightly different mean for each control lot. But this shift is one of the easiest to correct. Good Laboratory Practice (and CLSI guideline C24-A3) recommend a crossover period between the old and new control lots of several weeks to several days, depending on the stability of the control lots (hematology controls have limited stability, so the crossover period may only be a few days). In this way, a laboratory can phase in the new control lot, characterizing the performance of the new lot and providing a comparison with the old lot.

So, Bias Exists. Now, when do we assess it? Do we determine bias at a specific time? At a specific level?

In addition to choosing the source or reference for comparison in the estimation of bias, laboratories also have to answer a question of scale, or time. Over what timeframe does a laboratory want to determine its bias?

With a method validation study, for instance a comparison of methods study, the bias calculations represent a specific window of time (the duration of the study). The bias calculation is valuable for that specific time period, but after that, the value diminishes as more time elapses and the instrument, method, and laboratory staff changes.

Likewise, the results of a PT or EQA event are quite specific and few in number. In the US, PT events may happen only two or three times per year, and involve between two and five samples per event. Using just a handful of data points to determine bias may not inspire confidence in the calculations. For a longer, broader view of bias, you may want to average a number of events and samples together. Outside the US, PT and EQA may be more frequent and involve more samples, so confidence in the bias determinations is higher.

Here is where peer group evaluation may be more helpful. Often peer groups collect all the data, or at least a lot more data points than PT or EQA challenges.

Another technique of monitoring bias on a continuous basis is patient split-sampling. When there is an available reference/comparative method, you can run the same patient sample on the "test" method and comparative method on an ongoing basis. In our earlier scenario with the two identical instruments in the same laboratory, for example, split-sampling would be a good technique to monitor the differences between the methods/instruments. Many health systems don't have the ability to run that continuous comparison in-house, unfortunately.

The Commutability Conundrum and the Matrix effect

No, this isn't time to take the blue pill (although after learning about all the biases in the laboratory world, you might wish you could wake up from this metrology Wonderland). There's one last issue when it comes to bias and how we measure it. It's called Commutability.

"The term 'commutability' was first used to describe the ability of a reference or control material to have interassay properties comparable to the properties demonstrated by authentic clinical samples when measured by more than one analytical method ... More recent metrologic documents expand the concept; they describe commutability as the equivalence of the mathematical relationships between the results of different measurement procedures for a reference material and for representative samples from healthy and diseased individuals ."
W. Greg Miller, Gary L Myers, Robert Rej, Why Commutability Matters, Clin Chem 52(4): 553. 2006.

Commutability, in layman's terms, means that if a bias is detected by control materials, there is also a bias in real patient samples. The control behaves like the patient sample. The assumption that the control behaves like the patient sample is built into the very foundation of quality control (if the controls were unstable and wholly different in behavior from patient samples, there would be no point in running them).

Unfortunately, it is not easy to build a cost-efficient control material that behaves exactly like a real patient sample. The challenges of creating control materials are a huge topic outside the scope of our focus in this lesson. Suffice it to say that manufacturers try to create controls that are as close as practically possible to patient samples, and laboratories try to put up with the differences between controls and patient samples. For cholesterol and glycated hemoglobin, for example, there is a strong commitment to create a traceability chain and find, minimize, or eliminate matrix effects. For other analytes, however, there is at best mixed success in eliminating matrix biases.

When there is a distinct difference between the method performance using a control material versus a patient sample, usually because of the constituents of the control material, this is called a Matrix Effect (which has absolutely no relationship to the Wachowski brothers, Keanu Reeves, or Laurence Fishburne). This, in effect, is another bias. The control materials are biased from the patient samples. Matrix Effects are most obvious when you plot different methods and instruments in PT or EQA groups; whenever there is a marked difference between instrument A and instrument B on the same event, the answer is often a matrix effect (really, it's a bias of the biases)

Can't we just forget about bias? Pretend it doesn't exist?

"A fundamental goal of laboratory medicine is that results for patients’ samples will be comparable and independent of the medical laboratory that produced the results. Routine measurement procedures of acceptable analytical specificity that have calibration traceable to the same higher order reference material or reference measurement procedure should produce numerical values for clinical samples that are comparable irrespective of time, place, or laboratory generating the result."
W. Greg Miller, Gary L Myers, Robert Rej, Why Commutability Matters, Clin Chem 52(4): 553. 2006.

In a world where professional guidelines are making global recommendations for cutoff limits, where multinational diagnostic manufacturers are issuing reference ranges that are often used by customers as de facto reference ranges around the world, and where pay-for-performance schemes are implemented with agency-mandated cutoffs, it is not possible to ignore bias.

Furthermore, while one hand of ISO is encouraging measurement uncertainty and the elimination of bias as a factor in the performance of methods, another ISO-driven effort is toward traceability, standardization, and harmonization. This latter effort explicitly recognizes that there are biases between methods and urges some form of standardization to harmonize results. In the US, there is no regulatory mandate for standardization, harmonization, or traceability. Indeed, the FDA has no real power to order medical device manufacturers to supply traceability information with their applications for FDA clearance and approval.

In a perfect world, these two ISO desires would be fulfilled. Methods would be traceable, standardized, and/or harmonized, so that many of the biases discussed here could be eliminated. In that world, measurement uncertainty would be easy to calculate because you wouldn't have to ignore biases.

In our less-than-perfect world, however, not only are methods often un-traceable, un-standardized, and un-harmonious, there are also biases that will still exist even in the presence of national standardization programs (witness HbA1c). We have lots of biases which have not yet been eliminated or reduced sufficiently to assure the comparability of laboratory test results.

Tools, Technologies and Training for Healthcare Laboratories

Quality Management