Tools, Technologies and Training for Healthcare Laboratories

The Quality of ...The Quality of... Series

To study the usefulness of the calculations from The Quality of Laboratory Testing series, we took a look at one proficiency testing program, the New York State Wadsworth program. Rather than look at a single event, we looked at all the events for two years.

Looking at the Proficiency Testing Data

April 2006

[Note: This QC application is an extension of the lesson From Method Validation to Six Sigma: Translating Method Performance Claims into Sigma Metrics. This article assumes that you have read that lesson first, and that you are also familiar with the concepts of QC Design, Method Validation, and Six Sigma. If you aren't, follow the links provided.]

We have previously presented data and calculations about the Quality of Laboratory testing for several touchstone tests. The conclusions of the studies are in marked contrast with the conventional wisdom. Where the laboratory professionals usually assume that test performance is more than adequate than needed for medical usefulness, the data tell the opposite story: many test methods are performing poorly, and some are performing far worse than needed for proper medical use.

The methodology of the studies have been previously presented here. To summarize it briefly, we selected data from individual proficiency testing events from all the proficiency testing programs with the publicly available data.

To study the usefulness of these calculations, we took a look at one proficiency testing program, the New York State Wadsworth program. Rather than look at a single event, we looked at all the events for two years.

Broad measures of Quality from PT data

In addition to the three summary metrics (NTQ, LMQ, and NMQ), we have calculated the Weighted Average CV, eighted Average Bias, and the Sigma calculated by those weighted average figures.

Month
National Test Quality
(NTQ)
Local Method Quality
(LMQ)
National Method Quality
(NMQ)
Weighted
Average CV
Weighted
Average Bias
Weighted
Average Sigma
Application
4.09
4.83
4.48
-
-
-
Jan 2004
4.095
5.041
4.677
2.135
0.721
4.326
May 2004
2.161
4.856
3.461
2.226
2.885
3.197
Sept. 2004
2.273
4.856
3.536
2.257
2.899
3.147
Jan 2005
2.543
4.827
3.898
2.382
2.285
3.238
May 2005
4.150
4.524
3.797
2.448
1.528
3.461
Sept 2005
3.787
4.858
4.306
2.182
1.140
4.060
Average
3.168
4.827
3.946
2.272
1.910
3.575
stan. dev.
0.939
0.167
0.467
0.120
0.919
0.506

Of these measures, it's clear that the most volatile metrics are the National Test Quality and the Weighted Average Bias. The bias changes most from event to event. The NTQ takes the bias into account in an unweighted form. The most stable measures are the Weighted Average CV, and the Local Method Quality (which is based solely on the imprecision measurement).

What do these metrics mean in practice? Using our previous "rules of thumb" on Sigma metrics, we can determine roughly what control rules and N would be required for each event.

Month
National Test Quality
(NTQ)
Rough Rules
Local Method Quality
(LMQ)
Rough Rules
National Method Quality
(NMQ)
Rough Rules
Application
4.09
N=4, Multirule or 2.5s
4.83
N=4, Multirule or 2.5s
4.48
N=4, Multirule or 2.5s
Jan 2004
4.095
N=4, Multirule or 2.5s
5.041
N=2, 2.5s or 3s limits
4.677
N=4, Multirule or 2.5s
May 2004
2.161
Max QC
4.856
N=4, Multirule or 2.5s
3.461
Max QC
Sept. 2004
2.273
Max QC
4.856
N=4, Multirule or 2.5s
3.536
Max QC
Jan 2005
2.543
Max QC
4.827
N=4, Multirule or 2.5s
3.898
Max QC
May 2005
4.150
N=4, Multirule or 2.5s
4.524
N=4, Multirule or 2.5s
3.797
Max QC
Sept 2005
3.787
Max QC
4.858
N=4, Multirule or 2.5s
4.306
N=4, Multirule or 2.5s
Average
3.168
Max QC
4.827
N=4, Multirule or 2.5s
3.946
Max QC
stan. dev.
0.939
0.167
0.467

As you can see, with the metrics that include bias, they are switching between N=4 QC procedures to Maximum QC procedures (basically, using as many controls and rules as possible). With NMQ, there is only one switch, when the sigma reaches above 5.0, which allows the user to reduce QC to wide rules with N=2.

Next, with the weighted CV and bias figures, we can actually calculate the specific rule recommendations with EZ Rules:

Month
Weighted
Average CV
Weighted
Average Bias
Weighted
Average Sigma
QC Recommendation
by EZ Rules
Jan 2004
2.135
0.721
4.326
12.5s with N=4
May 2004
2.226
2.885
3.197
12.5s with N=4
Sept. 2004
2.257
2.899
3.147
13s22s/R4s/41s/8x with N=4 (Max QC)
Jan 2005
2.382
2.285
3.238
13s22s/R4s/41s with N=4
May 2005
2.448
1.528
3.461
13s22s/R4s/41s with N=4
Sept 2005
2.182
1.140
4.060
12.5s with N=4
Average
2.272
1.910
3.575
13s22s/R4s/41s with N=4

You can see the six events represented visually here:

But what if all these laboratories were using the CLIA minimum of 2 controls with limits set at the traditional 2s? What error detection would they by acheiving?

Month
Weighted
Average CV
Weighted
Average Bias
Weighted
Average Sigma
QC Recommendation
by EZ Rules
Error Detection by 2s N=2 rules
Jan 2004
2.135
0.721
4.326
12.5s with N=4
0.94
May 2004
2.226
2.885
3.197
12.5s with N=4
0.54
Sept. 2004
2.257
2.899
3.147
13s22s/R4s/41s/8x with N=4 (Max QC)
0.51
Jan 2005
2.382
2.285
3.238
13s22s/R4s/41s with N=4
0.56
May 2005
2.448
1.528
3.461
13s22s/R4s/41s with N=4
0.70
Sept 2005
2.182
1.140
4.060
12.5s with N=4
0.89

Notice the big difference in error detection capability. Instead of catching errors in the first run where they occur, it might take 2 runs. And remember that when using 2s limits and 2 controls, you can expect up to 10% false rejections. So you should have at least 3 rejections in a month (if you have one run each day). If you're not, that means your limits or your sd are set wider, larger than they should be - and the error detection figures here are even lower.

Data on specific instrument groups

Looking at specific instruments, even more interesting details become clearer.

Instrument Group # of instruments AvgCV AvgBias Sigma Sigma w/Bias
Abbott Aeroset 9-11 1.43 1.12 7.16 6.39
Roche Integra 10-12 2.59 1.71 3.98 3.35
Hitachi Modular 21-31 1.89 0.69 5.32 3.35
Beckman Coulter LX 31-39 2.12 0.93 4.82 4.38
Olympus AU400/AU600/others 30-39 2.1 2.98 4.86 3.47
Beckman Coulter CX 47-35 2.1 0.73 4.86 4.52
J&J Vitros 67-47 2.44 5.37 4.15 1.83
Dade Dimension 62-71 2.37 1.75 4.26 3.51

Because of the differing numbers of instruments in the survey, it's useful only to compare like with like. For instance, comparing the Abbott Aeroset with the Dade Dimension based on this data probably isn't as reliable as comparing the Abbot Aeroset with the Roche Integra. Below, you can see the results of each PT event at the critical level of interest, as well as the resulting average as a solid "operating point":

Here's a visual display of the three instruments in the middle of the pack:

Finally, here is a comparison of the last three instruments.

What have we learned?

The data in the proficiency testing surveys is fairly stable. Taking a fine grained approach, the data is even more stable in differentiating between the performance of instruments.