VIEWS: 27 PAGES: 6 POSTED ON: 3/20/2011 Public Domain
QUANTIFYING EQUIVALENCE FOR INTERLABORATORY COMPARISONS OF FIXED POINTS A.G. Steele, B.M. Wood and R.J. Douglas National Research Council of Canada, Ottawa, Canada ABSTRACT Debates about measurement equivalence can be simplified by a satisfactory quantification of equivalence. We describe a rigorous, simple method of quantifying equivalence in one parameter. It treats all aspects of the ISO Guide to the Expression of Uncertainty in Measurement for measurement comparisons. We compare this method to the common usage of the normalized error for three “case studies” using published temperature comparisons. 1. INTRODUCTION Equivalence in measurement is introduced as an ideal to facilitate acceptance of measurements made at one laboratory, avoiding unnecessary re-measurements at another. Equivalence agreements will generally be based on selected measurement comparisons using exchanged artifacts. Temperature scales realised in different laboratories can be compared at fixed points in a manner which may be fully understood by metrologists unfamiliar with the intricacies of comparing temperature scales, and the first objective of the CCT’s Key Comparison program recognises this. While most international comparison programs will be multilateral, the essential elements appear in a bilateral comparison. Comparison results consist of two parts, the mean difference in temperature recorded by an exchanged thermometer and the uncertainty in the difference. Measurements at two laboratories will have neither exactly the same mean, nor the same uncertainty. In a first comparison, the difference of the means will normally be construed as exactly that: a difference of the means. With much more effort and repetition, it may occasionally develop that some part of this difference can be attributed to a previously unappreciated source of uncertainty. Some proposals for analysing comparisons can misrepresent the opinions of the metrologists concerned by only allowing the difference in means to be interpreted as an expression of uncertainty. Any treatment that insists on this approach has limited application to the real world where new comparisons should be analysed and acted upon without delay, before the interpretation of a new source of uncertainty could be put on a sound statistical basis. Our approach to “equivalence” addresses the equality of the means, with the uncertainties used as further descriptions of the means. An equivalence statement must specify the extent to which the two laboratories’ means are equal. Because most laboratories report expanded uncertainties with k = 2, which corresponds approximately to a 95 % confidence interval, we contend that equivalence agreements should normally be based on the same 95 % confidence. The equivalence statement’s wording should be able to assure a wide audience of the tightest equivalence that can be rigorously supported by analysis of the comparison. We advocate the following form: “On the basis of comparison measurements [reference] performed in the period of [date to date], the results of similar measurements made at [Laboratory 1] and [Laboratory 2] can be expected to agree to within [±QDE 0.95], with 95 % confidence.” This form is unambiguous and can be statistically justified from a measurement comparison. This form of equivalence statement is purely technical. Only with formal approval by the laboratories and oversight agencies could it become an equivalence agreement. Normally measurement comparisons will be able to demonstrate equivalence, at high confidence, that will satisfy almost all commercial requirements. The need for demonstrated equivalence can be seen in the very substantial efforts being invested in international key comparisons. Their full utility can be realized with the quantification of their demonstrated equivalence, but a simple procedure is needed that can rigorously encompass all ISO Guide [1] uncertainty budgets with minimal debate. 2. QDE0.95: A RIGOROUS METHOD FOR QUANTIFYING EQUIVALENCE One impediment to quantifying equivalence has been the lack of an accepted method that can represent the essence of a comparison’s results (the two means and two standard uncertainties), ideally as a single parameter. One proposed method [2] overcomes all concerns of which we are aware. It is mathematically rigorous, properly incorporates the effects of both degrees of freedom and correlations; and furthermore is very easy to use. The method is derived from the recommended practices and underlying philosophy of the ISO Guide to the Expression of Uncertainty in Measurement [1], which is followed by most national metrology institutions. The method is outlined here for two laboratories and their measurements of an exchanged artifact. For each laboratory, the comparison result (its mean and its Guide compliant uncertainty budget) is used to construct a probability density function for that laboratory’s measurement. After adjustment for known correlations, the probability density functions of any pair of laboratories are then evaluated by convolution to find P(z), the probability density that the laboratories report a difference of means z. P(z) is integrated symmetrically about z = 0, and within z = ±d is accumulated the desired confidence, typically 95 %. At 95 % confidence we refer to d as QDE 0.95. This confidence interval represents the range within which each laboratory is expected to report the same mean with a probability of 95 %. This is the tightest range within which the comparison can be said to have demonstrated that the laboratories’ means agree at this confidence level, without resorting to ad-hoc corrections or the reintroduction of laboratory units. Figure 1 shows a comparison as the joint probability distribution over all possible pairs of measurement results of the same physical quantity. Projections of this distribution onto x (in Lab 1), y (in Lab 2) and z’ = (x-y)/√2 (a difference-like variable) are also shown. A single parameter interpretation of their equivalence as described above is shown at the top of the figure and is denoted as QDE0.95. A two parameter description is illustrated at the bottom of the figure. The two parameters are the difference of the means and the measurement pair uncertainty. The two parameter form does contain more information than QDE0.95, but end users rarely want this extra information. Figure 2 is a plot of the values of the confidence interval QDE0.95 versus the parameter Γk=1 = |m2 − m1 |/u p for normal probability distributions and with u p = √(u 1 2 +u2 2 ). As expected, the curve starts at 2 u p (that is, a full confidence interval that is 4 u p wide) in the limit of Γk=1 = 0, and increases linearly with Γk=1 in the limit as Γk=1 becomes large. A function which is simple to calculate and is accurate to better than 1 % of QDE0.95 is given by: QDE0.95 ≈ |m2 –m1 | + {1.645 + 0.3295 exp [–4.05 (|m2 –m1 |)/u p ] } u p (1) Another expression is available [2] to treat cases with specified, finite degrees of freedom. 3. COMPARISON WITH THE USE OF THE NORMALIZED ERROR The normalized error, Γk=1 , is often used as a test of the “null hypothesis” when comparison measurements are assumed to have the same means. It is calculated as the difference of the means divided by the combined standard uncertainty of the pair of measurements: Γk=1 = |m2 –m1 |/√(u 1 2 +u2 2 ) (2) The normalized error’s subscript is a reminder that this definition is not normalized to an expanded uncertainty but uses a coverage factor k=1. Although Γk=1 is well defined numerically, an additional criterion is required to express equivalence [3-5]. If Γk=1 ≤ 1, then the measurements are often deemed to be in “satisfactory agreement”. The measurements are often said to be in “disagreement” if Γk=1 ≥ 2, while the range 1 > Γk=1 > 2 is “questionable” or unresolved. The normalized error is only a weak test for equivalence, since only within the broadest interval of ±2u p (including travel uncertainty) and in the most favorable circumstance of Γk=1 = 0 does the demonstrated confidence reach 95 %. Using Γk=1 as the single parameter for quantifying equivalence can be misleading. In particular, the confidence for an “accepted equivalent” interval has not usually been calculated for the measured values of Γk=1 . For many practical cases, confidence levels for “agreement” are disturbingly low and much less than the 95% confidence that is generally associated with calibration reports. A further weakness in using Γk=1 with these criteria is that the only practical way of increasing the confidence for demonstrated equivalence is to persuade one or both parties to inflate artificially their uncertainty budget. 4. EXAMPLES FROM THERMOMETRY In order to place the discussion of quantified equivalence into a temperature context, tabulated results are given below for three recently published international comparisons. The work at the triple point of water [6] and the triple point of mercury [7] was done with a travelling cell using standard (non-travelling) platinum resistance thermometers, and the high temperature comparison at 1000 °C [8] used a travelling radiation thermometer. In each case, the published data includes the observed temperature difference between the participants and the reference value appropriate to the comparison, along with the uncertainty. In the tables below we summarize the measured differences in the means, and the expanded uncertainty (U) with coverage factor two (k=2) as is the normal practice. We calculate the two equivalence parameters: normalized error (Γk=1 ) and QDE0.95. 4.1 Triple Point of Water Comparison For the BIPM comparison of triple point of water cells [6], the tabulated data lists the results of five participating laboratories, and the pair uncertainty with respect to the BIPM reference cell. These data are taken from Table 23 in [6], and the conclusion drawn in the report is that there is excellent agreement among the national reference cells of the particpating laboratories: the maximum difference being smaller than 0.14 mK. The calculation of the equivalence parameters is straightforward using the equations discussed above, and is shown in Table 1. Table 1: BIPM triple point of water comparison results [6] for an indirect comparison of national reference cells. Laboratory TLab – TBIPM (mK) Uk=2 (mK) Γk=1 QDE0.95 (mK) NPL -0.030 0.090 0.667 0.11 NIST +0.091 0.080 2.275 0.16 BNM-INM +0.088 0.126 1.397 0.19 IMGC +0.049 0.178 0.551 0.20 VNIIM -0.046 0.102 0.902 0.13 This comparison illustrates quite clearly the utility of QDE0.95 over Γk=1 and its acceptance band approach to equivalence. Only three of the five NMIs which participated in the comparison would be considered to have “acceptable equivalence” to the reference laboratory, based on their normalized uncertainty, and one of these has Γk=1 very nearly equal to unity; one laboratory is in the “questionable” band; and one laboratory is in “disagreement” with the reference. By contrast, each of the five measured differences can be used to calculate a 95% confidence interval for equivalence with the BIPM reference value. The laboratory which would have been deemed in “disagreement” using the normalized error method is seen to have a QDE0.95 confidence interval tighter than one of the laboratories which would have been deemed to “agree” with the reference value. Further calculation of all possible bilateral demonstrated equivalence for each pair of nations would complete the analysis of this experiment, and remove the possible bias introduced by naming one laboratory as having the “reference” triple point of water cell for the comparison. This experimental situation highlights the importance of quantifying demonstrated equivalence in a statistically rigorous fashion. 4.2 Comparison at 1000 °C The comparison of ITS-90 in the range from 800 °C to 2000 °C [8] used a travelling radiation thermometer operating at 0.98 µm which was supplied and calibrated by the pilot laboratory, NPL. This instrument was used to measure local optical sources at several temperatures, including 1000 °C. The expanded uncertainty (k=2) of the calibration of the thermometer for the calculation of the compatibility parameters is 1.0 °C, and is included in the combined pair uncertainty. In this comparison, all of the laboratories are considered to be “in agreement” with the reference value. The lack of correlation between the magnitude of Γk=1 and a quantified equivalence interval is apparent from an inspection of the results for Nmi/VSL shown in Table 2, which has the lowest value for Γk=1 due to the happy combination of a very small observed difference and a reasonably large uncertainty. In contrast, the confidence interval for demonstrated equivalence between Nmi/VSL and NPL is in the middle of the values calculated for this experiment. Similarly, the IMGC Γk=1 value is the highest, while its QDE0.95 interval is the smallest. These observations confirm the weakness of using normalized error as a figure of merit when evaluating equivalence. Table 2: NPL travelling standard infrared radiation standard comparison results at 1000 °C [8]. The measured differences in the means were taken from data plotted in separate figures for each laboratory; the expanded uncertainty for each measurement was taken from the associated standard uncertainty table. Laboratory TLab – TNPL (°C) Uk=2 (°C) Γk=1 QDE0.95 (°C) IMGC 0.33 0.8 0.83 0.99 INM 0.48 2.0 0.48 2.17 PTB 0.29 1.0 0.58 1.13 Nmi/VSL 0.13 1.4 0.19 1.39 SP 0.45 1.8 0.50 1.97 4.3 Triple Point of Mercury Comparison Local realizations of the Hg point were compared in eleven laboratories against a travelling cell supplied by the pilot laboratory, BNM-INM. The average temperature difference between a local cell and the circulating cell, with uncertainty, was reported for each of the particpating laboratories. Although this comparison was not a Key Comparison of national Hg values, and some specified uncertainty components were omitted from the budgets at each laboratory, the results are nevertheless amenable to equivalence analysis. No data for the experiments at the pilot laboratory was included in [7]. In order to evaluate the compatiblity parameters for each laboratory with respect to the pilot, an uncertainty Uk=2 = 0.2 mK is assumed for the BNM-INM difference, whose value is taken to be zero. The comparison data is summarized in Table 3, and the full matrix of bilateral equivalence and comparison parameters is shown in Table 4. Even for such a large comparison, it is possible to summarize the demonstrated equivalence in a single chart, without resorting to a “comparison reference value”. It is also clear that many of the bilateral equivalences can be calculated even when a pilot laboratory has not published complete results. The values above the diagonal in Table 4 are the QDE0.95 confidence intervals, and the italicized values below the diagonal are Γk=1 , all calculated as pairwise equivalence parameters. This same full matrix calculation of bilateral equivalences could also be performed for the comparisons discussed above. There is, of course, much current debate over the nature of the comparison reference value, and careful consideration must be taken when determining this number (be it a simple or weighted average, or other construction from the comparison data). Using QDE0.95 in this type of matrix form sidesteps this thorny issue entirely, leaving no laboratory in the position of feeling that its deviation from the “reference” is misrepresentative of its measurement capability. An additional benefit of the full matrix presentation of demonstrated equivalence is that it is simple to read, directly from the table, to what extent any pair of participants’ measurements are compatible, without the necessity of any further calculations being performed: all of the arithmetic has already been done for the interested reader. Table 3: BNM-INM triple point of mercury comparison results [7]. The ‘?’ indicates that the tabulated values were assumed for the purpose of calculating Table 4, and not obtained from [7]. Laboratory TLab – TCirc (mK) Uk=2 (mK) Γk=1 QDE0.95 (mK) 0 BNM-INM 0? 0.2? 1 CEM -0.011 0.28 0.06 0.55 2 CMA +0.02 0.19 0.14 0.43 3 DM -0.100 0.16 0.78 0.32 4 IMGC +0.116 0.15 0.93 0.33 5 IPQ -0.12 0.22 0.81 0.38 6 NMI +0.17 0.20 1.20 0.41 7 NPL +0.06 0.12 0.51 0.29 8 OFMET -0.03 0.30 0.17 0.49 9 PTB -0.25 0.34 1.27 0.58 10 SP -0.42 0.31 2.28 0.72 Table 4: BNM-INM Hg point comparison results [7], expressed as bilateral comparisons. Above the diagonal is shown the demonstrated equivalence at the 95% confidence level in [-QDE0.95,+ QDE0.95]: QDE0.95 (calculated from Table 3 and Eq. 1) in mK. Also shown is the dimensionless comparison parameter, Γk=1 , calculated from Table 3 and Eq. 2. The ‘?’ indicates that the values used in the Lab 0 calculations were not obtained from [7]. Lab 0 1 2 3 4 5 6 7 8 9 10 0 ! 0.34? mK 0.27? mK 0.31? mK 0.32? mK 0.37? mK 0.40? mK 0.26? mK 0.36? mK 0.57? mK 0.72? mK 1 0.06? ! 0.34 mK 0.36 mK 0.39 mK 0.41 mK 0.46 mK 0.33 mK 0.40 mK 0.60 mK 0.75 mK 2 0.14? 0.18 ! 0.33 mK 0.30 mK 0.38 mK 0.38 mK 0.23 mK 0.36 mK 0.59 mK 0.74 mK 3 0.78? 0.55 0.97 ! 0.40 mK 0.27 mK 0.48 mK 0.32 mK 0.36 mK 0.46 mK 0.61 mK 4 0.93? 0.80 0.79 1.97 ! 0.46 mK 0.27 mK 0.22 mK 0.42 mK 0.67 mK 0.82 mK 5 0.81? 0.61 0.96 0.15 1.77 ! 0.53 mK 0.39 mK 0.40 mK 0.47 mK 0.61 mK 6 1.20? 1.05 1.09 2.11 0.43 1.95 ! 0.30 mK 0.50 mK 0.74 mK 0.89 mK 7 0.51? 0.47 0.36 1.60 0.58 1.44 0.94 ! 0.36 mK 0.61 mK 0.75 mK 8 0.17? 0.09 0.28 0.41 0.87 0.48 1.11 0.56 ! 0.59 mK 0.74 mK 8 1.27? 1.09 1.39 0.80 1.97 0.64 2.13 1.72 0.97 ! 0.55 mK 10 2.28? 1.96 2.42 1.83 3.11 1.58 3.20 2.89 1.81 0.74 ! 5. CONCLUSIONS Equivalence statements, of use to the clients of calibration laboratories, require quantification and justification based on measurement comparisons. We have outlined a method which allows an unambiguous and statistically justifiable statement of equivalence expressible as a single, easily calculated, bilateral parameter, and have contrasted this method with the use of the normalized difference of the means by including worked examples from recently published thermometric comparisons. By calculating a full matrix of bilateral demonstrated equivalence parameters for each comparison, no artificial “reference value” needs to be calculated from the comparison data, and yet the presentation of all of the information may still be summarized in a concise, easily- understood tabular format. One-Parameter 95% Confidence Interval for Quantified Demonstrated E quivalence y = x ± QDE 0.95 P(z) p z=(x-y)/ Ö 2 y x Two-Parameter 95% Confidence Interval for Quantified Demonstrated Difference y = x + (m-m 1) ± 2u p 2 Figure 1 A graphical plot of two comparison results and the interpretation of their equivalence. At the top is a single parameter description of the equivalence denoted as QDE 0.95. At the bottom is the difference of the means and the measurement pair uncertainty. 6 95% Confidence Interval 5 4 3 2 1 0 0 1 2 3 4 |m 2-m 1|/u p Figure 2 The 95 % confidence interval parameter, QDE0.95, versus the difference of the means (each in units of the combined standard uncertainty of the measurement pair, u p = (u 1 2 +u2 2 )1/2 ). The 95 % confidence interval is [-QDE0.95,+QDE0.95]. REFERENCES [1] ISO Guide to the Expression of Uncertainty in Measurement, (International Organization for Standardization), Geneva, Switzerland, 1993 [2] Wood, B.M., Douglas R.J., Confidence-Interval Interpretation of a Measurement Pair for Quantifying a Comparison, Metrologia, 1998, 35, pp. 187..196. [3] ISO, Guide 43-1 and Guide 43-2: Proficiency testing by interlaboratory comparisons, (International Organization for Standardization), Geneva, Switzerland, 1996 [4] EUROMET Guidance Document #3, Guidelines for the organization of comparisons, available as DFM- 1997-R20, from Danish Institute of Fundamental Metrology, Lyngby, Denmark, 1997 [5] NORAMET Document #8, Mutual Recognition of Calibration Services of National Metrology Institutes, available from National Research Council of Canada, Ottawa, Canada, 1998 [6] Pello, R., Goebel, R., Köhler, R., Report on the international comparison of water triple point cells, Comité Consultatif de Thermométrie, 1996. CCT/96-1 [7] Hermier, Y., Bonnier, G., Intercomparison of Mercury point cells, EUROMET Report No. 280, 1997 [8] Machin, G., Ricolfi, T., Battuello, M., Negro, G., Jung, H.-J., Bloembergen, P., Bosma, R., Ivarsson, J., Weckström, T., Comparison of the ITS-90 using a transfer standard infrared radiation thermometer between seven EU national metrological institutes, Metrologia, 1996, 33, pp. 197..206 Contact point: R.J. Douglas, National Research Council of Canada, Institute for National Measurement Standards, Ottawa, Canada, tel. +1 613 993-5186 fax +1 613 952-1394 e-mail rob.douglas@nrc.ca