A.G. Steele, B.M. Wood and R.J. Douglas
                        National Research Council of Canada, Ottawa, Canada


Debates about measurement equivalence can be simplified by a satisfactory quantification of equivalence. We
describe a rigorous, simple method of quantifying equivalence in one parameter. It treats all aspects of the ISO
Guide to the Expression of Uncertainty in Measurement for measurement comparisons. We compare this
method to the common usage of the normalized error for three “case studies” using published temperature


Equivalence in measurement is introduced as an ideal to facilitate acceptance of measurements made at one
laboratory, avoiding unnecessary re-measurements at another. Equivalence agreements will generally be based
on selected measurement comparisons using exchanged artifacts. Temperature scales realised in different
laboratories can be compared at fixed points in a manner which may be fully understood by metrologists
unfamiliar with the intricacies of comparing temperature scales, and the first objective of the CCT’s Key
Comparison program recognises this. While most international comparison programs will be multilateral, the
essential elements appear in a bilateral comparison.

Comparison results consist of two parts, the mean difference in temperature recorded by an exchanged
thermometer and the uncertainty in the difference. Measurements at two laboratories will have neither exactly
the same mean, nor the same uncertainty. In a first comparison, the difference of the means will normally be
construed as exactly that: a difference of the means. With much more effort and repetition, it may occasionally
develop that some part of this difference can be attributed to a previously unappreciated source of uncertainty.
Some proposals for analysing comparisons can misrepresent the opinions of the metrologists concerned by only
allowing the difference in means to be interpreted as an expression of uncertainty. Any treatment that insists on
this approach has limited application to the real world where new comparisons should be analysed and acted
upon without delay, before the interpretation of a new source of uncertainty could be put on a sound statistical

Our approach to “equivalence” addresses the equality of the means, with the uncertainties used as further
descriptions of the means. An equivalence statement must specify the extent to which the two laboratories’
means are equal. Because most laboratories report expanded uncertainties with k = 2, which corresponds
approximately to a 95 % confidence interval, we contend that equivalence agreements should normally be based
on the same 95 % confidence.

The equivalence statement’s wording should be able to assure a wide audience of the tightest equivalence that
can be rigorously supported by analysis of the comparison. We advocate the following form: “On the basis of
comparison measurements [reference] performed in the period of [date to date], the results of similar
measurements made at [Laboratory 1] and [Laboratory 2] can be expected to agree to within [±QDE 0.95], with
95 % confidence.” This form is unambiguous and can be statistically justified from a measurement comparison.
This form of equivalence statement is purely technical. Only with formal approval by the laboratories and
oversight agencies could it become an equivalence agreement.

Normally measurement comparisons will be able to demonstrate equivalence, at high confidence, that will
satisfy almost all commercial requirements. The need for demonstrated equivalence can be seen in the very
substantial efforts being invested in international key comparisons. Their full utility can be realized with the
quantification of their demonstrated equivalence, but a simple procedure is needed that can rigorously
encompass all ISO Guide [1] uncertainty budgets with minimal debate.

One impediment to quantifying equivalence has been the lack of an accepted method that can represent the
essence of a comparison’s results (the two means and two standard uncertainties), ideally as a single parameter.
One proposed method [2] overcomes all concerns of which we are aware. It is mathematically rigorous, properly
incorporates the effects of both degrees of freedom and correlations; and furthermore is very easy to use.

The method is derived from the recommended practices and underlying philosophy of the ISO Guide to the
Expression of Uncertainty in Measurement [1], which is followed by most national metrology institutions. The
method is outlined here for two laboratories and their measurements of an exchanged artifact.

For each laboratory, the comparison result (its mean and its Guide compliant uncertainty budget) is used to
construct a probability density function for that laboratory’s measurement. After adjustment for known
correlations, the probability density functions of any pair of laboratories are then evaluated by convolution to
find P(z), the probability density that the laboratories report a difference of means z. P(z) is integrated
symmetrically about z = 0, and within z = ±d is accumulated the desired confidence, typically 95 %. At 95 %
confidence we refer to d as QDE 0.95.

This confidence interval represents the range within which each laboratory is expected to report the same mean
with a probability of 95 %. This is the tightest range within which the comparison can be said to have
demonstrated that the laboratories’ means agree at this confidence level, without resorting to ad-hoc corrections
or the reintroduction of laboratory units.

Figure 1 shows a comparison as the joint probability distribution over all possible pairs of measurement results
of the same physical quantity. Projections of this distribution onto x (in Lab 1), y (in Lab 2) and z’ = (x-y)/√2 (a
difference-like variable) are also shown. A single parameter interpretation of their equivalence as described
above is shown at the top of the figure and is denoted as QDE0.95. A two parameter description is illustrated at
the bottom of the figure. The two parameters are the difference of the means and the measurement pair
uncertainty. The two parameter form does contain more information than QDE0.95, but end users rarely want
this extra information.

Figure 2 is a plot of the values of the confidence interval QDE0.95 versus the parameter Γk=1 = |m2 − m1 |/u p for
normal probability distributions and with u p = √(u 1 2 +u2 2 ). As expected, the curve starts at 2 u p (that is, a full
confidence interval that is 4 u p wide) in the limit of Γk=1 = 0, and increases linearly with Γk=1 in the limit as Γk=1
becomes large. A function which is simple to calculate and is accurate to better than 1 % of QDE0.95 is given by:

                          QDE0.95 ≈ |m2 –m1 | + {1.645 + 0.3295 exp [–4.05 (|m2 –m1 |)/u p ] } u p                   (1)

Another expression is available [2] to treat cases with specified, finite degrees of freedom.


The normalized error, Γk=1 , is often used as a test of the “null hypothesis” when comparison measurements are
assumed to have the same means. It is calculated as the difference of the means divided by the combined
standard uncertainty of the pair of measurements:

                                               Γk=1 = |m2 –m1 |/√(u 1 2 +u2 2 )                                      (2)

The normalized error’s subscript is a reminder that this definition is not normalized to an expanded uncertainty
but uses a coverage factor k=1. Although Γk=1 is well defined numerically, an additional criterion is required to
express equivalence [3-5]. If Γk=1 ≤ 1, then the measurements are often deemed to be in “satisfactory
agreement”. The measurements are often said to be in “disagreement” if Γk=1 ≥ 2, while the range 1 > Γk=1 > 2
is “questionable” or unresolved. The normalized error is only a weak test for equivalence, since only within the
broadest interval of ±2u p (including travel uncertainty) and in the most favorable circumstance of Γk=1 = 0 does
the demonstrated confidence reach 95 %.

Using Γk=1 as the single parameter for quantifying equivalence can be misleading. In particular, the confidence
for an “accepted equivalent” interval has not usually been calculated for the measured values of Γk=1 . For many
practical cases, confidence levels for “agreement” are disturbingly low and much less than the 95% confidence
that is generally associated with calibration reports. A further weakness in using Γk=1 with these criteria is that
the only practical way of increasing the confidence for demonstrated equivalence is to persuade one or both
parties to inflate artificially their uncertainty budget.


In order to place the discussion of quantified equivalence into a temperature context, tabulated results are given
below for three recently published international comparisons. The work at the triple point of water [6] and the
triple point of mercury [7] was done with a travelling cell using standard (non-travelling) platinum resistance
thermometers, and the high temperature comparison at 1000 °C [8] used a travelling radiation thermometer. In
each case, the published data includes the observed temperature difference between the participants and the
reference value appropriate to the comparison, along with the uncertainty. In the tables below we summarize the
measured differences in the means, and the expanded uncertainty (U) with coverage factor two (k=2) as is the
normal practice. We calculate the two equivalence parameters: normalized error (Γk=1 ) and QDE0.95.

4.1 Triple Point of Water Comparison

For the BIPM comparison of triple point of water cells [6], the tabulated data lists the results of five
participating laboratories, and the pair uncertainty with respect to the BIPM reference cell. These data are taken
from Table 23 in [6], and the conclusion drawn in the report is that there is excellent agreement among the
national reference cells of the particpating laboratories: the maximum difference being smaller than 0.14 mK.
The calculation of the equivalence parameters is straightforward using the equations discussed above, and is
shown in Table 1.

Table 1: BIPM triple point of water comparison results [6] for an indirect comparison of national reference cells.

           Laboratory        TLab – TBIPM (mK)          Uk=2 (mK)           Γk=1       QDE0.95 (mK)
              NPL                   -0.030                0.090            0.667          0.11
              NIST                 +0.091                 0.080            2.275          0.16
           BNM-INM                 +0.088                 0.126            1.397          0.19
             IMGC                  +0.049                 0.178            0.551          0.20
             VNIIM                  -0.046                0.102            0.902          0.13

This comparison illustrates quite clearly the utility of QDE0.95 over Γk=1 and its acceptance band approach to
equivalence. Only three of the five NMIs which participated in the comparison would be considered to have
“acceptable equivalence” to the reference laboratory, based on their normalized uncertainty, and one of these
has Γk=1 very nearly equal to unity; one laboratory is in the “questionable” band; and one laboratory is in
“disagreement” with the reference. By contrast, each of the five measured differences can be used to calculate a
95% confidence interval for equivalence with the BIPM reference value. The laboratory which would have
been deemed in “disagreement” using the normalized error method is seen to have a QDE0.95 confidence interval
tighter than one of the laboratories which would have been deemed to “agree” with the reference value. Further
calculation of all possible bilateral demonstrated equivalence for each pair of nations would complete the
analysis of this experiment, and remove the possible bias introduced by naming one laboratory as having the
“reference” triple point of water cell for the comparison. This experimental situation highlights the importance
of quantifying demonstrated equivalence in a statistically rigorous fashion.

4.2 Comparison at 1000 °C

The comparison of ITS-90 in the range from 800 °C to 2000 °C [8] used a travelling radiation thermometer
operating at 0.98 µm which was supplied and calibrated by the pilot laboratory, NPL. This instrument was used
to measure local optical sources at several temperatures, including 1000 °C. The expanded uncertainty (k=2) of
the calibration of the thermometer for the calculation of the compatibility parameters is 1.0 °C, and is included
in the combined pair uncertainty.

In this comparison, all of the laboratories are considered to be “in agreement” with the reference value. The lack
of correlation between the magnitude of Γk=1 and a quantified equivalence interval is apparent from an
inspection of the results for Nmi/VSL shown in Table 2, which has the lowest value for Γk=1 due to the happy
combination of a very small observed difference and a reasonably large uncertainty. In contrast, the confidence
interval for demonstrated equivalence between Nmi/VSL and NPL is in the middle of the values calculated for
this experiment. Similarly, the IMGC Γk=1 value is the highest, while its QDE0.95 interval is the smallest. These
observations confirm the weakness of using normalized error as a figure of merit when evaluating equivalence.

Table 2: NPL travelling standard infrared radiation standard comparison results at 1000 °C [8]. The measured
differences in the means were taken from data plotted in separate figures for each laboratory; the expanded
uncertainty for each measurement was taken from the associated standard uncertainty table.

           Laboratory          TLab – TNPL (°C)        Uk=2 (°C)          Γk=1       QDE0.95 (°C)
             IMGC                     0.33                0.8             0.83          0.99
              INM                     0.48                2.0             0.48          2.17
              PTB                     0.29                1.0             0.58          1.13
            Nmi/VSL                   0.13                1.4             0.19          1.39
               SP                     0.45                1.8             0.50          1.97

4.3 Triple Point of Mercury Comparison

Local realizations of the Hg point were compared in eleven laboratories against a travelling cell supplied by the
pilot laboratory, BNM-INM. The average temperature difference between a local cell and the circulating cell,
with uncertainty, was reported for each of the particpating laboratories. Although this comparison was not a
Key Comparison of national Hg values, and some specified uncertainty components were omitted from the
budgets at each laboratory, the results are nevertheless amenable to equivalence analysis. No data for the
experiments at the pilot laboratory was included in [7]. In order to evaluate the compatiblity parameters for each
laboratory with respect to the pilot, an uncertainty Uk=2 = 0.2 mK is assumed for the BNM-INM difference,
whose value is taken to be zero. The comparison data is summarized in Table 3, and the full matrix of bilateral
equivalence and comparison parameters is shown in Table 4. Even for such a large comparison, it is possible to
summarize the demonstrated equivalence in a single chart, without resorting to a “comparison reference value”.
It is also clear that many of the bilateral equivalences can be calculated even when a pilot laboratory has not
published complete results. The values above the diagonal in Table 4 are the QDE0.95 confidence intervals, and
the italicized values below the diagonal are Γk=1 , all calculated as pairwise equivalence parameters. This same
full matrix calculation of bilateral equivalences could also be performed for the comparisons discussed above.

There is, of course, much current debate over the nature of the comparison reference value, and careful
consideration must be taken when determining this number (be it a simple or weighted average, or other
construction from the comparison data). Using QDE0.95 in this type of matrix form sidesteps this thorny issue
entirely, leaving no laboratory in the position of feeling that its deviation from the “reference” is
misrepresentative of its measurement capability. An additional benefit of the full matrix presentation of
demonstrated equivalence is that it is simple to read, directly from the table, to what extent any pair of
participants’ measurements are compatible, without the necessity of any further calculations being performed:
all of the arithmetic has already been done for the interested reader.

Table 3: BNM-INM triple point of mercury comparison results [7]. The ‘?’ indicates that the tabulated values
were assumed for the purpose of calculating Table 4, and not obtained from [7].

               Laboratory     TLab – TCirc (mK)       Uk=2 (mK)           Γk=1       QDE0.95 (mK)
           0    BNM-INM                0?                0.2?
           1       CEM              -0.011               0.28             0.06            0.55
           2      CMA               +0.02                0.19             0.14            0.43
           3       DM               -0.100               0.16             0.78            0.32
           4      IMGC             +0.116                0.15             0.93            0.33
           5       IPQ               -0.12               0.22             0.81            0.38
           6       NMI              +0.17                0.20             1.20            0.41
           7       NPL              +0.06                0.12             0.51            0.29
           8     OFMET               -0.03               0.30             0.17            0.49
           9       PTB               -0.25               0.34             1.27            0.58
          10        SP               -0.42               0.31             2.28            0.72
Table 4: BNM-INM Hg point comparison results [7], expressed as bilateral comparisons. Above the diagonal is
shown the demonstrated equivalence at the 95% confidence level in [-QDE0.95,+ QDE0.95]: QDE0.95 (calculated
from Table 3 and Eq. 1) in mK. Also shown is the dimensionless comparison parameter, Γk=1 , calculated from
Table 3 and Eq. 2. The ‘?’ indicates that the values used in the Lab 0 calculations were not obtained from [7].

  Lab        0           1        2        3          4           5          6          7          8          9         10
   0         !       0.34? mK 0.27? mK 0.31? mK   0.32? mK    0.37? mK   0.40? mK   0.26? mK   0.36? mK   0.57? mK   0.72? mK
   1       0.06?         !    0.34 mK 0.36 mK     0.39 mK     0.41 mK    0.46 mK    0.33 mK    0.40 mK    0.60 mK    0.75 mK
   2       0.14?       0.18       !    0.33 mK    0.30 mK     0.38 mK    0.38 mK    0.23 mK    0.36 mK    0.59 mK    0.74 mK
   3       0.78?       0.55     0.97       !      0.40 mK     0.27 mK    0.48 mK    0.32 mK    0.36 mK    0.46 mK    0.61 mK
   4       0.93?       0.80     0.79     1.97         !       0.46 mK    0.27 mK    0.22 mK    0.42 mK    0.67 mK    0.82 mK
   5       0.81?       0.61     0.96     0.15       1.77          !      0.53 mK    0.39 mK    0.40 mK    0.47 mK    0.61 mK
   6       1.20?       1.05     1.09     2.11       0.43        1.95         !      0.30 mK    0.50 mK    0.74 mK    0.89 mK
   7       0.51?       0.47     0.36     1.60       0.58        1.44       0.94         !      0.36 mK    0.61 mK    0.75 mK
   8       0.17?       0.09     0.28     0.41       0.87        0.48       1.11       0.56         !      0.59 mK    0.74 mK
   8       1.27?       1.09     1.39     0.80       1.97        0.64       2.13       1.72       0.97         !      0.55 mK
  10       2.28?       1.96     2.42     1.83       3.11        1.58       3.20       2.89       1.81       0.74         !


Equivalence statements, of use to the clients of calibration laboratories, require quantification and justification
based on measurement comparisons. We have outlined a method which allows an unambiguous and statistically
justifiable statement of equivalence expressible as a single, easily calculated, bilateral parameter, and have
contrasted this method with the use of the normalized difference of the means by including worked examples
from recently published thermometric comparisons. By calculating a full matrix of bilateral demonstrated
equivalence parameters for each comparison, no artificial “reference value” needs to be calculated from the
comparison data, and yet the presentation of all of the information may still be summarized in a concise, easily-
understood tabular format.

                   One-Parameter 95% Confidence Interval for Quantified Demonstrated E quivalence
                                                   y = x ± QDE 0.95

                                                                      z=(x-y)/ Ö

                      y                                                                            x

                   Two-Parameter 95% Confidence Interval for Quantified Demonstrated Difference
                                             y = x + (m-m 1) ± 2u p

Figure 1 A graphical plot of two comparison results and the interpretation of their equivalence. At the top is a
single parameter description of the equivalence denoted as QDE 0.95. At the bottom is the difference of the
means and the measurement pair uncertainty.

                      95% Confidence Interval
                                                    0   1             2                 3                 4
                                                                |m 2-m 1|/u p

Figure 2 The 95 % confidence interval parameter, QDE0.95, versus the difference of the means (each in units of
the combined standard uncertainty of the measurement pair, u p = (u 1 2 +u2 2 )1/2 ). The 95 % confidence interval is


[1] ISO Guide to the Expression of Uncertainty in Measurement, (International Organization for
    Standardization), Geneva, Switzerland, 1993

[2] Wood, B.M., Douglas R.J., Confidence-Interval Interpretation of a Measurement Pair for Quantifying a
    Comparison, Metrologia, 1998, 35, pp. 187..196.

[3] ISO, Guide 43-1 and Guide 43-2: Proficiency testing by interlaboratory comparisons, (International
    Organization for Standardization), Geneva, Switzerland, 1996

[4] EUROMET Guidance Document #3, Guidelines for the organization of comparisons, available as DFM-
    1997-R20, from Danish Institute of Fundamental Metrology, Lyngby, Denmark, 1997

[5] NORAMET Document #8, Mutual Recognition of Calibration Services of National Metrology Institutes,
    available from National Research Council of Canada, Ottawa, Canada, 1998

[6] Pello, R., Goebel, R., Köhler, R., Report on the international comparison of water triple point cells, Comité
    Consultatif de Thermométrie, 1996. CCT/96-1

[7] Hermier, Y., Bonnier, G., Intercomparison of Mercury point cells, EUROMET Report No. 280, 1997

[8] Machin, G., Ricolfi, T., Battuello, M., Negro, G., Jung, H.-J., Bloembergen, P., Bosma, R., Ivarsson, J.,
    Weckström, T., Comparison of the ITS-90 using a transfer standard infrared radiation thermometer
    between seven EU national metrological institutes, Metrologia, 1996, 33, pp. 197..206

Contact point:     R.J. Douglas, National Research Council of Canada, Institute for National Measurement
                   Standards, Ottawa, Canada, tel. +1 613 993-5186 fax +1 613 952-1394 e-mail

To top