Docstoc

Qualitative Analysis - PDF

Document Sample
Qualitative Analysis - PDF Powered By Docstoc
					                                                                                                               VOL. 20 NO. 2 (2008)




                                  TONY DAVIES COLUMN

Back to basics: qualitative
analysis, introduction
A.M.C. Davies
Norwich Near Infrared Consultancy, 75 Intwood Road, Cringleford, Norwich NR4 6AA, UK. E-mail: td@nnirc.co.uk


Tom Fearn
Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK.
E-mail: tom@stats.ucl.ac.uk



More “Back-to-basics”                           The question “Is this sample compound          know about distance measures, decision
In December 2004 I made the decision            A or compound B?” is different to the          boundaries, prior probabilities, misclassi-
that this column should make a return           question “Is this sample compound A, or        fication costs, ... . Luckily for me I have
visit to topics in quantitative analysis that   B, or C, or ..., or Z” and very different to   had Tom Fearn to guide and advise me
had been covered (or sometimes just             the request “Identify this sample”.            for the last 25 years and most of what
mentioned) in previous columns. Within                                                         follows is Tom’s work, much of it previ-
seconds of that decision I realised that        Human skills                                   ously published in the “Chemometric
we would need to treat qualitative analy-       Spectroscopists have been looking at           Space” in NIR news2 or in our frequently
sis to the same revision. The quantitative      spectra, and giving answers to all three       referenced book.3
aspects turned out to be a three year           types of questions listed above, for a                                         Tony Davies
marathon∗ journey but we have at last           very long time. I do not know of any
arrived at the start of what was conceived      spectroscopists who would claim to             Supervised and
as “Part 2”.                                    be able to look at a spectrum and esti-        unsupervised classification
   I have been working on problems in           mate the percentage of ingredient x, but       Statistical classification has a number of
qualitative analysis for 40 years! That’s       computers can, so qualitative analysis         interesting applications in spectroscopy.
before chemometrics as a topic began            must be a less difficult problem!              For NIR data in particular, it has been
and I regard them as being much more               A recent query from one of our read-        used in a number of scientific publica-
demanding than quantitative analysis.           ers (always welcome!) resulted in an e-        tions and practical applications.
There are several reasons for this, some        mail discussion with some of the world            There is an important distinction
more obvious than others:                       experts on IR qualitative analysis. Their      between two different types of classifica-
■ qualitative analysis is not a single          view is that qualitative analysis is too       tion: so-called unsupervised and super-
    problem,                                    difficult to trust to a computer! As Peter     vised classification. The former of these
■ some humans are very good at look-            Griffiths points out in his recent second      usually goes under the name of cluster
    ing at spectra and making qualitative       edition of Fourier Transform Infrared          analysis and relates to situations with little
    decisions,                                  Spectrometry, “... a library search cannot     or no prior information about group struc-
■ solutions require more statistics than        identify an unknown unless the unknown         tures in the data. The goal of the tech-
    are needed for quantitative analysis.       is present in the library”.1                   niques in this class of methods is to find
                                                                                               or identify tendencies of samples to clus-
Problems in qualitative                         Statistics in qualitative analysis             ter in sub-groups without the use of any
analysis                                        In quantitative analysis if we have the        prior information. This is a type of analy-
From the classical point of view, qualita-      RMSEP then we have all the statistics          sis that is often used at an early stage
tive analysis is divided into supervised or     we need (some others may be useful).           of an investigation, to explore, for exam-
unsupervised methods but the number             In qualitative analysis we need to know        ple, whether there may be samples from
of different objects is also very important.    standard errors but we also need to            different sub-populations in the dataset,
                                                                                               for instance different varieties of a grain
                                                                                               or samples of chemicals from different
*
 Sorry! Races and currently marathons are uppermost in my mind because I have a place          suppliers. In this sense, cluster analysis
in the London Marathon to run for the homelessness charity “Shelter” on 13th April.            has similarities with the problem of iden-
Would you like to sponsor me? You can at: www.justgiving.com/tonydavies1                       tifying outliers in a quantitative data set.

www.spectroscopyeurope.com                                                                                 SPECTROSCOPYEUROPE 15
    VOL. 20 NO. 2 (2008)




  TONY DAVIES COLUMN
   Cluster analysis can be performed                                                              is the one that relates the so-called dot
using very simple visual techniques                                                               product of the two vectors
such as PCA, but it can be done more
                                                                                                         x.z = x1 z1 + x2 z2 + ... + xp zp = Σ xi zi
formally, for instance by one of the hier-
archical methods. These are techniques                                                            to their lengths |x| and |z| and the angle
that use distances between objects to                                                             θ between them. The formula is
identify samples that are close to each                        θ
                                                                                                                  x.z = |x| |z| cos θ                  (1)
other. The hierarchical methods lead to
so-called dendrograms, which are visual                                                           where
aids for deciding when to stop a cluster-
                                                                                                             |x|2 = x1 + x2 + ... + xp = Σ xi2
                                                                                                                     2    2          2
ing process.
                                               Figure 1. Two spectra as vectors x and z in
   The other type of classification, super-                                                       and
                                               a three-dimensional space.
vised classification, is also known under
                                                                                                             |z|2 = z1 + z2 + ... + zp = Σ zi2
                                                                                                                     2    2          2
the name of discriminant analysis. This
is a class of methods primarily used to                                                              Thus, to compute the angle we
build classification rules for a number        each of the p measurements as the                  compute the dot product and the two
of pre-specified subgroups. These rules        coordinate in one of the dimensions. We            lengths, and then use Equation (1) to
are later used for allocating new and          may equally well think of the spectra as           find cos θ, and hence θ.
unknown samples to the most probable           vectors, by joining the point representa-
sub-group. Another important applica-          tion of the spectrum to the origin with a          Standardising the length
tion of discriminant analysis is to help in    line. As usual, the trick to understanding         If we are going to be computing a lot of
interpreting differences between groups.       the maths is to consider the case p = 3,           these angles, it makes sense to stand-
Discriminant analysis can be looked upon       for which it is easy to draw the picture.          ardise all the spectra so that each has
as a kind of qualitative calibration, where    Figure 1 shows two vectors in a three-             length 1. This is achieved for x by divid-
the quantity to be calibrated for is not       dimensional space.                                 ing each xi by |x|.
a continuous measurement value, but a                                                                In the picture, the vectors keep their
categorical group variable. Discriminant       Euclidean distance                                 direction but are rescaled in length to lie
analysis can be done in many different         Euclidean distance, D, is the “natural             on a sphere of radius 1. Then |x| = |z| = 1
ways, some of these will be described in       measurement” of distance between two               and Equation (1) reduces to
following columns. Some of the methods         objects.
                                                                                                                     x.z = cos θ                       (2)
are quite model orientated, while others          Geometrically, D is the length of the
are very flexible and can be used regard-      line joining the ends of the two vectors              Now the angle and the dot product are
less of structures of the sub-groups.          in the figure. For the multi-dimensional           equivalent measures of distance in the
   Some of the material in earlier columns     case it is defined as:                             sense that each can be calculated simply
on quantitative analysis is also relevant to                                                      from the other. Note though that the
                                                D2 = (x1 − z1)2 + (x2 − z2)2 + ... + (xp − zp)2
classification. Topics and techniques such                                                        maximum dot product, 1, corresponds
as collinearity, data compression, scatter         = Σ(xi − zi)2                                  to the minimum angle, 0, whilst a dot
correction, validation, sample selection,                                                         product of 0 corresponds to an angle of
outliers and spectral correction are all       which expands to:                                  π/2 = 90°. This equivalence means that
as important for this area as they are for                                                        we could equally well define a region
                                                        D2 = Σ xi2 + Σ zi2 – 2 Σ xi zi
quantitative calibration.                                                                         of similarity around x as all spectra that
                                                                                                  have a dot product with x exceeding d, or
Distance measurements                          Angles between vectors                             as all spectra that make an angle of less
used in classification                         Geometrically, we can just measure                 than cos–1 d with x.
It seems a good idea before we begin           the angle θ between the two vectors in
a discussion of techniques to describe         Figure 1. If the vectors represent spectra,        Relation with Euclidean distance
some of the ways of measuring distance         then we can call this the angle between            Using standardised spectra, there is a
that we will be using. The message is          the spectra. It is clear from the picture          fairly simple relation between these two
that there are some very simple though         that the more similar the two spectra, the         measures and the Euclidean distance D.
perhaps non-obvious relationships              closer together will be the two points and
                                                                                                    If            D2 = Σ xi2 + Σ zi2 – 2 Σ xi zi
between some of these measures.                the smaller will be the angle between
                                               the corresponding vectors. Of course it            then when the vectors are standardised
Spectra as vectors                             is usually preferable to use a formula to          and the first two terms are each 1, we
A spectrum x = (x1, x2, ..., xp) measured      compute the angle between x = (x1, x2,             have
at p wavelengths can be thought of as          ..., xp) and z = (z1, z2, ..., zp) directly from
a point in p-dimensional space by taking       the measurements. The relevant formula                       D2 = 2(1 − x.z) = 2(1 − cos θ)

16 SPECTROSCOPYEUROPE                                                                                    www.spectroscopyeurope.com
                                                                                                            VOL. 20 NO. 2 (2008)




                                  TONY DAVIES COLUMN
   Thus, for standardised spectra, the dot    is the mean of the elements in x and         lent, it just introduces a scale factor into
product, angle and Euclidean distance                                                      the equation relating them. Thus, in this
                                                           lx2 =   Σ(xi − mx )
                                                                             2
are all three equivalent measures of                                                       sense, using the correlation coefficient
distance. A region of similarity defined by   is the squared length of x after it has      (or its square) as a distance measure is
any of the three would be all spectra that    been centred. Then the dot product           essentially the same as pretreating with
lie within a circle around x on the surface   between x* and the similarly centred and     SNV and using either the angle or the
of the sphere.                                scaled z* is                                 dot product between the spectra as the
   The dot product is easily the quickest
                                               x * .z * =
                                                           ∑ ( xi − mx )( zi − mz )        distance measure.
to calculate, so would be the preferred
measure from a computational point of
                                                          ∑ ( xi − mx )2 ∑ ( zi − mz )2    References
view. For non-standardised spectra the        which, by definition, is the correlation     1. P.R. Griffiths and J.A. de Haseth,
three measures would, of course, all be       coefficient between x and z. Thus we            F o u r i e r Tr a n s f o r m I n f r a r e d
different.                                    have yet another equivalence: the corre-        Spectrometry, 2nd Edn. John Wiley
                                              lation is the same as the dot product if        & Sons, Inc. Hoboken, NJ, USA
Relation with correlation                     we centre and scale the spectra before          (2007).
Another measure sometimes used to             computing the latter.                        2. T. Fearn, NIR news 14(2), 6–7
compare spectra is the correlation coef-         The transformation in Equation (3)           (2003).
ficient between them. To relate this to       looks rather similar to the well-known       3. T. Næs, T. Isaksson, T. Fearn and
the distance measures above we need           SNV standardisation.4,5 The only differ-        T. Davies, A User-Friendly Guide
to centre as well as scale the spec-          ence is that SNV would normally use sx          to Multivariate Calibration and
tra. Suppose we transform from x to x*,       as a divisor rather than lx , where             Classification. NIR Publications,
where the ith element xi* of x* is given                                                      Chichester, UK (2002).
                                                           sx2 = lx2 / (p − 1)
by                                                                                         4. J. Barnes, M.S. Dhanoa and S.J. Lister,
                                                 The only difference this would make          Appl. Spectrosc. 43, 772 (1989).
           xi* = (xi − mx) / lx        (3)
                                              is that the dot product now becomes          5. A.M.C. Davies and T. Fearn, Spectrosc.
where                                         p − 1 times the correlation. This does not      Europe 19(6), 15 (2007).
                mx = Σ xi / p                 change the fact that the two are equiva-




                                      FASTLINK / CIRCLE 009 FOR FURTHER INFORMATION