VIEWS: 292 PAGES: 3 CATEGORY: Education POSTED ON: 7/4/2008 Public Domain
VOL. 20 NO. 2 (2008) TONY DAVIES COLUMN Back to basics: qualitative analysis, introduction A.M.C. Davies Norwich Near Infrared Consultancy, 75 Intwood Road, Cringleford, Norwich NR4 6AA, UK. E-mail: td@nnirc.co.uk Tom Fearn Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK. E-mail: tom@stats.ucl.ac.uk More “Back-to-basics” The question “Is this sample compound know about distance measures, decision In December 2004 I made the decision A or compound B?” is different to the boundaries, prior probabilities, misclassi- that this column should make a return question “Is this sample compound A, or fication costs, ... . Luckily for me I have visit to topics in quantitative analysis that B, or C, or ..., or Z” and very different to had Tom Fearn to guide and advise me had been covered (or sometimes just the request “Identify this sample”. for the last 25 years and most of what mentioned) in previous columns. Within follows is Tom’s work, much of it previ- seconds of that decision I realised that Human skills ously published in the “Chemometric we would need to treat qualitative analy- Spectroscopists have been looking at Space” in NIR news2 or in our frequently sis to the same revision. The quantitative spectra, and giving answers to all three referenced book.3 aspects turned out to be a three year types of questions listed above, for a Tony Davies marathon∗ journey but we have at last very long time. I do not know of any arrived at the start of what was conceived spectroscopists who would claim to Supervised and as “Part 2”. be able to look at a spectrum and esti- unsupervised classification I have been working on problems in mate the percentage of ingredient x, but Statistical classification has a number of qualitative analysis for 40 years! That’s computers can, so qualitative analysis interesting applications in spectroscopy. before chemometrics as a topic began must be a less difficult problem! For NIR data in particular, it has been and I regard them as being much more A recent query from one of our read- used in a number of scientific publica- demanding than quantitative analysis. ers (always welcome!) resulted in an e- tions and practical applications. There are several reasons for this, some mail discussion with some of the world There is an important distinction more obvious than others: experts on IR qualitative analysis. Their between two different types of classifica- ■ qualitative analysis is not a single view is that qualitative analysis is too tion: so-called unsupervised and super- problem, difficult to trust to a computer! As Peter vised classification. The former of these ■ some humans are very good at look- Griffiths points out in his recent second usually goes under the name of cluster ing at spectra and making qualitative edition of Fourier Transform Infrared analysis and relates to situations with little decisions, Spectrometry, “... a library search cannot or no prior information about group struc- ■ solutions require more statistics than identify an unknown unless the unknown tures in the data. The goal of the tech- are needed for quantitative analysis. is present in the library”.1 niques in this class of methods is to find or identify tendencies of samples to clus- Problems in qualitative Statistics in qualitative analysis ter in sub-groups without the use of any analysis In quantitative analysis if we have the prior information. This is a type of analy- From the classical point of view, qualita- RMSEP then we have all the statistics sis that is often used at an early stage tive analysis is divided into supervised or we need (some others may be useful). of an investigation, to explore, for exam- unsupervised methods but the number In qualitative analysis we need to know ple, whether there may be samples from of different objects is also very important. standard errors but we also need to different sub-populations in the dataset, for instance different varieties of a grain or samples of chemicals from different * Sorry! Races and currently marathons are uppermost in my mind because I have a place suppliers. In this sense, cluster analysis in the London Marathon to run for the homelessness charity “Shelter” on 13th April. has similarities with the problem of iden- Would you like to sponsor me? You can at: www.justgiving.com/tonydavies1 tifying outliers in a quantitative data set. www.spectroscopyeurope.com SPECTROSCOPYEUROPE 15 VOL. 20 NO. 2 (2008) TONY DAVIES COLUMN Cluster analysis can be performed is the one that relates the so-called dot using very simple visual techniques product of the two vectors such as PCA, but it can be done more x.z = x1 z1 + x2 z2 + ... + xp zp = Σ xi zi formally, for instance by one of the hier- archical methods. These are techniques to their lengths |x| and |z| and the angle that use distances between objects to θ between them. The formula is identify samples that are close to each θ x.z = |x| |z| cos θ (1) other. The hierarchical methods lead to so-called dendrograms, which are visual where aids for deciding when to stop a cluster- |x|2 = x1 + x2 + ... + xp = Σ xi2 2 2 2 ing process. Figure 1. Two spectra as vectors x and z in The other type of classification, super- and a three-dimensional space. vised classification, is also known under |z|2 = z1 + z2 + ... + zp = Σ zi2 2 2 2 the name of discriminant analysis. This is a class of methods primarily used to Thus, to compute the angle we build classification rules for a number each of the p measurements as the compute the dot product and the two of pre-specified subgroups. These rules coordinate in one of the dimensions. We lengths, and then use Equation (1) to are later used for allocating new and may equally well think of the spectra as find cos θ, and hence θ. unknown samples to the most probable vectors, by joining the point representa- sub-group. Another important applica- tion of the spectrum to the origin with a Standardising the length tion of discriminant analysis is to help in line. As usual, the trick to understanding If we are going to be computing a lot of interpreting differences between groups. the maths is to consider the case p = 3, these angles, it makes sense to stand- Discriminant analysis can be looked upon for which it is easy to draw the picture. ardise all the spectra so that each has as a kind of qualitative calibration, where Figure 1 shows two vectors in a three- length 1. This is achieved for x by divid- the quantity to be calibrated for is not dimensional space. ing each xi by |x|. a continuous measurement value, but a In the picture, the vectors keep their categorical group variable. Discriminant Euclidean distance direction but are rescaled in length to lie analysis can be done in many different Euclidean distance, D, is the “natural on a sphere of radius 1. Then |x| = |z| = 1 ways, some of these will be described in measurement” of distance between two and Equation (1) reduces to following columns. Some of the methods objects. x.z = cos θ (2) are quite model orientated, while others Geometrically, D is the length of the are very flexible and can be used regard- line joining the ends of the two vectors Now the angle and the dot product are less of structures of the sub-groups. in the figure. For the multi-dimensional equivalent measures of distance in the Some of the material in earlier columns case it is defined as: sense that each can be calculated simply on quantitative analysis is also relevant to from the other. Note though that the D2 = (x1 − z1)2 + (x2 − z2)2 + ... + (xp − zp)2 classification. Topics and techniques such maximum dot product, 1, corresponds as collinearity, data compression, scatter = Σ(xi − zi)2 to the minimum angle, 0, whilst a dot correction, validation, sample selection, product of 0 corresponds to an angle of outliers and spectral correction are all which expands to: π/2 = 90°. This equivalence means that as important for this area as they are for we could equally well define a region D2 = Σ xi2 + Σ zi2 – 2 Σ xi zi quantitative calibration. of similarity around x as all spectra that have a dot product with x exceeding d, or Distance measurements Angles between vectors as all spectra that make an angle of less used in classification Geometrically, we can just measure than cos–1 d with x. It seems a good idea before we begin the angle θ between the two vectors in a discussion of techniques to describe Figure 1. If the vectors represent spectra, Relation with Euclidean distance some of the ways of measuring distance then we can call this the angle between Using standardised spectra, there is a that we will be using. The message is the spectra. It is clear from the picture fairly simple relation between these two that there are some very simple though that the more similar the two spectra, the measures and the Euclidean distance D. perhaps non-obvious relationships closer together will be the two points and If D2 = Σ xi2 + Σ zi2 – 2 Σ xi zi between some of these measures. the smaller will be the angle between the corresponding vectors. Of course it then when the vectors are standardised Spectra as vectors is usually preferable to use a formula to and the first two terms are each 1, we A spectrum x = (x1, x2, ..., xp) measured compute the angle between x = (x1, x2, have at p wavelengths can be thought of as ..., xp) and z = (z1, z2, ..., zp) directly from a point in p-dimensional space by taking the measurements. The relevant formula D2 = 2(1 − x.z) = 2(1 − cos θ) 16 SPECTROSCOPYEUROPE www.spectroscopyeurope.com VOL. 20 NO. 2 (2008) TONY DAVIES COLUMN Thus, for standardised spectra, the dot is the mean of the elements in x and lent, it just introduces a scale factor into product, angle and Euclidean distance the equation relating them. Thus, in this lx2 = Σ(xi − mx ) 2 are all three equivalent measures of sense, using the correlation coefficient distance. A region of similarity defined by is the squared length of x after it has (or its square) as a distance measure is any of the three would be all spectra that been centred. Then the dot product essentially the same as pretreating with lie within a circle around x on the surface between x* and the similarly centred and SNV and using either the angle or the of the sphere. scaled z* is dot product between the spectra as the The dot product is easily the quickest x * .z * = ∑ ( xi − mx )( zi − mz ) distance measure. to calculate, so would be the preferred measure from a computational point of ∑ ( xi − mx )2 ∑ ( zi − mz )2 References view. For non-standardised spectra the which, by definition, is the correlation 1. P.R. Griffiths and J.A. de Haseth, three measures would, of course, all be coefficient between x and z. Thus we F o u r i e r Tr a n s f o r m I n f r a r e d different. have yet another equivalence: the corre- Spectrometry, 2nd Edn. John Wiley lation is the same as the dot product if & Sons, Inc. Hoboken, NJ, USA Relation with correlation we centre and scale the spectra before (2007). Another measure sometimes used to computing the latter. 2. T. Fearn, NIR news 14(2), 6–7 compare spectra is the correlation coef- The transformation in Equation (3) (2003). ficient between them. To relate this to looks rather similar to the well-known 3. T. Næs, T. Isaksson, T. Fearn and the distance measures above we need SNV standardisation.4,5 The only differ- T. Davies, A User-Friendly Guide to centre as well as scale the spec- ence is that SNV would normally use sx to Multivariate Calibration and tra. Suppose we transform from x to x*, as a divisor rather than lx , where Classification. NIR Publications, where the ith element xi* of x* is given Chichester, UK (2002). sx2 = lx2 / (p − 1) by 4. J. Barnes, M.S. Dhanoa and S.J. Lister, The only difference this would make Appl. Spectrosc. 43, 772 (1989). xi* = (xi − mx) / lx (3) is that the dot product now becomes 5. A.M.C. Davies and T. Fearn, Spectrosc. where p − 1 times the correlation. This does not Europe 19(6), 15 (2007). mx = Σ xi / p change the fact that the two are equiva- FASTLINK / CIRCLE 009 FOR FURTHER INFORMATION