VIEWS: 64 PAGES: 24 CATEGORY: Business POSTED ON: 3/12/2010
MACHINE LEARNING & PATTERN RECOGNITION An Introduction
M ACHINE L EARNING & PATTERN R ECOGNITION An Introduction Stephen Roberts sjrob@robots.ox.ac.uk 1 L ECTURE 1,2: I NTRODUCTORY M ATERIAL Books There are a number of good books on pattern recognition. I would, however, suggest the following: 1. Bishop, Pattern Recognition & Machine Learning, 2006. 2. Bishop, Neural Networks for Pattern Recognition, 1995. 3. Ripley, Pattern Recognition and Neural Networks, 1994. These will cover almost all the work on this course in combination, and for most of them in isolation. Background We, as human beings, can perform a large number of learning & pattern recognition tasks with ease; consider listening to a conversation in a crowded room: • How do we separate a single source (speaker) from all the others? • How do we process the words? Consider further recognising a face in a crowded scene: • How do we identify faces? • How do we identify a particular face? In a medical context: • How is a tumour recognised in an MRI image? • How is a prognosis arrived at from patient information? The last few of these tasks can tax even the best clinicians; can machines ever hope to compete with humans? 2 Classiﬁcation and Regression We will treat both techniques as mapping problems, such that some input variable, x, generates (is mapped to) some output variable y via a mapping function F, i.e. F :x→y Any problem may be reduced then to ﬁnding an optimal mapping F. In the case of supervised methods this mapping is ‘learned’ or optimised through a training data set which consists of input-output pairs (x, t) where each t represents a target or desired output for the corresponding input x. It should be noted that F is conditional on the training data set and if the latter is poor then we cannot expect the mapping to be anything other than poor also. Classiﬁcation A classiﬁcation problem is one in which we desire to assign each input to one of a closed set of output classes. Our data set will therefore contain targets, t, which will code the class label of each input. The usual format for each target is ‘one-of-n’ coding such that, for example in a two-class problem, t = (1, 0) for class 1 and t = (0, 1) for class 2. As we will see, the minimum risk statistical classiﬁer ascribes an unknown input to the class with the highest Bayes’ a posteriori probability. Bayes’ theorem, a belief update theorem, links a prior belief (the a priori probability) to the a posteriori belief given a new piece of information. Regression In the case of regression (or prediction) each input is mapped, not to a prob- ability space, but onto another continuous number sequence. Again, a map- ping function may be ‘learned’ using a set consisting of input-output pairs, (x, t) where t may, for example in the case of prediction, be the value of a data series some number of samples in the future (we need off-line data, of course, to construct such a training data set). Given a set of examples, then, what is the ‘best’ a system can do? It turns out that, given a training set, our best guess for the output in response to an input x is the conditional average of t evaluated over the training set. We denote this as ybest (x) = t | x and note that this expression codes a common-sense interpretation; if we are given an input, x, we look at what the target responses were for inputs 3 close to x in the training set and our ‘best’ output response is the average of all these values. There is one word of caution, however. Refer to Figure 1; plot (a) shows a target distribution in which the conditional average gives a correct result as the distribution is unimodal given any arbitrary input, xo say. Plot (b), however, has a multimodal distribution. In this case the conditional average gives a poor representation. To tackle this form of data a much more sophisticated methodology is required. (a) (b) t t < t | xo > < t | xo > xo x xo x Figure 1: (a) Unimodal distribution - correct results obtained for regression. (b) Multimodal distribution – the basic regression problem becomes intractable and more sophisticated techniques are required. Supervised and unsupervised Given a space in which to perform a classiﬁcation, unsupervised classiﬁca- tion relies on the fact that data belonging to the same object tend to have similar feature vectors and vice versa. This means that segmentation of the data relies on ﬁnding the ‘blobs’ or ‘clumps’ which dominate feature space (we hope). Supervised partitioning relies on a set of training data from which we can estimate a mapping function, F, which maps a vector in the feature space to a classiﬁcation space i.e. F :f →C ‘hard’ segmentation into class r, say, of the data set is then achieved by allo- cating all data whose feature vectors were mapped to Cr . Both unsupervised and supervised segmentations have problems, the former assumes that each class only has one ‘blob’ in the feature space and the latter is only as good as 4 feature 2 feature 1 segmentation of image Figure 2: Blobs or clumps in a feature space – unsupervised classiﬁcation. the training set (often labouriously created by hand). Both of these methods are intimately related to probability mappings and to Bayes’ theorem which we look at later in the lectures. The supervised case is clear : If we have any system which we wish to partition into N components, then each component of the system, x, has membership of each of the N classes given by its posterior probability on that class. x → Class r iff P (r | x) = max{P (c | x)} c If we can estimate the posteriors then we can estimate the partition function. In the case of unsupervised partioning : Each node/cluster/blob is taken to be a class. Hence if we search for the positions and shapes of K clusters, the data set will be seg- mented into K classes. Generally, we do not know K ahead of time, and methods for assessing the probable number of classes in a data set are a subject of active research. Generalisation One of the perennial problems of optimising a pattern classiﬁer using a train- ing set of data is that the apparent performance of the system improves with the number of tunable parameters in the system. Such a system runs the risk of over-ﬁtting to the training data set and would thus perform poorly on new data. What is required is a system, obtained from a training set, which 5 ? f2 f1 ? f2 f1 Figure 3: Unsupervised classiﬁcation has its problems. is capable of generalising to new data. This is a classic model-ﬁtting prob- lem and the pattern recognition community have adopted several methods to improve generalisation performance. Validation – one way of doing it: One of the favoured methods of ensuring decent generalisation performance is that of splitting the training data to provide a separate validation set as well as a (new and smaller) training set. One successful strategy is therefore to successively increase the complexity of the analyser and re-optimise each time on the training set. The performance of the analyser is thence evalu- ated on both the training and validation set. A typical set of results gives rise to curves such as in ﬁgure (5) in which the training-set error decreases with increasing complexity whilst the validation error reaches a minimum and thence increases. We may perform several ‘runs’ of such a validation pro- cedure by randomly re-splitting the original data into many different training and validation sets (this is referred to as n-fold validation. We may then 6 ‘sun’ feature 2 house’ ‘sky’ feature 1 segmentation of image Figure 4: Partitions of a feature space – supervised classiﬁcation. choose the analyser of complexity which, on average, generalises best (i.e. has lowest validation-set error). We must still, however, test the performance of our optimal system on a test set which is independent from either train- ing or validation sets and is unused in the development procedure. In many ‘real-world’ problems, however, the available data set may be small. In this situation Bayesian learning methods may be superior (see later) as these al- low comparison of different analysers (of differing complexities, for example) using only the training data set. What do we mean by small, however? Some studies address the issue of the number of pieces of information in the train- ing set as a ratio to the number of adaptive parameters in the analyser. The results suggest that a factor of three or more is required. A general rule of thumb is to be even more pessimistic, however, and a factor of 5-10 is often suggested. Typical application procedure The most commonly used (validation) approach gives rise to a simple method- ology with which an analyser may be trained and tested : 1. A labelled set of data is formed. This set consists of input-output pairs (x, t). We will furthermore assume that each component of the input set (x1 etc.) has a similar numerical magnitude. This avoids numerically large components being given unfair weighting over those (possibly more useful) components whose magnitudes happen to be smaller. We will also assume that, in the classiﬁcation case, this set is balanced i.e. equal 7 training validation Error Nparameters Figure 5: Training set error decreases, on average, monotonically with analyser complexity, whereas validation set error reaches a minimum at the ‘optimal’ analyser complexity. numbers of examples from each class are contained in it. 2. The data is randomly split into, typically, three data sets: training, valida- tion and test. 3. a. Starting with a ‘simple’ analyser (low number of free parameters) opti- mise its free parameters (weights) to minimise the error on the training set. b. Obtain performance measures for the analyser on the validation set. c. Increase the complexity of the anlayser (e.g. more parameters). d. Repeat these steps until a levelling or minimum in the validation error vs number of parameters (e.g.) curve is passed. 4. Repeat steps 2 and 3 with different random splits of the original data set. This may be performed ten times, say. 5. Find the analyser complexity that gives the best average (over the ten runs, say) performance on the validation set. Use an analyser with this conﬁguration to assess performance on the (hitherto unseen) test set. 8 Generalisation: a quick example This example is of the (in)famous ‘noisy sine wave’. 100 samples with 30% Gaussian noise are used in training and another 100 as a ‘test’ set. Figure 6 shows the regression using increasing numbers of parameters (actually a number of spline functions). Note that 1 is not enough, 5 is about OK (it captures the basic sine wave pattern) and 100 captures all the noise as well! Figure 7 shows these same analysers applied to the test set (which has different noise on it remember). 1 5 2 2 1 1 0 0 −1 −1 −2 −2 0 50 100 0 50 100 20 100 2 2 1 1 0 0 −1 −1 −2 −2 0 50 100 0 50 100 Figure 6: Generalisation: training set. 9 1 5 2 2 1 1 0 0 −1 −1 −2 −2 0 50 100 0 50 100 20 100 2 2 1 1 0 0 −1 −1 −2 −2 0 50 100 0 50 100 Figure 7: Generalisation: test set. 10 Some examples In this section I introduce some simple data sets which, along with some others, are made available on line (see later). Tremor analysis This data was collected as part of a study trying to identify patients with mus- cle tremor. The features are auto-regressive coefﬁcients which detail hand tremor resonances. The data set consists of equal numbers of data from patients and a control group. Figure 8 shows the data. Using committees of 1 0.5 2 x 0 patients control −0.5 −0.5 0 0.5 1 x 1 Figure 8: Tremor data set. ﬂexible classiﬁers (such as ‘neural’ networks) we can classify this data with about 85% accuracy. Predicting the future: chaotic time series Figure 9 shows the Mackey-Glas chaotic time series. The data has been looked at as a prediction problem by using past samples from the data to try to predict into the future. It is fairly difﬁcult to do it well! We can get (Fig. 10) to a 5% error level fairly easily. 11 1400 1200 1000 800 600 400 200 0 100 200 300 400 500 600 700 800 900 1000 Figure 9: Chaotic time series data. Wine recognition The data set consists of 178 13-dimensional exemplars which are a set of chemical analyses of three types of wine. Figure 11 shows the projection of the data onto its ﬁrst two principal (eigen) components. The data, if we use the labels and know there are three kinds of wine, can be classiﬁed with 100% accuracy. If we are not allowed to use the labels, we can infer that there are three wines and classify with about 98% accuracy - not bad! 12 NMSE = 0.0465 1400 1200 1000 800 600 400 200 0 20 40 60 80 100 120 140 160 180 Figure 10: Chaotic time series data - predictions and error bars.. 2.5 2 1.5 1 0.5 2 0 x −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x 1 Figure 11: Wine data set. 13 Resources Journals • There is a good web site with journal information at www.ph.tn.tudelft.nl/PRInfo/journals.home.html Machine learning archive • There is a good machine learning and pattern recognition URL at www.aic.nrl.navy.mil/∼aha/research/machine-learning.html The link to the UCI data base is useful - also lots of free software to download. Data, software for this course Some example data sets may be downloaded via my web site: http://www.robots.ox.ac.uk/∼sjrob/teaching.html and following the links to pattern recognition. I have made links to some of our group software archives also. Data sets (more extensive) are also avail- able via the machine learning URL (in particular the UCI data repository) for those who are keen. 14 S TATISTICAL P RELIMINARIES Bayes Theorem - the calculus of probabilities: a brief review You will all, no doubt, be familiar with the following terms: • Priors, likelihoods, posteriors and evidences, which describe the components to quantities in the Rev. Thomas Bayes’ famous and important theorem (mid 18th century). Bayes’ theorem relies upon the notion of conditional probabilities, such as “the chance that I will ﬁnish these lecture notes given my other tasks is 0.9”. What this implies is that the chance (probability) of one event happen- ning is conditional on another. Note also that statements like this code a belief rather than a probability that is obtained by multiple trials (the so-called frequentist approach). It is hence possible to allow singular (non-repeatable) events to have a probability. Cox (1946) showed that so long as these beliefs obey some consistency rules (they sum to unity, are strictly non-negative etc.) known as the Cox axioms then these beliefs can be handled just as proba- bilities. I will come up front and say that I believe that Bayesian statistics is correct, although there are some who would disagree (they are gradually proved wrong though), so these lectures make this tacit assumption. Just to refresh the memory... Note: I will use capital ‘P’ to denote a probability, whether this is a ‘degree of belief’ (bounded in the closed interval [0,1]) or a probability density func- tion (pdf - which is unbounded above, is non-negative and must integrate to unity). If the argument is discrete, such as P (k) where k is the k-th class of a ﬁnite set, then we have belief. If the argument is continuous, such as P (x), where x is some variable, then we have a pdf. Formally we write “probability of A given E” as P (A | E). What Bayes’ theorem states is that: P (A&E) = P (A | E)P (E) = P (E | A)P (A) Re-arranging the above gives: P (E | A)P (A) P (A | E) = P (E) which is a basic formulation of Bayes’ theorem. We can hence evaluate the chance of A occurring given E has occurred if we know the chance of E 15 occurring given A has occurred and the chances that A and E occur on their own. likelihood × prior posterior = evidence So we have ‘converted’ some prior estimate about the chance of A into a better estimate given information about E. This notion of using information in E to reﬁne our belief about A is very elegant. Say we had a series of hypotheses, A1 , A2 , ..., An each dependent upon E then Bayes’ theorem for mixtures states that n P (E) = P (E | Ai )P (Ai ) i=1 we may combine this with the previous equation to obtain a set of posterior probabilities, one for each hypothesis Ai . It may be shown that, if we wish to decide the most probable outcome to event E happenning, that outcome is the Ai with the largest posterior probability. Most likely outcome if E has occurred = argmax{P (Ai |E)} An example Take a simple, 1-D example. Let E be the gaining of information regarding a datum, x, and let there be two hypotheses, or classes, conditional on x, called A and B. Hence P (x | A)P (A) P (A | x) = P (x) P (x | B)P (B) P (B | x) = P (x) where, from Bayes’ mixture theorem P (x) = P (x | A)P (A) + P (x | B)P (B) The likelihood terms, e.g. P (x | A) deﬁne how likely it is to generate x given class A etc. If we take a simple case, where A, B are taken to be Gaussian (normally) distributed classes, then this example may be drawn simply (see ﬁgure 12). • P (A | x) = 1 − P (B | x) • If P (A) = P (B) (no prior preference) then choosing class with maximum posterior is the same as choosing the maximum likelihood class (as the evidence is common to all classes). 16 1 0.8 0.6 0.4 0.2 A B 0 0 20 40 60 80 100 120 140 160 180 200 Figure 12: Posteriors and likelihoods. To give another example, from image processing, let us say that x repre- sents the value at some pixel from an edge detection operation. A and B can then be thought of as the chances of that pixel belonging to an edge or not. P (x | edge)P (edge) P (edge | x) = P (x) and if P (edge) = P (no edge) then P (x | edge) P (edge | x) = P (x | edge) + P (x | no edge) The ﬁgure shows some edge detection using such a scheme. I have not said yet how to get the likelihoods or how to go about performing classiﬁcation in practice. Classﬁcation: in more detail Minimum risk classiﬁers We consider a set of regions such that Rk corresponds to class Ck , so if there are c classes we have R1 ...Rc . 17 Figure 13: House image and the ‘signiﬁcant’ edges (P > 0.95). The total probability of getting a correct classiﬁcation is c P (correct) = P (x ∈ Rk , Ck ) (1) k=1 = P (x ∈ Rk | Ck )P (Ck ) k = P (x | Ck )P (Ck )dx k Rk this is maximised by always choosing to classify x to the class with the largest value of P (x | Ck )P (Ck ) which is just the posterior probability via Bayes’ theorem. The loss matrix enables a different costs to be associated with different wrong decisions. Lkj is loss when x is classiﬁed to Cj when x ∈ Ck . The expected loss over Ck c lossk = Lkj P (x | Ck )dx j=1 Rj as c losstotal = lossk P (Ck ) k=1 so c c losstotal = Lkj P (x | Ck )P (Ck ) dx j=1 Rj k=1 this is minimised if the integrand is minimised at each x. We will thus choose to classify to Cj when c c Lkj P (x | Ck )P (Ck ) < Lki P (x | Ck )P (Ck ) k=1 k=1 18 for all i = j. Consider a vector of output probabilities, p, then classify using p = Lp. Note that the expected loss under this rule (the Bayes’ rule) is just E[Lpmax (1− pmax )]. This is known as the Bayes’ loss (or error) and is the lower bound to the loss for the data set. You can never do better than this (on average). If we don’t follow the Bayes rule then this expectation is just called the expected loss or error. The reject option Let’s make things a little simpler by assigning no loss to getting the right answer Lkk = 0, ∀k, and the same loss to getting it wrong, Lkj = 1, k = j. Now introduce another class, the ‘doubt’ class. We will classify to this class when we are not certain which other class to classify into. Let the loss for this be Lk,doubt = d, ∀k. Remember that the expected loss (for this loss matrix) is just: loss = E[1 − P (Cchosen | x)] to minimise this we must choose to classify to the class with the minimum {1 − P (C1 | x), 1 − P (C2 | x), ..., 1 − P (CK | x), d} we see that we still choose the class with the largest Pmax = P (C | x) unless Pmax < 1 − d in which case we reject to the ‘doubt’ class. By varying d we can look at the trade-off characteristics between accuracy and the fraction of data not rejected. Receiver operating characteristics (ROC) The ROC curve is much used in tha assessment of systems for use in vari- able cost domains i.e. where the elements of the loss matrix may vary in time. An example might be the a medical diagnosis problem in which the cost of falsely diagnosing a disease reduces as the disease becomes benign. The ROC curve is a plot of true positive rate for a class against false pos- itive rate. Consider a decision in a two class system, in which a decision is made to class C1 if P (C1 | x) > t By varying the threshold t we obtain a set of classiﬁers with differing charac- teristics, and differering true and false positive rates. 19 1 0.9 0.8 0.7 fraction classified 0.6 0.5 0.4 0.3 0.2 0.1 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 error rate Figure 14: Reject option curve (tremor data set). 1 good true + bad 0 false + 1 Figure 15: ROC curve, showing good and bad classiﬁers. 20 1 0.9 0.8 0.7 True +ve rate 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False +ve rate Figure 16: ROC curve for the tremor data set. 21 Uncertainty and information gain Note that the statistic U = 1 − maxk {P (Ck | x)} naturally denotes the uncer- tainty in the decision to classify to the class with the largest posterior proba- bility. We can see this quantity as a measure of reliability in the decision. It makes sense that we use the reject or doubt option when this quantity is low. If a datum provides clear information then U is small. Imagine we have no information provided by x, the posteriors just lie at the priors. If x is informa- tive then the uncertainty (inherent typically in the priors) collapses and the posteriors diverge from the priors. As an aside The strict measure of uncertainty is the entropy, H of some ran- dom source. Entropy (in bits) is deﬁned, over a continous space, as: H(x) = − P (x) log2 P (x)dx and in discrete space as: H(k) = − P (k ) log2 P (k ) k For a Gaussian source the entropy is ∝ log2 σ hence large entropies corre- spond to large variances and vice versa. Consider now the observation of some datum x. We know that if we clas- sify to class k then P (k | x) was the maximum posterior. We consider the measure P (k | x) r = log2 P (k) By Bayes’ therom this is also equal to: P (x | k) r = log2 P (x) for P (x | k) > P (x) then the region that class k models must be a subset of the entire data space. This means that we have localised x. The tighter this localisation, with reference to the original data set, then the larger the measure r, hence the lower the uncertainty in the classiﬁcation decision. Figure 17 shows this for the tremor data set using different classiﬁers. Note that the gain is bounded above by unity (1 bit = ability to classiﬁy into one of two classes) and below by 0. The latter lies along the decision boundary. A datum on this boundary is impossible to classify, so seeing it gives no information regarding its class. 22 linear 2 2 0.8 1 1 0.6 x_2 0 0 0.4 −1 −1 0.2 −2 −2 0 −2 −1 0 1 2 −2 0 2 x_1 quadratic 5−spline 2 1 2 0.8 1 1 0.6 0 0.5 0 0.4 −1 −1 0.2 −2 0 −2 0 −2 0 2 −2 0 2 P (x|k) Figure 17: Information gain as log2 P (x) . The outlier option The reject option offers the option to disregard a datum if the subsequent classiﬁcation has high uncertainty. If we look again at ﬁgure 12 we see that the posteriors are high (they asymptote to unity) as we go far away from the data (likelihoods). This assumes that the classes represent a complete hypothesis space, i.e. there truly are no other classiﬁcation options. If we are faced with data which lie far away from all known classes, we may wish to assign it to an outlier class under the belief that it is unlikely to have been truly generated by our hypothesis set. This approach is also referred to as novelty detection. Figure 18 depicts this. Error bars On of the things you might have noticed is that, for regression, the argument of our probability is a continuous variable, y, rather than a member of a dis- crete set of classes. Just as a full (most informative) description of a class decision is P (Ck | x) so the most informative regression measure is the den- sity function P (y | x). Note that the latter is a pdf, whilst P (Ck | x) is just a single number representing a degree of belief. How do we represent a density? One way is to assume that the density is 23 Outlier A Reject B Figure 18: Rejects and outliers. normal and then just give its mean and variance. If we do this then regressors must output the mean, which we expect to be y(x) = t | x along with an error bar. How to get that error bar is dealt with later on in the course as it requires an understanding of Bayesian learning methods. Just to make another thing clear; posterior probabilities (beliefs) do not have error bars! 24