Document Sample

Lecture 5 Machine Learning 第5讲 机器学习 5.1 Introduction 5.2 Supervised Learning 监督学习 5.3 Parametric Methods 参数化方法 5.4 Clustering 聚类 5.5 Nonparametric Methods 非参数化方法 5.6 Decision Trees 决策树 5.1 Introduction Why “Learn” ? Machine learning is programming computers to optimize a performance criterion using example data or past experience. There is no need to “learn” to calculate payroll Learning is used when: Human expertise does not exist (navigating on Mars), Humans are unable to explain their expertise (speech recognition) Solution changes in time (routing on a computer network) Solution needs to be adapted to particular cases (user biometrics) 3 What We Talk About When We Talk About“Learning” Learning general models from a data of particular examples Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and scarce. Example in retail: Customer transactions to consumer behavior: People who bought “Da Vinci Code” also bought “The Five People You Meet in Heaven” (www.amazon.com) Build a model that is a good and useful approximation to the data. 4 Data Mining Retail: Market basket analysis, Customer relationship management (CRM) Finance: Credit scoring, fraud detection Manufacturing: Optimization, troubleshooting Medicine: Medical diagnosis Telecommunications: Quality of service optimization Bioinformatics: Motifs, alignment Web mining: Search engines ... 5 What is Machine Learning? Optimize a performance criterion using example data or past experience. Role of Statistics: Inference from a sample Role of Computer science: Efficient algorithms to Solve the optimization problem Representing and evaluating the model for inference 6 Applications Association Supervised Learning Classification Regression Unsupervised Learning Reinforcement Learning 7 Learning Associations Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services. Example: P ( chips | beer ) = 0.7 8 Classification Example: Credit scoring Differentiating between low-risk and high-risk customers from their income and savings Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk 9 Classification: Applications Aka Pattern recognition Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style Character recognition: Different handwriting styles. Speech recognition: Temporal dependency. Use of a dictionary or the syntax of the language. Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for speech Medical diagnosis: From symptoms to illnesses ... 10 Face Recognition Training examples of a person Test images AT&T Laboratories, Cambridge UK http://www.uk.research.att.com/facedatabase.html 11 Regression Example: Price of a used car x : car attributes y = wx+w0 y : price y = g (x | θ ) g ( ) model, θ parameters 12 Regression Applications Navigating a car: Angle of the steering wheel (CMU NavLab) Kinematics of a robot arm (x,y) α1= g1(x,y) α2 α2= g2(x,y) α1 Response surface design 13 Supervised Learning: Uses Prediction of future cases: Use the rule to predict the output for future inputs Knowledge extraction: The rule is easy to understand Compression: The rule is simpler than the data it explains Outlier detection: Exceptions that are not covered by the rule, e.g., fraud 14 Unsupervised Learning Learning “what normally happens” No output Clustering: Grouping similar instances Example applications Customer segmentation in CRM Image compression: Color quantization Bioinformatics: Learning motifs 15 Reinforcement Learning Learning a policy: A sequence of outputs No supervised output but delayed reward Credit assignment problem Game playing Robot in a maze Multiple agents, partial observability, ... 16 Resources: Datasets UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.h tml UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application. html Statlib: http://lib.stat.cmu.edu/ Delve: http://www.cs.utoronto.ca/~delve/ 17 Resources: Journals Journal of Machine Learning Research www.jmlr.org Machine Learning Neural Computation Neural Networks IEEE Transactions on Neural Networks IEEE Transactions on Pattern Analysis and Machine Intelligence Annals of Statistics Journal of the American Statistical Association ... 18 Resources: Conferences International Conference on Machine Learning (ICML) ICML05: http://icml.ais.fraunhofer.de/ European Conference on Machine Learning (ECML) ECML05: http://ecmlpkdd05.liacc.up.pt/ Neural Information Processing Systems (NIPS) NIPS05: http://nips.cc/ Uncertainty in Artificial Intelligence (UAI) UAI05: http://www.cs.toronto.edu/uai2005/ Computational Learning Theory (COLT) COLT05: http://learningtheory.org/colt2005/ International Joint Conference on Artificial Intelligence (IJCAI) IJCAI05: http://ijcai05.csd.abdn.ac.uk/ International Conference on Neural Networks (Europe) ICANN05: http://www.ibspan.waw.pl/ICANN-2005/ ... 19 5.2 Supervised Learning Learning a Class from Examples Class C of a “family car” Prediction: Is car x a family car? Knowledge extraction: What do people expect from a family car? Output: Positive (+) and negative (–) examples Input representation: x1: price, x2 : engine power 21 Training set X X {x t ,r t }tN 1 1 if x is positive r 0 if x is negative x 1 x x 2 22 Class C p1 price p2 AND e1 engine power e2 23 Hypothesis class H 1 if h classifies x as positive h (x ) 0 if h classifies x as negative Error of h on H E (h | X ) 1h x r N t t t 1 24 S, G, and the Version Space most specific hypothesis, S most general hypothesis, G h H, between S and G is consistent and make up the version space (Mitchell, 1997) 25 VC Dimension N points can be labeled in 2N ways as +/– H shatters N if there exists h Î H consistent for any of these: VC(H ) = N An axis-aligned rectangle shatters 4 points only ! 26 Probably Approximately Correct (PAC) Learning How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ? (Blumer et al., 1989) Each strip is at most ε/4 Pr that we miss a strip 1‒ ε/4 Pr that N instances miss a strip (1 ‒ ε/4)N Pr that N instances miss 4 strips 4(1 ‒ ε/4)N 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x) 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ) 27 Noise and Model Complexity Use the simpler one because Simpler to use (lower computational complexity) Easier to train (lower space complexity) Easier to explain (more interpretable) Generalizes better (lower variance - Occam’s razor) 28 Multiple Classes, Ci i=1,...,K X {x t ,r t }tN 1 t 1 if x t Ci ri 0 if x t C j , j i Train hypotheses hi(x), i =1,...,K: 1 if x t Ci t hi x 0 if x C j , j i t 29 g x w1x w 0 Regression g x w 2x 2 w 1x w 0 X x ,r t t N t 1 rt rt f xt r 1 N E g | X t g x t 2 N t 1 r 1 N E w 1 , w 0 | X t t 2 w 1x w 0 N t 1 30 Model Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique solution The need for inductive bias, assumptions about H Generalization: How well a model performs on new data Overfitting: H more complex than C or f Underfitting: H less complex than C or f 31 Triple Trade-Off There is a trade-off between three factors (Dietterich, 2003): 1. Complexity of H, c (H), 2. Training set size, N, 3. Generalization error, E, on new data As N-, E¯ As c (H)-, first E¯ and then E- 32 Cross-Validation To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) Validation set (25%) Test (publication) set (25%) Resampling when there is few data 33 Dimensions of a Supervised Learner 1. Model : x | g 2. Loss function: E | X L r t , g x t | t * arg min E | X 3. Optimization procedure: 34 5.3 Parametric Methods Parametric Estimation X = { xt }t where xt ~ p (x) Parametric estimation: Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ2) where θ = { μ, σ2} 36 Maximum Likelihood Estimation Likelihood of θ given the sample X l (θ|X) = p (X |θ) = ∏t p (xt|θ) Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) 37 Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1} P (x) = pox (1 – po ) (1 – x) L (po|X) = log ∏t po xt (1 – p ) (1 o – xt) MLE: po = ∑t xt / N Multinomial: K>2 states, xi in {0,1} P (x1,x2,...,xK) = ∏i pixi L(p1,p2,...,pK|X) = log ∏t ∏i pixit MLE: pi = ∑t xit / N 38 Gaussian (Normal) Distribution p(x) = N ( μ, σ2) 1 x 2 p x exp 2 2 2 MLE for μ and σ2: μ σ xt m t N x t 2 m s2 t N 39 Bias and Variance Unknown parameter θ Estimator di = d (Xi) on sample Xi Bias: bθ(d) = E [d] – θ Variance: E [(d–E [d])2] Mean square error: r (d,θ) = E [(d–θ)2] = (E [d] – θ)2 + E [(d–E [d])2] = Bias2 + Variance 40 Bayes’ Estimator Treat θ as a random var with prior p (θ) Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML): θML = argmaxθ p(X|θ) Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ 41 Bayes’ Estimator: Example xt ~ N (θ, σo2) and θ ~ N ( μ, σ2) θML = m θMAP = θBayes’ = N/ 2 1 / 2 E | X 2 0 2 m 2 2 N / 0 1 / N / 0 1 / 42 Parametric Classification gi x p x | Ci P Ci or equivalent ly gi x log p x | Ci log P Ci 1 x i 2 p x | Ci exp 2 2i 2i 1 gi x log 2 log i x i log P C 2 2 i 2 2i 43 Given the sample X {x t ,r t }tN1 1 if x t Ci x t ri 0 if x C j , j i t ML estimates are ri x ri x t t t t 2 mi rit ˆ Ci P t mi t si2 t N rit t rit t Discriminant becomes 1 gi x log 2 log si x mi log ˆ C P i 2 2 2 2si 44 Equal variances Single boundary at halfway between means 45 Variances are different Two boundaries 46 Regression r f x estimator : g x | ~ N 0, 2 p r | x ~ N g x | , 2 N L | X log p x t ,r t t 1 N N log p r t | x t log p x t t 1 t 1 47 Regression: From LogL to Error L | X log N 1 exp rt g xt | 2 t 1 2 22 2 1 N N log 2 2 r t g x t | 2 t 1 2 1 N E | X r t g x t | 2 t 1 48 Linear Regression g x t | w 1 , w 0 w 1x t w 0 t r t Nw 0 w 1 x t t r x t t w 0 x w1 x t t 2 t t t N w 0 xt r t A t w y t t t tx t t 2 x w1 r x t t w A 1 y 49 Polynomial Regression t g x | wk ,, w2 , w1 , w 0 wk x t k w2 x t 2 w1x t w 0 1 x 1 x 1 2 x 1 k r 1 2 D 1 x2 x 2 2 2 k x r r 2 N 1 x N x N 2 N x r w D D DT r T 1 50 Other Error Measures 2 Square Error: 1N t 2 t 1 E | X r g x t | 2 r N t g xt | Relative Square Error: E | X t 1 2 r N t r t 1 Absolute Error: E (θ|X) = ∑t |rt – g(xt|θ)| ε-sensitive Error: E (θ|X) = ∑ t 1(|rt – g(xt|θ)|>ε) (|rt – g(xt|θ)| – ε) 51 Bias and Variance E r g x | x E r E r | x | x E r | x g x 2 2 2 noise squared error 2 E X E r | x gx | x E r | x E X gx E X gx E X gx 2 2 bias variance 52 Estimating Bias and Variance M samples Xi={xti , rti}, i=1,...,M are used to fit gi (x), i =1,...,M g x f x 1 Bias g 2 t t 2 N t Varianceg 1 NM t i gi x t g x t 2 1 g x gi x M t 53 Bias/Variance Dilemma Example: gi(x)=2 has no variance and high bias gi(x)= ∑t rti/N has lower bias with variance As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma: (Geman et al., 1992) 54 f f bias gi g variance 55 Polynomial Regression Best fit “min error” 56 Best fit, “elbow” 57 Model Selection Cross-validation: Measure generalization accuracy by testing on data unused during training Regularization: Penalize complex models E’=error on data + λ model complexity Akaike’s information criterion (AIC), Bayesian information criterion (BIC) Minimum description length (MDL): Kolmogorov complexity, shortest description of data Structural risk minimization (SRM) 58 Bayesian Model Selection Prior on models, p(model) p data | model p model p model | data p data Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high posterior (voting, ensembles: Chapter 15) 59 5.4 Clustering Semiparametric Density Estimation Parametric: Assume a single model for p (x | Ci) (Chapter 4 and 5) Semiparametric: p (x | Ci) is a mixture of densities Multiple possible explanations/prototypes: Different handwriting styles, accents in speech Nonparametric: No model; data speaks for itself (Chapter 8) 61 Mixture Densities k p x p x | Gi P Gi i 1 where Gi the components/groups/clusters, P ( Gi ) mixture proportions (priors), p ( x | Gi) component densities Gaussian mixture where p(x|Gi) ~ N ( μi , ∑i ) parameters Φ = {P ( Gi ), μi , ∑i }ki=1 unlabeled sample X={xt}t (unsupervised learning) 62 Classes vs. Clusters Supervised: X = { xt ,rt }t Unsupervised : X = { xt }t Classes Ci i=1,...,K Clusters Gi i=1,...,k k p x p x | Gi P Gi K p x p x | Ci P Ci i 1 i 1 where p ( x | Ci) ~ N ( μi , where p ( x | Gi) ~ N ( μi , ∑i ) ∑i ) Φ = {P (Ci ), μi , ∑i }Ki=1 Φ = {P ( Gi ), μi , ∑i }ki=1 ˆ Ci P t rit mi t rit x t N t rit Labels, r ti ? Si t rit x t mi x t mi T t rit 63 k-Means Clustering Find k reference vectors (prototypes/codebook vectors/codewords) which best represent data Reference vectors, mj, j =1,...,k Use nearest (most similar) reference: x t mi min x t m j j Reconstruction error E m k i i 1 X t i bit x t mi 1 if x t mi min x t m j bit j 0 otherwise 64 Encoding/Decoding t 1 if x t mi min x t m j bi j 0 otherwise 65 k-means Clustering 66 67 Expectation-Maximization (EM) Log likelihood with a mixture model L | X log p x t | t k t log p x t | Gi P Gi i 1 Assume hidden variables z, which when known, make optimization much simpler Complete likelihood, Lc(Φ |X,Z), in terms of x and z Incomplete likelihood, L(Φ |X), in terms of x 68 E- and M-steps Iterate the two steps 1. E-step: Estimate z given X and current Φ 2. M-step: Find new Φ’ given z, X, and old Φ. E - step : Q | l E LC | X, Z | X, l M - step : l 1 arg max Q | l An increase in Q increases incomplete likelihood Ll 1 |X L |X l 69 EM in Gaussian Mixtures zti = 1 if xt belongs to Gi, 0 otherwise (labels r ti of supervised learning); assume p(x|Gi)~N(μi,∑i) E-step: E zit X ,l p x t | Gi , l P Gi t p xj l | Gj , P Gj P Gi | x t , l hit M-step: P Gi t hit m l 1 t hit x t Use estimated labels h in place of i t N i t unknown labels Sil 1 t hit x t mil 1 x t mil 1 T t hit 70 P(G1|x)=h1=0.5 71 Mixtures of Latent Variable Models Regularize clusters 1. Assume shared/diagonal covariance matrices 2. Use PCA/FA to decrease dimensionality: Mixtures of PCA/FA p xt | Gi N mi , Vi ViT ψi Can use EM to learn Vi (Ghahramani and Hinton, 1997; Tipping and Bishop, 1999) 72 After Clustering Dimensionality reduction methods find correlations between features and group features Clustering methods find similarities between instances and group instances Allows knowledge extraction through number of clusters, prior probabilities, cluster parameters, i.e., center, range of features. Example: CRM, customer segmentation 73 Clustering as Preprocessing Estimated group labels hj (soft) or bj (hard) may be seen as the dimensions of a new k dimensional space, where we can then learn our discriminant or regressor. Local representation (only one bj is 1, all others are 0; only few hj are nonzero) vs Distributed representation (After PCA; all zj are nonzero) 74 Mixture of Mixtures In classification, the input comes from a mixture of classes (supervised). If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures: ki p x | Ci p x | Gij P Gij j 1 K p x p x | Ci P Ci i 1 75 Hierarchical Clustering Cluster based on similarities/distances Distance measure between instances xr and xs Minkowski (Lp) (Euclidean for p = 2) x 1/ p r s d r s p dm x , x j 1 j x j City-block distance dcb x , x j 1 x rj x s r s d j 76 Agglomerative Clustering Start with N groups each with one instance and merge two closest groups at each iteration Distance between two groups Gi and Gj: Single-link: x Gi ,x G j d Gi ,Gj r min d x r , x s s Complete-link: x Gi ,x G j d Gi ,Gj r max d x r , x s s Average-link, centroid 77 Example: Single-Link Clustering Dendrogram 78 Choosing k Defined by the application, e.g., image quantization Plot data (after PCA) and check for clusters Incremental (leader-cluster) algorithm: Add one at a time until “elbow” (reconstruction error/log likelihood/intergroup distances) Manual check for meaning 79 5.5 Nonparametric Methods Nonparametric Estimation Parametric (single global model), semiparametric (small number of local models) Nonparametric: Similar inputs have similar outputs Functions (pdf, discriminant, regression) change smoothly Keep the training data;“let the data speak for itself” Given x, find a small number of closest training instances and interpolate from these Aka lazy/memory-based/case-based/instance- based learning 81 Density Estimation Given the training set X={xt}t drawn iid from p(x) Divide data into bins of size h Histogram: ˆ x # x t in the same bin as x p Nh Naive estimator: # x h x t x h ˆ x p 2Nh or ˆ x 1 N x xt 1 / 2 if u 1 p w h w u Nh t 1 0 otherwise 82 83 84 Kernel Estimator Kernel function, e.g., Gaussian kernel: 1 u2 K u exp 2 2 Kernel estimator (Parzen windows) 1 N x xt ˆ x p K h Nh t 1 85 86 k-Nearest Neighbor Estimator Instead of fixing bin width h and counting the number of instances, fix the instances (neighbors) k and check bin width k ˆ x p 2Nd k x dk(x), distance to kth closest instance to x 87 88 Multivariate Data Kernel density estimator 1 N x xt ˆ x d K p h Nh t 1 Multivariate Gaussian kernel d 1 u 2 K u exp spheric 2 2 1 1 T 1 ellipsoid K u exp u S u 2 S d/2 1/ 2 2 89 Nonparametric Classification Estimate p(x|Ci) and use Bayes’ rule Kernel estimator 1 N x xt t ˆ Ni ˆ x | Ci p K h ri P Ci N N ih d t 1 ˆ Ci 1 K N x xt t gi x ˆ x | Ci P p d h ri Nh t 1 k-NN estimator ki ˆ x | Ci ˆ Ci ki ˆ Ci | x p P ˆ x | Ci p P N iV x k ˆ x p k 90 Condensed Nearest Neighbor Time/space complexity of k-NN is O (N) Find a subset Z of X that is small and is accurate in classifying X (Hart, 1968) E' Z | X E X | Z Z 91 Condensed Nearest Neighbor Incremental algorithm: Add instance if needed 92 Nonparametric Regression Aka smoothing models Regressogram ˆ x g t 1 N b x ,x t r t b x , x N t t 1 where 1 if x t is in the same bin with x b x ,xt 0 otherwise 93 94 95 Running Mean/Kernel Smoother Running mean smoother Kernel smoother x xt t x xt t t 1w h r t 1 K h r N N ˆ x g ˆ x g x xt x xt t 1w h N t 1 K h N where 1 if u 1 w u where K( ) is Gaussian 0 otherwise Additive models (Hastie Running line smoother and Tibshirani, 1990) 96 97 98 99 How to Choose k or h? When k or h is small, single instances matter; bias is small, variance is large (undersmoothing): High complexity As k or h increases, we average over more instances and variance decreases but bias increases (oversmoothing): Low complexity Cross-validation is used to finetune k or h. 100 5.6 Decision Trees Tree Uses Nodes, and Leaves 102 Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, xi Numeric xi : Binary split : xi > wm Discrete xi : n-way split for n possible values Multivariate: Uses all attributes, x Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) 103 Classification Trees (ID3, CART, C4.5) For node m, Nm instances reach m, Nim belong to Ci i ˆ Ci | x ,m pm i Nm P Nm Node m is pure if pim is 0 or 1 Measure of impurity is entropy K Im pm log 2pm i i i 1 104 Best Split If node m is pure, generate a leaf and stop, otherwise split and continue recursively Impurity after split: Nmj of Nm take branch j. Nimj belong to Ci i N mj ˆ Ci | x ,m, j pmj P i N mj n N mj K I'm i i pmj log2pmj j 1 Nm i 1 Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) 105 106 Regression Trees Error at node m: 1 if x Xm : x reaches node m bm x 0 otherwise 1 t bm x t r t 2 Em t r gm bm x t gm Nm t b x t m t After splitting: 1 if x Xmj : x reaches node m and branch j bmj x 0 otherwise t bmj x t r t r 1 t 2 t E'm gmj bmj x gmj Nm j t b x t mj t 107 Model Selection in Trees: 108 Pruning Trees Remove subtrees for better generalization (decrease variance) Prepruning: Early stopping Postpruning: Grow the whole tree then prune subtrees which overfit on the pruning set Prepruning is faster, postpruning is more accurate (requires a separate pruning set) 109 Rule Extraction from Trees C4.5Rules (Quinlan, 1993) 110 Learning Rules Rule induction is similar to tree induction but tree induction is breadth-first, rule induction is depth-first; one rule at a time Rule set contains rules; rules are conjunctions of terms Rule covers an example if all terms of the rule evaluate to true for the example Sequential covering: Generate rules one at a time until all positive examples are covered IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995) 111 112 113 Multivariate Trees 114

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 58 |

posted: | 8/2/2012 |

language: | |

pages: | 114 |

OTHER DOCS BY ajizai

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.