Document Sample

Course Summer Mining Data Summer Course: Data Mining Vector Vector Machines SupportSupportMachines and other penalization classifiers Presenter: Georgi Nalbantov Presenter: Georgi Nalbantov August 2009 Summer Course 2/20 Mining Data Contents Purpose Linear Support Vector Machines Nonlinear Support Vector Machines (Theoretical justifications of SVM) Marketing Examples Other penalization classification methods Conclusion and Q & A (some extensions) Summer Course 3/20 Mining Data Purpose Task to be solved (The Classification Task): Classify cases (customers) into “type 1” or “type 2” on the basis of some known attributes (characteristics) Chosen tool to solve this task: Support Vector Machines Summer Course 4/20 Mining Data The Classification Task Given data on explanatory and explained variables, where the explained variable can take two values { 1 }, find a function that gives the “best” separation between the “-1” cases and the “+1” cases: Given: ( x1, y1 ), … , ( xm , ym ) n {1} Find: : n { 1 } “best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k ) is minimal Existing techniques to solve the classification task: Linear and Quadratic Discriminant Analysis Logit choice models (Logistic Regression) Decision trees, Neural Networks, Least Squares SVM Summer Course 5/20 Mining Data Support Vector Machines: Definition Support Vector Machines are a non-parametric tool for classification/regression Support Vector Machines are used for prediction rather than description purposes Support Vector Machines have been developed by Vapnik and co-workers Summer Course 6/20 Mining Data Linear Support Vector Machines A direct marketing company wants to sell a new book: “The Art History of Florence” ∆ buyers Number of art books purchased ● non-buyers Nissan Levin and Jacob Zahavi in Lattin, ∆ ∆ Carroll and Green (2003). ∆ ∆ ∆ ∆ Problem: How to identify buyers and non- ● ● buyers using the two variables: ∆ ● Months since last purchase ● ∆ ∆ ● Number of art books purchased ● ● ● ● ● ● Months since last purchase Summer Course 7/20 Mining Data Linear SVM: Separable Case Main idea of SVM: separate groups by a line. ∆ buyers Number of art books purchased ● non-buyers However: There are infinitely many lines ∆ ∆ that have zero training error… ∆ ∆ ∆ … which line shall we choose? ∆ ● ● ● ● ● ● ● ● Months since last purchase Summer Course 8/20 Mining Data Linear SVM: Separable Case SVM use the idea of a margin around the separating line. ∆ buyers Number of art books purchased ● non-buyers The thinner the margin, ∆ ∆ margin ∆ ∆ ∆ the more complex the model, ∆ ● ● ● ● ● The best line is the one with the ● ● largest margin. ● Months since last purchase Summer Course 9/20 Mining Data Linear SVM: Separable Case The line having the largest margin is: x2 w1x1 + w2x2 + b = 0 w Number of art books purchased ∆ ∆ ∆ ∆ ∆ ∆ ● Where margin x1 = months since last purchase ● ● x2 = number of art books purchased ● ● ● ● Note: ● w1xi 1 + w2xi 2 + b +1 for i ∆ x1 w1xj 1 + w2xj 2 + b –1 for j ● Months since last purchase Summer Course 10/20 Mining Data Linear SVM: Separable Case The width of the margin is given by: x2 w 1 ( 1) 2 Number of art books purchased margin ∆ ∆ w1 w 2 2 2 || w || ∆ ∆ ∆ ∆ Note: ● margin 2 w w 2 w 2 2 ● ● maximize minimize minimize ● ● ● the margin ● ● x1 Months since last purchase Summer Course 11/20 Mining Data Linear SVM: Separable Case 2 2 w w 2 w 2 x2 maximize minimize minimize the margin ∆ ∆ ∆ The optimization problem for SVM is: ∆ ∆ margin 2 ∆ minimize L( w ) w 2 ● ● ● subject to: ● ● ● w1xi 1 + w2xi 2 + b +1 for i ∆ ● ● w1xj 1 + w2xj 2 + b –1 for j ● x1 Summer Course 12/20 Mining Data Linear SVM: Separable Case x2 “Support vectors” “Support vectors” are those points that lie ∆ ∆ on the boundaries of the margin ∆ ∆ ∆ ∆ ● The decision surface (line) is determined only by the support vectors. All other ● points are irrelevant ● ● ● ● ● ● x1 Summer Course 13/20 Mining Data Linear SVM: Nonseparable Case Non-separable case: there is no line Training set: 1000 targeted customers separating errorlessly the two groups x2 Here, SVM minimize L(w,C) : ∆ buyers ● non-buyers ∆ ∆ L( w,C ) w 2 2 C i ∆ i ∆ ∆ ∆ maximize minimize the ● ● the margin training errors ∆ ● L(w,C) = Complexity + Errors ● ∆ ∆ ● ● ● ● subject to: ● ● ● w1xi 1 + w2xi 2 + b +1 – i for i ∆ x1 w1xj 1 + w2xj 2 + b –1 + i for j ● I,j 0 Summer Course 14/20 Mining Data Linear SVM: The Role of C x2 x2 ∆ ∆ C=5 ∆ ∆ C=1 ∆ ∆ ∆ ● ∆ ● ∆ ∆ ● ● ● ● ● ● ● ● x1 x1 Bigger C increased complexity Smaller C decreased complexity ( thinner margin ) ( wider margin ) smaller number errors bigger number errors ( better fit on the data ) ( worse fit on the data ) Vary both complexity and empirical error via C … by affecting the optimal w and optimal number of training errors Summer Course 15/20 Mining Data Bias – Variance trade-off Summer Course 16/20 Mining Data From Regression into Classification We have a linear model, such as y b * x const We have to estimate this relation using our training data set and having in mind the so-called “accuracy”, or “0-1” loss function (our evaluation criterion). The training data set we have consists of only MANY observations, for instance: Output (y) Input (x) -1 0.2 1 0.5 Training data: 1 0.7 ... .. -1 -0.7 Summer Course 17/20 Mining Data From Regression into Classification We have a linear model, such as y b * x const y We have to estimate this relation using our training data set and having in mind the so- 1 called “accuracy”, or “0-1” loss function (our evaluation criterion). The training data set we have consists of only MANY observations, for instance: -1 Training data: Output (y) Input (x) x -1 0.2 1 0.5 1 0.7 x ... .. Support vector Support vector -1 -0.7 “margin” Summer Course 18/20 Mining Data From Regression into Classification: Support Vector Machines flatter line greater penalization y y b * x const equivalently: smaller slope bigger margin 1 -1 x x “margin” Summer Course 19/20 Mining Data From Regression into Classification: Support Vector Machines y b1 * x1 b2 * x2 const y b1 * x1 b2 * x2 const y x2 x2 x1 x1 “margin” flatter line greater penalization equivalently: smaller slope bigger margin Summer Course 20/20 Mining Data Nonlinear SVM: Nonseparable Case Mapping into a higher-dimensional space x2 x11 x12 x11 2 2 x11x12 x12 2 x 2 2 ∆ ∆ 21 x 22 x 21 2 x 21x 22 x 22 ∆ 2 ∆ ∆ xl 1 xl 2 2 ∆ xl1 xl 2 2 xl 1xl 2 ∆ ● ● ● Optimization task: minimize L(w,C) ● ∆ ∆ ● ● C i 2 L(w,C ) w 2 ● ● ● i ● subject to: ● ∆ w1xi2 w2 2 xi 1xi 2 w3 xi22 b 1 i x1 1 ● w1x 21 w2 2 x j 1x j 2 w3 x 22 b 1 j j j Summer Course 21/20 Mining Data Nonlinear SVM: Nonseparable Case Map the data into higher-dimensional space: 2 3 1, 1 1, 2 , 1 ∆ x12 x1 1, 1 1, 2 , 1 ∆ 2 x1 x2 x 1, 1 1, ● 2 x2 2,1 2 1, 1 1, 2,1 ● x2 2 x2 ● ∆ ∆ (-1,1) (1,1) x1 ● ∆ ● x12 (-1,-1) (1,-1) 2 x1 x2 Summer Course 22/20 Mining Data Nonlinear SVM: Nonseparable Case Find the optimal hyperplane in the transformed space 1, 1 1, 2 , 1 ∆ x12 x1 1, 1 1, 2 , 1 ∆ 2 x1 x2 x 1, 1 1, ● 2 x2 2,1 2 1, 1 1, 2,1 ● x2 2 x2 ● ∆ ∆ (-1,1) (1,1) x1 ● ∆ ● x12 (-1,-1) (1,-1) 2 x1 x2 Summer Course 23/20 Mining Data Nonlinear SVM: Nonseparable Case Observe the decision surface in the original space (optional) 1, 1 1, 2 , 1 ∆ x12 x1 1, 1 1, 2 , 1 ∆ 2 x1 x2 x 1, 1 1, ● 2 x2 2,1 2 1, 1 1, 2,1 ● x2 2 x2 ● ∆ ∆ x1 ● ∆ ● x12 2 x1 x2 Summer Course 24/20 Mining Data Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem Primal Dual 2 min w 2 C i max i i 1 2 i j i j yi yj xi xj i Subject to Subject to yi w x i b 1 i 0 i C i 0 yi 1 i i yi 0 yi 1 Summer Course 25/20 Mining Data Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem x12 Dual x1 2 x1 x2 x 2 x2 2 max i i 1 2 i j i j yi yj xi xj ( xi ) ( x j ) x , 2 i1 2 xi1 xi 2 , xi22 x 2 j1 , 2 x j1 x j 2 , x 22 j ( x i1 , xi 2 ) ( x j1 , x j 2 ) 2 x x i j 2 Subject to K ( xi , x j ) ( xi ) ( x j ) (kernel function) 0 i C i i yi 0 yi 1 Summer Course 26/20 Mining Data Nonlinear SVM: Nonseparable Case Dual formulation of the (primal) SVM minimization problem x12 Dual x1 2 x1 x2 x 2 x2 2 max i i 1 2 i j i j yi yj xi xj ( xi ) ( x j ) x , 2 i1 2 xi1 xi 2 , x2 x i2 2 j1 , 2 x j1 x j 2 , x 2 j2 max 2 i 1 i j yi yj ( xi ) ( xj ) i i j ( x i1 , xi 2 ) ( x j1 , x j 2 ) 2 x x i j 2 max i 12 i j yi yj xi xj 2 i i j Subject to K ( xi , x j ) ( xi ) ( x j ) (kernel function) 0 i C i i yi 0 yi 1 Summer Course 27/20 Mining Data Strengths and Weaknesses of SVM Strengths of SVM: Training is relatively easy No local minima It scales relatively well to high dimensional data Trade-off between classifier complexity and error can be controlled explicitly via C Robustness of the results The “curse of dimensionality” is avoided Weaknesses of SVM: What is the best trade-off parameter C ? Need a good transformation of the original space Summer Course 28/20 Mining Data The Ketchup Marketing Problem Two types of ketchup: Heinz and Hunts Seven Attributes Feature Heinz Feature Hunts Display Heinz Display Hunts Feature&Display Heinz Feature&Display Hunts Log price difference between Heinz and Hunts Training Data: 2498 cases (89.11% Heinz is chosen) Test Data: 300 cases (88.33% Heinz is chosen) Summer Course 29/20 Mining Data The Ketchup Marketing Problem Choose a kernel mapping: Cross-validation mean squared errors, SVM with RBF kernel K (xi , x j ) (xi x j ) Linear kernel K (xi , x j ) (xi x j 1)d Polynomial kernel xi xj / 2 2 2 K (xi, xj ) e RBF kernel Do (5-fold ) cross-validation procedure to C find the best combination of the manually adjustable parameters (here: C and σ) min max σ Summer Course 30/20 Mining Data The Ketchup Marketing Problem – Training Set Model Linear Discriminant Heinz Predicted Group Membership Total Analysis Hunts Heinz Hit Rate Original Count Hunts 68 204 272 89.51% Heinz 58 2168 2226 % Hunts 25.00% 75.00% 100.00% Heinz 2.61% 97.39% 100.00% Summer Course 31/20 Mining Data The Ketchup Marketing Problem – Training Set Model Predicted Group Logit Choice Model Heinz Membership Total Hunts Heinz Hit Rate Original Count Hunts 214 58 272 77.79% Heinz 497 1729 2226 % Hunts 78.68% 21.32% 100.00% Heinz 22.33% 77.67% 100.00% Summer Course 32/20 Mining Data The Ketchup Marketing Problem – Training Set Model Support Vector Heinz Predicted Group Membership Total Machines Hunts Heinz Hit Rate Original Count Hunts 255 17 272 99.08% Heinz 6 2220 2226 % Hunts 93.75% 6.25% 100.00% Heinz 0.27% 99.73% 100.00% Summer Course 33/20 Mining Data The Ketchup Marketing Problem – Training Set Model Predicted Group Majority Voting Heinz Membership Total Hunts Heinz Hit Rate Original Count Hunts 0 272 272 89.11% Heinz 0 2226 2226 % Hunts 0% 100% 100.00% Heinz 0% 100% 100.00% Summer Course 34/20 Mining Data The Ketchup Marketing Problem – Test Set Model Linear Discriminant Heinz Predicted Group Membership Total Analysis Hunts Heinz Hit Rate Original Count Hunts 3 32 35 88.33% Heinz 3 262 265 % Hunts 8.57% 91.43% 100.00% Heinz 1.13% 98.87% 100.00% Summer Course 35/20 Mining Data The Ketchup Marketing Problem – Test Set Model Predicted Group Logit Choice Model Heinz Membership Total Hunts Heinz Hit Rate Original Count Hunts 29 6 35 77% Heinz 63 202 265 % Hunts 82.86% 17.14% 100.00% Heinz 23.77% 76.23% 100.00% Summer Course 36/20 Mining Data The Ketchup Marketing Problem – Test Set Model Support Vector Heinz Predicted Group Membership Total Machines Hunts Heinz Hit Rate Original Count Hunts 25 10 35 95.67% Heinz 3 262 265 % Hunts 71.43% 28.57% 100.00% Heinz 1.13% 98.87% 100.00% •37/36 •Part II •Penalized classification and regression methods Support Hyperplanes Nearest Convex Hull classifier Soft Nearest Neighbor Application: An example Support Vector Regression financial study Conclusion •38/36 •Classification: •Support Hyperplanes •+ •+ •+ •+ •+ •+ •+ •+ •+•+ •+ •+ •Consider a (separable) binary •There are infinitely many classification case: training data (+,-) hyperplanes that are semi-consistent and a test point x. (= commit no error) with the training data. •39/36 •Classification: •Support Hyperplanes •+ •+ •Support hyperplane •+ •+ of x •+ •+•+ •+ •+ •+ •+ •+ •For the classification of the test •The SH decision surface. Each point x, use the farthest-away h- point on it has 2 support h-planes. plane that is semi-consistent with training data. •40/36 •Classification: •Support Hyperplanes SH, linear kernel SH, RBF kernel, =5 SH, RBF kernel,=35 •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ SVM, linear kernel SVM, RBF kernel, =5 SVM, RBF kernel,=35 •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •+ •Toy Problem Experiment with Support Hyperplanes and Support Vector Machines •41/36 •Classification: •Support Vector Machines and Support Hyperplanes • Support Vector • Support Hyperplanes Machines •42/36 •Classification: •Support Vector Machines and Nearest Convex Hull cl. • Support Vector • Nearest Convex Hull Machines classification •43/36 •Classification: •Support Vector Machines and Soft Nearest Neighbor • Support Vector • Soft Nearest Neighbor Machines •44/36 •Classification: Support Hyperplanes • Support Hyperplanes • Support Hyperplanes • (bigger penalization) •45/36 •Classification: Nearest Convex Hull classification • Nearest Convex Hull classification • Nearest Convex Hull classification • (bigger penalization) •46/36 •Classification: Soft Nearest Neighbor • Soft Nearest Neighbor • Soft Nearest Neighbor • (bigger penalization) •47/36 •Classification: Support Vector Machines, •Nonseparable Case • Support Vector Machines •48/36 •Classification: Support Hyperplanes, •Nonseparable Case • Support Hyperplanes •49/36 •Classification: Nearest Convex Hull classification, •Nonseparable Case • Nearest Convex Hull classification •50/36 •Classification: Soft Nearest Neighbor, •Nonseparable Case • Soft Nearest Neighbor •51/36 Summary: Penalization Techniques for Classification •Penalization methods for classification: Support Vector Machines (SVM), Support Hyperplanes (SH), Nearest Convex Hull classification (NCH), and Soft Nearest Neighbour (SNN). In all cases, the classificarion of test point x is dete4rmined using the hyperplane h. Equivalently, x is labelled +1 (-1) if it is farther away from set S_ (S+). Summer Course 52/20 Mining Data Conclusion Support Vector Machines (SVM) can be applied in the binary and multi-class classification problems SVM behave robustly in multivariate problems Further research in various Marketing areas is needed to justify or refute the applicability of SVM Support Vector Regressions (SVR) can also be applied http://www.kernel-machines.org Email: nalbantov@few.eur.nl

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 6 |

posted: | 9/7/2011 |

language: | English |

pages: | 52 |

OTHER DOCS BY yaofenji

Feel free to Contact Us with any questions you might have.