Introduction to WEKA by kmb15358

VIEWS: 0 PAGES: 77

									Introduction to WEKA



                                       Haibin Liu
                                 haibin@cs.dal.ca
   Faculty of Computer Science, Dalhousie University
Outline

 Introduction
 Architecture
 Preprocessing
 Classification
 Clustering
 Association rule
Introduction

   Java-based Open source Machine Learning Tool
   Implements numerous algorithms
   3 modes of operation
       GUI
       Command Line
       Java API: Google: „weka java‟
   http://www.cs.waikato.ac.nz/ml/weka/
   To run:
       java -Xmx1024M -jar weka.jar &
Architecture
Preprocessing

   Input
       The ARFF file
       Access
       URL
   Filter
       Select attributes
       Apply filters
Input
An example for the ARFF
@relation heart-disease-simplified

@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
Filter

   Attribute selection
       to investigate which (subsets of) attributes are the
        most predictive ones
       contains two parts:
           A search method: best-first, forward selection,
            random, exhaustive, genetic algorithm, ranking
           An evaluation method: correlation-based, wrapper,
            information gain, chi-squared, …
Classification

   Classifiers in WEKA are models for predicting
    nominal or numeric quantities
   Implemented learning schemes include:
       Decision trees, instance-based classifiers, support
        vector machines, multi-layer perceptrons, logistic
        regression, Bayes‟ networks, …
   “Meta”-classifiers include:
       Bagging, boosting, stacking, error-correcting output
        codes, locally weighted learning, …
QuickTime™ and a T IFF (LZW) decompressor are needed to see this picture.
An example to illustrate the use of C4.5 (J48)
 classifier in WEKA
@relation bank
@attribute age numeric
@attribute sex {MALE,FEMALE}
@attribute region {INNER_CITY,RURAL,TOWN,SUBURBAN}
@attribute income numeric
@attribute married {YES,NO}
@attribute children {YES,NO}
@attribute car {YES,NO}
@attribute mortgage {YES,NO}
@attribute pep {YES,NO}

@data
48,FEMALE,INNER_CITY,17546,NO,YES,NO,NO,YES
40,MALE,TOWN,30085.1,YES,YES,YES,YES,NO
51,FEMALE,INNER_CITY,16575.4,YES,NO,YES,NO,NO
23,FEMALE,TOWN,20375.4,YES,YES,NO,NO,NO
……
Using the Command Line (Recommended)

   Generate the classification model:
       java weka.classifiers.trees.J48 -C 0.25 -M 2 -t directory-
        path\bank.arff -d directory-path\bank.model
   Apply the model to the new instances:
       java weka.classifiers.trees.J48 -p 9 -l directory-path\bank.model -
        T directory-path\bank-new.arff
   Output:
    0 YES 0.75 ?
    1 NO 0.7272727272727273 ?
    2 YES 0.95 ?
    3 YES 0.8813559322033898 ?
    4 NO 0.8421052631578947 ?
    ………
Using the Command Line
   Running an N-fold cross validation experiment
       java weka.jar weka.classifiers.bayes.NaiveBayes -t
        trainingdata.arff -x N –i
   Using a predefined test set
       java weka.jar weka.classifiers.bayes.NaiveBayes -t
        trainingdata.arff -T testingdata.arff
   Saving the model
       java weka.jar weka.classifiers.bayes.NaiveBayes -t
        trainingdata.arff -d output.model
   Classifying a test set
       java weka.jar weka.classifiers.bayes.NaiveBayes -l input.model -T
        testingdata.arff
Clustering

   To find groups of similar instances in a dataset
   Implemented schemes are:
       k-Means, EM, Cobweb, X-means, FarthestFirst
       hard vs. soft
   Clusters can be visualized and compared to “true”
    clusters (if given)
   Evaluation: closeness vs. separateness
Association rule

   WEKA contains an implementation of the Apriori
    algorithm for learning association rules
       Works only with nominal attributes
   Can identify statistical dependencies between
    groups of attributes:
       milk, butter  bread, eggs (with confidence 0.9 and
        support 2000)
   Apriori can compute all rules that have a given
    minimum support and exceed a given confidence
Try it yourself!

								
To top