Docstoc

Data mining in the internet age

Document Sample
Data mining in the internet age Powered By Docstoc
					       Data mining
      An overview of
techniques and applications

          Sunita Sarawagi
            IIT Bombay
  http://www.it.iitb.ernet.in/~sunita
Data mining

• Process of semi-automatically analyzing
  large databases to find patterns that are:
  – valid: hold on new data with some certainity
  – novel: non-obvious to the system
  – useful: should be possible to act on the item
  – understandable: humans should be able to
    interpret the pattern
• Other names: Knowledge discovery in
  databases, Data analysis
Relationship with other fields
• Overlaps with machine learning, statistics,
  artificial intelligence, databases, visualization
  but more stress on
   – scalability of number of features and instances
   – stress on algorithms and architectures whereas
     foundations of methods and formulations provided
     by statistics and machine learning.
   – automation for handling large, heterogeneous data
Outline
 • Mining operations
   – Classification            Applications
   – Clustering                Methods
   – Association rule mining
   – Sequence mining
 • Two applications
   – Intrusion detection
   – Information extraction
Classification                            X1 X2 ...       Xn Y
 • Given, a table D of rows with          2    6.7        BB +
   columns X1..Xn,Y                       5    3.4 ..     CA +
   – Xi could numeric or string           ..
   – Special attribute Y, the class-label            ..

 • Training
                                          ..         ..
   – Learn a classifier C that can
     predict the label Y in terms of
     X1,X2..Xn                            ..
   – C must hold for                      10   0.9        CX -
      • examples in D
      • unseen data
                                                 C          +/-
 • Application
   – Use C to predict Y for new X-s       10   0.9        CX ?
Automatic loan approval
• Given old data about customers and payments,
  predict new applicant’s loan eligibility.

Previous customers           Classifier           Decision rules
                                                       Salary > 5 L
  Age                                                                      Good/
  Salary                                                                    bad
                                                            Prof. = Exec
  Profession
  Location
  Customer type
                                          Age
                                          Salary
                  New applicant’s data    Profession
                                          Location
Drug design: molecular Bioactivity
 • Predict activity of compounds binding to
   thrombin
 • Library of compounds included:
   – 1909 known molecules (42 actively binding
     thrombin)
 • 139,351 binary features describe the 3-
   D structure of each compound
 • 636 new compounds with unknown
   capacity to bind thrombin
Automatic webpage classification
• Several large categorized search engines
  – Yahoo, Dmoz used in Google/Altavista




• Web: 2 billion pages and only a subset in the
  directories
• Existing taxonomies manually created
• Need to automatically classify new pages
Several classification methods
• Regression
• Decision tree     Choose based on
  classifier          – data type
                        (numeric,categorical)
• Rule-learners       – number of attributes
• Neural networks     – number of classes
                      – number of training
• Generative models     examples
• Nearest neighbor    – need for interpretation

• Support vector
  machines
Nearest neighbor
• Define similarity between instances
• Find neighbors of new instance in training data
   – K-NN approach: assign majority class amongst k
     nearest neighbour
   – weighted regression: learn a new regression
     equation by weighting each training instance based
     on distance from new instance


 • Pros                • Cons
     + Fast training      – Slow during application.
                          – No feature selection.
                          – Notion of proximity vague
Decision tree classifiers
 • Widely used learning method
 • Easy to interpret: can be re-represented
   as if-then-else rules
 • Approximates function by piece wise
   constant regions
 • Does not require any prior knowledge
   of data distribution, works well on noisy
   data.
Decision trees

 • Tree where internal nodes are simple decision
   rules on one or more attributes and leaf nodes
   are predicted class labels.

                          Salary < 20K


       Profession = teacher              Age < 30


     Good              Bad         Bad              Good
Algorithm for tree building
 • Greedy top-down construction.

              Gen_Tree (Node, data)


                                       Yes
               make node a leaf?                 Stop


 Selection
           Find best attribute and best split on attribute
 criteria
             Partition data on split condition

        For each child j of node Gen_Tree (node_j, data_j)
Support vector machines
 • Binary classifier: find hyper-plane
   providing maximum margin between
   vectors of the two classes

 fj




                                fi
Support Vector Machines
• Extendable to:
  – Non-separable problems (Cortes & Vapnik, 1995)
  – Non-linear classifiers (Boser et al., 1992)
• Good generalization performance
  – OCR (Boser et al.)
  – Vision (Poggio et al.)
  – Text classification (Joachims)
• Requires tuning: which kernel, what
  parameters?
• Several freely available packages: SVMTorch
Neural networks
 • Useful for learning complex data like
   handwriting, speech and image
   recognition
                     Decision boundaries:




 Linear regression    Classification tree   Neural network
Bayesian learning
 • Assume a probability model on
   generation of data.
 • Apply Bayes theorem to find most
   likely class as:
                                       p(d | c j ) p(c j )
         c  max p( c j | d )  max
              cj                cj           p(d )
 • Naïve bayes: Assume attributes
  conditionally independent given class
  value                p(c j ) n
                   c  max
                        cj   p(d )
                                    p(a
                                      i 1
                                              i   | cj)
Meta learning methods
 • No single classifier good under all cases
 • Difficult to evaluate in advance the conditions
 • Meta learning: combine the effects of the
   classifiers
   – Voting: sum up votes of component classifiers
   – Combiners: learn a new classifier on the outcomes
     of previous ones:
   – Boosting: staged classifiers
 • Disadvantage: interpretation hard
   – Knowledge probing: learn single classifier to
     mimick meta classifier
Outline
 • Mining operations
   – Classification            Applications
   – Clustering                Methods
   – Association rule mining
   – Sequence mining
      What is Cluster Analysis?
• Cluster: a collection of data objects
   – Similar to one another within the same cluster
   – Dissimilar to the objects in other clusters
• Cluster analysis
   – Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
  predefined classes
• Typical applications
   – As a stand-alone tool to get insight into data
     distribution
   – As a preprocessing step for other algorithms
Applications
 • Customer segmentation e.g. for
   targeted marketing
   – Group/cluster existing customers based on
     time series of payment history such that
     similar customers in same cluster.
   – Identify micro-markets and develop
     policies foreach
 • Image processing
 • Text clustering e.g. scatter/gather
 • Compression
Distance functions
 • Numeric data: euclidean, manhattan distances
    – Minkowski metric: [sum(xi-yi)^m]^(1/m)
    – Larger m gives higher weight to larger distances
 • Categorical data: 0/1 to indicate
   presence/absence
    – Euclidean distance: equal weightage to 1 and 0
      match
    – Hamming distance (# dissimilarity)
    – Jaccard coefficients: #similarity in 1s/(# of 1s) (0-
      0 matches not important
    – data dependent measures: similarity of A and B
      depends on co-occurance with C.
 • Combined numeric and categorical data:weighted normalized
   distance:
Clustering methods
 • Hierarchical clustering
   – agglomerative Vs divisive
   – single link Vs complete link
 • Partitional clustering
   – distance-based: K-means
   – model-based: EM
   – density-based:
Outline
 • Mining operations
   – Classification            Applications
   – Clustering                Methods
   – Association rule mining
   – Sequence mining
 • Two applications
   – Intrusion detection
   – Information extraction
Intrusion via privileged programs
 • Attacks exploit a loophole in the                open
   program to do illegal actions                    lseek
     – Example: exploit buffer over-flows to run    lstat
       user-code                                    mmap
                                                    execve
 • What to monitor of an executing                  ioctl
   privileged program to detect attacks?            ioctl
                                                    close
 • Sequence of system calls                         execve
    – |S| = set of all possible system calls ~100   close
                                                    unlink
 • Mining problem: given traces of previous
   normal execution, monitor a new
   execution and flag attack or normal
 • Challenge: is it possible to do this given
   widely varying normal conditions?
Detecting attacks on privileged programs
  •   Short sequences of system calls made
      during normal execution of system calls are
      very consistent, yet different from the
      sequences of its abnormal executions
  •   Each execution a trace of system calls:
      – ignore online traces for the moment
  •   Two approaches
      – STIDE
         •   Create dictionary of unique k-windows in normal traces,
             count what fraction occur in new traces and threshold.
      – IDS
         •   next...
Classification models on k-grams
 • When both normal
   and abnormal data        7-grams              class labels

   available
                            4 2 66 66 4 138 66   “normal”
                            5 5 5 4 59 105 104   “abnormal”
    – class label =         …                    …
      normal/abnormal:
                          6 attributes           Class labels
 • When only normal       4 2 66 66 4 138        “66”
   trace,                 5 5 5 4 59 105         “104”
                          …                      …
   – class-label=k-th
     system call
 Learn rules to predict class-label [RIPPER]
Examples of output RIPPER rules
 • Both-traces:
    – if the 2nd system call is vtimes and the 7th is
      vtrace, then the sequence is “normal”
    – if the 6th system call is lseek and the 7th is
      sigvec, then the sequence is “normal”
    –…
    – if none of the above, then the sequence is
      “abnormal”
 • Only-normal:
    – if the 3rd system call is lstat and the 4th is write,
      then the 7th is stat
    – if the 1st system call is sigblock and the 4th is
      bind, then the 7th is setsockopt
    –…
    – if none of the above, then the 7th is open
 Experimental results on sendmail
                                  traces            Only-normal   BOTH
• The output rule sets contain    sscp-1            13.5          32.2
~250 rules, each with 2 or 3      sscp-2            13.6          30.4
attribute tests                   sscp-3            13.6          30.4
                                  syslog-remote-1   11.5          21.2
                                  syslog-remote-2   8.4           15.6
• Score each trace by counting    syslog-local-1    6.1           11.1
                                  syslog-local-2    8.0           15.9
fraction of mismatches and
                                  decode-1          3.9           2.1
thresholding                      decode-2          4.2           2.0
                                  sm565a            8.1           8.0
                                  sm5x              8.2           6.5
 Summary: Only normal traces      sendmail          0.6           0.1
sufficient to detect intrusions
Information extraction
 • Automatically extract structured fields
   from unstructured documents
   – by learning from examples
 • Technology:
   – Graph models (Hidden Markov Models)
   – Probabilistic parsers
 • Applications:
   – Comparison shopping agents
   – Bibliography databases (citeseer)
   – Address elementization (IIT Bombay)
Problem definition
 Source: concatenation of structured
   elements with limited reordering and
   some missing fields
     – Example: Addresses, bib records
    House
    number                                         City       Zip
                 Building           Road   Area
         156   Hillside ctype Scenic drive Powai Mumbai     400076


Author              Year    Title                 Journal
                                                            Volume
                                                                   Page
     P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick
     (1993) Protein and Solvent Engineering of Subtilising BPN' in
     Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115,
     12231-12237.
Learning to segment
 Given,
      – list of structured elements
      – several examples showing position of structured
        elements in text,
 Train a model to identify them in unseen text
 At top-level a classification problem
Issues:
     What are the input features?
     Build per-element classifiers or a single joint classifier?
     Which type of classifier to use?
     How much training data is required?
Input features
  Content of the element
    Specific keywords like street, zip, vol, pp,
    Properties of words like capitalization,
     parts of speech, number?
  Inter-element sequencing
  Intra-element sequencing
  Element length
IE with Hidden Markov Models
 • Probabilistic models for IE
                                                                   Emission
                                                                  probabilities
              Transition
             probabilities           0.5
                                                Title   A   0.6

    Letter     0.3
                             Author       0.9           B   0.3

                             0.1                0.5     C   0.1
    Et. al     0.1
                                    0.8
    Word       0.5
                             Year          Journal
                                                  0.2   journal       0.4
                     dddd    0.8
                     dd      0.2                        ACM           0.2
                                                        IEEE          0.3
HMM Structure
   Naïve Model: One state per element




              … Mahatma Gandhi Road Near Parkland ...
Nested model
Each element
another HMM [Mahatma Gandhi Road Near : Landmark] Parkland ...
          …
Results: Comparative Evaluation

Dataset      inst   Elem
             anc    ents
             es
IITB         2388 17
student
Addresses
Company      769    6
Addresses
US           740    6
Addresses


            The Nested model does best in all three cases
Mining market
 • Around 20 to 30 mining tool vendors
 • Major tool players:
   – SAS’s Enterprise Miner.
   – IBM’s Intelligent Miner,
   – SGI’s MineSet,
 • All pretty much the same set of tools
 • Many embedded products:
   –   fraud detection:
   –   electronic commerce applications,
   –   health care,
   –   customer relationship management: Epiphany

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:4/20/2013
language:Unknown
pages:37