Data mining in the internet age

Document Sample
Data mining in the internet age Powered By Docstoc
					       Data mining
      An overview of
techniques and applications

          Sunita Sarawagi
            IIT Bombay
Data mining

• Process of semi-automatically analyzing
  large databases to find patterns that are:
  – valid: hold on new data with some certainity
  – novel: non-obvious to the system
  – useful: should be possible to act on the item
  – understandable: humans should be able to
    interpret the pattern
• Other names: Knowledge discovery in
  databases, Data analysis
Relationship with other fields
• Overlaps with machine learning, statistics,
  artificial intelligence, databases, visualization
  but more stress on
   – scalability of number of features and instances
   – stress on algorithms and architectures whereas
     foundations of methods and formulations provided
     by statistics and machine learning.
   – automation for handling large, heterogeneous data
 • Mining operations
   – Classification            Applications
   – Clustering                Methods
   – Association rule mining
   – Sequence mining
 • Two applications
   – Intrusion detection
   – Information extraction
Classification                            X1 X2 ...       Xn Y
 • Given, a table D of rows with          2    6.7        BB +
   columns X1..Xn,Y                       5    3.4 ..     CA +
   – Xi could numeric or string           ..
   – Special attribute Y, the class-label            ..

 • Training
                                          ..         ..
   – Learn a classifier C that can
     predict the label Y in terms of
     X1,X2..Xn                            ..
   – C must hold for                      10   0.9        CX -
      • examples in D
      • unseen data
                                                 C          +/-
 • Application
   – Use C to predict Y for new X-s       10   0.9        CX ?
Automatic loan approval
• Given old data about customers and payments,
  predict new applicant’s loan eligibility.

Previous customers           Classifier           Decision rules
                                                       Salary > 5 L
  Age                                                                      Good/
  Salary                                                                    bad
                                                            Prof. = Exec
  Customer type
                  New applicant’s data    Profession
Drug design: molecular Bioactivity
 • Predict activity of compounds binding to
 • Library of compounds included:
   – 1909 known molecules (42 actively binding
 • 139,351 binary features describe the 3-
   D structure of each compound
 • 636 new compounds with unknown
   capacity to bind thrombin
Automatic webpage classification
• Several large categorized search engines
  – Yahoo, Dmoz used in Google/Altavista

• Web: 2 billion pages and only a subset in the
• Existing taxonomies manually created
• Need to automatically classify new pages
Several classification methods
• Regression
• Decision tree     Choose based on
  classifier          – data type
• Rule-learners       – number of attributes
• Neural networks     – number of classes
                      – number of training
• Generative models     examples
• Nearest neighbor    – need for interpretation

• Support vector
Nearest neighbor
• Define similarity between instances
• Find neighbors of new instance in training data
   – K-NN approach: assign majority class amongst k
     nearest neighbour
   – weighted regression: learn a new regression
     equation by weighting each training instance based
     on distance from new instance

 • Pros                • Cons
     + Fast training      – Slow during application.
                          – No feature selection.
                          – Notion of proximity vague
Decision tree classifiers
 • Widely used learning method
 • Easy to interpret: can be re-represented
   as if-then-else rules
 • Approximates function by piece wise
   constant regions
 • Does not require any prior knowledge
   of data distribution, works well on noisy
Decision trees

 • Tree where internal nodes are simple decision
   rules on one or more attributes and leaf nodes
   are predicted class labels.

                          Salary < 20K

       Profession = teacher              Age < 30

     Good              Bad         Bad              Good
Algorithm for tree building
 • Greedy top-down construction.

              Gen_Tree (Node, data)

               make node a leaf?                 Stop

           Find best attribute and best split on attribute
             Partition data on split condition

        For each child j of node Gen_Tree (node_j, data_j)
Support vector machines
 • Binary classifier: find hyper-plane
   providing maximum margin between
   vectors of the two classes


Support Vector Machines
• Extendable to:
  – Non-separable problems (Cortes & Vapnik, 1995)
  – Non-linear classifiers (Boser et al., 1992)
• Good generalization performance
  – OCR (Boser et al.)
  – Vision (Poggio et al.)
  – Text classification (Joachims)
• Requires tuning: which kernel, what
• Several freely available packages: SVMTorch
Neural networks
 • Useful for learning complex data like
   handwriting, speech and image
                     Decision boundaries:

 Linear regression    Classification tree   Neural network
Bayesian learning
 • Assume a probability model on
   generation of data.
 • Apply Bayes theorem to find most
   likely class as:
                                       p(d | c j ) p(c j )
         c  max p( c j | d )  max
              cj                cj           p(d )
 • Naïve bayes: Assume attributes
  conditionally independent given class
  value                p(c j ) n
                   c  max
                        cj   p(d )
                                    p(a
                                      i 1
                                              i   | cj)
Meta learning methods
 • No single classifier good under all cases
 • Difficult to evaluate in advance the conditions
 • Meta learning: combine the effects of the
   – Voting: sum up votes of component classifiers
   – Combiners: learn a new classifier on the outcomes
     of previous ones:
   – Boosting: staged classifiers
 • Disadvantage: interpretation hard
   – Knowledge probing: learn single classifier to
     mimick meta classifier
 • Mining operations
   – Classification            Applications
   – Clustering                Methods
   – Association rule mining
   – Sequence mining
      What is Cluster Analysis?
• Cluster: a collection of data objects
   – Similar to one another within the same cluster
   – Dissimilar to the objects in other clusters
• Cluster analysis
   – Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
  predefined classes
• Typical applications
   – As a stand-alone tool to get insight into data
   – As a preprocessing step for other algorithms
 • Customer segmentation e.g. for
   targeted marketing
   – Group/cluster existing customers based on
     time series of payment history such that
     similar customers in same cluster.
   – Identify micro-markets and develop
     policies foreach
 • Image processing
 • Text clustering e.g. scatter/gather
 • Compression
Distance functions
 • Numeric data: euclidean, manhattan distances
    – Minkowski metric: [sum(xi-yi)^m]^(1/m)
    – Larger m gives higher weight to larger distances
 • Categorical data: 0/1 to indicate
    – Euclidean distance: equal weightage to 1 and 0
    – Hamming distance (# dissimilarity)
    – Jaccard coefficients: #similarity in 1s/(# of 1s) (0-
      0 matches not important
    – data dependent measures: similarity of A and B
      depends on co-occurance with C.
 • Combined numeric and categorical data:weighted normalized
Clustering methods
 • Hierarchical clustering
   – agglomerative Vs divisive
   – single link Vs complete link
 • Partitional clustering
   – distance-based: K-means
   – model-based: EM
   – density-based:
 • Mining operations
   – Classification            Applications
   – Clustering                Methods
   – Association rule mining
   – Sequence mining
 • Two applications
   – Intrusion detection
   – Information extraction
Intrusion via privileged programs
 • Attacks exploit a loophole in the                open
   program to do illegal actions                    lseek
     – Example: exploit buffer over-flows to run    lstat
       user-code                                    mmap
 • What to monitor of an executing                  ioctl
   privileged program to detect attacks?            ioctl
 • Sequence of system calls                         execve
    – |S| = set of all possible system calls ~100   close
 • Mining problem: given traces of previous
   normal execution, monitor a new
   execution and flag attack or normal
 • Challenge: is it possible to do this given
   widely varying normal conditions?
Detecting attacks on privileged programs
  •   Short sequences of system calls made
      during normal execution of system calls are
      very consistent, yet different from the
      sequences of its abnormal executions
  •   Each execution a trace of system calls:
      – ignore online traces for the moment
  •   Two approaches
      – STIDE
         •   Create dictionary of unique k-windows in normal traces,
             count what fraction occur in new traces and threshold.
      – IDS
         •   next...
Classification models on k-grams
 • When both normal
   and abnormal data        7-grams              class labels

                            4 2 66 66 4 138 66   “normal”
                            5 5 5 4 59 105 104   “abnormal”
    – class label =         …                    …
                          6 attributes           Class labels
 • When only normal       4 2 66 66 4 138        “66”
   trace,                 5 5 5 4 59 105         “104”
                          …                      …
   – class-label=k-th
     system call
 Learn rules to predict class-label [RIPPER]
Examples of output RIPPER rules
 • Both-traces:
    – if the 2nd system call is vtimes and the 7th is
      vtrace, then the sequence is “normal”
    – if the 6th system call is lseek and the 7th is
      sigvec, then the sequence is “normal”
    – if none of the above, then the sequence is
 • Only-normal:
    – if the 3rd system call is lstat and the 4th is write,
      then the 7th is stat
    – if the 1st system call is sigblock and the 4th is
      bind, then the 7th is setsockopt
    – if none of the above, then the 7th is open
 Experimental results on sendmail
                                  traces            Only-normal   BOTH
• The output rule sets contain    sscp-1            13.5          32.2
~250 rules, each with 2 or 3      sscp-2            13.6          30.4
attribute tests                   sscp-3            13.6          30.4
                                  syslog-remote-1   11.5          21.2
                                  syslog-remote-2   8.4           15.6
• Score each trace by counting    syslog-local-1    6.1           11.1
                                  syslog-local-2    8.0           15.9
fraction of mismatches and
                                  decode-1          3.9           2.1
thresholding                      decode-2          4.2           2.0
                                  sm565a            8.1           8.0
                                  sm5x              8.2           6.5
 Summary: Only normal traces      sendmail          0.6           0.1
sufficient to detect intrusions
Information extraction
 • Automatically extract structured fields
   from unstructured documents
   – by learning from examples
 • Technology:
   – Graph models (Hidden Markov Models)
   – Probabilistic parsers
 • Applications:
   – Comparison shopping agents
   – Bibliography databases (citeseer)
   – Address elementization (IIT Bombay)
Problem definition
 Source: concatenation of structured
   elements with limited reordering and
   some missing fields
     – Example: Addresses, bib records
    number                                         City       Zip
                 Building           Road   Area
         156   Hillside ctype Scenic drive Powai Mumbai     400076

Author              Year    Title                 Journal
     P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick
     (1993) Protein and Solvent Engineering of Subtilising BPN' in
     Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115,
Learning to segment
      – list of structured elements
      – several examples showing position of structured
        elements in text,
 Train a model to identify them in unseen text
 At top-level a classification problem
     What are the input features?
     Build per-element classifiers or a single joint classifier?
     Which type of classifier to use?
     How much training data is required?
Input features
  Content of the element
    Specific keywords like street, zip, vol, pp,
    Properties of words like capitalization,
     parts of speech, number?
  Inter-element sequencing
  Intra-element sequencing
  Element length
IE with Hidden Markov Models
 • Probabilistic models for IE
             probabilities           0.5
                                                Title   A   0.6

    Letter     0.3
                             Author       0.9           B   0.3

                             0.1                0.5     C   0.1
    Et. al     0.1
    Word       0.5
                             Year          Journal
                                                  0.2   journal       0.4
                     dddd    0.8
                     dd      0.2                        ACM           0.2
                                                        IEEE          0.3
HMM Structure
   Naïve Model: One state per element

              … Mahatma Gandhi Road Near Parkland ...
Nested model
Each element
another HMM [Mahatma Gandhi Road Near : Landmark] Parkland ...
Results: Comparative Evaluation

Dataset      inst   Elem
             anc    ents
IITB         2388 17
Company      769    6
US           740    6

            The Nested model does best in all three cases
Mining market
 • Around 20 to 30 mining tool vendors
 • Major tool players:
   – SAS’s Enterprise Miner.
   – IBM’s Intelligent Miner,
   – SGI’s MineSet,
 • All pretty much the same set of tools
 • Many embedded products:
   –   fraud detection:
   –   electronic commerce applications,
   –   health care,
   –   customer relationship management: Epiphany

Shared By: