Data Mining and Scalability

Document Sample
Data Mining and Scalability Powered By Docstoc
					   Data Mining and

Lauren Massa-Lochridge
Nikolay Kojuharov
Hoa Nguyen
Quoc Le

 Data Mining Overview
 Scalability Challenges & Approaches.
 Overview – Association rules.
 Case study - BIRCH – An Efficient Data
  Clustering Method for VLDB.
 Case Study – Scientific Data Mining.
 Q&A
Data Mining: Rationale
   Data size
     Datain databases is estimated to double every year.
     Number of people who look at the data stays constant

   Complexity
     The analysis is complex.
     The characteristics and relationships are often
      unexpected and unintuitive.
   Knowledge discovery tools and algorithms are
    needed to make sense and use of data
Data Mining: Rationale (cont’d)
   As of 2003, France Telecom has largest decision-
    support DB, ~30 TB; AT&T was 2nd with 26 TB database.
   Some of the largest databases on the Web, as of 2003,
         Alexa ( internet archive: 7 years of data, 500 TB
         Internet Archive (,~ 300 TB
         Google, over 4 Billion pages, many, many TB
   Applications
         Business – analyze inventory, predict customer acceptance, etc.
         Science – find correlation between genes and diseases, pollution
          and global warming, etc.
         Government – uncover terrorist networks, predict flu pandemic,

Adapted from: Data Mining, and Knowledge Discovery: An Introduction,
Data Mining: Definition
   Semi-automatic discovery of patterns, changes,
    anomalies, rules, and statistically significant structures
    and events in data.

   Nontrivial extraction of implicit, previously unknown, and
    potentially useful information from data

   Data mining is often done on targeted, preprocessed,
    transformed data.
       Targeted: data fusion, sampling.
       Preprocessed: Noise removal, feature selection, normalization.
       Transformed: Dimension reduction.
     Data Mining: Evolution
 Evolutionar Business Question                          Enabling          Product                   Characteristics
 y Step                                                 Technologies      Providers
      Data   "What was my total revenue                 Computers, tapes, IBM, CDC                  Retrospective,
  Collection in the last five years?"                   disks                                       static data
     ('60s)                                                                                         delivery
 Data Access "What were unit sales in                   RDBMS, SQL,            Oracle, Sybase,      Retrospective,
    ('80s)   New England last March?"                   ODBC                   Informix, IBM,       dynamic data
                                                                               Microsoft            delivery at record
    Data     "What were unit sales in                   OLAP,            Pilot, Comshare,           Retrospective,
 Warehousing New England last March?                    multidimensional Arbor, Cognos,             dynamic data
   ('90s)    Drill down to Boston."                     databases, data  Microstrategy              delivery at
                                                        warehouses                                  multiple levels

  Data Mining "What’s likely to happen to               Advanced               Pilot, Lockheed,     Prospective,
    (today)   Boston unit sales next                    algorithms,            IBM, SGI,            proactive
              month? Why?"                              multiprocessor         numerous startups    information
                                                        computers,             (nascent industry)   delivery
                                                        massive databases
Adapted from: An Introduction to Data Mining,
Data Mining: Approaches
   Clustering - identify natural groupings within the data.
   Classification - learn a function to map a data item into
    one of several predefined classes.
   Summarization – describe groups, summary statistics,
   Association – identify data items that occur frequently
   Prediction – predict values or distribution of missing
   Time-series analysis – analyze data to find periodicity,
    trends, deviations.
Scalability & Performance
    Scaling and performance are often considered together in Data Mining.
     The problem of scalability in DM is not only how to process such large
     sets of data, but how to do it within a useful timeframe.
    Many of the issues of scalability in DM and DBMS are similar to
     scaling performance issues for Data Management in general.
    Dr. Gregory Piatetsky-Shapiro & Prof. Gary Parker, (P&P) define that the
     main issue for a clustering algorithms in general as an approach to DM
     is: “The main issue in clustering is how to evaluate the quality of
     potential grouping. There are many methods, ranging from manual,
     visual inspection to a variety of mathematical measures that minimize
     the similarity of items within the cluster and maximize the difference
     between the clusters."
Common DM Scaling Problem
 Algorithms generally:
  Operate on data with assumption of in-
   memory processing of entire data set
  Operate under assumption that KIWI will
   be used to address I/O and other
   performance scaling issues
  Or just don't address scalability within
   resource constraints at all
Data Mining: Scalability
   Large Datasets
     Use   scalable I/O architecture - minimize I/O, make it
      fit, make it fast.
     Reduce data - aggregation, dimensional reduction,
      compression, discretization.
   Complex Algorithms
     Reduce   algorithm complexity
     Exploit parallelism, use specialized hardware
   Complex Results
     Effectivevisualization
     Increase understanding, trustworthiness
Scalable I/O Architecture
   Shared memory parallel computers: local +
    global memory. Locking is used to synchronize.
   Distributed memory parallel computers:
    Message Passing/ Remote Memory Operations.
   Parallel Disk: B records – 1 unit. D blocks can be
    read or written at once.
   Primitives: Scatter, Gather, Reduction
   Data parallelism or Task parallelism
Scaling – General Approaches
   Question: How can we make tackle memory
    constraints and efficiency?
                Manipulate data to fits into memory –
     Statistics:
      sampling, selecting features, partition, summarization.
     Database: Reduce the time to access out of memory
      data – specialized data structures, block reads, parallel
      block reads.
     High Performance Computing: Use several processors
     Data Mining Imp: Efficient DM primitives, Pre-compute
     Misc.: Reduce the amount of data - Discretization,
      Compression, Transformation.
Association rules
 Beer-Diapers example.
 Basket data analysis, Cross-market, sale-
  campaigns, Web-log analysis etc.
 Introduced for the first time in 1993
     Mining Association   Rules between Sets of Items in
      Large Databases
      by R. Agrawal et al., SIGMOD 93 Conference.
AR Applications
X ==> Y:
 What are the interesting applications?
   find   all rules with “bagels” as X?
        what should be shelved together with bagels?
        what would be impacted if stop selling bagels?
   find   all rules with “Diet Coke” as Y?
        what the store should do to promote Diet Coke?
   find   all rules relating any items in Aisles 1 and 2?
        shelf planning to see if the two aisles are related
    AR: Input & Output
   Input:
     a database of sales “transactions”
     parameters:                          TID   Items
        minimal support: say 50%
        minimal confidence: say 100%
                                           100   1 34
                                           200     23 5
                                           300   123 5
   Output:                                400     2   5
     rule 1: {2, 3} ==> {5}: s =?, c =?
     rule 2: {3, 5} ==> {2}
     more … …
Apriori: A Candidate Generation-
and-Test Approach
   Apriori pruning principle: If there is any itemset which is
    infrequent, its superset should not be generated/tested!
    (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’
   Method:
       Initially, scan DB once to get frequent 1-itemset
       Generate length (k+1) candidate itemsets from length k frequent
       Test the candidates against DB
       Terminate when no frequent or candidate set can be generated
Scaling Attempts
   Count Distribution: distribute transaction data among
    processors and count the transactions in parallel. 
    scale linearly with # of transactions
   Savasere et. al. (VLDB95): Partition data and scan twice
    (local, global).
   Toivonen (VLDB96): Sampling – Verification of closure
   Brin et. al. (SIGMOD97): Dynamic itemset counting.
   Pei & Han (SIGMOD00): Compact Description (FP-tree),
    no candidate generation. Scale up with partition-based

       BIRCH Approach
BIRCH Approach
   Informal definition: "data clustering identifies the sparse and the
    crowded places and hence discovers the overall distribution patterns
    of the data set."
   Hierarchical clustering utilizing a distance measure is the catagory of
    clustering algorithm that BIRCH uses, K-Means is an example of
    distance measure.
   Approach: "statistical identification of clusters, i.e. densely populated
    regions in multi-dimensional dataset, given the desired number of
    clusters K. and a dataset of N points, and a distance based
   Problem with other approaches, distance measure, hierarchical, etc.
    are all similar in terms of scaling and resource utilization.
BIRCH Novelty
  First algorithm proposed in the database
   area that filters out “noise”, i.e. outliers
  Prior work does not adequately address
   large data sets with minimization of I/O
  Prior work does not address issues of
   data set fit to memory
  Prior work does not address resource
   utilization or resource constraints in
   scalability and performance
Database / DM Oriented Constraints
   Resource utilization is maximizing usage of available
    resources as opposed to just working within resources
    constraints alone, which does not necessarily
    optimize utilization.
   Resource utilization is important in DM Scaling or for
    any case where the data sets are very large.
   BIRCH single scan of data set yields a minimum of
    “good enough” clustering.
   One or more additional passes are optional and
    depending upon specifics of constraints for a particular
    system and application, can be used to improve the
    quality over and above the "good enough" .
Database Oriented Constraints

   Database Oriented Constraints are what
    differentiates BIRCH from more general DM
   Limited acceptable response time
   Resource Utilization – optimize not just work
    within resources available – necessary for
    VeryLarge data sets
   Fit to available memory
   Minimize I/O costs
   Need I/O cost linear in size of data set
Features of BIRCH Solution             :

    Locality of reference: each unit clustering
     decision made without scanning all data points
     for all existing clusters.
    Clustering decision: measurements reflect
     natural "closeness" of points
    Locality enables incrementally maintained and
     updated during clustering process
    Optional removal of outliers:
       Cluster equals dense region of points.
       Outlier equals point in sparse region.
More Features of BIRCH Solution                 :

   Optimal memory resource usage -> Utilization
    and and within Resource Constraints.
   Finest possible sub clusters, given memory resource
    and I/O/time constraints:
      Finest clusters given memory implies best
       accuracy achievable (another type of optimal
   Minimize I/O costs:
      implies efficiency and required response time.
More Features of BIRCH Solution

 Running time linearly scalable (in size of
  data set).
 Optionally, incremental scan of data set, i.e.
  do not have to scan entire data said in
  advance and increments adjustable.
 Only scans complete data set once (others
  scan multiple times)
Background (Single cluster)
   Given N d-dimensional data points : {Xi}
                       “Centroid”

                              2
                 i1 ( Xi  X 0)
           R(                       )1/ 2    “radius”

                 i1  j 1
                    N     N
                             ( Xi  Xj) 2
          D(                               )1/ 2“diameter”
                        N ( N  1)
Background (two clusters)
Given the centroids : X0 and Y0,
 The centroid Euclidean distance D0:
                 2 1/ 2
   D0  (( X 0  Y 0) )

   The centroid Manhattan distance D1:
                           (i )  (i )
     D1  X 0  Y 0  i 1 X 0  Y 0
Background ( two clusters)
   Average inter-cluster distance
                                2
            i 1  j 1 ( Xi  Yj )
               N1        N2

    D2= (                                )1/ 2
                         N1N 2

   Average intra-cluster distance
                                  2
              i 1  j 1 ( Xi  Yj )
                    N1    N2

    D3= (                                  )1/ 2
            ( N1  N 2)( N1  N 2  1)
Clustering Feature
CF = (N, LS, SS)
     N = |C|        “number of data points”
     LS = i 1 Xi “linear sum of N data points”

     SS = i 0 Xi2 “square sum of N data points ”

 Summarization of cluster
CF Additive Theorem
   Assume CF1=(N1, LS1 ,SS1), CF2 =(N2,LS2,SS2) .
                                    
     CF1  CF2  ( N1  N 2 , LS1  LS 2 , SS1  SS 2 )
   Information stored in CFs is sufficient to
        Centroids
        Measures for the compactness of clusters

        Distance measure for clusters
   height-balanced tree
   two parameters:
       branching factor
           B : An internal node contains at most B entries [CFi, childi]
           L : A leaf node contains at most L entries [CFi]
       threshold T
           The diameter of all entries in a leaf node is at most T
   Leaf nodes are connected via prev and next
    pointers  efficient for data scan
CF tree example
BIRCH Algorithm Scaling Details
CF / CF Tree used to optimize clusters for memory
  & I/O:
 P, page size (page of memory)

   Tree size a function of T, larger T -> smaller CF
   Require node to fit in memory page size P –>
    split to fit, or merge for optimal utilization -
   P can be varied on the system or in the algorithm for
    performance tuning and scaling
BIRCH Algorithm Steps
Phase 1
                              Start CF tree t1 of initial T

                     Continue scanning data and insert into t1

               Out of memory                                  Finish scanning data

     • increase T
     • rebuild CF tree t2 of new T from CF tree t1. if a leaf entry is a
       potential outlier, write to disk. Otherwise use it.
     • t1 <= t2

Otherwise          Result?           Out of disk space

                                  Re-absorb potential outliers into t1

                              Re-absorb potential outliers into t1
   I/O cost:
                              M           N       M               M
        d * N * B(1  log B     )  log 2    *d *    * B(1  log B )
                              P           N0      ES              P
       N: Number of data points
       M: Memory size
       P: Page size
       d: dimension
       N0: Number of data points loaded into memory with
        threshold T0
    Data mining: The semi-automatic discovery of
     patterns, associations, anomalies, and
     statistically significant structures of data.
    Pattern recognition: The discovery and
     characterization of patterns
    Pattern: An ordering with an underlying structure
    Feature: Extractable measurement or attribute
Scientific data mining

     Figure 1. Key steps in scientific data mining
Data mining is essential
   Scientific data set is very
     Multi-sensor,    multi-resolution,multi-
      spectral data
     High-dimentional data
     Mesh data from simulation
     Data contaminated with noise
          Sensor noise, clouds, atmospheric
           turbulence, …
Data mining is essential
   Massive dataset
       Advances in technology allows us to collect ever increasing
        amount of scientific data (in experiments, observations, and
            Astronomies dataset with tens of millions of galaxies
            Sloan Digital Sky Survey: Assuming the pixel size of about 0.25”,
             the whole sky is 10Tera pixels (2 bytes/pixel and 1TeraByte)

       Collection of data made possible by advances in:
            Sensors (telescopes, satellites, …)
            Computers and storages (faster, parallel, …)

     We need fast and accurate data analysis techniques to realize the full
     potential of our enhanced data collecting ability. And manual
     techniques are impossible
Data mining in astronomy
   FIRST: Detecting radio-emitted stars
     Dataset:    100GByte of image data (1996)
                Image Map

    16K image maps, 7.1MB each
Data mining in astronomy
   Example

   Result: Find 20K radio-emitted stars from 400K
Mining climate data (Univ. of
                                Research Goal:
  Average Monthly Temperature      Find global climate patterns of
                                    interest to Earth Scientists

                                    A key interest is finding
                                    connections between the
                                    ocean / atmosphere and the

                                   Global snapshots of values
                                    for a number of variables on
                                    land surfaces or water.
                                   Span a range of 10 to 50
Mining climate data (Univ. of
   EOS satellites provide high
    resolution measurements

       Finer spatial grids
            8 km  8 km grid produces 10,848,672 data points
            1 km  1 km grid produces 694,315,008 data points

     More frequent measurements
     Multiple instruments                                           Earth Observing System
            Generates terabytes of day per day                  (e.g., Terra and Aqua satellites)

Questions and Answering!

Shared By: