Rain Forest

Document Sample
Rain Forest Powered By Docstoc
					  Rain Forest -      A Framework
            for Fast Decision Tree
   Construction of Large Database
  By       Johannes Gehrke, Raghu Ramakrishnan, Venkatesh Ganti


Paper presentation by (Group 8)
          Zeng xiaohua                                  HT00-6171L
          Loh Zheng Xuan                                HT00-6211U
          Xu min                                        HT00-3539X


       Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   1
Agenda

Part I      – Introduction (by Zeng xiaohua)


Part II – Rain Forest (by Loh Zheng Xuan)


Part III – Evaluations (by Xu Min)




   Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   2
                                           Introduction
                                                                by Zeng xiaohua



    Introduction to rain forest
    Previous work
    SPRINT

    Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   3
Rain Forest : Introduction

   Is a framework, NOT a
    classification algorithm
       Q: What is a framework?
       A: Data access algorithm, NOT restricted to
           any classification algorithm.

   Produces scalable versions of
    algorithms without modifying the
    quality of the tree

     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   4
Previous Work

   Discretization
        Discretized ordered
         attribute and run                              Assumption :
         algorithm on                                   - database fits in
         discretized data
                                                          main memory
   Node Sampling

   Datasets partitioning
        Each subsets fits in main memory


        Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   5
SPRINT : Introduction

   Fastest scalable classification
    algorithm proposed previously

   Features :
     Using sorted attribute list
     Removes all relationships between
      main memory and size of the training
      datasets
     Works well for very large datasets


     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   6
Example Input Database
Record Id      Car Type          Age         Number of Children            Subscription
   1             Sedan            23                    0                        Yes
   2             Sports           31                    1                         No
   3             Sedan            36                    1                        Yes
   4              Truck           25                    2                         No
   5             Sports           30                    0                         No
   6             Sedan            36                    0                         No
   7             Sedan            25                    0                        yes
   8              Truck           36                    1                         No
   9             Sedan            30                    2                        Yes
   10            Sedan            31                    1                        Yes
   11            Sports           25                    0                         No
   12            Sedan            45                    1                        Yes
   13            Sports           23                    2                         No
   14             truck           45                    0                        Yes

    Rain Forest - A Framework for Fast Decision Tree Construction of Large Database       7
Example Decision Tree

                               Car Type
                sedan                                sports, truck

           # Childr.                                     Age

    >0                    0                <=30                      >30
      YES               Age                     NO             # Childr.

         <=30                      >30                     0                >0

               YES                 NO                     YES               NO


  Find split attribute
  Partition
  Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   8
         SPRINT : Implementation

 Car     Subscription    Record     Age    Subscription    Record      Num. Of    Subscription   Record
Type                       Id                                Id        Children                    Id
Sedan        Yes            1        23         Yes            1           0            Yes        1
Sports       No             2        23         No            13           0            No         5
Sedan        Yes            3        25         No             4           0            No         6
Truck        No             4        25         Yes            7           0            No         7
Sports       No             5        25         No            11           0            Yes        11
Sedan        No             6        30         No             5           0            No         14
Sedan        Yes            7        30         Yes            9           1            No         2
Truck        No             8        31         No             2           1            Yes        3
Sedan        Yes            9        31         Yes           10           1            No         8
Sedan        Yes           10        36         Yes            3           1            Yes        10
Sports       No            11        36         No             6           1            Yes        12
Sedan        Yes           12        36         No             8           2            No         4
Sports       No            13        45         Yes           12           2            Yes        9
Truck        Yes           14        45         Yes           14           2            No         13



             Rain Forest - A Framework for Fast Decision Tree Construction of Large Database       9
SPRINT : Implementation, cont.
                                                    Car    Subscription    Record
   Car    Subscription    Record Id
                          Record                   Type                      Id
  Type                      Id
                                                  sedan          Yes           1
 Sedan         Yes           11
                                                  sedan          Yes           3
 Sports        No            22
                                                  sedan          No            6
 Sedan         Yes           33
                                                  sedan          Yes           7
  Truck        No            44
                                                  sedan          Yes           9
 Sports        No            55
                                                  sedan          Yes          10
 Sedan         No            66
                                                  sedan          Yes          12
 Sedan         Yes           77
  Truck        No            88                     Car     Subscription    Record
                                                   Type                       Id
 Sedan         Yes           99
                                                  sports        No             2
 Sedan         Yes            10
                             10
                                                   truck        No             4
 Sports        No             11
                             11
                                                  sports        No             5
 Sedan         Yes            12
                             12
                                                   truck        No             8
 Sports        No            13
                              13
                                                  sports        No            11
  Truck        Yes            14
                             14
                                                  sports        No            13
                                                   truck        Yes           14
  Rain Forest - A Framework for Fast Decision Tree Construction of Large Database    10
   SPRINT : Implementation, cont.
             Car    Subscrip. Record       Age   Subscrip. Record       Num. Of    Subscrip. Record
            Type                Id                           Id         Children               Id
           sedan       Yes        1        23       Yes        1           0         Yes       1
           sedan       Yes        3        25       Yes        7           0         No        6
 Left      sedan       No         6        30       Yes        9           0         Yes       7
Branch     sedan       Yes        7        31       Yes       10           1         Yes       3
           sedan       Yes        9        36       Yes        3           1         Yes      10
           sedan       Yes       10        36       No         6           1         Yes      12
           sedan       Yes       12        45       Yes       12           2         Yes       9


             Car    Subscrip. Record       Age   Subscrip. Record       Num. Of    Subscrip. Record
            Type                Id                           Id         Children               Id
           sports      No         2        23       No        13           0         No        5
            truck      No         4        25       No         4           0         No       11
Right      sports      No         5        25       No        11           0         Yes      14
Branch      truck      No         8        30       No         5           1         No        2
           sports      No        11        31       No         2           1         No        8
           sports      No        13        36       No         8           2         No        4
            truck      Yes       14        45       Yes       14           2         No       13

         Rain Forest - A Framework for Fast Decision Tree Construction of Large Database           11
SPRINT : Problems

   Materialized attribute lists at each
    node

   Large cost in keeping attribute lists




     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   12
                                                    Rain Forest
                                                            by Loh Zheng Xuan


   Basic idea
   What is AVC set and AVC group?
   How does it scale up?


    Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   13
Basic idea

   Observation:
       Only aggregate information about the class
        label distribution for each distinct attribute
        value is needed for choosing splitting
        attribute


   AVC group introduced for
    aggregation

     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   14
Rain Forest : AVC

   AVC (Attribute-Value, Class label)
       AVC set                                                      Car       Subscription
        • Counts of individual class                                Type        Yes    No
          are aggregated                                           Sedan          6     1
        • Class label distribution for                            Sports          0     4
          each distinct attribute                                  Truck          1     2


       AVC group
        • A set of AVC sets


     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database    15
  Rain Forest : AVC sets and AVC
  group of root node

                                          Subscription
                                  Age      Yes        No
 Car      Subscription                                             Num.            Subscription
Type                               23        1         1
           Yes         No                                         Children         Yes     No
                                   25        1         2
Sedan        6          1                                              0            3       3
                                   30        1         1
Sports       0          4                                              1            3       2
                                   31        1         1
Truck        1          2                                              2            1       2
                                   36        1         2
                                   45        2         0

        Note : With large database, the size of AVC
          group is generally much more smaller than
          attribute lists in SPRINT
         Rain Forest - A Framework for Fast Decision Tree Construction of Large Database    16
Top-Down Decision Tree
Induction Schema
 BuildTree (Node n, datapartition D, algorithm CL)
 (1a)     for each partition attribute p
 (1)
 (1b)          Call to D to find crit(n)          Line 1
          Apply CL CL.find_best_partitioning (AVC-set of p)
 (1c)     endfor

 (2)
 (2a)     k = be the number of children of n
          let kCL.decision_splitting_criteria();

 (3)      if ( k>0 )
 (4)            Create k children c1,….., cn of n
 (5)            Use best split to partition D into D1,….., Dk
 (6)            for (i =1; i <=k; i++)
 (7)                BuildTree (ci , Di)
 (8)            endfor
 (9)      endif



  Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   17
     Rain Forest vs SPRINT
Original database
Rid A1      A2     A3       C                         Attribute lists
                                           Rid   A1   C   Rid   A2   C   Rid   A3    C
 :    :       :     :         :
 :    :       :     :         :             :    :    :    :    :    :    :     :    :
                                            :    :    :    :    :    :    :     :    :
 :    :       :     :         :
                                            :    :    :    :    :    :    :     :    :


                                                          Crit()
                        Rid       A1   C                                              Rid    A1   C
                        :         :    :                                                 :   :    :
                        Rid       A2   C                                              Rid    A2   C
                        :         :    :                                                 :   :    :
                        Rid       A3   C                                              Rid    A3   C
                        :         :    :                                                 :   :    :


          Rain Forest - A Framework for Fast Decision Tree Construction of Large Database         18
   Attribute lists for SPRINT of
   child nodes after splitting
             Car    Subscrip. Record       Age   Subscrip. Record       Num. Of    Subscrip. Record
            Type                Id                           Id         Children               Id
           sedan       Yes        1        23       Yes        1           0         Yes       1
           sedan       Yes        3        25       Yes        7           0         No        6
 Left      sedan       No         6        30       Yes        9           0         Yes       7
Branch     sedan       Yes        7        31       Yes       10           1         Yes       3
           sedan       Yes        9        36       Yes        3           1         Yes      10
           sedan       Yes       10        36       No         6           1         Yes      12
           sedan       Yes       12        45       Yes       12           2         Yes       9


             Car    Subscrip. Record       Age   Subscrip. Record       Num. Of    Subscrip. Record
            Type                Id                           Id         Children               Id
           sports      No         2        23       No        13           0         No        5
            truck      No         4        25       No         4           0         No       11
Right      sports      No         5        25       No        11           0         Yes      14
Branch      truck      No         8        30       No         5           1         No        2
           sports      No        11        31       No         2           1         No        8
           sports      No        13        36       No         8           2         No        4
            truck      Yes       14        45       Yes       14           2         No       13

         Rain Forest - A Framework for Fast Decision Tree Construction of Large Database           19
 Rain Forest vs SPRINT
Original database
                                                        AVC sets (AVC group)
Rid A1              A2       A3           C
                                                        A1      C     A2   C     A3        C
 :        :         :            :        :
                                                        :       Co.   :    Co.   :        Co.
 :        :         :            :        :
 :        :         :            :        :
                                                                      Crit()



              Rid       A1       A2          A3    C                                            Rid   A1       A2       A3    C
                                                                P1               P2
               :        :            :         :   :                                             :    :        :          :   :




     A1        C            A2           C         A3       C                         A1         C        A2        C         A3    C

     :        Co.            :           Co.       :     Co.                          :         Co.        :        Co.       :     Co.



          Rain Forest - A Framework for Fast Decision Tree Construction of Large Database                                          20
     AVC sets and AVC group of child
     nodes after splitting
                                                   Subscription
                                        Age       Yes        No
          Car        Subscription                                       Num.         Subscription
                                         23        1          0
 Left    Type       Yes       No                                       Children     Yes       No
                                         25        1          0
Branch   Sedan       6         1                                          0          2         1
                                         30        1          0
         Sports      0         0                                          1          3         0
                                         31        1          0
         Truck       0         0                                          2          1         0
                                         36        1          1
                                         45        1          0




                                                   Subscription
                                        Age       Yes        No
          Car        Subscription                                       Num.         Subscription
                                         23        0          1
         Type       Yes       No                                       Children     Yes       No
Right                                    25        0          2
         Sedan       0         0                                          0          1         2
Branch                                   30        0          1
         Sports      0         4                                          1          0         2
                                         31        0          1
         Truck       1         2                                          2          0         2
                                         36        0          1
                                         45        1          0


          Rain Forest - A Framework for Fast Decision Tree Construction of Large Database           21
Rain Forest : Algorithms

   RF-Write
        • Buffers used for
           storing partitions                     Assumption :
   RF-Read                                       - AVC group of
                                                  root node fits in
        • Scan database to
          decide partitions                       main memory

   RF-Hybrid
        • Combination of RF-
          Write and RF-Read




     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   22
 Rain Forest : RF-Write
Original database
Rid A1              A2       A3      C                                  Main Memory
 :         :        :        :       :                       A1       C     A2          C     A3    C
 :         :        :        :       :                       :        Co.       :       Co.   :     Co.
 :         :        :        :       :

                                             CL                                     r


                   Buffer
     Rid       A1       A2   A3          C   k1                  k2         …… …… ……                      kn
      :        :        :        :       :

                                                            Buffer                                      Buffer
                                                  Rid   A1    A2       A3   C                 Rid   A1    A2     A3   C
                                                   :    :        :     :    :                  :    :      :     :    :


           Rain Forest - A Framework for Fast Decision Tree Construction of Large Database                            23
     Rain Forest : RF-Read
          database
     Rid A1    A2     A3    C                            Main Memory
     :     :    :      :       :
                                                 A1     C      A2        C    A3    C
     :     :    :      :       :                 A1
                                                  :
                                                        C
                                                       Co.
                                                              A2
                                                                :
                                                                        C
                                                                        Co.
                                                                              A3
                                                                               :
                                                                                   C
                                                                                   Co.
                                                                                              CL
     :     :    :      :       :                 :    Co.      :        Co.   :    Co.
                                                 A1    C       A2        C    A3    C
                                                  :    Co.      :       Co.    :   Co.
                                                                                              CL
                                                                    :
                           r
                                                 A1     C      A2        C    A3    C
                                                  :    Co.      :       Co.    :   Co.
                                                                                              CL

          k1        k2 …… …… kn
                                                                                         CL
                                                                         :
                                                                         :
k1       k2 … kn                   k1      k2 … kn

           Rain Forest - A Framework for Fast Decision Tree Construction of Large Database    24
Rain Forest : Algorithms, cont.


                                                   Assumption :
   RF-Vertical                                    - AVC group of root
        • Processes several AVC                    node doesn’t fit in
          sets that could fit main
          memory each time                         main memory




     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   25
                                                    Evaluations
                                                                             by Xu Min


   Experimental results
   Conclusions
   Others’ work using this paper
   Our opinions

    Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   26
Experimental Result : Scalability




  Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   27
           Experimental Result :
Comparison with SPRINT : Number of Tuples




  Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   28
Experimental Result :
Comparison with SPRINT : Sorting Cost




  Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   29
Experimental Result :
Comparison with SPRINT : Partitioning Cost




   Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   30
Conclusions by the authors

 Rain forest is applicable to all
  decision tree algorithms
 Rain forest offers significant
  improvement over SPRINT




    Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   31
Others’ works using this paper


   [TZ98]
        • Several algorithms implemented using
          Rain Forest schema


   [CFB99]
        • SQL operation upon CC table (similar to
          AVC set) to construct decision tree




     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   32
Others’ works using this paper,
cont.


   [RS98]
        • PUBLIC: integrates pruning phrase into
          tree building phrase
        • PUBLIC: significant performance
          improvements compared to traditional
          classifiers such as SPRINT




     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   33
Others’ works using this paper                                                         ,
cont.


   [GGRL99]
        • BOAT: first scalable algorithm with the
          ability to incrementally update the tree
          with both insertions and deletions
        • BOAT: faster than the best existing
          algorithms: RF-Hybrid of Rain Forest




     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database       34
Others’ works using this paper                                                         ,
cont.


   [GGR99]
        • Proposed a framework for quantifying the
          difference between two datasets in terms
          of the models they induce
        • Implemented an algorithm CART using
          Rain Forest framework




     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database       35
Our opinions

   Truly scalable
        • Makes all in-memory structure of a decision tree
          algorithm irrelevant to the size of the database


   AVC set and group
        • Summarizes database


   Problem in parallelizing
        • The control mechanism is too complex to divide
          the tasks


     Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   36
References
   [SAM96]
          • J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for
             data mining. In Proc. of VLDB, 1996.
   [TZ98]
          • Classification in the RainForest Framework, CS764 & CS784 Joint-Project
            Report, Dongbin Tao, Chun Zhang, Computer Sciences Department, University
            of Wisconsin--Madison, May 11, 1998
   [CFB99]
          • Surajit Chaudhuri, Usama M. Fayyad, Jeff Bernhardt: Scalable Classification
            over SQL Databases. ICDE 1999: 470-479
   [RS98]
          • Rajeev Rastogi, Kyuseok Shim: PUBLIC: A Decision Tree Classifier that
            Integrates Building and Pruning. VLDB 1998: 404-415
   [GGRL99]
          • Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, Wei-Yin Loh:
            BOAT-Optimistic Decision Tree Construction. SIGMOD Conference 1999: 169-180
   [GGR99]
          • Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan: A Framework for
            Measuring Changes in Data Characteristics. PODS 1999: 126-137
   [GRSS99]
          • Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: Data
            Mining and the Web: Past, Present and Future. Workshop on Web Information
            and Data Management 1999: 43-47

      Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   37
  Thank You
Rain Forest - A Framework for Fast Decision Tree Construction of Large Database   38

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:4/18/2012
language:English
pages:38