An Interval Classifier for Database Mining Applications Rakesh Agrawal

Shared by: the36chambers
-
Stats
views:
11
posted:
2/11/2009
language:
English
pages:
16
Document Sample
scope of work template
							An Interval Classifier for Database Mining Applications

Rakesh Agrawal Bala Iyer

Sakti Ghosh

Tomas Imielinsky Arun Swami

IBM Almaden Research Center Presentation made by N.Sharygina

Outline of the presentation
• Classification problem (formal statement, requirements) • Family of classifiers
– neural nets – decision trees • ID3 • Interval Classifier (IC)

• IC classifier generation algorithm • Empirical evaluation of the performance of IC • Conclusion

Classification problem (supervised learning)
• Problem formalization {G1,G2,…Gm} - set of m group labels; {A1, A2,…An} -set of n attributes; dom(Ai) refer to the set of possible values;
D - large DB of objects (test data set), each object is n-tuple <v1,v2,…vn>, vi dom (Ai), G is not one of Ai; {E1,E2,…Ek} - examples set (training set), objects: <v1,v2,…vn, gk>, vi dom (Ai), gk G.
 

goal: obtain m classification functions fj, one for each group Gj being fj:
A1 x A2 x …An -> Gj for j = 1,…,m.

• Additional requirements:
– retrieval efficiency (DM application: retrieve all the objects belonging to the group) – generation efficiency (DM application: classifiers embedded in an interactive system; systems with frequently changing training data)

¡

¡

Family of classifiers
• Neural nets (NN)
– fixed sized data structure with the output of one node feeding into one or many other nodes. The CF generated by neural networks are buried in the weights on the inter-node links
Disadvantages of CF: • poor generation efficiency • difficult to use • NN don’t handle nonnumeric data well

• Decision trees
– decision tree consist of nodes that are tests on the attributes
E - collection of objects T - test on an object with possible outcomes (branches) {O1, O2,…Ow} T partition E into {E1,E2,…Ek} with Ei containing those objects with Oj Each Ei is replaced by decision tree -> decision tree for all E (process terminate if 2 or more Ei are not empty, each Ei < E, E -finite)

Family of classifiers/ID3
• ID3 (greedy growing algorithm)
– Characteristics: • decision trees have a branch for every value of a non-numerical attribute • a numeric attribute is handled by repeated binary decomposition of its range of values – Algorithm: • starts with the all the training examples at the root node of the tree • attribute is selected to partition these examples (!critical) • branch is created for each value of the attribute • subset of examples that have the attribute value is moved to the newly created child node • recursive applying of the algorithm to each child made until either
• all examples at a node are of one class • all examples at a node have the same value for all the attributes

• As a result every leaf is a classification rule

Family of classifiers/ID3
• Creating more accurate trees
– to avoid overfitting data a tree pruning (starting from leaves) is used

• Advantages of binary decomposition:
– it takes away the bias in favor of attributes with large number of values

• Disadvantages of binary decomposition :
– can lead to large decision trees – may cause large increase in computation

Family of classifiers/Interval Classifier
• IC - Tree classifier creates k-ary trees (instead of binary subtrees for atrributes), where k is algorithmically determined at each node • IC contribution: – !Avoid disadvantages of binary decomposition – IC does dynamic pruning as the tree is expanded. It leads to generation of trees that decompose the feature space into nested ndimensional rectangular sections. It means that IC can generate SQL queries for CF that
• can be optimized using relational query optimizers • can exploit DB indexes to realize retrieval efficiency

Family of classifiers/Interval classifier
• Problem statement
– typical classification problem (solved using decision trees approach) – attributes can be categorical or non-categorical • categorical attributes are those for which there is a finite discrete set of possible values (“make a car”, “zip code’) • non-categorical attributes - “salary”, “age”, etc

• Interval definition Interval - range of values for a non-categorical attribute or a single value for a categorical attribute

Interval Classifier generation

• Creation of a decision tree, leaves of which are labeled with one group label • Tree traversal algorithm generate a classification function for each group starting from the root and finding all paths to a particular group at the leaves

Make_tree procedure functions
• Winner attribute
– attribute selected to be a winner in the classification predicate. Goodness function is used for this determination. 2 goodness functions are considered:
• minimizes the resubstitution error rate which is

1 - winner_freq(v)/total_freq
• maximizes the information gain ratio
¢

gain(Ai)/I(Ai)
where I(Ai) - information content of the value of the attribute (entropy of the example set)

Make_tree procedure functions
• Winner group
– returns the group that has the largest frequency for the value v of the chosen attribute

• Winner Strength
– returns the strength as strong if the ratio of the frequency of Winner to the total frequency for v of the attribute is above a certain precision threshold. Precision threshold
• may have a fixed value (ex: 1 - winner is strong if instances of only the winning group are present) • can be an adaptive function of the current depth of the classification tree

1 - (curr_depth/max_depth)

Make_tree procedure
• Stopping condition
– all the intervals for the corresponding attribute are found to be strong – there are no tuples in some range of values for the selected attribute – branching may be limited by specifying a maximum tree depth

Performance (comparing IC to ID3)
• Generation efficiency of IC
– stems from doing k-ary decomposition and from using dynamic pruning

• Retrieval efficiency of IC
– k-ary trees are smaller and shallower. It leads to better retrieval performance

•

Classification accuracy
– classification error was chosen as a measure of the quality

Comparing 3 versions of IC
• • Classification error rates for 4 functions Adaptive precision (adaptive precision threshold is used to chose the winner;max.depth = 10) Error pruning - a lookahead = 5 Fixed precision = 0.9 1 - 3 functions partition the attribute space into n- dimensional rectangular regions function 4 partitions the attribute space into hyper-polyhedra (IC had to approximate the partitioning)
18 16 14 12 10 8 6 4 2 0 1 2 3 4

F ixe d P re c ision A daptive P re c ision E rr or P runing

• • •

•

Comparing IC to ID3 (using IND tree package)
• • Classification error rates for 4 functions Adaptive precision (adaptive precision threshold is used to chose the winner;max.depth = 10) 5% perturbation in training and test data 1 - 3 functions partition the attribute space into n- dimensional rectangular regions function 4 partitions the attribute space into hyper-polyhedra (IC had to approximate the partitioning)
18 16 14 12 10 8 6 4 2 0 1 2 3 4 ID 3 A daptive P recision

• •

•

Summary
• Problem of synthesizing CF for retrieving all instances of specified groups from a large database based on a representative sample of examples was considered • A tree-based interval-classifier was presented • IC is designed to be interfaced efficiently with the db systems • IC has been designed to be efficient in the generation of CF • novel aspect of IC is its treatment of non-categorical attributes (IC creates k-ary subtrees instead of binary as is the case with ID3) • rather than generating a full tree and pruning it, IC does dynamic pruning as the tree is expanded to make the classifier generation phase efficient • IC generates SQL queries for CF that can be optimized using relational optimizers. • Comparision with ID3 indicates that IC compares quite favorably in the classification accuracy


						
Related docs