Learning Center
Plans & pricing Sign in
Sign Out

Map Reduce for DC4.5 and Ensemble Learning In Distributed Data Mining


The International Journal of Computer Science and Information Security (IJCSIS) is a reputable venue for publishing novel ideas, state-of-the-art research results and fundamental advances in all aspects of computer science and information & communication security. IJCSIS is a peer reviewed international journal with a key objective to provide the academic and industrial community a medium for presenting original research and applications related to Computer Science and Information Security. . The core vision of IJCSIS is to disseminate new knowledge and technology for the benefit of everyone ranging from the academic and professional research communities to industry practitioners in a range of topics in computer science & engineering in general and information & communication security, mobile & wireless networking, and wireless communication systems. It also provides a venue for high-calibre researchers, PhD students and professionals to submit on-going research and developments in these areas. . IJCSIS invites authors to submit their original and unpublished work that communicates current research on information assurance and security regarding both the theoretical and methodological aspects, as well as various applications in solving real world information security problems. . Frequency of Publication: MONTHLY ISSN: 1947-5500 [Copyright � 2011, IJCSIS, USA]

More Info
  • pg 1
									                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 9, No. 1, 2011

     Map Reduce for DC4.5 and Ensemble Learning
                               in Distributed Data Mining
                      Dr.E.Chandra                                                                 P.Ajitha
            Research Supervisor and Director                                       Research Scholar and Assistant Professor
            Department Of Computer Science                                            Department Of Computer Science
         D J Academy for Managerial Excellence                                     D J Academy for Managerial Excellence
             Coimbatore, Tamilnadu, India.                                             Coimbatore, Tamilnadu, India.

Abstract— MapReduce one of the distributed computing                      improves the scalability in a significant way. Ability to
techniques is integrated with decision tree classifier C4.5 in            Generate multiple versions of the classifiers by the training set
the distributed environment and ensemble learning with                    or any method or considering any set of parameters.
its classifier. This paper proposes an algorithm to classify,
predict data using MapReduce for DC4.5 with ensemble                      SPRINT algorithm[3] achieve the scalable performance in the
learning. Proposed algorithm increases the accuracy and                   parallel and distributed environment. Prodromidis[4] in the
scalability of the data. Noise handling in the decision trees             meta learning discusses in respect to the distributed
with respect to the distributed data is also handled here.                environment.JAM[5] in the grid and cloud computing
                                                                          environment and Weka[6] for grid enabled cross validation
Keywords C4.5, Distributed Decision Trees, MapReduce,                     and testing. These are the few algorithms already specifies for
Ensemble learning                                                         the parallel and distributed environment with ensemble
                      I.    INTRODUCTION
                                                                          Map Reduce [1] proposed by Google for handling massive
Classification of the decision trees plays a major role in the            large data sets. There are two primary functions (1) a Map
distributed and centralised environment. Centralised                      Function (2) Reduce Function for the distributed, parallel and
classification techniques are not efficient in handling large             cloud computing environment. Basic work of map reduce is
volumes of data. Deluge information is available in today’s               to iterate the input, compute key – value pairs for each part of
world and it had become necessitated to analyse the                       the input, group all intermediate values by key, then again
information and mine knowledge. For these aspects distributed             iterate over the resulting groups and finally reduce each group.
classification techniques comes in handy. One of the                      Implementation issues like fault tolerance, load balancing and
classification techniques C4.5 in distributed environment is              performance are also be able to handle in the Map Reduce
discussed here and also MapReduce which comes in handy in                 environment. Series of machine learning techniques are used
the distributed environment.                                              in MapReduce to efficiently solve the problems arises in large
                                                                          scale distributed environment.
C4.5 decision tree algorithm is one of the most popular
techniques used to predict and classify the data. On massive              Recently ensemble learning had become popular , because of
handling of large data sets may lead to biased selection and              it is inherent nature to train many learnings and combine or
chances of missing the value are high. C4.5 selects the                   group the results. For the distributed environment ensemble
unbiased values and prunes the tree in the times of over fitting.         learning can increase the accuracy and reduce the computing
Handling discrete and missing values with precision is also               efforts. Utilising the advantages of Map Reduce, ensemble
possible through C4.5. In a distributed environment, C4.5                 and C4.5 , a new classification algorithm DEC4.5-MR is
selects the attributes across the environment, but the inherit            proposed here. Distributed ensemble C4.5 with map reduce
disadvantage of the C4.5 is it is an unstable classification so           algorithm will classify , construct and predict the data.
another learning techniques of ensemble learning and
MapReduce is utilised.                                                    Rest of the paper is organised as follows. Section 2 discuss the
                                                                          related work exists for C4.5, ensemble, Map Reduce in
In Ensemble learning, base classifiers are constructed from the           distributed environment. Section 3 examines the
data sets. New data is classified by combining the predictions            distributedC4.5 with ensemble and Map Reduce. Section 4 the
of the base classifiers. Ensemble paves way to combining and              experimental evaluation and Section 5 described the
perturbing many methods to increase the accuracy of and                   conclusions of this paper

                                                                                                     ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 9, No. 1, 2011

                                                                          B. Ensemble Learning in DDM
                                                                          Ensemble learning methods in distributed data mining are of 2
A. C4.5 Method                                                            methods one is either mining inherently distributed data and
                                                                          scaling up ensemble methods based on partitioning and
C4.5 an successor of CLS and ID3. Generates the classifier in             combining results. When the nature of data is geographically
form of decision trees and generate rules based on the results.           distributed and handling it in centralised environment is not
Divide and conquer strategy is utilised in the C4.5 which is              efficient without perturbing the conventional methods.
described in the figure 1. Two heuristic evaluation methods               Majority voting and weighted voting both can combined in
like information gain and gini index is used in the C4.5 both             distributed decision trees. So ensemble technique allows the
the heuristic technique has the ability to handle discrete,               partition of vertical partitioning otherwise called as
missing values and precision of the data handling will be high.           heterogeneous data[7]. This ensemble technique increases
Numeric or Nominal data can also be considered. After the                 accuracy, reduces the processing time in compared with
construction pruning can be done to avoid over fitting by using           centralised prediction. Scalability issues are also handled in
post pruning pessimistic method. Suppose for N nodes , E will             the ensemble methods
be the errors that occurs and most frequent node is set and can
be experimented by Bernoulli experiments equation specifies               Bagging, boosting and sub space are some of the algorithms in
the same. Pessimistic error condition can be computed using               ensemble techniques. Here, bagging one of the most and
equation 2.                                                               popular classifier is utilised here as it scales well in distributed
                                                                          environments[9]. Bagging is well able to handle[8] noise in
      f q                                                              the data
pr              z   c ----- (1)
    q(1  q) N
                   
                                                                         Bagging ensemble algorithm
                                                                          Input: L: a classification method, D: a training data set, m:the
                                                                          number of base classifiers;
                                                                          Output: φ(*): an ensemble classifier;
           z2      f   f2   Z2                                            1: for i = 1 to m
      f      Z                                                        2: D’ = bootstrap sampling from the set D;
e         2N     N N 4 N 2 -----(2)
                                                                          3: φi(*) = L(D’);
                1                                                        4: end for;
                    N                                                     /* Y is a nominal class label set */
                                                                          5:  (*)  arg max  i 1 (*)  y;
 The algorithm for distributed DC4.5 decision tree is specified                         yY
in figure1 .                                                              Fig 2. The Bagging Ensemble Algorithm.

Input: A training sets S, a node T;                                       C. MapReduce Distributed Computing Model
Output: A decision tree with the root T;
1: If the instances in S belong to the same class or the amount           The following architecture specifies the MapReduce
of instances in S is too few, set T as leaf node                          distributed computing Model and the operation of the two
and label the node T with the most frequent class in S;                   functions Map and Reduce. The architecture of the
2: Otherwise, choose a test attribute X with two or more                  MapReduce splits the Map and Value pair and local tasks
outcomes based on a selecting criterion, and label the                    player with the necessitude Jobs Scheduling and work.
node T with X;
3: Partition S into subsets S1, S2, …, Sn according to the                Map function can be highly parallelized and so is the Reduce
outcomes of attribute X for each instance; generate T’s                   function. Map functions usually deal with parallelized and
n children nodes T1, T2, …, Tn;                                           distributed sub-missions. Reduce function usually collect the
4: For every group (Si, Ti), build recursively a subtree with the         sub-missions and parallelize or distributed it according to the
root Ti.                                                                  need[10]. Computing process of Map Reduce is short(i)Divide
5. Each and every group is integrated and global tree is                  into large sets (ii)each (or several) data sets handled in one
generated                                                                 cluster for intermediate processing (iii) intermediate is
Fig1 Algorithm for DC4.5 decision tree                                    combine with final cluster. The Map function read input data
                                                                          sets with the format of <key1, value1>. After the analysis, it
Algorithm for DC4.5 of the distributed decision tree discusses            generates an intermediate result <key2, value2> and submits it
how to use C4.5 in distributed environment.                               to a Reducer; then the Reduce function combines the results to
                                                                          get a final list <key3, value3> according to the list <key2,

                                                                                                        ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                               Vol. 9, No. 1, 2011

                                                                         reliability and places them on compute nodes on server
                                                                         groups. So MapReduce can handle these data on the associated
                                                                         nodes. Figure 3 shows the Hadoop MapReduce framework

                                                                         III.   PROPOSED ALGORITHM

                                                                         A. Distributed C4.5 with ensemble and MapReduce

                                                                         The proposed algorithm DC4.5 with ensemble and Map
                                                                         reduces has 3 phases. Three phases are partition Phase/the
                                                                         map,Build base classifier phase and Reduce/Ensemble phase
                                                                         Divide data sets D into n subsets of {D1,D2,..Dn} and users
                                                                         determine the value n. In the first phase of Map phase a Base
                                                                         classifier BCi needs to be generated into Classifier Ci with the
                                                                         DC4.5 algorithm. In Reduce/Ensemble phase assemble the n
                                                                         base classifiers to generated final classifier using Bagging.

                                                                         B. Types of keys and values

                                                                         The types of sets of key and value of MReC4.5 illustrates as
                                                                         key1 : Text
                                                                         value1 : Instances
                                                                         key2 : Text
                                                                         value2 : Iterator of Classifiers
           Fig 3-Hadoop MapReduce framework Architecture                 key3 : Text
                                                                         value3 : Classifer
During the Map process, in order to improve the combination
efficiency, a Combiner can be used which has similar function            key1, key2, key3 are all the Text type offered by Hadoop and
as Reducer to reduce at local. The transformation between the            their values are the file name associated with the input data set
input and the output looks as follows [6].                               D. In the Partition phase, when the data set D is split into m
                                                                         data sets, according to the input format of the C4.5 algorithm
map (key1, value1) → (key2, value2) [ ]                                  each data set is formatted as value1 with the Instances type. In
reduce (key2, value2 [ ]) → (key3, value3) [ ]                           the Map phase, we build a classifier model with the C4.5
                                                                         algorithm and obtain a classifier model set value2 which
Hadoop is an implementation of the MapReduce parallel                    belongs to the Iterator of Classifiers type; In the Reducer
computing model of the open source framework for distributed             phase, we assemble classifiers from value2 to obtain a
programming. With the help of Hadoop, programmers can                    classifier model value3 with the Classifier type.
easily write parallel and distributed programs. It runs in
computing clusters to deal with massive data [11]. The basic             C. Map/Reduce Phase
components of an application on Hadoop's MapReduce
include a Mapper and a Reducer class, as well as a program to            Figure 4 specifies the proposed algorithm for the Map
create a JobConf. Some applications also include a Combiner              operations in respect to the C4.5 algorithm. A change is done
class which is actually the implement of the Reducer on local.           in the original proposed algorithm to Map and reduce for the
                                                                         key value pairs
Hadoop implements a distributed file system, referred to                 function mapper(key, value)
HDFS. HDFS has the characteristic of high fault-tolerant, and            /* Build base-classifier */
is designed for deployed on low-cost hardware. It provides               1: Build a C4.5 Classifier c with the data set value;
high throughput to access the data of applications, which is             /* Submit intermediate results */
suitable for an application with large data sets. HDFS relax the         2: Emit(key, c);
requirements of POSIX, allowing streaming access to data in              3. Generate {D1,D2,..Dn} for all subsets
the file system. In addition, Hadoop implements the                      4.Build and Map with Ci
MapReduce distributed computing paradigm. MapReduce                      5.Integrate into one cluster the (key,c)
splits the mission of an application into small blocks of work.                              Fig 4. The Map Operation
HDFS establishes multiple replicas of data blocks for

                                                                                                    ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                   Vol. 9, No. 1, 2011

 In the mapper functions C4.5 classifier is used so that data set
value with the key is taken and paired with the distributed data
list. After pairing with the value_list emit the key. Generating
all the subsets of data sets which further adhere to the key is
classified. On classifiers of the all subsets of data Build and
Map the C1,C2,….Cn with the D1,D2,..Dn.

D. Reduce/Ensemble Phase

Figure 5 specifies the proposed algorithm for reduce
operations with the bagging ensemble which eliminates noise
and scales the data well.
function reducer(key, value_list)
/* Get each classifier model */                                             Fig 6 Accuracy Vs Scalability chart for DC4.5 and others
1: foreach value in value_list
2: classifiers[i++] = getClassifier(value);                                 Hadoop MapReduce takes the various data sets with the
/* Perform the bagging ensemble */                                          attributes characteristics of both numeric and categorical
3: c = baggingEnsemble(classifiers);                                        attributes. On this experimental evaluation computational
4: Emit(key, c);                                                            complexity of the each MapReduce phase on the basis of the
5.Integate Ci and value_list                                                θ(n log n2) where n specifies the data sets and the accuracy
6.Generate the model and predict the results.                               handling based on the emit of key-value pair.
7. Reduce the value_list based on the key pair
8. Ci and key with the model .                                              5.CONCLUSIONS only the key which provides the closest combination
to the value_list generated.                                                This paper deals with the distributed C4.5 in the ensemble
Fig 5. The Reduce Operation for DC4.5 with ensemble and Map Reduce          learning for the distributed environment utilizing the Map
                                                                            reduce framework. The proposed algorithm increases accuracy
value_list specifies the combination of all intermediate lists              as well as scalability of the data handled. Inherent nature of
and submit the final to Map Reduce Framework.                               the number of data handling in the Distributed Decision Trees
                                                                            can be improved based on the value_list of the key pairs in the
DC4.5 partitions the data handled and after partitioning the                MapReduce environment.
data sets into required and necessities format ,the data are
transferred to next phase of build /classifier phase. In classifier         Comparing to the centralized decision trees the algorithm
phase the classification technique of decision trees in                     proposed can be utilised efficiently in terms of the processing
distributed data are utilised. Here, the gini index and                     time and computation cost. Future work may be enhanced well
information gain are made use of. After the classifier phase the            with the detailed implementation of the same on the Hadoop
map operations are build and mapped to the cluster with the                 MapReduce framework so that number of millions of data
reference to the key value pair.                                            with the characteristics can also be integrated.

Next Phase of Reduce/Ensemble phase each of the value are                   REFERENCES
classified and bagging Ensemble technique is used for the
further classification and model generation. After generating               1. J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large
                                                                            clusters," in OSDI'04: Proceedings of the 6th Symposium on Operating System
the models the results are predicated according to the key-                 Design and Implementation. San Francisco, California, USA. USENIX
value pairs and integrated based on the key.                                Association, pp. 137-150. December 6-8, 2004.
                                                                            2. Apache, “Hadoop”,, 2006.
                IV.   EXPERIMENTAL EVALUTION                                3. J. C. Shafer, R. Agrawal, and M. Mehta, "SPRINT: A scalable parallel
                                                                            classifier for data mining," in VLDB'96: Proceedings of the 22th International
                                                                            Conference on Very Large Data Bases. Mumbai (Bombay), India, Morgan
The computational complexity of the algorithm is o(n+m)log n                Kaufmann, pp. 544-555, September 3-6,1996.
where the n and m specifies the values and accuracy. Log n                  4. A. Prodromidis, P. Chan, and S. Stolfo, "Meta-learning in distributed data
specifies how it scales well compared to the centralized                    mining systems: Issues and approaches," Advances in Distributed and Parallel
                                                                            Knowledge Discovery, Vol. 114, 2000.
decision trees. Based on the other methods this increases                   5. K. Cardona, J. Secretan, M. Georgiopoulos, and G. Anagnostopoulos, “A
greater accuracy as it takes the O(n) where n is the large data             Grid Based System for Data Mining Using MapReduce”, Technical Report,
sets size.                                                                  University of Puerto Rico, July 2007.
                                                                            6. R. Khoussainov, X. Zuo, and N. Kushmerick, “Grid-enabled weka: A
                                                                            toolkit for machine learning on the grid,” ERCIM News No. 59, October

                                                                                                             ISSN 1947-5500
                                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                               Vol. 9, No. 1, 2011

7. Skillicorn,D.B and S.M.McConnell,2008.Distributed prediction from
vertically partitioned data. J.Parallel Distributed Computing,68:16-36.
8.T. G. Dietterich, "Machine-learning research: Four current directions," The
AI Magazine, vol. 18, no. 4, 1998, pp. 97-136.
9. B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Chapman &             P.Ajitha,, received her B.Com from Bharathiar University, Coimbatore in
Hall/CRC, May 1994.                                                                       1998 and received MCA from Bharathiadsan University, Trichy in 2001. She
10. D. Borthakur, The Hadoop Distributed File System: Architecture and                    obtained her M.Phil in the area of Data Mining in 2004. She has 9 years of
Design, The Apache Software Foundation, 2007.                                             experience in teaching and 3 months of industrial experience. Currently, she is
11. L. A. Barroso, J. Dean, and U. Holzle, "Web search for a planet: The                  working as Assistant Professor, Department Of Computer Applications,
google cluster architecture", Micro, IEEE, vol. 23, no. 2, 2003, pp.22-28.                D.J.Academy for Managerial Excellence, Coimbatore and pursuing her Ph.D
                                                                                          in Bharathiar University,Coimbatore. She has presented more than 10 research
                                                                                          papers in National and International Conferences and published a paper in an
                            AUTHORS PROFILE
                                                                                          International Journal. Her Research Interest lies in Distributed Data Mining,
                                                                                          Machine Learning and Artificial Intelligence. She is Life a member of CSI,
                          Dr.E.Chandra received her B.Sc., from Bharathiar
                                                                                          life member of Institute of Advanced Scientific Research and also a member
                          University, Coimbatore in 1992 and received M.Sc.,
                          from Avinashilingam University ,Coimbatore in
                          1994. She obtained her M.Phil., in the area of Neural
                          Networks from Bharathiar University, in 1999. She
                          obtained her PhD degree in the area of Speech
                          recognition system from Alagappa University
                          Karikudi in 2007. She has totally 15 yrs of
                          experience in teaching including 6 months in the
                          industry. Presently she is working as Director,
                          Department of Computer Applications in D. J.
Academy for Managerial Excellence, Coimbatore. She has published more
than 30 research papers in National, International Journals and Conferences in
India and abroad. She has guided more than 20 M.Phil., Research Scholars.
Currently 3 M.Phil Scholars and 8 Ph.D Scholars are working under her
guidance. She has delivered lectures to various Colleges. She is a Board of
studies member of various Institutions. Her research interest lies in the area of
Data Mining, Artificial Intelligence, Neural Networks, Speech Recognition
Systems, Fuzzy Logic and Machine Learning Techniques. She is an active and
                              Life member of CSI, Society of Statistics and
                              Computer Applications. Currently she is
                              Management Committee member of CSI
                              Coimbatore Chapter.

                                                                                                                           ISSN 1947-5500

To top