Design and Implementation of MapReduce-based

Document Sample
Design and Implementation of MapReduce-based Powered By Docstoc
					Short Paper
                        Proc. of Int. Conf. on Advances in Computing, Control, and Telecommunication Technologies 2011

      Design and Implementation of MapReduce-based
      Image Conversion Module in Cloud Computing
                                 Hyeokju Lee1, Myoungjin Kim2, Joon Her3, Hanku Lee*
                      Konkuk University/Department of Internet & Multimedia Engineering, Seoul, Korea
                       Konkuk University/Department of Internet & Multimedia Engineering, Seoul, Korea
                                      Email: {tough105, herj00n, hlee}

Abstract—In recent year, rapid advancement of internet                   System) for solve above problems. The proposed module
network and a growing number of people using SNS have                    consists of two parts. The first part is image data store into
made it easier for use to share multimedia data on the Internet.         HDFS for distributed parallel processing. The second part
However, the increasing of data amount has put heavier burden            process stored image data in HDFS using Hadoop MapReduce
on the computing infrastructure necessary to process
                                                                         framework and JAI. We used SequenceFiles method for small
multimedia data like a transcoding and transmoding.
Therefore, we proposed MapReduce-base image conversion                   files problem in Map function [4].
module in cloud computing environment and evaluate it. The                   We performed two experiments for the evaluation of our
proposed module consists of two parts: storage system (HDFS)             proposed module. The first experiment, we compared
for image data and MapReduce program with JAI library for                proposed module and non-Hadoop-based single program.
image transcoding. The proposed module can process image                 The second experiment, we evaluated proposed module with
data in distributed and parallel in cloud computing                      JVM(Java Virtual Machine) reuse option for many small files
environment. Thus, our proposed module can minimize                      problem.
overhead of computing infrastructure. In this paper, we explain              The structure of this paper is organized as below: In
how to implement the proposed module with Hadoop and JAI.
                                                                         section 2, introduce about Hadoop HDFS, MapReduce and
In addition, we evaluated our proposed module in terms of
processing time.                                                         JAI. The module architecture and its features will be proposed
                                                                         in section 3. In section 4, introduce how to implement the
Index Terms—Cloud Computing, Hadoop, MapReduce, HDFS,                    module. The result of Evaluation will be introduced in section
Image Conversion.                                                        5. Lastly, section 6 this paper will be concluded with the
                                                                         suggestions for future research.
                        I. INTRODUCTION
                                                                                              II. RELATED WORK
    Multimedia applications have been driving the need for
increased computing capability on devices such as personal               A. HDFS (Hadoop Distributed File System)
computers, smart phones, and consumer appliances.                            Hadoop Distributed File system (HDFS) is the primary
Multimedia applications are characterized by large amounts               storage system used by Hadoop applications [5]. HDFS cre-
of data and require large amount of processing storage, and              ates multiple replicas of data blocks and distributes them on
communication resources. Recently, the advent of internet                computed nodes throughout a cluster to enable reliable, ex-
network and a growing number of people using SNS (Social                 tremely rapid computations. HDFS has a master-slave struc-
Networking Service) have made it easier for user to share                ture and using TCP/IP protocol for communicate with each
multimedia data on the internet. This increase in the amount             node. Figure 1 shows structure of HDFS.
of data has put heavier burden on the computing
infrastructure necessary to process multimedia data [1]. For
example, users are allowed to upload multimedia content to
SNS, and SNS processes that content to allow other users to
access it at various formats. In this process, SNS need huge
computing resource for media transcoding and transmoding.
In addition, present multimedia data is high capacity and
high definition. Therefore, general purpose devices and
methods are not a cost effective solution and have a limitation.
Presently, there are a few research for transcoding based on
parallel distributed processing [2][3]. In this paper, we
designed and implemented image conversion module based
on Hadoop MapReduce and HDFS (Hadoop Distributed File
  *Corresponding Author                                                                   Figure 1. The HDFS Structure
© 2011 ACEEE
DOI: 02.ACT.2011.03. 51
Short Paper
                       Proc. of Int. Conf. on Advances in Computing, Control, and Telecommunication Technologies 2011

In Figure 1, the name node manages namespace and control                  on cloud computing power. The proposed module can image
file access by the client. The data node manages each node’s              resize and format converting. The proposed module use HDFS
storage in cluster. In addition, the data node performs block             as a storage for distributed parallel processing. The image
command ordered by namenode.                                              data will be distributed in HDFS. For the distributed parallel
                                                                          processing, proposed module use Hadoop MapReduce
B. MapReduce
                                                                           framework and JAI. Figure 3 shows the proposed module
    MapReduce is a programming model to process large-                    architecture.
scale data in distributed parallel [6]. MapReduce process the
entire large-scale data by dividing into multiple servers. Figure
2 shows the structure.

                                                                                Figure 3. The Image Conversion Module Architecture
               Figure 2. The MapReduce Structure                          In Figure 3, proposed module store image data to HDFS. HDFS
    The job of MapReduce is divided into two stages,                      will distribute automatically in each node. Map function
including Map and Reduce. At the Map stage, each inputted                 processes the entire image data in distributed parallel. The
records are processed in parallel as <key, value> in its                  proposed module implemented only Map function. In Map
initialize and transforming stage. At the Reduce stage, the               function, proposed module use JAI for image resizing and
data created at the Map stage output the result value through             format converting.
summery or construction according to user’s definition.
During this process, when developer implements only the                        IV. IMPLEMENTATION OF IMAGE CONVERSION MODULE
Map and Reduce function, the rest process including
                                                                               In this paper, image conversion module implemented
distribution and parallel processing is automatically done by
                                                                          based on Hadoop. However, small file like image files is
the MapReduce frame work. The Hadoop MapReduce
                                                                          problem in Hadoop MapReduce process. Map tasks usually
implements MapReduce programing model and framework,
                                                                          process a block of input at a time. If the file is very small and
runtime environment.
                                                                          there are a lot of them, then each map task processes very
C. JAI (Java Advanced Imaging)                                            little input, and there are a lot more map tasks, each of which
   JAI is Java library based on Open Source for Image                     imposes extra bookkeeping overhead. Compare a 1GB file
processing [7]. JAI support varying image formats (BMP,                   broken into 16 64MB blocks, and 10,000 or so100KB files.
JPEG, PNG, PNM, TIFF) and encoder/decoder function. In                    The 10,000 files use on map each and the job time can be tens
addition, most of the features provide by API and simple                  or hundreds of times slower than the equivalent one with a
framework for Image processing.                                           single input file. There are couple of feature to help alleviate
                                                                          the bookkeeping overhead. Thus, we use task JVM reuse for
       III. IMAGE CONVERSION MODULE ARCHITECTURE                          running multiple map tasks in one JVM, thereby avoiding
                                                                          some JVM startup overhead. In addition, we evaluated about
    In this paper, we designed and implemented MapReduce-                 performance of above method. The other method is a
base Image Conversion Module for solve a problem. The                     SequenceFiles method in Map function. The SequenceFiles
problem is that increasing internet infrastructure burden, due            method use filename as the key and the file contents as the
to the increase of the multimedia data share through the                  value. This method is optimization for many small files. In
internet. The traditional method of multimedia transcoding                proposed module, we use BytesWritable interface of Hadoop
usually uses general purpose devices and offline-based                    for input image data’s content. The proposed module
process. However, the multimedia data processing requires a               converts size of image, image format with use option
lot of computing resources and time-consuming. To solve                   (maxWidth, maxHeight, Image Format).
this problem, we designed image conversing module based
© 2011 ACEEE
DOI: 02.ACT.2011.03.51
Short Paper
                         Proc. of Int. Conf. on Advances in Computing, Control, and Telecommunication Technologies 2011

                          V. EVALUATION                                  We measure running times take in proposed module using
   Our evaluation cloud server used in the experiments is a              MapReduce programming and taken in machine A and B ap-
single enterprise scale cluster that consists of 28                      plying only sequential programming using JAI without
computational nodes. Table 1 shows setup environments of                 Hadoop. Figure 4 shows the result of first experiment.
evaluation cluster. Since the structure of the cluster is
homogeneous, it provides uniform evaluation environment.


Data sets consisting of nine are used to verify out proposed
module and an average size of one image file is approximately
19.8MB. Table 2 shows specific dataset information.
                                                                           Figure 4. Elapsed Time for proposed module with two different
                       TABLE II. IMAGE DATASETS                                                     machines.
                                                                         The elapsed times in machine A, B are smaller than running
                                                                         time taken in less than 2 nodes in our cluster. The reason why
                                                                         the performance in machine A and B without Hadoop is better
                                                                         than our cluster in case of 1 and 2 nodes is that distributed
                                                                         processing on MapReduce programming causes overhead
                                                                         elements with related to the creation of map tasks, job
Other default experiment options in Hadoop are as follows:               scheduling and transporting speed of HDFS. In the second
1) the numbers of block replication is 3. 2) Block size is 64MB.         experiment, we compare JVM reuse option 1 with JVM reuse
    We evaluated performance of proposed module and                      option -1. Option 1 uses again JVM only one time. Option -1
optimization. We planned two experiments. The progresses                 uses again JVM unlimited. Figure 5 shows the result of second
of experiments are as follows. In the first experiment, we               experiment.
compare our proposed module with non-Hadoop-base single
program in two different single machines. Two machines
specification shown in Table 3.

                                                                                   Figure 5. Elapsed Time for JVM reuse option
                                                                         The elapsed time is not different until 520 files. However,
                                                                         after 1040 files that performance of moment difference can be
                                                                         seen growing. Processing file’s numbers are over certain
                                                                         exceed level which cause JVM overhead for creation map
                                                                         task. JVM reuse option is one of solution as you can see in
                                                                         above result.
© 2011 ACEEE
DOI: 02.ACT.2011.03.51
Short Paper
                     Proc. of Int. Conf. on Advances in Computing, Control, and Telecommunication Technologies 2011

               CONCLUSIONS & FUTURE WORK                                                    ACKNOWLEDGMENT
    Recently, the advent of internet network and growing                  This research was supported by the MKE (The Ministry
number of people using SNS has made it easier for user to             of Knowledge Economy), Korea, under the ITRC (Information
share multimedia data on the internet. However, this increase         Technology Research Center) support program supervised
in the amount of data has put heavier burden on the internet          by the NIPA (National IT Industry Promotion Agency (NIPA-
infrastructure necessary to process multimedia data. Thus,            2011-(C1090-1101-0008)).
we designed and implemented MapReduce-base Image                          This work is supported by Seoul Metropolitan
Conversion Module in Cloud Computing Environment. The                 Government ‘Seoul R&BD Program (SS100006)’.
proposed module based on Hadoop HDFS and MapReduce
framework for distributed parallel process of large-scale                                       REFERENCES
imaged data. We use a JAI library for image format converting
                                                                      [1] Sun-Moo Kang, Bu-Ihl Kim, Hyun-Sok Lee, Young-so Cho,
 and resizing. Therefore, we use cloud computing power for            Jae-Sup Lee, Byeong-Nam Yoon, “A study on a public multimedia
handle multimedia data process. We performed two experiments          sevice provisioning architecture for enterprise networks”, Network
for the evaluation of our proposed module. In the first               Operations and Management Symposium, 1998, NOMS 98., IEEE,
experiment, we compared our proposed module with no-                  15-20 Feb 1998, 44-48 vol.1, ISBN : 0 -7803-4351-4
Hadoop-base single program with JAI library. In the result of         [2] Hari Kalva, Aleksandar Colic, Garcia, Borko Furht, “Parallel
first experiment, our proposed module shows better                    programming for multimedia applications”, MULTIMEDIA
performance than single program after 2 nodes. In the Second          TOOLS AND APPLICATIOS, volume 51, number 2, 901-818,
experiment, we changed mapred.job.reuse.jvm.num.task                  DOI: 10.1007/s11042-010-o656-2
                                                                      [3] Gracia, A., Kalva, H., “Cloud transcoding for mobile video
option in mapred-site.xml file and evaluated its performance.
                                                                      content delivery”, Consmer Electronics(ICCE), 2011 IEEE
The result of second experiment shows better performance              International Conference on, 9-12 Jan. 2011, 379-380, ISSN : 2158-
when proposed module process with small and many files.               3994
The future researches should focus on video data not only             [4]
on image data. We will implement integrated multimedia                problem/
process system and multimedia share system for SNS in cloud           [5] Hadoop Distributed File System :
computing environment. In addition, we will research about            [6] Jeffrey Dean, Sanjay Ghemawat, “MapReduce : Simplified
optimization for multimedia file system.                              Data Processing on large Cluster”, OSDI`04 : Sixth Symposium on
                                                                      Operating System Design and Implementation, San Francisco, CA,
                                                                      December, 2004.
                                                                      [7] Java Advanced Imaging Library :

© 2011 ACEEE
DOI: 02.ACT.2011.03.51

Shared By:
ides ajith ides ajith http://