Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

B629 CLOUD COMPUTING PROJECT 2 Kaushik Chandrasekaran by dandanhuanghuang

VIEWS: 2 PAGES: 4

									                 B629 CLOUD COMPUTING
                       PROJECT 2
              Kaushik Chandrasekaran, Nabeel Ahamed Akheel


GOAL

The goal of this project is to learn the concepts of iterative MapReduce
programming model and have a try of an implementation of the iterative
MapReduce, the Twister Kmeans with Multiple Reducers. In this project we are
required to implement an automatic Twister Kmeans Program, which runs with
different centroids and gets the best case.

PROJECT DESCRIPTION

The main objective of this project is to execute the K-Means Data Generation to
create 10 Centroid files initially. Once these centroid files are obtained we need to
run the K-Means on these centroid files and get the best centroid location and the
least objective value among those 10 runs. Hence a total of 10 x 10 runs would be
required.

STEPS INVOLVED

   1.) Follow the necessary instructions given in the initial document and have
       twister started and running.

   2.) Generate the initial 10 centroid files using the following command

       ./gen_data.sh [init clusters file][num of clusters][vector length][sub dir][data file
       prefix][number of files to generate][number of data points]


   3.) Create the Partition files


   4.) Obtain the final centroid point for each of the file


   5.) Calculate the Euclidian distance


   6.) Find the sum of the Euclidian distance for each point in each of the file
   7.) Repeat step 6 for all the files and find the least summation value.


   8.) The minimum value obtained is the least object function value.


 COMPARISON WITH THE EXISTING SYSTEM

 The algorithm followed is similar to the present algorithm. The vital differences
are as follows

   1.) The calculation is performed on 10 initial centroid files.
   2.) The Objective function value for each file has been calculated
   3.) The least objective function value amongst 10 Objective function values are
       obtained

OBSERVATIONS

   a.) The sequential complexity per iteration for K centers and N points  O(NK). Since
       80 Mapper Tasks are used in this program the time complexity is O(NK/80).


   b.) The Time Complexity of Reduce task would be 80*N*(K/80) 0(NK)


   c.) There would be a positive speed up obtained as the value of N increases. Since
       twister is considered to be an iterative version of Map Reduce the performance
       would increase with the increase in value of N.


   d.) Initially we obtain the Objective function value for each of the Centroid files we have
       (i.e 10 in our case). We are then calculating the summation of the distance of each of
       these points to the 80,000 points in the partition file and the minimum value
       obtained would be the best solution.
FLOW CHART


                         Start



             Generate Initial cluster files



                Generate partition file



                Run K-means for the
             generated Input cluster files


              Get final centroid of each
                   input cluster file



             Calculate Euclidian distance



             Sum the Euclidian distance
              for each point in each file



                       Done for               No
                       10 files?

                             YES
               Compute the Euclidian
             distance for the least value



             Print the centroid with the
              least Euclidian distance



                        Stop
REFERENCES

  [1] CS machine assignment,
      https://docs.google.com/spreadsheet/ccc?key=0AtR8aHmmVF3ydDdncnRucVhrYX
      Q5VkVMYnd0U3E0MEE&hl=en_US#gid=0
  [2] Twister 0.9 package, http://salsahpc.indiana.edu/csci-b649-2011/files/Twister-
      0.9.tar.gz
  [3] Kmeans Wiki, http://en.wikipedia.org/wiki/Kmeans _clustering
  [4] Twister Official website, http://www.iterativemapreduce.org/

								
To top