Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

B629 CLOUD COMPUTING PROJECT 2 Kaushik Chandrasekaran by dandanhuanghuang


									                 B629 CLOUD COMPUTING
                       PROJECT 2
              Kaushik Chandrasekaran, Nabeel Ahamed Akheel


The goal of this project is to learn the concepts of iterative MapReduce
programming model and have a try of an implementation of the iterative
MapReduce, the Twister Kmeans with Multiple Reducers. In this project we are
required to implement an automatic Twister Kmeans Program, which runs with
different centroids and gets the best case.


The main objective of this project is to execute the K-Means Data Generation to
create 10 Centroid files initially. Once these centroid files are obtained we need to
run the K-Means on these centroid files and get the best centroid location and the
least objective value among those 10 runs. Hence a total of 10 x 10 runs would be


   1.) Follow the necessary instructions given in the initial document and have
       twister started and running.

   2.) Generate the initial 10 centroid files using the following command

       ./ [init clusters file][num of clusters][vector length][sub dir][data file
       prefix][number of files to generate][number of data points]

   3.) Create the Partition files

   4.) Obtain the final centroid point for each of the file

   5.) Calculate the Euclidian distance

   6.) Find the sum of the Euclidian distance for each point in each of the file
   7.) Repeat step 6 for all the files and find the least summation value.

   8.) The minimum value obtained is the least object function value.


 The algorithm followed is similar to the present algorithm. The vital differences
are as follows

   1.) The calculation is performed on 10 initial centroid files.
   2.) The Objective function value for each file has been calculated
   3.) The least objective function value amongst 10 Objective function values are


   a.) The sequential complexity per iteration for K centers and N points  O(NK). Since
       80 Mapper Tasks are used in this program the time complexity is O(NK/80).

   b.) The Time Complexity of Reduce task would be 80*N*(K/80) 0(NK)

   c.) There would be a positive speed up obtained as the value of N increases. Since
       twister is considered to be an iterative version of Map Reduce the performance
       would increase with the increase in value of N.

   d.) Initially we obtain the Objective function value for each of the Centroid files we have
       (i.e 10 in our case). We are then calculating the summation of the distance of each of
       these points to the 80,000 points in the partition file and the minimum value
       obtained would be the best solution.


             Generate Initial cluster files

                Generate partition file

                Run K-means for the
             generated Input cluster files

              Get final centroid of each
                   input cluster file

             Calculate Euclidian distance

             Sum the Euclidian distance
              for each point in each file

                       Done for               No
                       10 files?

               Compute the Euclidian
             distance for the least value

             Print the centroid with the
              least Euclidian distance


  [1] CS machine assignment,
  [2] Twister 0.9 package,
  [3] Kmeans Wiki, _clustering
  [4] Twister Official website,

To top