B629 CLOUD COMPUTING
Kaushik Chandrasekaran, Nabeel Ahamed Akheel
The goal of this project is to learn the concepts of iterative MapReduce
programming model and have a try of an implementation of the iterative
MapReduce, the Twister Kmeans with Multiple Reducers. In this project we are
required to implement an automatic Twister Kmeans Program, which runs with
different centroids and gets the best case.
The main objective of this project is to execute the K-Means Data Generation to
create 10 Centroid files initially. Once these centroid files are obtained we need to
run the K-Means on these centroid files and get the best centroid location and the
least objective value among those 10 runs. Hence a total of 10 x 10 runs would be
1.) Follow the necessary instructions given in the initial document and have
twister started and running.
2.) Generate the initial 10 centroid files using the following command
./gen_data.sh [init clusters file][num of clusters][vector length][sub dir][data file
prefix][number of files to generate][number of data points]
3.) Create the Partition files
4.) Obtain the final centroid point for each of the file
5.) Calculate the Euclidian distance
6.) Find the sum of the Euclidian distance for each point in each of the file
7.) Repeat step 6 for all the files and find the least summation value.
8.) The minimum value obtained is the least object function value.
COMPARISON WITH THE EXISTING SYSTEM
The algorithm followed is similar to the present algorithm. The vital differences
are as follows
1.) The calculation is performed on 10 initial centroid files.
2.) The Objective function value for each file has been calculated
3.) The least objective function value amongst 10 Objective function values are
a.) The sequential complexity per iteration for K centers and N points O(NK). Since
80 Mapper Tasks are used in this program the time complexity is O(NK/80).
b.) The Time Complexity of Reduce task would be 80*N*(K/80) 0(NK)
c.) There would be a positive speed up obtained as the value of N increases. Since
twister is considered to be an iterative version of Map Reduce the performance
would increase with the increase in value of N.
d.) Initially we obtain the Objective function value for each of the Centroid files we have
(i.e 10 in our case). We are then calculating the summation of the distance of each of
these points to the 80,000 points in the partition file and the minimum value
obtained would be the best solution.
Generate Initial cluster files
Generate partition file
Run K-means for the
generated Input cluster files
Get final centroid of each
input cluster file
Calculate Euclidian distance
Sum the Euclidian distance
for each point in each file
Done for No
Compute the Euclidian
distance for the least value
Print the centroid with the
least Euclidian distance
 CS machine assignment,
 Twister 0.9 package, http://salsahpc.indiana.edu/csci-b649-2011/files/Twister-
 Kmeans Wiki, http://en.wikipedia.org/wiki/Kmeans _clustering
 Twister Official website, http://www.iterativemapreduce.org/