# chenpeng_proj2_report

Document Sample

```					     Twister Kmeans Project Report
Peng Chen, Yuan Gao

1 Dataflow of an automatic Twister Kmeans Program:

centroid generation

Generate 10 sets of centroids
generation

Data generation
Use “map-only” operation to
generate data concurrently
…

Compute the distance from
Map()        Map()            each data point to the centroid
of each cluster, find the nearest
Data points                center and assign to points to
the cluster center.
…     …
Compute new centroids by
calculating the center of the
points in each cluster.
Map()          Map()

Compute the difference
…                    between new centroids and the
previous one. If it is less than
the threshold, break the
Reduce(                             iteration
)
Until less than threshold
Compute objective function
…

Choose the best centroids with
Map()         Map()           the lowest value of objective
function
2 Comparison Analysis

Below is the original kmeans twister dataflow:

Data generation
Use “map-only”
operation to
…                               generate data
concurrently

Map()         Map()              Compute the distance from each
data point to the centroid of each
cluster, find the nearest center
Data points
and assign to points to the cluster
center.
…
Compute new centroids by
calculating the center of the
points in each cluster.
Map()           Map()

Compute the difference between
…
new centroids and the previous
one. If it is less than the threshold,
break the iteration
Reduce()

Until less than threshold

In order to implement an automatic Twister Kmeans Program which runs with different centroids and
gets the best case, we need to find the minimal value of objective function within the ten round. So
compared with the original kmeans, we add the part to calculate the value of objective function.
Specifically, we use assg[i] to store the minimum centroid of last time, then calculate the sum of the
Euclidian distances between data points and assg[i], and then store the result in an additional column of
the centroid array. So after generating data points, the map() we use is different from the original map().
And after the main iteration for kmeans clustering, we add another round of map() to calculate the final
value of the objective function. To implement the program automatically, we initially run a loop to
generate 10 sets of centroids.

a. The sequential complexity per iteration is O(NK) for K centers and N points. What is time

Assume we have p mappers, then the time complexity of each map task is O(NK/p).

b. What is time complexity of Reduce task?

O(k)

c. What speed up would you expect when N is large for Twister version?

Assume we have p mappers, then the speed up is

s = ( O(NK)+O(K) ) / ( O(NK/p) + O(K))

= O(NK) / O(NK/p)       (since N >> K)

=p

d. In your best solution with lowest objective function value, could you explain or describe the
reason?

The k-means algorithm we are using is a heuristic algorithm that converges to a local optimum.
During each iteration, it decreases the value of within-cluster sum of squares (WCSS), which
happens to be our objective function, until some threshold is satisfied. However, the final result
depends on the location of initial centroids. That is to say, the location of new centroids
depends on the location of old centroids. So if we run k-means algorithm on the same dataset
with 10 different initial centroids, we expect to have 10 different results. We chose the one with
the lowest objective function (WCSS) value, which is the best among the 10.

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 1/23/2012 language: pages: 3