Docstoc

chenpeng_proj2_report

Document Sample
chenpeng_proj2_report Powered By Docstoc
					     Twister Kmeans Project Report
                                 Peng Chen, Yuan Gao


1 Dataflow of an automatic Twister Kmeans Program:

        centroid generation


      Generate 10 sets of centroids
              generation

           Data generation
                                         Use “map-only” operation to
                                          generate data concurrently
                   …

                                          Compute the distance from
          Map()        Map()            each data point to the centroid
                                        of each cluster, find the nearest
              Data points                center and assign to points to
                                               the cluster center.
                   …     …
                                          Compute new centroids by
                                         calculating the center of the
                                            points in each cluster.
          Map()          Map()

                                           Compute the difference
                   …                    between new centroids and the
                                         previous one. If it is less than
                                           the threshold, break the
               Reduce(                             iteration
                  )
     Until less than threshold
                                          Compute objective function
                   …



                                        Choose the best centroids with
           Map()         Map()           the lowest value of objective
                                                   function
    2 Comparison Analysis

        Below is the original kmeans twister dataflow:

                Data generation
                                                      Use “map-only”
                                                        operation to
                       …                               generate data
                                                        concurrently


              Map()         Map()              Compute the distance from each
                                               data point to the centroid of each
                                                cluster, find the nearest center
                   Data points
                                               and assign to points to the cluster
                                                             center.
                        …
                                                   Compute new centroids by
                                                  calculating the center of the
                                                     points in each cluster.
              Map()           Map()


                                               Compute the difference between
                       …
                                                new centroids and the previous
                                               one. If it is less than the threshold,
                                                        break the iteration
                   Reduce()


         Until less than threshold


In order to implement an automatic Twister Kmeans Program which runs with different centroids and
gets the best case, we need to find the minimal value of objective function within the ten round. So
compared with the original kmeans, we add the part to calculate the value of objective function.
Specifically, we use assg[i] to store the minimum centroid of last time, then calculate the sum of the
Euclidian distances between data points and assg[i], and then store the result in an additional column of
the centroid array. So after generating data points, the map() we use is different from the original map().
And after the main iteration for kmeans clustering, we add another round of map() to calculate the final
value of the objective function. To implement the program automatically, we initially run a loop to
generate 10 sets of centroids.
3 Questions and Answers

  a. The sequential complexity per iteration is O(NK) for K centers and N points. What is time
     complexity of each Map Task?

      Answer:

      Assume we have p mappers, then the time complexity of each map task is O(NK/p).



  b. What is time complexity of Reduce task?

      Answer:

      O(k)



  c. What speed up would you expect when N is large for Twister version?

      Answer:

      Assume we have p mappers, then the speed up is

      s = ( O(NK)+O(K) ) / ( O(NK/p) + O(K))

       = O(NK) / O(NK/p)       (since N >> K)

       =p



  d. In your best solution with lowest objective function value, could you explain or describe the
     reason?

      Answer:

      The k-means algorithm we are using is a heuristic algorithm that converges to a local optimum.
      During each iteration, it decreases the value of within-cluster sum of squares (WCSS), which
      happens to be our objective function, until some threshold is satisfied. However, the final result
      depends on the location of initial centroids. That is to say, the location of new centroids
      depends on the location of old centroids. So if we run k-means algorithm on the same dataset
      with 10 different initial centroids, we expect to have 10 different results. We chose the one with
      the lowest objective function (WCSS) value, which is the best among the 10.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:1/23/2012
language:
pages:3