K-means Clustering Algorithm with Meliorated Initial

Document Sample
K-means Clustering Algorithm with Meliorated Initial Powered By Docstoc
					       33                      3                                                                                                                                                 2007   2
    Vol.33                   No.3                                              Computer Engineering                                                                            February 2007

      ·                                          ·                             1000   3428(2007)03 0065      02                                    A                                        TP39



                                                                                                             k-means
                                                                                                                 071002

                                k-means
                                                                                                   k
 k-means
                                                k-means


                K-means Clustering Algorithm with Meliorated Initial Center
                                                                        YUAN Fang, ZHOU Zhiyong, SONG Xin
                                                            College of Mathematics and Computer, Hebei University, Baoding 071002

      Abstract           The traditional k-means algorithm has sensitivity to the initial start center. To solve this problem, a new method is proposed to find the
 initial start center. First it computes the density of the area where the data object belongs to; then finds k data objects all of which are belong to high
 density area and the most far away to each other, using these k data objects as the initial start centers. Experiments on the standard database UCI
 show that the proposed method can produce a high purity clustering result and eliminate the sensitivity to the initial start centers.
      Key words                Data mining; Clustering; K-means algorithm; Clustering center


                                                                                                              3
                                                                                                             k     nj

                                                                                                       J = ∑∑ d (x j , z i )
                                                                                                             i =1 j =1
                       [1]                                                                                                [1]
                                                                                                       k-means
                                                                                                                                    k                       n
                                                     [2]                 [3]
                                k-means                     STING              CLIQUE                                                                             k
[4]             [5]
       CURE                                    [6]
          k-means                                                                                      (1)   n                                          k
                                                                                                       (2)                  (3)         (4)                                J
                                                                                  k-means              (3)                                          (                  )


                                                                                                       (4)
                                                                                                                   k-means


                                                                     k-means
                                                                                                                                             (1)
                                                                                                                                                                      k-means
                                                                                                             k                                                    k

                                                                                                                                              k                        k
1

          X = {x i | x i ∈ R p , i = 1, 2,… , n}
                                                                                                                                                            [1]
           k                                    z 1 , z 2 ,… , z k
                                                                                                                                k
               w j ( j = 1, 2,… , k )                          k
                   1                                                                                                                                                           (05213573)
          d (x i , x j ) =      (x i − x j ) (x i − x j )
                                           T
                                                                                                                                                  (2004406)
                   2                                                                                                      (1965         )
               1
          zj =
               Nj
                        ∑x
                       x∈w j                                                                                     2006-04-27                 E-mail          yuanfang@mail.hbu.edu.cn

                                                                                                                                                                                             65
                                                                                                        1           k-means                      k-means

                                               xi
                                                                             Iris                          Balance-scale                     New-thyroid                  Haberman                      Wine

                    xi                                             18,9,101         82.00%             466,561,599    49.92%        185,206,59      62.79%       198,291     51.96%         47,105,3       56.74%
              Minpts                                               35,6,21          88.67%             384,49,465     46.88%        110,25,122      78.14%       74,120      51.96%         25,117,141     70.22%
                                     xi                            12,11,66         53.33%             616,496,589    50.88%        84,73,183       84.19%       19,182      50.00%         104,55,19      57.30%
                                                                   143,36,85        57.33%             362,18,139     46.88%        204,202,91      65.12%       240,222     51.96%         35,119,34      57.30%
                                                                   11,56,53         84.67%             22,115,401     50.72%        198,194,171     62.79%       161,216     50.98%         15,89,131      70.22%
                                                                   16,2,29          57.33%             247,121,529    44.96%        94,200,46       86.05%       158,202     51.96%         133,35,101     70.22%
                                                                   47,5,42          52.00%             472,111,1      48.48%        1,102,202       69.77%       78,246      51.63%         116,153,86     57.30%
                                                                   14,10,127        82.00%             316,128,345    50.40%        21,207,152      62.79%       108,145     75.82%         113,18,96      57.87%
                                                                   12,49,134        76.77%             281,510,362    47.04%        55,116,207      62.79%       18,176      50.00%         26,112,169     70.22%
                                                                   13,10,96         66.67%             501,127,10     46.72%        160,6,39        84.19%       162,190     52.29%         31,154,59      57.30%
                                                                                    70.08%                            48.29%                           71.86%                 53.86%                       62.47%
                                                                   17,61,126 88.67%                    1,475,601      51.20%        26,202,154      84.19%       146,63      75.82%         61,19,3        57.30%
                                                                                                                           3.3
                  D                                                                                                                  Iris New-thyroid                Haberman 3
              D                                                                                    1                                       k-means
         z1                     z1                                                   2                                                                                             20
z2                     D                              xi       z1 , z 2                   d (xi , z1 )                                                   Balance scale                  6

d (xi , z 2 )          z3
                                                                                                                                                                 5                     4
         max(min( d ( x i , z 1 ), d ( x i , z 2 )) i = 1, 2,… , n
                                                                                                                                                                            Wine
                       xi       zm
                                                                                                                                 Wine                      178                         3
         max(min(d (x i , z 1 ), d (x i , z 2 ),… , d ( x i , z m-l )) i = 1, 2,… , n
                                                                                                                           14
                       xi , xi ∈ D                             k
                                                                                                                                           Alcohol(        1)    Flavanoids(    8)
2                                                   k-means                                                                Color_intensity(     11) 3
                                                k-means                                                                                       (0.34~5.08) (1.28~13) (11.03~14.83)
                                      k                    n                                                                                         Proline          13
                                                               k                                                           (278~1680)                         Proline
      (1)                                                      d(xi, xj)
      (2)
                                                               D                                                           4
      (3)                                                                  1              z1                                     k-means
      (4)       z1                                             2                     z2       z2       D
      (5)       z3              max(min( d (x i , z 1 ), d ( x i , z 2 )) , i = 1, 2,… , n                                                                           k-means
           x i z3 D                                                                                                                                                                                   k-means

      (6)            z4                   max(min(d ( x i , z 1 ), d (x i , z 2 ), d (x i , z 3 )))
i = 1, 2,… , n             xi   z4 ∈ D
      …
                                                                                                                           1             .               [M].        :                                     , 2002:
      (7)         zk             max(min(d (xi , z j ))) , i =1,2,…, n , j = 1, 2,… ,k - 1                                     138-139.
    xi        zk ∈ D                                                                                                       2 MacQueen J. Some Methods for Classification and Analysis of
                                                                                                                               Multivariate Observations[C]//Proceedings of the 5th Berkeley
      (8)              k                                       k-means
                                                                                                                               Symposium on Mathematical Statistics and Probability, 1967.
                                                                                                                           3 Wang Wei, Yang Jiong, Muntz R. STING: A Statistical Information
3                                                                                                                              Grid Approach to Spatial Data Mining[C]//Proc. of the 23rd
3.1                                                                                                                            International Conference on Very Large Data Bases, 1997.
                          UCI                                                         Iris             Balance             4 Agrawal R, Gehrke J, Gunopulcs D. Automatic Subspace Clustering of
scale New-thyroid Haberman Wine 5                                                                                              High Dimensional Data for Data Mining Application[C]//Proc. of
UCI                                                                                                                            ACM SIGMOD Intconfon Management on Data, Seattle, WA, 1998:
                                                                                                                               94-205.
                                                                                                                           5 Guha S, Rastogi R, Shim K. Cure: An Efficient Clustering Algorithm
                                                                                                                               for Large Database[C]//Proc. of ACM-SIGMOND Int. Conf.
3.2                                                                                                                            Management on Data, Seattle, Washington, 1998: 73-84.
                                                                    k-means                                                6             ,         .                                             [J].
                                                k-means                                   1                                        , 2003, 19(1).

    66