Accelerating Fuzzy Clustering

Document Sample
Accelerating Fuzzy Clustering Powered By Docstoc
					                            Accelerating Fuzzy Clustering


                                       Christian Borgelt

                    Intelligent Data Analysis and Graphical Models Research Unit
                                 European Center for Soft Computing
                         c/ Gonzalo Gutierrez Quiros s/n, 33600 Mieres, Spain

                             christian.borgelt@softcomputing.es
                                  http://www.borgelt.net/




Christian Borgelt                           Accelerating Fuzzy Clustering          1
                                                  Overview

           • Brief Review of Neural Network Training
                    ◦   Standard Error Backpropagation and Momentum Term
                    ◦   (Super) Self-adaptive Error Backpropagation
                    ◦   Resilient Error Backpropagation
                    ◦   Quickpropagation

           • Brief Review of Fuzzy Clustering
                    ◦ Basic Idea and Objective Function
                    ◦ Alternating Optimization
                    ◦ Fuzzy C-Means and Gustafson–Kessel Algorithm

           • Transfer of NN Techniques to Fuzzy Clustering
           • Comparing Clustering Results
           • Experimental Results
           • Summary

Christian Borgelt                                   Accelerating Fuzzy Clustering   2
                         Review: Neural Network Training

        General approach: gradient descent on the error function.
           • The error is a function of the network weights.
           • Approach minimum by small weight changes opposite to the gradient.

                                                                        z|(x0,y0)
                                              y0
                                      z                        ∂z
                                                               ∂y |y0
                                                     ∂z
                                                     ∂x |x0


                                          y
                                                      x0


                                                                  x

        Illustration of the gradient of a real-valued function z = f (x, y) at a point (x0, y0).
                             ∂z     ∂z
        It is z|(x0,y0) = ∂x |x0 , ∂y |y0 .

Christian Borgelt                              Accelerating Fuzzy Clustering                       3
                    Neural Network Gradient Descent: Variants

        Weight update rule:

                                  w(t + 1) = w(t) + ∆w(t)

        Standard backpropagation:

                                    ∆w(t) = −η w e(t)

        Manhattan training:

                                 ∆w(t) = −η sgn( w e(t))

        Momentum term:
                              ∆w(t) = −η w e(t) + β∆w(t − 1)



Christian Borgelt                       Accelerating Fuzzy Clustering   4
                    Neural Network Gradient Descent: Variants


        (Super) Self-adaptive error backpropagation:

                                  γ − · ηw (t − 1), if w e(t)
                             
                             
                                                                · w e(t − 1) < 0,
                                  γ + · ηw (t − 1), if w e(t)
                             
                                                                 · w e(t − 1) > 0
                             
                             
                     ηw (t) = 
                             
                                                   ∧ w e(t − 1) · w e(t − 2) ≥ 0,
                                        ηw (t − 1), otherwise.
                             
                             



        Resilient error backpropagation:

                                  γ − · ∆w(t − 1), if w e(t)
                             
                             
                                                               · w e(t − 1) < 0,
                                  γ + · ∆w(t − 1), if w e(t)
                             
                                                                · w e(t − 1) > 0
                             
                             
                     ∆w(t) = 
                             
                                                  ∧ w e(t − 1) · w e(t − 2) ≥ 0,
                                        ∆w(t − 1), otherwise.
                             
                             


        Typical values: γ − ∈ [0.5, 0.7] and γ + ∈ [1.05, 1.2].


Christian Borgelt                             Accelerating Fuzzy Clustering          5
                    Neural Network Gradient Descent: Variants

        Quickpropagation                                   e
                                                                                                    e(t−1)
                                                                                 apex
        The error function is locally
                                                                                                    e(t)
        approximated by a parabola.
        The weight update “jumps”                                                                                w
        to the apex of the parabola.                                       m w(t+1)       w(t) w(t−1)


                                                               we                                       w e(t−1)


                                                                                                        w e(t)
        The weight update rule can be
        derived from the triangles:
                                                                                                    0
                                                                                                                 w
                                                                                 w(t+1)   w(t) w(t−1)
                            w e(t)
        ∆w(t) =                               · ∆w(t − 1).
                     w e(t − 1) −    w e(t)

Christian Borgelt                                Accelerating Fuzzy Clustering                                       6
                           Review: Standard Fuzzy Clustering

           • Allow degrees of membership of a datum to different clusters.
             (Classical c-means clustering assigns data crisply.)
           • Objective Function:                  (to be minimized)
                                                         c     n
                                       J(X, C, U) =                h(uij ) d 2(ci, xj )
                                                        i=1 j=1
           • U = [uij ] is the c × n fuzzy partition matrix,
               uij ∈ [0, 1] is the membership degree of the data point xj to the i-th cluster.
           • C = {c1, . . . , cc} is the set of cluster prototypes.
           • Usually h(uij ) = uα is chosen, where α is the so-called “fuzzifier”
                                 ij
             (the higher α, the softer the cluster boundaries).
           • Constraints:               n                                                        c
               ∀i ∈ {1, . . . , c} :         uij > 0         and        ∀j ∈ {1, . . . , n} :         uij = 1
                                       j=1                                                      i=1

Christian Borgelt                                  Accelerating Fuzzy Clustering                                7
                              Review: Alternating Optimization

           • Problem: The objective function J cannot be minimized directly.
           • Therefore: Alternating Optimization
                    ◦ Optimize membership degrees for fixed cluster parameters.
                    ◦ Optimize cluster parameters for fixed membership degrees.
                      (Update formulae are derived by differentiating the objective function J)
                    ◦ Iterate until convergence (checked, e.g., by change of cluster center).
           • Update Rules:              (for Euclidean distance and only centers, i.e. ci = (µi))
                                                                                       2
                                                                                      1−α
                                                                                    dij
                               ∀i; 1 ≤ i ≤ c : ∀j; 1 ≤ j ≤ n :              uij =          2
                                                                                    c     1−α
                                                                                    k=1 dkj
                                                               n uα x
                                                               j=1 ij j
                               ∀i; 1 ≤ i ≤ c :     µi =         n uα
                                                                j=1 ij

Christian Borgelt                                 Accelerating Fuzzy Clustering                     8
                    Review: Gustafson–Kessel Fuzzy Clustering

           • Introduce a covariance matrix to describe the cluster shape.

           • Objective Function:           (to be minimized)
                                          c      n
                          J(X, C, U) =               uα (xj − µi) Σ−1(xj − µi)
                                                      ij
                                         i=1 j=1

           • Update Rule for the covariance matrix:
                              1                              c     n
                             −m
                     Σ = S|S|        where           S=                uα (xj − µi)(xj − µi)
                                                                        ij
                                                           i=1 j=1

           • Axes-parallel version of Gustafson–Kessel Fuzzy Clustering:
             Restrict the covariance matrix to a diagonal matrix.
                                                  2            2
                                        Σ = diag(σ1 , . . . , σm).


Christian Borgelt                             Accelerating Fuzzy Clustering                    9
                            Transfer to Fuzzy Clustering

           • Compute one update step of fuzzy clustering,
             i.e., compute new membership degrees and new centers.

           • Compute change of centers (cluster parameters),
             i.e., difference of coordinates to preceding step.

           • Consider this difference as a gradient and
             apply the improvements from neural network training.


           • Note: Standard backpropagation also yields a modification:
             Introduction of a learning rate η ≥ 1.
             (This approach is generally known as over-relaxation.)

           • Expectation: This transfer of neural network methods
             leads to a speed-up of fuzzy clustering.


Christian Borgelt                           Accelerating Fuzzy Clustering   10
                    Transfer to Fuzzy Clustering: Variants

        General parameter update rule:

                                     θ(t + 1) = θ(t) + ∆θ(t)

        Update step expansion:

                              ∆θ(t) = ηδθ(t)         with η ∈ [1.05, 2],

        where δθ(t) is the change of the cluster parameter θ
        as it is computed with the standard update rule in step t.

        Momentum term:

                        ∆θ(t) = δθ(t) + β∆θ(t − 1)               with β ∈ [0, 1).

        ∆θ(t) is clamped to [δθ(t), ηmaxδθ(t)] with ηmax = 1.8 for robustness.

Christian Borgelt                          Accelerating Fuzzy Clustering            11
                    Transfer to Fuzzy Clustering: Variants

        Self-adaptive step expansion:
                                  γ − · ηθ (t − 1), if δθ(t) · δθ(t − 1) < 0,
                               
                               
                               
                       ηθ (t) =  γ + · ηθ (t − 1), if δθ(t) · δθ(t − 1) > 0,
                                
                                        ηθ (t − 1), otherwise.

        Resilient update:
                                γ − · ∆θ(t − 1), if δθ(t) · δθ(t − 1) < 0,
                               
                               
                               
                      ∆θ(t) =  γ + · ∆θ(t − 1), if δθ(t) · δθ(t − 1) > 0,
                              
                                      ∆θ(t − 1), otherwise.

        Quickpropagation analog:
                                           δθ(t)
                             ∆θ(t) =                   · ∆θ(t − 1).
                                     δθ(t − 1) − δθ(t)

        In my experiments I used γ − = 0.7 and γ + = 1.2 and clamping.

Christian Borgelt                           Accelerating Fuzzy Clustering       12
                                Updating Covariance Matrices

           • Center coordinates can be updated independently and arbitrarily.

           • (Co)variances, however, have a bounded range of values
             and depend on each other (e.g. s2 ≤ s2 s2 ).
                                             xy    x y

           • (Co)variances are updated before normalization to determinant 1.

           • Variances:             (axes-parallel Gustafson–Kessel clustering)
                    ◦ Are treated independently of each other.
                    ◦ Check for a positive value (a variance must be > 0),
                      otherwise do standard update step.

           • Covariances:              (general Gustafson–Kessel clustering)
                    ◦ Check for positive definite matrix with Cholesky decomposition.
                    ◦ If the updated matrix is not positive definite,
                      do a standard update setp for the matrix as a whole.

Christian Borgelt                                Accelerating Fuzzy Clustering         13
                                    Convergence Evaluation

           • General idea: Use relative cluster evaluation measures.

           • Simplest approach:
                                                         1 c n      (1)   (2)  2
                          Qdiff   (U(1), U(2))   = min              uij − uπ(i)j .
                                                 π∈Π(c) cn i=1 j=1

                         (k)
               U(k) = (uij )1≤i≤c,1≤n for k = 1, 2 are the two partition matrices to compare,
               n is the number of data points, c the number of clusters, and Π(c) is the set
               of all permutations of the numbers 1 to c.

           • Other possibilities:
               ◦ (cross-classification) accuracy         ◦   F1-measure
               ◦ Rand statistic / Rand index            ◦   Jaccard coefficient / Jaccard index
               ◦ Fowlkes–Mallows index                  ◦   Hubert index / Hubert-Arabie index
               The last four measures are based on evaluating coincidence matrices.

Christian Borgelt                                Accelerating Fuzzy Clustering                   14
                     Experimental Results: Clustering Trials

                     fuzzy c-means                                                 general Gustafson–Kessel
                                                                                                                              log(difference)
                     -2                                                            -2                                               trials
                                                                                                                                    average

                     -4                                                            -4
        iris
                     -6                                                            -6
        3 clusters
                              log(difference)
                     -8             trials
                                                                                   -8
                                    average

                          0      2     4        6    8    10    12    14   16           0     10      20      30   40   50    60     70    80

                     -2                                                            -2

                     -4                                                            -4

        wsel         -6                                                            -6
        6 clusters
                     -8       log(difference)                                      -8       log(difference)
                                    trials                                                        trials
                                    average                                                       average

                          0      5     10       15   20   25    30    35   40           0        20        40      60    80        100    120

Christian Borgelt                                              Accelerating Fuzzy Clustering                                                    15
                        Experimental Results: iris, 3 clusters, FCM

                                              log(difference)                                        log(difference)
               -2                                   none               -2                                  none
                                                    1.1/1.3/1.5                                            0.1/0.15/0.4
                                                    1.6                                                    0.5
               -4                                                      -4

               -6                                                      -6

               -8                                                      -8


                    0    2   4   6   8   10     12     14     16            0   2     4     6   8   10   12    14    16

                                               log(difference)
               -2                                    none                   Clustering the iris data with
                                                     adaptive
                                                     resilient              the fuzzy c-means algorithm
               -4                                    quick
                                                                            and update modifications;
               -6                                                           top left: step expansion,
                                                                            top right: momentum term,
               -8
                                                                            bottom left: other methods.
                    0    2   4   6   8   10     12     14     16


Christian Borgelt                                           Accelerating Fuzzy Clustering                                 16
                        Experimental Results: iris, 3 clusters, GK

                                                 log(difference)                                              log(difference)
               -2                                      none               -2                                        none
                                                       1.2/1.4/1.9                                                  0.2/0.3/0.8
                                                       2.0                                                          0.9
               -4                                                         -4

               -6                                                         -6

               -8                                                         -8


                    0   10   20   30   40   50     60     70     80            0   10    20    30   40   50     60     70    80

                                                  log(difference)
               -2                                       none                   Clustering the iris data with
                                                        adaptive
                                                        resilient
                                                                               the Gustafson–Kessel algorithm
               -4                                       quick                  and update modifications
               -6                                                              for all parameters;
                                                                               top left: step expansion,
               -8
                                                                               top right: momentum term,
                                                                               bottom left: other methods.
                    0   10   20   30   40   50     60     70     80


Christian Borgelt                                              Accelerating Fuzzy Clustering                                      17
                        Experimental Results: iris, 3 clusters, GK

                                                 log(difference)                                              log(difference)
               -2                                      none               -2                                        none
                                                       1.2/1.4/1.9                                                  0.2/0.3/0.5
                                                       2.0                                                          0.6
               -4                                                         -4

               -6                                                         -6

               -8                                                         -8


                    0   10   20   30   40   50     60     70     80            0   10    20    30   40   50     60     70    80

                                                  log(difference)
               -2                                       none                   Clustering the iris data with
                                                        adaptive
                                                        resilient
                                                                               the Gustafson–Kessel algorithm
               -4                                       quick                  and update modifications
               -6                                                              applied only for cluster centers;
                                                                               top left: step expansion,
               -8
                                                                               top right: momentum term,
                                                                               bottom left: other methods.
                    0   10   20   30   40   50     60     70     80


Christian Borgelt                                              Accelerating Fuzzy Clustering                                      18
                    Experimental Results: wine, 6 clusters, ap. GK

               -2                   log(difference)        -2                         log(difference)
                                          none                                              none
                                          1.2/1.5/1.9                                       0.15/0.3/0.8
               -4                         2.0              -4                               0.9


               -6                                          -6

               -8                                          -8


                    0    50   100     150         200           0         50    100      150         200

               -2                    log(difference)
                                           none                 Clustering the wine data with
                                           adaptive
               -4                          resilient            the axes-parallel Gustafson–
                                           quick
                                                                Kessel algorithm and 6 clusters;
               -6                                               top left: step expansion,
               -8
                                                                top right: momentum term,
                                                                bottom left: other methods.
                    0    50   100     150         200


Christian Borgelt                               Accelerating Fuzzy Clustering                              19
                        Experimental Results: wine, 6 clusters, GK

               -2                      log(difference)        -2                         log(difference)
                                             none                                              none
                                             1.2/1.5/1.9                                       0.2/0.3/0.8
               -4                            2.0              -4                               0.9


               -6                                             -6

               -8                                             -8


                    0      50    100         150                   0          50   100         150

               -2                       log(difference)
                                              none                 Clustering the wine data with
                                              adaptive
               -4                             resilient            the general Gustafson–Kessel
                                              quick
                                                                   algorithm and 6 clusters;
               -6                                                  top left: step expansion,
               -8
                                                                   top right: momentum term,
                                                                   bottom left: other methods.
                    0      50    100         150


Christian Borgelt                                  Accelerating Fuzzy Clustering                             20
                Experimental Results: abalone, 3 clusters, FCM

                                                log(difference)                                              log(difference)
               -2                                     none               -2                                        none
                                                      1.2/1.4/1.7                                                  0.1/0.2/0.4
                                                      1.8                                                          0.5
               -4                                                        -4

               -6                                                        -6

               -8                                                        -8


                    0   5   10   15   20   25     30     35     40            0   5     10    15   20   25     30     35    40

                                                 log(difference)
               -2                                      none                   Clustering the abalone data with
                                                       adaptive
                                                       resilient              the fuzzy c-means algorithm
               -4                                      quick
                                                                              and 3 clusters;
               -6                                                             top left: step expansion,
                                                                              top right: momentum term,
               -8
                                                                              bottom left: other methods.
                    0   5   10   15   20   25     30     35     40


Christian Borgelt                                             Accelerating Fuzzy Clustering                                      21
                    Experimental Results: abalone, 3 clusters, GK

                                          log(difference)                                         log(difference)
               -2                               none             -2                                     none
                                                1.2/1.5/1.9                                             0.15/0.3/0.5
                                                2.0                                                     0.6
               -4                                                -4

               -6                                                -6

               -8                                                -8


                    0   50   100   150   200    250     300           0     50        100   150   200    250     300

                                           log(difference)
               -2                                none                 Clustering the abalone data with
                                                 adaptive
                                                 resilient            the general Gustafson–Kessel
               -4                                quick
                                                                      algorithm and 3 clusters;
               -6                                                     top left: step expansion,
                                                                      top right: momentum term,
               -8
                                                                      bottom left: other methods.
                    0   50   100   150   200    250     300


Christian Borgelt                                     Accelerating Fuzzy Clustering                                    22
                                           Summary


           • Fuzzy clustering as well as neural network training are iterative processes.
           • Executing alternating optimization can be seen as providing a gradient step.
           • Thus all variants of neural network gradient descent become applicable.
           • Some of these variants lead to a considerable speed-up.
           • A transfer to estimating a mixture of Gaussian is also possible.


        An implementation of these techniques can be retrieved free of charge at
               http://www.borgelt.net/cluster.html

        The full set of diagrams for the experiments is available at:
               http://www.borgelt.net/papers/nndvl.pdf   (color)
               http://www.borgelt.net/papers/nndvl g.pdf (greyscale)

Christian Borgelt                            Accelerating Fuzzy Clustering                  23