Christian Borgelt

Intelligent Data Analysis and Graphical Models Research Unit
European Center for Soft Computing
c/ Gonzalo Gutierrez Quiros s/n, 33600 Mieres, Spain

christian.borgelt@softcomputing.es
http://www.borgelt.net/

Overview

• Brief Review of Neural Network Training
◦   Standard Error Backpropagation and Momentum Term
◦   Resilient Error Backpropagation
◦   Quickpropagation

• Brief Review of Fuzzy Clustering
◦ Basic Idea and Objective Function
◦ Alternating Optimization
◦ Fuzzy C-Means and Gustafson–Kessel Algorithm

• Transfer of NN Techniques to Fuzzy Clustering
• Comparing Clustering Results
• Experimental Results
• Summary

Review: Neural Network Training

General approach: gradient descent on the error function.
• The error is a function of the network weights.
• Approach minimum by small weight changes opposite to the gradient.

z|(x0,y0)
y0
z                        ∂z
∂y |y0
∂z
∂x |x0

y
x0

x

Illustration of the gradient of a real-valued function z = f (x, y) at a point (x0, y0).
∂z     ∂z
It is z|(x0,y0) = ∂x |x0 , ∂y |y0 .

Weight update rule:

w(t + 1) = w(t) + ∆w(t)

Standard backpropagation:

∆w(t) = −η w e(t)

Manhattan training:

∆w(t) = −η sgn( w e(t))

Momentum term:
∆w(t) = −η w e(t) + β∆w(t − 1)

γ − · ηw (t − 1), if w e(t)


                                   · w e(t − 1) < 0,
γ + · ηw (t − 1), if w e(t)

· w e(t − 1) > 0


ηw (t) = 

                      ∧ w e(t − 1) · w e(t − 2) ≥ 0,
ηw (t − 1), otherwise.



Resilient error backpropagation:

γ − · ∆w(t − 1), if w e(t)


                                  · w e(t − 1) < 0,
γ + · ∆w(t − 1), if w e(t)

· w e(t − 1) > 0


∆w(t) = 

                     ∧ w e(t − 1) · w e(t − 2) ≥ 0,
∆w(t − 1), otherwise.



Typical values: γ − ∈ [0.5, 0.7] and γ + ∈ [1.05, 1.2].

Quickpropagation                                   e
e(t−1)
apex
The error function is locally
e(t)
approximated by a parabola.
The weight update “jumps”                                                                                w
to the apex of the parabola.                                       m w(t+1)       w(t) w(t−1)

we                                       w e(t−1)

w e(t)
The weight update rule can be
derived from the triangles:
0
w
w(t+1)   w(t) w(t−1)
w e(t)
∆w(t) =                               · ∆w(t − 1).
w e(t − 1) −    w e(t)

Review: Standard Fuzzy Clustering

• Allow degrees of membership of a datum to diﬀerent clusters.
(Classical c-means clustering assigns data crisply.)
• Objective Function:                  (to be minimized)
c     n
J(X, C, U) =                h(uij ) d 2(ci, xj )
i=1 j=1
• U = [uij ] is the c × n fuzzy partition matrix,
uij ∈ [0, 1] is the membership degree of the data point xj to the i-th cluster.
• C = {c1, . . . , cc} is the set of cluster prototypes.
• Usually h(uij ) = uα is chosen, where α is the so-called “fuzziﬁer”
ij
(the higher α, the softer the cluster boundaries).
• Constraints:               n                                                        c
∀i ∈ {1, . . . , c} :         uij > 0         and        ∀j ∈ {1, . . . , n} :         uij = 1
j=1                                                      i=1

Review: Alternating Optimization

• Problem: The objective function J cannot be minimized directly.
• Therefore: Alternating Optimization
◦ Optimize membership degrees for ﬁxed cluster parameters.
◦ Optimize cluster parameters for ﬁxed membership degrees.
(Update formulae are derived by diﬀerentiating the objective function J)
◦ Iterate until convergence (checked, e.g., by change of cluster center).
• Update Rules:              (for Euclidean distance and only centers, i.e. ci = (µi))
2
1−α
dij
∀i; 1 ≤ i ≤ c : ∀j; 1 ≤ j ≤ n :              uij =          2
c     1−α
k=1 dkj
n uα x
j=1 ij j
∀i; 1 ≤ i ≤ c :     µi =         n uα
j=1 ij

Review: Gustafson–Kessel Fuzzy Clustering

• Introduce a covariance matrix to describe the cluster shape.

• Objective Function:           (to be minimized)
c      n
J(X, C, U) =               uα (xj − µi) Σ−1(xj − µi)
ij
i=1 j=1

• Update Rule for the covariance matrix:
1                              c     n
−m
Σ = S|S|        where           S=                uα (xj − µi)(xj − µi)
ij
i=1 j=1

• Axes-parallel version of Gustafson–Kessel Fuzzy Clustering:
Restrict the covariance matrix to a diagonal matrix.
2            2
Σ = diag(σ1 , . . . , σm).

Transfer to Fuzzy Clustering

• Compute one update step of fuzzy clustering,
i.e., compute new membership degrees and new centers.

• Compute change of centers (cluster parameters),
i.e., diﬀerence of coordinates to preceding step.

• Consider this diﬀerence as a gradient and
apply the improvements from neural network training.

• Note: Standard backpropagation also yields a modiﬁcation:
Introduction of a learning rate η ≥ 1.
(This approach is generally known as over-relaxation.)

• Expectation: This transfer of neural network methods
leads to a speed-up of fuzzy clustering.

Transfer to Fuzzy Clustering: Variants

General parameter update rule:

θ(t + 1) = θ(t) + ∆θ(t)

Update step expansion:

∆θ(t) = ηδθ(t)         with η ∈ [1.05, 2],

where δθ(t) is the change of the cluster parameter θ
as it is computed with the standard update rule in step t.

Momentum term:

∆θ(t) = δθ(t) + β∆θ(t − 1)               with β ∈ [0, 1).

∆θ(t) is clamped to [δθ(t), ηmaxδθ(t)] with ηmax = 1.8 for robustness.

Transfer to Fuzzy Clustering: Variants

γ − · ηθ (t − 1), if δθ(t) · δθ(t − 1) < 0,



ηθ (t) =  γ + · ηθ (t − 1), if δθ(t) · δθ(t − 1) > 0,

ηθ (t − 1), otherwise.

Resilient update:
γ − · ∆θ(t − 1), if δθ(t) · δθ(t − 1) < 0,



∆θ(t) =  γ + · ∆θ(t − 1), if δθ(t) · δθ(t − 1) > 0,

∆θ(t − 1), otherwise.

Quickpropagation analog:
δθ(t)
∆θ(t) =                   · ∆θ(t − 1).
δθ(t − 1) − δθ(t)

In my experiments I used γ − = 0.7 and γ + = 1.2 and clamping.

Updating Covariance Matrices

• Center coordinates can be updated independently and arbitrarily.

• (Co)variances, however, have a bounded range of values
and depend on each other (e.g. s2 ≤ s2 s2 ).
xy    x y

• (Co)variances are updated before normalization to determinant 1.

• Variances:             (axes-parallel Gustafson–Kessel clustering)
◦ Are treated independently of each other.
◦ Check for a positive value (a variance must be > 0),
otherwise do standard update step.

• Covariances:              (general Gustafson–Kessel clustering)
◦ Check for positive deﬁnite matrix with Cholesky decomposition.
◦ If the updated matrix is not positive deﬁnite,
do a standard update setp for the matrix as a whole.

Convergence Evaluation

• General idea: Use relative cluster evaluation measures.

• Simplest approach:
1 c n      (1)   (2)  2
Qdiﬀ   (U(1), U(2))   = min              uij − uπ(i)j .
π∈Π(c) cn i=1 j=1

(k)
U(k) = (uij )1≤i≤c,1≤n for k = 1, 2 are the two partition matrices to compare,
n is the number of data points, c the number of clusters, and Π(c) is the set
of all permutations of the numbers 1 to c.

• Other possibilities:
◦ (cross-classiﬁcation) accuracy         ◦   F1-measure
◦ Rand statistic / Rand index            ◦   Jaccard coeﬃcient / Jaccard index
◦ Fowlkes–Mallows index                  ◦   Hubert index / Hubert-Arabie index
The last four measures are based on evaluating coincidence matrices.

Experimental Results: Clustering Trials

fuzzy c-means                                                 general Gustafson–Kessel
log(difference)
-2                                                            -2                                               trials
average

-4                                                            -4
iris
-6                                                            -6
3 clusters
log(difference)
-8             trials
-8
average

0      2     4        6    8    10    12    14   16           0     10      20      30   40   50    60     70    80

-2                                                            -2

-4                                                            -4

wsel         -6                                                            -6
6 clusters
-8       log(difference)                                      -8       log(difference)
trials                                                        trials
average                                                       average

0      5     10       15   20   25    30    35   40           0        20        40      60    80        100    120

Experimental Results: iris, 3 clusters, FCM

log(difference)                                        log(difference)
-2                                   none               -2                                  none
1.1/1.3/1.5                                            0.1/0.15/0.4
1.6                                                    0.5
-4                                                      -4

-6                                                      -6

-8                                                      -8

0    2   4   6   8   10     12     14     16            0   2     4     6   8   10   12    14    16

log(difference)
-2                                    none                   Clustering the iris data with
resilient              the fuzzy c-means algorithm
-4                                    quick
and update modiﬁcations;
-6                                                           top left: step expansion,
top right: momentum term,
-8
bottom left: other methods.
0    2   4   6   8   10     12     14     16

Experimental Results: iris, 3 clusters, GK

log(difference)                                              log(difference)
-2                                      none               -2                                        none
1.2/1.4/1.9                                                  0.2/0.3/0.8
2.0                                                          0.9
-4                                                         -4

-6                                                         -6

-8                                                         -8

0   10   20   30   40   50     60     70     80            0   10    20    30   40   50     60     70    80

log(difference)
-2                                       none                   Clustering the iris data with
resilient
the Gustafson–Kessel algorithm
-4                                       quick                  and update modiﬁcations
-6                                                              for all parameters;
top left: step expansion,
-8
top right: momentum term,
bottom left: other methods.
0   10   20   30   40   50     60     70     80

Experimental Results: iris, 3 clusters, GK

log(difference)                                              log(difference)
-2                                      none               -2                                        none
1.2/1.4/1.9                                                  0.2/0.3/0.5
2.0                                                          0.6
-4                                                         -4

-6                                                         -6

-8                                                         -8

0   10   20   30   40   50     60     70     80            0   10    20    30   40   50     60     70    80

log(difference)
-2                                       none                   Clustering the iris data with
resilient
the Gustafson–Kessel algorithm
-4                                       quick                  and update modiﬁcations
-6                                                              applied only for cluster centers;
top left: step expansion,
-8
top right: momentum term,
bottom left: other methods.
0   10   20   30   40   50     60     70     80

Experimental Results: wine, 6 clusters, ap. GK

-2                   log(difference)        -2                         log(difference)
none                                              none
1.2/1.5/1.9                                       0.15/0.3/0.8
-4                         2.0              -4                               0.9

-6                                          -6

-8                                          -8

0    50   100     150         200           0         50    100      150         200

-2                    log(difference)
none                 Clustering the wine data with
-4                          resilient            the axes-parallel Gustafson–
quick
Kessel algorithm and 6 clusters;
-6                                               top left: step expansion,
-8
top right: momentum term,
bottom left: other methods.
0    50   100     150         200

Experimental Results: wine, 6 clusters, GK

-2                      log(difference)        -2                         log(difference)
none                                              none
1.2/1.5/1.9                                       0.2/0.3/0.8
-4                            2.0              -4                               0.9

-6                                             -6

-8                                             -8

0      50    100         150                   0          50   100         150

-2                       log(difference)
none                 Clustering the wine data with
-4                             resilient            the general Gustafson–Kessel
quick
algorithm and 6 clusters;
-6                                                  top left: step expansion,
-8
top right: momentum term,
bottom left: other methods.
0      50    100         150

Experimental Results: abalone, 3 clusters, FCM

log(difference)                                              log(difference)
-2                                     none               -2                                        none
1.2/1.4/1.7                                                  0.1/0.2/0.4
1.8                                                          0.5
-4                                                        -4

-6                                                        -6

-8                                                        -8

0   5   10   15   20   25     30     35     40            0   5     10    15   20   25     30     35    40

log(difference)
-2                                      none                   Clustering the abalone data with
resilient              the fuzzy c-means algorithm
-4                                      quick
and 3 clusters;
-6                                                             top left: step expansion,
top right: momentum term,
-8
bottom left: other methods.
0   5   10   15   20   25     30     35     40

Experimental Results: abalone, 3 clusters, GK

log(difference)                                         log(difference)
-2                               none             -2                                     none
1.2/1.5/1.9                                             0.15/0.3/0.5
2.0                                                     0.6
-4                                                -4

-6                                                -6

-8                                                -8

0   50   100   150   200    250     300           0     50        100   150   200    250     300

log(difference)
-2                                none                 Clustering the abalone data with
resilient            the general Gustafson–Kessel
-4                                quick
algorithm and 3 clusters;
-6                                                     top left: step expansion,
top right: momentum term,
-8
bottom left: other methods.
0   50   100   150   200    250     300

Summary

• Fuzzy clustering as well as neural network training are iterative processes.
• Executing alternating optimization can be seen as providing a gradient step.
• Thus all variants of neural network gradient descent become applicable.
• Some of these variants lead to a considerable speed-up.
• A transfer to estimating a mixture of Gaussian is also possible.

An implementation of these techniques can be retrieved free of charge at
http://www.borgelt.net/cluster.html

The full set of diagrams for the experiments is available at:
http://www.borgelt.net/papers/nndvl.pdf   (color)
http://www.borgelt.net/papers/nndvl g.pdf (greyscale)

