Digital Signal Processing by huanglianjiang1


Final Project 2007/08

Irene Moreno González 100039158
José Manuel Camacho Camacho 100038938
IMG/JMCC                                                                                        DSP08-G95

Table of Contents
Visualisation of Data .................................................................................... 2
Linear Classifiers .......................................................................................... 2
Single Perceptron with ADALINE Learning Rule ..................................................... 2
Sequential Gradient Rule & Soft Activation .............................................................. 3

Non-Linear, Non-Parametric Classification............................................ 6
Reducing Costs via Clustering ................................................................... 8

IMG/JMCC                                                                               DSP08-G95

Final Project 2007/08


In order to have a general view of the problem, the cloud of points is represented first,
showing with each colour which class the data belongs to.


Single Perceptron with ADALINE Learning Rule

This machine has the same structure as the hard SLP (single layer perceptron). The
structure that will be used is as follows:

 x1        w1
                                                                 1           w0 
                            z      +
                                                             x   x1    w   w1 
                                  1                                         
                                        -                         x2 
                                                                             w2 
                                                                               
 x2                    w0               1


IMG/JMCC                                                                         DSP08-G95

Given a group of already classified samples, the perceptron can be trained following a
Widrow-Hoff algorithm applied to the input at the error of the decider:

In each iteration, the weights are re-calculated as shown above, where the samples used as
input are the training samples. The step controls the speed of the convergence in the
There are some design decisions that were taken during the implementation:
   -   Weights are initialized randomly; for this reason, even if the training samples and
       the iterations do not change, different results are obtained each time the algorithm
       is run.

   -   A pocket algorithm with zipper is used to obtain the optimum weights. Although
       the weights are calculated with the training samples as inputs, it is the validation
       set the one used to selects the optimum weights to keep. In this way, better
       generalization of the problem is reached, since the perceptron is receiving more
       information about the distribution of the data; but its raise of the computational
       complexity can be shown as a disadvantage, since for each re-calculation of the
       weights, the error over the whole validation set has to be obtained.
   -   In each iteration both training and validation sets are reordered in a random way,
       but initial weights are no longer random, since the ones from previous iterations
       are used. A number of 100 iterations is chosen, although the algorithm reaches its
       convergence much earlier, as it will be shown later.
   -   Although it is important to know the samples that were correctly classified, not the
       percentage error but the MSE one is considered during the training phase, since not
       only has the success rate to be optimized, also an intermediate position of the
       border between the clouds of patterns is sought.

   -   The value of the step was set by trial and error: different steps were used until the
       best results were reached. For the case of the ADALINE algorithm, a final step of
       0.0001 was chosen.

                                                                    Step: 0.0001

                                                                    Weigths:   w0=0.196



                                                                    Error rate|TRN = 3.23 %

                                                                    Error rate|VAL = 3.31 %

                                                                    Error rate|TEST= 3.15 %

IMG/JMCC                                                                          DSP08-G95

In order to obtain the error rate for each set, every point is multiplied by the weights, and
the decision will be based in the sign of the result. Decision border and the error rates that
were obtained are show above.

Next plot shows the evolution of the error rate of the validation set through the iterations of
the algorithm. The fact that the test rate is lower than the training and validation one is
not relevant: linear classification is performed, so depending on the position of the clouds of
data for each set, the problem can be more separable or less. Nevertheless, as it was
pointed out before, these results depend on the initialization of the problem each time, since
it follows a random procedure.

As it can be seen from the figures, the algorithm gets its minimum before the 15 th iteration,
so the weights stored in the pocket are no longer renewed from that moment (right side
figure). The MSE error of the validation set starts increasing after that minimum (left side
With this algorithm, the number of error is not minimized, but a convergence to a local
minimum is ensured.

Sequential Gradient Rule & Soft Activation

The advantage of the sequential gradient rule is that the hard decision is replaced by a
derivable approximation. In this case, soft hyperbolic tangent activation function was used.
In this way, a Least Mean Squared (LMS) algorithm can be implemented for the
recalculation of the weights:

Parameters and design decisions remain the same as in the ADALINE case, but a better
performance and better results are obtained with this learning rule, as it is shown in
following graphs and percentages. A step of 0.05 was used.

IMG/JMCC                                                                            DSP08-G95

                                                                       Step: 0.05

                                                                       Weigths:     w0=3.262


                                                                                  w 2=-1.292

                                                                       Error rate|TRN = 3.23 %

                                                                       Error rate|VAL = 3.31 %

                                                                       Error rate|TEST= 3.15 %

As can be seen in the figures, convergence is fast and optimal for this rule. The algorithm
gets its minimum around the 20th iteration, but the recalculation of the weights is less
unstable than in ADALINE case.

IMG/JMCC                                                                                 DSP08-G95


k-NN Classifier was implemented as required in the guide. Following sections depict
classifier’s properties in terms of error rates, expresiveness and computational cost.
Comparision with linear classifier results is provided as well.

    a. Plot the training and validation classification error rates of this classifier as a
       function of k, for k = 1, 3, 5, 7, …, 25.

  Fig. 1   Training and Validation Error Rates (as percentage) for each value of ‘k’, for k=1,3…25

At a first glance, it can be clearly seen that error rates are slightly lower at training than in
validation classification, for any value of k. This is due to the k set during the training
classification has at least one training point belonging to the true class, which is the sample
itself. Latter does not hold for validation, even, if training points are not representative
enough, this method may generalise badly and training error rates could not be reliable.

Validation set was used to estimate the optimum k parameter. Classification resulted in,

This value is employed to compute results for the test set later on.

    b. Depict the decision borders for the classifiers with k = 1, 5, 25.

Fig. 3, Fig. 2 and Fig. 4 show classification border for three different values of k. Dealing
with expressiveness, this method has better properties compared to the linear
classification, error rates are reduced thanks to a better fitting of these classification
borders to the data spread all over the plane, specially those at the right part, which were
always missed by the linear border.

               IMG/JMCC                                                                            DSP08-G95

                                                           Given a training set from which build up a
                                                           classification border, effect of k choice is
                                                           discussed here. 1-NN method has null training
                                                           error rate, its border is therefore flawless, as it
                                                           can be seen, creating very expressive shapes.
                                                           Validation error rate, however, it is not the best,
                                                           hence it can be concluded that 1-NN classifier
                                                           over-fits to the training set, laying rather bad
                                                           generalization in validation (and we guess that
                                                           also for testing).

                                                           25-NN classifier has exactly the opposite
                                                           behavior. For each classification it takes into
                                                           account so many training points, which
                                                           introduce wrong information to the classifier.
                                                           Actually, this classifier neither fits well the
 Fig. 3    Classification Border for the 1-NN Classifier
                                                           training nor the validation set.

                                                           In between is the 5-NN. It has not such a good
                                                           training error rate, however it establishes a not
                                                           so shaped classification border compared to 1-
                                                           NN. In principle, that can lead to better
                                                           generalization and less sensitivity to outliers.

                                                           Furthermore, it seems clear that, in the choice of
                                                           k depends on how close/far are the training
                                                           points of different classes to each other.

                                                            In order to obtain the results for this part of the
                                                            assignment, several simulations with the
                                                            implemented k-NN classifier were carried out.
                                                            Time consumption was much higher than in
                                                            linear classification. If  operations are needed
                                                            to calculate distance between one sample from
 Fig. 2    Classification Border for the 5-NN Classifier    the training set to a training point, and k
                                                           operations to select the k closest points and
                                                           counting its labels, we get that in order to
                                                           classify n samples,

                                                           Same computational cost is required for
                                                           validation and test if those sets have similar size
                                                           as the training set.

Fig. 4    25-NN’s Classification Border

IMG/JMCC                                                                                  DSP08-G95

    c. Obtain the optimum value for k according to the validation set, and give the test
       classification error rate that would be obtained in that case.

As mention, value of k parameter that minimizes validation error rates is k=9. In that
case, 9-NN classification had following error rate,

This error rate is lower than the one for the linear classifier, according to the improvement
of the expressiveness. Test error for this classifier is slightly higher than the training and
validation, which may be due to the dependability of the training set for the classification.


       Implement the classifiers corresponding to kC =1, 2, 3,…, 50, and compute their
        training and validation classification error rates.

          Fig. 5   Training (orange) and Validation (blue) error rates expressed as missed
                                  samples out of set size in the vertical axe for kc=1,2,…,50

An iteration of the algorithm for each of the possible number of centroids is run and plotted.
As can be extracted from the figure, error rates have a reasonable value when more than 3
centroids per class are used. The fact is that, due to the shape of the cloud of the given
points, three or less centroids are not good enough to represent the data. In addition,
training error rates are always lower than validation ones, since the centroids seek their
optimum position with reference to the training set.

         IMG/JMCC                                                                              DSP08-G95

               Select the value kC* that minimizes the following objective function

                                             J(kC) = Te + log10(kC)

                Where Te is the validation error rate. Obtain the test classification error rate when
                using kC*, and plot the classification border of this classifier.

                                                         Fig. 6   Objective Function J(kC) for kC = 1,2,…,50.

         In order to obtain the optimum value of centroids to be used, as a compromise between the
         error rate and the complexity of the problem, the objective function J is compute for each of
         the values of kc. It results with a minimum value for a kC*=20. The test error is computed
         for this case. Using this target function, large numbers of centroids are avoided. Using Te
         as figure of merit, kC would trend to the given number of training points, because error
         rates would converge to the 1-NN error, and therefore, computational cost would not be
         reduced. Instead, weighting the error percentage with log10(kC), not only affects the
         optimum kC to the error rate, but computational cost is kept low.

         Following picture depicts the test points and classification border for kC=20

                                                        The border of the classifier gets a quite good
                                                        adaptation to the samples. Actually, this
                                                        border is the set of equidistant points among
                                                        different centroids representing each class.

                                                        An issue that was found during the
                                                        development of the algorithm is that, when
                                                        having a relative high kC, after some
                                                        iterations some of the centroids did not get
                                                        any sample assigned to them, since all the
                                                        close samples found a closer centroid to join.
                                                        Our implementation, does not re-allocate
                                                        these centroids. But different adjustments
                                                        can be done to the algorithm to handle this
                                                        problem: those "dummy" centroids can either
                                                        be assigned a new random value, nudging
                                                        them, or splitting clusters with greater
                                                        dispersion into two groups, producing two
                                                        new clusters.
Fig. 7

              IMG/JMCC                                                                                  DSP08-G95

                     Repeat now the experiment 100 times, and give the following results:

                              Plot the average value and the standard deviation of the training and
                               validation error rates as a function of kC.

              Since the initialization of the centroids is done in a random fashion, different results of
              error rates and performance are obtained from different runnings of the algorithm. For this
              reason, a good way to get trusty results is to run the experiment a given number of
              iterations, extracting the average value and standard deviation of the error rates. In this
              case, 100 iterations were performed, and the figures below show the results:

Fig. 8   Mean and Standard Deviation of Error Rates (no percentages) for different values of kC. Despite using v as letter for the
right graph, it is not the variance, it is sigma or standard deviation.

              From the figures it can be concluded that:

              - There is a strong dependency of the error rates on the initialization when using small
              number of centroids (less than 10). This is due to the fact that the shape of data spread all
              over the plane makes those few centroids not representative of data set. Bad initialization
              may cause the error to grow even more (large confidence interval).

              - For higher values of kC both the  and  decrease steady towards the 1-NN behavior. As it
              is discussed latter, these results show that the relatively small improvement of the error
              rate for large kC may not be enough to justify the increase the cost of employing such
              amount of centroids.

              - When there is one centroid per class, algorithm has the worst performance. Convergence
              is reached at the first iteration, since the centroid is simply the center of mass of all the
              samples of its class. Hence, average error is constant through all repetitions up. Therefore,
              variance is null, since the centroid is always the same, so that error rate is deterministic.

IMG/JMCC                                                                          DSP08-G95

               Using a histogram, give an approximate representation of the distribution of
                the 100 values obtained for kC*. Obtain the average value for kC*, and for
                its corresponding test error rate.

In order to carry out this section, we have assumed that for each initialization of the
centroids, kC can be view as a different discrete random variable. By adding 100 random
variables, central limit theorem guarantees that independently of each variable
distribution, the total mixture can be characterized as a Gaussian distribution.

The histogram shows the results obtained through the 100 iterations: for each of them, a
value of    turned out to be the optimum one, and the histogram represents the frequency
of those elected values. Fitting a Gaussian p.d.f. to the displayed histogram, expectation of
kC is,

                                                     Latter results are consistent with
                                                     previous analysis. Average error is
                                                     higher than 1-NN error. Furthermore,
                                                     optimum choice for kC is at the
                                                     beginning of the flat region of Fig. 8,
                                                     where, as mentioned, good properties in
                                                     term of errors can be achieved without
                                                     increasing algorithm complexity.

                                                     Dealing with computational cost, like in
                                                     previous section, if  is said to be the
                                                     cost to compute the Euclidean distance
                                                     between two points, and k is the cost of
                                                     computing the closest centroid among k
                                                     centroids, total cost of classifying a
                                                     entire n-sample set is as follows,


Thus, once centroids have been calculated (there may not be reduction of computational
cost while training), this clustering approach reduces computational cost to a linear cost in
n during classification, instead of k-NN quadratic approach, providing more scalability
while keeping such a good performance in terms of errors. As mentioned, kC also affects the
complexity. A randomly chosen classification border of a classifier obtained among the 100
iterations using the optimum estimated kC early is shown below.

IMG/JMCC                                                                         DSP08-G95


     Linear Classifiers: they provide a simple tool for data classification, with reasonable
      performance which can be enough for certain applications. Their main drawbacks
      are the lack of expressiveness (linear borders) and sensitivity to step size. Soft
      activation has shown better performance (almost same error rates but less

     k-NN classifier gave the best performance for this classification problem. Its good
      properties to depict complex classification borders provided the best classification
      pattern. Good k choice is critical to have good generalisation. The main shortcoming
      of this algorithm is the humongous computational cost, because it performs a global
      sweep for each sample to get choose the most suitable class for that sample.

     Clustering classification provides in this particular case the chance to set up a
      trade-off between computational cost and accuracy. The amount of centroids
      calculated during training for each class determines the accuracy of the results.
      Small number of centroids means low computational cost is required but
      performance turns out to be clumsy, whereas is every training point becomes a
      centroid (extrem case) same performance as de k-NN algorithm is achieved by
      means of the already mentioned computational effort.


To top