Fuzzy Clustering - A Versatile Mean to Explore Medical Databases

Document Sample
Fuzzy Clustering - A Versatile Mean to Explore Medical Databases Powered By Docstoc
					     Fuzzy Clustering - A Versatile Mean to Explore Medical Databases.

           Georg Berks, Diedrich Graf v. Keyserlingk, Jan Jantzen*, Mariagrazia Dotoli**,
                                          Hubertus Axer
                                        Department of Anatomy I
                                              RWTH Aachen
                               Pauwelsstr. 30, D-52057 Aachen, Germany
                           Phone: ++49-241-8089100, Fax: ++49-241-8888431
                     email: {georg, keyser, hubertus }@cajal.medizin.rwth-aachen.de
                                       *Department of Automation
                                    Technical University of Denmark
                                       DK-2800 Lyngby, Denmark
                             Phone: ++45-4525-3561, Fax: ++45-4588-1295
                                        email: Jantzen@iau.dtu.dk
                         **DEE – Dipartimento di Elettrotecnica ed Elettronica
                                            Politecnico di Bari
                                            Via Re David, 200
                                             70125 Bari, Italy
                           Phone: ++39-080-5963312, Fax: ++39-080-5963410
                                         email: dotoli@poliba.it


   ABSTRACT: A clinical syndrome is a set or a cluster of concurrent symptoms which indicate together the
   presence and the nature of a disease. Looking for concurrent symptoms is therefore one of the main tasks in
   medical diagnosis. In medicine imprecise conditions are the rule and therefore fuzzy methods are more suitable
   than crisp ones. We used fuzzy c-means clustering to assign symptoms to the different types of aphasia
   categories. The results were compared with the results in some subtests of the Aachen Aphasia Test (AAT). The
   polarization of the five main factors leads to at least 10 different categories. The description of language failures
   by c-mean classification of the analyzed factors corresponds in many but not in all cases to the traditional
   diagnostic scheme.


   KEYWORDS: Aphasia, fuzzy c-mean clustering, classification.


   INTRODUCTION

   A symptom is a visible or even measurable condition indicating the presence of a disease and hence can be
   regarded as an aid in diagnosis. Symptoms are the smallest units indicating the existence of a disease. A
   syndrome on the other hand is a collection, a set, or a cluster of concurrent symptoms, which together indicate the
   presence and the nature of the disease. The history of a syndrome includes its first description, its confirmation,
   and the acknowledgement of its usefulness. In many cases its name is dedicated to the first author. Joining single
   symptoms together to one syndrome is one of the main tasks in medical diagnosis. Classification and clustering
   are therefore basic concerns in medicine. Classification depends on the definition of the classes and on the
   required degree of affiliation of their elements, i.e. the cases’ symptoms. Although classification is a traditional
   approach in medicine, many ambiguities exist in finding exact diagnoses. In a mathematical or statistical
   environment a value may or may not belong to one class. In medicine there are usually imprecise conditions and
   therefore fuzzy methods seem to be more suitable than crisp ones.




ESIT 2000, 14-15 September 2000, Aachen, Germany                                                                           453
   FUZZY C-MEAN-CLUSTERING

   Cluster analysis is a large field, both within fuzzy sets and beyond it. Many algorithms have been developed to
   obtain hard clusters from a given data set. Among those, the c-means algorithms and the ISODATA clustering
   methods, are probably the most widely used. Both approaches are iterative. Hard c-means algorithms assume that
   the center of a class C is known, whereas C is unknown in the case of the ISODATA algorithms. Hard c-means
   execute a sharp classification, in which each object is either assigned to a class or not. The membership to a class
   of objects therefore amounts to either 1 or 0. The application of Fuzzy sets in a classification function causes this
   class membership to become a relative one and consequently an object can belong to several classes at the same
   time but with different degree. The c-means algorithms are prototype-based procedures, which minimize the total
   of the distances between the prototypes and the objects by the construction of a target function. Both methods,
   sharp and fuzzy classification, determine class centers and minimize, e.g., the sum of squared distances between
   these centers and the objects, which are characterized by their features. Thus classes have to be developed, which
   are as dissimilar as possible.
   Fuzzy c-mean clustering is an easy and well improved tool, which has been applied in many medical fields.
   However, in c-means algorithms, like in all other optimization procedures, which look for the global minimum of
   a function, there is the danger to come into local minima. Therefore the result of such a classification has to be
   regarded as an optimum solution with a determined degree of the accuracy.

   MEDICAL BACKGROUND OF APHASIA

   Aphasia is a disturbance in the communicative use of language, which can occur in different forms (Axer et al.,
   2000). It is produced by damage to regions of the cerebral cortex, which are related to language functions. In
   contrast a disturbance of the articulation alone is called dysarthria. That means, in aphasia higher
   neuropsychologic functions are affected.


   Major clinical entities of aphasia

   In aphasiology, there are many inconsistencies concerning the definition and interpretation of aphasic syndromes.
   In a clinical setting, the following aphasic syndromes are distinguished. These syndromes are strictly empirical
   and based on a statistically reliable co-occurrence of a set of symptoms.

   •   Broca's Aphasia (also called Motor or Expressive Aphasia) (Broca, 1861): The Motor Aphasia is caused by
       a lesion within the 3. frontal turn. The disturbances include mostly expressive language functions. The
       patients speak non-fluently with and in a so-called telegram style.
   •   Wernicke's Aphasia (also called Sensory or Receptive Aphasia) (Wernicke, 1874): The Sensory Aphasia is
       caused by a lesion near the auditory center, with the consequence that the patient does not understand words
       or also does not notice the defects of his actually fluent language.
   •   Global Aphasia (also called Total Aphasia): In global Aphasia, loss of expression and understanding is
       caused by an extended destruction of both of the centers above. Hence global aphasia is a very severe
       language disturbance. Often communication is not possible at all.
   •   Anomic Aphasia: The spontaneous speech of anomic patients is fluent and grammatically correct, but these
       patients have difficulties in the retrieval of words.
   •   Conduction Aphasia: Conduction aphasia is based on a damage of the connection between the sensory and
       the motor center, the so-called Fasciculus arcuatus. While spoken language is understood, the repetition of
       spoken words is severely disturbed or even impossible.


   MATERIALS AND METHODS

   The 265 AAT-test profiles (Huber et al., 1983, 1084) collected in the Aphasia Database since 1986 (Axer et al.,
   2000) were taken as the input for a factor analysis. Factor analysis was applied on a correlation matrix of 26
   symptoms of language disorders and led to five factors (Keyserlingk et al., 2000). These factors displayed
   meaningful indication of the disease.




ESIT 2000, 14-15 September 2000, Aachen, Germany                                                                           454
                                                 Factor-No. Meaning
                                                 I             severity of disturbance
                                                 II            expressive vs. comprehensive
                                                 III           granularity of phonetic mistakes
                                                 IV            awareness of disease
                                                 V             deficits in communication
                                                 Table I: Factors derived from the factor analysis.

   After the factors have been gained they are usually transformed into 'simple structure' to render easier
   interpretation of their significance. The principle of the 'simple structure' is to work out from all possible feature
   configurations – how scattered they may be - the ideal configuration, in which the variable possesses the simplest
   complexity, i.e., it can be described by only one single factor. We treated the factors with the so-called varimax
   method (Weber, 1980).
   If the 'simplicity' of a single factor f p is defined as the variance of its loadings s 2 , than this variance has to
                                                                                           p
   become a maximum to increase the 'simplicity' of the respective factor.


                                           ( )                      å (a )
                                                                                  2
                         1 m 2    1                                 m
                      s = å aip − 2
                       2
                       p
                                                                             2
                                                                             ip                   p = 1,K, k                                       (1)
                         m i =1  m                                  i =1

                                                                                            s
   To increase the 'simplicity' of the complete matrix A =                             åa  p =1
                                                                                                   ip   f p the sum of all single 'simplicities ' has to be

   increased, i.e.
                                                                                                             2
                               k
                               1 k m 4     1                                           k
                                                                                           æ m 2ö
                      s = å s = åå aip − 2
                       2               2
                                       p                                              å ç å aip ÷                                                  (2)
                          p =1 m p =1 i =1 m                                          p =1 è i =1 ø
   The weights of the resulting 5 factors were transferred to membership functions of symptoms. The symptoms
   reveal in this way different memberships to the different aspects of language disorders. Fuzzy c-mean clustering
   (Bezdek, 1981) was then used to advise the symptoms to the different entities, because of polarization of the five
   factors results in at least 10 categories.
   The algorithm comprises the following steps (cf. Zimmermann, 1996):
   Step 1. Chose the number of classes c, the number of objects A, and a weighting factor m, so that 1 < m < ∞ .
   Step 2. Calculate the c fuzzy cluster centers by means of the chosen parameters
                               A

                            å (µ )
                                                     m
                                           ik            xk
                     vi =   k =1
                                A
                                                                     i = 1,K, c                                                                   (3)
                               å (µ )
                                                         m
                                                ik
                               k =1

   Step 3. Calculate the new membership of all objects to the c classes
                                                1
                      µ ik =                                  2
                                                                           i = 1,K, c; k = 1,K, n                                                  (4)
                                   cæ d ik ö                 m −1

                               åç d
                                ç
                               j =1 è
                                           ÷
                                           ÷
                                        jk ø

   Step 4. Compare the membership matrices before and after the iteration.

                      U n − U n−1                        ≤ ε            where n is the number of the actual iteration                              (5)

   If the difference between the respective factor matrices is below a predefined threshold ε, then stop, else go back
   to step 3.
   The resulting cluster should be able to separate the different clusters in a sufficient way. For practical reasons we
   examined only two clusters on each of the passages.




ESIT 2000, 14-15 September 2000, Aachen, Germany                                                                                                              455
   RESULTS

   The resulting classes of the clustering method are presented in the Figure 1 - 2. For graphical interpretation the
   different symptoms were put in order according to their membership to the respective feature, e.g., severe or
   moderate overall severity of disturbance. It can be seen that the clustering procedure leads to clearly
   distinguishable classes of symptoms. The clusters can be separated easily, as it is indicated in the small areas of
   overlap between the respective features. Moreover, the description of language failures by c-mean classification
   of analyzed factors correspond in many but not in all cases to the traditional diagnostic scheme.
   However, it is also visible that there are differences between the factors. The slope in the presentation of factor
   III is too steep to be the basis of a clinical interpretation.


                                               Fa c t or I                                                                                                   Fa c t or I I


        1,2                                                                                                            1,2

         1                                                                                                                 1

        0,8                                                                                                            0,8
                                                                                                sever e                                                                                              motor i c
        0,6                                                                                                            0,6
                                                                                                moder ate                                                                                            sensor i c
        0,4                                                                                                            0,4

        0,2                                                                                                            0,2

         0                                                                                                                 0
               1   3     5   7     9   11    13   15      17   19       21   23   25                                            1    3    5   7   9   11   13      15   17   19   21   23   25

                                        sympt oms                                                                                                      sympt oms




              Figure 1: Graphical presentation of the results of the c-means clustering. Factor I (left) represents
              the overall severity of disturbance whereas factor II (right) indicates the more expressive or more
              comprehensive character of the language disorder.


                                               Factor I I I                                                                                                  Fa c t or I V



        1,2                                                                                                            1,2

         1                                                                                                                 1

        0,8                                                                                                            0,8
                                                                                                    hi gh
        0,6                                                                                                                                                                                      highl y awar e
                                                                                                    l ow               0,6
        0,4                                                                                                                                                                                      not awar e
                                                                                                                       0,4
        0,2
                                                                                                                       0,2
         0
               1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26                                        0

                                            s y mpt oms
                                                                                                                                                      sympt oms




                                                                                                      Fa c t or V


                                                                1,2

                                                                    1

                                                                0,8
                                                                                                                                                       sever e
                                                                0,6
                                                                                                                                                       moder ate
                                                                0,4

                                                                0,2

                                                                    0
                                                                         1    3    5   7   9   11   13      15   17   19       21   23   25

                                                                                                sympt oms




   Figure 2: Graphical presentation of the results of the c-means clustering. Factor III and IV (upper row) represent
   the granularity of the phonetic language disorders and the patients awareness of the disease, factor V (below)
   exposes the deficits in communication.



   DISCUSSION AND CONCLUSION

   Classification is a common, pragmatic tool in clinical medicine. It is the basis for diagnostic and hence for
   therapeutic decisions. We used c-mean fuzzy clustering for classification after feature extraction from an aphasia
   database. The additional feature extraction allows to ensure the statistical validity of the factors. It is obvious that




ESIT 2000, 14-15 September 2000, Aachen, Germany                                                                                                                                                                  456
   the information contained in the factors is already present in the original state, that is before extraction. The
   extracted factors have been polarized. Consequently, the factors may be transferred to membership functions of
   symptoms using Fuzzy c-means and the five factors lead to at least 10 categories.
   The clustering method seems to be insufficient to distinguish the granularity of the phonetic mistakes correctly.
   Nevertheless, overall severity of the disease and the character of the language disorder can be distinguished much
   better. For practical reasons, these points seem to be of greater importance. The former point determines the
   clinical outcome and the prognosis for the patient. The latter includes the differentiation between the different
   entities of aphasia and hence is the major input for the determination of the therapy.


   Fuzzy clustering of uncertain data: a model for dealing with medical ambiguities

   The ambiguities inherent in the definition of the aphasic syndromes (Marshall, 1986), cannot be resolved
   completely by the applied algorithm. Definitions of syndromes are probabilistic rather than crisply defined
   (Marshall, 1986) and the classification features overlap between different categories. A symptom may belong to
   more than one class. This is in accordance with the classical taxonomy of aphasia, which is also polytypic in
   nature (cf. Axer, 2000). This taxonomy is based upon anatomical models, developed more than a century ago
   (Broca, 1861; Wernicke, 1874). The design of all neuropsychological language tests is based upon the classical
   classification scheme above. Despite emphasis on standardization the uncertainty inherent in neuropsychological
   testing leads to some inconsistencies in the range of all tests. As these ambiguities exist, the application of fuzzy
   methods seems to be a adequate means for an exploration of the results of the patients' clinical investigations.
   Moreover, the described ambiguities are suited to be generalized to many problems of classification in medicine.
   The question is: What is the benefit of using methods of soft computing in this field? Does it make sense to use
   artificial procedures for exploring data, when even the clinical expert cannot resolve the ambiguities of the
   clinical syndromes? During the last decade much research has been focused on the advance of computational
   methods to analyze large data collections. A physician, who has to work with large collections of medical data
   should know the possibilities and dangers of computational methods in dealing with this kind of information. On
   the other hand computer scientists working on medical software should be exposed to medical data analysis as
   well as to the specific purposes of medical knowledge. Computers in medicine cannot replace the medical expert
   in diagnostic or therapeutic decision making. However, computers in general, and especially Fuzzy techniques,
   may facilitate standardization of classification routines and hence can be important supportive tools for the
   physician in practice as well as valuable tools in medical quality control and medical training. In addition, the
   communication between medical scientists and computer engineers may lead to an interdisciplinary advance in
   the analysis of inconsistencies in medical classifications. In this way, soft computing can be used to generate
   models to be used for different medical disciplines.


   REFERENCES

   Axer, H., Jantzen, J., Berks, G., Südfeld, G., Keyserlingk, D.G.v., 2000, "The Aphasia Database on the Web:
      Description of a Model for Problems of Classification in Medicine." Proc. ESIT 2000
   Bezdek, J.C., 1981, "Pattern Recognition with Fuzzy Objective Function Algorithms." Plenum Press, New York,
      London
   Broca, P., 1861, "Remarques sur le siège de la faculté de langage articulé, suivie d’une observation d’aphémie
      (perte de la parole)." Bull Soc Anat 6, pp. 330-57.
   Huber, W., Poeck, K., Weniger, D., 1983, “Aachener Aphasie Test (AAT).” Hogrefe, Göttingen.
   Huber, W., Poeck, K., Weniger, D., 1984, "The Aachen Aphasia Test." In: Rose, F.C., “Advances in Neurology.
      Vol. 42: Progress in Aphasiology.” Raven, New York.
   Keyserlingk, D.G.v., Jantzen, J., Berks, G., Keyserlingk, A.G.v., Axer, H., "Critical Data Analysis Precedes Soft
      Computing of Medical Data." Proc. ESIT 2000
   Marshall, J.C., 1986, "The description and interpretation of aphasic language disorder." Neuropsychologia 24,
      pp. 5-24.
   Weber, E., 1980, "Grundriss der Biologischen Statistik." Gustav Fischer Verl., Jena
   Wernicke, C., 1874, "Der aphasische Symptomenkomplex. Eine psychologische Studie auf anatomischer Basis."
      Max Cohn & Weigert, Breslau.
   Zimmermann, H.-J., 1996, “Fuzzy Set Theory“, 3rd Ed., Kluwer Acad. Publ., Boston/MA , USA




ESIT 2000, 14-15 September 2000, Aachen, Germany                                                                           457