Adaptation of Hierarchical clustering by areas for automatic

Document Sample
Adaptation of Hierarchical clustering by areas for automatic Powered By Docstoc
					       Lobachevsly State University of Nizhny Novgorod
     Faculty of Computational Mathematics and Cybernatics
        Chair of the Mathematical Support of Computer

Adaptation of Hierarchical clustering by areas
  for automatic construction of electronic

                               Prepared by
                                 Fedor Vladimirovich Borisyuk

                               Scientific adviser
                                  Doctor of technical sciences,
                                  Vladimir Ivanovich Shvetsov
                  Nizhny Novgorod, 2010
           Problem definition
Initial data
 Collection of text documents.

 Automatically construct hierarchical catalogue,
 which reflects thematic areas of given initial
       Examples of web catalogues:
           Yandex Catalogue
  Yandex Catalogue stores
  information on     tens
  of thousands of Russian
   Uses 16 major topics,
 each of them not more than six
levels in depth.

  Yandex Catalogue was
  compiled and is updated
     Examples of web catalogues:
Russian scientific electronic
library eLIBRARY.RU

  12 millions of scientific

  Use library classificator
  GRNTI (State rubricator of
  scientific and technical

  Maintained by group of

  Depth of catalogue is no
  more than 3.
   Why to construct electronic
    catalogue automatically
Big amounts of text data is accumulated and is
continuously growing.

Most of the catalogues are maintained with
support of human experts. High labor costs!

Subjectivity of human experts.

Traditional and manually prepared catalogues
and classifiers can not reflect high rates of
informational progress in required areas.
            Related works
Tao Li and Shenghuo Zhu have used linear
discriminant projection approach for
transformation of document space onto lower-
dimensional space and then cluster the
documents into hierarchy using Hierarchical
agglomerative clustering algorithm.
O. Peskova develops a modification of layerwise
clustering method of Ayvazyan. There was found
a 4% advantage in average f-measure of the
developed clustering method over Hierarchical
agglomerative clustering algorithm.
  Mechanism of automatic construction
        of electronic catalogue

                               Parameters of
Unclassified Preparation
             of document    clustering algorithm
              images for
 collection    clustering
                              clustering by
                                  areas                            Hierarchical
                                                        Post       structure of
                                                   processing of
    Preparation of document images:
Suggested algorithm of keywords selection
 For all words of the document stem is extracted using
 Porter algorithm.
 Remove stop words and words, which have frequency
 more than predefined max frequency or less than
 predefined minimum frequency.
 Weight of the stemi in the document D is calculated
 using modified TFxIDF formula.
 No more than 300 stems with the highest weight are
 selected as keywords to represent the document.
 The number of keywords reduced using suggested
 selective feature reduction algorithm.
               Weighting formula
TFxIDF weighting formulas

   IDFi = log (1+ TDN )
                         (0.5  0.5TFi )
   Weight D (stemi )  (                  )IDF
                         MaxStemFreq D         i
  Tfi - is term frequency in document D
  MaxStemFreqD - max frequency between all stems in D
  TDN - total number of documents in collection
  DNi - number of documents where this stem occurs
  IDFi - inversed document frequency.
Suggested Selective feature space reduction
  Purpose: Select keywords with the best discrimination
     power in relation to possible catalogue areas.
  Selective feature space reduction algorithm:
  1.   Cluster the documents collection using modified
       Hierarchical by areas algorithm. Each area in the tree
       is characterized by keywords vector.
  2.   Execute keywords extraction algorithm on areas
       keywords vectors to select the keywords of each area
       in relation to other areas.
  3.   Remove from documents keywords, which are not
       presented in areas feature space.
 Basics of Hierarchical by areas
1. Object of clustering is text document.
2. Document is characterized by vector of
3. Tree of Areas.

                            B               D

                        E       F       G       H
          Characters of Area
1. Characterized by vector of keywords, which
    is prepared from the keywords of documents
    in this area. Each keyword has a weight.
2. Documents, which are belongs to Area.
    There is limit on the number of documents in
3. Area can have a children. There is limit on
   the number of children.
Lets we have incoming flow of

Incrementally builds Area tree from
incoming documents
     Hierarchical by areas clustering
1 step. Area = Root area
        Verify possibility to insert the document Doc.
        Put Doc to RecycleBin if proximity is less than min.
2 step. Search for closest to Doc child of Area.
3 step. IF Child closer to Doc than to Area,
        then Area = Child and go to step 2.
4 step. Insert document in Area.
5 step. Verify limits: IF Area is crowded, then divide it.
        IF number of children is more then limit –
        integrate them.
6 step. Update set of keywords of areas, which are
        located on the path to the resulted area.

             A             R

  B C                X
  All those documents which do not meet entry criteria of
the areas of the certain level should be temporary stored in
special area on the same level – in RecycleBin.

  When the number of objects in the RecycleBin exceed the
predefined limit, RecycleBin is divided and detached area is
connected to the current level.
                 Divide operation
           A                                   A

                                          C       D
Reason: too many documents in area.

  Divide area using K-means algorithm into two parts.

   Connect areas C and D to the tree. B area will host
integrated characteristics.
              Integrate operation
            A                                  A

  B C                X                 D               X
                                   B       C
Reason: number of children is more then predefined limit.

  Find two most close areas (B and C) and unite them under
one parent area.

   Parent area D will have as center - average of keywords
vectors from both integrated areas B and C.
Areas Tree is filled like pyramid of
   Post processing of generated
     Hierarchical structure
Setup areas Titles as three first keywords from
top of area keywords vector sorted by weight.
For all the tree - make a links between areas at
the same level if distance between keyword
vectors of these areas is greater than calculated
distance. Purpose - referring similar or related
         Test collections

Collection name   Collection characterictics

20NewsGroups 20000 articles evenly divided
             among 20 Usenet newsgroups.
   NNSU8     1302 scientific taken from
             portal of Nizhny Novgorod
             State University.
       External clustering evaluation
 Recall                       For each pair of
                                                 Di and Dj
                                                                    Di and Dj
                             the documents Di                       contain in
           tp                      and Dj
                                               contain in one
Recall =                                         cluster of
         tp+fn                                   “sample”
                                                                    cluster of
 Precision                                      partitioning
             tp              Di and Dj contain
Precision=                   in one cluster of tp        (true      fp (false
           tp+fp             automatic         positive)            positive)
                             Di and Dj contains
             2*Recall*Prec   in         different
F-measure=                                        fn      (false     tn (true
                             clusters          of
             Prec+Recall     automatic
                                                  negative)         negative)
     Computational experiments:
     Evaluation of average metrics
                            Algorithms and datasets
            Hierarchical by areas
Metric                              Agglomerative
            20News                  20News
                        NNSU8                    NNSU8
            groups                  groups
Recall        0.79        0.66          0.1           0.40

Precision     0.35        0.59          0.11          0.38

F-measure     0.48        0.6           0.1           0.33

             2505       2391          2896       45116
     Top levels of catalogue generated by
     the hierarchical clustering by areas
     NNSU8 collection                    20NewsGroups collection
Areas        Members          Areas                     Members
 1       Law, philosophy;               talk.politics.mideast,talk.politics.guns,
 2         Mathematics;        2,
 3          Sociology;          ,,
 4         Economics;          4              sci.crypt,,
 5           Physics           5
 6      Biology, Chemistry;    6                  soc.religion.christian
                               8                    talk.religion.misc
Effective method of keywords extraction
from text documents in purpose of text
clustering is presented.

Taken computational experiments showed
efficiency of suggested approach with using
of Hierarchical clustering by areas algorithm
for automatic construction of electronic
  Thank you! Questions? – Fedor Vladimirovich Borisyuk - Vladimir Ivanovich Shvetsov

Shared By: