Docstoc

Adaptation of Hierarchical clustering by areas for automatic

Document Sample
Adaptation of Hierarchical clustering by areas for automatic Powered By Docstoc
					       Lobachevsly State University of Nizhny Novgorod
     Faculty of Computational Mathematics and Cybernatics
        Chair of the Mathematical Support of Computer


Adaptation of Hierarchical clustering by areas
  for automatic construction of electronic
                  catalogue

                               Prepared by
                                 Fedor Vladimirovich Borisyuk

                               Scientific adviser
                                  Doctor of technical sciences,
                                  Professor
                                  Vladimir Ivanovich Shvetsov
                  Nizhny Novgorod, 2010
           Problem definition
Initial data
 Collection of text documents.



Purpose
 Automatically construct hierarchical catalogue,
 which reflects thematic areas of given initial
 collection.
       Examples of web catalogues:
           Yandex Catalogue
  Yandex Catalogue stores
  information on     tens
  of thousands of Russian
  websites.
   Uses 16 major topics,
 each of them not more than six
levels in depth.

  Yandex Catalogue was
  compiled and is updated
  manually.
     Examples of web catalogues:
            eLibrary.ru
Russian scientific electronic
library eLIBRARY.RU

  12 millions of scientific
  articles

  Use library classificator
  GRNTI (State rubricator of
  scientific and technical
  information)

  Maintained by group of
  experts.

  Depth of catalogue is no
  more than 3.
   Why to construct electronic
    catalogue automatically
Big amounts of text data is accumulated and is
continuously growing.

Most of the catalogues are maintained with
support of human experts. High labor costs!

Subjectivity of human experts.

Traditional and manually prepared catalogues
and classifiers can not reflect high rates of
informational progress in required areas.
            Related works
Tao Li and Shenghuo Zhu have used linear
discriminant projection approach for
transformation of document space onto lower-
dimensional space and then cluster the
documents into hierarchy using Hierarchical
agglomerative clustering algorithm.
O. Peskova develops a modification of layerwise
clustering method of Ayvazyan. There was found
a 4% advantage in average f-measure of the
developed clustering method over Hierarchical
agglomerative clustering algorithm.
  Mechanism of automatic construction
        of electronic catalogue

                               Parameters of
Unclassified Preparation
             of document    clustering algorithm
    text
              images for
 collection    clustering
                              Hierarchical
                              clustering by
                                  areas                            Hierarchical
                                                        Post       structure of
                                                   processing of
                                                                    electronic
                                                   Hierarchical
                                                     structure
                                                                     catalogue
    Preparation of document images:
Suggested algorithm of keywords selection
 For all words of the document stem is extracted using
 Porter algorithm.
 Remove stop words and words, which have frequency
 more than predefined max frequency or less than
 predefined minimum frequency.
 Weight of the stemi in the document D is calculated
 using modified TFxIDF formula.
 No more than 300 stems with the highest weight are
 selected as keywords to represent the document.
 The number of keywords reduced using suggested
 selective feature reduction algorithm.
               Weighting formula
TFxIDF weighting formulas

   IDFi = log (1+ TDN )
                  DNi
                         (0.5  0.5TFi )
   Weight D (stemi )  (                  )IDF
                         MaxStemFreq D         i
Notations:
  Tfi - is term frequency in document D
  MaxStemFreqD - max frequency between all stems in D
  TDN - total number of documents in collection
  DNi - number of documents where this stem occurs
  IDFi - inversed document frequency.
Suggested Selective feature space reduction
  Purpose: Select keywords with the best discrimination
     power in relation to possible catalogue areas.
  Selective feature space reduction algorithm:
  1.   Cluster the documents collection using modified
       Hierarchical by areas algorithm. Each area in the tree
       is characterized by keywords vector.
  2.   Execute keywords extraction algorithm on areas
       keywords vectors to select the keywords of each area
       in relation to other areas.
  3.   Remove from documents keywords, which are not
       presented in areas feature space.
 Basics of Hierarchical by areas
           clustering
1. Object of clustering is text document.
2. Document is characterized by vector of
   keywords.
3. Tree of Areas.
                                    A

                            B               D
                                    C

                        E       F       G       H
          Characters of Area
1. Characterized by vector of keywords, which
    is prepared from the keywords of documents
    in this area. Each keyword has a weight.
2. Documents, which are belongs to Area.
    There is limit on the number of documents in
    area.
3. Area can have a children. There is limit on
   the number of children.
             Startup
Lets we have incoming flow of
documents.

Incrementally builds Area tree from
incoming documents
     Hierarchical by areas clustering
1 step. Area = Root area
        Verify possibility to insert the document Doc.
        Put Doc to RecycleBin if proximity is less than min.
2 step. Search for closest to Doc child of Area.
3 step. IF Child closer to Doc than to Area,
        then Area = Child and go to step 2.
4 step. Insert document in Area.
5 step. Verify limits: IF Area is crowded, then divide it.
        IF number of children is more then limit –
        integrate them.
6 step. Update set of keywords of areas, which are
        located on the path to the resulted area.
                     RecycleBin

             A             R

  B C                X
  All those documents which do not meet entry criteria of
the areas of the certain level should be temporary stored in
special area on the same level – in RecycleBin.

  When the number of objects in the RecycleBin exceed the
predefined limit, RecycleBin is divided and detached area is
connected to the current level.
                 Divide operation
           A                                   A

                                              B
                                          C       D
Reason: too many documents in area.

  Divide area using K-means algorithm into two parts.

   Connect areas C and D to the tree. B area will host
integrated characteristics.
              Integrate operation
            A                                  A

  B C                X                 D               X
                                   B       C
Reason: number of children is more then predefined limit.

  Find two most close areas (B and C) and unite them under
one parent area.

   Parent area D will have as center - average of keywords
vectors from both integrated areas B and C.
Areas Tree is filled like pyramid of
           champagne
   Post processing of generated
     Hierarchical structure
Setup areas Titles as three first keywords from
top of area keywords vector sorted by weight.
For all the tree - make a links between areas at
the same level if distance between keyword
vectors of these areas is greater than calculated
distance. Purpose - referring similar or related
rubrics.
         Test collections

Collection name   Collection characterictics


20NewsGroups 20000 articles evenly divided
             among 20 Usenet newsgroups.
             Language:English
   NNSU8     1302 scientific taken from
             portal of Nizhny Novgorod
             State University.
             Language:Russian
       External clustering evaluation
 Recall                       For each pair of
                                                 Di and Dj
                                                                    Di and Dj
                             the documents Di                       contain in
           tp                      and Dj
                                               contain in one
                                                                     different
Recall =                                         cluster of
         tp+fn                                   “sample”
                                                                    cluster of
                                                                    “sample”
 Precision                                      partitioning
                                                                   partitioning
             tp              Di and Dj contain
Precision=                   in one cluster of tp        (true      fp (false
           tp+fp             automatic         positive)            positive)
                             clustering
 F-measure
                             Di and Dj contains
             2*Recall*Prec   in         different
F-measure=                                        fn      (false     tn (true
                             clusters          of
             Prec+Recall     automatic
                                                  negative)         negative)
                             clustering
     Computational experiments:
     Evaluation of average metrics
                            Algorithms and datasets
                                    Hierarchical
            Hierarchical by areas
Metric                              Agglomerative
            20News                  20News
                        NNSU8                    NNSU8
            groups                  groups
Recall        0.79        0.66          0.1           0.40

Precision     0.35        0.59          0.11          0.38


F-measure     0.48        0.6           0.1           0.33

Time
             2505       2391          2896       45116
(msec)
     Top levels of catalogue generated by
     the hierarchical clustering by areas
     NNSU8 collection                    20NewsGroups collection
Areas        Members          Areas                     Members
 1       Law, philosophy;               talk.politics.mideast,talk.politics.guns,
                               1
                                                    talk.politics.misc
 2         Mathematics;        2      comp.graphics, comp.os.ms-windows.misc
 3          Sociology;                    rec.sport.baseball,rec.sport.hockey,
                               3
                                               rec.autos, rec.motorcycles
 4         Economics;          4              sci.crypt, sci.med, sci.space
                                              comp.sys.ibm.pc.hardware;
 5           Physics           5
                                                comp.sys.mac.hardware
 6      Biology, Chemistry;    6                  soc.religion.christian
                                              sci.electronics;misc.forsale;
                               7
                                                    comp.windows.x
                               8                    talk.religion.misc
             Conclusions
Effective method of keywords extraction
from text documents in purpose of text
clustering is presented.

Taken computational experiments showed
efficiency of suggested approach with using
of Hierarchical clustering by areas algorithm
for automatic construction of electronic
catalogue.
  Thank you! Questions?



fedorvb@gmail.com – Fedor Vladimirovich Borisyuk
shvetsov@unn.ru - Vladimir Ivanovich Shvetsov

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:9/7/2011
language:English
pages:25