Docstoc

05

Document Sample
05 Powered By Docstoc
					Classification and clustering methods
development and implementation for
unstructured documents collections
           by
           Osipova Nataly
    St.Petesburg State University
    Faculty of Applied Mathematics and Control Processes
    Department of Programming Technology
Contents

  Introduction
  Methods description
  Information Retrieval System
  Experiments
Contextual Document Clustering

   was developed in joined project of
   Applied Mathematics and Control
   Processes Faculty, St. Petersburg State
   University and
   Northern Ireland Knowledge Engineering
   Laboratory (NIKEL), University of Ulster.
Definitions

  Document
  Terms dictionary
  Dictionary
  Cluster
  Word context
  Context or document conditional
   probability distribution
  Entropy
Document conditional probability
distribution

  Document x

  y       tf(y)   p(y|x)    y – words
  word1   5       5/m       tf(y) – y frequency
  word2   10      10/m
                            p(y|x) – y conditional
  word3   6       6/m       probability in document x
  …                         m – document x size
  wordn   16      16/m


  (5/m, 10/m,6/m,…,16/m ) – document conditional probability
  distribution
Word context
                                   Word w
    Document x1                    Document x2                       …         Document xk


y        tf(y)   p(y|x1)   y            tf(y)       p(y|x1)              y         tf(y)    p(y|x1)
word1    5       5/m1      word1        7           7/m1                 word1     20       20/mk
word2    10      10/m1     word3        12          12/m1            …   word4     9        9/mk
…                          …                                             …
wordn1   16      16/m1     wordn2       4           4/m1                 wordnk    3        3/mk




                               y            tf(y)             p(y|w)
                               word1        5+7+20=32         32/m
                               word2        10                10/m           Context
                               word3        12                12/m           conditional
                               …                                             probability
                               wordnk       3                 3/m            distribution
Contents

  Introduction
  Methods description
  Information Retrieval System
  Experiments
Methods

  document clustering method
  dictionary build methods
  document classification method using training
 set

   Information retrieval methods:
  keyword search method
  cluster based search method
  similar documents search method
Contextual Documents Clustering

                    Dictionary          Narrow context words
    Documents




                Distances calculation




                      Clusters
Entropy

 y context conditional probability
 distribution

                p1 p2                 pn   p1+p2+…+pn=1




                p1 p2                 pn



   Uncertainly measure, here it is used to characterize commonness
   (narrowness) of the word context.
Contextual Document Clustering




  maxH(y)=H (              )
Entropy




          0            0.5            1       α
     H(       )   H(         )   H(       )
Word Context - Document Distance

 y context conditional
 probability distribution



 Average conditional
 probability distribution



 Document x
 conditional probability
 distribution
Word Context - Document Distance


    JS[p1,p2]=H(            )




         - 0.5H(            )



         - 0.5H(            )
Jensen-Shannon divergence
Dictionary construction

 Why:
 - big volumes:
  60,000 documents, 50,000 words => 15,000
    words in a context
 - narrow context words importance
Dictionary construction

 Delete words with
 1. High or low frequency
 2. High or low document frequency
 3. 1. and 2.
Retrieval algorithms

  keyword search method
  cluster based search method
  search by example method
Keyword search method

 Document 1        Document 2          Document 3   Document 4
 word 1            word 10             word 15      word 11
 word 2            word 25             word 2       word 21
 word 3            word 30             word 32      word 3
 …                 …                   …            …
 word n1           word n2             word n3      word n4




 Request: word 2                Result set:
                                document 1
                                document3
Cluster based search method

  Documents        Documents          Documents

 Cluster 1         Cluster 2          Cluster 3
 word 1            word 12            word 1
                                                  Cluster
 word 2            word 26            word 23
                                                  context
 …                 …                  …
                                                  words
 word n1           word n2            word n3

 Request: word 1        Result set:
                        Cluster 1
                        Cluster 3
Similar documents search
                   Minimal Spanning Tree


     Cluster              document 1               document 4
      name
                          document 2               document 5

                          document 3               document 6

                                                   document 7

                                         Cluster




    Request: document 3    Result set:
                           document 6
                           document 7
Document classification: method 1

   Test              Clusters            List of topics       Training set
documents



                                                 Topics contexts



            Distances between topics and clusters contexts



                        Classification result:
                         cluster1 – topic 10
                          cluster 2 – topic 3
                                  …
                         cluster n – topic 30
Document classification: method 2

   Test     All documents                         Topics list
                  set              Training set
documents




             Clusters




                        Classification result:
                         cluster1 – topic 10
                          cluster 2 – topic 3
                                  …
                         cluster n – topic 30
Contents

  Introduction
  Methods description
  Information Retrieval System
  Experiments
Information Retrieval System

  Architecture
  Features
  Use
Information Retrieval System
architecture.

 data base server
 client
IRS architecture

           Data Base Server
          MS SQL Server 2000   Data
                               Base




           Local Area
            Network




              “thick” client
                   C#
IRS architecture

 DBMS MS SQL Server 2000:
  High-performance
  Scalable
  Secure
  Huge volumes of data treat
  T/SQL
  Stored procedures
IRS features
    In the IRS the following problems are solved:
  document clustering
  keyword search method
  cluster based search method
  similar documents search method
  document classification with the use of
 training set
    DB structure

The Data Base of the IRS consists of the following tables:
   documents
   all words dictionary
   dictionary
   table of relations between documents and words: document-word
   words contexts
   words with narrow contexts
   clusters
   intermediate tables for main tables build and for retrieve realization
Algorithms implementation
  Documents          All words           Dictionary
                     dictionary
                                                               Keyword
                                                                search
                 Table “document-word”



                                                              Cluster based
       Words              Clusters          Centroid             search
      contexts



         Words with narrow                             Similar documents
             contexts                                        search
Similar documents search

                              0,26967       document2
                document1
                                 0,211
       0,57231                           0,1011      0,16285



    document5                                     document3
                     0,7231    0,8731

      0,23851
                                          0,98154
                   document4


                      Cluster
Minimal Spanning Tree


   Cluster    document 1             document 4
    name

              document 2             document 5


              document 3




                           Cluster
Similar documents search

                                        Similar
                                      documents
  Clusters   Distances                  search
                         Tree table
   table        table
IRS use
IRS use
IRS use
IRS use
IRS use
IRS use
Contents

  Introduction
  Methods description
  Information Retrieval System
  Experiments
Experiments

 Test goals were:
  algorithm accuracy test
  different classification methods
   comparison
  algorithm efficiency evaluation
Experiments

  60,000 documents
  100 topics
  Training set volume = 5% of the
   collection size
Experiments
Result analysis

 - Russian Information Retrieval Evaluation
   Seminar
 - Such measures as macro-average
  recall
  precision
  F-measure
 were calculated.
Recall
Precision
F-measure
Result analysis
                       №
 List of some topics   1
                            Category
                            Family law

 test documents        2    Inheritance law
                       3    Water industry
 were classified in    4    Catering
                       5    Inhabitants’ consumer services
                       6    Rent truck
                       7    International law of the space
                       8    Territory in international law
                       9    Off-economic relations fellows
                       10 Off-economic dealerships
                       11   Economy free trade zones. Customs unions.
Result analysis
Recall results for every category.
Results which were the best for the category are selected with bold type.
All results are set in percents.
                     С
              V          1     2    3     4     5     6    7     8     9     10   11
              textan     33    34   35    60    46    26   27    98    75    25   100
              xxxx       1     0    0.2   3     4     0    0.9   0     3     0    2
              xxxx       0     0    4.3   2.3   0     5    0.9   8     3     0    0.8
              xxxx       55    86   75    19    59    51   80    0     41    82   0
              xxxx       21    39   2     22    15    6    0     1.4   0     5    0
              xxxx       40    43   16    11    25    23   10    1.4   1.2   5    0
              xxxx       23    4    2.5   1.1   18    7    0.9   0     1.2   10   0
              xxxx       2.7   0    0     0     1.5   0    0     0     0     0    0
              xxxx       2.2   0    0     0     1.5   0    0     0     0     0    0
              xxxx       37    21   12    22    18    27   51    0     0     0    0
Thank you for your attention!


              

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:11
posted:6/26/2011
language:English
pages:51