Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Exploiting Wikipedia as External Knowledge for Document Clustering

VIEWS: 7 PAGES: 27

									 Exploiting Wikipedia as External
Knowledge for Document Clustering


   Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K.
            Park, and Xiaohua Zhou

   Proceeding of International Conference on Knowledge
    Discovery and Data Mining, ACM SIGKDD, 2009

2011/6/22            報告人:吳建良                             1
                        Outline
   Motivation
   Framework of Wikipedia-based clustering
       Concept mapping schemes
       Category mapping
       Document clustering
   Experiments
   Conclusions

                                              2
                           Motivation
   Traditional text clustering algorithm
       Based on BOW (Bag of Word)
       Ignore the semantic relationship among words
            Synonym or semantically associated in other forms
   One way to resolve this problem
       Use background knowledge to enrich document
        representation
       Background knowledge is described by an ontology
       Ontology: concepts, attributes, relationships
                                                                 3
                            Motivation (cont.)
   Problem of this approach based on an ontology
       Difficult to find a comprehensive ontology to cover
        all the concepts
       Previous works has adopted WordNet and Mesh
            Replace original content with ontology term
                 Information loss
            Add ontology term to original document vector
                 Bring data noise into the dataset



                                                             4
                            Goal
   Adopt more comprehensive ontology
       Wikipedia
   Fully leverage ontology terms and relations
    without introducing more noise
   Two matching methods
       Exact-match
       Relatedness-match

                                                  5
Framework




            6
         Mapping Document to Wikipedia
            Concepts and Categories
   Mapping process includes three steps:
    1.   Build the connection between Wikipedia concepts
         and categories
    2.   Map each document into a vector of Wikipedia
         concepts
    3.   Match each document to a set of Wikipedia
         categories



                                                       7
        Figure of three steps


Step1



                                Step3



Step2


                                        8
             Concept-Category Matrix
   In Wikipedia, each topic is described by only one
    article
   Title of the article  preferred concept
   Each article (concept) has the corresponding categories
   Example:
       Concept: Cluster Analysis
       Categories: Data mining | Data analysis | Cluster analysis |
        Geostatistics | Machine learning | Multivariate statistics |
        Knowledge discovery in databases

                                                                       9
                Document-Concept Matrix
   Built matrix through two matching schemes
       Exact-match
       Relatedness-match
   Exact-match
       Issue: how to map synonymous phrases to the same concept
       Use redirect links in Wikipedia
       Example:
            Preferred concept: cluster analysis
            Redirected concepts: data clustering,… are redirected to the same article
       Use preferred and redirected concepts to construct a dictionary
                                                                                  10
                    Exact-Match Scheme
   Each document is scanned to find concepts of dictionary
   Only preferred concepts are used to build the concept
    vector for each document
        Preferred_concept1                   Preferred_concept2
Doc1    Freq_pre_con1 + Freq_all_redi_con1   Freq_pre_con2 + Freq_all_redi_con2

   Based on this frequency matrix
       Further calculate the document-concept TFIDF matrix
   Efficient, but has low recall
       Product good results only when Wiki has good coverage
                                                                                  11
                 Relatedness-Match Scheme
   Consist of two steps
    1.       First, create Wikipedia term-concept matrix from Wikipedia
             article collection
               Each word token is represented by a concept vector
               Values of the vector are TFIDF scores
               For each word, only choose top k=5 concepts with highest TFIDF
                scores




                                                                                 12
         Relatedness-Match Scheme (cont.)
2.       Use word-concept matrix as a bridge to associate documents
         with Wikipedia concepts
           Calculate relatedness of a Wikipedia concept to a given document



                  : a document collection
                  : all Wikipedia preferred concepts
           For each document, select top M=200 concepts with highest
            relatedness score
           Concept relatedness score vector is normalized
           Especially useful when Wikipedia concepts have less coverage for a
            dataset
                                                                               13
         Category Mapping for Exact-Match
   Document-category frequency matrix
       Derived from document-concept frequency matrix
       Replace each concept with its corresponding categories
       Calculate frequency of a category:
             CAT1   CAT2
        C1   1      0
        C2   1      1                    CAT1   CAT2

             C1     C2
                           }        D1
                                    D2
                                         9+2
                                         3+5
                                                2
                                                5
        D1   9      2
        D2   3      5


       Further derive the document-category TFIDF matrix
                                                                 14
         Category Mapping for Relatedness-
                     Match
   Document-category matrix
       Derived from document-concept relatedness matrix
       Replace each concept with its corresponding categories
       Calculate relatedness score of a category:
             CAT1   CAT2
        C1   1      0                     CAT1        CAT2
        C2   1      1

             C1     C2
                           }        D1

                                    D2
                                          0.3+0.7

                                          0.63+0.37
                                                      0.7

                                                      0.37
        D1   0.3    0.7
        D2   0.63   0.37




                                                                 15
                Document Clustering
   Agglomerative clustering algorithm
    1.   Initially, each document starts as a cluster
    2.   Repeatedly merge closest pair of clusters
    3.   Until only one cluster is formed covering all documents




   Similarity measure



                                                                   16
    Closest Pair of Clusters Calculation
                                 C1   C2


   Single linkage



   Complete linkage
        Adopted in this paper


   Average linkage

                                           17
                                                    Partitional Clustering
                    K-means clustering algorithm
                                                            10                                                                                                   10
10
                                                            9                                                                                                    9
9
                                                            8                                                                                                    8
8
                                                            7                                                                                                    7
7
                                                            6                                                                                                    6
6
                                                            5                                                                                                    5
5
                                                            4                                                                                                    4
4
                                                  Assign    3                                                                                          Update    3
3

2                                                 each
                                                            2                                                                                          the       2

                                                            1                                                                                                    1
1
                                                  objects                                                                                              cluster
                                                            0                                                                                                    0
0
     0   1   2   3   4   5   6   7   8   9   10   to most
                                                                 0       1       2       3       4       5       6       7       8       9       10    means          0       1       2       3       4       5       6       7       8       9       10


                                                  similar
                                                  center                                                         reassign                                                                                             reassign
                                                             10                                                                                                   10

     K=2                                                         9                                                                                                    9

                                                                 8                                                                                                    8

     Arbitrarily choose K                                        7                                                                                                    7

                                                                 6                                                                                                    6
     object as initial cluster                                   5                                                                                                    5

     center                                                      4                                                                                     Update         4

                                                                 3                                                                                                    3

                                                                 2
                                                                                                                                                       the            2

                                                                 1                                                                                     cluster        1

                                                                 0
                                                                     0       1       2       3       4       5       6       7       8       9    10
                                                                                                                                                       means          0
                                                                                                                                                                          0       1       2       3       4       5       6       7       8       9   18
                                                                                                                                                                                                                                                       10
               Partitional Clustering (cont.)
   Similarity measure



   Clustering result is influenced by initial
    selection of cluster centroids
       Evaluation:
            Run ten times with random initialization
            Take average as the final clustering result
                                                           19
                      Experiments
   Wikipedia data
       Download from http://download.wikipedia.org
       911,028 articles and 29,000 categories
   Clustering dataset
       TDT2: 7,094 documents, 10 classes
       LA Times (from TREC): 18,547 documents from top
        ten sections, 10 classes
       20-newgroups (20NG): 19,997 documents, 20 classes
                                                       20
                       Experiments (cont.)
   For each dataset, five small datasets are created
       Method:
            For each small dataset, randomly pick 100 documents
             from each selected class of a given dataset
            Merge them into a big pool
       Cluster each small dataset separately
       Average result is viewed as the clustering result for
        whole dataset

                                                                   21
                                      Evaluation Metrics
   Purity
       Average percentage of the dominant class label in each cluster
                               Cid
                    
                        K
                        i 1
                               Ci
        purity                     100%
                         K
   F-score
       Combine precision and recall to compute score
             2  ( precision  recall )
       F
                precision  recall
   Normalized mutual information (NMI)
                             I ( X ;Y )
     NMI ( X , Y ) 
                        (log k  log c) / 2



                                                                         22
Agglomerative Clustering Results




                               23
                    Summary of this result
   Word_Category performs better than Word_Concept_Category
   Combining Word and Category significantly improve clustering
    result
       Category information is more useful than concept information
   Word_Concept improves clustering result, but not significant
   Clustering only based on Concept performs worse than the
    baseline
       Still contain too much noise
       Do not disambiguate concept senses during concept mapping process


                                                                            24
Partitional Clustering Results




                                 25
                Summary of this result
   For 20 Newsgroup, Word_Category scheme still significantly
    improve clustering result
   F-Score and Purity of Word_Concept_Category based clustering
    are significantly improved
   For 20 Newsgroup , RM always produces better result than EM
   For LATimes and TDT2, EM always outperforms RM




                                                               26
                          Conclusion
   A framework
       Leverage Wikipedia concept and category information to
        improve text clustering performance
   Mapping Schemes
       Exact-Match and Relatedness-Match
   Concept vector and Category vector
   Two clustering approaches on three datasets
       Agglomerative and partitional clustering


                                                                 27

								
To top