Prototype Hierarchy Based Clustering for the Categorization and by pengxiuhui

VIEWS: 2 PAGES: 31

									Prototype Hierarchy Based Clustering
for the Categorization and Navigation
         of Web Collections
       Zhao-Yan Ming, Kai Wang and Tat-Seng Chua
    School of Computing, National University of Singapore
                        SIGIR 2010
                 Speaker: Tom Chao Zhou
                   2010.10.26, Tuesday




                              1
              Outline


• Motivation
• Prototype Hierarchy Based Clustering
• Problem Formulation and Approach
• Experiments

                   2
              Outline


• Motivation
• Prototype Hierarchy Based Clustering
• Problem Formulation and Approach
• Experiments

                   3
                   Motivation

• Utility of user-generated-contents
 • Quality: distinguish good, bad quality
    content.
 • Accessibility:
   •   question search

   •   Organizing the huge collections of data for information
       navigation: Categorization, hierarchical clustering with labels
       and descriptions of clusters.



                                4
          Categorization
• Users to construct fine-grained topic
  hierarchies and assign objects
 • Open Directory Project and Wikipedia
 • Disadvantage: too many manual
    efforts.
• Coarse grain hierarchies
 • Yahoo! Answers’ categories.
 • Disadvantage: too coarse, does not
    have “IPod”.
                    5
          Categorization

• Supervised techniques. Not appropriate
  for dynamic Web services.
• Unsupervised
 • Clustering the collections into smaller
    groups.
 • Extracting labels for clustered groups.
                    6
 Prototype Hierarchy based
      Clustering (PHC)

• Tackle web collection categorization
  and navigation problem.
• PHC utilizes the world knowledge in the
  form of prototype hierarchies, while
  adapts to the underlying topic structures
  of the collections.


                    7
 Prototype Hierarchy based
      Clustering (PHC)
• Advantages
 •   Eliminate the problem of determining the
     number of clusters and assigning initial
     clusters by following the structure of the
     prototype hierarchy.
 •   Results are interpretable, comprehensive,
     and organized.
 •   Flexible forms of supervision: prototype
     hierarchy can come in different level of
     granularity.
                        8
              Outline


• Motivation
• Prototype Hierarchy Based Clustering
• Problem Formulation and Approach
• Experiments

                   9
 Prototype Hierarchy Based
         Clustering
• Prototype Hierarchy (PH)
 • A hierarchy whose nodes set V
    represent a set of <l,p> tuples. p:
    prototype serving as description of
    concept l.
• Data Hierarchy (DH)
 • A hierarchy organizes a collection of
    objects d. Each node represents a
    category of objects CO.
                    10
     Problem Formulation

• Given a collection D of objects on a
  topic τ, PHC partitions and maps D into
  the categories that are predefined by a
  PH on τ, such that the formed objects
  clusters CO1, CO2,..., COk are
  organized in a DH with similar
  structures.


                    11
•some PH node does not have objects.
•some questions have no appropriate category to assign to.




                    12
          Requirements

• Data hierarchy is evolving into a
  compact structure encoding the
  underlying topics of the collection.
• Data and prototype hierarchy matched
  at both node and relation level.
• Distance between objects are
  measured by appropriate metrics.

                    13
              Outline


• Motivation
• Prototype Hierarchy Based Clustering
• Problem Formulation and Approach
• Experiments

                  14
  Problem Formulation and
         Approach
• Hierarchy Metric and Information
  Function
 • A hierarchy metric as a function that
    operates on all nodes.
 • h: V×V->R+, adjacent pair of
    nodes      ,
 • Quality of the structure measured by
    the amount of information carried in H.
                   15
      Minimum Evolution
• Minimum Evolution (obj1)
 • Intuition :DH that compactly “encodes”
    the collection into topic categories is
    the best.
 • Monitor the structural evolution of the
    data hierarchy.
 • The optimal DH on a collection is the
    one that contains the least information.
                      16
Matching of Prototype Data
        Hierarchy
• Data Hierarchy Centroid
 • Centroids of DH nodes are generated
   in an incremental manner.
 • New object in a leaf node
   automatically becomes member of its
   ancestor nodes.
 • Magnitude of the change decreases
   with the levels from the leaf node.
                   17
      Prototype Centrality

• Prototype centrality (obj2)
 • Intuition: Adding a data object into a
    node, so that the updated centroids
    are most similar to their
    corresponding prototypes.
 • A prototype is located at the center of
    an object cluster.

                    18
  Prototype-Data Hierarchy
        Resemblance
• Matching between two hierarchies H1,
  H2
 • Full match, V1=V2 and R1=R2.
 • Partial match
   •   common hierarchy: matched nodes and relations.

   •   Incomplete match: V1+Vin=V2,R1+Rin=R2

   •   Excess match:V1=V2+Vin,R1=R2+Rin



                             19
20
  Prototype-Data Hierarchy
        Resemblance


• Prototype-Data Hierarchy Resemblance
  (obj3)
 • Common part of the data hierarchy
   and the prototype hierarchy.



                  21
Partially Matched Prototype
          Hierarchy
• PH is an incomplete match of DH
 • Adding dummy child nodes to the
   existing nodes in PH.
 • Employ label extraction algorithms.
• PH is an excess match of DH
 • Empty nodes will be removed.
                  22
           Object Metric

• M(di,dj) defined as the similarity
  between a pair of objects di and dj
  within a node.
• Translation-based Language Model.
 • semantic.
• Syntactic Tree Kernel Matching.
 • syntactic.
                     23
   Category Cohesiveness


• Category Cohesiveness (obj4)
 • Objects in the same category are
    similar to each other.
 • Objects in different categories are
    dissimilar to each other.


                    24
 Multi-Criterion Optimization
           Function

• Minimum evolution.
• Prototype centrality.
• Prototype-Data Hierarchy Resemblance.
• Category cohesiveness.

                  25
              Outline


• Motivation
• Prototype Hierarchy Based Clustering
• Problem Formulation and Approach
• Experiments

                  26
                             Datasets



•   Hierarchy
    •   Dental: Wikipedia
    •   IPod: manually constructed by combining Wikipedia article, Wordnet, product spec.

•   Dataset diversity
    •   CS: deep hierarchy. Hierarchies are noise.
    •   RS: broad hierarchy, abstract domain. Hierarchies are noise.
    •   IPod: concrete domain.
    •   Dental: Hierarchy is well constructed.


                                         27
     Experimental Setting
• proKmeans
 • Prototype hierarchy enhanced K-
    means divisive hierarchical clustering.
• LiveClassifier
• PHC
• CFC Classifier
 • Supervised text categorization
    technique.
                    28
•Specifying a prototype hierarchy for a collection, even a simple method can categorize
the collection reasonable well.
•PHC is superior in terms of utilizing the prototype hierarchy.
•Comparable with supervised method.
•PHC introduces new nodes into predefined hierarchy.
•PHC works better in concrete domains than on abstract domains.


                                             29
          Ablation Study on
        Optimization Objectives



•   Prototype Centrality(obj2)
•   Category Cohesiveness(obj4)
•   Prototype-Data Hierarchy Resemblance.(obj3)
•   Minimum Evolution(obj1)
    •   Data hierarchy varies less from the prototype hierarchy without minimum
        evolution. (create new node with minimum evolution)
    •   Minimum evolution objective leads to a self-contained data hierarchy.



                                        30
        Robustness with Mismatched
           Prototype Hierarchy




•PHC is robust against overfitted prototype hierarchies.
•PHC has only limited ability to create categories.



                                            31

								
To top