Generic Summarization and Keyphrase Extraction Using Mutual

					Generic Summarization and Keyphrase
Extraction Using Mutual Reinforcement
Principle and Sentence Clustering

                    Hongyuan Zha
     Department of Computer Science & Engineering
             Pennsylvania State University
              University Park, PA 16802

                      SIGIR ’02
   Informally, the goal of text summarization is to take a textual
    document, extract content from it and present important
    content to the user in a condensed form and in a manner
    sensitive to the user’s or application’s needs.

   Two basic approach to sentence extraction
        Supervised approaches need human-generated summary extracts
         for feature extraction and parameter estimation, sentence
         classifiers are trained using human-generated sentence-summary
         pairs as training examples.
        We adopt the unsupervised approach, explicitly model both
         keyphrase and the sentences that contain them using weighted
         undirected and weighted bipartite graphs.
   First cluster sentences of a (set of) document(s) into topical
    groups and then select the keyphrase and sentences by their
    saliency score within each group.

   Major contributions are
        Proposing the use of sentence link priors resulted from the linear
         order to enhance sentence clustering quality.
        Develop the mutual reinforcement principle for simultaneous
         keyphrase and sentence saliency score computation.
The Mutual Reinforcement Principle
   For each document we generate two sets of objects: the set of
    terms T = {t1,…,tn}, the set of sentences S = {s1,…,sm}.

   Build a weighted bipartite graph: if term ti appears in sentence
    sj, then create an edge and specify nonnegative weight wij
    between them.
        We can simply choose wij to be the number of times ti appears in sj,
         more sophisticated schemes will be discussed later.

   Hence G(T, S, W) is a weighted bipartite graph where W = [wij] is
    an m-by-n weight matrix, and we wish to compute saliency
    scores u(ti) and v(sj).
The Mutual Reinforcement Principle
   Mutual reinforcement principle:

   The saliency score of a term is determined by saliency scores of
    the sentence it appears in, and the saliency scores of a
    sentence is determined by the saliency scores of the terms it
    contains. Mathematically,
The Mutual Reinforcement Principle
   Written in matrix format,

   We can rank terms and sentences in decreasing order of their
    saliency scores and choose top n terms or sentences.

   Choose an initial value of v to be the vector of all ones,
    alternate between the following two steps until convergence:
        1. Compute and normalize
        2. Compute and normalize
        σ can be computed as           upon convergence.
The Mutual Reinforcement Principle
   The above weighted bipartite graph can be extended by adding
    vertex weights to the sentences and/or the terms.

   For example, the weight of a sentence vertex can be increased
    if it contains certain bonus words.

   In general, let DT and DS be two diagonal matrices the diagonal
    elements of which represent the weights of the term and
    sentence, we compute the largest singular value triplet {u,σ,v}
    of the scaled matrix DT W DS.
Clustering Sentences into Topical Groups
   The saliency score can be more effective if it is applied within
    each topical group of a document.

   For sentence clustering we build an undirected weighted graph
    with vertices representing sentences and two sentences si and sj
    are linked if there are terms shared by them, weight wij
    indicates the similarity between si and sj, and there are many
    different ways for their specification.

   Sentences arranged in linear order, and near-by sentences tend
    to be about the same topic.

   Topical groups are usually made of sections of consecutive
    sentences is a strong prior which we call sentence link prior.
Incorporating Sentence Link Priors
   We call si and sj are near-by if si is followed by sj. A simple
    approach to take advantage of sentence link prior is to modify
    the weights:

   We call α the sentence link prior, and use the idea of
    generalized cross-validation (GCV) to choose α.

   Note that incorporating sentence link prior is different from text
    segmentation: we do allow several sections of consecutive
    sentences to form a single topical group.
Incorporating Sentence Link Priors
   For fixed α, apply the spectral clustering technique to obtain a
    set of Π*(α). Define to γ(Π) be the number of consecutive
    sentence segments it generates, we compute a function of αas:

   We then select the α that maximizes the above function as the
    estimated optimal α value.
Sum-of-Squares Cost Function and
Spectral Relaxation
   In the bipartite graph G(T, S, W) each sentence is represented
    by a column of W = [w1,…wn] which we call sentence vector. A
    partition Π can be written as:

        E is a permutation matrix, and Wi is m-by-ni.

   For a given partition, the sum-of-squares cost function is:

        mi is the centroid of the sentence vectors in cluster i, and ni is the
         number of sentences in cluster i.
Sum-of-Squares Cost Function and
Spectral Relaxation
   Traditional K-means algorithm is iterative and in each iteration
    the following is performed:
        For each w, find mi that is closest to it, associate w with this mi.
        Compute a new set of centers.
        Major drawback is prone to local minima giving rise to very few
         data points.

   An equivalent formulation can be derived as a matrix trace max.
    problem, it also makes K-means method easily adaptable to
    utilizing the sentence link priors.
Sum-of-Squares Cost Function and
Spectral Relaxation
   Let e be a vector of appropriate dimension with all elements
    equal to one, thus

   The sum-of-squares cost function can be written as:

   And its minimum is equivalent to:

   Let X be an arbitrary orthonormal matrix, we obtain a relaxed
    matrix trace max. problem:
Sum-of-Squares Cost Function and
Spectral Relaxation
   An extension of the Rayleigh-Ritz characterization of
    eigenvalues of symmetric matrices shows that the above
    maximum is achieved by the first k largest eigenvectors of the
    Gram matrix WTW. We also have the following inequality:

   This gives a lower bound for the minimum of the sum-of-
    squares cost function. In particular, we can replace WTW by
    WS(α) after incorporating the link strength.
Sum-of-Squares Cost Function and
Spectral Relaxation
   The clustering label assignment is done by QR decomposition
    with pivoting:
       Compute the k eigenvectors Vk = [v1,…,vk] of WS(α) corresponding to
        the largest k eigenvalues.
       Compute the pivoted QR decomposition of VkT as

        where Q is a k-by-k orthogonal matrix, R11 is a k-by-k upper
        triangular matrix, and P is a permutation matrix.
       Compute

   Then the clustering number is determined by the row index of
    the largest element in absolute value of the corresponding
    column of R hat.
Experimental Results
   Evaluation is a challenging task: human-generated
    summarization tends to be different; another approach is to
    extrinsically evaluate their performance on, for example,
    document retrieval or text categorization.

   We collect 10 documents, manually divide each one into topical
        Notice that the clustering is not unique, some clusters can merge
         into a bigger cluster and some can be split into finer structures.
Experimental Results
Experimental Results
   In processing the documents,
       Delete stop words and applied Porter’s stemming.
       Construct WS = (wij): each sentence is represented by a column of
        W, and wij is equal to the dot-product of si and sj.
       The sentence vectors are weighted with tf.idf weighting and
        normalized to have Euclidean length one.

   To measure the quality of clustering, we assume the manually
    generated section number is the true cluster label.

   Here we use a greedy algorithm to compute a sub-optimal
Experimental Results
   For a sequence of α, apply the spectral clustering algorithm to
    the weight matrix WS(α) of the document dna.

   We also plot the clustering accuracy against α, and contrast
    the clustering result with and without sentence link priors.

   The clustering algorithm matches the section structure poorly
    when there is no near-by sentence constraints (i.e. α= 0); with
    too large α,sentence similarities are overwhelmed by link
    strength, the results are also poor.
Experimental Results
Experimental Results
   GCV method is quite efficient at choosing good α. In Table 1,
    the estimated α may differ from optimal α but still produces
    accuracy matches well the best ones.
Experimental Results
   For the computation of keyphrase and sentence saliency scores,
    apply the weight when applying the mutual reinforcement
    principle. For the i-th sentence apply the weight:

   The idea is to mitigate the influence of long sentences by
    scaling by a factor proportional to the sentence length, at the
    same time sentences close to the beginning of the document
    get a small boost.
Experimental Results
   We use the document dna for illustration. For α= 3.5, the
    clustering matches section structure well except for cluster 8.
        Sentences 1 to 4 discuss issues related the common ancestor “Eve”,
         and section 4 with heading Defining mitochondrial ancestors is about
         the same topic.
        Here sentence similarities win over sentence link strength.

   We also applied the mutual reinforcement principle to all
    sentences and extracted first few keywords and sentences.
   We presented a novel method for simultaneous keyphrase
    extraction and generic text summarization.
        Exploring the sentence link priors embedded in the linear order of a
         document to enhance clustering quality.
        Also develop the mutual reinforcement principle to compute
         keyphrases and saliency scores within each topical groups.

   Many issues need further investigation:
        More research needs to be done for choosing optimal α.
        Other possible way for clustering, for example, using 2-stage
         method: 1) segment sentences 2)cluster into topical groups.
        Replacing the use of simple terms by noun phrases, this will impact
         W and WS.
        Extension to translingual summarization.

Shared By: