Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Extraction

VIEWS: 92 PAGES: 34

									            Extraction
Chapter 3
in
Automatic Summarization

한경수
2001-11-08
고려대학교 자연어처리연구실
                            Contents
   Introduction
   The Edmundsonian paradigm
   Corpus based sentence extraction
         General considerations
         Aspects of learning approaches
   Coherence of extracts
   Conclusion




한경수                              Extraction   2
                                                                    Introduction

                              Extraction
   Extraction (discussed here)
         Analysis phase dominates.
         This analysis is relatively shallow.
         Discourse level information, if used at all, is mostly for …
           – establishing coreference between proper names
           – pronoun resolution
   Extraction is not appropriate for every summarization.
         At high compression rate
           – extraction seems less likely to be effective, unless some pre-
             existing highly compressed summary material is found.
         In multi-document summarization
           – both differences and similarities between documents need to be
             characterized.
         Human abstractors produce abstracts, not extracts.

한경수                                 Extraction                                3
                                                                        Introduction

                      Extraction element
   The basic unit of extraction is the sentence.
   Practical reason preferring sentence to paragraph
         It offers better control over compression
         Linguistic motivation
           – Sentence has historically served as a prominent unit in syntactic
             and semantic analysis.
           – Logical accounts of meaning offer precise notions of sentential
             meaning.
               o Sentences can be represented in a logical form, and taken to denote
                 propositions.

   The extraction of elements below the sentence level
         The extracts will often be fragmentary in nature.
   The sentence seems a natural unit to consider in the
    general case.
한경수                                   Extraction                                       4
                                                            The Edmundsonian paradigm

    Classic work of Edmundson (1969)
   Used a corpus of 200 scientific papers on chemistry.
         Each paper between 100 and 3900 words long.
         Manually prepare the target extracts
   Features
         Title words
           – Words from the title, subtitles, and headings
           – given a hand-assigned weight
         Cue words
           – Extracted from the training corpus based on selection ratio
               o Selection ratio = # of occurrences in extract / # of occurrences in all
                 sentences of the corpus
           – Bonus words
               o Evidence for selection: above an upper selection ratio threshold
               o comparatives, superlatives, adverbs of conclusion, value terms, relative
                 interrogatives, causality terms
           – Stigma wods
               o Evidence for non-selection: below a lower selection ratio cutoff
               o Anaphoric expressions, belittling expressions, insignificant detail expressions,
                 hedging expressions

한경수                                       Extraction                                                5
                                                                The Edmundsonian paradigm

    Classic work of Edmundson (1969)
   Features (continued)
         Keywords
           – The word frequencies were tabulated in descending order
               o Until a given cutoff percentage of all the word occurrences in the document
                 were reached
           – Non-cue words above that threshold were extracted as key words.
           – Each word‟s weight is its frequency in the document.
         Sentence location
           – Heading weight
               o Short list of particular section headings was constructed.
                     •   Like “Introduction” and “Conclusion”
               o Sentences occurred under such headings were assigned a positive weight.
           – Ordinal weight
               o Sentences were assigned weights based on their ordinal position.
               o If they occurred in the first and last paragraph or if they were the first or last
                 sentences of paragraphs, they were assigned a positive weight.




한경수                                           Extraction                                              6
                                                  The Edmundsonian paradigm

    Classic work of Edmundson (1969)
   Sentence scoring
         Based on a linear function of the weights of each features
                W (s)  C (s)  K (s)  L(s)  T ( s)
         Edmundson adjusted by hand the feature weights and the
          tuning parameters
           – by feedback from comparisons against manually created training
             extracts
   Evaluations
         Key words were poorer than the other 3 features.
         The combination of cue-title-location was the best
           – The best individual feature: location, the worst: key words




한경수                                  Extraction                               7
                                                  The Edmundsonian paradigm

    Feature reinterpretation: cue words
   Cue words  cue phrases
   Cue phrases
         Expressions
           – like “I conclude by”, “this paper is concerned with”, …
         Bonus words, stigma words
         In-text summary cues (indicator phrases)
           – E.g. beginning with “in summary”
   Useful for specific technical domains
   Indicator phrases can be extracted by a pattern
    matching process
         Black(1990): p.49 example




한경수                                 Extraction                           8
                                                  The Edmundsonian paradigm

      Feature reinterpretation: key words
   Key words  presence of thematic term features
         Selected based on term frequency
         Including key words of Edmundson
   Thematic Term Assumption
         Relatively more frequent terms are more salient.
         Luhn(1958)
           – Find content words in a document by filtering against a stoplist of
             function words
           – Arrange it by frequency
           – Suitable high-frequency and low-frequency cutoffs were
             estimated from a collection of articles and their abstracts.
   A variant of the thematic term assumption: tf*idf
         Its use in automatic summarization is somewhat less well-
          motivated.

한경수                                  Extraction                                    9
                                                       The Edmundsonian paradigm

          Feature reinterpretation: location
   Baxendale(1958)
         Found that important sentences were located at the beginning
          or end of paragraphs.
         Salient sentences were likely to occur as …
           – first sentence in the paragraph 85% the time
           – Last sentence 7% of the time
   Brandow et al.(1995)
         Compared their thematic term based extraction system for
          news(ANES) against Searchable Lead, a system which just
          output sentences in order.
         Searchable Lead outperformed ANES
           – Acceptable 87% to 96% of the time
           – Unacceptable case
               o anecdotal, human-interest style lead-ins, documents that contained
                 multiple news stories, stories with unusual structural/stylistic features,
                 …


한경수                                     Extraction                                       10
                                                    The Edmundsonian paradigm

          Feature reinterpretation: location
   Lin & Hovy(1997)
         Defined Optimal Position Policy(OPP).
         OPP
           – A list of positions in the text in which salient sentences were likely
             to occur.
         For 13,000 Ziff-Davis news articles
           – Title, 1st sentence of 2nd paragraph, 1st sent of 3rd para, …
         For Wall Street Journal
           – Title, 1st sentence of 1st paragraph, 2nd sent of 1st para, …




한경수                                   Extraction                                      11
                                                   The Edmundsonian paradigm

            Feature reinterpretation: title
   Title words  Add Term
         Weight is assigned based on terms in it that are also present
          in the title, article headline, or the user‟s profile or query.
         A user-focused summary
           – Relatively heavy weight for 
           – Will favor the relevance of the summary to the query or topic.
           – Must be balanced against the fidelity to the source document.
               o Need for the summary to represent information in the document




한경수                                  Extraction                                  12
                                                       The Edmundsonian paradigm

                                 Criticism
   The Edmundsonian equation is inadequate for
    summarization for the following reasons
         Extracts only single elements in isolation, rather than
          extracting sequences of elements.
           – Incoherent summaries
           – Knowing that a particular sentence has been selected should
             affect the choice of subsequent sentences.
         Compression rate isn‟t directly referenced in the equation
           – The compression rate should be part of the summarization
             process, not just an afterthought.
               o E.g.
                   • most salient concept A – s1, s2
                   • Next-to-most salient concept B – s3
                   • One-sentence summary: s3
                   • Two-sentence summary: s1, s2




한경수                                    Extraction                            13
                                                      The Edmundsonian paradigm

                                Criticism
         A linear equation may not be a powerful enough model for
          summarization.
           – Non-linear model is required for certain applications
               o Spreading activation between words
               o Other probabilistic models
         Uses only shallow, morphological-level features for words
          and phrases in the sentence, along with the sentence‟s
          location.
           – There has been a body of work which explores different linear
             combinations of syntactic, semantic, and discourse-level
             features.
         Is rather ad hoc.
           – Doesn‟t tell us anything theoretically interesting about what makes
             a summary a summary.



한경수                                  Extraction                                14
                                            Corpus based sentence extraction

                 General considerations
   The most interesting empirical work in Edmundsonian
    paradigm has used some variant of Edmundson’s
    equation, leveraging a corpus to estimate the weights.
   Basic methodology for a corpus-based to sentence
    extraction
         Figure 3.1 (p. 54)




한경수                            Extraction                                15
                                                     Corpus based sentence extraction

                                   Labeling
   A training extract is also preferred to a training abstract
         Because it is somewhat less likely to vary across human
          summarizers.
   Producing an extract from an abstract
         Mani & Bloedorn(1998)
           – Treat the abstrat as a query.
           – Rank the sentences for similarity to the abstract.
               o Combined-match
                   • Each source sentence is matched against the entire abstract treated as a
                     single sentence.
                   • Euqation 3.2 (p. 56)
               o Individual-match
                   •   Each source sentence is compared against each sentence of the
                       abstract.




한경수                                     Extraction                                          16
                                                 Corpus based sentence extraction

                                Labeling
   Producing an extract from an abstract (continued)
         Marcu(1999)
           – Prunes a clause away from the source that is least similar to
             abstract.
         Jing & McKeown(1999)
           – Word-sequence alignment using HMM
           – Refer to section 3 in Kyoung-Soo‟s Technical Note KS-TN-
             200103
   Can result in a score for each sentence
         Yes/no label
         Labeling can be left as a continuous function.




한경수                                 Extraction                                17
                                                Corpus based sentence extraction

                Learning representation
   The result of learning can be represented as …
         Rules
         Mathematical functions
   If a human is to trust a machine’s summaries
         The machine has to have some way of explaining why it
          produced the summary it did.
   Logical rules are usually preferred to mathematical
    functions.




한경수                                Extraction                                18
                                                 Corpus based sentence extraction

              Compression & Evaluation
   Compression
         Typically, it is applied at the time of testing.
         It is possible to train a summarizer for a particular
          compression.
           – Different feature combinations may be used for different
             compression rates.
   Evaluation
         Precision, recall, accuracy, F-measure
         Table 3.1/3.2 (p. 59)




한경수                                 Extraction                                19
                                                   Corpus based sentence extraction

          Aspects of learning approaches
   Sentence extraction as Bayesian classification
         Kupiec et al.(1995)
         188 full text/summary pairs
           – drawn from 21 different collections of scientific articles
           – Summary was written by a professional abstractor and was 3
             sentences long on average.
         Features
           – Sentence length, presence of fixed cue phrases, location,
             presence of thematic terms, presence of proper names
         Bayesian classifier (Equation 3.4      p.60)
         Producing an extract from the abstract
           – Direct match(79%)
               o identical, or considered to have the same content
           – Direct join(3%)
               o two or more document sentences appear to have the same content
                 as a single summary sentence.


한경수                                   Extraction                                  20
                                                    Corpus based sentence extraction

          Aspects of learning approaches
   Sentence extraction as Bayesian classification (cont’d)
         Evaluation
           – 43% recall
           – As the summaries were lengthened performance improved.
               o 84% recall at 25% of the full text length
           – Location was the best feature
           – Location-cue phrase-sentence length was the best combination




한경수                                    Extraction                                21
                                                    Corpus based sentence extraction

          Aspects of learning approaches
   Classifier combination
         Myaeng & Jang(1999)
           – Tagged each sentence in the Introduction and Conclusion section
               o Whether the section represented …
                   • Background
                   • Main theme
                   • Explanation of the document structure
                   • Description of future work
           – 96% of the summary sentence were main theme sentences.
           – Training method
               o Used bayesian classifier to determine whether a sentence belonged
                 to a main theme
               o Combined evidence from multiple Bayesian feature classifiers using
                 a voting
               o Applied a filter to eliminate redundant sentences.
           – Evaluation
               o Cue words-location-title words was the best combination
               o Suggests that the Edmundsonian features are not language-specific.

한경수                                    Extraction                                     22
                                                  Corpus based sentence extraction

          Aspects of learning approaches
   Term aggregation
         In a document about a certain topic,
           – There would be many reference to that topic.
           – The reference need not result in verbatim repetition.
               o Synonym, more specialized word, related term, …
         Aone et al.(1999)
           – Different methods of term aggregation can impact summarization
             performance.
               o Treat morphological variants, synonyms, name aliases as instances
                 of the same term.
           – Performance can be improved
               o When place names and organization names are identified as terms,
               o And when person names are filtered out
               o Reason: document topics are generally not about people.




한경수                                  Extraction                                      23
                                                      Corpus based sentence extraction

          Aspects of learning approaches
   Topic-focused summaries
         Lin(1999)
           – Used a corpus, called the Q&A corpus
               o 120 texts (4 topics * 30 relevant docs/topic)
               o Human-created, topic-focused passage extraction summary
           – Features
               o Add-Term: query term
                   •   Sentences are weighted based on the number of query terms they
                       contained.
               o Additional relevance feature
                   •   Relevance feedback weight for terms that occurred in documents most
                       relevant to the topic.
               o Presence of proper name, sentence length
               o Cohesion features
                   •   Number of terms shared with other sentences
               o Numerical expression, pronoun, adjective, reference to specific
                 weekdays or months, presence of quoted speech

한경수                                      Extraction                                          24
                                                  Corpus based sentence extraction

          Aspects of learning approaches
   Topic-focused summaries (continued)
         Lin(1999) (continued)
           – Feature combination
               o Naïve combination with each feature given equal weight
               o Decision tree learner
           – Naïve method outperformed the decision tree learner on 3 out of
             4 topics.
           – Baseline method(based on sentence order) also performed well
             on all topics.




한경수                                  Extraction                                25
                                                     Corpus based sentence extraction

          Aspects of learning approaches
   Topic-focused summaries (continued)
         Mani & Bloedorn(1998)
           – Cmp-lg corpus: a set of 198 pairs of full-text docs/abstracts
           – Labeling
               o The overall information need for a user was defined by a set of docs.
               o A subject was told to pick a sample of 10 docs matched his
                 interests.
               o Top content words were extracted from each docs.
               o Words for the 10 docs were sorted by their scores
               o All words more than 2.5 standard deviations above the mean of
                 these words‟ scores were treated as a representation of the user‟s
                 interest, or topic.
                   •   There were 72 such words.
               o Relevance match
                   • Used spreading activation based on cohesion information to weight word
                     occurrences in the document related to the topic.
                   • Each sentence was weighted based on the average of its word weights.
                   • The top C% of these sentences were picked as positive examples




한경수                                     Extraction                                        26
                                                     Corpus based sentence extraction

          Aspects of learning approaches
   Topic-focused summaries (continued)
         Mani & Bloedorn(1998) (continued)
           – Features
              o 2 additional user-interest-specific features
                  • Number of reweighted words(topic keywords) in the sentence
                  • Number of topic keywords / number of content word in the sentence
                  • Specific topic keywords weren‟t used as features, since it is preferable to
                    learn rules that could transfer across user-interests.
                  • Topic keywords are similar to „relevance feedback‟ terms in Lin‟s study.
              o Location, thematic features
              o cohesion features
                  • Synonymy: judged by using WordNet
                  • Statistical cooccurrence: scores between content words i and j up to 40
                    words apart were computed using mutual information.
                  • Equation 3.5        (p. 65)
                  • Association table only stores scores for tf counts greater than 10 and
                    association scores greater than 10.



한경수                                     Extraction                                            27
                                                      Corpus based sentence extraction

          Aspects of learning approaches
   Topic-focused summaries (continued)
         Mani & Bloedorn(1998) (continued)
           – Evaluation
               o In user-focused summaries, the number of topic keywords in a
                 sentnece was the single most influential feature.
               o The cohesion features contributed the least,
                   •   Perhaps because the cohesion calculation was too imprecise.
           – Some sample rules (Table 3.4 p.66)
               o The learned rules are highly intelligible, and can perhaps be edited
                 in accordance with human intuitions.
               o The discretization of the features degraded performance by about
                 15%
                   •   There is a tradeoff there between accuracy and transparency.




한경수                                      Extraction                                     28
                                                         Corpus based sentence extraction

          Aspects of learning approaches
   Case study: Noisy channel model
         There has been a surge of interest in language modeling
          approaches to summarization. (Berger & Mittal 2000)
         The problem of automatic summarization as a translation
          problem
           – translating between a verbose language(of source documents)
             and a succinct language(of summaries)
           – This idea is related to the notion of the abstractor reconstructing
             the author‟s ideas in order to produce a summary.
         Generic summarization
               s*  arg max P( s | d )  arg max ( P(d | s)  P( s))
                        s                                s

                    s          P( d | s )       d            decoder
                                                                       s*
                            Noisy Channel

한경수                                         Extraction                                29
                                                              Corpus based sentence extraction

          Aspects of learning approaches
   Case study: Noisy channel model (continued)
         User-focused summarization
           s *  arg max P ( s | d , q )  arg max ( P (q | s, d )  P ( s | d ))
                        s                                 s

            arg max ( P(q | s )  P ( s | d ))
                  s
                            relevance          fidelity
           – fidelity
                                        m
             P ( s | d )  ld (m) Pd ( si )
                                        i 1
           – relevance
                                    m
             P (q | s )  ls (k ) Ps (qi )
                                   i 1




한경수                                              Extraction                                30
                                                    Corpus based sentence extraction

          Aspects of learning approaches
   Case study: Noisy channel model (continued)
         Training
           – Use FAQ pages on WWW
               o Lists a sequence of question-answer pairs (10,395)
               o Culled from 201 usenet FAQs and 4 call-center FAQs
               o View each answer as the query-focused summary of the document
         Evaluation
           – Assigns the correct summary, on the average, a rank of …
               o 1.41 for usenet
               o 4.3 for the call center data
         Criticism
           – The noisy channel model is appealing
               o Because it decomposes the summarization problem for generic and
                 user-focused summarization in a theoretically interesting way
           – However, the model tends to rely on large quantities of training
             data.


한경수                                    Extraction                                  31
                                                   Corpus based sentence extraction

          Aspects of learning approaches
   Conclusion
         The corpus-based approach to sentence extraction is
          attractive because …
           – It allows one to tune the summarizer to the characteristics of the
             corpus or genre of text.
           – Well-established
           – The capability to learn interesting and often quite interesting rules
         But,
           – Lots of design choices and parameters involved in training
         Issues
           – How is the training to be utilized in an end-application?
           – Learning sequences of sentences to extract deserves more
             attention.
           – Evaluation


한경수                                   Extraction                                     32
                   Coherence of extracts
   When extracting sentences from a source,
         An obvious problem is preserving context.
         Picking sentences out of context can result in incoherent
          summaries
   Coherence problems
         Dangling anaphors
           – If an anaphor is present in a summary extract, the extract may
             not be entirely intelligible if the referent isn‟t included as well.
         Gaps
           – Breaking the connection between the ideas in a text can cause
             problems.
         Structured environments
           – Itemized lists, tables, logical arguments, etc., cannot be
             arbitrarily divided.


한경수                                   Extraction                                    33
                                  Conclusion
   Abstracts vs. extracts
         The most important aspect of an abstract …
           – Is not so much that it paraphrases the input in its own words.
           – Some level of abstraction of the input has been carried out
                o Providing a degree of compression
                o Requires Knowledge of the meaning of the information talked about
                o And ability to make inferences at the semantic level
         Extraction methods
           – While knowledge-poor, are not entirely knowledge-free.
           – Knowledge about a particular domain is represented
                o In terms of features specific to that domain
                o In the particular rules or functions learned for that domain
           – The knowledge here is entirely internal.
         There is fundamental limitation to the capabilities of extraction
          systems.
           – Current attention is focused on the opportunity to avail of compression in
             a more effective way by producing abstracts automatically.

한경수                                         Extraction                                    34

								
To top