Course overview Introduction to summarization

Document Sample
Course overview Introduction to summarization Powered By Docstoc
					       Course overview
Introduction to summarization


              Lecture 1
   Instructor: Ani Nenkova
    –   505 Levine, nenkova@seas.upenn.edu
    –   Office hours: Tuesdays 3:15—4:15 or by
        appointment


   TA: Annie Louis
    –   lannie@seas.upenn.edu
Textbook

   No required text
     –   Slides/lecture notes and handouts will be given in class

   Recommended
     –   Speech and Language Processing (second edition, 2007,
         Prentice-Hall), by Daniel Jurafsky and James Martin

   Also see
     –   Christopher Manning and Hinrich Schutze, ―Foundations of
         statistical natural language processing‖
     –   Advances in Automatic Text Summarization
         Edited by Inderjeet Mani and Mark T. Maybury
Grading

   5 homeworks (65%)
    –   One will be a literature overview assignment
    –   One will be at the end of the semester, instead of
        a final
   You are encouraged to form teams for the
    homework (programming) assignments, but
    all write-ups should be individual
   Midterm (20%)
   Class participation (15%)
    –   ―Submit‖ 5 questions each week
Late submission policy

   5 late days for the semester
    –   Can be used for any assignment with no penalty


   Late submissions after ―late days‖ have been
    used up will not be graded
What you will learn

   A lot about summarization and natural
    language techniques used in summarization

   Tools and resources
    –   Part of speech and named entity taggers, parsers,
        Wordnet, WEKA
   Problem formalization/distributions
    –   Distributions: Zipfian, Binomial, Multinomial
    –   Graph representations


   System comparisons
    –   Statistical significance and statistical tests
   Reading scientific articles
    –   Part of the assigned readings
    –   Useful skill, regardless of your future job plans


   Improving writing skills
    –   Immensely useful, regardless of your future job plans
    –   The literature overview assignment will focus on this, but in
        other assignments the way you describe your work will also
        be evaluated
What is summarization?
Columbia Newsblaster


   The academic version
What is the input?

   News, or clusters of news
    –   a single article or several articles on a related
        topic
   Email and email thread
   Scientific articles
   Health information: patients and doctors
   Meeting summarization
   Video
What is the output

   Keywords
   Highlight information in the input
   Chunks or speech directly from the input or
    paraphrase and aggregate the input in novel
    ways
   Modality: text, speech, video, graphics
Ideal stages of summarization

   Analysis
    –   Input representation and understanding


   Transformation
    –   Selecting important content


   Realization
    –   Generating novel text corresponding to the gist of the input
Most current systems

   Use shallow analysis methods
    –   Rather than full understanding


   Work by sentence selection
    –   Identify important sentences and piece them
        together to form a summary
Data-driven approaches

   Relying on features of the input documents
    that can be easily computes from statistical
    analysis

   Word statistics
   Cue phrases
   Section headers
   Sentence position
Knowledge-based systems

   Use more sophisticated natural language
    processing

   Discourse information
    –   Resolve anaphora, text structure
   Use external lexical resources
    –   Wordnet, adjective polarity lists, opinion
   Using machine learning
What are summaries useful for?

   Relevance judgments
    –   Does this document contain information I am
        interested in?
    –   Is this document worth reading?


   Save time
   Reduce the need to consult the full document
Multi-document summarization

   Very useful for presenting and organizing
    search results
    –   Many results are very similar, and grouping
        closely related documents helps cover more
        event facets
    –   Summarizing similarities and differences between
        documents
Scientific article summarization

   Not only what the article is about, but also
    how it relates to work it cites

   Determine which approaches are criticized
    and which are supported
    –   Automatic genre specific summaries are more
        useful than original paper abstracts
Other uses

   Document indexing for information retrieval

   Automatic essay grading, topic identification
    module
   Data-driven summarization
Frequency as indicator of importance

   The topic of a document will be repeated
    many times

   In multi-document summarization, important
    content is repeated in different sources
Greedy frequency method

   Compute word probability from input

   Compute sentence weight as function of
    word probability

   Pick best sentence
How to deal with redundancy?

 Author JK Rowling has won her legal battle in a
  New York court to get an unofficial Harry Potter
  encyclopaedia banned from publication.

 A U.S. federal judge in Manhattan has sided with
   author J.K. Rowling and ruled against the
   publication of a Harry Potter encyclopedia created
   by a fan of the book series.

 –   Shallow techniques not likely to work well
Global optimization for content
selection

   What is the best summary? vs What is the
    best sentence?

   Form all summaries and choose the best
    –   What is the problem with this approach?
Sentence clustering for theme
identification

1. PAL was devastated by a pilots' strike in June and
   by the region's currency crisis.

2. In June, PAL was embroiled in a crippling three-
   week pilots' strike.

3. Tan wants to retain the 200 pilots because they
   stood by him when the majority of PAL's pilots
   staged a devastating strike in June.
   Cluster sentences from the input into similar
    themes

   Choose one sentence to represent a theme

   Consider bigger themes as more important
Using graph representations

   Nodes
    –   Sentences
    –   Discourse entities


   Arcs
    –   Between similar sentences
    –   Between related entities
Using machine learning

   Ask people to select sentences
   Use these as training examples for machine
    learning
    –   Each sentence is represented as a number of
        features
    –   Based on the features distinguish sentences that
        are appropriate for a summary and sentences that
        are not
   Run on new inputs
Information ordering

   In what order to present the selected
    sentences?
    –   An article with permuted sentences will not be
        easy to understand


   Very important for multi-document
    summarization
    –   Sentences coming from different documents
Automatic summary edits

   Some expressions might not be appropriate
    in the new context
    –   References:
              –   he
              –   Putin
              –   Russian Prime Minister Vladimir Putin
    –   Discourse connectives
            However, moreover, subsequently
   Requires more sophisticated NLP techniques
Before


Pinochet was placed under arrest in London Friday by
British police acting on a warrant issued by a Spanish
judge. Pinochet has immunity from prosecution in
Chile as a senator-for-life under a new constitution that
his government crafted. Pinochet was detained in the
London clinic while recovering from back surgery.
After


Gen. Augusto Pinochet, the former Chilean dictator,
  was placed under arrest in London Friday by British
  police acting on a warrant issued by a Spanish
  judge. Pinochet has immunity from prosecution in
  Chile as a senator-for-life under a new constitution
  that his government crafted. Pinochet was detained
  in the London clinic while recovering from back
  surgery.
Before


Turkey has been trying to form a new government
  since a coalition government led by Yilmaz collapsed
  last month over allegations that he rigged the sale of
  a bank. Ecevit refused even to consult with the
  leader of the Virtue Party during his efforts to form a
  government. Ecevit must now try to build a
  government. Demirel consulted Turkey's party
  leaders immediately after Ecevit gave up.
After

Turkey has been trying to form a new government
  since a coalition government led by Prime Minister
  Mesut Yilmaz collapsed last month over allegations
  that he rigged the sale of a bank. Premier-designate
  Bulent Ecevit refused even to consult with the leader
  of the Virtue Party during his efforts to form a
  government. Ecevit must now try to build a
  government. President Suleyman Demirel consulted
  Turkey's party leaders immediately after Ecevit gave
  up.