topics_queries by wuxiangyu

VIEWS: 4 PAGES: 76

									                                                      Initial processing:
                                                         Documents,
                                                            Queries

                                                        Scott Farrar
                                                      CLMA, University
                                                       of Washington
                                                       farrar@uw.edu


       Initial processing: Documents, Queries         Review

                                                      Strategies
                                                      Document preparation
                                                      Query/Topic
                                                      processing

                       Scott Farrar                   CLASSY 2006,
                                                      2007
       CLMA, University of Washington farrar@uw.edu


                       April 6, 2010




1/31
                                      Initial processing:
  Today’s lecture                        Documents,
                                            Queries

                                        Scott Farrar
                                      CLMA, University
                                       of Washington
                                       farrar@uw.edu

                                      Review
       1   Review
                                      Strategies
                                      Document preparation
                                      Query/Topic
                                      processing

       2   Strategies                 CLASSY 2006,
                                      2007
             Document preparation
             Query/Topic processing


       3   CLASSY 2006, 2007




2/31
                                                               Initial processing:
  Human summaries                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
                                                                of Washington
                                                                farrar@uw.edu

       Conroy et al. (2006a)                                   Review

                                                               Strategies
       For single document summaries, Kupiec, Pendersen, and   Document preparation
       Chen (1995) reported that 79% of the sentences in a     Query/Topic
                                                               processing

       human-generated abstract were a “direct match” to a     CLASSY 2006,
                                                               2007
       sentence in a document.




3/31
                                                                   Initial processing:
  Human summaries                                                     Documents,
                                                                         Queries

                                                                     Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
                                                                    farrar@uw.edu

       Conroy et al. (2006a)                                       Review

                                                                   Strategies
       For single document summaries, Kupiec, Pendersen, and       Document preparation
       Chen (1995) reported that 79% of the sentences in a         Query/Topic
                                                                   processing

       human-generated abstract were a “direct match” to a         CLASSY 2006,
                                                                   2007
       sentence in a document.
       In contrast, for multi-document summarization, Copeck
       and Szpakowicz (2004) report that no more than 55% of
       the vocabulary contained in human generated abstracts can
       be found in the given documents.




3/31
                                                                    Initial processing:
  Last time                                                            Documents,
                                                                          Queries

                                                                      Scott Farrar
                                                                    CLMA, University
                                                                     of Washington
                                                                     farrar@uw.edu

                                                                    Review
       Key points:                                                  Strategies
                                                                    Document preparation
           Access to training data: DUC2006, TREC, etc.             Query/Topic
                                                                    processing
           Complex queries can be transformed into simple queries   CLASSY 2006,
                                                                    2007
           that can be input into a Q/A systems
           See website for an open-source Java-based Q/A system
           TF-IDF can be used to find terms that are relevant to a
           given query.




4/31
                                      Initial processing:
  Today’s lecture                        Documents,
                                            Queries

                                        Scott Farrar
                                      CLMA, University
                                       of Washington
                                       farrar@uw.edu

                                      Review
       1   Review
                                      Strategies
                                      Document preparation
                                      Query/Topic
                                      processing

       2   Strategies                 CLASSY 2006,
                                      2007
             Document preparation
             Query/Topic processing


       3   CLASSY 2006, 2007




5/31
                                                    Initial processing:
  Initial steps of the summarization process (D2)      Documents,
                                                          Queries

                                                      Scott Farrar
                                                    CLMA, University
                                                     of Washington
                                                     farrar@uw.edu

                                                    Review

                                                    Strategies
                                                    Document preparation
                                                    Query/Topic
                                                    processing

                                                    CLASSY 2006,
                                                    2007




6/31
                                   Initial processing:
  Document processing                 Documents,
                                         Queries

       (not necessarily ordered)     Scott Farrar
                                   CLMA, University
                                    of Washington
                                    farrar@uw.edu

                                   Review

                                   Strategies
                                   Document preparation
                                   Query/Topic
                                   processing

                                   CLASSY 2006,
                                   2007




7/31
                                                                  Initial processing:
  Document processing                                                Documents,
                                                                        Queries

       (not necessarily ordered)                                    Scott Farrar
                                                                  CLMA, University
                                                                   of Washington
           sentence segmentation: see the NIST segmenter tool      farrar@uw.edu
           (write your own heuristics, cf. Conroy et al., 2007)
                                                                  Review

                                                                  Strategies
                                                                  Document preparation
                                                                  Query/Topic
                                                                  processing

                                                                  CLASSY 2006,
                                                                  2007




7/31
                                                                     Initial processing:
  Document processing                                                   Documents,
                                                                           Queries

       (not necessarily ordered)                                       Scott Farrar
                                                                     CLMA, University
                                                                      of Washington
           sentence segmentation: see the NIST segmenter tool         farrar@uw.edu
           (write your own heuristics, cf. Conroy et al., 2007)
                                                                     Review
           sentence normalization: tokenization, stemming,           Strategies
           stop-word removal, de-capitalization (except for proper   Document preparation
                                                                     Query/Topic
           N’s)                                                      processing

                                                                     CLASSY 2006,
                                                                     2007




7/31
                                                                     Initial processing:
  Document processing                                                   Documents,
                                                                           Queries

       (not necessarily ordered)                                       Scott Farrar
                                                                     CLMA, University
                                                                      of Washington
           sentence segmentation: see the NIST segmenter tool         farrar@uw.edu
           (write your own heuristics, cf. Conroy et al., 2007)
                                                                     Review
           sentence normalization: tokenization, stemming,           Strategies
           stop-word removal, de-capitalization (except for proper   Document preparation
                                                                     Query/Topic
           N’s)                                                      processing

                                                                     CLASSY 2006,
           sentence re-writing/expansion: produce alternative        2007

           sentences for extraction (Conroy et al, 2007; Toutanova
           et al., 2007)




7/31
                                                                     Initial processing:
  Document processing                                                   Documents,
                                                                           Queries

       (not necessarily ordered)                                       Scott Farrar
                                                                     CLMA, University
                                                                      of Washington
           sentence segmentation: see the NIST segmenter tool         farrar@uw.edu
           (write your own heuristics, cf. Conroy et al., 2007)
                                                                     Review
           sentence normalization: tokenization, stemming,           Strategies
           stop-word removal, de-capitalization (except for proper   Document preparation
                                                                     Query/Topic
           N’s)                                                      processing

                                                                     CLASSY 2006,
           sentence re-writing/expansion: produce alternative        2007

           sentences for extraction (Conroy et al, 2007; Toutanova
           et al., 2007)
           sentence reduction: shorten sentences based on
           heuristics (Pingali et al, 2007)




7/31
                                                                     Initial processing:
  Document processing                                                   Documents,
                                                                           Queries

       (not necessarily ordered)                                       Scott Farrar
                                                                     CLMA, University
                                                                      of Washington
           sentence segmentation: see the NIST segmenter tool         farrar@uw.edu
           (write your own heuristics, cf. Conroy et al., 2007)
                                                                     Review
           sentence normalization: tokenization, stemming,           Strategies
           stop-word removal, de-capitalization (except for proper   Document preparation
                                                                     Query/Topic
           N’s)                                                      processing

                                                                     CLASSY 2006,
           sentence re-writing/expansion: produce alternative        2007

           sentences for extraction (Conroy et al, 2007; Toutanova
           et al., 2007)
           sentence reduction: shorten sentences based on
           heuristics (Pingali et al, 2007)
           indexing: sentences or documents with an IR system
           (Stokes et al., 2007)



7/31
                                                                     Initial processing:
  Document processing                                                   Documents,
                                                                           Queries

       (not necessarily ordered)                                       Scott Farrar
                                                                     CLMA, University
                                                                      of Washington
           sentence segmentation: see the NIST segmenter tool         farrar@uw.edu
           (write your own heuristics, cf. Conroy et al., 2007)
                                                                     Review
           sentence normalization: tokenization, stemming,           Strategies
           stop-word removal, de-capitalization (except for proper   Document preparation
                                                                     Query/Topic
           N’s)                                                      processing

                                                                     CLASSY 2006,
           sentence re-writing/expansion: produce alternative        2007

           sentences for extraction (Conroy et al, 2007; Toutanova
           et al., 2007)
           sentence reduction: shorten sentences based on
           heuristics (Pingali et al, 2007)
           indexing: sentences or documents with an IR system
           (Stokes et al., 2007)
           sort sentences with respect to the query (include in
           summary, do not include,...)
7/31
                                      Initial processing:
  Sentence segmentation                  Documents,
                                            Queries

                                        Scott Farrar
                                      CLMA, University
                                       of Washington
                                       farrar@uw.edu

                                      Review

                                      Strategies
       Punctuation                    Document preparation
                                      Query/Topic
                                      processing

                                      CLASSY 2006,
                                      2007




       See the NIST segmenter tool.




8/31
                                           Initial processing:
  Sentence segmentation                       Documents,
                                                 Queries

                                             Scott Farrar
                                           CLMA, University
                                            of Washington
                                            farrar@uw.edu

                                           Review

                                           Strategies
       Punctuation                         Document preparation
                                           Query/Topic
           Sammy David, Jr. said that...   processing

                                           CLASSY 2006,
                                           2007




       See the NIST segmenter tool.




8/31
                                                     Initial processing:
  Sentence segmentation                                 Documents,
                                                           Queries

                                                       Scott Farrar
                                                     CLMA, University
                                                      of Washington
                                                      farrar@uw.edu

                                                     Review

                                                     Strategies
       Punctuation                                   Document preparation
                                                     Query/Topic
           Sammy David, Jr. said that...             processing

                                                     CLASSY 2006,
           The average temperature is 56.3 and ...   2007




       See the NIST segmenter tool.




8/31
                                                       Initial processing:
  Sentence segmentation                                   Documents,
                                                             Queries

                                                         Scott Farrar
                                                       CLMA, University
                                                        of Washington
                                                        farrar@uw.edu

                                                       Review

                                                       Strategies
       Punctuation                                     Document preparation
                                                       Query/Topic
           Sammy David, Jr. said that...               processing

                                                       CLASSY 2006,
           The average temperature is 56.3 and ...     2007

           They made off with one million! dollars...
       See the NIST segmenter tool.




8/31
                                   Initial processing:
  Sentence segmentation               Documents,
                                         Queries

                                     Scott Farrar
                                   CLMA, University
                                    of Washington
                                    farrar@uw.edu
       Heuristics
                                   Review
       See Conroy et al. (2006):
                                   Strategies
                                   Document preparation
                                   Query/Topic
                                   processing

                                   CLASSY 2006,
                                   2007




9/31
                                                                       Initial processing:
  Sentence segmentation                                                   Documents,
                                                                             Queries

                                                                         Scott Farrar
                                                                       CLMA, University
                                                                        of Washington
                                                                        farrar@uw.edu
       Heuristics
                                                                       Review
       See Conroy et al. (2006):
                                                                       Strategies
                                                                       Document preparation
           erroneous splits on foreign words, especially names, that   Query/Topic
                                                                       processing
           appear to be English abbreviations;                         CLASSY 2006,
                                                                       2007




9/31
                                                                       Initial processing:
  Sentence segmentation                                                   Documents,
                                                                             Queries

                                                                         Scott Farrar
                                                                       CLMA, University
                                                                        of Washington
                                                                        farrar@uw.edu
       Heuristics
                                                                       Review
       See Conroy et al. (2006):
                                                                       Strategies
                                                                       Document preparation
           erroneous splits on foreign words, especially names, that   Query/Topic
                                                                       processing
           appear to be English abbreviations;                         CLASSY 2006,
                                                                       2007
           erroneous splits on less commonly used abbreviations
           due to their omission from the sentence splitters
           abbreviation list as well as to “unexpected” use;




9/31
                                                                       Initial processing:
  Sentence segmentation                                                   Documents,
                                                                             Queries

                                                                         Scott Farrar
                                                                       CLMA, University
                                                                        of Washington
                                                                        farrar@uw.edu
       Heuristics
                                                                       Review
       See Conroy et al. (2006):
                                                                       Strategies
                                                                       Document preparation
           erroneous splits on foreign words, especially names, that   Query/Topic
                                                                       processing
           appear to be English abbreviations;                         CLASSY 2006,
                                                                       2007
           erroneous splits on less commonly used abbreviations
           due to their omission from the sentence splitters
           abbreviation list as well as to “unexpected” use;
           erroneous splits due to missing or bad punctuation;




9/31
                                                                       Initial processing:
  Sentence segmentation                                                   Documents,
                                                                             Queries

                                                                         Scott Farrar
                                                                       CLMA, University
                                                                        of Washington
                                                                        farrar@uw.edu
       Heuristics
                                                                       Review
       See Conroy et al. (2006):
                                                                       Strategies
                                                                       Document preparation
           erroneous splits on foreign words, especially names, that   Query/Topic
                                                                       processing
           appear to be English abbreviations;                         CLASSY 2006,
                                                                       2007
           erroneous splits on less commonly used abbreviations
           due to their omission from the sentence splitters
           abbreviation list as well as to “unexpected” use;
           erroneous splits due to missing or bad punctuation;
           erroneous splits due to ellipsis at sentence end (our
           sentence splitter does not look far enough ahead).




9/31
                                                                    Initial processing:
  Stemming to cast a wider net                                         Documents,
                                                                          Queries

                                                                      Scott Farrar
                                                                    CLMA, University
    Strategy                                                         of Washington
                                                                     farrar@uw.edu
    Increase recall by treating all forms of a given lemma as the
                                                                    Review
    same. Recall that the Porter Stemmer a this is based on
                                                                    Strategies
    cascaded re-write rules:                                        Document preparation
                                                                    Query/Topic
                                                                    processing

                                                                    CLASSY 2006,
                                                                    2007




10/31
                                                                    Initial processing:
  Stemming to cast a wider net                                         Documents,
                                                                          Queries

                                                                      Scott Farrar
                                                                    CLMA, University
    Strategy                                                         of Washington
                                                                     farrar@uw.edu
    Increase recall by treating all forms of a given lemma as the
                                                                    Review
    same. Recall that the Porter Stemmer a this is based on
                                                                    Strategies
    cascaded re-write rules:                                        Document preparation
                                                                    Query/Topic
                                                                    processing

    Original                                                        CLASSY 2006,
                                                                    2007
    “Federal safety agency seeks to hit Toyota with largest
    possible fine for a single reporting violation.”




10/31
                                                                    Initial processing:
  Stemming to cast a wider net                                         Documents,
                                                                          Queries

                                                                      Scott Farrar
                                                                    CLMA, University
    Strategy                                                         of Washington
                                                                     farrar@uw.edu
    Increase recall by treating all forms of a given lemma as the
                                                                    Review
    same. Recall that the Porter Stemmer a this is based on
                                                                    Strategies
    cascaded re-write rules:                                        Document preparation
                                                                    Query/Topic
                                                                    processing

    Original                                                        CLASSY 2006,
                                                                    2007
    “Federal safety agency seeks to hit Toyota with largest
    possible fine for a single reporting violation.”

    Stop words
    Remove stop words and transform to lower case:
    “federal safety agency seeks hit toyota largest possible fine
    single reporting violation”


10/31
                                       Initial processing:
  Results of stemming                     Documents,
                                             Queries

                                         Scott Farrar
                                       CLMA, University
        Original Word   Stemmed Word    of Washington
                                        farrar@uw.edu
        federal         feder
                                       Review
        safety          safeti
                                       Strategies
        agency          agenc          Document preparation
                                       Query/Topic
        seeks           seek           processing

        hit             hit            CLASSY 2006,
                                       2007
        toyota          toyota
        largest         largest
        possible        possibl
        fine             fine
        single          singl
        reporting       report
        violation       violat



11/31
                                                                     Initial processing:
  Reduce info in sentences                                              Documents,
                                                                           Queries

                                                                       Scott Farrar
                                                                     CLMA, University
    Trimming                                                          of Washington
                                                                      farrar@uw.edu
    The more you can reduce the pool of information (document
                                                                     Review
    set, sentence set, length of docs/sentences), the less
                                                                     Strategies
    processing there is down the line. This is known as              Document preparation
                                                                     Query/Topic
    trimming.                                                        processing

                                                                     CLASSY 2006,
    Technique: use linguistically-motivated heuristics to remove     2007

    low content syntactic constituents until a length threshold is
    reached.

    Example
    after 15 years and an investigation involving thousands of
    interviews, canada’s police have arrested the men they say
    masterminded the deadliest-ever bombing of an airplane


12/31
                                                                     Initial processing:
  Reduce info in sentences                                              Documents,
                                                                           Queries

                                                                       Scott Farrar
                                                                     CLMA, University
    Trimming                                                          of Washington
                                                                      farrar@uw.edu
    The more you can reduce the pool of information (document
                                                                     Review
    set, sentence set, length of docs/sentences), the less
                                                                     Strategies
    processing there is down the line. This is known as              Document preparation
                                                                     Query/Topic
    trimming.                                                        processing

                                                                     CLASSY 2006,
    Technique: use linguistically-motivated heuristics to remove     2007

    low content syntactic constituents until a length threshold is
    reached.

    Example
    after 15 years and an investigation involving thousands of
    interviews, canada’s police have arrested the men they
    say masterminded the deadliest-ever bombing of an airplane


12/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation
                                                               Query/Topic
                                                               processing

                                                               CLASSY 2006,
                                                               2007




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
                                                               2007




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007
        2   Remove low content units




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007
        2   Remove low content units
                some determiners




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007
        2   Remove low content units
                some determiners
                time expressions




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007
        2   Remove low content units
                 some determiners
                 time expressions
        3   Iterative shortening:




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007
        2   Remove low content units
                 some determiners
                 time expressions
        3   Iterative shortening:
                 XP Reduction




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007
        2   Remove low content units
                 some determiners
                 time expressions
        3   Iterative shortening:
                 XP Reduction
                 Remove preposed adjuncts




13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007
        2   Remove low content units
                 some determiners
                 time expressions
        3   Iterative shortening:
                 XP Reduction
                 Remove preposed adjuncts
                 Remove trailing PPs



13/31
                                                               Initial processing:
  Deep processing                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
    Headline generation                                         of Washington
                                                                farrar@uw.edu
    Algorithm from Dorr et al. (2003) generates headlines in
                                                               Review
    TREC:
                                                               Strategies
                                                               Document preparation

    Steps                                                      Query/Topic
                                                               processing

                                                               CLASSY 2006,
        1   Choose lowest leftmost S with NP,VP                2007
        2   Remove low content units
                 some determiners
                 time expressions
        3   Iterative shortening:
                 XP Reduction
                 Remove preposed adjuncts
                 Remove trailing PPs
                 Remove trailing SBARs


13/31
                                                   Initial processing:
  Shallow processing                                  Documents,
                                                         Queries

    Trimming                                         Scott Farrar
                                                   CLMA, University
    Shallow algorithm from Conroy et al. (2006):    of Washington
                                                    farrar@uw.edu

                                                   Review

                                                   Strategies
                                                   Document preparation
                                                   Query/Topic
                                                   processing

                                                   CLASSY 2006,
                                                   2007




14/31
                                                   Initial processing:
  Shallow processing                                  Documents,
                                                         Queries

    Trimming                                         Scott Farrar
                                                   CLMA, University
    Shallow algorithm from Conroy et al. (2006):    of Washington
                                                    farrar@uw.edu

                                                   Review
    Steps
                                                   Strategies
                                                   Document preparation
                                                   Query/Topic
                                                   processing

                                                   CLASSY 2006,
                                                   2007




14/31
                                                                              Initial processing:
  Shallow processing                                                             Documents,
                                                                                    Queries

    Trimming                                                                    Scott Farrar
                                                                              CLMA, University
    Shallow algorithm from Conroy et al. (2006):                               of Washington
                                                                               farrar@uw.edu

                                                                              Review
    Steps
                                                                              Strategies
        1   We remove extraneous words that appear in a sentence, including   Document preparation
                                                                              Query/Topic
            date lines, editor’s comments, and so on.                         processing

                                                                              CLASSY 2006,
                                                                              2007




14/31
                                                                                Initial processing:
  Shallow processing                                                               Documents,
                                                                                      Queries

    Trimming                                                                      Scott Farrar
                                                                                CLMA, University
    Shallow algorithm from Conroy et al. (2006):                                 of Washington
                                                                                 farrar@uw.edu

                                                                                Review
    Steps
                                                                                Strategies
        1   We remove extraneous words that appear in a sentence, including     Document preparation
                                                                                Query/Topic
            date lines, editor’s comments, and so on.                           processing

                                                                                CLASSY 2006,
        2   We remove many adverbs and all conjunctions, including phrases      2007
            such as “As a matter of fact,” and “At this point,” that occur at
            the start of a sentence.




14/31
                                                                                 Initial processing:
  Shallow processing                                                                Documents,
                                                                                       Queries

    Trimming                                                                       Scott Farrar
                                                                                 CLMA, University
    Shallow algorithm from Conroy et al. (2006):                                  of Washington
                                                                                  farrar@uw.edu

                                                                                 Review
    Steps
                                                                                 Strategies
        1   We remove extraneous words that appear in a sentence, including      Document preparation
                                                                                 Query/Topic
            date lines, editor’s comments, and so on.                            processing

                                                                                 CLASSY 2006,
        2   We remove many adverbs and all conjunctions, including phrases       2007
            such as “As a matter of fact,” and “At this point,” that occur at
            the start of a sentence.
        3   We remove small selections of words that occur in the middle of a
            sentence, such as “, however,” and “, also,” (not always requiring
            the commas).




14/31
                                                                                 Initial processing:
  Shallow processing                                                                Documents,
                                                                                       Queries

    Trimming                                                                       Scott Farrar
                                                                                 CLMA, University
    Shallow algorithm from Conroy et al. (2006):                                  of Washington
                                                                                  farrar@uw.edu

                                                                                 Review
    Steps
                                                                                 Strategies
        1   We remove extraneous words that appear in a sentence, including      Document preparation
                                                                                 Query/Topic
            date lines, editor’s comments, and so on.                            processing

                                                                                 CLASSY 2006,
        2   We remove many adverbs and all conjunctions, including phrases       2007
            such as “As a matter of fact,” and “At this point,” that occur at
            the start of a sentence.
        3   We remove small selections of words that occur in the middle of a
            sentence, such as “, however,” and “, also,” (not always requiring
            the commas).
        4   For DUC 2006, we added the removal of ages such as “, 51,” or “,
            aged 24,”.




14/31
                                                                                  Initial processing:
  Shallow processing                                                                 Documents,
                                                                                        Queries

    Trimming                                                                        Scott Farrar
                                                                                  CLMA, University
    Shallow algorithm from Conroy et al. (2006):                                   of Washington
                                                                                   farrar@uw.edu

                                                                                  Review
    Steps
                                                                                  Strategies
        1   We remove extraneous words that appear in a sentence, including       Document preparation
                                                                                  Query/Topic
            date lines, editor’s comments, and so on.                             processing

                                                                                  CLASSY 2006,
        2   We remove many adverbs and all conjunctions, including phrases        2007
            such as “As a matter of fact,” and “At this point,” that occur at
            the start of a sentence.
        3   We remove small selections of words that occur in the middle of a
            sentence, such as “, however,” and “, also,” (not always requiring
            the commas).
        4   For DUC 2006, we added the removal of ages such as “, 51,” or “,
            aged 24,”.
        5   We remove gerund phrases (phrases starting with the -ing form of
            a word) from the start, middle, or end of a sentence when possible.


14/31
                                                                                  Initial processing:
  Shallow processing                                                                 Documents,
                                                                                        Queries

    Trimming                                                                        Scott Farrar
                                                                                  CLMA, University
    Shallow algorithm from Conroy et al. (2006):                                   of Washington
                                                                                   farrar@uw.edu

                                                                                  Review
    Steps
                                                                                  Strategies
        1   We remove extraneous words that appear in a sentence, including       Document preparation
                                                                                  Query/Topic
            date lines, editor’s comments, and so on.                             processing

                                                                                  CLASSY 2006,
        2   We remove many adverbs and all conjunctions, including phrases        2007
            such as “As a matter of fact,” and “At this point,” that occur at
            the start of a sentence.
        3   We remove small selections of words that occur in the middle of a
            sentence, such as “, however,” and “, also,” (not always requiring
            the commas).
        4   For DUC 2006, we added the removal of ages such as “, 51,” or “,
            aged 24,”.
        5   We remove gerund phrases (phrases starting with the -ing form of
            a word) from the start, middle, or end of a sentence when possible.
        6   We remove relative clause attributives (clauses beginning with
            “who(m), which, when, where”) wherever possible.
14/31
                                                                                  Initial processing:
  Shallow processing                                                                 Documents,
                                                                                        Queries

    Trimming                                                                        Scott Farrar
                                                                                  CLMA, University
    Shallow algorithm from Conroy et al. (2006):                                   of Washington
                                                                                   farrar@uw.edu

                                                                                  Review
    Steps
                                                                                  Strategies
        1   We remove extraneous words that appear in a sentence, including       Document preparation
                                                                                  Query/Topic
            date lines, editor’s comments, and so on.                             processing

                                                                                  CLASSY 2006,
        2   We remove many adverbs and all conjunctions, including phrases        2007
            such as “As a matter of fact,” and “At this point,” that occur at
            the start of a sentence.
        3   We remove small selections of words that occur in the middle of a
            sentence, such as “, however,” and “, also,” (not always requiring
            the commas).
        4   For DUC 2006, we added the removal of ages such as “, 51,” or “,
            aged 24,”.
        5   We remove gerund phrases (phrases starting with the -ing form of
            a word) from the start, middle, or end of a sentence when possible.
        6   We remove relative clause attributives (clauses beginning with
            “who(m), which, when, where”) wherever possible.
14/31
                                                                   Initial processing:
  Query/topic processing                                              Documents,
                                                                         Queries

                                                                     Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
                                                                    farrar@uw.edu
    (not necessarily ordered)
                                                                   Review
    Ask, “what sentences in the doc set are most like the query,   Strategies
    based on bag of terms” (Conroy et al., 2006/7).                Document preparation
                                                                   Query/Topic
                                                                   processing

                                                                   CLASSY 2006,
                                                                   2007




15/31
                                                                   Initial processing:
  Query/topic processing                                              Documents,
                                                                         Queries

                                                                     Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
                                                                    farrar@uw.edu
    (not necessarily ordered)
                                                                   Review
    Ask, “what sentences in the doc set are most like the query,   Strategies
    based on bag of terms” (Conroy et al., 2006/7).                Document preparation
                                                                   Query/Topic
                                                                   processing
        query normalization: tokenization, stemming,               CLASSY 2006,
        stop-word removal                                          2007




15/31
                                                                   Initial processing:
  Query/topic processing                                              Documents,
                                                                         Queries

                                                                     Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
                                                                    farrar@uw.edu
    (not necessarily ordered)
                                                                   Review
    Ask, “what sentences in the doc set are most like the query,   Strategies
    based on bag of terms” (Conroy et al., 2006/7).                Document preparation
                                                                   Query/Topic
                                                                   processing
        query normalization: tokenization, stemming,               CLASSY 2006,
        stop-word removal                                          2007


        query expansion: research to (trial, effort, step,... )




15/31
                                                                   Initial processing:
  Query/topic processing                                              Documents,
                                                                         Queries

                                                                     Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
                                                                    farrar@uw.edu
    (not necessarily ordered)
                                                                   Review
    Ask, “what sentences in the doc set are most like the query,   Strategies
    based on bag of terms” (Conroy et al., 2006/7).                Document preparation
                                                                   Query/Topic
                                                                   processing
        query normalization: tokenization, stemming,               CLASSY 2006,
        stop-word removal                                          2007


        query expansion: research to (trial, effort, step,... )
        query decomposition approach: refer to Harabagiu et
        al. (2006)




15/31
                                                                   Initial processing:
  Query/topic processing                                              Documents,
                                                                         Queries

                                                                     Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
                                                                    farrar@uw.edu
    (not necessarily ordered)
                                                                   Review
    Ask, “what sentences in the doc set are most like the query,   Strategies
    based on bag of terms” (Conroy et al., 2006/7).                Document preparation
                                                                   Query/Topic
                                                                   processing
        query normalization: tokenization, stemming,               CLASSY 2006,
        stop-word removal                                          2007


        query expansion: research to (trial, effort, step,... )
        query decomposition approach: refer to Harabagiu et
        al. (2006)
        knowledge-rich approach: use ontologies, WordNet,
        etc.




15/31
                                                                  Initial processing:
  DUC2006 queries                                                    Documents,
                                                                        Queries

                                                                    Scott Farrar
    steroid use among female athletes                             CLMA, University
                                                                   of Washington
    Discuss the prevalence of steroid use among female athletes    farrar@uw.edu

    over the years. Include information regarding trends, side    Review
    effects and consequences of such use.                          Strategies
                                                                  Document preparation
                                                                  Query/Topic
                                                                  processing

                                                                  CLASSY 2006,
                                                                  2007




16/31
                                                                  Initial processing:
  DUC2006 queries                                                    Documents,
                                                                        Queries

                                                                    Scott Farrar
    steroid use among female athletes                             CLMA, University
                                                                   of Washington
    Discuss the prevalence of steroid use among female athletes    farrar@uw.edu

    over the years. Include information regarding trends, side    Review
    effects and consequences of such use.                          Strategies
                                                                  Document preparation
                                                                  Query/Topic
                                                                  processing
    wetlands value and protection                                 CLASSY 2006,
                                                                  2007
    Why are wetlands important? Where are they threatened?
    What steps are being taken to preserve them? What
    frustrations and setbacks have there been?




16/31
                                                                    Initial processing:
  DUC2006 queries                                                      Documents,
                                                                          Queries

                                                                      Scott Farrar
    steroid use among female athletes                               CLMA, University
                                                                     of Washington
    Discuss the prevalence of steroid use among female athletes      farrar@uw.edu

    over the years. Include information regarding trends, side      Review
    effects and consequences of such use.                            Strategies
                                                                    Document preparation
                                                                    Query/Topic
                                                                    processing
    wetlands value and protection                                   CLASSY 2006,
                                                                    2007
    Why are wetlands important? Where are they threatened?
    What steps are being taken to preserve them? What
    frustrations and setbacks have there been?

    anticipation of and reaction to the premier of Star Wars
    Episode I – The Phantom Menace
    How did fans, media, the marketplace, and critics prepare for
    and react to the movie? Include preparations and reactions
    outside the United States.

16/31
                                                       Initial processing:
  Query types                                             Documents,
                                                             Queries

                                                         Scott Farrar
                                                       CLMA, University
                                                        of Washington
                                                        farrar@uw.edu

                                                       Review

                                                       Strategies
    Wh-, Y/N                                           Document preparation
                                                       Query/Topic
                                                       processing
    why?, what?, how?, which?, who?, BE...,?           CLASSY 2006,
                                                       2007

    Directives
    describe, discuss, include, provide, note, track




17/31
                                                                   Initial processing:
  Topic Signatures                                                    Documents,
                                                                         Queries

                                                                     Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
                                                                    farrar@uw.edu

                                                                   Review
    Definition
                                                                   Strategies
    Content selection refers to selecting some sentences (or       Document preparation
                                                                   Query/Topic
    content) for inclusion in the summary, and rejecting others.   processing

                                                                   CLASSY 2006,
                                                                   2007




18/31
                                                                   Initial processing:
  Topic Signatures                                                    Documents,
                                                                         Queries

                                                                     Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
                                                                    farrar@uw.edu

                                                                   Review
    Definition
                                                                   Strategies
    Content selection refers to selecting some sentences (or       Document preparation
                                                                   Query/Topic
    content) for inclusion in the summary, and rejecting others.   processing

                                                                   CLASSY 2006,
                                                                   2007

    Definition
    Determining what is important or salient is based on the
    terms contained in the individual sentences. The set of
    relevant terms is referred to as the topic signature.




18/31
                                                               Initial processing:
  Signature terms                                                 Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
                                                                of Washington
                                                                farrar@uw.edu

                                                               Review

    Definition                                                  Strategies
                                                               Document preparation
                                                               Query/Topic
    Signature terms are those terms which occur significantly   processing

    more than expected                                         CLASSY 2006,
                                                               2007


    Some examples:
    tropical storms & train wrecks




19/31
                                                                       Initial processing:
  Signature terms                                                         Documents,
                                                                             Queries

                                                                         Scott Farrar
                                                                       CLMA, University
                                                                        of Washington
                                                                        farrar@uw.edu
    Tropical storm
                                                                       Review
    ahmed, allison, andrew, bahamas, bangladesh, bn, caribbean,
                                                                       Strategies
    carolina, caused, cent, coast, coastal, croix, cyclone,            Document preparation
                                                                       Query/Topic
    damage, destroyed, dev- astated, disaster, dollars, drowned,       processing

                                                                       CLASSY 2006,
    ood, ooded, ooding, oods, orida, gulf, ham, hit, homeless,         2007
    homes, hugo, hurri- cane, insurance, insurers, island, islands,
    lloyd, losses, louisiana, manila, miles, nicaragua, north, port,
    pounds, rain, rains, rebuild, rebuilding, relief, rem- nants,
    residents, roared, salt, st, storm, storms, supplies, tourists,
    trees, tropi- cal, typhoon, virgin, volunteers, weather, west,
    winds, yesterday.




20/31
                                                                       Initial processing:
  Signature terms                                                         Documents,
                                                                             Queries

                                                                         Scott Farrar
                                                                       CLMA, University
                                                                        of Washington
    Train wrecks                                                        farrar@uw.edu


    accident, accidents, ammunition, beach, bernardino, board,         Review

    boulevard, brake, brakes, braking, cab, car, cargo, cars,          Strategies
                                                                       Document preparation
    caused, collided, col- lision, conductor, coroner, crash, crew,    Query/Topic
                                                                       processing
    crossing, curve, derail, derailed, driver, emergency, engineer,    CLASSY 2006,
                                                                       2007
    engineers, equip- ment, fe, re, freight, grade, hit, holland,
    injured, injuries, investigators, killed, line, locomotives,
    maintenance, mechan- ical, miles, morning, nearby, ntsb, oc-
    curred, ofcials, pacic, passenger, pas- sengers, path, rail,
    railroad, railroads, railway, routes, runaway, safety, san,
    santa, shells, sheriff, signals, southern, speed, station, train,
    trains, transporta- tion, truck, weight, wreck



21/31
                            Initial processing:
  Finding signature terms      Documents,
                                  Queries

                              Scott Farrar
                            CLMA, University
                             of Washington
                             farrar@uw.edu


    Two shallow methods     Review

                            Strategies
                            Document preparation
                            Query/Topic
                            processing

                            CLASSY 2006,
                            2007




22/31
                                                                  Initial processing:
  Finding signature terms                                            Documents,
                                                                        Queries

                                                                    Scott Farrar
                                                                  CLMA, University
                                                                   of Washington
                                                                   farrar@uw.edu


    Two shallow methods                                           Review

        TF-IDF is a way to combine IDF with simple term           Strategies
                                                                  Document preparation
        frequency (TF). It prefers words that are frequent in a   Query/Topic
                                                                  processing

        given doc, but rare in the entire collection.             CLASSY 2006,
                                                                  2007




22/31
                                                                    Initial processing:
  Finding signature terms                                              Documents,
                                                                          Queries

                                                                      Scott Farrar
                                                                    CLMA, University
                                                                     of Washington
                                                                     farrar@uw.edu


    Two shallow methods                                             Review

        TF-IDF is a way to combine IDF with simple term             Strategies
                                                                    Document preparation
        frequency (TF). It prefers words that are frequent in a     Query/Topic
                                                                    processing

        given doc, but rare in the entire collection.               CLASSY 2006,
                                                                    2007
        Log-likelihood ration (LLR) is a better way of finding
        these informative words, especially given sparse data. It
        tells us how much more likely a word is to appear in
        general versus how likely is it is to appear associated
        with a certain topic.




22/31
                                                                   Initial processing:
  LLR                                                                 Documents,
                                                                         Queries

    Definition                                                        Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
    Use log-likelihood ratio (LLR) of some word, or λ(w ), to       farrar@uw.edu
    assign it a weight:
                                                                   Review
    H1: The probability of observing w in the input and in the
                                                                   Strategies
    background corpus assuming equal probabilities in both         Document preparation
                                                                   Query/Topic
    corpora.                                                       processing

                                                                   CLASSY 2006,
    divided by:                                                    2007


    H2: The probability of observing w in both assuming
    different probabilities for w in the input and the background
    corpus.




23/31
                                                                   Initial processing:
  LLR                                                                 Documents,
                                                                         Queries

    Definition                                                        Scott Farrar
                                                                   CLMA, University
                                                                    of Washington
    Use log-likelihood ratio (LLR) of some word, or λ(w ), to       farrar@uw.edu
    assign it a weight:
                                                                   Review
    H1: The probability of observing w in the input and in the
                                                                   Strategies
    background corpus assuming equal probabilities in both         Document preparation
                                                                   Query/Topic
    corpora.                                                       processing

                                                                   CLASSY 2006,
    divided by:                                                    2007


    H2: The probability of observing w in both assuming
    different probabilities for w in the input and the background
    corpus.

    See
    Lin & Hovy (2000) or Manning & Sch utze (1999), p. 173
    for how to calculated the ratio. The NLTK also has an
    implementation.
23/31
                                                               Initial processing:
  Assigning weights                                               Documents,
                                                                     Queries

                                                                 Scott Farrar
                                                               CLMA, University
                                                                of Washington
                                                                farrar@uw.edu

                                                               Review
    weight(wi )                                                Strategies
                                                               Document preparation
        1: if −2log (λ(wi )) > 10                              Query/Topic
                                                               processing

        0: otherwise                                           CLASSY 2006,
                                                               2007


    Sentence score
    The sentence score is the average weight of its non-stop
    words (more on that later).




24/31
                                                                  Initial processing:
  Training data and LLR                                              Documents,
                                                                        Queries

                                                                    Scott Farrar
                                                                  CLMA, University
                                                                   of Washington
                                                                   farrar@uw.edu


    More data                                                     Review

                                                                  Strategies
    Use a larger background corpus (not just docs from other      Document preparation
                                                                  Query/Topic
    topic clusters in DUC2006 (Conroy et al., 200x):              processing

                                                                  CLASSY 2006,
        ACQUAINT: /corpora/LDC/LDC02T31                           2007

        TREC: /corpora/TREC/TREC
        TIPSTER: /corpora/TREC/TIPSTER
    Include stop words (but only if larger background corpus is
    used).




25/31
                                                                 Initial processing:
  TIPSTER                                                           Documents,
                                                                       Queries

                                                                   Scott Farrar
    Contents                                                     CLMA, University
                                                                  of Washington
    Source (Vol)            YEAR    Approx. # Words (Millions)    farrar@uw.edu
    ---------------------------------------------------------
    Associated Press (1)    1989    40                           Review

    Associated Press (2)    1988    37                           Strategies
                                                                 Document preparation
    Associated Press (3)    1990    37                           Query/Topic
    Wall Street Journal (1) 1987    20                           processing

                            1988    17                           CLASSY 2006,
                                                                 2007
                            1989     6
    Wall Street Journal (2) 1990    11
                            1991    22
                            1992     5
    Dept. Of Energy (1)             28
    Federal Register (1)    1989    38
    Federal Register (2)    1988 30
    Ziff/Davis (1)                  36
    Ziff/Davis (2)          1989-90 26
    Ziff/Davis (3)          1991-92 50
    San Jose Mercury (3)    1991    45

26/31
                                   Initial processing:
  Today’s lecture                     Documents,
                                         Queries

                                     Scott Farrar
                                   CLMA, University
                                    of Washington
                                    farrar@uw.edu

                                   Review
    1   Review
                                   Strategies
                                   Document preparation
                                   Query/Topic
                                   processing

    2   Strategies                 CLASSY 2006,
                                   2007
          Document preparation
          Query/Topic processing


    3   CLASSY 2006, 2007




27/31
                                                                  Initial processing:
  CLASSY, Conroy et al. (2007)                                       Documents,
                                                                        Queries

                                                                    Scott Farrar
                                                                  CLMA, University
                                                                   of Washington
                                                                   farrar@uw.edu

                                                                  Review

    First key steps                                               Strategies
                                                                  Document preparation
                                                                  Query/Topic
        1   Data preparation/sentence trimming                    processing

                                                                  CLASSY 2006,
        2   Query term selection from the topic descriptions.     2007

        3   Signature term computation for each of the document
            sets
        4   ...




28/31
                                                                  Initial processing:
  CLASSY, Conroy et al. (2007)                                       Documents,
                                                                        Queries

                                                                    Scott Farrar
                                                                  CLMA, University
    STEP 1: sent. preprocessing                                    of Washington
                                                                   farrar@uw.edu
    Use stemming for processing query terms. Experimentation
    shows that better ROUGE scores are achieved with              Review

                                                                  Strategies
    stemming. Use larger background corpus (ACQUAINT) for         Document preparation
    computing signature terms; also include stop words.           Query/Topic
                                                                  processing

                                                                  CLASSY 2006,
                                                                  2007
    Prior to scoring, sentences are split into: (Conroy et al.
    2006)
         1: candidates for inclusion in summary
         0: non-candidates that may provide useful terms
         -1: non-candidates to be ignored
    Log-likelihood using signature tokens (those more likely to
    occur in doc than in corpus at large).


29/31
                                                       Initial processing:
  CLASSY, Conroy et al. (2007)                            Documents,
                                                             Queries

                                                         Scott Farrar
                                                       CLMA, University
                                                        of Washington
                                                        farrar@uw.edu

                                                       Review
    STEP 1: trimming                                   Strategies

    Next, sentences are trimmed the following phrase   Document preparation
                                                       Query/Topic
                                                       processing
    eliminations were made.                            CLASSY 2006,
                                                       2007
        gerund clauses;
        restricted relative-clause appositives;
        intra-sentential attribution;
        lead adverbs.




30/31
                                                                 Initial processing:
  Query bias                                                        Documents,
                                                                       Queries

                                                                   Scott Farrar
                                                                 CLMA, University
                                                                  of Washington
                                                                  farrar@uw.edu
    Katragadda and Varma (2009)
                                                                 Review
    Query biased means that at least one query term appears in   Strategies
    the extraction sentence.                                     Document preparation
                                                                 Query/Topic
                                                                 processing
    Query focused means that the summary answers the query       CLASSY 2006,
                                                                 2007
    (like what humans are assumed to do)

    Most systems are better at finding important sentences only
    from query-biased sentences. On an average, 76.67% of the
    sentences picked by any automated summarizer are
    query-biased.




31/31

								
To top