First Story Detection: Combining Similarity and Novelty Based

Shared by: HC12021602253
Categories
Tags
-
Stats
views:
3
posted:
2/15/2012
language:
pages:
22
Document Sample
scope of work template
							First Story Detection: Combining
  Similarity and Novelty Based
           Approaches

 Martin Franz, Abraham Ittycheriah, J. Scott
           McCarley, Todd Ward,

    IBM T. J. Watson Research Center
What is First Story Detection
•Have we seen this before?
   If it is not old it must be new.

                               past                                           future
New York politics Russia Presidential race Netanyahu Greenspan Milosevic NFL Gore,Bush




•Novelty measured at three levels:
   •word: “Bloomberg”
   •story: Bloomberg wins! … (NYT 11/7/2001, page1)
   •story cluster: NYC mayoral elections in 2001
Outline

•our first participation in FSD

•combined approach:
   •story similarity (unsupervised clustering)
   •term novelty
 FSD with Unsupervised Clustering

              for each story


             for each cluster


  compute story/cluster similarity score

                                   yes
          best score > threshold
                no

            start new cluster        merge story into cluster


FSD confidence = 1 / best similarity score
Story/Cluster Similarity
 cluster representation: “mean story”




symmetrized Okapi formula


Ok(s,c) =         S cnts(t)*cntc(t)*idf(t)
cnt is warped, length scaled term count
Text Pre-Processing

•tokenizing

•part-of-speech tagging

•morphing
   word_tag -> morph

   computers_NNS -> computer
   computed_VBD -> compute

•unigrams and noun bigrams
Refinement: Cluster Recency
 Distance from the first story (TDT2, January-March)




correct reject: flat
FA: decreasing with the distance from the seed story
Clusters more “attractive” shortly after they are created.


 score’ = score *(1 + 2-age/half-time)


half-time ~ 2 days ~ 860 stories
After Incorporating Cluster Recency
Effect of Cluster Recency
before (baseline)               after (cluster recency)




                                   half-time




   TDT2, first 10 000 stories
Baseline vs. Cluster Recency




   TDT3, ASR, reference boundaries
Effect of Cluster Recency
                        90
                                                                        baseline
                                                         min Norm(Cost) = 0.7108
                        80                                       cluster recency
                                                         min Norm(Cost) = 0.6790
     Miss probability (in %)




                        60
                                   Cnorm = 0.60
                        40
                                   Cnorm = 0.40

                        20
                                   Cnorm = 0.20
                        10
                               5

                               2
                               1
                                 .02.05.1 .2 .5 1 2
                               .01                      5 10 20       40 60     80   90
                                              False Alarms probability (in %)

    TDT3, ASR, reference boundaries
Processing Very Short Stories, Automatic
Boundaries

Problem:
       numerous segmentation false alarms, resulting in
short “stories”, causing FSD false alarms.

Solution:
       finding and connecting similar neighboring stories

      “catch all” cluster
Processing Very Short Stories

Problem:
      short “stories”, causing FSD false alarms.

Solution:
if
   best similarity score = 0
       or
   story vocabulary size < 20
then
   story -> “catch all” cluster
 Term Novelty Feature

                new story ~ new words and phrases

  score(t) = (1 - 2-distance / half-time) * tf * idf
  half-time = (dev_corpus_size / df) * c

        min. Norm(Cfsd)
 0.9

0.85

 0.8

0.75

 0.7

0.65
        10    20     50   100   200   500
                                            c
       TDT2, Jan-March, clean
Combining Similarity and Novelty Scores




scoreFSD = 0.8 * scoreSim + 0.2 * scoreNov
Combining Similarity and Novelty Scores
                              TDT3, manual                                                                     TDT3, ASR
                   90                                                                                   90
                                                             novelty-based                                                                        novelty-based
                                                    min Norm(Cost) = 0.8904                                                              min Norm(Cost) = 0.8940
                   80                                      similarity-based                             80                                      similarity-based
                                                    min Norm(Cost) = 0.6300                                                              min Norm(Cost) = 0.6641
                                                                 combined                                                                             combined
                                                    min Norm(Cost) = 0.6293                                                              min Norm(Cost) = 0.6485
Miss probability (in %)




                                                                                     Miss probability (in %)
                   60                                                                                   60
                               Cnorm = 0.60                                                                        Cnorm = 0.60
                   40                                                                                   40
                               Cnorm = 0.40                                                                        Cnorm = 0.40

                   20                                                                                   20
                               Cnorm = 0.20                                                                        Cnorm = 0.20
                   10                                                                                   10
                          5                                                                                    5
                          2                                                                                    2
                          1                                                                                    1
                            .02.05.1 .2 .5 1 2
                          .01                      5 10 20       40 60     80   90                             .01
                                                                                                                 .02.05.1 .2 .5 1 2     5 10 20       40 60     80   90
                                         False Alarms probability (in %)                                                      False Alarms probability (in %)


                              TDT3, ASR, auto boundaries
                   90
                                                             novelty-based
                                                    min Norm(Cost) = 0.8989
                   80                                      similarity-based
                                                    min Norm(Cost) = 0.7597
                                                                 combined
                                                    min Norm(Cost) = 0.6950
Miss probability (in %)




                   60
                               Cnorm = 0.60
                   40
                               Cnorm = 0.40

                   20
                               Cnorm = 0.20
                   10
                          5

                          2
                          1
                            .02.05.1 .2 .5 1 2
                          .01                      5 10 20       40 60     80   90
                                         False Alarms probability (in %)
                          FSD on Mandarin (Systran) Data
                              reference boundaries                                                                 automatic boundaries
                   90                                                                                   90
                                                             novelty-based                                                                        novelty-based
                                                    min Norm(Cost) = 0.8375                                                              min Norm(Cost) = 0.8368
                   80                                      similarity-based                             80                                      similarity-based
                                                    min Norm(Cost) = 0.6845                                                              min Norm(Cost) = 0.6927
                                                                 combined                                                                             combined
                                                    min Norm(Cost) = 0.6136                                                              min Norm(Cost) = 0.6125
Miss probability (in %)




                                                                                     Miss probability (in %)
                   60                                                                                   60
                               Cnorm = 0.60                                                                        Cnorm = 0.60
                   40                                                                                   40
                               Cnorm = 0.40                                                                        Cnorm = 0.40

                   20                                                                                   20
                               Cnorm = 0.20                                                                        Cnorm = 0.20
                   10                                                                                   10
                          5                                                                                    5

                          2                                                                                    2
                          1                                                                                    1
                            .02.05.1 .2 .5 1 2
                          .01                      5 10 20       40 60     80   90                             .01
                                                                                                                 .02.05.1 .2 .5 1 2     5 10 20       40 60     80   90
                                         False Alarms probability (in %)                                                      False Alarms probability (in %)



                          det_SR=nwt+bnasr_TE=mul,eng.ndx
                          October-December
                          Mandarin only
                          99 topics
                          FSD on Mandarin (Systran) and English Data
                              reference boundaries                                                                 automatic boundaries
                   90                                                                                   90
                                                             novelty-based                                                                        novelty-based
                                                    min Norm(Cost) = 0.8781                                                              min Norm(Cost) = 0.8832
                   80                                      similarity-based                             80                                      similarity-based
                                                    min Norm(Cost) = 0.7303                                                              min Norm(Cost) = 0.7949
                                                                 combined                                                                             combined
                                                    min Norm(Cost) = 0.7011                                                              min Norm(Cost) = 0.7457
Miss probability (in %)




                                                                                     Miss probability (in %)
                   60                                                                                   60
                               Cnorm = 0.60                                                                        Cnorm = 0.60
                   40                                                                                   40
                               Cnorm = 0.40                                                                        Cnorm = 0.40

                   20                                                                                   20
                               Cnorm = 0.20                                                                        Cnorm = 0.20
                   10                                                                                   10
                          5                                                                                    5

                          2                                                                                    2
                          1                                                                                    1
                            .02.05.1 .2 .5 1 2
                          .01                      5 10 20       40 60     80   90                             .01
                                                                                                                 .02.05.1 .2 .5 1 2     5 10 20       40 60     80   90
                                         False Alarms probability (in %)                                                      False Alarms probability (in %)



                          det_SR=nwt+bnasr_TE=mul,eng.ndx
                          October-December
                          Mandarin (Systran) + English
                          240 topics, 39 have Mandarin first story
Conclusion


• Cluster recency feature brings moderate performance gain.

• Term novelty approach shows acceptable performance, more robust
to noise.

• Combining the two algorithms improves performance under most
conditions.

• As the noise level grows, the performance gain obtained by
combining novelty and similarity systems increases.
Lessons Learned
•Automatic FSD is a hard problem




•Solution: deeper story understanding?

						
Related docs
Other docs by HC12021602253
Volunteer Portal - Overview
Views: 7  |  Downloads: 0
INTECH ProdServ DK2008
Views: 1  |  Downloads: 0
Final Project Groups
Views: 0  |  Downloads: 0
Chapter 7: Conditioning and Learning
Views: 5  |  Downloads: 0
Die Integrierte Gesamtschule in Rheinland-Pfalz
Views: 31  |  Downloads: 0
Web Content Standards Checklist - DOC
Views: 4  |  Downloads: 0
Lecture 14 Output Control
Views: 1  |  Downloads: 0