First Story Detection: Combining Similarity and Novelty Based
Shared by: HC12021602253
-
Stats
- views:
- 3
- posted:
- 2/15/2012
- language:
- pages:
- 22
Document Sample


First Story Detection: Combining
Similarity and Novelty Based
Approaches
Martin Franz, Abraham Ittycheriah, J. Scott
McCarley, Todd Ward,
IBM T. J. Watson Research Center
What is First Story Detection
•Have we seen this before?
If it is not old it must be new.
past future
New York politics Russia Presidential race Netanyahu Greenspan Milosevic NFL Gore,Bush
•Novelty measured at three levels:
•word: “Bloomberg”
•story: Bloomberg wins! … (NYT 11/7/2001, page1)
•story cluster: NYC mayoral elections in 2001
Outline
•our first participation in FSD
•combined approach:
•story similarity (unsupervised clustering)
•term novelty
FSD with Unsupervised Clustering
for each story
for each cluster
compute story/cluster similarity score
yes
best score > threshold
no
start new cluster merge story into cluster
FSD confidence = 1 / best similarity score
Story/Cluster Similarity
cluster representation: “mean story”
symmetrized Okapi formula
Ok(s,c) = S cnts(t)*cntc(t)*idf(t)
cnt is warped, length scaled term count
Text Pre-Processing
•tokenizing
•part-of-speech tagging
•morphing
word_tag -> morph
computers_NNS -> computer
computed_VBD -> compute
•unigrams and noun bigrams
Refinement: Cluster Recency
Distance from the first story (TDT2, January-March)
correct reject: flat
FA: decreasing with the distance from the seed story
Clusters more “attractive” shortly after they are created.
score’ = score *(1 + 2-age/half-time)
half-time ~ 2 days ~ 860 stories
After Incorporating Cluster Recency
Effect of Cluster Recency
before (baseline) after (cluster recency)
half-time
TDT2, first 10 000 stories
Baseline vs. Cluster Recency
TDT3, ASR, reference boundaries
Effect of Cluster Recency
90
baseline
min Norm(Cost) = 0.7108
80 cluster recency
min Norm(Cost) = 0.6790
Miss probability (in %)
60
Cnorm = 0.60
40
Cnorm = 0.40
20
Cnorm = 0.20
10
5
2
1
.02.05.1 .2 .5 1 2
.01 5 10 20 40 60 80 90
False Alarms probability (in %)
TDT3, ASR, reference boundaries
Processing Very Short Stories, Automatic
Boundaries
Problem:
numerous segmentation false alarms, resulting in
short “stories”, causing FSD false alarms.
Solution:
finding and connecting similar neighboring stories
“catch all” cluster
Processing Very Short Stories
Problem:
short “stories”, causing FSD false alarms.
Solution:
if
best similarity score = 0
or
story vocabulary size < 20
then
story -> “catch all” cluster
Term Novelty Feature
new story ~ new words and phrases
score(t) = (1 - 2-distance / half-time) * tf * idf
half-time = (dev_corpus_size / df) * c
min. Norm(Cfsd)
0.9
0.85
0.8
0.75
0.7
0.65
10 20 50 100 200 500
c
TDT2, Jan-March, clean
Combining Similarity and Novelty Scores
scoreFSD = 0.8 * scoreSim + 0.2 * scoreNov
Combining Similarity and Novelty Scores
TDT3, manual TDT3, ASR
90 90
novelty-based novelty-based
min Norm(Cost) = 0.8904 min Norm(Cost) = 0.8940
80 similarity-based 80 similarity-based
min Norm(Cost) = 0.6300 min Norm(Cost) = 0.6641
combined combined
min Norm(Cost) = 0.6293 min Norm(Cost) = 0.6485
Miss probability (in %)
Miss probability (in %)
60 60
Cnorm = 0.60 Cnorm = 0.60
40 40
Cnorm = 0.40 Cnorm = 0.40
20 20
Cnorm = 0.20 Cnorm = 0.20
10 10
5 5
2 2
1 1
.02.05.1 .2 .5 1 2
.01 5 10 20 40 60 80 90 .01
.02.05.1 .2 .5 1 2 5 10 20 40 60 80 90
False Alarms probability (in %) False Alarms probability (in %)
TDT3, ASR, auto boundaries
90
novelty-based
min Norm(Cost) = 0.8989
80 similarity-based
min Norm(Cost) = 0.7597
combined
min Norm(Cost) = 0.6950
Miss probability (in %)
60
Cnorm = 0.60
40
Cnorm = 0.40
20
Cnorm = 0.20
10
5
2
1
.02.05.1 .2 .5 1 2
.01 5 10 20 40 60 80 90
False Alarms probability (in %)
FSD on Mandarin (Systran) Data
reference boundaries automatic boundaries
90 90
novelty-based novelty-based
min Norm(Cost) = 0.8375 min Norm(Cost) = 0.8368
80 similarity-based 80 similarity-based
min Norm(Cost) = 0.6845 min Norm(Cost) = 0.6927
combined combined
min Norm(Cost) = 0.6136 min Norm(Cost) = 0.6125
Miss probability (in %)
Miss probability (in %)
60 60
Cnorm = 0.60 Cnorm = 0.60
40 40
Cnorm = 0.40 Cnorm = 0.40
20 20
Cnorm = 0.20 Cnorm = 0.20
10 10
5 5
2 2
1 1
.02.05.1 .2 .5 1 2
.01 5 10 20 40 60 80 90 .01
.02.05.1 .2 .5 1 2 5 10 20 40 60 80 90
False Alarms probability (in %) False Alarms probability (in %)
det_SR=nwt+bnasr_TE=mul,eng.ndx
October-December
Mandarin only
99 topics
FSD on Mandarin (Systran) and English Data
reference boundaries automatic boundaries
90 90
novelty-based novelty-based
min Norm(Cost) = 0.8781 min Norm(Cost) = 0.8832
80 similarity-based 80 similarity-based
min Norm(Cost) = 0.7303 min Norm(Cost) = 0.7949
combined combined
min Norm(Cost) = 0.7011 min Norm(Cost) = 0.7457
Miss probability (in %)
Miss probability (in %)
60 60
Cnorm = 0.60 Cnorm = 0.60
40 40
Cnorm = 0.40 Cnorm = 0.40
20 20
Cnorm = 0.20 Cnorm = 0.20
10 10
5 5
2 2
1 1
.02.05.1 .2 .5 1 2
.01 5 10 20 40 60 80 90 .01
.02.05.1 .2 .5 1 2 5 10 20 40 60 80 90
False Alarms probability (in %) False Alarms probability (in %)
det_SR=nwt+bnasr_TE=mul,eng.ndx
October-December
Mandarin (Systran) + English
240 topics, 39 have Mandarin first story
Conclusion
• Cluster recency feature brings moderate performance gain.
• Term novelty approach shows acceptable performance, more robust
to noise.
• Combining the two algorithms improves performance under most
conditions.
• As the noise level grows, the performance gain obtained by
combining novelty and similarity systems increases.
Lessons Learned
•Automatic FSD is a hard problem
•Solution: deeper story understanding?
Get documents about "