Speech Summarization
Shared by: HC12083110304
-
Stats
- views:
- 3
- posted:
- 8/31/2012
- language:
- English
- pages:
- 23
Document Sample


Speech Summarization
Julia Hirschberg (thanks to Sameer
Maskey for some slides)
CS4706
Summarization Distillation
• ‘…the process of distilling the most important
information from a source (or sources) to
produce an abridged version for a particular user
(or users) and task (or tasks) [Mani and
Maybury, 1999]
• Why summarize? Too much data!
Types of Summarization
• Indicative
– Describes the document and its contents
• Informative
– ‘Replaces’ the document
• Extractive
– Concatenate pieces of existing document
• Generative
– Creates a new document
• Document compression
[Salton, et al., 1995]
Sentence Extraction
Similarity Measures
[McKeown, et al., 2001]
Extraction Training
w/ manual Summaries
SOME SUMMARIZATION
[Hovy & Lin, 1999]
TECHNIQUES BASED
ON TEXT (LEXICAL FEATURES) Concept Level
Extract concepts units
[Witbrock & Mittal, 1999]
Generate Words/Phrases
[Maybury, 1995]
Use of Structured Data
Sentence Extraction/Similarity measures
[Salton, et al. 1995]
• Extract sentences by their similarity to a topic
sentence and their dissimilarity to sentences
already in summary (Maximal Marginal
Relativity)
• Similarity measures
– Cosine Measure
– Vocabulary Overlap
– Topic word overlap
– Content Signatures Overlap
Concept/content level extraction [Hovy & Lin,
1999]
• Present key-words as summary
• Builds concept signatures by finding relevant
words in 30,000 WSJ documents, each
categorized into different topics
• Phrase concatenation of relevant
concepts/content
• Sentence planning for generation
Feature-based statistical models
[Kupiec, et al., 1995]
• Create manual summaries
• Extract features
• Train statistical model using various ML techniques
• Use the trained model to score each sentence in the test
data
• Extract N highest-scoring sentences
k
P( F
j 1
j |s S ) P( s S )
P( s S | F1 , F2, ...Fk ) k
P( F )
j 1
j
• Where S is summary given k features Fj and P(Fj) & P(Fj|s of
S) can be computed by counting occurrences
Structured Database [Maybury, 1995]
• Summarize text represented in structured form:
database, templates
– E.g. generation of a medical history from a
database of medical ‘events’
s
# of occurrence of event E
Relative frequencyof E
Total # of all events
• Link analysis (semantic relations within the
structure)
• Domain dependent importance of events
Comparing Speech and Text Summarization
• Alike • Different
– Identifying important – Speech Signal
information – Prosodic features
– Some lexical, – NLP tools?
discourse features – Segments –
– Extraction or sentences?
generation or – Generation?
compression
– Errors
– Data size
Text vs. Speech Summarization (NEWS)
Speech Signal
Speech Channels
- phone, remote satellite, station
Error-free Text Transcripts
Transcript- Manual - ASR, Close Captioned
Lexical Features Many Speakers
Some Lexical Features
- speaking styles
Segmentation Story presentation Structure
-sentences style -Anchor, Reporter Interaction
Prosodic Features
NLP tools -pitch, energy, duration
Commercials, Weather Report
Speech Summarization Today
• Mostly extractive:
– Words, sentences, content units
• Some compression methods
• Generation-based summarization difficult
– Text or synthesized speech?
Generation or Extraction?
• SENT27 a trial that pits the cattle industry against tv talk show host oprah winfrey is under
way in amarillo , texas.
• SENT28 jury selection began in the defamation lawsuit began this morning .
• SENT29 winfrey and a vegetarian activist are being sued over an exchange on her April
16, 1996 show .
• SENT30 texas cattle producers claim the activists suggested americans could get mad
cow disease from eating beef .
• SENT31 and winfrey quipped , this has stopped me cold from eating another burger
• SENT32 the plaintiffs say that hurt beef prices and they sued under a law banning false
and disparaging statements about agricultural products
• SENT33 what oprah has done is extremely smart and there's nothing wrong with it she
has moved her show to amarillo texas , for a while
• SENT34 people are lined up , trying to get tickets to her show so i'm not sure this hurts
oprah .
• SENT35 incidentally oprah tried to move it out of amarillo . she's failed and now she has
brought her show to amarillo .
• SENT36 the key is , can the jurors be fair
• SENT37 when they're questioned by both sides, by the judge , they will be asked, can
you be fair to both sides
• SENT38 if they say , there's your jury panel
• SENT39 oprah winfrey's lawyers had tried to move the case from amarillo , saying they
couldn't get an impartial jury
• SENT40 however, the judge moved against them in that matter …
story summary
[Christensen et al., 2004]
Sentence extraction with
similarity measures
[Hori C. et al., 1999, 2002] , [Hori T. et al., 2003]
Word scoring
with dependency structure
SPEECH SUMMARIZATION
TECHNIQUES [Koumpis & Renals, 2004]
Classification
[He et al., 1999]
User access information
[Zechner, 2001]
[Hori T. et al., 2003]
Removing disfluencies
Weighted finite state
transducers
Content/Context sentence level extraction for
speech summary [Christensen et al., 2004]
Find sentences similar to the lead topic sentences
Use position features to find the relevant nearby sentences after
detecting a topic sentence
where Sim is a similarity measure between two sentences or a
sentence and a document (D) and E is the set of sentences
already in the summary
^
Sk s arg max {Sim( s1 , si )}
si D / E
^
Sk s arg max {Sim( D, si )}
si D / E
Choose a new sentence which is most like D and most
different from E
Weighted finite state transducers for speech
summarization
[Hori T. et al., 2003]
• Summarization includes speech recognition, paraphrasing, sentence
compaction integrated into single Weighted Finite State Transducer
• Decoder can use all knowledge sources in one-pass strategy
• Speech recognition using WFST R H C LG
– Where H is state network of triphone HMMs, C is triphone
connection rules, L is pronunciation and G is trigram language
model
• Paraphrasing can be looked at as a kind of machine translation with
translation probability P(W|T) where W is source language and T is
the target language Z H C LG S D
• If S is the WFST representing translation rules and D is the
language model of the target language speech summarization can
be looked at as the following composition
Speech Translator
H C L G S D
Speech recognizer Translator
User Access Identifies What to Include
[He et al., 1999]
• Summarize lectures or shows by extracting parts that
have been viewed the longest
• Needs multiple users of the same show, meeting or
lecture for training
• E.g. To summarize lectures compute the time spent on
each slide
• Summarizer based on user access logs did as well as
summarizers that used linguistic and acoustic features
– Average score of 4.5 on a scale of 1 to 8 for the
summarizer (subjective evaluation)
•
Word level extraction by scoring/classifying words
[Hori C. et al., 1999, 2002]
Score each word in the sentence and extract a set of words to form
a sentence whose total score is the product/sum of the scores of
each word
Example:
Word Significance score (topic words)
Linguistic Score (bigram probability)
Confidence Score (from ASR)
Word Concatenation Score (dependency structure grammar)
M
S (V ) {L(vm | ... vm 1 ) I I (vm ) cC (vm ) T Tr (vm1,vm )
m 1
Where M is the number of words to be extracted, and I C T
are weighting factors for balancing among L, I, C, and T r
Segmentation Using Discourse Cues
[Maybury, 1998]
Discourse Cue-Based Story Segmentation
Discourse Cues in CNN
Start and end of broadcast
Anchor/Reporter handoff, Reporter/Anchor handoff
Cataphoric Segment (“still ahead …”)
Time Enhanced Finite State Machine representing discourse states
such as anchor segment, reporter segment, advertisement
Other features: named entities, part of speech, discourse shifts
“>>” speaker change, “>>>” subject change
Source Precision Recall
ABC 90 94
CNN 95 75
Jim Lehrer Show 77 52
CU: Summarization without Words: Does
importance of ‘what’ is said correlates with ‘how’ it
is said?
• Hypothesis: “Speakers change their amplitude, pitch,
speaking rate to signify importance of words, phrases,
sentences.”
– If so, then the prediction labels for sentences predicted
using acoustic features (A) should correlate with labels
predicted using lexical features (L)
– In fact, this seems to be true (corr .74 between precitions
of A and L
Is It Possible to Build ‘good’ Automatic Speech
Summarization Without Any Transcripts?
Feature Set F-Measure ROUGE-avg
L+S+A+D 0.54 0.80
L 0.49 0.70
S+A 0.49 0.68
A 0.47 0.63
Baseline 0.43 0.50
• Just using A+S without any lexical features we get 6% higher F-
measure and 18% higher ROUGE-avg than the baseline
Evaluation using ROUGE
• F-measure too strict
– Predicted summary sentences must match
summary sentences exactly
– What if content is similar but not identical?
• ROUGE(s)…
ROUGE metric
• Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
• ROUGE-N (where N=1,2,3,4 grams)
• ROUGE-L (longest common subsequence)
• ROUGE-S (skip bigram)
• ROUGE-SU (skip bigram counting unigrams as well)
• Does ROUGE solve the problem?
Next Class
• Emotional speech
• HW 4 assigned
Get documents about "