mueller
Document Sample


Recommendations Based on
Speech Classification
(and examples of what recommender
systems can learn from signal processing)
Christian Müller
German Research Center for Artificial Intelligence
International Computer Science Institute, Berkeley, CA
Overview
Speech as a source of information for non-intrusive
user modeling
Speech/signal processing Take-away messages
Vocal aging -> features Knowledge-driven
Recommendations Based on
for speaker age feature selection
recognition
Speech Classification
GMM/SVM supervector Classification methods
approach for acoustic for independent “bag of
(and observations” systems
speech features examples of what recommenderfeatures can
learn from signal processing)
Detection task and Valid application-
pseudo-NIST evaluation independent evaluation
procedure
Rank and polynomial Feature space warping
rank normalization normalization
Conclusions
Speech as a Source for Non-Intrusive UM
Now it’s time to
Information about get to gate 38.
adaptive
the user
speech dialog system
speaker A
?
classification user model
speech = sensor
adapts it's dialog
inference from
behavior
sensors (e.g. detailed map with
(not intrusive) shops vs. arrows)
B
explicit statement provides
recommendations
(intrusive)
(e.g. a different route
to the gate)
Speaker Classification Systems
Cognitive Load
Best Research Paper Award
UM 2001
Age and Gender
Voice Award 2007
S Telekom live operation 2009
Audio segment y Language
(telephone quality) s 14 languages + dialects
NIST evaluation 2007
t
e Identity
m Project with BKA 2009
NIST* Evaluation 2008
Acoustic Events
Project with VW 2008
Interspeech 2008
Recommendations Based on
Speech Classification
products media services actions strategies
age
gender
emotions
language
dialect
accent
identity
acoustic
events
Product Recommendations
Based on Age and Gender
Zur Anzeige wird der QuickTime™
Dekompre ssor „svq1“
benötigt.
Product Recommendations
Based on Age and Gender
AM
Michael Feld and Christian Müller. Speaker Classification for Mobile Devices.
In Proceedings of the 2nd IEEE International Interdisciplinary Conference on
Portable Information Devices (Portable 2008). 2008
How can you find features for
building your models by explicitly
studying the underlying phenomena?
Proposing Knowledge-driven
feature select the example of
features for speaker age
recognition
Speaker Classification as an
Interdisciplinary Area of Research
Which are the manifestationsspeaker of a speaker be the
How the the age (and the of age (and gender) in
Which arecan requirements of a gender) classification system
speaker’s voice automatically ? ?
recognized on speaking style
and how can they be solvedand the implementation layer ?
Speech Speaker Phonetics
Technology / Classificatio Voice
Artificial n Pathology
Intelligence
Software-
Technolog
y
Impact of Aging on the Human Speech
Production
Speech breathing
effects:
lower expirational volume
more speech pauses
lower amplitude
thorax
stiffer
lungs
lighter
less elastic
lower position
Impact of Aging on the Human Speech
Production
laryngal area
effects:
rise of fundamental frequency (in men)
reduced voice quality
larynx calcification and ossification
vocal folds loss of tissue
stiffening
Impact of Aging on the Human Speech
Production
supralaryngal area
facial bones and
muscles
degeneration
reduced elasticity
effects:
imprecise articulation
for example vowel centralization
Impact of Aging on the Human Speech
Production
neurological
effects
loss of tissue in the cortex
reduced performance of the neuronal transmitters
effects:
reduced articulation rate
defective coordination between the articulators
vowel centralization
Development of F0 in Men / Women
F0 (Hz)
170
men
160
150
140 only non-smokers
women
130
120
smokers and non-
110 smokers
100
Linville (2001)
90 age in years
20 30 40 50 60 70 80 90
Age Classes
Female Male age
CF CM
Children <= 13 years
YM
Youth 14 - 19 years
Adults AM
20 - 64 years
Seniors >= 65 Jahren
Age Classes
Female Male age
CF CM
Children <= 13 years
YM
Youth 14 - 19 years
Adults AM
20 - 64 years
Seniors >= 65 Jahren
Features
fundamental frequency (pitch)
mean pitch_mean
standard deviation pitch_stddev
min, max and difference pitch_min / pitch_max / pitch_diff
voice quality
shimmer shim_l / shim_ldb / shim_apq3 / shim_apq11 / shim_ddp
jitter jitt_l / jitt_la / jitt_rap / jitt_ppq / jitt_ddp
harmonics-to-noise-ratio harm_mean / harm_stddev
articulation rate ar_rate
speech pauses pause_num / pause_dur
Features
fundamental frequency (pitch)
mean
standard deviation
min, max and difference
voice
voice quality
shimmer
jitter
harmonics-to-noise-ratio
articulation rate
speaking style
speech pauses
Example Results
C_YF
AF
SF
YM_AM_SM
C C Y Y A A S S
F M F M F M F M
high jitter value = low voice quality
fundamental frequency (F0)
Christian Müller. Zweistufige kontextsensitive
Sprecherklassifikation am Beispiel von Alter und Geschlecht
[Two-layered Context-Sensitive Speaker Classification on the
Example of Age and Gender]. AKA, Berlin, 2006
C C Y Y A A S S
F M F M F M F M
speech pauses
Hiearchical Feature Model
High-level features
(learned characteristics)
semantics
?
dialog
A
B b
:
b a e b
d d e c
:
ideloect
<s> how shall I say this <c> <s> yeah I
know...
phonetics
/S/ /oU/ /m/ /i:/ /D/ /&/ /m/ / / /n/ /i:/ ...
prosody
spectrum
Low-level features
(physical characterstics)
How can your features be modeled
assuming that they
are multi-dimentional
represent repeating observations of the
same kind
can be assumed to be independent
(“bag” of observations)
Proposing the GMM/SVM
Supervector Approach on the
example of frame-by-frame
acoustic features
General Classification Scheme zk
e.g. channel
wkj
compensation -
multilayer perceptron
support-vector machines 0.7 0,4
-
(not addressed in this
networks 1
Preprocessing talk) y1 y2
-1.5
0.5
1 1 1 wj
Feature i
Extraction 1
x1 x2
Classification
Fusion
Top-Down-
Knowledge
Modeling Acoustics and
Prosodics
semantics
?
dialog
A
B b
:
b a e b
d d e c
:
ideloect no ASR
<s> how shall I say this <c> <s> yeah I
know...
phonetics
/S/ /oU/ /m/ /i:/ /D/ /&/ /m/ / / /n/ /i:/ ...
prosody
spectrum
Generative Approach: Gaussian Mixture Model
(GMM)
training
“emergency vehicle” probability
density “emergency
feature
vehicle”
extraction
model
frame of speech
test
? avg likelihood
“emergency over all frames
feature
vehicle” for class
extraction “emergency
model
vehicle”
Generative Approach: Gaussian Mixture Model
(GMM)
test
?
“emergency
feature
vehicle”
extraction avg. log
model
likelihood ratio
over all
frame of speech frames for
class
“emergency
vehicle”
back-
ground
model
A Mixture of Gaussians
Means, variances, and mixtures weights are
optimized in training
Black line = mixture of 3 Gaussians
Discriminative Method:
Support Vector Machine (SVM)
training
“em. vehic.” (1)
feature “em. vehic.”
“not em. vehic.” (-1) extraction model
Features are transformed into higher-dimensional space where problem
is linear
Discriminating hyper plane is learned using linear regression
Trade-off between training error and width of margin
Model is stored in form of “support vectors” (data points on the margin)
Discriminative Method:
Support Vector Machine (SVM)
test
?
feature score
extraction (distance to
hyper plane)
Discriminative methods have shown to be superior to generative
methods for similar tasks
Features vectors have to be of the same lengths (sensitive to variable
segment lengths)
Solutions:
feature statistics calculated over the entire utterance
fixes portion of the segment
sequential kernels
GMM/SVM Supervector Approach
feature
extraction
Gaussian means
(MAP adapted)
Combines discriminative power of SVMs with length
independency of GMMs
Very successful with similar tasks such as speaker
recognition
GMM is trained using MAP adaptation
DCF
Evaluation Results
25 23,41
19,55
20
14,58
15
10,22
8,09
10
3,45
5
0
t d d
e se he he
c c
tir at at GMM-UBM
en m nm
u GMM-SVM
Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection
for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008.
How can you evaluate your multi-
class models independently from
the given application?
How can you establish a
appropriate evaluation in order
procedure to obtain valid results?
Proposing the detection task and
the “pseudo NIST” evaluation
procedure on the example of
acoustic event detection and
speaker age recognition.
Background
With multi-class recognition problems, many
test/analyzing methods are very application
specific.
e.g. confusion matrices.
we want a method that allows results to be
generalized across a large set of applications.
With home-grown databases, parameter
tuning on the evaluation set often
compromises the validity of the
results/inferences.
we want a fair “one shot” evaluation.
The Detection Task
system yes , 1.324326
emergeny vehicle ?
Given
a speech segment (s)
and an acoustic event to be detected (target event,
ET )
the task is to decide whether ET is
present in s (yes or no)
the system's output shall also contains a score
indicating its confidence with more positive
scores indicating greater confidence.
Terminology
Segment class
e.g. segment event, segment age-class.
ground truth (not known).
Target
the hypothesized class.
Trial
a combination of segment and target.
Evaluation
yes 1.32432
emergency vehicle ? system no -0.3212
music ? no 1.8463
talking ? no -2.5773
laughing ? yes 0.00132
phone ? no 2.20122
no event ?
The system performance is evaluated by presenting it
with a set of trials.
Each test segment is used for multiple trials.
The absence of all of all targets is explicitly included.
Type of Errors
segment “em. vehic.”
system no
target “em. vehic” ? “MISS”
segment “em. vehic”
system yes
target “phone” ? “FALSE ALARM”
Decision-Error Tradeoff
misses
“equal error rate”
false alarms
Selecting an operating point (decision threshold) along the
dotted line trades misses off false alarms.
Optimal operating point is application dependent.
Low false alarm rates are desirable for most applications.
Decision Cost Function
C(ET, EN) = CMiss · PTarget · PMiss(ET)
+ CFA · (1-PTarget) · PFA (ET,EN)
where ET and EN are the target and non-target events,
and CMiss, CFA and PTarget are application model parameters.
The application parameters for EER are:
CMiss = CFA = 1 and PTarget = 0.5
Weighted sum of misses and false alarms using
variable costs and priors.
Application model parameters are selected
according to the application.
Example DET-Plot
miss
probability
false alarm probability
Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection
for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008.
Example Cost Chart
COSTS: (At, An)
An C YF YM AF AM SF SM
C -- 0.220 0.092 0.145 0.083 0.133 0.069
YF 0.166 -- 0.081 0.201 0.080 0.198 0.070
YM 0.076 0.084 -- 0.130 0.203 0.108 0.188
AF 0.088 0.161 0.110 -- 0.095 0.219 0.082
AM 0.064 0.083 0.254 0.139 -- 0.105 0.228
SF 0.096 0.150 0.100 0.249 0.091 -- 0.095
SM 0.065 0.085 0.238 0.117 0.246 0.118 --
Avg Cost 0.092 0.130 0.146 0.164 0.133 0.147 0.122
(At)
Avg Cost 0.133
Acoustic GMM/SVM Supervector system on 7-class age task
Pseudo NIST Evaluation Procedure
ERL provided development and evaluation data as
representative as possible for the application.
Three months before the evaluation, ICSI was provided with
the development data.
At a pre-determined date, the blind evaluation data was
provided to ICSI for processing.
The system's output was submitted to ERL in NIST format.
ERL downloaded the scoring software from NIST’s website,
made the necessary modifications due to the changes in the
labels.
ERL ran the software on the submitted system output.
The results were then disclosed to ICSI along with the keys
(truth) for further analysis.
--> Fair “one-shot” evaluation, no parameter tuning on the
evaluation set.
How can you normalize your
features in order to obtain a
uniform scale and a unifom
distribution?
Proposing rank normalization
respectively polynomial rank
normalization
Background
Fundamental frequency (pitch): 75-200 Hz
Jitter: 0.001324 PPQ
--> implicit feature weighing
Mean/Variance Normalization
1
ai = vi − min(vi)
max(vi) − min(vi)
-1 1
uniform scale
non-uniform distribution
Rank-Normalization
feature background model normalized
feature
0101 0.01 0101 0 0 0101 0.75
... 0101 0.01 0.25 ...
0101 0.06 0.5 0123 0.4
0101 0.13 0.75 2317 0.2
0101 0.29 1 ...
0101 0.06 ...
...
0101 0.13
...
create ordered list of values using bg
0101 0.29 data
...
rank = position in list / number of values
no occurrence mapped to 0
Rank Normalization
1 1
-1 1 -1 1
(+) uniform distribution
(-) large three dimensional lookup tables
(-) linear interpolation for unseen values
larger values ? smaller values ?
Polynomial Rank Normalization
use ranks to train a polynomial
apply polynomial instead of look-up tables
better interpolation
no need to store look-up
tables
Christian Müller and Joan-Isaac Biel. The ICSI 2007 Language
Recognition System. In Proc eedings of the Odyssey 2008 Workshop
on Speaker and Language Recognition. Stellenbosch, South Africa, 2008
Conclusions
Speech as a source of information for non-intrusive user
modeling
Speech/signal processing Take-away messages
Vocal aging -> features Knowledge-driven
for speaker age feature selection
recognition
GMM/SVM supervector Classification methods
approach for acoustic for independent “bag of
speech features observations” features
Detection task and Valid application-
pseudo-NIST evaluation independent evaluation
procedure
Rank and polynomial Feature space warping
rank normalization normalization
Thank you!
Get documents about "