Docstoc

Finding Music with Words.ppt

Document Sample
Finding Music with Words.ppt Powered By Docstoc
					                          Towards
Musical Query-by-Semantic-Description
               using the CAL500 Dataset


                    Douglas Turnbull
                    Computer Audition Lab
                        UC San Diego


  Work with Luke Barrington, David Torres, and Gert Lanckriet


                            SIGIR
                        June 25, 2007
How do we find music?
• Query-by-Metadata - artist, song, album, year
  – We must know what we want

• Query-by-(Humming, Tapping, Beatboxing)
  – Requires talent

• Query-by-Song-Similarity
  – We must possess ‘acoustically’ similar songs

• Query-by-Semantic-Description
  – Google seems to work pretty well for text
  – Semantic Image Labeling is a hot topic in Computer Vision
  – Can it work for music?

                                                                1
Semantic Music Annotation and Retrieval
Our goal is build a system that can
1. Annotate a song with meaningful words
2. Retrieve songs given a text-based query

              Frank Sinatra                       ‘Jazz’
          ‘Fly Me to the Moon’   Annotation
                                              ‘Male Vocals’
                                                   ‘Sad’
                                                 ‘Mellow’
                                 Retrieval    ‘Slow Tempo’



Plan: Learn a probabilistic model that captures a relationship between
    the audio content of a song and words that describe the song.
We consider this as a supervised multi-class, multi-label problem.

                                                                     2
System Overview
  Data             Features                 Modeling          Evaluation

                                          Parametric Model:
Training Data
                                            Set of GMMs
                    Vocabulary
   T
       T
    Annotation   Annotation Vectors (y)

                                           Parameter
                                           Estimation:
                  Audio-Feature            EM Algorithm
                  Extraction (X)



  Novel Song                                                   Evaluation
                                                                   Music
                                            (annotation)           Review


                                           Inference
       Text                                  (retrieval)
       Query                                                                3
System Overview
  Data

Training Data


   T
       T
    Annotation




                  4
The CAL500 data set
The Computer Audition Lab 500-song (CAL500) data set
•    500 ‘Western Popular’ songs
•    174-word vocabulary
     –   genre, emotion, usage, instrumentation, rhythm, pitch, vocal
         characteristics
•    3 or more annotations per song
•    55 paid undergrads annotate music for 120 hours


Other Techniques
1.   Text-mining of web documents
2.   ‘Human Computation’ Games - (e.g., Listen Game )



                                                                    5
System Overview
  Data             Features

Training Data
                    Vocabulary
   T
       T
    Annotation   Document Vectors (y)




                  Audio-Feature
                  Extraction (X)




                                        6
Semantic Representation
We choose a vocabulary of ‘musically relevant’ words
Each annotation is converted to a real-valued vector.
    – Each element represents the ‘semantic association’ between a
      word and the song.


Example: Frank Sinatra’s ‘Fly Me to the Moon”
Vocab = {funk, jazz, guitar, female vocals, sad, passionate }
Annotation Vector = [0/4 , 3/4, 4/4 , 0/4 , 2/4, 1/4]




                                                                 7
Acoustic Representation
Each songs is represented as a bag-of-feature-vectors
   – Pass a short time window over the audio signal
   – Extract a feature vector for each short-time audio segment



Specifically, we calculate Delta MFCC feature vectors
   – Mel-frequency Cepstral Coefficients (MFCC) represent the shape of a
     short-term (23 msec) spectrum
   – Popular for both representing speech, music, and sound effects
   – Instantaneous derivatives (deltas) encode short-time temporal info
   – 10,000 39-dimensional vector per minute




                                                                           8
System Overview
  Data             Features               Modeling

                                        Parametric Model:
Training Data
                                          Set of GMMs
                    Vocabulary
   T
       T
    Annotation   Document Vectors (y)

                                         Parameter
                                         Estimation:
                  Audio-Feature          EM Algorithm
                  Extraction (X)




                                                            9
Statistical Model
We adapt the Supervised Multi-class Labeling
 (SML) model
   – Set of probability distributions over the audio feature space
   – One Gaussian Mixture Model (GMM) per word - p(x|w)
   – Estimate parameters for GMM using the set of training songs
     that are positively associated with the word


Notes:
   – Developed for image annotation by Carneiro and Vasconcelos
   – Modified for real-value semantic weights rather than binary
     class labels
   – Extended formulation to handle multi-word queries

                                                                     10
Gaussian Mixture Model (GMM)
A GMM is often used to model arbitrary probability
  distributions over high dimensional spaces:



A GMM is a weighted combo of R Gaussian distributions
  • r is the r-th mixing weight
  • r is the r-th mean
  • r is the r-th a covariance matrix
These parameters are usually estimated using the
Expectation Maximization (EM) algorithm.
                                                        11
Step 1 - Song GMMs
 To model each song:
 1. Segment audio signal
 2. Extract short-time feature vectors
 3. Estimate the GMM distribution using a ‘standard’ EM

                     Bag of MFCC vectors
                                           +     ++                    + ++
                                         + ++ + +
                                         + ++ + ++++ +
                                                  +                   + + + + +
                                                                      + + + +++ +
                                          + ++ +
                                             +  ++ + + + +
                                       ++ +++ + ++++ +                  + + + ++
                                                                       + ++ ++ + +
                                                                    +++++ + +++++ + +
                                        + +
                                               ++ + +
                                      ++ ++ ++ + + + ++
                                        ++ ++++ +++ + + +
                                                             GMM    + +++++ + + +
                                                                     +
                                                                            + ++
                                                                            +++
                                      + + + +++ +                        ++ +
                                                                    +++ ++ ++ + + + +
                                          +     +++ +
                                     + ++ ++ + + + +++ +               + + + ++++ +
                                                                       ++
                                                                    ++ + + + +
                                                                            +
                                                                                 +
                                      ++ +++ + + +                 + + + + +
                                                                    ++
                                                                   + ++ + +
                                        + + + +                      + + + +




                                                                                        12
Word GMMs - p(x|w)
For each word w, we learn a word model p(x|w)
1. Identify all songs associated with w
   –   i.e., all ‘romantic’ songs
2. Estimate song-level GMMs
3. Use Weigthed Mixture Hierarchies EM to estimate p(x|w)
   – Soft clustering of Gaussian components from song GMMs

                  romantic
                                      Efficient
                                                    Semantic
                                    Hierarchical
                                                   Class Model
                                    Estimation



                                                   p(x|w)

                                                                 13
System Overview
  Data             Features                 Modeling

                                          Parametric Model:
Training Data
                                            Set of GMMs
                    Vocabulary
   T
       T
    Annotation   Annotation Vectors (y)

                                           Parameter
                                           Estimation:
                  Audio-Feature            EM Algorithm
                  Extraction (X)



  Novel Song
                                                              Music
                                            (annotation)      Review


                                           Inference

                                                                       14
Annotation
Given a set of word-GMMs and a novel song X = {x1, …, xT}, we
   calculate the likelihood of the song under each word GMM:



Assuming
1.   Uniform word prior P(w)
2.   Feature vectors are conditionally independent given a word (Naïve Bayes)




These likelihoods can be interpreted as a semantic multinomial
distribution over the vocabulary of words words.
Annotation involves picking the peaks of the semantic multinomial.

                                                                           15
Annotation
Semantic Multinomial for “Give it Away” by the Red Hot Chili Peppers




                                                                       16
Annotation: Automatic Music Reviews
Dr. Dre (feat. Snoop Dogg) - Nuthin' but a 'G' thang
This is a dance poppy, hip-hop song that is arousing and exciting. It
  features drum machine, backing vocals, male vocal, a nice acoustic
  guitar solo, and rapping, strong vocals. It is a song that is very
  danceable and with a heavy beat that you might like listen to while
  at a party.


Frank Sinatra - Fly me to the moon
This is a jazzy, singer / songwriter song that is calming and sad. It
  features acoustic guitar, piano, saxophone, a nice male vocal solo,
  and emotional, high-pitched vocals. It is a song with a light beat and
  a slow tempo that you might like listen to while hanging with friends.


                                                                     17
System Overview
  Data             Features                 Modeling

                                          Parametric Model:
Training Data
                                            Set of GMMs
                    Vocabulary
   T
       T
    Annotation   Annotation Vectors (y)

                                           Parameter
                                           Estimation:
                  Audio-Feature            EM Algorithm
                  Extraction (X)



  Novel Song
                                                              Music
                                            (annotation)      Review


                                           Inference
       Text                                  (retrieval)
       Query                                                           18
Retrieval
1. Annotate each songs in corpus with a semantic multinomial p
   •   p = {P(w1|X), …, P(wV|X)}.
2. Given a text-based query, construct as query multinomial q
   –   qw = 1/|w| , if word w appears in the query string
   –   qw = 0, otherwise
3. We rank order all songs by the Kullback-Leibler (KL) divergence
   between the query multinomial and all semantic multinomals




 The compact semantic multinomial representation of a song allow us
 to quickly rank order songs.


                                                                     19
Retrieval
The top 3 results for - “pop, female vocals, tender”

  0.33




                         1. Shakira - The One
  0.02




                         2. Alicia Keys - Fallin’
  0.02




                      3. Evanescence - My Immortal
  0.02



                                                       20
Retrieval: Query-by-Semantic-Description
     Query         Retrieved Songs

    ‘Tender’       Crosby, Stills and Nash - Guinnevere
                   Jewel - Enter from the East
                   Art Tatum - Willow Weep for Me
                   John Lennon - Imagine
                   Tom Waits - Time

 ‘Female Vocals’   Alicia Keys - Fallin’
                   Shakira - The One
                   Christina Aguilera - Genie in a Bottle
                   Junior Murvin - Police and Thieves
                   Britney Spears - I'm a Slave 4 U

    ‘Tender’       Jewel - Enter from the East
      AND          Evanescence - My Immortal
                   Cowboy Junkies - Postcard Blues
 ‘Female Vocals’   Everly Brothers - Take a Message to Mary
                   Sheryl Crow - I Shall Believe
                                                              21
System Overview
  Data             Features                 Modeling          Evaluation

                                          Parametric Model:
Training Data
                                            Set of GMMs
                    Vocabulary
   T
       T
    Annotation   Annotation Vectors (y)

                                           Parameter
                                           Estimation:
                  Audio-Feature            EM Algorithm
                  Extraction (X)



  Novel Song                                                   Evaluation
                                                                   Music
                                            (annotation)           Review


                                           Inference
       Text                                  (retrieval)
       Query                                                                22
Quantifying Annotation
Our system annotates the CAL500 songs with 10 words
   from our vocabulary of 174 words.

     Model            Precision       Recall
     Random             0.14          0.06
     Our System         0.27          0.16
     Human              0.30          0.15




                                                      23
Quantifying Retrieval
We rank order song according to songs once for each
   query.

        Model                      AROC
        Random                      0.50
        Our System - 1 Word         0.71
        Our System - 2 Words        0.72
        Our System - 3 Words        0.73




                                                      24
Demos


• CAL Music Search Engine - a content-based
  semantic music search engine.
• Listen Game - a ‘game with a purpose’ to collect
  semantic annotations of music.




                                                25
What’s on tap…
Large-scale system
  – Web-based, large-scale collection of reliable human annotations
     => Multiplayer, online game - Listen Game - ISMIR 07
  – Prune and extend vocabulary (automatically) - ISMIR 07
  – Novel Applications - Music Search Engine / Radio Player
Personalized search
  – Model homogeneous groups / individuals rather than population
     • Personalized Audio Search
  – Adjust to affective state of the user
Novel query paradigms
  - Query-by-semantic-example - ICASSP 07
  - Heterogeneous queries



                                                                26
“Talking about music is like dancing
about architecture”
                              - origins unknown




            cosmal.ucsd.edu/cal
                                                  27
System Overview
  Data             Features                 Modeling          Evaluation

                                          Parametric Model:
Training Data
                                            Set of GMMs
                    Vocabulary
   T
       T
    Annotation   Annotation Vectors (y)

                                           Parameter
                                           Estimation:
                  Audio-Feature            EM Algorithm
                  Extraction (X)



  Novel Song                                                   Evaluation
                                                                   Music
                                            (annotation)           Review


                                           Inference
       Text                                  (retrieval)
       Query                                                                28
Acoustic Representation
Calculating Delta MFCC feature vectors
  –   Calculate a time-series for 13 MFCCs
  –   Append 1st and 2nd instantaneous derivatives
  –   5,200 39-dimensional feature vectors per minute of audio content
  –   Denoted by X = {x1,…, xT} where T depends on the length of the song


                                                  Short-Time
                                                  Fourier Transform



                                                  Time Series of MFCCs




                                                  Reconstructed based on MFCCs
                                                  (log frequency)
                                                                            29
Three Approaches to Parameter Estimation
For each word w, we want to estimate the parameters of
   a GMM p(x|w).
1. Direct Estimation
   1. Take the union of sets of feature vectors for each song that
      are semantically associated with w.
   2. Use the ‘standard’ Expectation Maximization(EM) for
      estimating the parameters for a GMM

Direct Estimation is computationally difficult and
    empirically converges to poor local optima.




                                                                     30
Three Approaches to Parameter Estimation
For each word w, we want to estimate the parameters of
   a GMM.
2. Model Averaging
   1. For each song associated with w, estimate a ‘song GMM’
      using the standard EM algorithm - p(x|s)
   2. Concatenate mixture components and renormalize mixture
      weights
Model averaging produces a distribution with a variable
  number of mixture component.
As the training set size grows, evaluating this distribution
    become prohibitively expensive.


                                                               31
Mixture Hierarchices Density Estimation
For each word w, we want to estimate the parameters of a GMM.
Mixture Hierarchies EM
    1.   For each song associated with w, estimate a ‘song GMM’ using the
         standard EM algorithm.
    2.   Learn a ‘mixture of mixture components’ using the Mixture Hierarchies
         EM algorithm [Vasconcelos01]


Notes:
•   Computationally efficient for both parameter estimation and inference.
•   Each song is re-represented as a ‘smoothed’ estimate of bag-of-feature
    vectors.
•   Combining song models abstracts the semantic of a common word.




                                                                             32
Quantifying Annotation
Our system annotates the Cal-500 songs with 10 words
   from our vocabulary of 173 words.
   –   ‘Population Annotation’ Ground Truth


Metric: ‘Word’ Precision & Recall
Consider word w,
   Precision =       # songs correctly annotated with w
                         # songs annotated with w

   Recall     =       # songs correctly annotated with w
                  # songs that should have been annotated w

Mean Word Recall and Word Precision are the averages
   over all words in our vocabulary.
                                                              33
Quantifying Annotation
Our system annotates the Cal-500 songs with 10 words
   from our vocabulary of 174 words.

       Model                Precision           Recall
       Random                  0.14               0.06
       Upper Bound             0.71               0.38
       Our System              0.27               0.16
       Human                   0.30               0.15


Compared with a human, our model is
  •   worse on objective categories - instrumentation, genre

  •   about the same on subjective categories - emotion, usage   34
Quantifying Retrieval
For each 1-, 2-, & 3-word query for which there is at least 5
   songs in the ground truth, we rank order test set songs
   according KL divergence between the query
   multinomial and the semantic multinomial

Metric: Area under the ROC Curve (AROC)
   –   An ROC curve is a plot of the true positive rate as a function of
       the false positive rate as we move down this ranked list of
       songs.
   –   Integrating the curve gives us a scalar between 0 and 1 where
       0.5 is the expected value when randomly guessing.
Mean AROC is the average AROC over a large number of queries.


                                                                      35
A biased view of Music Classification
2000-03: Music classification (by genre, emotion, instrumentation)
  becomes a popular MIR task
   – Undergrad Thesis on Genre Classification with G. Tzanetakis
2003-04: MIR community starts to criticize music classification
  problems
   – ill-posed problem due to subjectivity
   – not an end in itself
   – performance ‘glass ceiling’
2004-06: Focus turns to Music Similarity research
   – Recommendation
   – Playlist generation
2006-07: We view Music Annotation as a supervised multi-class
  labeling problem
   – Like classification but with large, less-restrictive vocabulary


                                                                       36
Modeling Semantic Classes
Given a set of word models p(x|w) over a vocabulary of
  words,

Annotation: Given a novel song, we pick words by
  comparing the likelihood of the audio features under
  each word model.
Retrieval: Given a text query, we pick songs that are likely
  under the word models associated with the words in the
  query.




                                                          37

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:7/19/2012
language:
pages:38
suchufp suchufp http://
About