Event Identification in Social Media by elfphabet2

VIEWS: 0 PAGES: 15

									EVENT IDENTIFICATION IN
SOCIAL MEDIA
Hila Becker, Luis Gravano     Mor Naaman
   Columbia University      Rutgers University
    Social Media Sites Host Many
2
    “Event” Documents
    “Event”= something that occurs at a certain time in a certain place
       [Yang et al. ’99]
           Popular, widely known events
            Presidential Inauguration, Thanksgiving Day Parade
           Smaller events, without traditional news coverage
            Local food drive, street fair
           …

     Photo-sharing: Flickr      Video-sharing: YouTube           Social networking: Facebook




                                                         Social media documents for “All Points West” festival,
                                                                Liberty State Park, New Jersey, 8/8/08
    Identifying Events and Associated
3
    Social Media Documents
        Applications
          Event  search and browsing
          Local search

         …

        General approach: group similar documents via
         clustering
         Each cluster corresponds to one event and its associated
         social media documents
Event Identification: Challenges
4


       Uneven data quality
         Missing,short, uninformative text
         … but revealing structured context available: tags,
          date/time, geo-coordinates
       Scalability
       Dynamic data stream of event information
       Unknown number of events
         Necessary    for many clustering algorithms
         Difficult to estimate
    Clustering Social Media Documents
5


     Social media document representation
     Social media document similarity

     Social media document clustering

       Clustering task: definition
       Ensemble algorithm: combining multiple
        clustering results
     Preliminary evaluation
    Social Media Document Representation
6




       Title


    Description


       Tags


    Date/Time


     Location


     All-Text
       Social Media Document Similarity
7

       Title
                     Text: tf-idf weights, cosine similarity
       Title
    Description           A A A             B   B   B
    Description
       Tags          Time: proximity in minutes
    Date/Time-
       Tags                                                 time
     Keywords
     Location-
    Date/Time
    Keywords
    Date/Time-       Location: geo-coordinate proximity
     Location
     Proximity
     Location-
     Proximity
      All-Text

     All-Text
Social Media Document Clustering Framework
8

    Social media   Document feature   Event clusters
     documents      representation
        Clustering: Ensemble Algorithm
    9



Ctitle
                  Wtitle          Consensus Function:   Ensemble
                                  combine ensemble      clustering
                                  similarities          solution


                  Wtags
Ctag
s
                                       f(C,W)




                  Wtime


Ctime
                  Learned in a
                  training step
     Clustering: Measuring Quality
10


        Homogeneous clusters
                                                              ✔


        Complete clusters
                                            ✔


        Metric: Normalized Mutual Information (NMI)
         Shared information between clustering solution and
         “ground truth”
     Experimental Setup
11


         Data: >270K Flickr photos
           Event labels from Yahoo!’s “upcoming” event database
           Split into 3 parts for training/validation/testing

         Clusterers: single pass algorithm with centroid similarity
         Weighing scheme: Normalized Mutual Information
          (NMI) scores on validation set
         Consensus function: weighted average of clusterers’
          binary predictions
         Final prediction step: single pass clustering algorithm
     Preliminary Evaluation Results
12


        Individual clusterer performance
          HighestNMI: Tags, All-Text
          Lowest NMI: Description, Title

        Ensemble performance, compared against all
         individual clusterers
          Highest overall performance in terms of NMI
          More homogenous clusters: each event is spread
           over fewer clusters

                                               Details in paper
     Future Work: Alternative Choices
13


      Document similarity metric
         Ensemble approach
           Weight assignment
           Choice of clusterers

         Train   a classifier to predict document similarity
           Features  correspond to similarity scores
               All-text, title, tags, time, location, etc.
               Numeric values in [0,1]
           State-of-the-art classifiers: SVM, Logistic Regression, …
     Future Work: Alternative Choices
14


         Final clustering step
           Apply graph partitioning algorithms
            Requires estimating the number of clusters
         Evaluation metrics: beyond NMI
         Datasets
                 LastFM, YouTube
           Flickr

           Exploit social network connections
     Conclusions
15


        Identified events and their corresponding social
         media documents
          Proposed a clustering solution
          Leveraged different representations of social media
           documents
          Employed various social media similarity metrics

        Developed a weighted ensemble clustering approach
        Reported preliminary results of our event
         identification approach on a large-scale dataset of
         Flickr photographs

								
To top