Contextual Analysis of User Interests in Social Media Sites – An

Document Sample
Contextual Analysis of User Interests in Social Media Sites – An Powered By Docstoc
					 Contextual Analysis of User
Interests in Social Media Sites
     – An Exploration with Micro-blogs
Nilanjan Banerjee, Dipanjan Chakraborty, Koustuv Dasgupta, Anupam
    Joshi, Sameer Madan, Sumit Mittal, Seema Nagar, Angshu Rai
                          [CIKM ’09]

                                           Advisor: Dr. Koh Jia-Ling
                                          Reporter: Che-Wei, Liang
                                             Date: 2009/10/26
                                                                       1
                     Outline
•   Introduction
•   Data Set
•   Mining Real-Time User Interests
•   Discovering Associations in User Interests
•   Pattern Discovery in Interest Clusters
•   Conclusion and Future Work



                                                 2
                      Introduction
• Social media
   – A growing trend in the web, a majority of new
     content is user generated
      • Share thoughts (blog, microblogs)
      • Multimedia (Youtube, Flickr)
      • Personal information (Facebook, Orkut)

• But, “What attracts people initially is not what keeps
  them vested in the long run”
   – Need to provide more value-centric usage
      • Include search, advertising, commercial transactions
                                                               3
                  Introduction
• Contextual communications is emerging
  – Consumers are actively using end device
     • Updating status, mood, current interests

  – Rich context of a user is not limited to her location and
    availability , but directly extend a user’s personality
  – An analysis of microblogs can potentially infer
    what interests a person



                                                            4
                   Introduction
• Challenges
  – Tweets tend too be stream of consciousness fragment
  – Lack of structure of description

• In this paper,
  – Report the results of analyses of tweets using
    unstructured text mining technique



                                                          5
                       Data Set
• Collect data from Twitter
  – Select the most active users spanning across 10 cities
  – Collect tweets over four weeks
     • from March 2009 to April 2009
     • Tweet : <user name, tweet, time of publishing the tweet>




                                                                  6
  Mining Real-Time User Interests
• Tweets usually have the following properties
  – ephemeral:
     • the interest in an activity changes over time
  – descriptive:
     • the interest can be described using one or more
       indicative keywords or terms
  – localized:
     • the interest (or activity) is usually associated with
       (contextual) location information


                                                               7
  Mining Real-Time User Interests
• Identify tweets expressing interests by
  content-indicative and usage-indicative keywords
   – Content-indicative keywords (category words)
      • Express the broad class (category) of user interests, e.g.
        movie, sports, etc.
   – Usage-indicative keywords
      • Characterize the activity associated with a particular
        interest
      • Can be either temporal or action keywords


                                                                 8
  Mining Real-Time User Interests
• First, explore what kind of keywords twitters use most




• Exclude pronouns, prepositions, helping verbs,
  question words, non-indicative words
• Stem the words using Porter-stemming algorithm
                                                      9
  Mining Real-Time User Interests
• Content-indicative Keywords
  – Form an initial list of category keywords
     • Consult from Wordnet and IMDB
  – Enriched seed list of keywords by
     • Manually inspecting thousands of tweets and including
       “interest-indicative words”
  – Finally, identify five seed categories from the list
    of category keywords
     • movie, music, food, sports, dance

                                                           10
Mining Real-Time User Interests




                                  11
  Mining Real-Time User Interests
• Usage-indicative keywords
  – Words qualifying the content-indicative keywords
    can provide valuable insights into user interests
     • E.g. “movie” occurs along with “go” and “tomorrow”

  – Two kinds of category keyword neighboring terms
     • Action keywords (e.g. go, see, look)
     • Temporal keywords (e.g., today, tomorrow)



                                                            12
  Mining Real-Time User Interests
• Use term frequency-based measure
  – estimate the occurrences of temporal and action words




                                                            13
  Mining Real-Time User Interests
• Context-based discovery of keywords
  – Consider non-stemmed words to enrich knowledge base of
    keywords
     • Stemmed data incurs a loss of information of tense
  – Discover similar words by
     • Finding matches that are contextually similar to
       the seed dictionary words




                                                            14
      Mining Real-Time User Interests
  • POS-based discovery of action verbs
       – Use a POS analyser to extract action verbs
       – Identify the relevant action verbs that show a high
         correlation with identified category words
       – Added to existing set of usage-indicative keywords



• D represents the total number of tweets
• A = { tweets containing the keyword “cw” }
• B = { tweets containing the keyword “aw” }
                                                           15
Discovering Associations in User Interests

• Goal:
   – Explore different latent semantic associations
     between content-indicative category words and
     usage-indicative action/temporal words

• N-Gram Analysis
• Contextual Analysis using k-means clustering
• Temporal Analysis


                                                      16
                   N-Gram Analysis
• If an user is interested in an intention, he/she should
  use indicative action and/or temporal words to
  express interests
   – E.g. “I want to watch a movie tonight”

• Employ bigram-based analysis of category word
   – Co-occurring words can be at a
     variable distance
     (a tolerance limit of 5 words)



                                                        17
N-Gram Analysis




                  18
                N-Gram Analysis




• People have tendency to tweet about activities that
  are planned at different times of the day
   – E.g. “party tonight”

                                                        19
N-Gram Analysis




                  20
Contextual Analysis using k-means clustering

• To discover any new groups of tweets and
  perform a contextual analysis
  – Clustering is a better accepted technique to group
    similar documents
  – Use k-means clustering
  – Analyze clusters to discover latent associations of
    cluster tags with other words in the cluster
     • Tag cluster with the highest occurring words



                                                      21
Contextual Analysis using k-means clustering
             -Cluster Analysis




                                           22
Contextual Analysis using k-means clustering
            Sub-Cluster Analysis
• Analyzed content of clusters having content-indicative tags,
  temporal words, action words
   – Ran k-means, and gathered predominant sub-clusters




                                                                 23
             Temporal Analysis
• Real-time interests have a significant temporal
  component, if captured can lead to insights on word
  associations with temporal aspect of interests




                                                        24
 Pattern Discovery in Interest Clusters
• A microscopic analysis of select content-
  indicative clusters

• Built a set of benchmark
  – 5000 comprising of a mix of tweets
     • from party, food, sports, movie clusters
  – Manually tagging those that indicate a real-time
    interest (i.e. positive tweets)


                                                       25
 Patterns in Real-time Interest Tweets
• Patterns can be of several types:
  1. Word occurrence-based
     • e.g. “gym” occurs with “go” in positive tweets
  2. Grammar-based
     • e.g. party is preceded by a verb of the form “going for"
       in positive tweets
  3. Precedence-based
     • e.g. “tonight” succeeds “movie”



                                                                  26
   Patterns in Real-time Interest Tweets
       Sports Category                       Food Category
An intention to play a sport or go   Express a real-time intention of
and watch a game                     having a food, going to a
                                     restaurant




                                                                   27
Patterns in Real-time Interest Tweets
    Party Category                     Movie Category

Depicting user’s intention to   Expressing an intention to watch
get involved in a party         a movie in near future




                                                             28
Differentiating Intentions from Tweets
       -Word Affinity measure
• Affinity of a word “w” to a Set of Tweets “T”
   – Defined as the probability of “w” to occur in “T”
   – Using to compute the associations of frequently used
     words in tweets




                                                            29
     Real-time Interest Classification
            -Initial Evaluation
• An evaluation of how some traditional text
  classification algorithms perform in classifying tweets




• Further need to exploit several mechanisms
   – Word-usage based heuristics, rule-based filtering



                                                         30
     Conclusion And Future Work
• Investigated and evaluated microblogs by
   – Using contextual information of its users to capture real-
     time user interests
      •   Revealed of enough keywords that express interests
      •   Use statistical techniques to discover associations
      •   Clustering reveal words indicative of user interests
      •   Discover patterns from clusters


• There exists ample scope for research
   – Indentifying user context
      • Emotions, presence, location
                                                                  31

				
DOCUMENT INFO