Docstoc

Tweet Classification

Document Sample
Tweet Classification Powered By Docstoc
					 Sub-topic clustering on tweets and
generating brief pseudo summaries
                           Summarization
    Team Members


§   Anil Sutrala
§   Snigdha Verma
§   Dinesh Singla
§   Raghav K
    Introduction


§ Summarizing twitter tweets can be viewed as an
  instance of the more general problem of
  automated text summarization.
§ A Twitter post or tweet is at most 140 characters
  long and in this study we only consider English
  posts.
     Basic Idea


§ Identifying important entities from a cluster of
  tweets.
§ For each cluster we identify the most important
  entities from each type like Geographic location,
  Person etc using TF-IDF scores.
§ Finding most important tweets using these
  important entities.
§ Generate a brief pseudo summary for each
  cluster using the important entities and important
  tweets.
       Dataset


§ Labelled tweets taken from Replab Dataset.
§ RepLab is a competitive evaluation exercise for
  Online Reputation Management systemsFinding
  most important tweets using these important
  entities.
§ In the dataset provided we have the set of tweets
  with the tweet id, author, entity id , text.
§ Labeled dataset contains the fields of tweet id,
  author, entity id, filtering, polarity, topic, topic
  priority.
MVP Model
Named Entity Recognition


   § Using labeled data we have generated Base
     tweet clusters, for further processing, using the
     tweet topic name.
   § Then use Aritter NLP tool for identifying named
     entities, attributes and attribute relations.
   § Generate TF-IDF Scores for these entities
     recognized.
   § Location (“geo-loc” named entity as per Aritter
     classification) has been taken as the most priority
     type among all named entities.
Generate Summary Per Cluster


     § Generate a map of named entity type vs named
       entities for the list of all tweets and call this map
       as NETYPE_MAP.
     § Tweet Summary is of three types broadly:
        – Case I : When the named entities with max
          TF-IDF’s all are of location type
        – Case II : When no location type named entities
          has maximum TF-IDF and only first the max
          TF-IDF named entity type is” important”
        – Case III : Case III: When the max TF-IDF
          named entity types are of location and other
          types
       Case 1


§ This case occurs when the named entities
  with max TF-IDF’s all are of location type.
§ In this case we print the summary as the
  collection of tweet texts which contains the
  named entities with max TF-IDF counts of
  location type.
       Case 2


§ This case occurs When no location type
  named entities has maximum TF-IDF and
  only first the max TF-IDF named entity
  type is” important”.
§ A named entity type is marked as
  “important” only if its TF-IDF count is not
  less than half of the max TF-IDF count.
§ Max is the TF-IDF count for an NE type in
  cluster.
       Case 3


§ When the max TF-IDF named entity types
  are of location and other types (Mixed
  case).
§ Becomes a subcase of case2. A named
  entity type is marked as “important” only if
  its TF-IDF count is not less than half of the
  max TF-IDF count.
Web UI for summaries


 § We generate a web UI for the tweet cluster
   summary.
 § Clusters are provided in the alphabetical
   order and the summary is generated in
   following format.
 § Cluster Label: @ <Location>
 § <List of Named entities>
 § <Tweet Cluster Summary>
      Results


§ We have generated pseudo summaries for
  tweet clusters and will analyze the
  summaries with that of text rank tool.
§ A sample screen shot is shown in the next
  slide.
Results

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:9
posted:4/26/2014
language:English
pages:14