Opinion Mining in Twitter by ajithkumarjak47


									 National Conference on Role of Cloud Computing Environment in Green Communication 2012

                                 Opinion Mining in Twitter
                                                     Siva Krishna Kumar.E
                                       Department of Information Science and Technology
                                               College Of Engineering, Guindy,

                                       Department of Information Science and Technology
                                               College Of Engineering, Guindy
Abstract— The evolution of web 2.0 has made social media networks more prevalent. Blogs, posts, forums and tweets have become an
important media for internet users to share views. This approach is mainly used to find the public opinion for products like Laptop,
Mobile on the basis of evaluative factor. The major issues in obtaining opinion are presence of abbreviated words, colloquial words,
polysemy words. This project focuses on the preprocessing of tweets followed by polarity checking and finally summarizes the opinion
about a product. This system has been tested for large number of tweets and result analysis show that the system performs well for
different kinds of tweets.

   Keywords-opinion mining; social media networks; product feature mining

                                                        I.     INTRODUCTION
        The explosion of web 2.0 has resulted in phenomenal change in the way communication take place.
The advent usage of Forums, Blogs and other social media has made opinion mining essential .Social Media
Network paves way for people to communicate and express opinion .Michael Henley defined social media
network as “ A group of internet based applications built on technological foundations of Web 2.0 that
enable creation and exchange of User Generated information”. These social media networks have introduced
pervasive changes in the way organizations communicate. Social media networks exist in several forms such
as internet forums, Web blogs, social blogs, micro blogging, podcasts and social bookmarking. Twitter and
face book are ranked to be the highest used social media networks. Analysis reveals that twitter processed
more than one billion tweets in December 2009 and an average of 40 million tweets per day. By latest Face
book has got an approximated number of 700 million users.
Texts can contain two general types of information: facts and opinions. While traditional text mining
focuses on the analysis of facts, opinion mining concentrates on attitudes. Three main fields of research
predominate in opinion mining: sentiment classification, feature based opinion mining and comparative
opinion mining. Sentiment classification deals with classifying entire documents according to the opinions
towards certain objects. There are basic three factors for arriving at an opinion namely the Evaluative factor,
Potency factor and Activity factor. The analysis currently focuses in mining tweets on Mobile phones
,Laptops and Bikes. The strategy used in mining is feature based (Evaluative factor) .
                                                      II.     EXISTING WORKS
        Tian-jie Zhan and Chun-hung Li in 2010 proposed a corpus level feature extraction by using
semantic structure parsing. In this approach the finer grained product features will be represented as a pair of
clusters, cluster of nouns and cluster of semantic neighbours. Freimut Bodendorf et al in 2010 proposed a
Feature based opinion mining approach for extracting and analyzing the customer opinions on automobile
products posted in blogs. This approach comprises of four basic steps namely selection, extraction,
aggregation and analysis.
Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012

        William B. Claster et al in 2010 proposed Bayes and ANN based social media data analysis. They
aimed at mining the twitter blogs for tourist sentiment. The strength of sentiment is measured using binary
choice keyword algorithm and multi-knowledge self organizing maps. The rating of sentiments using a
binary choice algorithm varies between positive orientation and negative orientation. This measure is plotted
against time line and the peaks and valleys are monitored using kohnen self organizing maps.
        Peter D. Turney 2009 et al in described the PMI method which analysis the co-occurrence of terms x
and y. The log of the ratio of joint probability occurrence to the probability of individual occurrence gives
the PMI score for the terms x and y. This method also makes the lexical resources available in positive and
negative oriental forms. However this method is less efficient to the conjunctive method since it does not
make use of the corpora fully but this approach serves the scaling to large size corpora without degrading
the performance and accuracy.

                   III SYSTEM ARCHITECTURE
The overall architecture focuses on preprocessing module, part of speech tagging, orientation analysis
(positive / negative /neutral) and finally summarization of opinion in a form

Fig. 3.1 Overall System Architecture of opinion mining System.

       appealing to the user. The key process involved are downloading appropriate tweets, performing
       stemming and stop word removal, handling abbreviations, POS tagging and finally calculating
       the orientation of the opinion.

       A. Preprocessing

       Preprocessing is essential as this is the process which makes
       the tweet ideal for mining. Hence abbreviation handling, Anaphora resolution and Word sense
       disambiguation are essential. In the preprocessing step Abbreviation handling, stemming and
       stop word removal will be performed. Figure 3.3 provides the various stages involved in

         1) Stop Word Removal and Stemming
       The first stage in preprocessing is the stop word removal. Stop words are those words which are
       not useful in generating patterns for mining opinions. Adjectives and Nouns are the parts which
       help in identifying opinion. Hence Conjunctions and Prepositions are filtered from the tweet by
       the process of Stop word removal. Further stemming is the process of rendering the root word of
       a term.

Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012                     651

      Algorithm for Stop word removal

      Input: Tweets with stop words
      Output: Tweets without stop words

          •   Read the tokens of each tweet.
          •   For each token
                  o If the token matches with the stop word. Remove the Token.
          •   Filtered Tokens with no stop words is updated to tweet database.

      Algorithm for Stemming
      Input: Stop word removed Tweet
      Output: Tweet with root word
         • Read each token .

          •   For each word in WordNet dictionary.
          •   If the word matches with the token
                  o Then, Corresponding token is written into output file .

          •   the token is given to WordNet Stemmer to obtain the root word and the output is written
              into the database.

        2) Handling Abbreviations
      Almost every tweet contains abbreviations. List of abbreviations are maintained as a text file and
      is frequently updated to the abbreviation database.

       Algorithm for Resolving Abbreviations

      Input: Tweets with Abbreviated words
      Output: Tweets with Unabbreviated words
         • Pass every word in the tweet to the WordNet
                  o If valid synonym exist, skip the word.
                  o Else , queue it into list of abbreviations.
         • Find the type of abbreviation by matching each abbreviation with the standard pattern
              defined in the abbreviation database.
         • Replace the Acronyms with their Expansion.

      As soon as a word gets added to the abbreviation text file a trigger is generated which updates
      the Abbreviation database. Hence the overhead incurred in manual updating is eliminated.

      B. Transliteration
      In social media networks the posts are highly informal except for some official blogs. Most of
      tweets contain words which are in transliterated form, for instance “eppadi irukku indha feature”

 Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012                       652

      which actually is a Tamil word represented in English characters. In order to extract feature
      scoring from such words Translation to the corresponding English word has to be performed.
      Algorithm for Transliteration
      Input: colloquial word
      Output: transliterated word
          • Pass every word in the tweet to the WordNet
                  o If valid synonym exists, skip the word.
                  o Else, queue it into list to be transliterated.
          • Check if a match of the word exists in the Transliteration database.
                  o If match found replace transliterated word by the corresponding English word.
                  o Else queue it to a list for future updation.
      C. POS Tagging
      POS tagging is important as this process identifies each part of sentence that need to be
      processed. The main focus lies in processing of adjectives. Tagging of the tweets is performed
      using a Stanford POS tagger.

      Algorithm for Pos Tagging

      Input: Preprocessed Tweet
      Output: POS tagged tweet

          •   Check if word is Valid by passing it to the WordNet
                   o Assign the most likely tag to the word.
          •   If the word is not valid assign a Proper Noun tag to the word.
          •   Define lexical rules which are predefined for the parts of speech.
          •   Calculate error score for the deviation that occurs between the rule and already present
          •   Reapply the Rule until the error score falls below the threshold.

      D. Orientation Checking and Opinion Summarisation
      This is the main goal of this paper is to find the sentiment of the user. The user could have posted
      the tweet in positive or negative attitude. Certain tweets do not stand by either side they are
      generic and hence considered to be neutral. In order to the find the overall opinion about a
      product the opinion of every individual tweet in relation to the product is to be obtained.

      Algorithm for Orientation Checking

      Input: Adjective words in a tweet
      Output: Weight of the tweet
         • Obtain each tweet from the POS tagged database.
         • Extract the word with the tag JJ ie, Adjective.
         • Obtain the weight of the adjective from the predefined list.
         • Replace the corresponding weight with the word.
         • Check if the weight of the tweet is
                 o Weight>0

 Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012               653

                              Increment Positive opinion count for the product.
                  o Weight<0
                              Increment the negative opinion count for the product.
                  o Weight=0
                              Increment the count of neutral posters for the product.
          •     Represent the Total opinion count in a pie chart for the users to analyze.
      E. Temporal Opinion Mining
          The opinion of the public changes every day and minute. Hence time based opinion mining
         is covered required. In this analysis the opinion of the public over a time period can be

             The system performance is tested for large number of inputs and the Accuracy of the
      system is high. The figure 4.1 shows the accuracy of the system for 200 tweets and for the
      product nokia. Figure 4.2 shows the accuracy of the system for the product dell. The system
      provides more accuracy when the number of tweet is high.

                  no of     no of     no of Accuracy
                 tweets    tweets tweets
                          correctly wrongly
                          opiniated opiniated

      Fig.4.1 Accuracy graph for product “Nokia”


      Fig. 4.2 Accuracy graph for the product “Dell”

 Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012                                                              654

         The system that has been developed for mining public opinion for product oriented tweets is
      implemented successfully. This work will be further extended to resolve anaphora occurrences
      and extract feature from the product specification and mine opinion from those features.


          [1]    Alec Go, A Go, L Huang, R Bhayani, “Twitter Sentiment Analysis”, Volume: 2009, Issue: June, Publisher: Association for
                 Computational Linguistics, Pages: 1-17.
          [2]    Finn Arup Nielsen,” A new ANEW: Evaluation of a word list for sentiment analysis in microblogs”,
          [3]    Andrea Esuli, “Opinion Mining” Language and Intelligence Reading Group.
          [4]    Andrea Esuli, Fabrizio Sebastiani, 2005, “Determining the Semantic Orientation of Terms through Gloss Classification”, Conference
                 on Information and Knowledge Management, 2005.
          [5]    AndreaEsuli∗ and Fabrizio Sebastiani, Conference paper.
          [6]    Bing Liu, “Opinion Mining”.
          [7]    Bing Liu, Minqing Hu, Junsheng Cheng , “Opinion Observer: Analyzing and Comparing Opinions on the Web”, ACM 1-59593-046-
          [8]    Bo Pang and Lillian Lee, Shivakumar Vaithyanathan “Thumbs up? Sentiment Classification using Machine Learning Techniques”.
          [9]    Casey Whitelaw, Navendu Garg, Shlomo Argamon,2005,” Using AppraisalGroups for SentimentAnalysis”, CIKM’05, ACM, October
                 31–November 5, 2005, Bremen, Germany.
          [10]   Christopher Scaffidi, Kevin Bierhoff, Eric Chang, Mikhael Felker, Herman Ng, Chun Jin, “Red Opal: Product-Feature Scoring from
                 Reviews” International ACM conference 2007.
          [11]   Chun-hungLi, (2009),” Sentence Factorization for Opinion Feature Mining”, International Conference on Computational aspects of
                 social networks, IEEE, pp 129-132.
          [12]   David D.Lewis , “Feature Selection and Feature Extraction for Text Categorization”.
          [13]   Ding Zhou∗, Jiang Bian, Shuyi Zheng, Hongyuan Zha, C. Lee Giles, “Exploring Social Annotations for Information Retrieval” ,
                 Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0, ACM978-1-60558-085-2/08/04.
          [14]   Ding, X., Liu, B. and Yu, P. A Holistic Lexicon-Based Approach to Opinion Mining. Proceedings of the first ACM International
                 Conference on Web search and Data Mining (WSDM’08), 2008.
          [15]   Freimut Bodendorf, Carolin Kaiser (2010),” Mining Customer Opinions on the Internet A Case Study in the Automotive Industry”,
                 Third International Conference on Knowledge Discovery and Data Mining, IEEE pp 24-27.
          [16]   Hatzivassiloglou, V. and McKeown, K. R. 1997. Predicting the semantic orientation of adjectives. In Proceedings of the Eighth
                 Conference on European Chapter of the Association for Computational Linguistics (Madrid, Spain, July 07 - 12, 1997). European
                 Chapter Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 174-181.
          [17]   Hu, M and Liu, B. Mining and Summarizing Customer Reviews. Proceedings of ACM SIGKDD International Conference on
                 Knowledge Discovery and Data Mining (KDD’04), 2004.
          [18]   Jindal, N. and Liu, B. Mining Comparative Sentences and Relations. Proceedings of National Conference on Artificial Intelligence
                 (AAAI’06), 2006.
          [19]   Judita Preiss ,(2001),” Machine Learning for Anaphora Resolution”,2001
          [20]   Judita Preiss, “Machine Learning for Anaphora Resolution” , 2001.
          [21]   Kanayama, H. and Nasukawa, T. Fully Automatic Lexicon Expansion for Domain-Oriented Sentiment Analysis. Proceedings of the
                 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP’06), 2006.
          [22]   Kim, S. and Hovy, E. Determining the Sentiment of Opinions. Proceedings of the 20th International Conference on Computational
                 Linguistics (COLING’04), 2004.
          [23]   KlaarVanopstal, Bart Desmet , eroniqueHoste, “Towards a learning approach for abbreviation detection and resolution”.
          [24]   Kushal Dave, D., Lawrence, A., and Pennock, D. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of
                 Product Reviews. Proceedings of International World Wide Web Conference (WWW’03), 2003.
          [25]   Mining Hu and Bing Liu,2004,” Mining Opinion Features in Customer Reviews”, American Association for Artificial Intelligence.
          [26]   Opinion Mining and Sentiment Analysis ,Bo Pang1 and Lillian Lee2 Sentiment Analysis of User-Generated Twitter Updates using
                 Various Classification Techniques Ravi Parikh and Matin Movassate.
          [27]   P. Turney, Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews, Proc. of the 40th
                 Ann. Meeting of the Assoc. for Computational Linguistics, 2002.
          [28]   Pang, B., Lee, L. and Vaithyanathan, S. Thumbs up? Sentiment Classification Using Machine Learning Techniques. Proceedings of
                 the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP’02), 2002.
          [29]   Peter D. Turney and Michael L. Littman. 2002. Unsupervised learning of semantic orientation from a hundred-billion-word corpus.
                 Technical Report EGB-1094, National Research Council Canada.
          [30]   Peter Turney. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proc. of
                 the ACL.
          [31]   Popescu, A.-M. and Etzioni, O. Extracting Product Features and Opinions from Reviews. Proceedings of the 2005 Conference on
                 Empirical Methods in Natural Language Processing (EMNLP’05), 2005.
          [32]   Pulse: Mining Customer Opinions from Free Text Michael Gamon, Anthony Aue, Simon Corston-Oliver, and Eric Ringger.
          [33]   Ruslan Mitkov,” ANAPHORA RESOLUTION: THE STATE OF THE ART”.
          [34]   Theresa Wilson , Janyce Wiebe, Paul Hoffmann, “Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis”, Proceedings
                 of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP),
                 pages 347–354, Vancouver,October2005.

 Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012                                                                 655

          [35] Tian-jie Zhan Chun-hung Li,2010, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology,
               pp 465-467.
          [36] Turney, P. D. and Littman, M. L. 2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM
               Trans. Inf. Syst. 21, 4 (Oct. 2003), 315-346.
          [37] Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. ACL’02, 2002.
          [38] Vasileios Hatzivassiloglou and Janyce Wiebe. 2000. Effects of adjective orientation and gradability on Sentence subjectivity. In Proc.
               of COLING.
          [39] William B. Claster, Hung Dinh, Malcolm Cooper,(2010),” Naive Bayes and Unsupervised Artificial Neural Nets for Caneun Tourism
               Social Media Data Analysis”, Second World Congress on Nature and Biologically Inspired Computing Dec. 15-17,2010 in
               Kitakyushu, Fukuoka, Japan,pp 158-163.
          [40] Wilson, T., Wiebe, J. and Hwa, R. Just How Mad Are You? Finding Strong and Weak Opinion Clauses. Proceedings of National
               Conference on Artificial Intelligence (AAAI’04), 2004.
          [41] Wilson, T., Wiebe, J. and Hoffmann.P, “Recognizing contextual polarity in phrase level sentiment analysis”,Proceedings of the conf
               on Human Language Technology and Empirical methods in Natural Language Processing.
          [42] Yorick Wilks and Mark Stevenson. 1998. The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation.
               Journal of Natural Language Engineering, 4(2):135–144.
          [43] Yun-Qing Xia1, Rui-Feng Xu2, Kam-Fai Wong3, Fang Zheng,” The Unified Collocation Framework For Opinion Mining”.

 Department of CSE, Sun College of Engineering and Technology

To top