Soo Min Kim and Eduard Hovy by Nrv988

VIEWS: 22 PAGES: 56

									Blogvox2: A Modular Domain Independent
      Sentiment Analysis System


Sandeep Balijepalli
Masters Thesis, 2007
                   Overview
•   Introduction / Motivation
•   Problem Statement & Contribution
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                           9/14/2012   Page 2
                   Overview
•   Introduction / Motivation
•   Problem Statement & Contribution
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                           9/14/2012   Page 3
                         Social Media & Blogs

Social media defines the socialization of information as well as the
tools to facilitate conversations . – [1] (Examples: MySpace, YouTube,
Wikipedia…)

Blogs are popular due to their ability to express opinions and
express critiques on topics.

We focus on political blogs since they have lots of sentiments
words associated with them.
Examples
Hillary Clinton, Obama and Howard Dean are just some of the famous politicians who use
blogs


     [1] http://en.wikipedia.org/wiki/Social_media

                                                     9/14/2012        Page 4
                  Motivation
• Lack of domain independent framework for
  sentiment analysis

• Upcoming Elections
  • Better tool for Politicians.
  • Better tool for the Average American.

• Need of sentence level analysis for sentiment
  classification

• Opinmind was propriety.
                              9/14/2012     Page 5
                   Overview
•   Introduction / Motivation
•   Problem Statement & Contribution
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                           9/14/2012   Page 6
        Problem Statement

• Analyze the sentiment detection on
  sentence level.
• Examine the performance of various
  techniques employed for classification.

• Develop a sentiment analysis framework
  that is domain is Independent.


                         9/14/2012   Page 7
                   Contribution
• Sentence level sentiment analysis framework.
• Prototype applications to use the framework.

• Performance analysis of different filter techniques.
• Worked with Justin Martineau to develop trend
  analysis.

• Akshay Java provided the political URL dataset.




                               9/14/2012   Page 8
                   Overview
•   Introduction / Motivation
•   Problem Statement & Contribution
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                           9/14/2012   Page 9
                                               Related Work
     Blogvox1 [1]
     • Document level scoring module, sentence level should be
       focused
     • Classification is based on the bag of words approach, other
       machine level analysis will improve the results


     Turney (2002) [2]
     • Unsupervised review classification.
     • Deals with Paragraph level and its difficult for classification of
       sentences in blogs with their method.

[1] Akshay Java, Pranam Kolari, Tim Finin, James Mayfield, Anupam Joshi, and Justin Martineau, BlogVox: Separating Blog Wheat from
Blog Chaff, January 2007

[2] Peter D. Turney, Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews, Proceedings of the 40th Annual
Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania


                                                                                  9/14/2012                      Page 10
                           Related Work cont…
Pang, Lee & Vaithyanathan(2002) [1]
• Different techniques are analyzed and shown that unigrams
  perform well in movie domain.
• But according to Engstrom [2], these techniques are domain
  dependent.

Soo-Min Kim and Eduard Hovy [3]
• They have seed wordlist and unigram approach to identify the
  sentence sentiments.
• This is not sufficient as the seed wordlist is from the “wordnet”
  dataset introduces lot of noise [4]
  [1] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. 2002.
  [2] Charlotte Engstrom. Topic dependence in sentiment classification. Master’s thesis, University of Cambridge, July 2004.
  [3] "Determining the Sentiment of Opinions", Soo-Min Kim and Eduard Hovy. Proceedings of the 20th International Conference on
  Computational Linguistics (COLING), August 23-27, Geneva, Switzerland. 2004.
  [4] Brian Eriksson, Sentiment Classification of Movie Reviews using Linguistic Parsing,2005


                                                                                  9/14/2012                          Page 11
                  Overview
•   Introduction / Motivation
•   Problem Statement & Conclusion
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                          9/14/2012   Page 12
                                                    Framework
www.dailykos.com                                    http://www.dailykos.com/storyonly/2007/6/5/1211/30670
www.mediamatters.com


                                                       Obama is good. <President Bush is good.>
                                                                            <Edwards is hasty.>
                                                       I like Edwards.                                                 Bush
                                                                                                                      Clinton
                                                                                                                      Obama
            If President Bush and Vice President
            Cheney can blurt out vulgar language.



                                                       “Hillary
                                                       Clinton”




                                                                                9/14/2012                   Page 13
                 Datasets (Political URLs)
Datasets employed

   • Lada A. Adamic Political Dataset – 3028 political URLs.
   • Lada A. Adamic Labeled Dataset – 1490 blogs.
   • Twitter Dataset [1]
   • Spinn3r Dataset – live feeds [2]


  Experimental analysis

     109 feeds were used for experimental analysis

    [1] www.twitter.com     [2] www.tailrank.com

                                           9/14/2012   Page 14
                   Overview
•   Introduction / Motivation
•   Problem Statement
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                          9/14/2012   Page 15
Overview of Filters in sentiment analysis

 S
 e                                     Naïve Bayes
        Pattern               No                                  No        Parts of            No Objective
 n
       Recognizer                                                           Speech
 t                                    Unigram     Bigram
                                                                                                    Sentences
 e
 n     Yes                                      Yes
                                                                                        Yes
 c                            Named Entity Checker
 e
                                         Bag of words
 s

                                   Multiple Indexer




             Index_20070607    Index_20070606         Index_20070605   Index_20070604




                                                                9/14/2012                     Page 16
                                         Datasets (filter)
Pattern Matching Dataset classified - Manually
                                        92 Positive patterns
                                       163 Negative Patterns

 Training (Naïve Bayes)
 Movie Dataset                                               Political Dataset classified - Manually
    5331 Negative sentences                                        273 Negative sentences
    5000 Neutral sentences                                         320 Neutral sentences
    5331 Positive sentences                                        178 Positive sentences

The political wordlist contains: [1]
                                        2712 Negative words
                                         915 Positive words


[1] Akshay Java, Pranam Kolari, Tim Finin, James Mayfield, Anupam Joshi, and Justin Martineau, BlogVox: Separating Blog Wheat from
Blog Chaff, January 2007



                                                                                9/14/2012                    Page 17
Pattern Recognition filter - Overview
“Pattern recognizer filter is custom developed domain based filter for
identification of patterns.”
 S
 e                                           Naïve Bayes
              Pattern               No                              No         Parts of            No Objective
 n
             Recognizer                                                        Speech
 t                                         Unigram     Bigram
                                                                                                       Sentences
 e
 n           Yes                                   Yes
                                                                                           Yes
 c                                  Named Entity Checker
 e
                                               Bag of words
 s

                                         Multiple Indexer




                   Index_20070607     Index_20070606     Index_20070605   Index_20070604




                                                                   9/14/2012                     Page 18
Pattern Recognition filter – Working Model
    Chunked Sentences
She is well respected and won many     S
admirers for her staunch support for
women.                                 e
                                                       Pattern                     No
I hate George Bush.
John Edwards is my least favorite.
                                       n
                                                      Recognizer                 I want to be like Hillary.
                                       t
                                       e
                                       n               Yes
                                           I admire Hillary.            Named Entity Checker
                                           Obama is annoying             Bag of words
                                       c
                                       e
                                                                                          Multiple Indexer
                                       s
                                                     Current                       Sentence : "they hate Bush“
                                                      Index                        Date     : Thu Apr 19, 2007 at 08:14:12 PM PDT
                                                                                   Url       : www.mediamatters.com
                                                                                   Permalink : http://mediamatters.org/items/200508290005
                                                                                   Polarity : negative
                                                                                   Strength : 1
                                                               Index_20070607
                                                                                Sentence : “I like Clinton.”
                                                                                Date     : Thu Apr 19, 2007 at 08:14:12 PM PDT
                                                                                Url       : www.dailykos.com
                                                                                Permalink : http://www.dailykos.com/story/2007/4/13/114310/235
                                                                                Polarity : Positive
                                                                                Strength : 1




                                                                                9/14/2012                     Page 19
      Naïve Bayes Filter- Overview
“Naïve Bayes classifier is a simple probabilistic classifier based on applying
the Bayes theorem with strong (Naïve) independent assumptions.”
  S
  e                                         Naïve Bayes
               Pattern             No                             No         Parts of             No Objective
  n
              Recognizer                                                     Speech
  t                                       Unigram     Bigram
                                                                                                      Sentences
  e
  n                                                   Yes                                 Yes
  c                                Named Entity Checker
  e
                                              Bag of words
  s

                                        Multiple Indexer




                  Index_20070607     Index_20070606     Index_20070605   Index_20070604




                                                                 9/14/2012                      Page 20
     Naïve Bayes Analysis - Outline
Each document “d” is represented by the document vector
 where     {f1,f2,…. fm}                  - set of predefined feature vectors


         d = (n1(d),…..nm(d))               - where ni(d) = no. of times feature vector
                                           “fi” occurs in “d”.


 If a sentence S = ∑Wi, where “i” is the no words in that sentence
 Then,
          Probability of a sentence being positive is
          P(Spos) = ∑(Wipos) / (Wineg) + (Wipos) + (Winet)

          Probability of a sentence being negative is
          P(Sneg) = ∑(Wineg) / (Wineg) + (Wipos) + (Winet)

         “ This is a slight modification of the naïve bayes method.”
                                                  9/14/2012        Page 21
              Naïve Bayes Analysis – Working model

   Example – Unigram analysis

                           +      +      +       +           = 1.28 / 5 < .6
Probability of
the word
present in the
training dataset
                   “ Hillary is       an exciting leader.”
(negative)


                           +      +      +        +          27.4 / 5 = ~.6 == .6

  Probability of the
  word present in                                 Hence, the sentence is positive….
  the training set
  (positive)


        Similarly, for Bigrams we use two words together instead of one.


                                                 9/14/2012         Page 22
 Threshold analysis for naives bayes filter




• .7 misses lots of subjective sentences, hence threshold value of .7 will not capture the
  expected number of subjective sentences.
• .5 indexes lots of sentences that are both objective and subjective. Indexing of
  unwanted sentences needs to be avoided which is why we do not chose .5 as our
  threshold value
• According to our experimental analysis. Optimal threshold value is .6

                                                 9/14/2012         Page 23
Parts of Speech Filter - Overview

S
e                                  Naïve Bayes
                          No                             No         Parts of            No
n     Pattern                                                                                Objective
     Recognizer                                                     Speech
t                                Unigram     Bigram
                                                                                             Sentences
e
n                                            Yes                                 Yes
c                         Named Entity Checker
e
                                     Bag of words
s

                               Multiple Indexer




         Index_20070607     Index_20070606     Index_20070605   Index_20070604




                                                        9/14/2012                      Page 24
   Parts of speech analysis - Outline
“Part-of-speech tagging, also called grammatical tagging, is the process of marking up
the words in a text as corresponding to a particular part of speech, based on both its
definition, as well as its context .” - wiki

For Example:
Mr.$NNP Bill$NNP Clinton$NNP ,$, the$DT former$JJ president$NN of the$DT
Mr. Bill Clinton, the former will$MD become$VB Personal will advisor$NN of$IN
United$IN States$NNP ,$, president of the United States,$NN become
personal advisor of Hillary, Clinton announced yesterday in New York.
Hillary$NN,$, Clinton$NN announced$VBD yesterday$RB in$IN New$NNP
York$NNP.$.
                                                    NN       singular or mass noun
                                                    NNP      proper noun
Working model:                                      DT       singular determiner
                                                    JJ       adjective
 • The Unigrams and bigrams are tagged with Parts of speech for
   analysis. [1]                                    IN       preposition
                                                    VB       verb, base form
 • Each sentence is passed and experiments are carried out verb, past tense
                                                    VBD
   against the tagged naïve bayes for analysis.
   • The working is similar to Naïve Bayes filters.
                                                               [1] www.lingpipe.com
                                                  9/14/2012      Page 25
     Named Entity –Overview

S
e                                 Naïve Bayes
                                                       No
     Pattern
                         No                                        Parts of             No Objective
n
    Recognizer                                                     Speech
t                               Unigram     Bigram
                                                                                            Sentences
e
n                                           Yes                                 Yes
c                        Named Entity Checker
e
                                    Bag of words
s

                              Multiple Indexer




        Index_20070607     Index_20070606     Index_20070605   Index_20070604




                                                       9/14/2012                      Page 26
      Named Entity – Overview cont…

Problem:
      “I hate Bush, but I like Obama”

• Current approaches discards these sentences.
• Our solutions – Named Entity, reduce the score of the sentence.
   • This will return the number of entities = 2.
   • Anything more than 1 returned by the named entity, the
     system will reduce the score, rather than removing the
     search results.


                                    9/14/2012       Page 27
                   Overview
•   Introduction / Motivation
•   Problem Statement
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                          9/14/2012   Page 28
        Search and Trend Analysis
Search Analysis

   • Queries are “boosted” for performance enhancement.
Query :
       “George Bush” “Terms Together - high score”
Results:
     “George Bush is a great guy”
                         “Terms spacing up to 10 words – Medium score”
     “George’s last name Bush is …”

     “I dislike Bush”
                          “Either one of the terms – low score”
     “ I love George ”

                                       9/14/2012       Page 29
Search & Trend Analysis – Search screen shots


                 Two Panel View




                           9/14/2012   Page 30
Search & Trend Analysis – Search screen shots cont…


                           Four Panel View




                            9/14/2012    Page 31
Search & Trend Analysis – Search screen shots Cont…




                  Polarity Distribution
                               9/14/2012   Page 32
 Search and Trend Analysis cont…
“ Top topics are terms that have always been in the point
  of discussion in blogosphere. ” (eg: Bush, Iraq, Bomb)
    • Terms are computed analyzing the frequencies of the words in index.
    • Top 100 English words, dates and numbers are screened out



“ Hot topics are terms that have currently been in the point
 of discussion in blogosphere. ” (eg: Virginia, Immigration)
   • Computed by employing K-L Divergence.
                Dkl(P||Q) = ∑I P(i) log(P(i)/T))
              Dkl - Kullback - Leibler Divergence
           P – True Value         T – Target wordValue
                                        9/14/2012       Page 33
Search & Trend Analysis – Search screen shots Cont…




                   Top Term Analysis
                              9/14/2012   Page 34
                   Overview
•   Introduction / Motivation
•   Problem Statement
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                          9/14/2012   Page 35
Effect of Pattern Matching Analysis
Pattern Matching analysis




 Pattern matching does not capture most of the subjective sentences

                                     9/14/2012     Page 36
Effect of Pattern Matching Analysis Cont…
Problem:
• Bloggers do not write in a formal manner.
• Bloggers generally do not care about the grammar, spelling and
  punctuations in their blog
• Less Pattern Dataset collected [95 Pos and 162 Neg].

 Examples that caused problems:        Slang terms
    • “ Bush suckz ” (Causes problems)

 Possible Solutions:
 • Spelling checker is one of the ways to improve the results.
 • Requires more pattern dataset to improve the analysis.

                                   9/14/2012    Page 37
Effect of Pattern Matching Analysis Cont…
 Confusion Matrix




•   Accuracy = 58%
•   True Positive Rate (Recall) = 18%
•   False Positive Rate (FP) = 2 %
•   True Negative Rate = 98%
•   False Negative Rate (FN) = 82%
•   Precision = 92%
(Positive – Subjective, Negative – Objective Sentence)

                                                         9/14/2012   Page 38
      Effect of Naive Bayes Analysis
Unigram analysis
                                                                             Unigram Analysis

                                  200
      Total number of sentences




                                  180
                                  160
                                  140
                                  120
                                  100
                                   80
                                   60
                                   40
                                   20
                                    0
                                        1    2    3     4     5     6    7     8        9   10   11     12    13   14    15    16    17   18   19   20
                                                                                   Total number of Blogs

                                        Sentences in blog                                             Subjective sentences
                                        Machine classification of subjective sentence                 machine correctly classified



  •          Unigram captures most of the subjective sentences.



                                                                                                 9/14/2012                    Page 39
                                Unigram vs Patterns
                                                    Patterns vs Unigrams

                   80
                   70
 Total Sentences




                   60
                   50
                   40
                   30
                   20
                   10
                   0
                        1   2   3   4   5   6   7     8   9     10     11    12   13   14   15    16   17   18   19   20
                                                               Total Blogs

                                            Pattern Matching         Unigram       Total Subjective



• The graph shows that unigrams perform better than
  pattern matching techniques.

                                                                      9/14/2012                  Page 40
Effect of Naive Bayes Analysis cont…
Confusion matrix (Unigrams)




 •   Accuracy = 77%
 •   True Positive Rate (Recall) = 63%
 •   False Positive Rate (FP) = 10%
 •   True Negative Rate = 90%
 •   False Negative Rate (FN) = 37%
 •   Precision (Positive) = 86%
     (Positive – Subjective, Negative – Objective Sentence)

                                               9/14/2012      Page 41
    Effect of Naive Bayes Analysis
Bigram analysis




        • Bigrams perform better than Pattern matching.
        • Bigrams do not perform as well as unigrams.
        • Lack of domain independent dataset affects the results.
                                    9/14/2012    Page 42
Effect of Naive Bayes Analysis cont…
Confusion Matrix (Bigram)




  •   Accuracy = 70%
  •   True Positive Rate (Recall) = 50%
  •   False Positive Rate (FP) = 10%
  •   True Negative Rate = 91%
  •   False Negative Rate (FN) = 50%
  •   Precision (Positive) = 83%
      (Positive – Subjective, Negative – Objective Sentence)

                                               9/14/2012       Page 43
    Effect of Naive Bayes Analysis
Unigram + Bigram analysis
                                                         Unigram + Bigram Analysis

                       200
     Total number of




                       150
        sentences




                       100

                                                                            `
                       50

                         0
                              1    2    3    4    5    6     7    8     9   10     11     12   13   14    15   16      17   18   19   20
                                                                   Total number of Blogs

                        Sentences in blog                                               Subjective sentences
                        Machine classification of subjective sentence                   machine correctly classified

     • Results similar to unigram – which implies that the addition of
       bigrams does not seem to make a significant difference


                                                                                9/14/2012                 Page 44
Effect of Naive Bayes Analysis cont…
Confusion Matrix (Unigram + Bigram)




 •   Accuracy = 77%
 •   True Positive Rate (Recall) = 64%
 •   False Positive Rate (FP) = 10%
 •   True Negative Rate = 90%
 •   False Negative Rate (FN) = 36%
 •   Precision (Positive) = 86%
 (Positive – Subjective, Negative – Objective Sentence)

                                              9/14/2012   Page 45
Effect of Pattern Matching Analysis Cont…
Problem:
• We used the Movie training dataset [1] along with the custom
  developed political dataset.


Possible Solutions:
• More domain specific dataset should be collected for
  improvement of this technique.
• Analysis on trigrams would be useful for comparisons.




   [1] http://www.cs.cornell.edu/People/pabo/movie-review-data/
                                               9/14/2012          Page 46
       Effect of Parts of Speech Analysis
Parts of Speech analysis




        • Parts of speech does not perform as well as
          unigrams.

                                   9/14/2012    Page 47
Effect of Parts of Speech Analysis Cont…
Problem:
• Currently, the training set data for this analysis is not blog specific,
  but is collected from the news articles, which follow a standard
  format and procedure.


Possible Solutions:
 • Develop or obtain a blog specific training dataset.
 • Combining this with other filters could improve the results.




                                       9/14/2012      Page 48
Effect of Parts of Speech Analysis Cont…
Confusion matrix (Parts of Speech)




 •   Accuracy = 73%
 •   True Positive Rate (Recall) = 60%
 •   False Positive Rate (FP) = 13%
 •   True Negative Rate = 87%
 •   False Negative Rate (FN) = 40%
 •   Precision (Positive) = 82%
 (Positive – Subjective, Negative – Objective Sentence)
                                              9/14/2012   Page 49
                         Results




• “Unigram” & “Unigram + Bigram” outperform all other filter
  analysis.
• Although, parts of speech tagging performs well, the precision
  is less when compared to other filter analysis.
• Pattern matching technique can be improved by obtaining a
  larger dataset which is a non-trivial task.

                                    9/14/2012    Page 50
                   Overview
•   Introduction / Motivation
•   Problem Statement
•   Related Work
•   Framework
•   Sentiment Filters
•   Search and Trend Analysis
•   Experiments & Results
•   Conclusion & Future Work

                          9/14/2012   Page 51
                      Conclusion
• Analyzed the sentence level classification of sentiments.

• Focused on pattern matching, naïve Bayes and parts of
  speech filters for opinion classification.

• Analyzed and presented the performance for each sentiment
  filter.

• Developed a robust framework which is domain independent.

• Developed two different prototype applications.    (two-panel
  view & four-panel view)

                                   9/14/2012    Page 52
                 Future Work

• Incorporating other filters (such as SVM’s), adding
  stemming and spelling checker .
• Identify ways to deal with sarcastic sentences.
• Negations have to be captured
• Developing and improving the dataset should
  significantly improve the results.




                             9/14/2012   Page 53
                Acknowledgements

• Lada A. Adamic for her datasets.

• Tailrank & twitter for their dataset.

• Justin Martineau.

• Akshay Java and Pranam Kolari for their help in
  compiling the datasets.

• Alark Joshi

                                 9/14/2012   Page 54
Thank You !!



• Questions?




       9/14/2012   Page 55
                Experimental Information

Confusion Matrix:
 “ A confusion matrix [1] contains information about actual and
 predicted classifications done by a classification system.
 Performance of such systems is commonly evaluated using
 the data in the matrix.” [2]




 [1] Provost, F., & Kohavi, R. (1998). On applied research in machine learning. Machine Learning , 30, 127-- 132.

 [2] http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html

                                                             9/14/2012               Page 56

								
To top