Docstoc

pptx - Department of Computer Science and Engineering

Document Sample
pptx - Department of Computer Science and Engineering Powered By Docstoc
					Sentiment Analysis
        Balamurali A R
IITB-Monash Research Academy
     {balamurali@cse.iitb.ac.in}

    Acknowledgment: Aditya Joshi
                                                     Image from wikimedia commons
                                                     Source: Wikipedia




                            Smile of Mona Lisa

                            Is she smiling at all?
                                Is she happy?


                            What is she smiling about?
                              What is she happy about?




     Mona Lisa
     16th century
Artist: Leonardo da Vinci
      Sentiment analysis (SA)
Task of tagging text with orientation of opinion
This is a good movie.
                               Subjective
This is a bad movie.

The movie is set in
Australia.                     Objective
  Disciplines which form the core of AI- inner circle
  Fields which draw from these disciplines- outer circle.



                     Robotics                  Sentiment Analysis

                                               NLP
                      Search,
 Expert              Reasoning,
Systems               Learning

                                            Planning

                  Computer
                  Vision
                       Outline
Motivation & Introduction        Applications




    Approaches to SA             SA @ CFILT
                              Outline
    Motivation & Introduction           Applications

Need of SA: Why is SA needed?
Variants of SA: What forms does it
exist in?
Challenges of SA: Why is SA not
trivial?


         Approaches to SA               SA @ CFILT
       User-generated content
• Web 2.0 empowers the user of the internet

• They are most likely to express their opinion
  there

• Temporal nature of UGC: ‘Live Web’
• Can SA tap it?
                 SA: Where?
•
•   Blogs                 A website, usually maintained
                             Multiple review websites
•
•   Review websites        by an specific to general-topic
                         offering individual with regular
                                      Websites
•
•   Social networks                    reviews
                              entries of commentary,
                                that allow people to
                              Conversations between
                                            of another
                               descriptionsoneevents.
                             connect with of the above
•
•   User conversations       users on mouthshut, burrrp,
                          Some SPs: one
                              and exchange thoughts
                                bollywoodhungama
                         Some SPs: Blogger, LiveJournal,
                                     Wordpress
                                  Reference : www.technorati.com/state-of-the-blogosphere/




               SA: How much?
• Size of blogosphere
   – Through the ‘eyes’ of the blog trackers

• Technorati : 112.8 million blogs (excluding 72.82
  million blogs in Chinese as counted by a
  corresponding Chinese Center)
• A blog crawler could extract 88 million blog URLs
  from blogger.com alone
• 12,000 new weblogs daily
SA: How much opinion?




      Chart created using : www.technorati.com/chart/
               Flavours of SA
•   Subjective/Objective
•   Emotion analysis
•   SA with magnitude        “Taj Mahal was constructed by
                            Shah “The movie is good.”of his
                                  Jahan in the memory
•   Entity-specific SA          “The camera is the best
                                  “dude.. just get lost.”
                                      wife Mumtaz.”
                              in its defeatedwere arrested
                               “The Leftists England in the
                             “India price range. However,
                           “People say that the movie is good.”
•   Feature-based SA            pathetically slowpolice.”
                                  cricket by the interface
                              a yesterdaymatch badly.”
                                    “Whoa! a masterpiece
                              “Taj Mahal is Super!!”
                              ruins it for this cell phone.”
                               “This movie is awesome.”
•   Perspectivization            of an architecture and
                            symbolizes unparalleled beauty.”
             Challenges of SA
•   Domain dependent
•   Sarcasm
•   Thwarted expressions             Sentiment the movie.”
                                 “I did not likeof a word
                                  the sentences/words that
                                           is uses words of
                                    Sarcasmw.r.t. the
•   Negation                   contradict the overall sentiment
                            “Not only is the movie boring, it is
                                             domain.
                                    a polarity to represent
                                  of phone allows me to send
                               “This the set are in majority
                            also camera of the mobile phone is
                            “Thethe biggest waste of producer’s
                                        another polarity.
•   Implicit polarity                         SMS.”
                                   than one mega-pixel – quite
                                             money.”
                              lessExample: ‘unpredictable’
                              Example: “The actors are good,
                                Example: “The perfume is so
                             uncommon for a phone of today.”
•   Time-bounded             “This phone has a and appealing.
                           the music is brillianttouch-screen.”
                           “Not withstanding the pressure of the
                                     For steering of car,
                             amazing that I suggestayou wear it
                                Yet, the movie fails to strike a
                                  with your windows have
                           public, let me admit that Ishut” loved
                                             chord.”
                                       Forthe movie.”
                                            movie review,
        SA Challenges: Sample Review 1
                            (This, that and this)
                                             ‘Touch screen’ today signifies
FLY E300 is a good mobile which i purchased recently with lots of hesitation. Since this Brand
                                                    a positive feature.
is not familiar in Market as well known as Sony Ericsson. But i found that E300 was cheap
                                           Will it be the same in the future?
with almost all the features for a good mobile. Any other brand with the same set of features
would come around 19k Indian Ruppees.. But this one is only 9k.
Touch Screen, good resolution, good talk time, 3.2Mega Pixel camera, A2DP, IRDA and so on...

                                                     Comparing 3.2 MEGA PIXEL,
BUT BEWARE THAT THE CAMERA IS NOT THAT GOOD, THOUGH IT FEATURESold products
ITS NOT AS GOOD AS MY PREVIOUS MOBILE SONY ERICSSION K750i which is just 2Mega
Pixel.
Sony ericsson was excellent with the feature of camera. So if anyone is thinking for Camera,
please excuse. This model of FLY is not apt for you.. Am fooled in this regard..
Audio is not bad, infact better than Sony Ericsson K750i.
                                                               The confused conclusion
FLY is not user friendly probably since we have just started to use this Brand.




                                               From: www.mouthshut.com
      SA Challenges: Sample Review 2
Hi,
    I have Haier phone.. It was good when i was buing this
  phone.. But I invented A lot of bad features by this phone
  those are It’s cost is low but Software is not good and Battery
  is very bad..,,Ther are no signals at out side of the city..,,
  People can’t understand this type of software..,, There aren’t
                                     Lack of not good..,, Sound also
  features in this phone, Design is betterpunctuation marks,
  bad..So I’m not intrest this side.They are giving heare phones
                                         Grammatical errors
  it is good. They are giving more talktime and validity these
  are also good.They are giving colour screen at display time it
  is also good because other phones aren’t this type of
  feature.It is also low wait.
                                            Wait.. err.. Come again

                                  From: www.mouthshut.com
                                 Outline
     Motivation & Introduction             Applications

Need of SA: Why is SA needed?
Variants of SA: What forms does it
exist in?
Challenges of SA: Why is SA not
trivial?


          Approaches to SA                 SA @ CFILT

Basics: What is classification

Approaches: What are ways to do SA
               Task Definition
• Marking reviews as positive or negative at the
  document level
  – Lexicon-based classifiers
  – ML-based classifiers
                  What is classification?
 A machine learning task that deals with identifying the class to which an
 instance belongs


 A classifier performs classification



  Perceptive inputs )
  ( Test instance
    Textual features :
( Age, Marital status,              SVM Classifier
  Ngrams )                        Classifier
                                  MaxEnt Classifier
                                      Bayes Classifier
                                NaïveCombination            Discrete-valued

                                                            Category of document?
Health status, Salary )
  Attributes                                                Steer? { Left, Straight,
                                                            Issue Loan? {Yes, No}
                                                            {Politics, Science, Biology}
                                                            Class label
                                                            Right }
  (a1, a2,… an)
        Classification learning

   Training                    Testing
    phase                       phase
Learning the classifier   Testing how well the classifier
from the available data   performs
‘Training set’            ‘Testing set’
(Labeled)
                 Testing phase
Methods:
  – Holdout (2/3rd training, 1/3rd testing)
  – Cross validation (n – fold)
     • Divide into n parts
     • Train on (n-1), test on last
     • Repeat for different permutations
Approaches to SA and Text granularity
 Based on text granularity
 • Document level
 • Sentence level
 • Phrase level
 • Word level
            ………..Approaches to SA will differ
       Generic Approaches
                                Document level sentiment classifier


                        SA



         Machine
         Learning              Rule based
          based



Supervised     Unsupervised   Lexicon based
                                 Outline
     Motivation & Introduction             Applications

Need of SA: Why is SA needed?
Variants of SA: What forms does it
exist in?
Challenges of SA: Why is SA not
trivial?


          Approaches to SA                 SA @ CFILT

Basics: What is classification

Approaches: What are ways to do SA
      1) Rule based
      2) Machine Learning based
Rule based System: Resources for SA
SentiWordNet
  – WordNet synsets marked with three types of
    scores: positive, negative, objective

               I am feeling happy.
           Seed-set expansion in SWN




                                                                 Seed words

                                                       Ln
     Lp



The sets at the end of kth step are called Tr(k,p) and Tr(k,n)
Tr(k,o) is the set that is not present in Tr(k,p) and Tr(k,n)
          Building SentiWordnet
• Classifier alternatives used: Rocchio (BowPackage) &
  SVM(LibSVM)
      • Different training data based on expansion
      • POS –NOPOS and NEG-NONEG classification

• Total eight classifiers
   – For different combinations of k and classifiers

• Synsets not in the expanded seed set are used
  as test synsets
   – Score is average of scores returned by the classifiers
  Rule based System: An Example
C-FEEL-IT
   An entity-based opinion search engine on Twitter

How it works?
1. User enter a search string to get its “public vibe”
2. Tweets are fetched based on search string
3. Based on sentiment lexicon, mark each tweet with
   sentiment score using majority rule
4. Categorize each tweet into sentiment categories using a
   threshold value
C-FeeL-IT: Preprocessing and
          heuristics
• Feeds from twitter used to obtain:
    –   50 tweets
              In English
                      About the keyword
•   Normalization done using:
    –   Mapping between chat lingo to dictionary words1
    –   Mapping between emoticons and direct sentiment
        prediction1
    –   Extensions of words replaced by contracted forms
    –   Negation handling


                                1   http://chat.reichards.net/
                C-FeeL-IT: Resources used
             •SentiWordNet (Andrea & Sebastani,2006)
             •Subjectivity clues (Weibi et al, 2004)
             •Taboada (Taboada & Grieve, 2004)
             •Inquirer (Stone et al, 1966)




Reference given in Notes section
            C-FeeL-IT: Demo
• Available at:
  http://www.clia.iitb.ac.in:8080/cfeelit-2/
                                 Outline
     Motivation & Introduction             Applications

Need of SA: Why is SA needed?
Variants of SA: What forms does it
exist in?
Challenges of SA: Why is SA not
trivial?


          Approaches to SA                 SA @ CFILT

Basics: What is classification

Approaches: What are ways to do SA
      1) Rule based
      2) Machine Learning based
                  Supervised System


Training          Feature                               Apply on
                                     Learner                                 Evaluate
  Data          Engineering                             Test Data




                               Things to consider:
  Popular features: Term presence/term frequency, unigram/bigram ,
          1. Evaluation metrics: Accuracy, Recall, Precision, movies
  adjectives Select suitable domain: product, travel, politics,F-Score etc
                     SVM, Naïve Bayes, MaxEnt, Ensemble etc
                          2. Select the text granularity
    Supervised System: Our System
Existing approaches do not consider
  ‘sense/meaning’ of the word
However, a word may have:
1   sentiment bearing and non sentiment bearing senses

2   senses with opposing polarity

3   been abstracted

       Bag-of-words features - Pang et al. (2002), Martineau & Finn(2009), Paltoglou
       &Thelwall (2010)
               “Her face fell when sheto be deadly forhad young boy.”
                           “He speaks heard that she the
               “The snake bite proved a vulgar language.”been fired.”
       Syntactic features - Matsumoto et al. 2005, Kennedy & Inkpen (2006), Whitelaw et
                          “Now Warne is from the tree.”
                             “The fruit fell crude behavior!”
                         “Shanethat's real a deadly spinner.”
       al. (2005)
                                                                 Image source: Wikimedia commons




  Supervised System: Our System
Lexical space v/s sense space

                                There are also_347757                   Manually
 There are also fire-pits       fire_pits_19147259                      annotated
 available if you want to       available_4203394 if you                Senses (M)
 have a bonfire with your       want_21808093 to have
 friends .                      a bonfire_17203241 with                 Automatically
                                your friends_19962226 .                 annotated
                                                                        Senses (I)
   Lexical Space                 Sense Space
                        Word Sense
                   Disambiguation (WSD)
          fire_pits : 19147259
           (1: POS identifier : Noun, 9147259: Wordnet Synset offset)
                                                                 Image source: Wikimedia commons




    Supervised System: Our Approach
                                                                  Manual
             A WSD Engine                Corpus                  Annotation


              Automatic                Classifier                   Manual
                Sense-                 Training                      Sense-
              annotated                                            annotated
               Corpus                                               Corpus
                                       Classifier
                                          W
Only-sense                Classifier                Only-sense                 Classifier
  Filter                  Training                    Filter                   Training


Classifier                Classifier                 Classifier               Classifier
Training                   W+S(I)                    Training                 W+S(M)

Classifier                                          Classifier
    I                                                  M
                                                                   Image source: Wikimedia commons




          Results: Overall Classification
Feature
                      Pos         Neg
Represe    Accuracy                           Pos Recall   Neg Recall
                      Precision   Precision
ntation
W          84.90      84.95       84.92       85.19        84.60
M          89.10      91.50       87.07       85.18        91.24
W+S(M)     90.20      92.02       88.55       87.71        92.39
I          85.48      87.17       83.93       83.53        87.46
W+S(I)     86.08      85.87       86.38       86.69        85.46


    • Senses give better overall accuracy
    • Negative Recall increases
                                 Outline
     Motivation & Introduction                   Applications

Need of SA: Why is SA needed?          Cross-lingual SA
                                       Cross-domain SA
Variants of SA: What forms does it     Opinion Spam
exist in?
                                       SA for tweets
Challenges of SA: Why is SA not
trivial?


          Approaches to SA                        SA @ CFILT

Basics: What is classification

Approaches: What are ways to do SA
           1) Machine Learning based
           2) Rule based
           Cross-lingual SA
                              • Multilingual
                              content on the
                              internet growing


  Hindi
 English      Sentiment       • How can the
            Sentiment Label
document       Analysis       sentiment it carries
                System        be identified?


                              • Can we take help of
                              the ‘rich cousin’
                              English?
Alternatives to Cross-lingual SA

                  Strategies for SA for target language



                                                  Develop resources
                             Translate to a
                                                      for target
 Use corpus in target         ‘rich’ source
                                                      language
      language                  language
                                 Outline
     Motivation & Introduction                   Applications

Need of SA: Why is SA needed?          Cross-lingual SA
                                       Cross-domain SA
Variants of SA: What forms does it     Opinion Spam
exist in?
                                       SA for tweets
Challenges of SA: Why is SA not
trivial?


          Approaches to SA                        SA @ CFILT

Basics: What is classification

Approaches: What are ways to do SA
           1) Machine Learning based
           2) Rule based
   Domain-dependence of words
• ‘deadly’
  – It was one deadly match!
  – There are some deadly poisonous snakes in the
    jungles of Amazon.
           General Approach
• Retain the ‘common-to-all-domain’ words
• Learn only the ‘special domain’ words



• Domain differences can be substantial
                                 Outline
     Motivation & Introduction                   Applications

Need of SA: Why is SA needed?          Cross-lingual SA
                                       Cross-domain SA
Variants of SA: What forms does it     Opinion Spam
exist in?
                                       SA for tweets
Challenges of SA: Why is SA not
trivial?


          Approaches to SA                        SA @ CFILT

Basics: What is classification

Approaches: What are ways to do SA
           1) Machine Learning based
           2) Rule based
  Opinion spam: A side-effect of UGC
• Reviews contain rich user opinions on products
  and services
• Anyone can write anything on the Web
  – No quality control
• Result
• Incentives               Positive opinion ->
                          Low quality reviews,
                            Financial gain for
                         review spam / opinion
                              organization
                                  Spam.
                                                            Reference : [Jindal et al, 2008]




   Different types of spam reviews
• Type 1 (untruthful opinions)
• Type 2 (reviews on brands only)
• Type 3 (non-reviews)
           Giving undeserving reviews to some
                  target objects in order
              to promote/demote the object
                        Advertisements
               No comment on the product
        hyper spam - undeserving positive reviews
         Comments on brands, manufacturer or
       Other irrelevant reviews containing no opinions
       defaming spam - malicious negative reviews
                   sellers answers and random text
          e.g. questions, of the product
              Although you should not expect prompt shippin.
                                DUPLICATES
      (It took 3 weeks and several e-mails before I received my order.)
            It’s from nikon, what more you want..
                  I would order again from this merchant,
     just because the price was right - http://www.pricegrabber.com
                                 Outline
     Motivation & Introduction                   Applications

Need of SA: Why is SA needed?          Cross-lingual SA
                                       Cross-domain SA
Variants of SA: What forms does it     Opinion Spam
exist in?
                                       SA for tweets
Challenges of SA: Why is SA not
trivial?


          Approaches to SA                        SA @ CFILT

Basics: What is classification

Approaches: What are ways to do SA
           1) Machine Learning based
           2) Rule based
        Challenges with tweets
• Ill-formed
  – Spelling mistakes
  – Informal words/emoticons
  – Extensions of words (‘happppyyyyy’)


• Vague topics
                                 Outline
     Motivation & Introduction                   Applications

Need of SA: Why is SA needed?          Cross-lingual SA
                                       Cross-domain SA
Variants of SA: What forms does it     Opinion Spam
exist in?
                                       SA for tweets
Challenges of SA: Why is SA not
trivial?


          Approaches to SA                        SA @ CFILT
                                       Twitter based SA, Sense based SA,
Basics: What is classification
                                       Cross-Lingual SA and many
                                       more…..
Approaches: What are ways to do SA
           1) Machine Learning based
           2) Rule based
                                                 SA @
                                                 CFILT


                                                                                        Other
                      English
                                                                                      Languages


            Cross-    Sense-                  Detecting         Indian                             European
Twitter                           Discourse
           Domain     based                   Thwarting       Languages                            Languages


 Trend               Similarity                                    Cross                             Cross
Analysis              Metric                                      Lingual                           Lingual



                                                          Hindi             Marathi       French    Spanish    German
Thank you!
    &
Questions?
       Extra Reading- Classifiers
• Naïve Bayes
• SVM
• Committee-based classifiers
         Naïve Bayes classifiers
• Based on Bayes rule
• Naïve Bayes : Conditional independence assumption
      Support vector machines
• Basic idea
                         Margin

                                   “Maximum separating-
                                   margin classifier”



                         Support vectors



               Separating hyperplane : wx+b = 0
              Multi-class SVM
• Multiple SVMs are trained:
  – True/false classifiers for each of the class labels
  – Pair-wise classifiers for the class labels
                                     Reference : Scribe by Rahul Gupta, IIT Bombay




         Combining Classifiers
• ‘Ensemble’ learning
• Use a combination of models for prediction
  – Bagging : Majority votes
  – Boosting : Attention to the ‘weak’ instances
• Goal : An improved combined model
                                                             Reference : Scribe by Rahul Gupta, IIT Bombay




                     Boosting (AdaBoost)
Error                                  Classifier
                                     Classifier
                                        model
                            Weighted learning
                                          M1
                             vote    scheme Class Label

        Classifier                                                                        Sample
Error    model                                                                             Training
                                                                                            D1
           Mn                                         Weights of                           dataset
                                       Test                                                   D
                                        set         correctly classified
                                                    instances multiplied
        Total set
                                                    by error / (1 – error)
                     Initialize based of instances to 1/d bootstrap
                                                      use
                     Selectionweightson weight. May If error > 0.5?
                     sampling with replacement

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:5/17/2013
language:English
pages:55
yaofenji yaofenji
About