Sentiment…Human Intelligence

Document Sample
Sentiment…Human Intelligence Powered By Docstoc
					Sentiment Analysis


    Sivaji Bandyopadhyay


    Jadavpur University
       Kolkata, India
               Sentiment…Human Intelligence


                    Overview
         Sentiment Analysis is a multifaceted problem
   Sentiment Knowledge Acquisition

   Sentiment / Subjectivity Detection

   Sentiment Polarity Detection

   Sentiment Structurization

   Sentiment Summary
    Sentiment Knowledge Acquisition
      Involving Human Intelligence

   Prior Polarity Sentiment Lexicon
   Automatic Computational Processes
     WordNet
     Dictionary Based
     Antonym
     Involving Human Intelligence
         Dr Sentiment
     Cross Lingual Projection of Sentiment Lexicons
         Sentiment Analysis
                    IITH is a very good institution.

  I love Hyderabad, the city is famous for its Biriyani, Pearl and old
                        Mughal architecture!

               Summer in Hyderabad is too scorching.

Sentiment Analysis
   Sentiment Detection
   Sentiment Classification



Andrea Esuli and Fabrizio Sebastiani. SentiWordNet: A publicly
available lexical resource for opinion mining. In Proceedings of
Language Resources and Evaluation (LREC), 2006.

                                                                         4
What is SentiWordNet?
       Prior Polarity Lexicon


     POS     Offset    Positivity   Negativity     Synset

Adjective    1006361     0.875         0.0         happy

    Noun     4466580     0.375         0.0       friendliness

    Adverb   214589      0.625        0.125        sharply

     Verb    2471993      0.0         0.125        shame


                                                                5
     Prior Polarity Lexicon

Sentiment Bearing Words:        love, hate, good, favorite




Challenges for Polarity Identification:

Context Information (Pang et al., 2002)
Domain Pragmatic Knowledge (Aue and Gamon, 2005)
Time Dimension (Read, 2005)
Language/Culture Properties (Wiebe and Mihalcea, 2006)



                                                             6
                            Continue….(2)


        Prior Polarity Lexicon
Context Information
   I prefer Limuzin as it is longer than Mercedes.
   Avoid longer baggage during excursion in Amazon.

Language/Culture Properties
   सहे रा (Sahera: A marriage-wear of India)
   দুগাগ ঩ুজ া (Durgapujo: A festival of Bengal)

Domain Pragmatic Knowledge
  Sensex go high.
  Price go high.

Time Dimension
   During 90‟s mobile phone users generally reported in various
   online reviews about their color-phones but in recent times
   color-phone is not just enough. People are fascinated and
   influenced by touch screen and various software(s) installation
   facilities on these new generation gadgets.
                                                                     7
                          Continue….(3)


       Prior Polarity Lexicon

Suppose total occurrence of a word “long” in a domain corpus is n.
The positive and negative occurrence of that word are Sp and Sn
respectively.

Therefore in a developed sentiment lexicon the assigned positivity
and negativity score of that word will be as follows:

Positivity     :        Sp /n
Negativity     :        Sn /n

These associative positive and negative scores are called prior
polarity.




                                                                     8
  Source Lexicon Acquisition
Available Resources for English
  SentiWordNet (Esuli et. al., 2006)
         SentiWordNet is an automatically constructed lexical
          resource for English that assigns a positivity score and a
          negativity score to each WordNet synset.
  WordNet Affect List (Strapparava et al., 2004)
         WordNet synsets tagged with six basic emotions:
          anger, disgust, fear, joy, sadness, surprise.
  Taboada’s Adjective List (Voll et al., 2006)
         An automatically constructed adjective list with
          positivity and negativity polarity assignment.
  Subjectivity Word List (Wilson et. al., 2005)
         The entries in the subjectivity word list have been
          manually labeled with part of speech (POS) tags as well
          as either strong or weak subjective tag depending on
          the reliability of the subjective nature of the entry.

                                                                       9
                          Continue….(1)

Source Language Acquisition
 Chosen Source Lexicon Resources

 SentiWordNet
   SentiWordNet is most widely used in several applications
     such as sentiment analysis, opinion mining and emotion
     analysis.


 Subjectivity Word List (Wilson et. al., 2005)
    Subjectivity Word List is most trustable as the opinion mining
     system OpinionFinder that uses the subjectivity word list
     has reported highest score for opinion/sentiment
     subjectivity (Wiebe and Riloff, 2006) (Das and
     Bandyopadhyay, 2010)



                                                                      10
                         Continue….(2)

Source Language Acquisition
 Noise-Reduction

    A merged sentiment lexicon has been developed from both
   the resources by removing the duplicates.

    It has been observed that 64% of the single word entries
   are common in the Subjectivity Word List and SentiWordNet.

    The new merged sentiment lexicon consists of 14,135
   numbers of tokens.

    Several filtering techniques have been applied to generate
   the new list.




                                                                  11
                         Continue….(3)


Source Language Acquisition

                    SentiWordNet           Subjectivity Word List

                Single          Multi        Single        Multi

                115424          79091         5866           990
 Unambiguous
    Words
                20789           30000         4745           963

                             Orientation   Subjectivity
 Discarded     Threshold                                    POS
                              Strength      Strength
 Ambiguous
  Words
                86944           30000         2652           928


                                                                    12
Target Language Generation

   Generation Strategies

      Bilingual Dictionary Based Approach

      WordNet Based Approach

      Antonym Generation

      Corpus Based Approach

      Dr Sentiment (A Gaming Approach)



                                             13
                          Continue….(1)


Target Language Generation

Bilingual Dictionary Based Approach
  A word-level translation technique adopted.

  Robust and reliable synsets (approx 9966) are created by
   native speakers as well as linguistics experts of the specific
   languages as a part of English to Indian Languages Machine
   Translation Systems (EILMT).

  Various language specific dictionaries acquired.




                                                                    14
                          Continue….(2)

Target Language Generation
             Bilingual Dictionary Based Approach

 Hindi (90,872)
    SHABDKOSH (http://www.shabdkosh.com/)
    Shabdanjali
    (http://www.shabdkosh.com/content/category/download
    s/)
 Bengali (102119)
    Samsad Bengali-English Dictionary
    (http://dsal.uchicago.edu/dictionaries/biswas_bengali/)
 Telugu (112310)
    Charles Philip Brown English-Telugu Dictionary
    (http://dsal.uchicago.edu/dictionaries/brown/)
    Aksharamala English-Telugu Dictionary
    (https://groups.google.com/group/aksharamala)
    English-Telugu Dictionary
    (http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Dict_Fr
    ame.html)

                                                                   15
                        Continue….(3)


Target Language Generation
             Bilingual Dictionary Based Approach



 Hindi
     Translation process has resulted 22,708 Hindi entries

 Bengali
     Translation process has resulted 34,117 Bengali
    entries

 Telugu
     Translation process has resulted 30,889 Telugu entries
     Almost 88% Telugu SentiWordNet generated by this
    process



                                                               16
                          Continue….(4)

Target Language Generation
WordNet Based Expansion Approach

Synonymy Expansion
   WordNet based expansion technique produces more synset
  members: inactive, motionless, static for a source word still.
   Prior polarity scores are directly copied
Antonymy Expansion
   WordNet based expansion technique produces more sentiment
  lexemes: ugly for a source word beautiful.
   Prior polarities are calculated as:
                                  Tp=1-Sp
                                  Tn=1-Sn
where Sp, Sn are the positivity and negativity score for the source
language (i.e, English) and Tp, Tn are the positivity and negativity
score for target languages


                                                                       17
                         Continue….(5)

Target Language Generation
           WordNet Based Expansion Approach



 Hindi
   Hindi WordNet (Jha et al., 2001)
    (http://www.cfilt.iitb.ac.in/wordnet/webhwn/) is a well
    structured and manually compiled resource and is being
    updated since last nine years.
   Almost 60% generated by this process

 Bengali
   The Bengali (http://bn.asianwordnet.org/)
   It only contains 1775 noun synsets as reported in (Robkop
  et al., 2010)
   Only 5% new lexicon entries have been generated in this
  process


                                                                18
                            Continue….(6)

Target Language Generation
                   Antonymy Generation
             Affix/Suffix         Word        Antonym
         abX                Normal       Ab-normal
         misX               Fortune      Mis-fortune
         imX-exX            Im-plicit    Ex-plicit
         antiX              Clockwise    Anti-clockwise
         nonX               Aligned      Non-aligned
         inX-exX            In-trovert   Ex-trovert
         disX               Interest     Dis-interest
         unX                Biased       Un-biased
         upX-downX          Up-hill      Down-hill
         imX                Possible     Im-possible
         illX               Legal        Il-legal
         overX-underX       Overdone     Under-done
         inX                Consistent   In-consistent
         rX-irX             Regular      Ir-regular
         Xless-Xful         Harm-less    Harm-ful
         malX               Function     Mal-function


   About 8% of Bengali, 7% of Hindi and 11% of Telugu
    SentiWordNet entries are generated in this process.   19
                             Continue….(7)

Target Language Generation
Corpus Based Approach

Language/culture specific words:
      सहे रा (Sahera: A marriage-wear)
      দুগাগ ঩ুজ া (Durgapujo: A festival of Bengal)

Technique
      Generated sentiment Lexicon used a seed list
      Tag-Set
              SWP (Sentiment Word Positive)
              SWN (Sentiment Word Negative)
      Corpus
              EILMT language specific corpus: approximately 10K
      of sentences.
      Model
              Conditional Random Field (CRF)
              An n-gram (n=4) sequence labeling model has been
      used for the present task.
                                                                  20
                    Limitations

Issues in Cross Lingual Projection

   Sentiment score may not be equal to source language

   Relative sentiment score is needed rather than absolute score

   Language / Culture specific lexicons should be included

   Sentiment score should be updated by time




                                                                    21
                      Involving Human
                        Intelligence
                      WORLD INTERNET USAGE AND POPULATION STATISTICS


                       Population     Internet Users     Internet Users    Penetration      Growth      Users %
World Regions
                      ( 2010 Est.)    Dec. 31, 2000       Latest Data     (% Population)   2000-2010    of Table



Africa                1,013,779,050          4,514,400      110,931,700          10.9 %     2,357.3 %      5.6 %


Asia                  3,834,792,852       114,304,000       825,094,396          21.5 %       621.8 %     42.0 %


Europe                 813,319,511        105,096,093       475,069,448          58.4 %       352.0 %     24.2 %


Middle East            212,336,924           3,284,800       63,240,946          29.8 %     1,825.3 %      3.2 %


North America          344,124,450        108,096,800       266,224,500          77.4 %       146.3 %     13.5 %


Latin
                       592,556,972         18,068,919       204,689,836          34.5 %     1,032.8 %     10.4 %
America/Caribbean


Oceania / Australia     34,700,201           7,620,480       21,263,990          61.3 %       179.0 %      1.1 %


WORLD TOTAL           6,845,609,960       360,985,492     1,966,514,816          28.7 %       444.8 %    100.0 %




                                                                                                                   22
Dr. Sentiment
     Q1




                23
  Dr. Sentiment
            Q2




Word     Positivity   Negativity
Good       0.625         0.0
Better     0.875         0.0
Best       0.980         0.0
                                   24
Dr. Sentiment
     Q3




                25
Dr. Sentiment
      Q4




                26
Sentiment…Un-Explored
      Dimensions




                       Geo-Spatial
Blue in Islam: In verse 20:102 of the Qur‟an, the word ‫زرق‬
zurq (plural of azraq 'blue') is used metaphorically for evil
doers whose eyes are glazed with fear                           27
Sentiment…Un-Explored
      Dimensions




     Age-Wise Senti-Mentality
                                28
Sentiment…Un-Explored
      Dimensions




    Gender-Specific Senti-Mentality
                                      29
   Expected Impact of the
         Resources
Resources are useful in multiple aspect
   Mono-Lingual Sentiment/Opinion/Emotion Analysis task

Generated language specific SentiWordNet(s) could be
expanded by other proposed methods (Dictionary, WordNet,
Antonym and Corpus Based Approach)

The other dimensions
   Geospatial Information retrieval
   Personalized search
   Recommender System etc
   Stylometry: A writer‟s Senti-Mentality
   Plagiarism: Spamming Technique: Geo-Spatial and User
   Perspective
                                                           30
                    The Road Ahead
    Basic SentiWordNet has been developed for 56 languages

                                          Languages
    Afrikaans     Bulgarian    Dutch      German         Irish       Malay      Russian       Thai

     Albanian     Catalan     Estonian     Greek        Italian     Maltese     Serbian      Turkish

      Arabic      Chinese     Filipino    Haitian      Japanese    Norwegian     Slovak     Ukrainian

    Armenian      Croatian    Finnish     Hebrew       Korean       Persian     Slovenian     Urdu

    Azerbaijani    Creole     French     Hungarian     Latvian       Polish     Spanish     Vietnamese

      Basque       Czech      Galician   Icelandic    Lithuanian   Portuguese    Swahili      Welsh

    Belarusian     Danish     Georgian   Indonesian   Macedonian   Romanian     Swedish      Yiddish




A. Das and S. Bandyopadhyay. Towards The Global SentiWordNet, In the
Workshop on Model and Measurement of Meaning (M3), PACLIC 24, November 4,
Sendai, Japan, 2010. (Accepted)
                                                                                                         31
                       References
    Resources

I. A. Das and S. Bandyopadhyay. Towards The Global SentiWordNet, In the
   Workshop on Model and Measurement of Meaning (M3), PACLIC 24,
   November 4, Sendai, Japan, 2010.

II. A. Das and S. Bandyopadhyay. SentiWordNet for Indian Languages, In
    the 8th Workshop on Asian Language Resources (ALR), August 21-22,
    Beijing, China, 2010.

III. A. Das and S. Bandyopadhyay. SentiWordNet for Bangla, In Knowledge
     Sharing Event-4: Task 2: Building Electronic Dictionary , February 23rd-24th,
     2010, Mysore.




                                                                                     32
Sentiment / Subjectivity Detection


   Solution Architecture Explored
     Rule-Based
     Machine Learning
     Hybrid
     Adaptive Genetic Algorithm: Multiple Objective
    Optimization, The Evolutionary Technique to Detect
    Sentiment



 Adaptive Genetic Algorithm: Multiple Objective Optimization
technique yielded all other techniques
   Sentence subjectivity: An objective sentence expresses some factual
    information about the world, while a subjective sentence expresses some
    personal feelings or beliefs.

Example: Type: Film Review, Film Name: Deep Blue Sea, Holder:
     Arbitrary-outside of theatre
                     Oh, This is blue!
Is this statement an objective or subjective statement?
     • “blue” is not a evaluative expression
     • Among different cultures with a different colour scheme (blue;
        positive or negative?)
Example: Type: Comment, Holder: Governor of WB, Issue:
 Nandigram.
   Governor said the government should keep
                       patience.

Is this statement an objective or subjective statement?
    • “keep patience” regarding what?
    • How to determine Governor’s comment is important?
•   Subjectivity is a social norms

•   Subjectivity knowledge is pragmatic

•   A prior knowledge always help to
    identify Subjectivity
•   A rule-based approach
•   Use Themes and Ontology as pragmatic
    knowledge
•   SentiWordNet (Bengali): a prior polarity lexicon

•   Features
    •   Frequency
    •   Average Distribution
    •   Functional Word
    •   Positional Aspect
    •   Theme Identification
    •   Ontology List
    •   Stemming Cluster
    •   Part of Speech
    •   Chunk
    •   SentiWordNet (Bengali)
Features        Overall
                Performance
                incremented by        80


 Stemming             4.05%           70

  Cluster                             60                       Base-Line

  Part of             3.62%           50
                                                               POS-Chunk
  Speech                              40
                                                               Ontology
   Chunk              4.07%           30
                                                               Position
 Functional           1.88%           20

   Word                               10                       Distribution


SentiWordNet          5.02%            0

  (Bengali)                                English   Bengali


Ontology List         3.66%

                       Feature wise System Performance
               Domain           Precision          Recall
                NEWS             72.16%            76.00%
                 BLOG            74.60%            80.40%

                        Overall System Performance


Observations
 • Subjectivity detection is trivial for blog corpus rather than for news corpus
 • Performance incremented by 2% only from rule-based system using CRF
   technique with the same feature set
GBML used to identify automatically best feature set based on the principle of
natural selection and survival of the fittest.
The identified fittest feature set is then optimized locally and global optimization
is then obtained by multi-objective optimization technique.
The local optimization identify the best range of feature values of a particular
feature.
The Global optimization technique identifies the best ranges of values of given
multiple feature.
       Types                Features
                     POS
                     SentiWordNet
  Lexico-Syntactic
                     Frequency
                     Stemming
                     Chunk Label
      Syntactic
                     Dependency Parsing
                     Title of the Document
                     First Paragraph
   Discourse Level
                     Average Distribution
                     Theme Word

Experimentally Best Identified Feature Set
GAs are characterized by the five basic components as follows
I. Chromosome representation for the feasible solutions to the optimization
     problem.
II. Initial population of the feasible solutions.
III. A fitness function that evaluates each solution.
IV. Genetic operators that generate a new population from the existing
     population.
V. Control parameters such as population size, probability of genetic
     operators, number of generation etc.
                                    f s   i 0 fi
                                                  N


Where f s is the resultant subjectivity f i function, to be calculated and is the
ith feature function. If the present model is represented in a vector space
model then the above function could be re-written as:
                                                          
                            f s  fi . fi 1 . fi  2 ...... f n

This equation specifies what is known as the dot product between vectors.

The GBML provides the facility to search in the Pareto-optimal set of
possible features.

                 x  p y   i  xi  yi   i  xi  yi 
To make the Pareto optimality mathematically more rigorous, we state that
a feature vector x is partially less than feature vector y, symbolically x<p y,
when the following condition hold.
Imperialism/NNP is/VBZ the/DT source/NN of/IN war/NN and/CC the/DT
disturber/NN of/IN peace/NN.

NNP     VBZ       DT       NN      IN        NN       CC      DT    NN   IN   NN
  1      12        6        2      18         2        4       6     2   18    2

                Features                           Real-Values
          POS                     1-21 (Bengali)/1-45 (English)
          SentiWordNet            -1 to +1
          Frequency               0 or 1
          Stemming                1 to 17176/ 1 to 1235
          Chunk Label             1-11 (Bengali) / 1-21 (English)
          Dependency Parsing      1-30 (Bengali) / 1-55 (English)
          Title of the Document   Varies document wise
          First Paragraph         Varies document wise
          Average Distribution    Varies document wise
          Theme Word              Varies document wise
Crossover
The intuition behind crossover is the exploration of new solutions and
exploitation of old solutions. GAs construct a better solution by mixing the
good characteristic of chromosomes together.

Mutation
Mutation involves the modification of the values of each gene of a solution with some
probability (mutation probability).
For example, in the following chromosome a random mutation occurs at position 10.

                   Result:   101111110111101
                             101111110011101
The following cost-to-fitness transformation is commonly used with GAs.
                     f  x   Cmax  g  x  when g  x   Cmax
                           = 0 Otherwise
There are variety of ways to choose the coefficient . may be taken as an
input coefficient, as the largest g value observed thus far, as the largest g
value in the current population, or the largest of the last k generation.

There is some problems with negative utility function as in the particular case it
occurs during the fitness calculation of n number of features fitness evaluation. To
overcome this, we simply transform fitness according to the equation:


                  f  x   u  x   Cmin When u  x   Cmin  0
                        = 0 Otherwise
     Languages      Domain        Precision      Recall
                     MPQA         90.22%        96.01%
       English
                     IMDB         93.00%        98.55%
                     NEWS         87.65%        89.06%
       Bengali
                     BLOG          90.6%        92.40%


Experimental Result of Subjectivity Detection by Genetic Algorithm
                       References
    Subjectivity

I. A. Das and S. Bandyopadhyay. Subjectivity Detection using Genetic
   Algorithm. In the 1st Workshop on Computational Approaches to
   Subjectivity and Sentiment Analysis (WASSA10), Lisbon, Portugal, August
   16-20, 2010.

II. A.Das and S. Bandyopadhyay. Subjectivity Detection in English and
    Bengali: A CRF‐based Approach, In Proceeding of ICON 2009, December
    14th-17th, 2009, Hyderabad.

III. A. Das and S. Bandyopadhyay. Theme Detection an Exploration of
     Opinion Subjectivity, In Proceeding of Affective Computing & Intelligent
     Interaction (ACII) 2009.

IV. A. Das and S. Bandyopadhyay. Extracting Opinion Statements from
    Bengali Text Documents through Theme Detection, In Proceeding of 17th
    International Conference on Computing (CIC-09), Mexico City, Mexico.
                                                                                48
Sentiment Polarity Detection:
Inspired by Structural Human
         Intelligence


    Explored Solution Architectures
      Rule-Based
      Machine Learning
      Structural
Observations

  • Sentence level polarity is simply integration of
    phrase level polarity.

  • Word level polarity depends on four POS
    categories.

  • Polarity mainly depends on syntactical modifier-
    modified relations.

  • Dependency relations-vital clue.

  • Negative word-direct clue.
Identifies polarity at phrase level
Using Support Vector Machine (SVM)

Features
    • Part Of Speech (POS)
    • Chunk label
    • Functional Word
    • SentiWordNet (Bengali) orientation
    • Stemming Cluster
    • Negative Word Feature
    • Dependency Tree feature
 •   Solution Architecture
      Rule Based
      Statistical
       Malt
       CRF (Conditional Random Field)
      Hybrid
       Statistical + Post-process
Features
   • Part Of Speech (POS)
   • Chunk label
   • Root Word
   • Vibhakti
   • Morph Features
   • Animate-Inanimate gazetteer
   • Verb argument structure
Dependency Tree Feature
Accuracy             Precision                  Recall
 Overall               70.04%               63.02%
Positive               56.59%               52.89%
Negative               75.57%               65.87%


           Results on Polarity Identification
                     References
    Polarity

I. A. Das and S. Bandyopadhyay. Phrase-level Polarity Identification for
   Bengali, In International Journal of Computational Linguistics and
   Applications (IJCLA), Vol. 1, No. 1-2, Jan-Dec 2010, ISSN 0976-0962,
   Pages 169-182.

II. A. Das and S. Bandyopadhyay. Opinion-Polarity Identification in Bengali,
    In ICCPOL 2010, California, USA.




                                                                           55
     Sentiment Structurization:
    Rational Human Intelligence


   Proposed Generalized Structurization using 5W

                 Who, What, When, Where, Why
              Overview
 SRL Contemporary Theories
 Motivation
 Paninian Karaka Theory
 Fillmore‟s Case Grammar
 The Proposed Concept of 5W
 Resource Acquisition
 Challenges
 System Overview
 Statistical Model – MEMM
 Rule Based Post Processing

                               57
 SRL Contemporary Theories
 SRL extensively studied for English
 SRL has not been studied for Indian
  Languages
    Hindi SRL-Genitive Marker (Sharma at al., 2009)
 Domain       Specific
    FROM_DESTINATION
    TO_DESTINATION
    DEPARTURE_TIME etc.
 Verb    Specific
      EATER
      EATEN
                                                  58
                    Continue….(1)

SRL Contemporary Theories
 PropBank   (Approx 11K Semantic Role Labels)
 Verb Specific Semantic Roles
  Agent, patient or theme etc.
 More general semantic roles
 FrameNet   (Approx 7K Semantic Role Labels)
 Verb frame specific semantics roles
 Frame-to-frame semantic relations
 Inheritance, Perspective_on, Subframe, Precedes,
  Inchoative_of, Causative_of and Using.
 VerbNet   (Approx 23 Semantic Role Labels)
 Thematic roles specific semantic relationship between
  a predicate and its arguments.
 agent, patient, theme (From PropBank), experiencer,
  stimulus, instrument, location, source, goal, recipient,
  benefactive etc.
                                                       59
                  Continue….(2)


SRL Contemporary Theories


 Conclusion
 No Adequate semantic role Labels exists!
   Across various domains
   Across various languages




                                             60
              Overview
 SRL Contemporary Theories
 Motivation
 Paninian Karaka Theory
 Fillmore‟s Case Grammar
 The Proposed Concept of 5W
 Resource Acquisition
 Challenges
 System Overview
 Statistical Model – MEMM
 Rule Based Post Processing

                               61
                   Motivation
 Historical   Panini‟s Karaka Theory
  First Grammarian
  300-600 BC Astadhayi
  Still Influential
 Syntactic-Semantic    Relationship

Karta: central to the action of the verb
Karma: the one most desired by the karta
Karana: instrument, essential for the action to
take place
Sampradaan: recipient of the action
Apaadaan: movement away from a source
Adhikarana: location of the action
                                                   62
                          Continue….(1)

                      Motivation
 Fillmore‟s Case Grammar
                                     list of cases,
 He posited the following preliminary
 noting however that „Additional cases will surely be
 needed‟

  Agent: The typically animate perceived instigator of the
   action.
  Instrument: Inanimate force or object causally involved in
   the action or state.
  Dative: The animate being affected by the state or action.
  Factitive: The object or being resulting from the action or
   state.
  Locative: The location or time-spatial orientation of the state
   or action.
  Objective: The semantically most neutral case conceivably
   the concept should be limited to things which are affected by
   the action or state.
                                                             63
               Overview
 SRL Contemporary Theories
 Motivation
 Paninian Karaka Theory
 Fillmore‟s Case Grammar
 The Proposed Concept of 5W
 Resource Acquisition
 Challenges
 System Overview
 Statistical Model – MEMM
 Rule Based Post Processing

                               64
  The Proposed Concept of 5W
The 5W task seeks to extract the semantic
information of nouns in a natural language sentence
by distilling it into the answers to the 5W questions:
Who, What, When, Where and Why.
   Who? Who was involved?
   What? What happened?
   When? When did it take place?
   Where? Where did it take place?
   Why? Why did it happen?

                                                  65
             Overview
 SRL Contemporary Theories
 Motivation
 Paninian Karaka Theory
 Fillmore‟s Case Grammar
 The Proposed Concept of 5W
 Resource Acquisition
 Challenges
 System Overview
 Statistical Model – MEMM
 Rule Based Post Processing


                               66
                                        Continue….(1)



                Resource Acquisition
   Annotation



Madhabilata (was keeping) her (wrist watch) then (on the table) as she (was about
                                   to sleep).
                        What              When                Why
    Who
                                                                       Where

                  ভাধফীরতা (শ঱াজফ ফজর) তখন (঴াজতয ঘড়ি) খুজর শেড়ফজর যাখড়ির.




                                                                               67
                  Continue….(2)


      Resource Acquisition

Annotation




     Tag      Annotators X and Y Agree percentage
    Who                     88.45%
    What                    64.66%
    When                    76.45%
    Where                   75.23%
    Why                     56.23%



                                                    68
              Overview
 SRL Contemporary Theories
 Motivation
 Paninian Karaka Theory
 Fillmore‟s Case Grammar
 The Proposed Concept of 5W
 Resource Acquisition
 Challenges
 System Overview
 Statistical Model – MEMM
 Rule Based Post Processing

                               69
                       Challenges
 Irregular   Occurrences

     Tags                        Percentage
              What      When       Where       Why     Overall
      Who
              58.56%    73.34%     78.01%     28.33%   73.50%
               Who      When       Where       Why     Overall
     What
              58.56%    62.89%     70.63%     64.91%   64.23%
               Who      What       Where       Why     Overall
     When
              73.34%    62.89%     48.63%     23.66%   57.23%
               Who      What        When       Why     Overall
     Where
              78.0%     70.63%     48.63%     12.02%   68.65%
               Who      What        When      Where    Overall
      Why
              28.33%    64.91%     23.66%     12.02%   32.00%


                                                                 70
              Overview
 SRL Contemporary Theories
 Motivation
 Paninian Karaka Theory
 Fillmore‟s Case Grammar
 The Proposed Concept of 5W
 Resource Acquisition
 Challenges
 System Overview
 Statistical Model – MEMM
 Rule Based Post Processing

                               71
            Proposed System
 Machine   Learning
 Maximum Entropy (MEMM)
 CONLL 2005 SRL Shared Task
   8 system used MEMM among 19
   First two highest performing system used MEMM
 Rule-Based   Post Processing
 Heterogeneous problem structure
   Label-biased problem of ML techniques
   Bengali is Morpho-syntactically rich
   Avoid the limitations of available tools


                                                    72
            Proposed System
 Machine    Learning
 Feature Engineering
               Types                    Features
            Lexico-Syntactic               POS
                                       Root Word
                                       Gender



                                Noun
                                       Number
             Morphological             Peson
                                       Case
                                Verb   Root Word
                                    Modality
                               Head Noun
               Syntactic       Chunk Label
                               Dependency Relation
                                                     73
           Experimental Result
      Machine Learning (1)
      Machine Learning + Rule Based (2)

 Tag       Systems Precision   Recall      F-measure       Avg F-Measure
              1      76.23%    64.33%       69.77%
Who
              2      79.56%    72.62%       75.93%
                                                       1       62.22%
              1      61.23%    51.34%       55.85%
What
              2      65.45%    59.64%       62.41%
              1      69.23%    58.56%       63.44%
When
              2      73.35%    65.96%       69.45%
              1      70.01%    60.00%       64.61%
Where                                                  2       68.10%
              2      77.66%    69.66%       73.44%
              1      61.45%    53.87%       57.41%
Why
              2      63.50%    55.56%       59.26%

                                                                           74
                     References
    Structurization

I. A. Das and S. Bandyopadhyay. Customized Opinion Summarization and
   Visualization: 5Ws Dimensions, In the International Workshop on Topic
   Feature Discovery and Opinion Mining (TFDOM 2010) Joint with the 10th
   IEEE International Conference on Data Mining (ICDM10);, 13 December
   2010, Sydney, Australia.

II. A. Das, A. Ghosh and S. Bandyopadhyay. Semantic Role Labeling for
    Bengali Noun using 5Ws, In the International Conference on Natural
    Language Processing and Knowledge Engineering (IEEE NLP-KE2010),
    August 21-23, Beijing, China, 2010.




                                                                           75
 Sentiment Summary: Can a
Machine Aggregate Sentiment
  as Human Intelligence?

   Sentiment Summarization: the problem
     Single Doc?
     Multi-Doc?
     Extractive?
     Generative?
     Proper output format?

   Explored Solution Architectures
     Topic-Opinion multi-doc Summary
     Visualization
     Tracking
The Topic-Opinion Summarizer

The present system follows a topic-sentiment model for sentiment
identification and aggregation. Topic-sentiment model is designed as
discourse level theme identification and the topic-sentiment
aggregation is achieved by theme clustering (k-means) and Document
level Theme Relational Graph representation. The Document Level
Theme Relational Graph is finally used for candidate summary
sentence selection by standard page rank algorithms used in
Information Retrieval (IR).
                                   Annotation
 The agreement drops fast as the number of annotators increases
          Inter-annotator agreement is better for theme words annotation rather than
     candidate sentence identification for summary

 Discussion with annotators reveals that the psychology of annotators is to grasp as
many as possible theme words identification during annotation but the same groups of
annotators are more cautious during sentence identification for summary as they are very
conscious to find out the most concise set of sentences that best describe the opinionated snapshot
of any document.

               Annotators X vs. Y X Vs. Z Y Vs. Z Avg
               Percentage 82.64% 71.78% 80.47% 78.30%
               All Agree              69.06%
                     Agreement of Annotators at Theme Words Level

               Annotators         X vs. Y X Vs. Z Y Vs. Z Avg
               Percentage         73.87% 69.06% 60.44% 67.8%
               All Agree                     58.66%
                      Agreement of Annotators at Sentence Level
                 Theme Identification
The Theme detection technique has been proposed to identify discourse level
most relevant topic-semantic nodes in terms of word or expressions using a
standard machine learning technique. The machine learning technique used
here is Conditional Random Field (CRF). The theme word detection is
defined as a sequence labeling problem. Depending upon the series of input
feature, each word is tagged as either Theme Word (TW) or Other (O).

                   Types                  Features
                                 POS
                                 SentiWordNet
              Lexico-Syntactic   Frequency
                                 Stemming
                                 Chunk Label
                  Syntactic
                                 Dependency Parsing Depth
                                 Title of the Document
                                 First Paragraph
               Discourse Level
                                 Term Distribution
                                 Collocation
                Theme Clustering
The cluster hypothesis (Jardine and van Rijsbergen, 1971)
 A reasonable cluster is defined as the one that maximizes the within-
cluster document similarity and minimizes between-cluster similarities.
 Standard bottom-up soft k-means clustering technique
 Document represented as a theme matrix
                            election   cricket  hotel 
                       A   parliament sachin vacation
                                                      
                            governor soccer tourist 
                                                      


                   
                                              N
              s  qk , d j   qk . d j   wi ,k  wi , j  (1)
                                        i 1




                                    w w
                                          N
                     
                            
               s  qk , d j             i 1
                                                                           (2)
                                                     i ,k   i, j

                           
                                   w   w
                                   N          2             N      2
                                   i 1       i ,k          i 1   i ,k
                      Continue……(1)

         Theme Clustering
ID                     Themes            1     2      3
1    প্র঱া঳ন (administration)         0.63   0.12   0.04
1    ঳ু঱া঳ন (good-government)         0.58   0.11   0.06
1    ঳ভা (Society)                    0.58   0.12   0.03
1    আইন (Law)                        0.55   0.14   0.08
2    গজফলণা (Research)                0.11   0.59   0.02
2    কজর (College)                    0.15   0.55   0.01
2    উচ্চড়঱ক্ষা (Higher Study)        0.12   0.66   0.01
3    শ ঴াড়দ (Jehadi)                  0.13   0.05   0.58
3    ভ঳ড় দ (Mosque)                   0.05   0.01   0.86
3    ভু঱াযপ (Musharaf)                0.05   0.01   0.86
3    কাশ্মীয (Kashmir)                0.03   0.01   0.93
3    ঩াড়কস্তান (Pakistan)             0.06   0.02   0.82
3    নয়াড়দল্লী (New Delhi)            0.12   0.04   0.65
3    ফর্গায (Border)                  0.08   0.03   0.79
Document Level Theme Relational Graph
 Document Graph G=<V,E> from a given source document .
 The input document d is parsed and split into a number of text fragments
(sentence) using sentence delimiters (Bengali sentence marker “।“, “?” or
“!”).
 Cosine similarity for inter-document similarity
           Summarization System
 Extractive opinion summarization system
 Extraction based on sentence importance in representing the shared
subtopic (cluster) is an important issue and it regulates the quality of the
output summary.
 Adaptive page rank algorithm (Page et al., 1998)has been used for
sentence selection among documents in the same cluster.
 The summation of edge score reflects the correlation measure between
two nodes
 Sentences are presented in original sequence for coherent summary
                          Candidate Sentence                                      IR Score
       ভ঴ম্মদ আড়ভজনয ভজতা ঩ড়রেফুুজযায 'নফীনতভ' ঳দ঳ুজকও ড়কন্তু ফয়জ঳য ড়দক
                                                                                    151
       ঴ইজত নফীন বাফা কঠিন।
       এফায ড়িন্তা আযওএকেু শফড়঱, কাযণ এই ভূরুফৃড়িয ড়঩িজন শমভন শদজ঱য ড়বতজয
       ড় ড়ন঳঩জেয শ াগান কজভ মাওয়া আজি, শতভনই আজি আন্ত গ াড়তক ফা াজয                 167
       ভূরুফৃড়িয প্রফণতা।
       স্বাধীনতায ঩য লাে ফিয গত ঴ইর, এখনও প্রায় ঳কর ঳যকাড়য ঩ড়যকল্পনায
       ড়঩িজন এই একটিই বাফাদ঱গ কা কজয: ড়ফড়বন্ন শবােফুাঙ্কজক তু ষ্ট কড়যয়া শমন শতন     130
       প্রকাজযণ ড়নজ জদয দরীয় ড়িড়ত ড়নড়িত কযা।
           Summarization System
 Extractive opinion summarization system
 Extraction based on sentence importance in representing the shared
subtopic (cluster) is an important issue and it regulates the quality of the
output summary.
 Adaptive page rank algorithm (Page et al., 1998)has been used for
sentence selection among documents in the same cluster.
 The summation of edge score reflects the correlation measure between
two nodes
 Sentences are presented in original sequence for coherent summary
                          Candidate Sentence                                      IR Score
       ভ঴ম্মদ আড়ভজনয ভজতা ঩ড়রেফুুজযায 'নফীনতভ' ঳দ঳ুজকও ড়কন্তু ফয়জ঳য ড়দক
                                                                                    151
       ঴ইজত নফীন বাফা কঠিন।
       এফায ড়িন্তা আযওএকেু শফড়঱, কাযণ এই ভূরুফৃড়িয ড়঩িজন শমভন শদজ঱য ড়বতজয
       ড় ড়ন঳঩জেয শ াগান কজভ মাওয়া আজি, শতভনই আজি আন্ত গ াড়তক ফা াজয                 167
       ভূরুফৃড়িয প্রফণতা।
       স্বাধীনতায ঩য লাে ফিয গত ঴ইর, এখনও প্রায় ঳কর ঳যকাড়য ঩ড়যকল্পনায
       ড়঩িজন এই একটিই বাফাদ঱গ কা কজয: ড়ফড়বন্ন শবােফুাঙ্কজক তু ষ্ট কড়যয়া শমন শতন     130
       প্রকাজযণ ড়নজ জদয দরীয় ড়িড়ত ড়নড়িত কযা।
                            Evaluation
                         Evaluation of Theme Detection

                  Metrics        X      Y      Z          Avg
Detection                     87.65% 85.06% 78.06%       83.60%
 Theme
                 Precision
                 Recall       80.78% 76.06% 72.46%       76.44%

                 F-Score      84.07% 80.30% 75.16%       79.85%


                         Evaluation on Summarization
Summarization




                 Metrics         X        Y        Z        Avg
                Precision     77.65%    67.22%   71.57% 72.15%

                Recall        68.76%    64.53%   68.68% 67.32%

                F-Score       72.94%    65.85%   70.10% 69.65%
Sentiment Visualization and Tracking
 Semantic Document Clustering
                         
                                                 N
                    s  dk , d j   dk . d j   wi , k  wi , j
                                              i 1

                              Generated Clusters
 5Ws      5W Opinion Constituents                 Doc1         Doc2   Doc3 Doc4     Doc5

       Mamata Banerjee                           0.63         0.01    0.55   0.93   0.02
 Who
       CM                                        0.00         0.12    0.37   0.10   0.17
       Gyaneswari Express                        0.98         0.79    0.58   0.47   0.36
 What
       Derailment                                0.98         0.76    0.35   0.23   0.15
       24th May 2010                             0.94         0.01    0.01   0.01   0.01
 When
       Misdnight                                 0.68         0.78    0.01   0.01   0.01
       Jhargram                                  0.76         0.25    0.01   0.13   0.76
 Where
       Khemasoli                                 0.87         0.01    0.01   0.01   0.01
       Maoist                                    0.78         0.89    0.06   0.10   0.14
 Why
       Bomb Blast                                0.13         0.78    0.01   0.01   0.78
   Dimension Wise Opinion Summary
          and Visualization


 Real life users do not always require overall summary or visualization,
rather they seek for opinion changes of any “Who” during “When” and
depending upon “What” or “Where” and “Why”.

 A market surveyor from company A may have a need to find out the
changes in public opinion about their product X after release of product Y by
company B. An overall opinion summary on product X may miss some
valuable information as the market surveyor need. Another example: A voter
may be interested to find out the rate of change of public opinion about any
leader or any public event before and after of any election.
Dimension Wise Opinion Summary
       and Visualization




     A Snapshot of the Visualize-Tracking System
          Evaluation

Tags                  Average Scores
        What   When       Where        Why    Overall
Who
        3.20   3.30        3.30        2.50    3.08
        Who    When       Where        Why    Overall
What
        3.20   3.33        3.80        2.6     3.23
        Who    What       Where        Why    Overall
When
        3.30   3.33        2.0         2.5     3.00
        Who    What       When         Why    Overall
Where
        3.30   3.80        2.0         2.0     2.77
        Who    What       When     Where      Overall
Why
        2.50   2.6         2.5         2.0     2.40
                      References
    Summarization, Visualization and Tracking

I. A. Das and S. Bandyopadhyay. Event-Sentiment Visual Tracking, In the
   FALA 2010 "VI Jornadas en Tecnologia del Habla" , November 10-12, Vigo,
   Spain, 2010. .

II. A. Das and S. Bandyopadhyay. Opinion Summarization in Bengali: A
    Theme Network Model, In the Second IEEE International Conference on
    Social Computing (SocialCom-2010), Minneapolis, USA, August 20-22,
    2010.

III. A. Das and S. Bandyopadhyay. Topic-Based Bengali Opinion
     Summarization, In the 23rd International Conference on Computational
     Linguistics (COLING 2010), August 23-27, 2010, Beijing, China.


                                                                             90
                      Conclusion


 Sentiment beyond Human Intelligence.

 Sentiment Innate Human Intelligence.

 Sentiment: the Psychology of Human Intelligence.
Thank You

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:15
posted:9/16/2011
language:English
pages:92