A Study of Evaluative Language in SMS Messages

Document Sample
A Study of Evaluative Language in SMS Messages Powered By Docstoc
					A STUDY OF EVALUATIVE
LANGUAGE IN SMS MESSAGES :
TOWARDS A CHARACTERIZATION




                                                The Search Engine Meeting 2009, Boston
OF OPINION




              Marguerite Leenhardt
 Doctorant in Applied Linguistics and Natural
            Language Processing
    Université Paris 3 Sorbonne Nouvelle
         SYLED/CLA²T Paris, France
1. CONTEXT OF THE STUDY
   Corpus of the UCL (Université Catholique de Louvain, Belgium)
     30 000 SMS




                                                                        The Search Engine Meeting 2009, Boston
     French speakers of Belgium


   Corpus Linguistics and NLP
     Textual statistics methods
     Corpus automatic processing



   Clues of evaluative language (opinions)
     Knowledge representation
     Industrial perspectives for IE (Information Extraction)       2
METHOD




                                                      The Search Engine Meeting 2009, Boston
 Textometric                      Towards the
                Representation
 analysis and                    development of
                 of linguistic
   context                         a symbolic
                 phenomena
 observation                        grammar




                                                  3
ANALYSIS OF OPINIONS, WORKING
HYPOTHESIS EVALUATION

   Current phase
     No specific research concerning opinion expression in SMS
     Working hypothesis on electronic communication in




                                                                            The Search Engine Meeting 2009, Boston
      Applied Linguistics based on small corpora
     No adjusted linguistic resources for indexation


   This pilot study’s contribution
       Innovative approach for SMS linguistic study
           Work on the largest existing SMS corpus
           Towards the elaboration of linguistic resources
       Hypothesis evaluation
           Complexity and frequency of smileys (Pierozak, 2007)
           Non specificity of smileys for the message interpretation   4
            (Marcoccia et Gauducheau, 2007a and b)
STANDARDIZATION AND ANALYSIS
PROBLEMS

   SMS : non standard inputs
     Abbreviations, smileys




                                                                The Search Engine Meeting 2009, Boston
     Unsuitable existing linguistic resources



   Web applications : a challenge for NLP tools
     Analysing subjective linguistic productions
     Worthwile research for marketing industry


   Adaptation of NLP applications to non standard
    inputs as a major target (Fairon et al., 2007)
       Importance of linguistic description to elaborate   5
        description models
EXTERNAL RESOURCE DEPENDANCE




                                                                     The Search Engine Meeting 2009, Boston
   Textometry : computational and statistical, but
    non linguistic description model
     •   Textual analysis without dependance to the
         standardization level
                                                                 6
     •   Relevant tools to build suitable linguistic resources
2. COMMENTS AND RELATED WORK
   Change in the practices : from constraint to
    stylistic




                                                         The Search Engine Meeting 2009, Boston
   Smileys : non specific clues to identify
    evaluations

   Textometric approach : towards the building of
    symbolic grammars



                                                     7
CHANGE IN THE PRACTICES
   Web apps and SMS                                   HTC TyTN II
     Twitter




                                                                         The Search Engine Meeting 2009, Boston
     Mail messages on cell-phones


   Y generation
     Explosion of mobile telephony in the last 10 years
     Computer-like operating systems on cell-phones


   Adaptation of the market
     Through CRM systems in flight companies, banks…
     (Ogle, 2005) work on night clubs CRM sytems
                                                                     8
     Summize Twitter Search, Twittratr
    SMILEYS AS NON SPECIFIC CLUES
   Lexical chunks are more        « OK, c'est parfait pour moi??? À
    reliable (complete and reduced    jeudi :-) »
    forms) for determining         « Lol c'est malin ça :P »




                                                                                The Search Engine Meeting 2009, Boston
    message orientation
                                      « Félicitations ;o) bonne soirée. »
« intelligibility and coherence of
                                      « Merci ;-) »
   messages are not much
   influenced by the presence of      « J'ai pas pu venir :""( »
   smileys » (Marcoccia and
   Gauducheau, 2007a)                 « Bah tant pis...:-( Dommage... »
                                      « Je sais pas..parce que quand
« intrepretation message                 même,t'as pas été sympa avec
   mechanisms […] tend to rely ont       moi :-( »
   the verbal part of the message »   « Ça me soûle trop l'école, les
   (Marcoccia and Gauducheau,            cours, les profs,... Pff»          9
   2007b)
TOWARDS A SYMBOLIC GRAMMAR
   Symbolic grammar example




                                                                 The Search Engine Meeting 2009, Boston
   A link between symbolic and local grammars




       Distributional behavior of lexical chunks as an input   10
        to linguistic description
3. RESULTS
   Reduced forms (« pfff », « arf », « lol », « mdr ») as
    reliable clues to determine orientation of opinions




                                                              The Search Engine Meeting 2009, Boston
   Adverbial modifiers reduce the coverage of the
    local grammars
       9% coverage without adverbs vs. 2% coverage with
        adverbs


   Negative evaluations on the addressee are not
    common
       Insult local grammar only has 3% coverage
                                                             11
TALKING ABOUT ONESELF
   First person personal               « Chérie je m’ennuie en classe. »
    pronouns collocations               « Oh je suis triste :(»
        Two major communicational       « Je ne viens pas demain. »




                                                                             The Search Engine Meeting 2009, Boston
    
        functions :
                                        « je la vois quasi jamais à cause
                                           de ses parents :'-( »
           Express an negative
            opinion on one’s state of
            mind
              Exasperation grammar
                                         « Je suis dans le train. »
               over 19% coverage
                                        « Je vais en ville avec Jona
                                           faire les magasins. »
           Express a factual
            information                 « Sorry pas pu décrocher j'étais
              Wainting grammar            en réunion. »
                                                                            12
               49% coverage
A LOOK UPON LOVE DECLARATIONS
« ça va déjà mieux. Jtm mon
   petit cœur. »
« J'ai décidé de te dire un
   truc,jtm... » (males)




                                                                              The Search Engine Meeting 2009, Boston
« Je suis bleue de toi mon
   chéri!!!Je t'aime »
« bonne nuit pleine de beaux
   rêves *bisou doux* je t'aime
   fort:$ » (females)

    Reduced form : in
     proportion, most
     frequently in male
     messages
    Expanded / complete
     form : in proportion,        Example of Repeted Segments calculus       13
                                  on a sample of 1000 SMS for « [Jj]tm »
     most frequently in           17 occurrences in the female part versus
     female messages              21 occurrences in the male part
CONCERNING SMILEYS
Example on a sample of 1000 SMS
    Form            Total frequency
      ;-)                 424
      :-)
        )
                          306
                          208
                                         Complex smileys are the
                                          fewest (without




                                                                     The Search Engine Meeting 2009, Boston
     ;o)                  174
      :P                  130
       :)                 126             duplication of the same
                                          template)
       ;)                 125
      :-(                 113
     :-D                   64
      :p                   64

     Most frequent smileys
                                         Most frequent smileys
     Form           Total frequency       are of 2 or 3 chars
      XD                    2
      XP                    2
      *-)                   1
     *’’;o)
     *=/;’’
                            1
                            1
                                         Smileys complexity is
       ,)                   1             conversely proportional
                                          to their frequence
      ,-))                  1
      ‘:-)                  1                                       14
       -(
      -_-
                            1
                            1             (Pierozak’s hypothesis
   Less frequent smileys, hapax           validation)
3. PERSPECTIVE
   Concerning linguistic studies on electronic
    communication




                                                                     The Search Engine Meeting 2009, Boston
       Acquire knowledge on linguistic uses

       Hypothesis evaluation (Pierozak, 2007)

       Adaptation of existing linguistic resources (Unitex,
        Treetagger)

       Symbolic grammars development for IE applications
         Boosting the document score for indexation (Attardi and
          Simi, 2006)
                                                                    15
         Question answering

         Automatic summarization
SHORT BIOGRAPHY
   (Attardi and Simi, 2006), G. Attardi, M. Simi, Blog mining through opinionated words, Dipartimento di Informatica
    Universita di Pisa, in Proceedings of TREC 2006 Blog Task, 2006
   (Das and Chen, 2001), Das, S.R. and Chen M.Y. (2001), Yahoo! For Amazon : sentiment parsing from small talk on the
    web, EFA 2001 Barcelona Meetings
   (Fairon et al., 2006a), Fairon, C., Klein, J. et Paumier S. (2006), Le Corpus SMS pour la science. Base de données de




                                                                                                                                  The Search Engine Meeting 2009, Boston
    30.000 SMS et logiciels de consultation, CD-Rom, Presses Universitaires de Louvain, Cahiers du Cental, 3.2
   (Fairon et al., 2006b), Fairon, C., Klein, J. et Paumier S. (2006), Le langage SMS. Etude d’un corpus informatisé à
    partir de l’enquête ‘Faites don de vos SMS à la science’, Presses Universitaires de Louvain, Cahiers du Cental, 3.1, 136p.
   (Fairon et al., 2007), Fairon, C., Klein, J-R., Paumier, S. (2007), Un corpus transcrit de 30.000 SMS français, in
    (Gerbault (Ed.), 2007), pp. 173-182
   (Gamon, 2004), Gamon, M. (2004), Sentiment classification on customer feedback data : noisy data, large feature
    vectors, and the role of linguistic analysis, Microsoft Research
   (Gerbault (Ed.), 2007), Gerbault, J. (Editeur) (2007), La langue du cyberspace : de la diversité aux normes,
    L’Harmattan, 295p.
   (Guimier De Neef, 2004), Guimier De Neef, E. (2004), 1pw1srlakestion, Tutoriel TAL des NFCE, Journée d’étude de
    l’ATALA « Le traitement automatique des nouvelles formes de communication écrite (e-mails, forums, chats, SMS, etc.) »
    du 5 juin 2004, France Télécom R&D
   (Hiroshima et al., 2006), Hiroshima, N., Yamada, S., Furuse, O. and Kataoka, R. (2006), Searching for sentences
    expressing opinions by using declaratively subjective clues, NTT Cyber Solutions Laboratories, NTT Corporation
   (Marcoccia et Gauducheau, 2007b), Marcoccia, M. et Gauducheau, N. (2007), L‟analyse du rôle des smileys en
    production et en réception : un retour sur la question de l‟oralité des écrits numériques, in GLOTTOPOL, numéro 10
   (Martin and White, 2005), Martin, J.R. and White, P.R.R. (2005), The language of evaluation: appraisal in English,
    Palgrave, London, 2005
   (Ogle, 2005), Ogle T. (2005), Creative uses of information extracted from SMS messages. A project to investigate             16
    Information Extraction from SMS messages and potential use of the extracted information, Undergraduate project of
    Dissertation, Department of Computer Science, University of Sheffield
   (Pierozak, 2007), Pierozak, I. (2007), Et le smiley sous un angle émique ? Coénonciation et accommodation, remarquable
    et complexité, in (Gerbault (Ed.), 2007), pp. 75-88
The Search Engine Meeting 2009, Boston
                                         17
          MERCI