Docstoc

blogs

Document Sample
blogs Powered By Docstoc
					                                              1




Blog Mining – Market Research made easy?

Bettina Berendt, K.U.Leuven, www.berendt.de
                                                        2
About me ...




               : Computer Science




               : Information Systems
               : Computer Science / Cognitive Science
               : Artificial Intelligence
               : Business Science
               : Economics
                                  3
Motivation / Excecutive summary




          Dem Volk
          aufs Maul
           sehen :
                                        4
 Agenda

Concepts


From “online buzz“ to text mining


Text mining: first steps


Text mining: going deeper


Closing the loop: from blogs to blogs


The representativeness challenge

So ...
                                        5
 Agenda

Concepts


From “online buzz“ to text mining


Text mining: first steps


Text mining: going deeper


Closing the loop: from blogs to blogs


The representativeness challenge

So ...
                                                                     6
What„s a blog?

   a (more or less) frequently updated publication on the Web,
    sorted in (usually reverse) chronological order of the
    constituent blog posts.
   The content may reflect any interests including personal,
    journalistic or corporate.
   Usually textual, but multimedia forms exist (photoblog, vblog,
    …)
                                                                           7
Blogs and other social media (“Web 2.0“)




                          “Annotation platforms“
                          (e.g., del.icio.us)
     Wikis
     (e.g., Wikipedia)
                                                  Microblogging
   Social network sites                           (e.g., Twitter)
   (e.g., MySpace)

                         Blogs
                         (e.g., Livejournal; Huffington Post)
                         Sharing / linking by:
                         Hyperlinks, comments, blogroll, trackback links
                                                                              9
  Blogs and other social media,
  and some of their origins in older media
Computer-supported             Bookmarks           www.dmoz.org
cooperative work                                               Chatrooms
                             “Annotation platforms“
                             (e.g., del.icio.us)
        Wikis
                                                                         Usenet
        (e.g., Wikipedia)
                                                      Microblogging
      Social network sites                            (e.g., Twitter)
      (e.g., MySpace)

Dating sites                Blogs
                            (e.g., Livejournal; Huffington Post)
                            Sharing / linking by:
                            Hyperlinks, comments, blogroll, trackback links



       Diaries       (Often political) journalism          PR; press releases
                                                                 11
What„s market research?

   identification, collection, analysis, and dissemination of
    information
   for the purpose of assisting management in decision making
    related to the identification and solution of problems and
    opportunities in marketing
                                                                              12
Traditional methods of (consumer) market research

   Based on questioning:
        Focus groups, surveys, questionnaires, ...
   Based on observations:
        Ethnographic studies - observe social phenomena in their natural
         setting - observations can occur cross-sectionally or
         longitudinally - examples include product-use analysis and
         computer cookie traces.
        Experimental techniques - create a quasi-artificial environment to
         try to control spurious factors, then manipulates at least one of
         the variables - examples include purchase laboratories and test
         markets
                                        13
 Agenda

Concepts


From “online buzz“ to text mining


Text mining: first steps


Text mining: going deeper


Closing the loop: from blogs to blogs


The representativeness challenge

So ...
                                  14
                  buzz“:
Capturing “online buzz“
Bursty communication actitivies
                 15
Comparing
search volume,
news
and
blogs
                                                                            16
 The idea of text mining ...

... is to go beyond frequency-counting
... is to go beyond the search-for-documents framework
... is to find patterns (of meaning) within and across documents


(yes, there is text mining behind some of the things the above tools do!)
                                                 17
 The steps of text mining

1.   Application understanding
2.   Corpus generation
3.   Data understanding
4.   Text preprocessing
5.   Search for patterns / modelling
          Topical analysis
          Sentiment analysis / opinion mining
6.   Evaluation
7.   Deployment
                                        18
 Agenda

Concepts


From “online buzz“ to text mining


Text mining: first steps


Text mining: going deeper


Closing the loop: from blogs to blogs


The representativeness challenge

So ...
                                                               19
Application understanding; Corpus generation

   What is the question?
   What is the context?
   What could be interesting sources, and where can they be
    found?


   Crawl
   Use a search engine and/or archive
        Google blogs search
        Technorati
        Blogdigger
        ...
                                                         20
 Preprocessing (1)

Data cleaning
    Goal: get clean ASCII text
    Remove HTML markup, pictures, advertisements, ...
    Automate this: wrapper induction
                                                                          21
 Preprocessing (2)

Further text preprocessing
    Goal: get processable lexical / syntactical units
    Tokenize (find word boundaries)
    Lemmatize / stem
          ex. buyers, buyer  buyer / buyer, buying, ...  buy
    Remove stopwords
    Find Named Entities (people, places, companies, ...); filtering
    Resolve polysemy and homonymy: word sense disambiguation;
     “synonym unification“
    Part-of-speech tagging; filtering of nouns, verbs, adjectives, ...
    ...


    Most steps are optional and application-dependent!
    Many steps are language-dependent; coverage of non-English varies
    Free and/or open-source tools or Web APIs exist for most steps
                                                                                22
 Preprocessing (3)

Creation of text representation
    Goal: a representation that the modelling algorithm can work on
    Most common forms: A text as
         a set or (more usually) bag of words / vector-space representation:
          term-document matrix with weights reflecting occurrence,
          importance, ...
         a sequence of words
         a tree (parse trees)
                                      23
Recall text data pre-processing ...
                                      24
An important part of preprocessing:
Named-entity recognition (1)
                                                               25
An important part of preprocessing:
Named-entity recognition (2)
   Technique: Lexica, heuristic rules, syntax parsing
   Re-use lexica and/or develop your own
        configurable tools such as GATE
   A challenge: multi-document named-entity recognition
        See proposal in Subašić & Berendt (Proc. ICDM 2008)
                                                                                          26
The simplest form of content analysis is based on NER




Berendt, Schlegel und Koch
In Zerfaß et al. (Hrsg.) Kommunikation, Partizipation und Wirkungen im Social Web, 2008
                  28
 More about
 named
 entities:
 co-
 occurrence


Source:
Discussion
boards
 similar to
blogs,
but (more)
clearly
communication-
related



Feldman et al.,
Proc. ICDM 2007
                    29

Co-
occurrence
of brands
and
attributes




  Feldman et al.,
  Proc. ICDM 2007
30
                                  31
                  buzz“:
Capturing “online buzz“
Bursty communication actitivies
                 32
Comparing
search volume,
news
and
blogs
                                                                 33
 More advanced text modelling:
 Summarization – of time-indexed documents
Recall “Michelle Obama“




    Google Trends, Blogpulse etc. associate documents /
     document sets with “bursts“
    But: this means the user has to read the documents!
    Can we do better and create a concise summary of what was
     discussed in that period?
    Can we allow the user to ask as much detail as s/he is
     interested in?
                      34


    Yes –
    with
    STORIES




(Subašić & Berendt,
Proc. ICDM 2008)
                                                                                          37
 Salient story elements

1.   Identify content-bearing terms (e.g. 150 top-TF.IDF over whole corpus)
2.   Split whole corpus T by atomic time period (e.g., week)
3.   For each time period (atomic or moving-average)
           Compute the weights for corpus t for this period
           Weight =
        Support of co-occurrence of 2 content-bearing terms w1, w2 in t =
        (# articles from t containing both w 1, w2 in window) / (# all articles in t)
4.   Threshold
           Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)
           Time-relevance TR of co-occurrence(w1, w2) =
            support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥
            θ2 (e.g., 2)
           Thresholds are set dynamically + interactively by the user
5.   Story elements = relationships = all these edges
      Story basics = terms = all nodes connected by these edges
                                                                      38
 Salient story stages, and story evolution

6.   Story stage = the story graph made of basics and elements in t
7.   Story evolution = how story stages evolve over the t in T
                            39
An event: a missing child
                                                        40
A central figure emerges in the police investigations
                          41
Uncovering more details
                          42
Uncovering more details
                    43
An eventless time
                                         44
The story and the underlying documents
                                                       45
    Navigating between documents; relating different
    source types to one another




(Berendt & Trümper,
in press)
                                                                                    46
     A simple form of opinion mining:
                                             Source: Product reviews
     Feature-based Summary                    similar to blogs, but (more)
     (Hu and Liu, Proc. SIGKDD‟04)
                                             clearly product-related
GREAT Camera., Jun 3, 2004               Feature1: picture
Reviewer: jprice174 from Atlanta, Ga.    Positive: 12
                                         The pictures coming out of this camera
        I did a lot of research last     are amazing.
year before I bought this camera... It
kinda hurt to leave behind my            Overall this is a good camera with a
                                         really good picture clarity.
beloved nikon 35mm SLR, but I was
going to Italy, and I needed             …
something smaller, and digital.          Negative: 2
         The pictures coming out of      The pictures come out hazy if your hands
this camera are amazing. The 'auto'      shake even for a moment during the
feature takes great pictures most of     entire process of taking a picture.
the time. And with digital, you're not   Focusing on a display rack about 20 feet
wasting film if the picture doesn't      away in a brightly lit room during day
come out. …                              time, pictures produced by this camera
….                                       were blurry and in a shade of orange.
                                         Feature2: battery life

                                         …
                                        47
 Agenda

Concepts


From “online buzz“ to text mining


Text mining: first steps


Text mining: going deeper


Closing the loop: from blogs to blogs


The representativeness challenge

So ...
                                                                  48
 An application: Crisis PR – Step 1:
 Use blogs to observe public discussions
Detect products about which there is controversial discussion
    sentiment mining from text
 and/or
    use the structure of blogs (e.g., structure of blog post +
     comments; Mishne & Glance, Proc. WWW 2006)
 and/or
    discussion in the mainstream media (may be later though)
                                                                                49
     An application: Crisis PR – Step 2:
     Use blogs to communicate facts + own concerns
         Example Dell„s “exploding laptops“ – product recall and aftermath
         Dell launched a blog at that time (much maligned at first, but they
          learned ...)
         Evaluation of „all English-language consumer commentary on the
          Web“ before and after (methodology based on Reichhold 1996, The
          Loyalty Effect):




Market sentinel,
2007
                                        50
 Agenda

Concepts


From “online buzz“ to text mining


Text mining: first steps


Text mining: going deeper


Closing the loop: from blogs to blogs


The representativeness challenge

So ...
                                                                     51
First ...

   ... the imperfect nature of automatic text analysis presents a
    challenge
   But: human inter-rater agreement on various aspects of texts
    also tends to be rather low!
                                                                           52
Some findings

   Only a fraction of the population blog (8% of adult Internet
    users - Pew Internet & American Life Project July 2006)
   Most blogs are personal
        US survey (Business Week July 2006)
        German-language blogosphere is less “mature“, esp. Less
         politicized, than the US blogosphere (Berendt, Schlegel & Koch,
         2008)
        This includes fewer mentions of companies


   But what about those personal blogs ...?
                           53
What makes people happy?
                           54
Happiness in blogosphere
                                                                          55



Well kids, I had an awesome birthday
thanks to you. =D Just wanted to so                     current
thank you for coming and thanks for                      mood:
the gifts and junk. =) I have many
pictures and I will post them later.
                          What are the
hearts                 characteristic words
                        of these two moods?
Home alone for too many hours, all
 week long ... screaming child,                          current
 headache, tears that just won’t let                      mood:
 themselves loose.... and now I’ve
 lost my wedding band. I hate this.

                                               [Mihalcea, R. & Liu, H. (2006).
                                    In Proc. AAAI Spring Symposium CAAW.]
                               Slides based on Rada Mihalcea‘s presentation.
                                                                  56
 Data, data preparation and learning

LiveJournal.com – optional mood annotation
10,000 blogs:
    5,000 happy entries / 5,000 sad entries
    average size 175 words / entry
    post-processing – remove SGML tags, tokenization, part-of-
     speech tagging
quality of automatic “mood separation”
    naïve bayes text classifier
         five-fold cross validation
    Accuracy: 79.13% (>> 50% baseline)
                                               57
Results: Corpus-derived happiness factors


    yay         86.67      goodbye     18.81
    shopping    79.56
                           hurt        17.39
    awesome     79.71
                           tears       14.35
    birthday    78.37
                           cried       11.39
    lovely      77.39
                           upset       11.12
    concert     74.85
                           sad         11.11
    cool        73.72
                           cry         10.56
    cute        73.20
                           died        10.07
    lunch       73.02
                           lonely       9.50
    books       73.02
                           crying       5.50
                                        58
 Agenda

Concepts


From “online buzz“ to text mining


Text mining: first steps


Text mining: going deeper


Closing the loop: from blogs to blogs


The representativeness challenge

So ...
                                                                       59
Conclusion

   [A brief glance given:] (Semi-)automated forms of text analysis,
    applied to blogs, can produce useful insights
   [Only briefly mentioned today:] It can profit from the
    simultaneous analysis of link structures and/or tags
   It is the only way of analysing large-scale corpora; mining
    methods are improving continuously
   However, machine language understanding is not human
    language understanding
   Also, representativeness is questionable
 Need to combine blog mining with other forms of market
 research!
        NB: These forms include Web usage mining, query mining, ...
        But mining remains exploratory analysis
60
          61




Schluss
                                                          62
Outlook

   Evaluation
        Usually lacking from temporal text
         mining!
        Information retrieval quality: How
         to find the/a ground truth?
        (Subašić & Berendt, in press):
         encouraging results

   What are the implications of genre
    (e.g., news vs. scientific)?
        Narrative vs. “declarative“?
        Different register (including
         vocabulary choices)?
        One story vs. multiple story lines?
                                               THANKS !
                                                               63
Our interest in
THE WEB
   What„s out there?
   How do people use it?
   How can we make it more useful? ( information literacy)
                                                              64
Our interest in
(STORY) EVOLUTION
    “What happened?“
          Presidential elections
          Crime stories
    How did Saddam become an al-Qaida member?
    What did genes do before they transmitted information?


   Link to anyone who likes the Web + programming
   Link to other theories of evolution?
                                                                          65
 Understanding:
 A continuous interplay between different representations

Saddam is friend of US       reasons        text 1, text 2, text 3, ...


<something happens>


Saddam is foe of US          reasons        text 10, text 11, ...
                                           66
The application problem




                          What happened?
                                                     67
... very related to a typical bibliometric problem




                                What happened?




                                 What “happened“?
                                                       68



                        Agenda




The problem

Ingredients of a solution – and the STORIES approach

Demonstration

Evaluation

Differences between application areas?!
                                                     69
  A case study




http://www.telegraph.co.uk/
news/main.jhtml?xml=/news/2007/05/22/nmaddy122.xml
                                                          Manual!   70
The story unfolds
– new actors enter the stage (and old ones change their roles)
                                           Manual!   71
  Basic observation:
  A story is about relational statements




Robert Murat – suspect




Kate MccCann – suspect
                                                                          72
Solution approach 1: Find latent topics




                     • temporal development only by
                     comparative statics
                     • no “drill down“ possible
                     • no fine-grained relational information
                      lacks structure




                                   Tool: Blaž Fortuna : http://docatlas.ijs.sii
                                                                73
Solution approach 2: Temporal latent topics




                     • no fine-grained relational information
                     • “themes“ are fixed by the algorithm
                     • no “drill down“ possible
                      no combination of machine and
                     human intelligence




                                  Mei & Zhai, PKDD 2005
                                                                               74
 The ETP3 problem



Evolutionary theme patterns discovery, summary and exploration


1.   identify topical sub-structure in a set (generally, a time-indexed
     stream) of documents constrained by being about a common topic
2.   show how these substructures emerge, change, and disappear (and
     maybe re-appear) over time
3.   give users intuitive and interactive interfaces for exploring the topic
     landscape and the underlying documents – use machine-generated
     summarization only as a starting point!
                                                                       75
 Ingredients of a solution



Document / text pre-processing                  Interaction approach
• Template recognition
• Multi-document named entities                 • Graphs (& layout)
• Stopword removal, lemmatization               • Comparative statics or
                                                morphing
Document summarization strategy                 • Drill-down:
• no topics, but salient concepts & relations   “uncovering” relations
• time window; word-span window                 • Links to documents (in
                                                progress)
    Selection approach for concepts
    • concepts = words or named entities
    • salient concept = high TF & involved
    in a salient relation, time-indexed
    Similarity measure to determine relations       ETP3
    • bursty co-occurrence
                                                    STORIES
   Burstiness measure
   • time relevance,
   a “temporal co-occurrence lift”
                                                                  76
Data collection and preprocessing

   Articles from Google News 05/2007 – 11/2007 for search term
    “madeleine mccann“
        (there was a Google problem in the December archive)
   Only English-language articles
   For each month, the first 100 hits
   Of these, all that were freely available  477 documents


   Preprocessing:
        HTML cleaning
        tokenization
        stopword removal
                                                  77
Story elements

   content-bearing words
        the 150 top-TF words without stopwords
                                                                78
Story stages:
co-occurrence in a window




               “mother“ and “suspect“ co-occur
               • in a window of size ≥ 6 (all words)
               • in a window of size ≥ 2 (non-stopwords only)
                                                                                          79
 Salient story elements

1.   Split whole corpus T by week (17 = 30 Apr + until 44 = 12 Nov +)
2.   For each week
           Compute the weights for corpus t for this week

3.   Weight =
           Support of co-occurrence of 2 content-bearing words w1, w2 in t =
        (# articles from t containing both w 1, w2 in window) / (# all articles in t)
4.   Threshold
           Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)
           Time-relevance TR of co-occurrence(w1, w2) =
            support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥
            θ2 (e.g., 2) *
5.   Rank by TR, for each week identify top 2
6.   Story elements = peak words = all elements of these top 2 pairs (# = 38)
                                                                          80
 Salient story stages, and story evolution

7.   Story stage = co-occurrences of peak words in t
          For each week t: aggregate over t-2, t-1, t  moving average


8.   Story evolution = how story stages evolve over the t in T
                81
Demonstration
                                                          82
Outlook

   Evaluation
        Usually lacking from temporal text
         mining!
        Information retrieval quality: How
         to find the/a ground truth?
        (Subašić & Berendt, in press):
         encouraging results

   What are the implications of genre
    (e.g., news vs. scientific)?
        Narrative vs. “declarative“?
        Different register (including
         vocabulary choices)?
        One story vs. multiple story lines?
                                               THANKS !
83
                                                                      84
Blogs and other social media:
Where tagging (= adding keywords) is most prominent
                       Bookmarks

                     “Annotation platforms“
  Reader tags        (e.g., del.icio.us)




                                                                  Usenet


                    Blogs
  Author tags       (e.g., Livejournal; Huffington Post)
                    Sharing / linking by:
                    Hyperlinks, comments, blogroll, trackback links



     Diaries    (Often political) journalism       PR; press releases
85
                                                                                86
Traditional methods of market research

   Based on questioning:
        Qualitative MR - generally used for exploratory purposes - small
         number of respondents - not generalizable to the whole population
         - statistical significance and confidence not calculated - examples
         include focus groups, in-depth interviews, and projective
         techniques
        Quantitative MR - generally used to draw conclusions - tests a
         specific hypothesis - uses random sampling techniques so as to
         infer from the sample to the population - involves a large number
         of respondents - examples include surveys and questionnaires.
         Techniques include choice modelling, maximum difference
         preference scaling, and covariance analysis.
   Based on observations:
        Ethnographic studies -, observes social phenomena in their
         natural setting - observations can occur cross-sectionally or
         longitudinally - examples include product-use analysis and
         computer cookie traces.
        Experimental techniques -, creates a quasi-artificial environment
         to try to control spurious factors, then manipulates at least one of
         the variables - examples include purchase laboratories and test
         markets
                                                                      87
Lexicon dependence ...
(geht auch bei den Clintons nicht!! Family relations must be stated
in the text)
88
                            89
Who does market research?

   Companies
   Journalists
   Politicians
   ...
                                                                              90
Traditional methods of (consumer) market research

   Based on questioning:
        Focus groups, surveys, questionnaires, ...
   Based on observations:
        Ethnographic studies - observe social phenomena in their natural
         setting - observations can occur cross-sectionally or
         longitudinally - examples include product-use analysis and
         computer cookie traces.
        Experimental techniques - create a quasi-artificial environment to
         try to control spurious factors, then manipulates at least one of
         the variables - examples include purchase laboratories and test
         markets
                                        91
 Agenda

Concepts


From “online buzz“ to text mining


Text mining: first steps


Text mining: going deeper


Closing the loop: from blogs to blogs


The representativeness challenge

So ...
                                                      92
Capturing “online buzz“ – pattern type 1 : “bursty“
                   93
Comparing search
volume, news and
blogs
                                                        94
Capturing “online buzz“ – pattern type 1 (in news and
blogs) / type 2 (in search) : “smooth trend“
                                                        95
Capturing “online buzz“ – pattern type 2 (in news and
search) / type 3 (in blogs): “cyclic“
                                                                            96
 The idea of text mining ...

... is to go beyond frequency-counting
... is to go beyond the search-for-documents framework
... is to find patterns (of meaning) within and across documents


(yes, there is text mining behind some of the things the above tools do!)
                                                       97
 The steps of text mining (e.g., for blogs analysis)

1.   Application understanding
2.   Corpus generation
3.   Data understanding
4.   Text preprocessing
5.   Search for patterns / modelling
          Topical analysis
          Sentiment analysis / opinion mining
6.   Evaluation
7.   Deployment
                                                                           98
What„s market research?

   a form of business research
   identification, collection, analysis, and dissemination of
    information
   for the purpose of assisting management in decision making
    related to the identification and solution of problems and
    opportunities in marketing.
   two categories: consumer market research and business-to-
    business (B2B) market research
   Consumer market research
        understanding the behaviours and preferences, of consumers in a
         market-based economy, and aims to understand the effects and
         comparative success of marketing campaigns.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:16
posted:9/15/2011
language:English
pages:93