How Google Works and why you should care by pengxiuhui


  Propagation of Distrust
             to find
Untrustworthy Web Neighborhoods

        Panagiotis Takis Metaxas
        Computer Science Department
          Wellesley College, USA

                       ICIW2009 – Venice, Italy (May)
Outline of the Talk

  The role of Search Engines in Web experience

  What is Web Spam

  Why Search Engines evolve?

  Web Graph vs. Societal Trust

  Evolution of Search Engines: 1993-2003

  Backward Propagation of Distrust
Have you used the Web…
                                        We depend on search engines
to get informed?
                                                     to find information
to help you make decisions?
    Financial
    Medical
    Political
    Religious…

The Web is huge
  > 1 trillion (! ?)
     static pages publicly available,
    … and growing every day
    Much larger,
     if you count the “deep web”
    Infinite,
     if you count pages created
Web information can be unreliable

               Anyone can be an author on the web!
You know e-mail Spam…
The Web has Spam too!

                    Search results steroid drug HGH
                    (human growth hormone)
Any controversial issue will be

                    Search results for mental disease ADHD
                    (attention-deficit/hyperactivity disorder)
Political issues will be spammed

                  Search results for Senatorial candidate
                  John N. Kennedy, 2008 USA Elections
… you like it or not!

                               Famous search results for
                               “miserable failure”

            But Google is usually so good in finding info…
            Why does it do that?

 Web Spam:
     Attempt to modify the web (its structure and contents),
        and thus influence search engine results
                in ways beneficial to web spammers
How Google      (and the other search engines)   Work
                      Document           crawl the         WEB
                        IDs                 web

                                        inverted index
                         engine                      Inverted
                         servers                       index

A Brief History of Search Engines
  1st Generation (ca 1994):
     AltaVista, Excite, Infoseek…
     Ranking based on Content:
        Pure Information Retrieval

  2nd Generation (ca 1996):
     Lycos
     Ranking based on Content + Structure
        Site Popularity

  3rd Generation (ca 1998):
     Google, Teoma, Yahoo
     Ranking based on Content + Structure + Value
        Page Reputation

  In the Works
     Ranking based on “the user’s need behind the query”
1st Generation: Content Similarity

  Content Similarity Ranking:
  The more rare words two documents share,
  the more similar they are

  Documents are treated as “bags of words”
  (no effort to “understand” the contents)

  Similarity is measured by vector angles
  Query Results are ranked                            2
  by sorting the angles
  between query and documents                             d1

  How To Spam?                                             t1

  1st Generation: How to Spam

     “Keyword stuffing”:
     Add keywords, text, to increase content similarity

Page stuffed
with casino-
2nd Generation: Add Popularity

  A hyperlink
  from a page in site A 
  to some page in site B              1
  is considered a popularity vote                  
  from site A to site B                                          2

  Rank similar documents
  according to popularity           1             

  How To Spam?                       
 2nd Generation: How to Spam
     Create “Link Farms”:
     Heavily interconnected owned sites spam popularity

sites owned by
promote main site
3rd Generation: Add Reputation…
 The reputation “PageRank” of a page Pi =
 the sum
      of a fraction of the reputations
              of all pages Pj that point to Pi

 Idea similar to academic co-citations

 Beautiful Math behind it
     PR = principal eigenvector
      of the web’s link matrix
     PR equivalent to the chance
      of randomly surfing to the page
 HITS algorithm tries to recognize
    “authorities” and “hubs”

 How To Spam?
3rd Generation: How to Spam

 Organize Mutual Admiration Societies:
 “link farms” of irrelevant reputable sites
Mutual Admiration Societies
via Link Exchange
An Industry is Born
  “Search Engine Optimization” Companies
  Advertisement Consultants
3rd Generation: Reputation & Anchor

     Anchor text tells
     you what the               Page A                      Page B
     reputation is about        Anchor

     How To Spam?

Armonk, NY-based computer
 giant IBM announced today
                                 Big Blue today announced
Joe’s computer hardware links
                                record profits for the quarter
“Google-bombs” spam Anchor
 Business weapons
     “more evil than satan”

 Political weapon in pre-election season
     “miserable failure”
     “waffles”
     “Clay Shaw” (+ 50 Republicans)

     Promote steroids
     Discredit AD/HD research

 Activism / online protest
     “Egypt”
     “Jew”

 Other uses we do not know?
     “views expressed by the sites in your results are not in any way
      endorsed by Google…”
   … mostly for political purposes

                                    “miserable failure hits
                                    Obama in January 2009

Activists openly collaborating to
Google-bomb search results of
political opponents in 2006
Search Engines vs Web Spam

 Search Engine’s Action                Web Spammers Reaction

 1st Generation: Similarity            Add keywords so as
     Content                          to increase content similarity
 2nd Generation: + Popularity          + Create “link farms” of heavily
      Content + Structure
                                       interconnected sites
 3rd Generation: + Reputation
                                       + Organize “mutual admiration
              + Anchor Text            societies” of irrelevant reputable
      Content + Structure + Value
                                       sites          + Googlebombs

 4th Generation (in the Works)
     Ranking based on the user’s               Can you guess what
      “need behind the query”
                                                   they will do?

                            Is there a pattern on how to spam?
And Now For Something Completely(?) Different

       Attempt to modify human behavior,
         and thus influence people’s actions
                   in ways beneficial to propagandists

   Theory of Propaganda
       Developed by the Institute for Propaganda Analysis 1938-42

   Propagandistic Techniques (and ways of detecting propaganda)
       Word games - associate good/bad concept with social entity
          Glittering Generalities — Name Calling
       Transfer - use special privileges (e.g., office) to breach trust
       Testimonial - famous non-experts’ claims
       Plain Folk - people like us think this way
       Bandwagon - everybody’s doing it, jump on the wagon
       Card Stacking - use of bad logic
     Societal Trust is (also) a Graph

  Web Spam:
Attempt to modify the Web Graph,
and thus influence users through search engine results
in ways beneficial to web spammers

                                  Attempt to modify the Societal Trust Graph
                                  and thus influence people
                                  in ways beneficial to propagandist
Web Spammers as Propagandists
 Web Spammers can be seen as
    employing propagandistic techniques
           in order to modify the Web Graph
 There is a pattern on how to spam!
Propaganda in Graph Terms
 Word Games                    Modify Node weights
    Name Calling                 Decrease node weight
    Glittering Generalities      Increase node weight
 Transfer                      Modify Node content + keep weights
 Testimonial                   Insert Arcs b/w irrelevant nodes
 Plain Folk                    Modify Arcs
 Card stacking                 Mislabel Arcs
 Bandwagon                     Modify Arcs
                               & generate nodes
Anti-Propagandistic Lessons for
  How do you deal with propaganda in real

  Backwards propagation of distrust
  The recommender of an untrustworthy
  message becomes untrustworthy

  Can you transfer this technique to the web?
An Anti-Propagandistic Algorithm
Start from untrustworthy site s
S = {s}
Using BFS for depth D do:
    Find the set U of sites
     linking to sites in S
     (using the Google API
     for up to B b-links/site)
    Ignore blogs, directories, edu’s
    S=S+U
Find the bi-connected component
BCC of U
     that includes s

     BCC shows multiple paths
     to boost the reputation of s
Backwards Propagation of Distrust
Start from untrustworthy site s
S = {s}
Using BFS for depth D do:
    Find the set U of sites
     linking to sites in S
     (using the Google API
     for up to B b-links/site)
    Ignore blogs, directories, edu’s
    S=S+U
Find the bi-connected component
BCC of U
     that includes s

     BCC shows multiple paths
     to boost the reputation of s
BCC vs Periphery
Since the BCC reveals
multiple paths to boost the            Periphery
reputation of s,
we expect it to contain
a higher percentage of
untrustworthy sites

The Periphery of the BCC,
on the other hand,
should have                      BCC
significantly lower percentage
of untrustworthy sites
Explored neighborhoods
Evaluated Experimental Results
 The trustworthiness of
 starting site is a very
 good predictor for the
 trustworthiness of BCC

 The BCC is significantly
 more predictive of         Periphery
 untrustworthiness than
 the Periphery
Living in Cyberspace

  Critical Thinking, Education
     Realize how do we know what we know
     “Of course it’s true; I saw it on the Internet!”

  Cyber-social Structures that mimic Societal ones
     Know why to trust or distrust
     Who do you trust on a particular subject?

  A Search Engine per Browser
     Easier to fool one search engine than to fool millions of readers
     Enable the reader to keep track of her trust network
     Tools of cyber trust
     How would you avoid the Emulex hoax?
         Thank You!
How (not) To Solve The Problem
Link Farms vs MAS

To top