Docstoc

Présentation PowerPoint

Document Sample
Présentation PowerPoint Powered By Docstoc
					New challenges of Search Engines

           Katarzyna Wegrzyn-Wolska

                 katarzyna.wegrzyn@esigetel.fr


                       ESIGETEL
 Ecole Supérieure d'Ingénieurs en Informatique et Génie des
                    Télécommunications
    Outline

        Introduction
        SE : general problems
             Future of Search
             Economics of SE
             Search Quality
             Personalisation and Profiling
             Privacy and Search Engines
             Intellectual Property and Copyright
             Detecting Spam Indexing
             Multimedia Search
             Mobility, Local and Social Media
        Conclusion


Bratislava, 31 mars 2008    Katarzyna Wegrzyn-Wolska   2
    Introduction

Subject:
   New challenges of Future Search Engines

Motivation
   Importance of topic ...
   Participation in Europeans Commission Projects
        Expert in FP6 & FP7
Objectives
   Discuss the Problems

        How to search and evaluate the data ?
Solutions or ... ?


Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   3
          SE a Key Enabling Technology

85% of all Internet traffic: from Search Engine
10 Bn text pages accessible through SE like Google
9,6 Bn (US; in December 2007) searches
   up to 15% over last year (Google 30%)
   total: 113 billion searches in 2007
SE ADS:
   > 10 Bn € worldwide today, expected 22 Bn € in 2010
   extremely cost effective by business players, with clear and
   measurable ROI (Return on Investment)
Cultural data: digital libraries (indexed by powerful search
tools)
SE: key to ensure the cultural and language diversity

      Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska         4
      Other problems ….

Multiplicity of data formats and indexing:
    text, images, audio, 3D...
Integration of other technologies:
    satellite/airplane pictures (e.g Google earth);
New forms of data exchange:
    Peer to Peer vs Client server;
Integration with protected/encrypted formats:
   DRM interfaces;
Users tagged networks of knowledge;
Personalization according to user search “history”




  Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   5
                  Search Landscape in 2007

                              Enid Burns, Search Engine Watch, Feb 1, 2008


                                               Three major “Mainframes”
                                                   Google,Yahoo, and MSN


                                               >800 M searches daily
                                                   60% international
                                                   106 machines


                                               $20 Bn in Paid Search Revenues

Source: Search Engine Land:
US web search share,                           Large indices
NetRatings, August 2007                            Billions of documents
                                                   Petabytes of data



             Bratislava, 31 mars 2008         Katarzyna Wegrzyn-Wolska       6
    Power of GOOGLE ?




Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   7
           What we expect from SE in future


Most challenging (research) issues:
   economic opportunity and markets
   cultural diversity “spectrum”
   data/formats and content explosion in the future
   demands for audiovisual (multimedia)
   mobile search
   impact of user behaviour and the way users interact with
   online information systems
   specialisation versus generic search models and technologies
   (vertical search):
      (e.g in the context of specific application environments such as
      health or education)



       Bratislava, 31 mars 2008    Katarzyna Wegrzyn-Wolska              8
          Future of Search

                            John Battelle: SearchBlog and Battelle Media
SE as a Platform
… and more than a Platform
   a new interface to computing
   beginning of a new customer-driven culture
Rise of Conversational Media
   users interact with services…
   conversational models (conversation economy)
      business transition to conversational models
      smart companies see an opportunity online…
      possibility to have a conversation with the customers…
Web 2.0 : Architecture of Participation
   user-generated content
      the force of many to create advantage and build network effects


      Bratislava, 31 mars 2008      Katarzyna Wegrzyn-Wolska           9
      Future of Search : SE as a Platform

                   John Battelle: SearchBlog and Battelle Media

SE a new Platform :
   Remember DOS?


   After DOS….. Windows ...
   And now ?
       Search is an Interface



   In future ?
   New platform to computing ?




  Bratislava, 31 mars 2008      Katarzyna Wegrzyn-Wolska          10
          Future of Search : SE as a Platform

Platform to computing
   Like Spotlight (Mac OS)




      Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   11
               Future of Search : Conversation

                           John Battelle: SearchBlog and Battelle Media


Participants                                                Industry size ($bb)
(mm)

1,000                                                                     5,000


 100                                                                      500
                                                 Talk with
  10
                          Talk between
                            Front and
                                                Customers                 50
                           Back Office             (Web 2.0…)
         Talk with
        Back-Office


  1970         1980            1990        2000          2010
          Bratislava, 31 mars 2008       Katarzyna Wegrzyn-Wolska              12
       Economics of SE

                  Prof. Hal Varian; Chief Economist, Google, and
                  Professor at UC Berkeley


What services do search engines provide?
   Google as matchmaker
       Matches up those seeking info to those having info
       Matches up buyers with sellers
Ads are highly effective due to high relevance
   But even so, advertising still requires scale
       2% of ads might get clicks
       2% of clicks might convert
       So only 4 out a thousand who see an ad actually buy
       price per click (PPC) will not be large



   Bratislava, 31 mars 2008     Katarzyna Wegrzyn-Wolska           13
             Economics of SE

Google
   Brin & Page tried to sell algorithm to Yahoo for $1 million
       they wouldn’t buy
   Formed Google with no real idea of how they would make money
   Put a lot of effort into improving algorithm
      Availability of real time data allows for fine tuning, constant
      improvement:
           each query is tested on 4000 new algorithms (Google)
Why online business are different
   Online businesses can continually experiment
      Japanese term: kaizen = “continuous improvement”
      Hard to really do continuously for offline companies
           Manufacturing, Services
      Very easy to do online
           Leads to very rapid (and subtle) improvement

         Bratislava, 31 mars 2008     Katarzyna Wegrzyn-Wolska          14
          Search Quality


What’s the Goal?
  User Satisfaction
      Understand user intent
           Problems: Ambiguity and Context
      Generate relevant matches
           Problems: Scale and accuracy
      Present useful information
           Problems: Ranking and Presentation
Quality Dimensions
  Ranking
  Freshness
  Presentation



      Bratislava, 31 mars 2008     Katarzyna Wegrzyn-Wolska   15
    Search Quality


                             Dr. Jan Pedersen; Yahoo Search


                             Eye Tracking Studies
                                 Golden Triangle
                                      Top left corner
                                 Quick scan
                                      For candidate
                                 Longer scan
                                      For relevance




Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska       16
    Search Quality




Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   17
     Search Quality

                      Dr. Daniel Russell; Google


What peoples think when searching ?
   “Jaguar “ - Mac OS?, car?, cat?
        Central America - rather cat … (car no probably)
        good response - personalisation problem
   Specials studies :
        how people think
        mental model
        qualitative reactions
        expectations
        analysing users behaviours:
              ex. why 50% of clicks to Advanced Search page?




 Bratislava, 31 mars 2008       Katarzyna Wegrzyn-Wolska       18
          Personalisation and Profiling

                   Dr. Jaime Teevan: Microsoft Research
User profiling - User Data
   Persistent demographic information (age, gender, zip code, …)
   Dynamic interests (music, travel, …)
   User environment (locations, browser, connection speed, …)
   User transaction history (seasonal purchases, spending
   patterns,)
   User behavior at the Web site
The means of gathering user data varies widely:
   Static form-based profile:
      input by the user (explicit involvement of the user)
   Dynamic profile:
      automatically derived by the server based
      tracking user behaviors (implicit involvement of the user)


      Bratislava, 31 mars 2008     Katarzyna Wegrzyn-Wolska        19
         Privacy and Search Engines

                   Chris Jay Hoofnagle; Samuelson Clinic; Berkeley Ctr. for
                  Law and Tech

Collecting personal data ?
Search engines mediate access to content
   central point of privacy vulnerability
Search query:
   Access or Retention ?
What are personally identifiable information ?
   Information to identify ?
   Metadata, data about others may identify you too
Personalization to Customization
   Tracking is present, even to sites with “sensitive” topics
   Goal : to present ads across multiple platforms (desktop, laptop, xbox)




     Bratislava, 31 mars 2008       Katarzyna Wegrzyn-Wolska                  20
            Privacy and Search Engines


AOL Query Search
  AOL have published 20M queries based on 600 000 users
  (Users are uniquely enumerated)
  Uncensored queries for three months of AOL search service,
  spring 2006
  Essentially public domain
  Contains dangerous private information
        Some easy to identify
              Users vanity searched name, SSN

Ex. grep for credit-card patterns produces the following:
     grep -i -e “[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}” *.txt
          * 9006-0512-xxxx-xxx
          * 1550-0905-xxxx-xxxx

       Bratislava, 31 mars 2008             Katarzyna Wegrzyn-Wolska   21
             Privacy and Search Engines

Looking for (SSN)
       grep -i -e “\b[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}\b” *.txt


  * kristy nicole vega hammond la. social secruity number 437-67-xxxx birth date
03 08 xx drivers license number la. 00765xxxx address 41178 rene dr. hammond
la.
 * pamela button 079-60-xxxx
 * thomas j finney socsec 370-40-xxxx
 * 419-94-xxxx thomas black
 * 458-87-xxxx seguro social


Grep for email addresses
    ([a-zA-Z0-9_\-]*@[a-zA-Z0-9_\-]*\.)
    turns another 60 results



         Bratislava, 31 mars 2008            Katarzyna Wegrzyn-Wolska         22
         Privacy and Search Engines




Google’s Search Policy
  Source: Search Privacy Practices: A Work In Progress, CDT Report
  - August 2007




     Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska        23
          Intellectual Property and Copyright

                 Jason Schultz, Intellectual Property Attorney,
                Electronic Frontier Foundation (EFF)


Copyright threats to Search
   Search Engines copy, index, and distribute information to
   millions of people
   What about Spiders, Linking, Images, Books ?
Search Engine strategies
   Implied permission, Linking, not hosting (for the most part),
Linking to copyrighted works generally not an infringement,
unless
   You knew the link leads directly to infringing material



      Bratislava, 31 mars 2008    Katarzyna Wegrzyn-Wolska         24
          Intellectual Property and Copyright

Image Search




      Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   25
          Copyright: Image Search

Copyright Issues in Image Search
   Capturing image,
   Making and storing thumbnail
   Displaying thumbnails in response to keyword searches
   Providing Link to original picture page
Is it legal ? - Perfect 10 v; Google
   Google says:
       They spider everything
       They can’t tell who’s infringing until somebody notify them
       It’s a fair use to make an image directory
       Image search is important public resource
   Court says:
       First decision: it's legal
       P10 Opposition (opinion amended on December 3, 2007):
             "Image Search" tool illegally reproduced and displayed P10
             photos when it returned thumbnail results and framed third-
             party websites in response to search terms


      Bratislava, 31 mars 2008     Katarzyna Wegrzyn-Wolska                26
          Intellectual Property and Copyright



Google Book Search




3 kinds of books:
   classic, totally public, without copyright
   with copyright & editor permission to index
   with copyright & without editor permission to index


      Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   27
         Copyright: Google Book Search

Author’s Guild Guild v. Google
   Guild Author’s says:
      We sell books
      You borrowed books from the libraries and copied them without
      paying us
      You make money
      We want money
      Pay us
      This will help you sell books
   Google says:
      We had to copy books to make an index
      No one sees > a few lines at a time
      We link to where you can buy/borrow
      Book search is important to public access
      This will help you sell books

     Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska         28
         Detecting Spam Indexing

                   Dr. Marc Najork: Microsoft Research
Only highly placed sites in SE results (for some queries)
benefit from SE referrals
How to increase SE referrals:
   Buy keyword-based advertisements
   Improve the ranking of your pages
      Provide genuinely better content, or
      “Game” the system
   SEO business (Search Engine Optimization)
      Some SEOs are ethical
      Some are not …
Taxonomy of web spam techniques :
   Keyword stuffing,Link spam, Cloaking


     Bratislava, 31 mars 2008     Katarzyna Wegrzyn-Wolska   29
         Multimedia Search


                    Dr. Lynn Wilcox: FXPal
What is Multimedia?
   ACM Special Interest Group on Multimedia 2003
       More than one media (text, images, audio, video) that are
       correlated
       Examples:
            Time correlated: Video with text transcript of the audio
            Spatially correlated: Images on a page with associated text
   A less strict definition:
       Not “Just” Text : Images, Audio, Video
Interface to Search:
   Images, Audio, Video



     Bratislava, 31 mars 2008     Katarzyna Wegrzyn-Wolska                30
              Multimedia Search

Text Search
   Keywords
Image Search
   Search based on tags (FlickR, FaceBook)
   Search based on surrounding text (Google)
   Content based search
      Using image features
      Using faces
Audio Search
   Search based on metadata (iTunes)
   Content based search (MuscleFish, Foote)
Video Search
   Search based on text (Google/UTube)
                                                               MediaMagic
   Search based on associated media (Lectures with slides)
   Search based on content (TrecVid News Search)

         Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska        31
          How about Mobility & Mobil Search?

~2Bn mobile users today, 1.5 Bn GSM users
world-wide (3Bn in 2010)
~75% of terminals equipped with Internet
access in the medium term
mobility imposes very specific search
content search & other technologies (location)
heterogeneous mobile-fixed environments
Mobil search
   iPhone : mobile traffic has become a real possibility for real-
   time search needs.
   WML (wireless mobile language)
      real chance to thread local search into mobile media needs.



      Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska           32
         Local Search

Local search
   30 % of all search engine queries contained a zip code, city
   name, or state.
   local needs and mobile search:
      potential to turn local search into a "modern Yellow Pages" in
      real-time.




     Bratislava, 31 mars 2008    Katarzyna Wegrzyn-Wolska              33
           Social Media Explosion

Facebook




                                          http://www.visualcomplexity.com

      Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska             34
          Social Media: Explosion

Facebook
   And other …




   how they earn money ?
      access to the data for ad
      targeting purposes




      Bratislava, 31 mars 2008    Katarzyna Wegrzyn-Wolska   35
                 Social Media: Explosion of the Blogs

   > 60 million blogs
                                 The Hyperbolic Blogosphere 2007
   link connexion                Matthew Hurst    http://tinyurl.com/2nbwo6
       green: one-way
       blue: reciprocal
   white dots:
        individual blogs
1 - DailyKos 500K/day
2 - Boingboing
3 - LiveJournal (isolated
    community)
4 - “blue blob” balanced
    discourse (most links
    are reciprocal)
5&6 - “outlying blue
    island”


            Bratislava, 31 mars 2008      Katarzyna Wegrzyn-Wolska            36
          SE : 2007 Predictions and Scorecard

                 Sage Lewis, Search Engine Watch, Jan 17, 2008
                 ReadWriteWeb, Web Marketing, Watch WebProNews


Top SE Year-End predictions of 2007 :
   RSS will go mainstream in a big way
   The explosion of widgets
   Semantic Web products (Twine)
   Browser wars between IE7 and FireFox
   Virtual world businesses
   AOL acquired
   And most of all: the social revolution!
2007 Scorecard:
   interesting and thoughtful

      Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   37
       Search Marketing Predictions for 2008

                   Kevin Newcomb, Search Engine Watch, Jan 23, 2008

Local search starts to make and impact
Social search will finally be useful
    not just for friends
Education and training will be important
Google policy of privacy issues
"People-driven", "Brain-power" SE: success or fail ?
    Cha-Cha, Mahalo
Vertical searches will have shake-ups (maybe health?)
Widgets: another online presence:
   now just like a website, a blog, or social page
Increasing of China's participation of global search share
   Baidu or even Google and Yahoo China.
Yahoo will be someone to watch in 2008


   Bratislava, 31 mars 2008       Katarzyna Wegrzyn-Wolska            38
       Search Engine : Géneral Predictions




What will be the future
scorecard ?

Who know the answers ?




   Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   39
Thank you very much
 for your time today
     and for your
      attention

Bratislava, 31 mars 2008   Katarzyna Wegrzyn-Wolska   40

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:10/12/2012
language:Unknown
pages:40