Docstoc

Searching and the Web

Document Sample
Searching and the Web Powered By Docstoc
					Information jungle on the
           Web:
  finding and evaluating
    information sources

             Tefko Saracevic, PhD
              Rutgers University
                tefko@scils.rutgers.edu
 http://www.scils.rutgers.edu/people/faculty/tefko.html
                    Web & information:
                      key problems
    SEARCHING the Web for information
    Retrieving a MANAGEABLE AMOUNT
    Selecting the most RELEVANT sources
    EVALUATING sources & information
    Three laws for information on the Web:
    1. EVALUATE
                      2. EVALUATE
                                      3. EVALUATE
Tefko Saracevic, Rutgers University                 2
              Characteristics of
           information on the Web
        VARIETY - amazing
              rich source on myriad topics & subjects
        DISTRIBUTION - all over, global
              information scattered across great many sites
        LINKAGE - many hyperlinks, hypertexts
                    elaborate web of connections, paths, and mazes
        AMOUNT - huge, growing exponentially
              millions of sites, billions of pages
Tefko Saracevic, Rutgers University                                   3
         Characteristics … (cont.)
     CONTENT VALUE NEUTRAL - anything goes
           no control of content
           some accurate, trustworthy, verifiable
           some biased, self-serving, propaganda, promotional
           some false accidentally
           some false deliberately, some even with evil intent


                                 Thus, the three Web laws
Tefko Saracevic, Rutgers University                               4
                           Size of the Web
   Over 16 million web servers; 800 million pages
               83% commercial, 6% scientific or educational; 3% health
               2.5% personal; 2% societies; 1.5% government,
               about 1% each community, religion; 1.5% pornographic
               Growth 97-99 public sites +179%
               Countries of origin:
                     U.S. 55% (59% in 1997), Germany 6%, Canada 5%, UK 5%, Japan 3%,
                      Australia, Brazil, France, Italy 2% each, all others 18%
               Languages: 80% English (84% in 1997)
              US sites & English language predominate, but % falling steadily
   Sources: Lawrence & Giles, Nature (1999): http://www.wwwmetrics.com/
                  OCLC Web Characterization Project
                  http://oclc.org/oclc/research/projects/webstats/index.htm
Tefko Saracevic, Rutgers University                                                     5
      Organization of Web sites
        Metatags - to enable retrieval by fields- low use
              HTML “keywords”, “description”
                    34% of sites use them
              Dublin core - .3% sites use
        No standardization across sources
        Classification a predominant approach
              many types used
        Lack of organization major hindrance to retrieval
              also faked contents to force retrieval
Tefko Saracevic, Rutgers University                          6
   Comparison: Web & library
   or inf. retrieval searching
        SIMILARITIES in searching
              Basic principles to approach the same
                    human-human interaction - mediated or introspection
                            to determine content, explore information need for a task
                    preparation of search concepts, terms, logic
                    determination of range, restrictions
                    estimation of relevance




Tefko Saracevic, Rutgers University                                                      7
                                      Differences
        Vastly different sources
              as to contents, authority, reliability, persistence
              variation in amounts, depth, breadth
        Very different organization
              little standardization, few if any fields
        Quite different search engines
        Differing search strategies needed
        Presence of many links; complex connections
        Evaluation more complex
Tefko Saracevic, Rutgers University                                  8
   Needed for Web searching
        Knowledge & competencies
              about great variety of sources
              great variety in their organization
              search engines
              search strategies; search dynamics
              exploring & exploiting links & networks
              keeping up: constant changes, innovations
              Web economics - no such thing as free lunch
        Effectiveness proportional to that knowledge
Tefko Saracevic, Rutgers University                          9
              Criteria for evaluation
        http://www.otterbein.edu/learning/libpages/subeval.htm

  Authority
        Author - possible bias? Publisher - reputation?
                    Professional society? Academic source?
              Reason on the Web?
                    Vanity pages? Sponsor? Advocacy association?
        Domain name -who put up the site?
  Accuracy - possible independent verification? Sources?
  Currency - verification
  Prior review, experiences - checking review sources
  Critical thinking & constant verification
Tefko Saracevic, Rutgers University                                 10
  Ways to search & retrieve
        Most popular: search engines
              global, regional, country, specialized engines
        Following links from major sites & portals
                    e.g. from Library of Congress to many libraries
                    from newspapers to archives
        Reference sites - growing numbers
        Library sites - becoming ever richer sources
        Web addresses in print sources, newspapers
        Referrals, emails, bookmarks
Tefko Saracevic, Rutgers University                                    11
  Web sites & search engines
        Indexed by search engines (public sites)
              by keywords, classification, links, registration
         Hard to find
              most domain sources will not be found e.g digital
               libraries, online journals, reference sources
              many commercial sites
        Differing approaches to inclusion/selection
              mostly automatic; also generic source providers
              increasingly added human evaluation
Tefko Saracevic, Rutgers University                                12
         Search engine coverage
No US engine covers more than 16% of the Web
     Very hard to discern coverage
     In respect to combined coverage of 11 top engines:
                 Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot 27.1 MS 20.3
                  Infoseek 19.2, Google 18.6, Yahoo 17.6 Excite 13.5, Lycos 5.9, EuroSeek
                  5.2
                 HotBot, MS, Snap & Yahoo use Inktomi as search provider, but have
                  different filtering & Inktomi databases

Large European engines geared to country coverage
      E.g. Wanadoo (France), T-online (Germany)
           highest use among engines in their countries

Tefko Saracevic, Rutgers University                                                         13
            Unique search engines
    Number of specialized engines - looking for niche
          good for scientific, technical, professional searches
          include manual evaluation & selection of sources
    Northern Light has „special collections‟
          not found on publicly indexable Web
              http://www.northernlight.com

    Oingo has word associations, evaluations
          includes elaborate classification   http://www.oingo.com/

Tefko Saracevic, Rutgers University                                    14
           Search features among
                  engines
        Some search features the same across all but
         details differ - particularly in advanced
              Boolean available
                    but sometimes AND sometimes OR default
              Differences may be found in:
                    phrases, proximity, truncation, case sensitivity,
                     relevance feedback, field searching, special features
                    some have term expansion to concepts & lists of
                     associated terms ( e.g. latent semantic indexing)

Tefko Saracevic, Rutgers University                                          15
                  Search strategies &
                       outputs
        Geared toward very short searches
              big majority of searches 2-3 terms (av. 2.5)
              big majority of users view one page only
        Geared toward limited top outputs
        Ranking output by relevance predominates
              relevance calculation differ & secret
        Also heavy & increasing use of classification
        Browsing a big component
Tefko Saracevic, Rutgers University                           16
                Meta search engines
        Search engines that cover search engines e.g.
              All4one                http://all4one.com/
                    four windows - good for comparison
              Savvy Search                   http://www.savvysearch.com/
                    indicates search engine source
        More on the horizon & differing
        Search Engine Watch http://www.searchenginewatch.com/
              listing, reviews, ratings, tests, resources, tutorials

Tefko Saracevic, Rutgers University                                         17
            Reference sites - facts
Reference services & access changing drastically
Several models in reference services:
     Martindale’s Reference Desk - comprehensive
           http//www-sci.lib.uci.edu/~martindale/Ref.html
     Ask Jeeves! - natural language http://www.ask.com/
           over 2 million queries per day; growing 46% per quarter

     Electric Library - membership http://www.elibrary.com/
     Review of several reference sites -
         http//www.libraryjournal.com/articles/multimedia/webwatch/1999110
         1_12593.asp

Tefko Saracevic, Rutgers University                                     18
                                 Reference ...
        Sources … continued
              Information Please - almanacs
                  http://www.infoplease.com/

              Reference Desk - rich http://www.refdesk.com/
              Encyclopedia Britannica
                  http://www.britannica.com/
                    great many cross-references & other sources
              Webhelp - “real people, real answers, real time”
                    live conversation with one of the 1000+
                     “Web wizards” www.webhelp.com

Tefko Saracevic, Rutgers University                                19
         Libraries as Web sources
    Libraries providing open collections & services
          growth of digital libraries & Web access
          models vary; parts open to all, parts only to own users
    One example, among great many:
          Rutgers libraries - large & long term effort
              http://www.libraries.rutgers.edu/
          various sources & links involved
                e.g for domain information& sources go to:
                      Electronic Ready Reference Shelf; Research Guides; Social Sciences
                       & Law; Library & Information Science
Tefko Saracevic, Rutgers University                                                     20
             Virtual libraries on the
                       Web
    Libraries emerging only on the Web
          More & more libraries & organizations involved
    Examples of libraries rich in sources & links
           Virtual Library - Switzerland, US, UK & other countries,
              started by Tim Berners-Lee the creator the Web http://vlib.org.
          Toronto Public Library http://vrl.tpl.toronto.on.ca/
          Internet Public Library, Michigan http://www.ipl.org/
          Academic Info - “Gateway to Quality Educational
           Resources.” International http://academicinfo.net/
Tefko Saracevic, Rutgers University                                             21
              New modes of access
Libraries, agencies, companies, developing reference
 & service models - new, rich, innovative e.g.
     For & about children Los Angeles Public Library - great
      fun! http://www.lapl.org/kidsweb/
     Parenting: Parenttime http://www.parenttime.com/home/homepage.cgi
     Fathom - consortium of six leading institutions in US & UK
            beta testing - top quality research coverage http://www.fathom.com/
     Course on Internet use with links http://www.newbie.org/

Tefko Saracevic, Rutgers University                                                22
                                Domain sites
   Many domain/issue specific sites
         rich & often unique coverage & services
          different approaches & requirements
   Examples in health related domains:
         Medscape - registration required
               http://www.medscape.com/
         Rxlist - The Internet Drug Index
               http://www.rxlist.com/
         Mayo Clinic HealthOasis http://www.mayohealth.org/
Tefko Saracevic, Rutgers University                            23
       Societies, organizations ,
              publishers
   Great many rich sources for searching
         differences in requirements, depth, richness
   Examples from variety of organizations:
         Assoc. for Computing Machinery http://www.acm.org/
               Digital Library; subscription or registration, searchable
         State department http://www.state.gov/
               about the U.S & other countries
         R.R. Bowker http://www.bowker.com/
               free sections - Yours for the Asking; Library Resource Guide
         Genealogy: http://www.familysearch.org/
Tefko Saracevic, Rutgers University                                            24
                                  Newspapers
   Various online newspapers models are explored
    beyond having just a print copy on the Web
         subscription; links; archives; more elaborate stories …
         e.g. San Francisco Examiner - http://examiner.com/
               articles, in depth projects, area guide (SF Gate), archive ...
   Finding stories & papers: Excite News Tracker
       http://nt.excite.com/
         Includes: World Newspapers Resources
   Index of some major world news papers (from New
    Zealand) http://www.ccc.govt.nz/Library/Resources/Newspapers/index.asp
Tefko Saracevic, Rutgers University                                              25
                                      Summary
        Web is:
               rapidly evolving, changing, expanding
              unpredictable, rich, and valuable source
        Knowledge & competencies needed to use it
         effectively, also common sense & flexibility
         Three Web laws always in effect!
        Web economics
              rewards big, but costs significant
Tefko Saracevic, Rutgers University                       26
                        But … limitations
    The public Web does not have it all
     Many rich resources not accessible without paying
          DIALOG covers many fields & is larger than the Web
          similarly Lexis - Nexis, Data Star etc.
    Majority of content in libraries is NOT on Web
    Majority of archives, old newspapers NOT on Web

      WEB IS RICH, BUT NOT A BEGINNING & END
             OF INFORMATION SOURCES
Tefko Saracevic, Rutgers University                             27
Tefko Saracevic, Rutgers University   28

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:22
posted:2/24/2010
language:English
pages:28