Finding Information On The Internet - General by wuyunyi


									 Finding Information On The

              M.R. Lalitha
National Centre for Science Information
      Indian Institute of Science
         Bangalore - 560 012
                Finding Information On The
        •    Internet growth
        •    Search tools
        •    Types of Search tools
        •    Current trends in Internet Searching
        •    Steps in Internet searching
        •    Before starting an Internet search
        •    Case studies
        •    Related sources

IISc 22-26 Nov’99              IIRML P9             2
                        Internet Growth
        • A dynamic network of a variety of information sources.
        • Size is unknown and perhaps unknowable.
        • Approximately, has more than 550 million documents.
        • Document growth rate is, at least, doubling each year.
        • Heterogeneous in nature.
        • Large percentage of data is unstructured and unorganised.
        • Contents highly volatile - the websites containing
          information may be removed from the web or may be
          shifted to a new location.

IISc 22-26 Nov’99               IIRML P9                              3
                        Search Tools
        • Imagine Internet as a huge library with no Catalog
          and no staff to assist
        • To access a site/source on the Internet, one should
          know its URL or Address.
        • It is not possible to remember the addresses of all
          the required sites/sources.
        • Information retrieval becomes a major concern

IISc 22-26 Nov’99            IIRML P9                           4
                      Search Tools…
        • For this purpose, a number of search tools or
          services have been developed.
        • These maintain directory and/or database of the
          Internet resources
        • Provide Browse/Search interface to retrieve and
          access the required information.
        • Approximately, about 2,500 search services
          available presently on the Web

IISc 22-26 Nov’99           IIRML P9                        5
                              Search Tools…
        • Both commercial and free search tools are
        • Each tool varies from the other.
        • One can maximize the chances of finding the
          required information by understanding
              –     How these databases are constructed,
              –     How sites are indexed,
              –     How best to query them, and
              –     How they report their findings.

IISc 22-26 Nov’99                    IIRML P9              6
                       Types Of Search Tools
        • The search tools can be categorised as:
              –     Directories
              –     Search Engines
              –     Meta Search Engines
              –     Virtual libraries/Subject Gateways
              –     Tools for Finding People

IISc 22-26 Nov’99                     IIRML P9           7
        • Also called Subject Trees or Catalogs.
        • Resources organised in a hierarchy from general
          to specific topic
              – Eg: 'Architecture' will be hierarchically placed as
        • Maintained manually by administrators.
        • New sites added under appropriate subject tree
          after manual verifcation.
        • Outdated documents & dead links weeded
IISc 22-26 Nov’99                IIRML P9                             8
       • Some directories contain reviewed/rated websites
       • Both General & Subject-specific directories
       • ADVANTAGES:
             – A broad spectrum of available information can be
               viewed and the most suitable one can be
               selected for persual
             – A novice netsurfer can avoid the hassels of Boolean
             – Owing to manual selection, the directories are likely to
               provide links to higher quality documents.
             – Saves time - Provides a quick access
IISc 22-26 Nov’99                 IIRML P9                                9
              – Database size is comparitively small.
              – Updation frequency is low and so the sources may not
                be current.
              – They contain many invalid/dead links.
              – Classification may not match with the user's ideas and
                the information , though available, may go unaccessed.
                    • To overcome this problem, some directories provide a search

IISc 22-26 Nov’99                      IIRML P9                                     10
        • EXAMPLES
        • General
              –     Yahoo ! (
              – (
              –     Galaxy (
              –     Argus Clearing House (
              – (
              –     Megallen (

IISc 22-26 Nov’99                   IIRML P9                       11
        • EXAMPLES
        • Subject Specific directories:

              – TIPTOP Pilot physics (
              – BioSites (
              – Sciseek (

IISc 22-26 Nov’99               IIRML P9                        12
        • Directory is suitable when
              – The purpose of the search is to get a general idea of the
                topic. Or
              – The searcher is not very clear about what he wants,
                browsing will help to focus . Or
              – It is known that a particular information is available
                from a definite source, say a University or an
                Institution, which is likely to be listed in catalogs. Or
              – Information based on geographical or local sources are
              – Searcher is interested in General information on
                entertainment, jobs, advertisements, games.
IISc 22-26 Nov’99                  IIRML P9                                 13
                     Search Engines
       • Works on the principle that the information
         content of a document can be summarized by
         extracting those words in the title or text.
       • The search engines perform keyword searches
         against the database and retrieves a set of
         webpages matching the query.
       • Some maintain directories also.
       • Specialized search engines to search specific
         subjects/topics are also available.

IISc 22-26 Nov’99          IIRML P9                      14
                            Search Engines…
        • A search engine consists of:
              – A Spider:
                    • which continuously traverses the web and picks up the newly
                      added webpages/ documents
              – The Index
                    • which indexes the webpages gathered by the spider and builds
                      the database. Indexing is done by by extracting the words from
                      the webpages.
              – The Search Interface:
                    • which provides a front-end to the user to specify the
                      query/requirement. The search is performed in the database
                      maintained by the search engine and the retrieved webpages
                      are displayed.

IISc 22-26 Nov’99                      IIRML P9                                        15
                            Search Engines…
        • While Indexing,
              – Some search engines do fulltext indexing - index all the
                words in the webpage including the webpage title, &
                the URL (Eg: Altavista)
              – Some index only titles & URLS of webpages (eg:
              – Some index the Title, Metatags,first paragraph of the
                    • ( Metatags are presently not recognised by most of the search

IISc 22-26 Nov’99                       IIRML P9                                      16
                        Search Engines…
        • While Indexing, types of websites indexed may
              – All types - HTTP, Gopher, Telnet, FTP, Usenet groups
              – Only a particular type: Eg: Dejanews: indexes only
                Usenet group postings
              – Types of files: HTML files, Audio, Video, Images

IISc 22-26 Nov’99                IIRML P9                              17
                        Search Engines…
        • Stop words
              – Some search engines will ignore some common words.
        • Spam blockers
              – To increase the relavance score, web authors will
                include the terms in their webpages repeatedly many
                times.This is called Spamming. Some search engines
                have Spam Blockers which checks if the term appears
                more than a fixed number of times. If a robot
                recognises this, it may ignore that page entirely or
                downgrade the relavancy score

IISc 22-26 Nov’99                IIRML P9                              18
                        Search Engines…
        • Search engines generally support
              – Boolean searches, (AND, OR, NOT)
              – Implied Boolean: Term inclusion/Exclusion: +(Plus) , -
              – Proximity searches ( NEAR, ADJ, BEFORE)
              – Phrase searches (“…” Double quotes)
              – Use of parenthesis to group the search terms
              – Advanced features like limiting searches to Title, URL
              – Some support natural language queries (accepts queries
                like 'Why is the sky blue')
IISc 22-26 Nov’99                IIRML P9                                19
                            Search Engines…
        • Search engines sort/order the retrieved websites.
        • Relevance ranking
              – How many words appear?
                    • Webpages with all the query terms will be ranked higher
              – How many times terms appears in the document
                    • Webapage wherein the the term occurs 5 times will be ranked
                      higher than the same 4 times
              – Where does the term appear in the webpage
                    • Webpages containing any of the query terms in the title will be
                      ranked higher

IISc 22-26 Nov’99                       IIRML P9                                        20
                        Search Engines…
        • Criteria for Sorting/Ordering
              – User specified terms for sorting
              – Date, Domain & such other criteria.

IISc 22-26 Nov’99                IIRML P9             21
                         Search Engines…
        • Display of Search results
              – The retrieved pages are first sorted as mentioned
              – The documents are displayed in sets of 10-20 items per
        • Dispaly usually consists of:
              – title in hypertext format, a brief description, URL of the
                webpage, Last modified date, Size of the webpage.
              – Additional elements include : score of the relavance
                ranking, Options to search for similar websites, Access
                Translation services etc

IISc 22-26 Nov’99                  IIRML P9                                  22
                        Search Engines…
        • ADVANTAGES:
              – Best suited for complex/interdisciplinary search topics
              – Control over search: Search terms can be combined in
                the required
              – Searches can be limited to a period of time
              – Currency of information: Webspiders traverse the web
                almost everyday, so the latest additions to be web can
                be retrieved.
              – Exhaustive information is retrieved on a particular

IISc 22-26 Nov’99                 IIRML P9                                23
                        Search Engines…
              – The search is time consuming: The search normally
                results in too many hits and this contains a lot of
              – irrelevant documents as well.
              – The searcher should be familiar with the search
              – Search engines vary from each other
              – Redundant links: Same documents are displayed more
                than once.
              – The retrieved set may contain dead links.
              – The search may result in spamming.

IISc 22-26 Nov’99                IIRML P9                             24
                            Search Engines…
        • Some New developments:
              – Majority of the search engines now called Portals. Portal
                is generally synonymous with gateway, and offers
                services like
                    • directory of Web sites in Yahoo-style, a facility to search for
                      other sites, News, Weather information, E-mail, Stock quotes,
                      Phone and map information, and sometimes a community
                    • Leading portals included Yahoo, Excite, Netscape, Lycos, CNet
              – Have Channels: A preselected Web site that
                automatically sends updated information for immediate
                display or viewing on request.
                    • Eg: Latest News
IISc 22-26 Nov’99                       IIRML P9                                    25
                           Search Engines…
        • EXAMPLES
        • General Search engines
              –     Altavista (
              –     Excite (
              –     Infoseek (
              –     Northernlight (

IISc 22-26 Nov’99                   IIRML P9                26
                       Search Engines…
        • Special/Subject-specific
              – Chemfinder (
              – FileZ (
              – Medical Search Tools
              – Engineering Resources Online
              – (

IISc 22-26 Nov’99              IIRML P9                  27
                    Metasearch Engines
        • Do not have database of their own
        • Send the search request to several search engines
        • Aggregate resulting webpages in some way,
          remove the duplicates and display them in a single
          list grouped according to the search engine.

IISc 22-26 Nov’99           IIRML P9                           28
                    Metasearch Engines…
        • ADVANTAGES
              – Same query can be run across multiple search engines
              – These tools use their own query language and hence the
                user need not learn the syntax and Boolean operators of
                the individual search engines
              – Better results: Retrieves the top-ranking pages from the
                individual search engines.
              – Saves time

IISc 22-26 Nov’99                 IIRML P9                                 29
                    Metasearch Engines…
              – The unique features of individual search engines is lost
              – Not exhaustive: All the pages retrieved by each search
                engine are not displayed; only top ranking hits ar
              – Each search engine varies in quality, quantity , speed
                and other capabilities. Since a group of such search
                engines are accessed at a time, the functionality or
                otherwise of each search engine affects the results.

IISc 22-26 Nov’99                 IIRML P9                                 30
                       Metasearch Engines…
        • EXAMPLES
              –     Profusion (
              –     Metacrawler (
              –     Proteus (
              –     Coppernic (

IISc 22-26 Nov’99                    IIRML P9               31
                    Virtual Libraries/ Subject
        • Subject-specific lists/Directories
        • Provide links to all types of information sources
        • Professionals working in the field maintain the
        • Contain selected documents
        • Follow Hierarchical organisation of resources
        • Some follow organisation by type of information
          (like institutions, organizations, listservs, ftp sites,
          WWW pages, journals etc)
        • Many have Search/Browse interface
IISc 22-26 Nov’99              IIRML P9                              32
                     Virtual Libraries/ Subject
        • ADVANTAGES
              –     Access to selected, high quality information sources
              –     Faster access to resources
              –     Will not contain redundant links
              –     Saves time
              – User should be aware of existance of such Virtual
              – The listing may not be upto date.
              – Smaller database size.
IISc 22-26 Nov’99                     IIRML P9                             33
                    Virtual Libraries/ subject
        • EXAMPLES
              – BUBL (
              – Scicentral (
              – EEVL:Edinburgh Engineering Virtual Library
              – Infomine (
              – The World Wide Web Virtual Library - Engineering
              – The World Wide Web Virtual Library- Chemistry

IISc 22-26 Nov’99               IIRML P9                           34
                    Tools For Finding People
        • Specialized tools/services which lets people to
              – Register their names and addresses.
              – Locate a person's e-mail address.
        • ADVANTAGES
              – Search will be very fast and simple
              – The person should have registered with the service
                being used.
              – The searcher should be aware of surname and forename
                of the person. Otherwise, too many irrelevant hits will
                be generated.
IISc 22-26 Nov’99                 IIRML P9                                35
                Tools For Finding People…
       • Many times, the required e-mail will not be
         retrieved through these tools.
       • Alternatively, any search engine can be used to
         search the e-mail address.
             – Phrase search using the person's name.
       • Some Search engines provide special option for
         people search.
       • Some search engines recognize common names.
         (E.g: Infoseek)
       • If person’s affiliation is known, Yahoo directory
         can be used to locate the institution & Email.
IISc 22-26 Nov’99                IIRML P9                    36
                  Tools For Finding People…
        • EXAMPLES
              –     FOUR11 (
              –     Switchboard (
              –     Whowhere (
              –     Peoplesearch (
              –     Yahoo! People search (

IISc 22-26 Nov’99                   IIRML P9                  37
         Current Trends in Internet Search
        • Development of supplimentary Browser Tools
              – Intelligent search filters/agents
                (E.g.: Alexa:
              – Offline/desktop browsers
                (E.g.: Netferret:
              – Automatic content delivery packages
                (E.g.: Wingflier)
              – Page retrievers
                (E.g.: WebWhacker:

IISc 22-26 Nov’99                 IIRML P9                                 38
                    Current Trends in Internet
        • Improvement in quality and structure of
          information on the Web
              – Exhaustive coverage v/s Selected quality-oriented
              – (E.g.: NorthernLight (
        • Standards for web documents
              – Metadata
              – XML

IISc 22-26 Nov’99                 IIRML P9                          39
                    Current Trends in Internet
        • Thus, such emerging trends indicate
              – Better organisation of information,
              – improved indexing techniques and tools
              – and we can look forward to a enhanced search

IISc 22-26 Nov’99                IIRML P9                      40
                    Steps in Internet Searching
        • Let us consider steps involved in a typical search
              – Analysis of the information requirement.
              – Selecting the search engine.
              – Translation of the concepts into syntactical search
                statements of the selected search engine.
              – Performing the search.
              – Refining the search based on the results.
              – Visiting the actual site and saving the information.(Use
                File- Save option of the browser.)

IISc 22-26 Nov’99                 IIRML P9                                 41
         Before starting an Internet search
        • Consider what's Not Included in Internet
              – Content of Adobe PDF and formatted files
              – The content in sites requiring a log in CGI output such
                as search results from a database
              – Information within Intranets/Firewalls
              – Institutional resources requiring membership
              – Materials protected by copyright
              – Commercial resources with domain limitations
              – Sites that use a robots.txt file to keep files and/or
                directories off limits
              – Non-Web resources (E.g.: Text files)
IISc 22-26 Nov’99                 IIRML P9                                42
                            Case Studies
        • Description of a few searches done at NCSI

              – Details of Post graduate courses in Pharmacy/
                Phamacology offered by colleges affiliated to
                American Universities.
              – Introduction to electronic commerce.
              – Manufacturers of Dental materials
              – Environmental factors in Activated Sludge modelling
              – Adult learning: Motivation, Learning prinicples,
                Teaching principles, Models

IISc 22-26 Nov’99                IIRML P9                             43
                          Case Studies ...
        • Description of a few searches done at NCSI

              – Books and Conference proceedings on 'Gas Insulated
              – Packaging Agricultural Products: Methods and
              – Email Address of A J Epstein

IISc 22-26 Nov’99                IIRML P9                            44
                          Case Studies ...
        • Some other interesting searches done

              – Studies on Credit cards
              – Patents on Synthesis and characterisation methods of
              – Market information on Neem and Neem extracts
              – Money management practices of home-makers in India
                & abroad

IISc 22-26 Nov’99                IIRML P9                              45
                       Related Resources
        • Learning to search Internet
              – AskScott (/
              – The Tutorial:Guide to effective searching on the
                Internet (
              – Learn the Net (
              – Search engine tutorials
              – Terena guide to network resource tools
              – Kathy Schrock's Guide for Educators:Slide shows for
                learning & Training
IISc 22-26 Nov’99                IIRML P9                               46
                      Related Resources ...
        • Keeping current

              – SearchEngineWatch (
              – Search Engine Forums
              – A listserv is available for search engine developers at :
              – ResearchBuzz: Offers a free subscription based
                newsletter over email.

IISc 22-26 Nov’99                  IIRML P9                                 47

To top