Docstoc

tecaj ppt Currency Pair

Document Sample
tecaj ppt Currency Pair Powered By Docstoc
					 Searching and Indexing
  (tools and techniques)
                         Miroslav Milinovic
          Croatian Academic and Research Network - CARNet
                           Zagreb, Croatia
                          <miro@srce.hr>



6th CEENet Workshop on Network Technology, Budapest, Hungary, August 2000.



                                   MM-SI/1
                            Content
•   Part 1: Searching
     – Internet information space
     – Searching with the WWW
     – Searching the WWW
     – Search engines, Subject catalogs, other tools, Portals
     – Directory services
     – Summary
•   Part 2: Indexing
     – Concepts
     – Gathering, indexing, searching
     – Robots
     – Available tools
     – Summary
                                   MM-SI/2
Part 1 - Searching



        MM-SI/3
         Internet information space
•   is NOT unified
•   many subjects                    Internet
•   different formats                   WWW printed
•   different resources
    (information services)
•   various tools and techniques
    for searching and information retrieval
•   some information is not (yet):
    –   published electronically
    –   available on the Net

                                   MM-SI/4
         Web information space
•   Publicly indexable Web:
    –   about 800 million web pages and 15 TB (6 TB) of data
    –   83%-com; 6%-sci/edu; 1,5%-porn
    –   60% of the Web is indexed / catalogued by major search
        engines & catalogues
•   85% of users use search engines / catalogues to
    locate info
           Steve Lawrence, Lee Giles (Nec Institute, February 1999)


•   30% of Web pages are copied or mirrored
                             Shivakumar and Garcia-Molina (1998)

                                                                 
                              MM-SI/5
           Web information space
•   “deep” Web
    –   public info 400 to 550 times bigger than “surface” Web
    –   7500 TB of data
                  The Deep Web: Surfacing Hidden Value; White Paper;
                                             BrightPlanet.com, July 2000


•   Users see Internet as key information source
    –   2/3 of users consider Internet as “important” or “extremely
        important” source of information
    –   53%(47%) ranked TV (radio) at the same level of importance
                 Center for Communication Policy, UCLA, August 2000


                                 MM-SI/6
             Searching with the Web
•   searching tools:
    –   many different tools
    –   various concepts
    –   specialized for chosen resources
         •   Web, USENET, FTP, databases, ...
    –   global or local scope
    –   main problems: quality and currency
    –   there is NO perfect tool
•   user needs a strategy



                                    MM-SI/7
              Searching the Web
•   Search Engines
    –   Search Engines
    –   Metasearch Engines (Unified Search Interfaces)

•   Subject Catalogs (Virtual Libraries)

•   Other tools
    –   Multiple Search Interfaces
    –   Information Gateways
    –   ...

•   Portals
                               MM-SI/8
                     Search Engines
•   automated systems
•   specially designed programs
    –   robots, crawlers, spiders
    –   fetch WWW documents
    –   index those documents to build database
•   provide interface for user to search the database
    –   query syntax
    –   searching features
    –   presentation of the results - hits (format, ranking)



                                                               
                                     MM-SI/9
              Search Engines

                Search Engine        WWW
                                   documents   
                database
                           robot

http:// ...
                                               
                                               
                                           
                                                   
                      MM-SI/10
              Search Engines
Alta Vista - http://altavista.digital.com/
excite! NetSearch - http://www.excite.com/
FAST - http://alltheweb.com/
Google - http://www.google.com/
HotBot - http://www.hotbot.com/
InfoSeek (GO.com) - http://www.infoseek.com/
Lycos Search - http://www.lycos.com/
Nothern Light Search - http://www.northernlight.com/
WebCrawler - http://www.webcrawler.com/

local (regional) search engines

                                                       
                            MM-SI/11
                   Search Engines
•   query syntax and searching features:
    –   upper and lower case letters
         •   John December
         •   island
    –   phrases (text in quotes - “...”)
         •   “NASA space shuttle program”
         •   “John December”
    –   Boolean operators (AND, OR, NOT) and parentheses (...)
         •   vegetable AND green
         •   fruit NOT apple
    –   keyword control (+, -)
         •   +film +noir -”pinot noir”
         •
                                                                 
             +pyton -monty

                                         MM-SI/12
                       Search Engines
•   query syntax and searching features:
    –   proximity search (NEAR)
         •   Internet NEAR training (Alta Vista)
    –   keyword truncation (* %)
         •   alumi*um                     comput*
    –   cascade search (Infoseek)
    –   resource control (AltaVista, HotBot, Infoseek)
         •   title:”Internet training”
    –   natural language searching (Ask Jeeves! - http://www.ask.com/)
    –   new approaches:
         •   Ditto.com - http://www.ditto.com/
         •   Simpli.com - http://www.simpli.com/
         •
                                                                
             Oingo- http://www.oingo.com/

                                         MM-SI/13
                    Search Engines
•   important characteristics:
    –   database (quantity & quality)
         •   INKTOMI - 500 million web pages
         •   AltaVista - 250 million web pages
         •   NothernLight - 240 million web pages
    –   query language
    –   response time
    –   ranking (various approaches)
    –   output (format, available info)
    –   additional features (cascade search, refine, …)
    –   ...

                                                          
                                   MM-SI/14
                     Search Engines
•   advantages
    –   vast number of documents
    –   highly efficient searching and retrieval
    –   automated production
•   disadvantages
    –   no quality control
    –   no classification
    –   hits can be out of context
    –   dead or out-of-date links, junk



                                    MM-SI/15
             Metasearch Engines
•   Unified Search Interfaces
•   automated systems
•   DO NOT build databases of their own
•   query other search engines
•   provide unified interface for user to search a number of
    databases (search engines) with one query




                                                          
                             MM-SI/16
             Metasearch Engines
•   examples:
    All4one - http://all4one.com/
    Mamma - http://www.mamma.com/
    MetaCrawler - http://www.metacrawler.com/
    (SavvySearch) CNET Search.com - http://www.savvysearch.com/




                                                            
                              MM-SI/17
             Metasearch Engines
•   important characteristics:
    –   number and selection of search engines covered
    –   query language
    –   response time
    –   ranking (hits)
    –   results (hits) merging
    –   output (format, available info)
    –   additional features
    –   ...



                                                         
                               MM-SI/18
                Metasearch Engines
•   advantages
    –   same as search engines
    –   make use of search engines easier
•   disadvantages
    –   same as search engines
    –   unified query for all search engines means loss of additional
        capabilities of particular search engine
    –   searching is slower




                                   MM-SI/19
                   Subject Catalogs
•   Virtual Libraries, Subject Directories
•   collections of Internet resources descriptions
    –   names, URLs, abstracts, ratings, ...
•   organized within hierarchical subject scheme
    –   heuristic (subject based)
    –   UDC, Dewey, ...
•   manually maintained
•   internal search



                                                     
                                    MM-SI/20
                  Subject Catalogs
•   examples:
    Yahoo - http://www.yahoo.com/
    EINet Galaxy - http://galaxy.einet.net/
    Magellan - http://magellan.excite.com/
    NetGuide - http://www.netguide.com/
    LookSmart - http://www.looksmart.com/
    About.com - http://www.about.com/
    Open Directory - http://dmoz.org/
    Brittanica.com - http://www.brittanica.com/

    local (regional) catalogs
                                                  
                                 MM-SI/21
                      Subject Catalogs
•   important characteristics:
    –   size
         •   Yahoo (1999) - 150 editors, 1.2 million Web links
         •   Open Directory - 16000 editors, 1 million Web links
    –   classification method
    –   available info (about classified resources)
    –   ranking
    –   (internal) searching
    –   additional features
    –   ...

                                                                   
                                      MM-SI/22
                    Subject Catalogs
•   advantages
    –   classified into subject areas
    –   manually reviewed resources (no junk)
    –   internal search
•   disadvantages
    –   manual maintenance
    –   out-of-date information
    –   catalogue (some parts) is not professional




                                  MM-SI/23
                         Other tools
•   Multiple Search Interfaces
    –   simple Web pages with interfaces to number of search tools
    –   enable user to choose among listed tools
    –   DO NOT build databases of their own
    –   DO NOT act as Metasearch Engines
    –   examples:
         All-in-One - http://www.allonesearch.com/
         Easy Searcher - http://www.easysearcher.com/
•   Information (Subject) Gateways
    –   usually dedicated to one subject (e.g. Social Sciences)
    –   examples:
         SOSIG - http://sosig.ac.uk/
         OMNI - http://www.omni.ac.uk/
                                                                     
                                    MM-SI/24
                           Other tools
•   Databases search
    Inivisible Web - http://www.invisibleweb.com/
    Lycos Seachable Databases - http://dir.lycos.com/Reference/Searchable_Databases/
    INFOMINE - http://infomine.ucr.edu/
    Terraserver - http://terraserver.com/

•   And ...
    –   electronic dictionaries, encyclopedias, guides,
        software collections, map collections,
        tools for searching non-www resources, …


                                  PORTALS


                                      MM-SI/25
                            Portals
•   hybrid tool - winning solution
•   user’s entry point into Internet information space
•   brings together many (all) services
•   based on search engine and/or catalog
•   general or specialized (subject or community)
    –   http://cnn.com/
    –   http://www.excite.com/
    –   http://www.altavista.com/
    –   http://www.yahoo.com/
    –   http://www.ihlth.com/
    –   http://www.digitalessays.com/
    –   ...

                                MM-SI/26
Conclusion on Web search tools
•   each tool has advantages and disadvantages
•   new systems appear, old stagnate
•   CAUTION: tools are text oriented
•   non-WWW resources are also covered
•   quality and currency
•   precision .vs. recall
•   cooperation between tools is a necessity
•   winner: portal

•   usefull URL: http://searchenginewatch.com/

                           MM-SI/27
                       Selecting a tool
•   Portals
•   Search Engines
    –   when you have good (precise) keywords (narrow topic)
•   Subject Catalogs
    –   for “look and feel”
    –   when you don’t have good keywords (broad topic)
•   Information Gateways or other specialized tools
    –   for quality (if you can find one)
•   Multiple Search Interfaces
    –   useful to see “what is available”



                                      MM-SI/28
                        A Strategy?
•   no searching system is prefect
    –   be flexible and try different tools
    –   compare results and gain experience
•   learn vocabulary, read HELP and FAQ
•   be focused (don’t wander around)
•   concentrate on problem, not on tool (query)
•   use stepwise approach
    –   refine query (keywords)




                                  MM-SI/29
      What is directory service?
•   holds information about:
     – people - individuals (White Pages)
     – other things (Yellow Pages)
•   analogy with the telephone directory
•   service for locating information about individuals,
    companies, resources, ...
•   has searchable database with corresponding
    information



                            MM-SI/30
              Directory services
•   can be:
     – global or local
     – distributed or centralized
•   typically accessed through:
     – Web pages (interfaces)
     – telnet
     – e-mail clients


                             MM-SI/31
     Actual standards (services)
•   LDAP
•   X.500
•   Whois / Whois ++
•   Netfind
•   CCSO (ph)
•   RWhois
•   Web-based services
•   other services


                         MM-SI/32
                 Basic concepts
•   data model
•   distributed or centralized?
•   query language
•   access control and security
•   maintenance




                           MM-SI/33
                          Data model
•   NO general standard
•   almost all services use:
    –   attribute-value pair model
    –   database consists of records
    –   special field to identify the type of record
         •   whois++ uses template field
         •   X.500 uses object class field
    –   list of record types, attributes and their possible values
        depends on actual service




                                     MM-SI/34
     Distributed or centralized?
•   distributed:
    –   many servers tied together
    –   administrative structure (hierarchical)
    –   X.500, LDAP, WHOIS++, RWhois
•   centralized:
    –   NETFIND
    –   services based on WWW
    –   can have mirror (peer) sites
•   independent services - local (CCSO)



                                  MM-SI/35
      Access control and security
•   ability to control:
     – who sees what data
     – who updates what data
•   privacy
•   protocols enable filtering of attributes
•   main issue: to be easy but safe




                             MM-SI/36
                   Query language
•   servers enable query on values of attributes
•   exact matches, substring matches
•   depends on:
    –   service
    –   implementation (client or server capabilities)
•   Web interfaces:
    –   make searching easier
    –   doesn’t provide full functionality




                                   MM-SI/37
        Good directory service
3 main features:
• easy and efficient access, searching and updating of
  information
• access control (“who sees/updates what”)
• privacy (“right to be unlisted”)




                         MM-SI/38
                 Actual situation
•   Directory services are actual Internet problem
•   NO standards for cooperation between services
•   privacy .vs. currency of information
•   global Internet Directory service:
     – Do we need it?
     – Can it be done efficiently?
     – Who will put the information and keep it current?
•   Yellow pages services:
     – solved with Web Catalogs and Search Engines
     – DNS - intuitive directory service (?!)
•   emphasis on White Pages services
                              MM-SI/39
              Web-based services
•   centralized
•   part of (good) portals
•   standalone tools:
    –   http://www.four11.com/
    –   http://www.iaf.net/
    –   http://www.whowhere.com/




                              MM-SI/40
                    Who will win?
•   LDAP
•   main questions:
    –   how to proceed from this point?
    –   standard for cooperation between services
•   usefull URL:
    –   http://www.dante.net/nameflow.html


         “Pity the poor fanatic, when he loses sight of his objective
         he doubles his efforts”, Einar Stefferud




                                 MM-SI/41
                     Summary
•   Internet information space
•   Searching with the Web
•   Searching the Web
•   Search engines, Subject catalogs, other tools, Portals
•   Directory services




                            MM-SI/42
Part 2 - Indexing



       MM-SI/43
             Concepts in indexing
•   Gathering
    –   process of collecting resources
•   Indexing
    –   building a database
•   Searching
    –   servicing user queries




                                 MM-SI/44
                         Gathering
•   some issues to think about:
    –   beside HTTP include other protocols/resources
        (FTP, Usenet, …)
    –   use caches to get data (ICP ?)
    –   distributed gathering
    –   list of resources (where to start and where to stop)
    –   update frequency
    –   resources used for gathering




                                  MM-SI/45
                         Indexing
•   some issues to think about:
    –   what type of documents are we indexing (only HTML or …)
    –   full text, or ?
    –   metadata ?
    –   multilinguality
    –   including data from other indexes
    –   resources used for building index




                               MM-SI/46
                           Searching
•   some issues to think about:
    –   query features and syntax
         •   methods used for database searching and retrieving results
    –   presentation of search results
    –   user interface
    –   multilinguality
    –   resources used for searches




                                    MM-SI/47
        Robots and similar beasts
•   can place a heavy load on network and server
    –   act accordingly
    –   use others’ resources with care
•   there is a “robot ethics”:
    –   robot exclusion protocol
    –   ROBOT META tag
•   useful URL:
    http://www.searchenginewatch.com/webmasters/spiderchart.html
    http://info.webcrawler.com/mak/projects/robots/robots.html



                                   MM-SI/48
        Robot Exclusion Protocol
•   can be used by Web site administrator
•   robots.txt file
    –   should be placed in the document root directory
        (at URL http://hostname/robots.txt)
    –   has special syntax
•   example:
         User-agent: *
         Disallow: /archives/
         Disallow: /working/




                                MM-SI/49
          ROBOTS META tag
• can be used by Web page author
• <META NAME="ROBOTS” CONTENT=”content">
       content = ALL | NONE | directive ["," directive]
       directive = index | follow
       index     = "INDEX" | "NOINDEX”
       follow = "FOLLOW" | "NOFOLLOW”
•   default: INDEX, FOLLOW
•   example:
      <meta name="robots" content="index,nofollow">




                                MM-SI/50
               Available software
•   Harvest and related tools (Glimpse, Swish)
•   WebGlimpse (“mini Harvest”)
•   HTDIG
•   home-made solutions
•   commercial robots/solutions (AltaVista Scooter, …)
•   service offered by ASPs
•   useful URLs
    –   http://www.tardis.ed.ac.uk/harvest/
    –   http://webglimpse.org/
    –   http://www.htdig.org/
    –   http://sunsite.berkeley.edu/SWISH-E/

                                MM-SI/51
                      Summary
•   Concepts
•   Gathering, indexing, searching
•   Robots
•   Available tools




                           MM-SI/52

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:15
posted:12/20/2010
language:Croatian
pages:52
Description: tecaj ppt Currency Pair