A Reflection on Spiders, Bots and Aggregators by cmk16156


									A Reflection on Spiders, Bots
and Aggregators
An Independent Study by Jeff Heaton
Advised by Bill Darte
Presented by
Jeff Heaton
  Email: heatonj@heat-on.com
  Web: http://www.jeffheaton.com
Upcoming Book by Presenter
                Spiders, Bots, and
                Aggregators in Java
                by Jeff Heaton

                Published in March

                Paperback - 512 pages
                1st edition (March 2002)
                Sybex; ISBN:
Basic Terminology
  Robot (Bot)
  Intelligent agent
An Overview
The HTTP Protocol
  Bots must navigate web sites
  The HTTP protocol is the basic mode of
  transportation for web pages
  A bot must use the HTTP protocol
An HTTP Request
 GET /grindex.asp HTTP/1.1
 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
 application/vnd.ms-powerpoint, application/vnd.ms-excel,
 application/msword, */*
 Accept-Language: en-us
 Accept-Encoding: gzip, deflate
 User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
 Host: www.classinfo.net
 Connection: Keep-Alive
Types of HTTP Requests
  GET – most common, used to download
  a single resource.
  POST – used to respond to a FORM.
  HEAD – least common, used to verify
  the existence of a web paged.
An HTTP Response
HTTP/1.1 200 OK
Connection: Keep-Alive
Server: Microsoft-IIS/4.0
Content-Type: text/html
Cache-control: private
Transfer-Encoding: chunked
Via: 1.1 c760 (NetCache 4.1R4D1)
Date: Tue, 13 Mar 2001 03:55:05 GMT

... the rest of the HTML document ...
Retrieving a Web Page
  Most web pages are a mixture of text
  and graphics.
  First the web browser downloads the
  HTML page.
  Then every image contained in that
  page is downloaded.
A Typical Web Page
HTTP Messages
Example: HTTP
Building a Bot
  A bot retrieves information from a web
  Often a bot can be used to monitor a
  The BBS bot.
A Typical BBS
Example: Watch BBS Bot
What is a Spider?
  A spider is a specialized bot that moves
  from web page to web page.
  A spider takes its name from the insect
Page Queues
 Waiting queue – the page is waiting to be
 Running queue – the page is downloading.
 Error queue – the page resulted in an error.
 Complete queue – the page has been
 downloaded, and should not be
Page State Transition
A Typical Web
Spider Flowchart
Example: Download a Site
Business Issues of Spiders,
Bots and Aggregators.
  Two sides of the same coin

     How your company can use bots
     How bots can be used against your
Uses for Bots
  Tracking shipments
  Account aggregation
  Reputation monitoring
  Monitoring reliability
Sites That Extensively Use
  AltaVista BabbleFish
AltaVista Babble Fish
  Used to translate a site to/from any
  Directly integrated with AltaVista
  Located at http://world.altavista.com
  Aggregates many on-line accounts into
  ASP model, most users access Yodlee
  through an intermediary
  Located at http://www.yodlee.Com
  Used to compare different prices from
  multiple vendors
  Has encountered some legal problems
  Located at http://www.pricewatch.Com
  Primary business is searching
  Bots index web pages and indexes
Why Be Friendly to Bots
  Allow your site to be indexed into
  search engines
  Allow customers to access your data in
  new ways
  If you use bots yourself
Bot Friendly Sites
  Meta tags for bots to locate
  Friendly robots.txt files
  Terms of service agreements that allow
  bot usage
Mata Tags
<BASE HREF="http://www.wustl.edu/">
<META HTTP-EQUIV="content-type" content="text/html;charset=iso-8859-1">
<TITLE>Washington University in St. Louis Home Page
<META NAME="description" content="Washington University in St. Louis Official Home Page">
<META NAME="keywords" content="Washington University, Washington University in St. Louis, Wash U,
WU, WUSTL, WUStL, Wash. U., University Washington, university, universities, American Universities, St. Louis,
 Missouri, education, research, higher education, undergraduate, university libraries, academic, college, colleges,
Midwestern universities, online applications, health care, medicine, academic, academics, campus, students,
 university students, college students, Washington University School of Architecture, Washington University School of Art,
Washington University Arts and Sciences, Washington University School of Business, John M. Olin School of Business,
Washington University School of Engineering and Applied Science, Washington University School of Law,
Washington University School of Medicine, Washington University School of Social Work, George Warren Brown School of
Social Work, University College, College of Arts and Sciences">
# robots, scram User-agent: *
Disallow: /cgi-bin
Disallow: /development
Disallow: /thirdDisallow: /beta
Disallow: /java
Disallow: /shockwave
Disallow: /JOBS
Disallow: /pr
Disallow: /Interactive
Disallow: /alt_index.html
Disallow: /webmaster_logs
Disallow: /newscenter
Disallow:     /virtual
Disallow:     /DIGEST
Disallow:     /QUICKNEWS
Disallow:    /SEARCH
A Friendly TOS Agreement
Use of Data and Products
   The information on government servers are in the
public domain, unless specifically annotated otherwise,
and may be used freely by the public. Before using
information obtained from this server special attention
should be given to the date & time of the data and
products being displayed. This information shall not be
modified in content and then presented as official
government material.
(from the National Weather Service)
A Unfriendly TOS Agreement
User will not access any software or data
 provided via indirect means or any
 method not intended or agreed upon by
 PCQuote. Robot programs (automated
 query systems) are strictly prohibited
 and any use of such systems will result
 in immediate termination of access.
(From PCQuote.Com)
Bot Ethics
  Do unto others as you would have them
  do unto you.
  Do unto others as the law would have
  you do unto them.
Detecting Bots
                User Agent Name - What user agent name ar
specifying for your bot? If you are not using anonymous
access, your bot will stand out easily on an access log.
                Frequency of Access - How often are you acc
the site, and is it always from the same IP address. A very
large volume of accesses from the same IP address is usually
a tale-tell sign of a bot or spider.
        Access Method - How is the bot accessing the site?
Is it only pulling text files and not downloading any
images? Web browsers being used by human users
will almost always download all of the images too. A
bot typically only goes after the text.
Web Site Hostility
      Usenet Postings - The web master can make Usenet
 posting to defame your bot and site. If your bot is a
 annoyance, most web masters will want to warn other
 web masters.
      Legal Measures - If you are violating their terms of
 service, they may bring legal action against you.
      Bot Exclusion File - We’ve already examined the
 robots.txt file. By using this file the web master can request
 your spider to leave the site alone. This is the least subtle
 means of thwarting a spider or bot. If this method fails,
 a more sever alternative will likely be pursued.
      Filter based on IP - If a large volume of traffic is
 coming from a single IP address, that IP address could be
 denied access.
 •Filter based on Agent Name - If a large volume of traffic
 is coming from a single agent name, that agent name can be
 denied access.
The Future of Bots
  More consistent web sites
What is SOAP?
Main Entry: 1soap
  Pronunciation: 'sOp
  Function: noun
  Etymology: Middle English sope, from Old English
  sApe; akin to Old High German seifa soap
  Date: before 12th century
  1 a : a cleansing and emulsifying agent made usually
  by action of alkali on fat or fatty acids and consisting
  essentially of sodium or potassium salts of such acids
  b : a salt of a fatty acid and a metal
(from Merriam-Webster's 10th Collegiate Dictionary)
No, Really. What is SOAP?
Main Entry: 1soap
  Pronunciation: 'sOp
  Function: protocol
  Etymology: Archnom for Simple Object Access
  Date: before 20th century
  1 : A cross-platform, XML-based, protocol used to
  access objects that may not reside on the same local
  2 : SOAP OPERA : When different vendor’s
  implementations are not compatible.
What does SOAP Offer
  XML based
  Can use many different transfer
  protocols(i.e. HTML, SMTP)
  Distributed Systems
Who is involved in SOAP
  Introduced and governed by W3C.
  Sun is incorporating SOAP into Java via
  JAXM(Java API for XML Messaging)
  Microsoft is incorporating SOAP into the
  .NET protocol.
SOAP Components
 XML Based
 SOAP Messages
 Web Service Definition Language
What is XML?
  Another format for text files
  Hierarchical, not flat.
  Like HTML, based on SGML.
  Much stricter than HTML.
  Supported by a wide variety of tools
What does XML look like?
                          Node – A single element that
 <Student id="555">       can enclose other elements.
                          Attribute – A value that is
                          stored as part of a node begin
 </Stundent>              tag.
 <Student id="556">
                          Beginning Tag – A tag that
                          does not start or end with a /.
                          Ending Tag – A tag that starts
                          with a /.
                          Begin-End Tag – A tag that
                          ends, but does not start, with a
What are SOAP Messages
 Blocks of data that makeup SOAP
 requests and responses
 Blocks of data are in XML
 SOAP can use a variety of underlying
 transfer protocols.
How are SOAP Messages Sent
  Always sent Asynchronous
What does SOAP Look Like?
        <m:GetCurrentTemperature xmlns:m="Some-URI">
        </m: GetCurrentTemperature >
What is WSDL?
 Web Service Definition Language
 A SOAP roadmap
 Used to describe the kinds of SOAP
 messages a service expects.

To top