A Reflection on Spiders, Bots and Aggregators An Independent Study by Jeff Heaton Advised by Bill Darte Presented by Jeff Heaton Email: firstname.lastname@example.org Web: http://www.jeffheaton.com Upcoming Book by Presenter Programming Spiders, Bots, and Aggregators in Java by Jeff Heaton Published in March 2002. Paperback - 512 pages 1st edition (March 2002) Sybex; ISBN: 0782140408 Basic Terminology Spider Robot (Bot) Aggregator Agent Intelligent agent An Overview The HTTP Protocol Bots must navigate web sites The HTTP protocol is the basic mode of transportation for web pages A bot must use the HTTP protocol An HTTP Request GET /grindex.asp HTTP/1.1 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */* Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Host: www.classinfo.net Connection: Keep-Alive Cookie: ASPSESSIONIDGGGGQHPK=BHLGFGOCHAPALILEEMNIMAFG Types of HTTP Requests GET – most common, used to download a single resource. POST – used to respond to a FORM. HEAD – least common, used to verify the existence of a web paged. An HTTP Response HTTP/1.1 200 OK Connection: Keep-Alive Server: Microsoft-IIS/4.0 Content-Type: text/html Cache-control: private Transfer-Encoding: chunked Via: 1.1 c760 (NetCache 4.1R4D1) Date: Tue, 13 Mar 2001 03:55:05 GMT <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> ... the rest of the HTML document ... Retrieving a Web Page Most web pages are a mixture of text and graphics. First the web browser downloads the HTML page. Then every image contained in that page is downloaded. A Typical Web Page HTTP Messages Example: HTTP Building a Bot A bot retrieves information from a web site. Often a bot can be used to monitor a page. The BBS bot. A Typical BBS Example: Watch BBS Bot What is a Spider? A spider is a specialized bot that moves from web page to web page. A spider takes its name from the insect spider. Page Queues Waiting queue – the page is waiting to be downloaded. Running queue – the page is downloading. Error queue – the page resulted in an error. Complete queue – the page has been downloaded, and should not be redownloaded. Page State Transition A Typical Web Spider Flowchart Example: Download a Site Business Issues of Spiders, Bots and Aggregators. Two sides of the same coin How your company can use bots How bots can be used against your company Uses for Bots Tracking shipments Account aggregation Reputation monitoring Indexing/searching Monitoring reliability Sites That Extensively Use Bots AltaVista BabbleFish Yodlee PriceWatch Google AltaVista Babble Fish Used to translate a site to/from any language Directly integrated with AltaVista Located at http://world.altavista.com Yodlee Aggregates many on-line accounts into one ASP model, most users access Yodlee through an intermediary Located at http://www.yodlee.Com Pricewatch Used to compare different prices from multiple vendors Has encountered some legal problems Located at http://www.pricewatch.Com Google Primary business is searching Bots index web pages and indexes Why Be Friendly to Bots Allow your site to be indexed into search engines Allow customers to access your data in new ways If you use bots yourself Bot Friendly Sites Meta tags for bots to locate Friendly robots.txt files Terms of service agreements that allow bot usage Mata Tags <BASE HREF="http://www.wustl.edu/"> <HTML> <HEAD> <META HTTP-EQUIV="content-type" content="text/html;charset=iso-8859-1"> <TITLE>Washington University in St. Louis Home Page </TITLE> <META NAME="description" content="Washington University in St. Louis Official Home Page"> <META NAME="keywords" content="Washington University, Washington University in St. Louis, Wash U, WU, WUSTL, WUStL, Wash. U., University Washington, university, universities, American Universities, St. Louis, Missouri, education, research, higher education, undergraduate, university libraries, academic, college, colleges, Midwestern universities, online applications, health care, medicine, academic, academics, campus, students, university students, college students, Washington University School of Architecture, Washington University School of Art, Washington University Arts and Sciences, Washington University School of Business, John M. Olin School of Business, Washington University School of Engineering and Applied Science, Washington University School of Law, Washington University School of Medicine, Washington University School of Social Work, George Warren Brown School of Social Work, University College, College of Arts and Sciences"> Robots.txt # robots, scram User-agent: * Disallow: /cgi-bin Disallow: /TRANSCRIPTS Disallow: /development Disallow: /thirdDisallow: /beta Disallow: /java Disallow: /shockwave Disallow: /JOBS Disallow: /pr Disallow: /Interactive Disallow: /alt_index.html Disallow: /webmaster_logs Disallow: /newscenter Disallow: /virtual Disallow: /DIGEST Disallow: /QUICKNEWS Disallow: /SEARCH A Friendly TOS Agreement Use of Data and Products The information on government servers are in the public domain, unless specifically annotated otherwise, and may be used freely by the public. Before using information obtained from this server special attention should be given to the date & time of the data and products being displayed. This information shall not be modified in content and then presented as official government material. (from the National Weather Service) A Unfriendly TOS Agreement User will not access any software or data provided via indirect means or any method not intended or agreed upon by PCQuote. Robot programs (automated query systems) are strictly prohibited and any use of such systems will result in immediate termination of access. (From PCQuote.Com) Bot Ethics Do unto others as you would have them do unto you. Do unto others as the law would have you do unto them. Detecting Bots User Agent Name - What user agent name ar specifying for your bot? If you are not using anonymous access, your bot will stand out easily on an access log. Frequency of Access - How often are you acc the site, and is it always from the same IP address. A very large volume of accesses from the same IP address is usually a tale-tell sign of a bot or spider. Access Method - How is the bot accessing the site? Is it only pulling text files and not downloading any images? Web browsers being used by human users will almost always download all of the images too. A bot typically only goes after the text. Web Site Hostility Usenet Postings - The web master can make Usenet posting to defame your bot and site. If your bot is a annoyance, most web masters will want to warn other web masters. Legal Measures - If you are violating their terms of service, they may bring legal action against you. Bot Exclusion File - We’ve already examined the robots.txt file. By using this file the web master can request your spider to leave the site alone. This is the least subtle means of thwarting a spider or bot. If this method fails, a more sever alternative will likely be pursued. Filter based on IP - If a large volume of traffic is coming from a single IP address, that IP address could be denied access. •Filter based on Agent Name - If a large volume of traffic is coming from a single agent name, that agent name can be denied access. The Future of Bots More consistent web sites XML SOAP What is SOAP? Main Entry: 1soap Pronunciation: 'sOp Function: noun Etymology: Middle English sope, from Old English sApe; akin to Old High German seifa soap Date: before 12th century 1 a : a cleansing and emulsifying agent made usually by action of alkali on fat or fatty acids and consisting essentially of sodium or potassium salts of such acids b : a salt of a fatty acid and a metal 2 : SOAP OPERA (from Merriam-Webster's 10th Collegiate Dictionary) No, Really. What is SOAP? Main Entry: 1soap Pronunciation: 'sOp Function: protocol Etymology: Archnom for Simple Object Access Protocol. Date: before 20th century 1 : A cross-platform, XML-based, protocol used to access objects that may not reside on the same local system. 2 : SOAP OPERA : When different vendor’s implementations are not compatible. What does SOAP Offer XML based Can use many different transfer protocols(i.e. HTML, SMTP) Asynchronous Distributed Systems Who is involved in SOAP Introduced and governed by W3C. Sun is incorporating SOAP into Java via JAXM(Java API for XML Messaging) Microsoft is incorporating SOAP into the .NET protocol. SOAP Components XML Based SOAP Messages Web Service Definition Language (WSDL) What is XML? Another format for text files Hierarchical, not flat. Like HTML, based on SGML. Much stricter than HTML. Supported by a wide variety of tools What does XML look like? <StudentList> Node – A single element that <Student id="555"> can enclose other elements. <first>Tom</first> <last>Smith</last> Attribute – A value that is <middle></middle> stored as part of a node begin </Stundent> tag. <Student id="556"> <first>Regina</first> Beginning Tag – A tag that <last>Smith</last> does not start or end with a /. <middle>A</middle> Ending Tag – A tag that starts </Stundent> </StudentList> with a /. Begin-End Tag – A tag that ends, but does not start, with a /. What are SOAP Messages Blocks of data that makeup SOAP requests and responses Blocks of data are in XML SOAP can use a variety of underlying transfer protocols. How are SOAP Messages Sent Always sent Asynchronous HTTP SMTP What does SOAP Look Like? <SOAP-ENV:Envelope xmlns:SOAP- ENV="http://schemas.xmlsoap.org/soap/envelope/" SOAP- ENV:encodingStyle="http://schemas.xmlsoap.org/soap/en coding/"> <SOAP-ENV:Body> <m:GetCurrentTemperature xmlns:m="Some-URI"> <symbol>KSTL</symbol> </m: GetCurrentTemperature > </SOAP-ENV:Body> </SOAP-ENV:Envelope> What is WSDL? Web Service Definition Language A SOAP roadmap Used to describe the kinds of SOAP messages a service expects. Questions?
Pages to are hidden for
"A Reflection on Spiders, Bots and Aggregators"Please download to view full document