Embed
Email

Spider

Document Sample

Categories
Tags
Stats
views:
4
posted:
11/20/2011
language:
English
pages:
5
4Test Web Spider

A Web spider is a tool that you can use to recursively scan all links, through a specified level, on

a given Web site. This kind of tool is extremely useful for assisting in finding ‘dead links’

within your Web site. I have provided a separate file (webscanner.t) for you with the remainder

of the code.



Please note: there are two caveats. First, this code is not perfect. There are links that may not be

traversed, but it is a good starting point in helping you use 4Test for a common task. Second,

this code runs extremely well on Windows NT but not as well on Windows 9x.



Here is the code:



First, some type definitions, dll function declarations, and global variables:

// Type Definitions

type BOOL is INT

type BYTE is UNSIGNED CHAR

type WORD is UNSIGNED SHORT

type DWORD is UNSIGNED LONG

type UINT is UNSIGNED INT

type HWND is UINT

type LPSTR is STRING

type LPCSTR is STRING

type WPARAM is UINT

type LPARAM is LONG, STRING



dll "wininet.dll"

INT InternetOpenA(LPCSTR sAgent, DWORD AccessType, LPCSTR sProxy optional,

LPCSTR sProxyBypass optional, DWORD Flags)



INT InternetConnectA(INT hConnection, LPCSTR sHost, INT iPort, LPCSTR

sUsername optional, LPCSTR sPassword optional, DWORD Service, DWORD Flags, DWORD

Context)

INT InternetOpenUrlA(INT hConnection, LPCSTR sURL, LPCSTR sHeaders optional,

DWORD iHeaderLength, DWORD Flags, DWORD Context)

BOOL InternetReadFile(INT hConnection, out LPCSTR sData, DWORD iNumBytes,

out DWORD iLength)

BOOL InternetCloseHandle(INT hConnection)

BOOL HttpQueryInfoA(INT hFile, DWORD InfoLevel, LPCSTR sBuffer, DWORD

iLength, inout DWORD iIndex)

BOOL InternetQueryDataAvailable(INT hConnection, out DWORD iNumBytes,

DWORD Flags, DWORD Context)

INT HttpOpenRequestA(INT hConnection, LPCSTR sVerb null, LPCSTR sObject,

LPCSTR sVersion, LPCSTR sRefererUrl optional, LPCSTR sAcceptTypes optional, DWORD

Flags, DWORD Context)









1

BOOL HttpSendRequestA(INT hRequest, LPCSTR sHeaders optional, DWORD

dHeaderLength, LPCSTR lOptional optional, DWORD dOptLength)



dll "kernel32.dll"

DWORD GetLastError()



type MSG_TYPE is enum

MSG_SUCCESS

MSG_OUTSIDEDOMAIN

MSG_MAILADDR

MSG_RETEST

MSG_ERROR



type ProcessedLink is record

STRING sUrlName

STRING sMsgCode

MSG_TYPE mtMsgType





LIST OF ProcessedLink lplBeenProcessed = {}

LIST OF ProcessedLink lplMailLinks = {}

LIST OF ProcessedLink lplToRetest = {}

LIST OF ProcessedLink lplErrorLinks = {}

LIST OF ProcessedLink lplOutsideDomain = {}

INTEGER iDepthToProcess





The above code may appear confusing at first. We are mapping the C function calls from

WININET (Microsoft’s library of Internet functions) and KERNEL32 to functions that may be

called by 4Test. The use of the DLL declaration above enables us to point to a given DLL and

reference any public functions exposed by the DLL.

Let’s discuss the function we use to check for the existence of a given Web page:



BOOLEAN IsWebPageAvailable(STRING sUrl, inout INTEGER iErrorLevel)

INT hConn

INT hFile

STRING sUserAgent=""

STRING sData

DWORD iNumBytes

DWORD iMaxLength=1

DWORD Flags=0

DWORD Context=0



hConn = InternetOpenA(sUserAgent, 1, null, null, 0)

if (hConn > 0)

hFile = InternetOpenUrlA(hConn, sUrl, null, iMaxLength, Flags, 1)







2

iErrorLevel = GetLastError()

if (iErrorLevel != 0)

return FALSE



if (hFile > 0)

InternetCloseHandle(hConn)

return TRUE

else

InternetCloseHandle(hConn)

return FALSE





This function takes two arguments, a STRING representing the URL we are interested in

processing and an INOUT variable representing the error code returned from the WININET

function calls. Note that INOUT variables are variables that are both passed into a function and

that can be modified in the function and passed back out. I originally wrote this as an inout

variable since I was not sure if I would be modifying it in the function, however, I generally only

use the return value and rarely, if that, so it really should have been an OUT variable.



Let me briefly describe the flow of the function. First, we open an Internet connection using the

InternetOpenA function. If this function fails it returns 0 (hence the check for 0) otherwise it

returns a connection handle. We then open a connection to the requested page (passed in with

sUrl), trying to grab only 1 byte of data (specified by iMaxLength). The gives us a fairly quick

response and minimizes read time. We then close the Internet connection and return TRUE if we

successfully connected or FALSE if we failed.



The next function is used to extract the Web page’s HTML so that I can later parse it for the

links:



//This function is passed a URL and returns, as a string, the HTML of the document

STRING ReturnWebPage(STRING sUrl)

INT hConn

INT hFile

STRING sUserAgent=""

STRING sData

DWORD iNumBytes

DWORD iMaxLength=255

DWORD iIndex=0

DWORD Flags=0

DWORD Context=0

STRING sOutput=""

INT iLength=0



hConn = InternetOpenA(sUserAgent, 1, null, null, 0)



if (hConn > 0)







3

hFile = InternetOpenUrlA(hConn, sUrl, null, iMaxLength, Flags, Context)

if (hFile > 0)

//Read the content of the file pointed to by the URL we're dealing with

while(InternetReadFile(hFile, sData, iMaxLength, iNumBytes) &&

(iNumBytes > 0))

//Read iMaxLength of data at a time, constructing the string as we

go

iLength += iNumBytes



sOutput += Left(sData, iNumBytes)

InternetCloseHandle (hFile)

InternetCloseHandle (hConn)



return (sOutput)





Note that this function is nearly identical to IsWebPageAvailable(). We are again opening an

Internet connection and reading from the file. Only this time we read 256 bytes at a time,

constructing a STRING that is the URL page. We use the function InternetReadFile to read from

our file stream the HTML representing the page we are interested in.



As I stated, there is too much code to cover in this entire article, but let me discuss what some of

the other functions do:



LIST OF STRING GetLinks(STRING)



The GetLinks () function processes the HTML for the given page. It scans through the

document, looking for all ‘href’ tags. It then processes out the link that is found within the

double quotes.



private STRING StripChars(STRING)



The StripChars () function is used to strip out all spaces, line feeds, and carriage returns from the

returned HTML page. We do this so that we can more easily parse out the links on a page.



AddThisURLToProcessedList (STRING, STRING, MSG_TYPE)



The AddThisURLToProcessedList () function is used to add the link to it’s appropriate LIST

(see the LIST defined in the declarations section at the beginning).



BOOLEAN HasThisURLBeenProcessed (STRING)



The HasThisURLBeenProcessed () function is used to determine rather or not we have already

processed this page.



BOOLEAN IsKnownProtocol (STRING)







4

The IsKnownProtocol () function is takes a given link, passed in as a string, and returns a

Boolean indicating rather or not the link is using a protocol that we’re aware of (such as HTTP,

HTTPS, FTP, etc..).



STRING FixRelativeLinks(STRING, STRING)



The function FixRelativeLinks () is the weak link in the Web spider. This function takes a

relative link (e.g. ../document.html) from a page and attempts to create an absolute link from it

(since we need the absolute link to actually walk the link). I currently have approximately 80

lines of code to process the most common formats of relative links (such as ../document.html,

/document.html, etc.) but there are many that this code does not yet handle, unfortunately. Since

it would take another whole article to cover this function, alone, I will not be able to cover it in

this column.



BOOLEAN IsMailAddress (STRING)



The IsMailAddress () method is used to determine if the link is using the mailto protocol,

indicating that this is an email address that must be tested manually.



DisplayAddresses (LIST OF ProcessedLink)



The DisplayAddresses () is simply a very small helper function used by the main testcase to

assist in writing the list of URLs processed.



ProcessSubLevelLinks (STRING, INTEGER, LIST OF STRING)



This is the main recursive function for the Web spider. This function recursively gets all links on

a given page, verifying if the link is a mail address, storing the function when processed, then

moving down to the next level in the site.









5



Related docs
Other docs by Stariya Js @ B...
How we become literate
Views: 0  |  Downloads: 0
15189
Views: 0  |  Downloads: 0
Enrollment Agreement
Views: 0  |  Downloads: 0
seddc 061009 pm
Views: 0  |  Downloads: 0
Juvanec-KamenNaKamen-eng
Views: 0  |  Downloads: 0
Syllabus Macro Fall 10
Views: 0  |  Downloads: 0
23401
Views: 0  |  Downloads: 0
9-11-RPH-stonefabrication-ord-memo-agss
Views: 0  |  Downloads: 0
Junior_Pre_season_Soccer_League_application
Views: 0  |  Downloads: 0
guide_to_moodle_quizzes
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!