4Test Web Spider
A Web spider is a tool that you can use to recursively scan all links, through a specified level, on
a given Web site. This kind of tool is extremely useful for assisting in finding ‘dead links’
within your Web site. I have provided a separate file (webscanner.t) for you with the remainder
of the code.
Please note: there are two caveats. First, this code is not perfect. There are links that may not be
traversed, but it is a good starting point in helping you use 4Test for a common task. Second,
this code runs extremely well on Windows NT but not as well on Windows 9x.
Here is the code:
First, some type definitions, dll function declarations, and global variables:
// Type Definitions
type BOOL is INT
type BYTE is UNSIGNED CHAR
type WORD is UNSIGNED SHORT
type DWORD is UNSIGNED LONG
type UINT is UNSIGNED INT
type HWND is UINT
type LPSTR is STRING
type LPCSTR is STRING
type WPARAM is UINT
type LPARAM is LONG, STRING
dll "wininet.dll"
INT InternetOpenA(LPCSTR sAgent, DWORD AccessType, LPCSTR sProxy optional,
LPCSTR sProxyBypass optional, DWORD Flags)
INT InternetConnectA(INT hConnection, LPCSTR sHost, INT iPort, LPCSTR
sUsername optional, LPCSTR sPassword optional, DWORD Service, DWORD Flags, DWORD
Context)
INT InternetOpenUrlA(INT hConnection, LPCSTR sURL, LPCSTR sHeaders optional,
DWORD iHeaderLength, DWORD Flags, DWORD Context)
BOOL InternetReadFile(INT hConnection, out LPCSTR sData, DWORD iNumBytes,
out DWORD iLength)
BOOL InternetCloseHandle(INT hConnection)
BOOL HttpQueryInfoA(INT hFile, DWORD InfoLevel, LPCSTR sBuffer, DWORD
iLength, inout DWORD iIndex)
BOOL InternetQueryDataAvailable(INT hConnection, out DWORD iNumBytes,
DWORD Flags, DWORD Context)
INT HttpOpenRequestA(INT hConnection, LPCSTR sVerb null, LPCSTR sObject,
LPCSTR sVersion, LPCSTR sRefererUrl optional, LPCSTR sAcceptTypes optional, DWORD
Flags, DWORD Context)
1
BOOL HttpSendRequestA(INT hRequest, LPCSTR sHeaders optional, DWORD
dHeaderLength, LPCSTR lOptional optional, DWORD dOptLength)
dll "kernel32.dll"
DWORD GetLastError()
type MSG_TYPE is enum
MSG_SUCCESS
MSG_OUTSIDEDOMAIN
MSG_MAILADDR
MSG_RETEST
MSG_ERROR
type ProcessedLink is record
STRING sUrlName
STRING sMsgCode
MSG_TYPE mtMsgType
LIST OF ProcessedLink lplBeenProcessed = {}
LIST OF ProcessedLink lplMailLinks = {}
LIST OF ProcessedLink lplToRetest = {}
LIST OF ProcessedLink lplErrorLinks = {}
LIST OF ProcessedLink lplOutsideDomain = {}
INTEGER iDepthToProcess
The above code may appear confusing at first. We are mapping the C function calls from
WININET (Microsoft’s library of Internet functions) and KERNEL32 to functions that may be
called by 4Test. The use of the DLL declaration above enables us to point to a given DLL and
reference any public functions exposed by the DLL.
Let’s discuss the function we use to check for the existence of a given Web page:
BOOLEAN IsWebPageAvailable(STRING sUrl, inout INTEGER iErrorLevel)
INT hConn
INT hFile
STRING sUserAgent=""
STRING sData
DWORD iNumBytes
DWORD iMaxLength=1
DWORD Flags=0
DWORD Context=0
hConn = InternetOpenA(sUserAgent, 1, null, null, 0)
if (hConn > 0)
hFile = InternetOpenUrlA(hConn, sUrl, null, iMaxLength, Flags, 1)
2
iErrorLevel = GetLastError()
if (iErrorLevel != 0)
return FALSE
if (hFile > 0)
InternetCloseHandle(hConn)
return TRUE
else
InternetCloseHandle(hConn)
return FALSE
This function takes two arguments, a STRING representing the URL we are interested in
processing and an INOUT variable representing the error code returned from the WININET
function calls. Note that INOUT variables are variables that are both passed into a function and
that can be modified in the function and passed back out. I originally wrote this as an inout
variable since I was not sure if I would be modifying it in the function, however, I generally only
use the return value and rarely, if that, so it really should have been an OUT variable.
Let me briefly describe the flow of the function. First, we open an Internet connection using the
InternetOpenA function. If this function fails it returns 0 (hence the check for 0) otherwise it
returns a connection handle. We then open a connection to the requested page (passed in with
sUrl), trying to grab only 1 byte of data (specified by iMaxLength). The gives us a fairly quick
response and minimizes read time. We then close the Internet connection and return TRUE if we
successfully connected or FALSE if we failed.
The next function is used to extract the Web page’s HTML so that I can later parse it for the
links:
//This function is passed a URL and returns, as a string, the HTML of the document
STRING ReturnWebPage(STRING sUrl)
INT hConn
INT hFile
STRING sUserAgent=""
STRING sData
DWORD iNumBytes
DWORD iMaxLength=255
DWORD iIndex=0
DWORD Flags=0
DWORD Context=0
STRING sOutput=""
INT iLength=0
hConn = InternetOpenA(sUserAgent, 1, null, null, 0)
if (hConn > 0)
3
hFile = InternetOpenUrlA(hConn, sUrl, null, iMaxLength, Flags, Context)
if (hFile > 0)
//Read the content of the file pointed to by the URL we're dealing with
while(InternetReadFile(hFile, sData, iMaxLength, iNumBytes) &&
(iNumBytes > 0))
//Read iMaxLength of data at a time, constructing the string as we
go
iLength += iNumBytes
sOutput += Left(sData, iNumBytes)
InternetCloseHandle (hFile)
InternetCloseHandle (hConn)
return (sOutput)
Note that this function is nearly identical to IsWebPageAvailable(). We are again opening an
Internet connection and reading from the file. Only this time we read 256 bytes at a time,
constructing a STRING that is the URL page. We use the function InternetReadFile to read from
our file stream the HTML representing the page we are interested in.
As I stated, there is too much code to cover in this entire article, but let me discuss what some of
the other functions do:
LIST OF STRING GetLinks(STRING)
The GetLinks () function processes the HTML for the given page. It scans through the
document, looking for all ‘href’ tags. It then processes out the link that is found within the
double quotes.
private STRING StripChars(STRING)
The StripChars () function is used to strip out all spaces, line feeds, and carriage returns from the
returned HTML page. We do this so that we can more easily parse out the links on a page.
AddThisURLToProcessedList (STRING, STRING, MSG_TYPE)
The AddThisURLToProcessedList () function is used to add the link to it’s appropriate LIST
(see the LIST defined in the declarations section at the beginning).
BOOLEAN HasThisURLBeenProcessed (STRING)
The HasThisURLBeenProcessed () function is used to determine rather or not we have already
processed this page.
BOOLEAN IsKnownProtocol (STRING)
4
The IsKnownProtocol () function is takes a given link, passed in as a string, and returns a
Boolean indicating rather or not the link is using a protocol that we’re aware of (such as HTTP,
HTTPS, FTP, etc..).
STRING FixRelativeLinks(STRING, STRING)
The function FixRelativeLinks () is the weak link in the Web spider. This function takes a
relative link (e.g. ../document.html) from a page and attempts to create an absolute link from it
(since we need the absolute link to actually walk the link). I currently have approximately 80
lines of code to process the most common formats of relative links (such as ../document.html,
/document.html, etc.) but there are many that this code does not yet handle, unfortunately. Since
it would take another whole article to cover this function, alone, I will not be able to cover it in
this column.
BOOLEAN IsMailAddress (STRING)
The IsMailAddress () method is used to determine if the link is using the mailto protocol,
indicating that this is an email address that must be tested manually.
DisplayAddresses (LIST OF ProcessedLink)
The DisplayAddresses () is simply a very small helper function used by the main testcase to
assist in writing the list of URLs processed.
ProcessSubLevelLinks (STRING, INTEGER, LIST OF STRING)
This is the main recursive function for the Web spider. This function recursively gets all links on
a given page, verifying if the link is a mail address, storing the function when processed, then
moving down to the next level in the site.
5