Automated Price Comparison Shopping Search Engine PriceHunter by slappypappy113

VIEWS: 0 PAGES: 16

									CSE 401                                      Senior Design Project                      Elwin Chai & Rick Jones


                        Automated Price Comparison Shopping Search Engine
                                          PriceHunter

                                         Elwin Chai, Rick Jones
                                     {elwin, rjones}@seas.upenn.edu
                                     Faculty Advisor Dr Zachary Ives
                                          zives@cis.upenn.edu

Abstract
In this paper, we explore the possibility of creating a product search engine that is able to
dynamically find commercial sites, independent of merchant feeds and other human involvement
in the management of internal databases. We evaluate briefly the constraints of current shopping
search engines and the benefits of offering a fully automated version. In addition, we consider
the application of JTidy, Stemmers and Wrappers, in order to extract the relevant information
from a commercial website.

Introduction
In the past, the Internet could be thought of as just a repository of static information, and search
engines merely offered Internet users basic information retrieval. However, as the web evolved
to become a bustling marketplace where online transactions are the norm, there is a need for
more specific search capabilities. Ultimately, it is hoped that the ideal search engine reduces
searching costs in terms of time and money for consumers in a perfectly efficient market.

Currently, there already exist numerous shopping search engines (Sullivan, 2003), but they are
mostly constrained by an essentially static database of available products. PriceScan[4] lists
products from a manually updated database1, classified under static categories. Kelkoo[3] and
Yahoo! Shopping[5] utilize similar database frameworks, where merchants submit their products
to be classified manually2 by the search companies according to a predetermined structure.
Amazon[1] is a distributor, which sells a wide range of products, but in reality maintains a finite
database of either products they have in their inventory or registered re-sale products.

One search engine drew our attention because it seems to use a more dynamic approach in
searching. Froogle[2], led us to believe that a fully automated search engine was possible, because
we initially thought that it scours the web for relevant products on sale, instead of utilizing a
static database. After closer investigations, we discovered that it also relies on merchant feeds,
like Yahoo! Shopping, and offers free listing of products. Moreover, there are also other
shortcomings of Froogle that could be improved on (Mills, 2003).

In order to search over a database that is as dynamic as the growth of the Internet, we need to be
able construct such a database directly from web content. Hence, we divided the process into 5
main steps: 1) exploring the web, 2) deciding relevancy of sites, 3) information extraction, 4)
database management and 5) information retrieval.

1
  http://www.pricescan.com/shoppingguides.asp
2
  Any method of database management that involves a case-by-case human decision is considered manual, whether
it is the merchant or the search company who makes the decision.
CSE 401                                 Senior Design Project                      Elwin Chai & Rick Jones


Given the extensive amount of research on the features of a search engine, there is already an
established base of methods for crawling the web, database management, and information
retrieval, which includes ranking a page based on a query (Lee et al, 1997). As such, the steps
that require more research are the areas of site relevancy and information extraction. Some
research has been done on the automatic classification of websites (Pierre, 2001), but has not
concentrated on specifically commercial sites. Nonetheless, an important observation was made
then, that when determining the relevancy of a web page, metadata provide critical information
on top of just the plain content of the page. An example of metadata is whether a word is
displayed as bold or in the title. Google captures some pieces of metadata in a bitmap format for
every keyword (Brin and Page, 2001).

Information extraction, on the other hand, has been explored through the use of wrappers
(Kushmerick et al, 1997). There are even proposed toolkits to help construct a wrapper
(Sahuguet and Azavant, 1999), which laid out the fundamentals of wrapper creation that helped
us in our own practical implementation. It should be noted that even though there is general
consensus as to the need for wrappers, the initially proposed wrappers seem far too site-specific
and idealistic to be implemented in practice. Moreover, there is still debate over how effective
and practical they can be (Kushmerick, 2000). Nevertheless, we sought to integrate concepts
from wrappers to aid us in data extraction.

The paper will first present the architecture with which we have chosen to build our search
engine. Then we proceed with detailed explanation of our design choices in line with the steps of
the workflow, as well as the challenges we faced. Finally we conclude with how future work can
augment the effectiveness of this project.

Architecture
To layout the framework of the program, we track the flow of a single document (or web page)
through the system.

                   Frontier                Keyword                        Search
                   Manager                 Manager
    Internet




                 Webcrawler                                              Database
                                            Stemmer                      Manager




                  Heuristics                Extract
                  Manager                 Information       docInfo        DB         docSearch



                                                                robots
                                            Diagram 1
CSE 401                                  Senior Design Project                 Elwin Chai & Rick Jones


A web page is first processed by the Webcrawler, which extracts the links. The links are in turn
managed by the Frontier Manager to ensure the crawler has a steady supply of documents to
process. The document is then passed on to the Heuristics Manager, which decides if the page is
a commercial website selling a product. The Heuristics Manager passes the page on to extract its
information. Information extraction occurs in two stages: 1) extracting the price information of
the website and 2) extracting the keywords within the document. The keywords are first
stemmed and then packaged using Keyword Manager before they are inserted into the database
through the Database Manager. Finally Search uses the dynamically constructed database to
answer user queries.

Webcrawler
All operations begin with the Webcrawler. We based our crawler design on the Mercator model
(Heydon and Najork, 1999), as illustrated in the figure below.




                                             Diagram 2

Thus, the Webcrawler starts with a given set of seed URLs. From these URLs it proceeds to
extract the actual pages from the web to be processed. Processing in this case includes extracting
links to other pages, and passing it on to decide the desirability of the page. If the page is found
to be desirable, it is passed down the pipeline in order to extract its information. The crawler
terminates after repeating this process for a predetermined number of pages. If any page has
already been processed by the crawler, it is not passed on to the rest of the workflow. However,
its links are still extracted to ensure that the crawler always has links to follow and documents to
process. In the processing of a page, there are two main obstacles that need to be overcome.

The first obstacle is that of politeness. In particular crawlers have to be aware if a page does not
want to be searched or indexed. This information is contained in two forms: 1) as a robots.txt file
on the server or 2) in the meta tag of the HTML document.

Initially, the Webcrawler must check to see if a robots.txt file exists at the base URL. In order to
reduce the processing time, the crawler maintains a special robots table in the database. The
crawler will first check the robots table to decide if a website should be crawled. If no entry
CSE 401                                 Senior Design Project                  Elwin Chai & Rick Jones


exists, the crawler will attempt to obtain and parse a robots.txt file for the host site. Thus, the
crawler will only have to obtain and parse the robots.txt file once. Finally the robots table is
recreated on every new crawl. This is done for two reasons: 1) a restriction increase such as a site
previously marked with no robots.txt file might have added one, and 2) a change or reduction of
restrictions such as a news site changing its robots.txt to match its changing content. The
Webcrawler then proceeds to obtain the document and search for the robots meta tag that
specifies restrictions for the crawler to obey.

 User-agent: *                                      <html>
 Disallow: /~yikesinc/                                 <head>
 Disallow: /~gravinaj/                                      <meta name="robots"
                                                             content="noindex,nofollow"/>
                                                       <title> … </title>
                                                       </head> …
http://www.seas.upenn.edu/robots.txt                 Meta tag robots
                                             Diagram 3

The second obstacle is that of memory. Every document the Webcrawler processes generates
multiple links to other documents. Since the crawler is the sole generator and processor of links,
it is clear that the number of links will continue to grow until crawling is completed. The solution
is the creation of a separate thread called the FrontierManager.


                          Frontier Manager
                             Frontier      Frontier         Frontier
                              InFile       OutFile          TmpFile


                             storeIn       readDisk             writeOut
                             Frontier        IntoIn             ToDisk


                            FrontierIn Queue      FrontierOut Queue
          to Webcrawler                                                    from Webcrawler

                                               Diagram 4

The Webcrawler obtains links from the FrontierIn queue. The initial seed URLs are placed in the
FrontierIn queue. When the crawler extracts links from a page it pushes these links into the
FrontierOut queue. The FrontierManager handles the movement of links from the FrontierOut
queue into the FrontierIn queue. This is accomplished by checking the sizes of both queues.
When the size of FrontierIn falls below a certain threshold, the manager attempts to read links
from disk and places them into the queue. If no information is found on disk, the
FrontierManager could move links directly from FrontierOut into FrontierIn. On the other hand,
if the size of FrontierOut exceeds a certain point, the manager writes the links from the queue
onto disk. When crawling is complete, the FrontierManager will store any unprocessed links in
CSE 401                                        Senior Design Project                       Elwin Chai & Rick Jones


the FrontierIn queue back to disk. Multiple FrontierManagers can be created to handle multiple
Webcrawlers. This design prevents the Webcrawler from consuming indefinite memory due to
link extraction. However, there is still the possibility of generating contention between the
FrontierManager and the Webcrawler over the FrontierIn and FrontierOut queues.

To avoid creating such contention, the following observations are made. When the Webcrawler
extracts links and places them in the FrontierOut queue, it does not need access to FrontierIn.
Conversely, when the Webcrawler is obtaining a new document to search and checking its
permissions it does not need access to FrontierOut. Finally, the Webcrawler needs access to
neither queue when forwarding it down the pipeline. Given these observations, the manager
chooses the following strategy: when the Webcrawler signals the FrontierManager to proceed, it
is assumed that the crawler does need to access a given queues. Thus the FrontierIn queue can be
processed while the Webcrawler is extracting links. However, disk access is expensive and
simply given robots.txt permissions the Webcrawler may need extended access to a particular
queue or may skip certain steps. Thus to avoid these costs the FrontierManager will only be
signaled when the queues meet a certain threshold. The threshold is also set such that should the
queue in question not receive immediate attention from the FrontierManager, this does not cause
any delays. In addition, such a schedule allows the FrontierManager to more efficiently read or
write blocks of data to disk. Finally should the FrontierIn queue become empty or the
FrontierOut queue become full, the crawler can block until the FrontierManager can process the
given request.

Apart from interacting with the FrontierManager, the Webcrawler is also responsible for parsing
the web pages into a DOM (Document Object Model) document using JTidy (Marchal, 2003).
Since most web pages are not well formulated to according to proper XML structure, it is
difficult to traverse the document meaningfully without any heavy processing. JTidy produces a
DOM tree can be easily traversed to extract nodes, identifiable by their names. In the context of
HTML documents, the nodes can represent the HTML entity tags3, like <a href=“…”> and <img
src=“…”>.

Heuristics
Not all of the web sites obtained through crawling are relevant to product searches. In fact, we
are only concerned with sites that offer products or services for sale. Therefore, in order to avoid
processing irrelevant web pages, there needs to be a way of deciding if a page is relevant or not.

In addition to merely being relevant, we have grouped sites with products to offer into 3 broad
categories: 1) online stores that permit buyers to execute the purchase transaction online, 2)
auction sites that offer potential buyers the environment to bid for items they want to buy, and 3)
offline stores that provide buyers the information to eventually acquire the items from a physical
store, or through an offline exchange. In general, products offered by sites like Amazon.com and
Yahoo! Shopping fall under the first category. Products listed on Ebay.com or AuctionFire.com
belong to the second category, while classified ads and services like tutoring or childcare are
grouped in the third category.


3
 For our purposes, the terms ‘tag’ and ‘node’ are used interchangeably to refer to both the elements in the HTML
document and the nodes of the corresponding DOM tree.
CSE 401                                                                                   Senior Design Project                                                                                 Elwin Chai & Rick Jones


The main method that we elected to use is a bitwise scoring system. In essence, a 19-bit score is
calculated for every page and can be divided into 4 main sections. The first 4 bits record the
occurrence of generic traits of a commercial site, e.g. the most significant bit notes the existence
of a HTML input button on the page. The next 7 bits of the score are used to measure the extent
to which a page fits the characteristics of an online store, whereas the following 5 and 3 bits
analyze the same page with respect to auction sites and offline stores respectively. For any one
category, the characteristics that distinguish it are ranked in descending order, so that the most
important ones occupy the most significant bits of the score, thereby facilitating sorting using the
aggregate score.

Taking the identification of online stores as an example, we believe that any site that fits this
category should have a shopping cart (or an equivalent feature) at the minimum, followed by a
check out option, and so on. Some of the signs we hoped to find on the page were simply text
that mentions the availability of the product, shipping costs and return policy. Nevertheless, there
were a lot of false positives in using text-based identifiers since they do not uniquely recognize
only pages that are truly online stores. A bogus web site that simply comments on return policies
in general may still be scored on that characteristic, albeit erroneously.

Ultimately, we decided to only use non-text identifiers, i.e. explicit HTML entity tags like
<input>, <form> and <button> to provide more accuracy in the site classification process. The
bits used to encode text-identifiers are thus left unused. Nonetheless, there remains a trade-off
between using String.indexOf or StringTokenizer matching in searching for sub-strings. IndexOf
allows a searcher to find “bid” within the string “submitBid”, but does not ignore the string
“obidos” (which occurs on all Amazon sites). The reverse is true for using StringTokenizer and
matching each token. Additionally, StringTokenizer matching does not allow for matching
multiple words like “shopping cart” in “Add to shopping cart”. In the end, we decided to
primarily rely on IndexOf to find sub-strings, because it is more accurate and more efficient.

Furthermore, only the most important and significant characteristics were preserved and they
provide adequate heuristics information to classify websites. Nonetheless, because we are
operating under an open-world assumption, the results are in no way conclusive or perfect.
                                                        Payment methods




                                                                                                                                                                                                                                               Opening hours
                                                                          Shopping cart




                                                                                                                                                     Return Policy
                       Input button




                                                                                                                 Availability




                                                                                                                                                                                                           Buy it now


                                                                                                                                                                                                                                  Directions
                                                                                          Check out




                                                                                                                                                                           Time left
                                                                                                                                          Shipping
                                                                                                      Quantity


                                                                                                                                Account




                                                                                                                                                                                                 Auction
                                              Item ID




                                                                                                                                                                                                                        Contact
                                                                                                                                                                                       Seller
                                      Price




                                                                                                                                                                     Bid




      Score             1 1 1 1 1 1 1 1 1 1 1                                                                                                                        1      1 1 1                           1            1 1 1
   Bit position        19 18 17 16 15 14 13 12 11 10 9                                                                                                               8      7 6 5                           4            3 2 1
                         Generic         Online                                                                                                                            Auction                                        Offline
Identifiers in grey indicate text-identifiers that were eventually unused
                                                                                                      Diagram 5
CSE 401                                             Senior Design Project                           Elwin Chai & Rick Jones


Price Extraction
Price is a critical piece of information that needs to be extracted from a given page. However,
there is no standard format with which to identify the price that corresponds to any given item on
a page. Thus two sets of wrapper functions have been developed to help automatic identification
and extraction of prices: one for online stores and the other for auction sites. Our Price Extractor
utilizes the DOM tree of the parsed HTML document and exploits the information contained
within the structure.

There are generally two strategies to extract the price: 1) identify the true price4 through a set of
criteria or 2) attempt to eliminate all prices but the true price. Given that for an online store the
only two characteristics of the true price are being in a table and being isolated, these are
insufficient pieces of information for identification. Hence, the second strategy is adopted, which
can be viewed as passing the list of all prices through a sequence of filters. In contrast, auction
sites do not have well-defined competing prices to the true price. Thus, the Extractor simply tries
to identify the true price using the first strategy.

In order to construct such a list of prices, every text node is first checked to see if it possesses a
well-formatted price, such as $17.99. The price and the node that contains the price are then
stored in a linked list. In addition each price is marked as being isolated or not isolated. An
isolated price is one that appears alone within a given tag. For example, <tag>$17.99</tag> from
the diagram below is isolated, while <tag>$12.00 (%40)</tag> is not. Through our empirical
observations, the majority of sites present the true price as isolated, making it an important
characteristic of a true price. After the document is searched, this list of prices can be processed
to extract the true price. If there is only one price on the page, then it is trivially the true price.
Otherwise, if there is only one single isolated price, it is also then considered to be the true price
of the item.




                                                         Diagram 6
4
    The true price of a site is what a human user is able pick out as the price of the item sold.
CSE 401                                          Senior Design Project           Elwin Chai & Rick Jones




In the event that there is more than one isolated price, the list of prices is considered for possible
true prices. At any point, if only one isolated price remains in the list after eliminating others, it
is deemed to be the true price. Other types of prices that may appear on a page are: 1) prices of
items recommended by the commercial site, 2) strike out prices, and 3) list prices. Each of these
can be systematically removed from the list through a series of filters.

As seen at Amazon.com, a list of “recommended” items or items that consumers are likely to be
interested in is often included in a page containing one main item for sale. From our experience,
these recommendation lists contain more than four products. Even if there are less than four
recommended items, all the prices in the list are typically not isolated. In addition, pages do not
group more then two prices around the main price typically, since they only help to indicate the
“good deal” or savings the consumer is getting. Hence, prices that occur in a table with more
than 4 prices or in a table only containing non-isolated prices are all eliminated.




                                                     Diagram 7

After filtering out recommended prices, the remaining prices are scanned for strike out tags, i.e.
<strike> or <s>, such as the list price $29.99 in Diagram 6. This tag indicates that the price is
crossed out and is not the true price. Due to the fact that the tags must occur around a price in the
document, they can be easily located by searching through all of the sibling tags5 for any given
price. Once the tags are found, the prices contained within them are removed from consideration.

The system then searches for all “list prices”, which, instead of being the true price of an item,
are the manufacturer’s recommended price. This price is often placed within close proximity of
the true price. Its position most probably highlights the generous discount that the store is
offering. While there are synonyms such as MSRP and retail price, there is a limited number of
such terms. Nevertheless, searching for keywords such as “list” proves more difficult than
searching for a strikeout tag. The keyword “list” need only appear before the price and inside the
same table tag, but its actual location may be hard to find. In order to find the keywords, a
recursive depth-first traversal of the document is implemented. From a given price in the list, the
function recursively traverse up through its parent nodes until the table tag. From each parent
node, it searches the children nodes in a depth-first search manner to identify a node containing
the sub-strings like “list price”. Due to the fact that multiple prices may be under the same table,
we avoid misidentifying a price by not searching any children beyond the one that contains the

5
    There may be more than one sibling tag around a price
CSE 401                                          Senior Design Project           Elwin Chai & Rick Jones


initial price. If there is, in fact, a parent node that contains a “list price” child node, the initial
price is eliminated. The Price Extractor will also attempt to remove prices in the list that indicate
shipping cost or the discount the consumer is receiving in the same manner as “list prices”.

If no appropriate price is found, the Price Extractor returns null instead of trying to return an
average of the remaining prices within the list. This is because the result of such a calculation is
not based on any arguable logic.

As for auction sites, it should be noted that these sites often state two important prices: the
currently offered bid and a price you can pay to immediately buy the item. The decision was
made to extract the currently offered bid since this can be substantially lower than the immediate
buy out price and offers a more accurate representation of the lowest price the buyer could pay.
Thus to identify the current bid price the technique of identifying “list prices” is used only this
time only prices matching the criteria are kept. The key term is “current bid” or “starting bid”. If
this fails the system defaults to try and look for any price that can be associated with just the
word “bid”.

The most surprising feature of these extraction functions is how well they perform. Since it is
impossible to collect every shopping page on the web, our results are based on the pages we
actually crawled. Essentially, price extraction for online sites returns a price more often than not
and when it does, it almost always returns the true price. The results for auction sites, on the
other hand, were not as encouraging. Some auction sites do not even mention the word “bid” and
are thus eliminated using our Extractor.

Database
As the back-end of our search engine, we require a database to store all the information gathered
from crawling the web. We selected MySQL as our database of choice as it is a free SQL server
that we can install locally in our systems. In addition, MySQL is a relational database, which we
believe offers flexibility for project.

In order to process the main body of the web pages, pre-processing is performed on the relevant
text nodes found within the document. This involves removing punctuation marks as well as
common characters used in html documents, e.g. &nbsp; for space. These are found and replaced
with regular ASCII equivalents, if necessary. Then, stemming is done using an implementation
of Porter’s algorithm for suffix stripping (Porter, 1980). In doing so, words like “run”, “runs”,
and “running” all are stored as “run” and hashed to the same keyword ID. Hence, searches are
made more versatile because searching for “running” will also return pages containing “runs”.

The database takes in keyword-site pairs6 and processes them into the proper tables in the
database. We used the SHA-1 algorithm to hash the URI and the keyword to facilitate the storing
and retrieval of both items. The insertion of site-keyword pairs into the database mostly involves
accessing the wordlist and termlist tables (see Appendix I for all tables and the E-R diagram). It
should be noted that the insertion of word into these tables is not entirely independent, since new
words may signal a need to update both tables. Given this constraint, the number of threads that

6
    Each keyword is associated with the URL from which it is extracted
CSE 401                                  Senior Design Project                  Elwin Chai & Rick Jones


can be running simultaneously is quite limited. Thus, while the design can accommodate many
threads, it is built practically expecting only a few.

The design for the interaction of the Webcrawler and the database is a simplified version of the
design for the FrontierManager (see Diagram 4). Essentially keywords are extracted from a
document and placed into an Out queue. Then, a thread called KeywordManager moves
keywords from the Out queue to the In queue. These are in turn read by the Database for
insertion or updating. However, there are certain key differences. The primary difference is that
the Webcrawler is a separate thread from the database. Thus how many keywords is required to
be offloaded to disk depends entirely on whether the Webcrawler is generating keywords faster
than the database can processes them or vice versa. This is the classic consumer-producer
problem. In addition, the KeywordManager is not built to accommodate multiple instances.
Finally whereas the FrontierManager can be set to run at times when the Webcrawler is not using
a given queue, the KeywordManager simply empties and fills the queues as requested. This is
due to the fact that the while the KeywordManager is trying to transfer data between two
independently running threads. Thus, even having the threads invoke the KeywordManager
when they are not actively using the queues can create contention for the KeywordManager to
get work done. Nevertheless, KeywordManager follows FrontierManager in gettting to queues
before they are full/empty and performs block reading and writing to avoid unnecessary disk I/O.

Search
The main purpose of PriceHunter is to provide users searching capabilities. Therefore, to this
end, we have defined a basic set of searching features. Firstly, in order to determine if a page is
relevant enough to be returned as a result of a search, its score is calculated based on the vector
space model (Lee et al, 1997). The model measures the vector distance between the HTML
document and the search query as a proxy to how similar both documents are to each other. The
results from processing the query are ranked by default based on the vector space score. Even if
two documents obtain the same score, they are further divided based on their alternate score,
which records the number of times each word in the search query occurs in the document itself.

Users are able to specify which words they require every page in the search results to have. By
default, words entered in the search line do not all have to appear in every web site in the results
list, i.e. searching is based on disjunctive keywords. However, in order to narrow down the
search results, users are able to refine their search using the quotation marks. All words
appearing within quotations are considered to be conjunctive conditionals whereas words that are
otherwise separated by white space are disjunctive conditionals. In an example, the string (“one
two” three “four five”) can be interpreted as a request for sites containing both “one” and “two”,
or simply “three”, or both “four” and “five”.

Whether the user wishes to streamline his search using additional query line shortcuts, e.g.
“canon i320” $50-100 [online], or selecting his preferences using an online form, there is great
flexibility in the search system. Essentially, the user is offered the option of specifying the type
of sites in the results set and the range of prices, within which the products must fall.

Since ultimately, the PriceHunter is a web-based search engine, we designed an appropriately
user-friendly web interface that is hosted on a TOMCAT server. It allows the user to submit
CSE 401                                  Senior Design Project                 Elwin Chai & Rick Jones


queries to the SQL server and view the results. In addition, the user is able to sort through the
results list and refine the search based on a price range or by specifying the type of sites.




                                             Diagram 8

To further avoid returning invalid or irrelevant results, the system automatically calculates the
median prices of all initial search results and the standard deviation of the same group to
determine what is the range of prices the engine should be concentrating on. While this method
does not immediately guarantee the uniformity of search results, it provides an initial filtering of
clearly irrelevant pages to the search.

Taking into account the fact that MySQL does not have an optimizer and that processing search
queries each time may take a long time accessing the database to calculate the results, completed
searches are cached in the DB as well. Therefore, whenever the same search is made again,
determined by the SHA-1 hash of the search string, the results in the cache are simply returned,
if they are present. In order to allow for changes in web pages, each search cache is also
timestamped. When the timestamp of a search query is more than 2 weeks old, the system forces
a fresh processing of the search. In theory, the expiration period of the timestamp should be as
long as the lag time between complete web crawls.

Challenges
Throughout the design and implementation of our search engine, we faced several challenges and
learnt many important lessons. The most important lesson is that the web is essentially
unstructured and contains largely ill-formed HTML pages. In addition, the content is
continuously changing and does not follow any convention. While Tidy was supposed to convert
ill-formed HTML into XHTML, i.e. well-formed XML versions of HTML, there are certain
irregularities that are not appropriately converted by Tidy. For example, child nodes that exist
within the title tag are relocated by Tidy to be part of the body tag.
CSE 401                                  Senior Design Project                  Elwin Chai & Rick Jones


 <head>                                             <head>
    <title>2004 Mitsubishi Galant –                     <title>2004 Mitsubishi Galant –
         <a href=“http://…”>Cars</a>                    </title>
    </title>                                        </head>
 </head>                                            <body>
 <body>                                                 <a href=“http://…”>Cars</a>
 </body>                                            </body>
Before Tidy                                          After Tidy
                                             Diagram 9

This problem is also partially a cause of the difficulty in identifying relevant search results. Due
to the excessive amount of information presented on any one page, extracting all the words that
exist on the page will return too many false positives in the results. On the other hand, extracting
only the base set of keywords from the meta tags and title tags may provide too few keywords
for the user to find enough relevant websites. In the case above, the word “cars” has been
dropped from the title, and losing this piece of critical information will cause the page not to be
returned when the user searches for “cars”. In addition, since the user is likely to sort the results
by prices, it is highly undesirable for the result set to contain irrelevant pages with low prices.

Future Work
While we attempted to provide for scalability, there remains room for improvement in making
the entire search engine completely expandable. In particular, the number of Webcrawlers can be
increased to distribute the crawling workload. Similarly, more parallelism could potentially
alleviate the currently heavy workload of the back-end, which is due to the fact that disk access
is relatively expensive.

As mentioned before, Tidy has shortcomings in processing the ill-formed web. In order to extract
information based on what the user sees, Tidy should be improved to properly reflect the HTML
document as presented visually by most browser and preserve the intended structure.

In terms of searching, the location of keywords on the page could be recorded so that positional
querying can be performed. While keywords from the meta tag do not inherently have positional
data associated with them, those within the title tag do. As such, it is non-trivial to assign the
positions to a word. Furthermore, metadata of keywords, e.g. formatting data, can be used to
augment our current model to return relevant results.

Building on our method of extracting prices, other pieces of information can be obtained from
the web page, e.g. shipping, item number, seller etc. In particular, the category of a certain
product may also be dynamically assigned, thereby creating an automatic category list of all
products in the database. Ultimately, to become a fully viable product search engine, this method
would have to be able to extract more than just prices from the web pages.

Conclusion
There is sufficient evidence that there is some level of convention and structure among
commercial sites for wrapper functions to extract information reasonably. We were surprised to
find that the functions that we have developed have achieved a substantial amount of success.
CSE 401                                Senior Design Project                Elwin Chai & Rick Jones


Nonetheless, we have also restricted ourselves to a small set of web sites. The diversity of the
wider web and the constant change that it undergoes only means that further investigation is
necessary to determine the viability of a fully automated product search engine. The content of
these websites may simply become more diversified and irregular to warrant the applicability of
any set of heuristics in the long-term. Nonetheless, as the web develops, there may be greater
standardization among online stores.
CSE 401                                          Senior Design Project                        Elwin Chai & Rick Jones


References

 [1]      Amazon, http://www.amazon.com

 [2]      Froogle, http://froogle.google.com

 [3]      Kelkoo, http://www.kelkoo.co.uk

 [4]      Pricescan, http://www.pricescan.com

 [5]      Yahoo! Shopping, http://shopping.yahoo.com

 [6]      Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine,
          Computer Networks and ISDN Systems, 1998

 [7]      Allan Heydon and Marc Najork. Mercator: A Scalable, Extensible Web Crawler. In World Wide Web
          Journal, December 1999, pp. 219 - 229, http://research.compaq.com/SRC/mercator/papers/www/paper.html

 [8]      Nicholas Kushmerick, Daniel S. Weld and Robert Doorenbos, Wrapper induction for information
          extraction, Intl. Joint Conference on Artificial Intelligence, 1997, pp 729-735

 [9]      Nicholas Kushmerick, Wrapper verification, World Wide Web Journal 3(2), 2000, pp 79-94

[10]      Dik L. Lee, Huei Chuang and Kent Seamons, Document Ranking and the Vector-Space Model, IEEE
          Software, March/April 1997, pp 67-75

[11]      Benoit Marchal, Tip: Convert from HTML to XML with HTML Tidy, 18 Sep 2003,
          http://www-106.ibm.com/developerworks/library/x-tiptidy.html?ca=dgr-lnxw02TidyUp

[12]      Jason Mills. Early Froogle BETA Shortcomings. Top Site Listings, 15 January 2003
          http://www.topsitelistings.com/news/froogle-beta.php

[13]      John M. Pierre. On the Automated Classification of Web Sites. Electronic Transactions on Artificial
          Intelligence, 2001, p. 6, http://citeseer.ist.psu.edu/559123.html

[14]      Martin F. Porter, An algorithm for suffix stripping, Program, Vol. 14, no. 3, 1980, pp 130-137

[15]      Arnaud Sahuguet and Fabien Azavant, Building Light-Weight Wrappers for Legacy Web Data-Sources
          Using W4F, Proceedings of the 25th International Conference on VLDB, 1999, pp 738-741

[16]      Danny Sullivan. Shopping Search Engines. Search Engine Watch, 5 December 2003
          http://searchenginewatch.com/links/article.php/2156331
CSE 401                          Senior Design Project   Elwin Chai & Rick Jones


Appendix I

CREATE TABLE document (
 docid bigint,
 uri varchar(255),
 score char(24),
 type char(10),
 title varchar(255),
 descr varchar(255),
 PRIMARY KEY (docid));

CREATE TABLE documentinfo(
 price float,
 docid bigint,
 PRIMARY KEY (docid),
 FOREIGN KEY (docid) REFERENCES document);

CREATE TABLE termlist(
 docid bigint,
 wordid bigint,
 word varchar(255),
 hit bigint,
 weight float,
 PRIMARY KEY (docid, wordid),
 FOREIGN KEY (docid) REFERENCES document,
 FOREIGN KEY (wordid) REFERENCES wordlist);

CREATE TABLE wordlist(
 wordid bigint,
 word varchar(255),
 nhits bigint, idf float,
 PRIMARY KEY (wordid));

CREATE TABLE robots(
 host bigint,
 disallow varchar(128),
 PRIMARY KEY(host, disallow));

CREATE TABLE documentsearch(
 searchid bigint,
 docid bigint,
 ranking float,
 alt_ranking int,
 timestamp timestamp,
 PRIMARY KEY (searchid, docid),
 FOREIGN KEY (docid) REFERENCES document);
CSE 401                                      Senior Design Project                   Elwin Chai & Rick Jones




                             ER Diagram for PriceHunter
                     DocumentInfo                                               Robots
            docid@Document        bigint(20)                         host                bigint(20)
            price                      float                         disallow         varchar(128)


                              contains


                             Document
            docid                             bigint(20)                     Termlist
            uri                            varchar(255)              docid@Document bigint(20)
            score                              char(24)              wordid@Wordlist    bigint(20)
            type                               char(10)              word            varchar(255)
            title                          varchar(255)              hit                bigint(20)
            descr                          varchar(255)              weight                  float



                             is searched


                       DocumentSearch                                           Wordlist
            searchid                          bigint(20)             wordid              bigint(20)
            docid@Document                    bigint(20)             word             varchar(255)
            ranking                                float             nhits               bigint(20)
            alt_ranking                               int            idf                      float
            timestamp                        timestamp



                    Entity

               Relationship

           Bold&Underline          primary key
          lics@Entity               foreign key

								
To top