Web Page Categorization based on Document Structure

Document Sample
Web Page Categorization based on Document Structure Powered By Docstoc

                  Web Page Categorization based on Document Structure

                        Arul Prakash Asirvatham              Kranthi Kumar. Ravi
                       arul@gdit.iiit.net                     kranthi@gdit.iiit.net

                                 Centre for Visual Information Technology
                             International Institute of Information Technology
                                  Gachibowli, Hyderabad, INDIA 500019

                    Abstract                             [2], Bayesian probabilistic models [4][6][7],
                                                         inductive rule learning [5], decision trees [7],
The web is a huge repository of information and          neural networks [8] and support vector machines
there is a need for categorizing web documents           [9]. An effort was made to classify web content
to facilitate the indexing, search and retrieval of      based on hierarchical structure [10].
pages. Existing algorithms rely solely on the text                However, besides the text content of the
content of the web pages for categorization.             web page, the images, video and other
However, web documents have a lot of                     multimedia content together with the structure of
information contained in their structure, images,        the document also provide a lot of information
audio video etc present in them. In this paper,          aiding in the classification of a page. Existing
we propose a method for automatic                        classification algorithms, which rely solely on
categorization of web pages into a few broad             text content for classification, are not exploiting
categories based on the structure of the web             these features.We have come up with a new
document and the images present in it.                   approach for automatic web-page classification
                                                         based on these features.
1. Introduction                                                   The paper is organized as follows. In
                                                         Section 2, we discuss the existing algorithms for
         There is an exponential increase in the         web page classification and the drawbacks in
amount of data available on the web recently.            these algorithms. We discuss our approach in
According to [1], the number of pages available          Section 3 followed by our implementation of
on the web is around 1 billion with almost               this approach in Section 4. Our results are dealt
another 1.5 million are being added daily. This          with in Section 5. We present our conclusions
enormous amount of data in addition to the               and avenues for future work in this direction in
interactive and content-rich nature of the web           Section 6.
has made it very popular. However, these pages
vary to a great extent in both the information           2. Web Page Categorization Algorithms
content and quality. Moreover, the organization
of these pages does not allow for easy search. So            Several attempts have been made to
an efficient and accurate method for classifying         categorize the web pages with varying degree of
this huge amount of data is very essential if the        success. The major classifications can be
web is to be exploited to its full potential. This       classified into the following broad categories
has been felt for a long time and many                   a) Manual categorization: The traditional
approaches have been tried to solve this                 manual approach to classification involved the
problem.                                                 analysis of the contents of the web page by a
         To start with, classification was done          number of domain experts and the classification
manually by domain experts. But very soon, the           was based on the textual content as is done to
classification began to be done semi-                    some extent by Yahoo [11]. The sheer volume of
automatically or automatically. Some of the              data on the web rules out this approach.
approaches used include text-categorization              Moreover, such a classification would be
based on statistical and machine-learning                subjective and hence open to question.
algorithms [3], K-Nearest Neighbor approach

b)     Clustering     Approaches:       Clustering    classify the document being referred as has been
algorithms have been used widely as the clusters      done according to [13]
can be formed directly without any background                  We observe that the methods used so
information. However, most of the clustering          far, are based to a great extent on the textual
algorithms like K-Means etc. require the number       information contained in the page. We now
of clusters to be specified in advance. Moreover,     present our approach based on structure of the
clustering approaches are computationally more        page and image and multimedia content in the
expensive.                                            page.
c) META tags based categorization: These
classification techniques rely solely on attributes   3. Structure based approach
of the meta tags (<META name=’’keywords’’>
and <META name=’’description’’>). However,                     Structure based approach exploits the
there is a possibility of the web page author to      fact that there are many other features apart from
include keywords, which do not reflect the            text content, which form the whole gamut of
content of the page, just to increase the hit-rate    information present in a web document. A
of his page in search engine results.                 human mind categorizes pages into a few broad
d) Text content based categorization: The             categories at first sight without knowing the
fourth and fifth approaches use the text content      exact content of the page. It uses other features
of the web page for classification. In text-based     like the structure of the page, the images, links
approaches, first a database of keywords in a         contained in the page, their placement etc for
category is prepared as follows - the frequency       classification.
of the occurrence of words, phrases etc in a                   Web pages belonging to a particular
category is computed from an existing corpus (a       category have some similarity in their structure.
large collection of text). The commonly               This general structure of web pages can be
occurring words (called stop words) are               deduced from the placement of links, text and
removed from this list. The remaining words are       images (including equations and graphs). This
the keywords for that particular category and can     information can be easily extracted from a html
be used for classification. To classify a             document.
document, all the stop words are removed and                   We observe that there are some features,
the remaining keywords/phrases are represented        like link text to normal text ratio which are
as a feature vector. This document is then            present in all categories of documents but at
classified into an appropriate category using the     varying degrees and there are some features
K-Nearest Neighbor classification algorithm.          which are particular to only some kinds of
         These approaches rely on a number of         documents. Hence, for the classification we have
high quality training documents for accurate          assumed apriori that a certain feature contributes
classification. However, the contents of web          to the classification of the page into a particular
pages vary greatly in quality as well as quantity.    category by, say some x% whereas the same
It has been observed [12] that 94.65% of the          feature contributes to classification of the same
web pages contain less than 500 distinct words.       page into some other category by some (x±y)%.
Also the average word frequency of almost all         These apriori values will then be modified based
documents is less than 2.0, which means that          on learning from a set of sample pages. The final
most of the words in a web document will rarely       weights are then for categorization.
appear more than 2 times. Hence the traditional
method based on keyword frequency analysis            4. A Specific Implementation
cannot be used for web documents.
e) Link and Content Analysis: The link-based                  We have implemented a structure based
approach is an automatic web page                     categorization system to categorize the web
categorization technique based on the fact that a     pages into the following broad categories.
web page that refers to a document must contain               1. Information Pages.
enough hints about its content to induce                      2. Research Pages.
someone to read it. Such hints can be used to                 3. Personal Home Pages.

         A typical information page has a logo on    of characters in the page. A high ratio means the
the top followed by a navigation bar linking the     probability of the page being an information
page to other important pages. We have               page is high. Some of the information pages
observed that the ratio of link text (amount of      contain links followed by a brief description of
text with links) to normal text also tends to be     the document referred to by the link. In such
relatively high in these kinds of pages.             cases, this ratio turns out to be low. So, the
         Research pages generally contain huge       placement of the links is also an important
amounts of text, equations and graphs in the         parameter.
form of images etc. The presence of equations                 The amount of text in a page gives an
and graphs can be detected by processing the         indication of the type of page. Generally,
histogram of the images.                             information and personal home pages are sparse
         Personal home pages also tend to have a     in text compared to research pages. Hence we
common layout. The name, address and photo           have counted the number of characters in the
of the person appear prominently at the top of       document (after removing some of the
the page. Also, towards the bottom of the page,      commonly used words) and used this
the person provides links to his publications if     information to grade the text as sparse, moderate
there are any and other useful references or links   or high in information content.
to his favorite destinations on the web.             (b)Image Information: The number of distinct
                                                     colours in the images has been computed. Since
         Our implementation takes a start page       there is no need of taking into account all 65
and a domain as input, spiders the web retrieving    million colour shades in a true colour image, we
each of the pages along with images. The             have considered only the higher order 4 bits of
retrieval is breadth-first and care has been taken   each colour shade (to give a total of 4096
to avoid cyclic search. The pages and images are     colours). Information pages have more colours
stored on the local machine for local processing     than personal home pages, which in turn have
and feature extraction.                              more colours than research pages.
         The categorization is carried out in two             The histogram of images has been used
phases. The first phase is the feature extraction    to differentiate between natural and synthetic
phase and the second, classification phase.          images. The histogram of synthetic images
These are explained in detail in the following       generally tends to concentrate at a few bands of
sub-sections.                                        colour shades. In contrast, the histogram of
                                                     natural images is spread over a larger area.
4.1 Feature Extraction                               Information pages usually contain many natural
                                                     images, while research pages contain a number
         The feature extraction phase is the most    of synthetic images representing graphs,
important phase in the system as the                 mathematical equations etc.
classification is based on the features extracted             The research pages usually contain
during this phase. The features should be should     binary images and further information can be
provide some valuable information about the          extracted from them. Graphs are generally
document and at the same time be                     found in research pages. We have used the
computationally inexpensive. Very small images       histogram of the image to detect the presence of
(approximately 20x20 or less) have not been          a graph in the image.
considered for feature extraction as they usually
correspond to buttons, icons etc.                    4.2 Classification
         The following features were taken into
consideration for classification of the pages.               We have used the following approach to
(a) Textual Information: The number and              classify the page according to the features
placement of links in a page provides valuable       obtained. In this algorithm, each feature
information about the broad category the page        contributes either positively or negatively
belongs to. We have computed the ratio of
number of characters in links to the total number

towards a page being classified to a particular              The classification is done based on the
category. The actual procedure is as follows:        elements of matrix V obtained by the above
Let W be a (cxn) weight matrix and F be a (nx1)      procedure. The document belongs to the
feature value matrix.                                category for which the value of Vij is the highest.
 V = WxF gives a cx1 matrix as shown below:
   W00      W01      −     −      W0 c             5. Results
   W        W11      −     −      W1c 
    10                                                    We tested our implementation of this
W=  −           −    −     −       − ,             approach on a sample space of about 4000
                                                   pages. These pages are gathered from various
    −        −      Wij    −       − 
                                                     domains. The results and various parameters
   Wn 0     Wn1      −     −      Wnc 
                                                   used were shown in Table 1.

       F00                    V00 
      F                       V 
       01                      01                   Details                              Results
       −                      −
F=     ,             V=                             No.of pages on which we have ~4000
       F0i                    V0i                   tested our implementation
       −                      −                     Pages categorized            ~3700
                               
       F0 n 
                              V0c 
                                                      Pages categorized correctly          ~3250
        n – Number of features
In the matrix W,                                        % categorized correctly              87.83%
        c - Number of Categories
                                                                           Table 1
        and     Wij ∈ [−1,+1]            is   the
        contribution of the ith feature to the jth
In the matrix F,
        F0i is the value of the ith feature.

        For each category, Wij varies between
-1.0 and 1.0. For example, in case of a home
page, a weight of 1.0 could be assigned for the
feature ‘presence of image with human face’
whereas the same will have a weight of -0.8 in a
research page. Thus the presence of a
photograph increases the chance of it being
classified as a personal home page and at the
same time decreases the chance of it being
                                                             Figure1: wrongly categorized
classified as a research page or an information
                                                              We have manually tested the results
         Initially, these weights were assigned
                                                     obtained by our implementation and found some
based on heuristics. These weights were later
                                                     results which are out of sync with the actual
modified based on the implementation runs on
                                                     category the page should belong to. For
sample pages. The accuracy of the output is
                                                     example, the page shown in figure 1 fell into a
checked, and the error is used as a feedback to
                                                     research page though it should have fallen into
modify the weights.
                                                     information page. The reason can be attributed
                                                     to the fact that it is a print version of the original

page and hence there is not a single link in the      [3] Oh-Woog Kwon, Jong-Hyoek Lee, Web
page. Also, no images are present in the page         page classification based on k-Nearest Neighbor
and the amount of text is very large.                 approach
                                                      [4] Andrew McCallum and Kamal Nigam, A
6. Conclusions and Future Work                        Comparision of Event Models for Naïve Bayes
                                                      Text Classification, In AAAI-98 Workshop on
         We have described an approach for the        Learning for Text Categorization, 1998
automatic categorization of web pages that uses       [5] Chidanand Apte and Fred Damerau,
the images and the structure of the page for          Automated Learning of Decision rules for Text
classification. The results obtained are quite        Categorization, ACM Transactions on
encouraging. This approach, augmented with            Information Systems, Vol 12, No.3, pp.233-251,
traditional text based approaches could be used       1994.
by search engines for effective categorization of     [6] Koller, D. and Sahami, M, Hierarchically
web pages.                                            classifying documents using very few words,
         Our current implementation uses a            Proceedings of the Fourteenth International
method for categorization, wherein the weights        Conference on Machine Learning (ICML’97),
assigned to each feature are set manually during      pp.170-178, 1997.
the initial training phase. A neural network          [7] Lewis, D.D. and Ringuette, M. A
based could be employed to automate the               Classification of two learning algorithms for text
training process.                                     categorization, Third Annual Symposium on
         Adding a few more features based on          Document Analysis and Information Retrieval
heuristics, (e.g. the classification of a page as a   (SDAIR’94), pp.81-93, 1994.
home page by detecting a face at the top) would       [8] Weigend, A.S., Weiner, E.D. and Peterson,
increase the classification accuracy.                 J.O., Exploiting Hierarchy on Text
         Currently, we have used only images in       Categorization, Information Retrieval, I(3),
addition to the structural information present in     pp.193-216, 1999.
web pages. We are not exploiting other                [9] Dumais, S.T., Platt, J., Heckerman, D., and
information present in the form of video, audio       Sahami, M., Inductive Learning Algorithms and
etc. These could also be used to get valuable         representations for text categorization,
information about the page.                           Proceedings of the Seventh International
         We have currently used our approach to       conference on Information and Knowledge
categorize the web pages into very broad              Management (CIKM’98), pp.148-155, 1998.
categories. The same algorithm could also be          [10] Susan Dumais, Hao Chen, Hierarchial
used to classify the pages into more specific         Classification of web content
categories by changing the feature set.               [11] http://www.yahoo.com
                                                      [12] Wai-Chiu Wong, Ada Wai-Chee Fu,
7. References                                         Incremental Document Clustering for Web Page
[1] John.M.Pierre, Practical Issues for               Classification, Chinese University of Hong
Automated Categorization of Web Pages,                Kong, July 2000.
September 2000.                                       [13] Guiseppe Attardi, Antonio Gulli, Fabrizio
[2] Yiming Yang, Xin Lui A Reexamination of           Sebastiani, Automatic Web Page Categorization
Text Categorization methods, In proceedings of        by Link and Context Analysis.
the 22nd International Conference on Research
and Development in Information Retrieval
(SIGIR’99), pp. 42-49, University of California,
Berkeley, USA 1999.