1 Web Page Categorization based on Document Structure Arul Prakash Asirvatham Kranthi Kumar. Ravi email@example.com firstname.lastname@example.org Centre for Visual Information Technology International Institute of Information Technology Gachibowli, Hyderabad, INDIA 500019 Abstract , Bayesian probabilistic models , inductive rule learning , decision trees , The web is a huge repository of information and neural networks  and support vector machines there is a need for categorizing web documents . An effort was made to classify web content to facilitate the indexing, search and retrieval of based on hierarchical structure . pages. Existing algorithms rely solely on the text However, besides the text content of the content of the web pages for categorization. web page, the images, video and other However, web documents have a lot of multimedia content together with the structure of information contained in their structure, images, the document also provide a lot of information audio video etc present in them. In this paper, aiding in the classification of a page. Existing we propose a method for automatic classification algorithms, which rely solely on categorization of web pages into a few broad text content for classification, are not exploiting categories based on the structure of the web these features.We have come up with a new document and the images present in it. approach for automatic web-page classification based on these features. 1. Introduction The paper is organized as follows. In Section 2, we discuss the existing algorithms for There is an exponential increase in the web page classification and the drawbacks in amount of data available on the web recently. these algorithms. We discuss our approach in According to , the number of pages available Section 3 followed by our implementation of on the web is around 1 billion with almost this approach in Section 4. Our results are dealt another 1.5 million are being added daily. This with in Section 5. We present our conclusions enormous amount of data in addition to the and avenues for future work in this direction in interactive and content-rich nature of the web Section 6. has made it very popular. However, these pages vary to a great extent in both the information 2. Web Page Categorization Algorithms content and quality. Moreover, the organization of these pages does not allow for easy search. So Several attempts have been made to an efficient and accurate method for classifying categorize the web pages with varying degree of this huge amount of data is very essential if the success. The major classifications can be web is to be exploited to its full potential. This classified into the following broad categories has been felt for a long time and many a) Manual categorization: The traditional approaches have been tried to solve this manual approach to classification involved the problem. analysis of the contents of the web page by a To start with, classification was done number of domain experts and the classification manually by domain experts. But very soon, the was based on the textual content as is done to classification began to be done semi- some extent by Yahoo . The sheer volume of automatically or automatically. Some of the data on the web rules out this approach. approaches used include text-categorization Moreover, such a classification would be based on statistical and machine-learning subjective and hence open to question. algorithms , K-Nearest Neighbor approach 2 b) Clustering Approaches: Clustering classify the document being referred as has been algorithms have been used widely as the clusters done according to  can be formed directly without any background We observe that the methods used so information. However, most of the clustering far, are based to a great extent on the textual algorithms like K-Means etc. require the number information contained in the page. We now of clusters to be specified in advance. Moreover, present our approach based on structure of the clustering approaches are computationally more page and image and multimedia content in the expensive. page. c) META tags based categorization: These classification techniques rely solely on attributes 3. Structure based approach of the meta tags (<META name=’’keywords’’> and <META name=’’description’’>). However, Structure based approach exploits the there is a possibility of the web page author to fact that there are many other features apart from include keywords, which do not reflect the text content, which form the whole gamut of content of the page, just to increase the hit-rate information present in a web document. A of his page in search engine results. human mind categorizes pages into a few broad d) Text content based categorization: The categories at first sight without knowing the fourth and fifth approaches use the text content exact content of the page. It uses other features of the web page for classification. In text-based like the structure of the page, the images, links approaches, first a database of keywords in a contained in the page, their placement etc for category is prepared as follows - the frequency classification. of the occurrence of words, phrases etc in a Web pages belonging to a particular category is computed from an existing corpus (a category have some similarity in their structure. large collection of text). The commonly This general structure of web pages can be occurring words (called stop words) are deduced from the placement of links, text and removed from this list. The remaining words are images (including equations and graphs). This the keywords for that particular category and can information can be easily extracted from a html be used for classification. To classify a document. document, all the stop words are removed and We observe that there are some features, the remaining keywords/phrases are represented like link text to normal text ratio which are as a feature vector. This document is then present in all categories of documents but at classified into an appropriate category using the varying degrees and there are some features K-Nearest Neighbor classification algorithm. which are particular to only some kinds of These approaches rely on a number of documents. Hence, for the classification we have high quality training documents for accurate assumed apriori that a certain feature contributes classification. However, the contents of web to the classification of the page into a particular pages vary greatly in quality as well as quantity. category by, say some x% whereas the same It has been observed  that 94.65% of the feature contributes to classification of the same web pages contain less than 500 distinct words. page into some other category by some (x±y)%. Also the average word frequency of almost all These apriori values will then be modified based documents is less than 2.0, which means that on learning from a set of sample pages. The final most of the words in a web document will rarely weights are then for categorization. appear more than 2 times. Hence the traditional method based on keyword frequency analysis 4. A Specific Implementation cannot be used for web documents. e) Link and Content Analysis: The link-based We have implemented a structure based approach is an automatic web page categorization system to categorize the web categorization technique based on the fact that a pages into the following broad categories. web page that refers to a document must contain 1. Information Pages. enough hints about its content to induce 2. Research Pages. someone to read it. Such hints can be used to 3. Personal Home Pages. 3 A typical information page has a logo on of characters in the page. A high ratio means the the top followed by a navigation bar linking the probability of the page being an information page to other important pages. We have page is high. Some of the information pages observed that the ratio of link text (amount of contain links followed by a brief description of text with links) to normal text also tends to be the document referred to by the link. In such relatively high in these kinds of pages. cases, this ratio turns out to be low. So, the Research pages generally contain huge placement of the links is also an important amounts of text, equations and graphs in the parameter. form of images etc. The presence of equations The amount of text in a page gives an and graphs can be detected by processing the indication of the type of page. Generally, histogram of the images. information and personal home pages are sparse Personal home pages also tend to have a in text compared to research pages. Hence we common layout. The name, address and photo have counted the number of characters in the of the person appear prominently at the top of document (after removing some of the the page. Also, towards the bottom of the page, commonly used words) and used this the person provides links to his publications if information to grade the text as sparse, moderate there are any and other useful references or links or high in information content. to his favorite destinations on the web. (b)Image Information: The number of distinct colours in the images has been computed. Since Our implementation takes a start page there is no need of taking into account all 65 and a domain as input, spiders the web retrieving million colour shades in a true colour image, we each of the pages along with images. The have considered only the higher order 4 bits of retrieval is breadth-first and care has been taken each colour shade (to give a total of 4096 to avoid cyclic search. The pages and images are colours). Information pages have more colours stored on the local machine for local processing than personal home pages, which in turn have and feature extraction. more colours than research pages. The categorization is carried out in two The histogram of images has been used phases. The first phase is the feature extraction to differentiate between natural and synthetic phase and the second, classification phase. images. The histogram of synthetic images These are explained in detail in the following generally tends to concentrate at a few bands of sub-sections. colour shades. In contrast, the histogram of natural images is spread over a larger area. 4.1 Feature Extraction Information pages usually contain many natural images, while research pages contain a number The feature extraction phase is the most of synthetic images representing graphs, important phase in the system as the mathematical equations etc. classification is based on the features extracted The research pages usually contain during this phase. The features should be should binary images and further information can be provide some valuable information about the extracted from them. Graphs are generally document and at the same time be found in research pages. We have used the computationally inexpensive. Very small images histogram of the image to detect the presence of (approximately 20x20 or less) have not been a graph in the image. considered for feature extraction as they usually correspond to buttons, icons etc. 4.2 Classification The following features were taken into consideration for classification of the pages. We have used the following approach to (a) Textual Information: The number and classify the page according to the features placement of links in a page provides valuable obtained. In this algorithm, each feature information about the broad category the page contributes either positively or negatively belongs to. We have computed the ratio of number of characters in links to the total number 4 towards a page being classified to a particular The classification is done based on the category. The actual procedure is as follows: elements of matrix V obtained by the above Let W be a (cxn) weight matrix and F be a (nx1) procedure. The document belongs to the feature value matrix. category for which the value of Vij is the highest. V = WxF gives a cx1 matrix as shown below: W00 W01 − − W0 c 5. Results W W11 − − W1c 10 We tested our implementation of this W= − − − − − , approach on a sample space of about 4000 pages. These pages are gathered from various − − Wij − − domains. The results and various parameters Wn 0 Wn1 − − Wnc used were shown in Table 1. F00 V00 F V 01 01 Details Results − − F= , V= No.of pages on which we have ~4000 F0i V0i tested our implementation − − Pages categorized ~3700 F0 n V0c Pages categorized correctly ~3250 n – Number of features In the matrix W, % categorized correctly 87.83% c - Number of Categories Table 1 and Wij ∈ [−1,+1] is the contribution of the ith feature to the jth category. In the matrix F, F0i is the value of the ith feature. For each category, Wij varies between -1.0 and 1.0. For example, in case of a home page, a weight of 1.0 could be assigned for the feature ‘presence of image with human face’ whereas the same will have a weight of -0.8 in a research page. Thus the presence of a photograph increases the chance of it being classified as a personal home page and at the same time decreases the chance of it being Figure1: wrongly categorized classified as a research page or an information page. We have manually tested the results Initially, these weights were assigned obtained by our implementation and found some based on heuristics. These weights were later results which are out of sync with the actual modified based on the implementation runs on category the page should belong to. For sample pages. The accuracy of the output is example, the page shown in figure 1 fell into a checked, and the error is used as a feedback to research page though it should have fallen into modify the weights. information page. The reason can be attributed to the fact that it is a print version of the original 5 page and hence there is not a single link in the  Oh-Woog Kwon, Jong-Hyoek Lee, Web page. Also, no images are present in the page page classification based on k-Nearest Neighbor and the amount of text is very large. approach  Andrew McCallum and Kamal Nigam, A 6. Conclusions and Future Work Comparision of Event Models for Naïve Bayes Text Classification, In AAAI-98 Workshop on We have described an approach for the Learning for Text Categorization, 1998 automatic categorization of web pages that uses  Chidanand Apte and Fred Damerau, the images and the structure of the page for Automated Learning of Decision rules for Text classification. The results obtained are quite Categorization, ACM Transactions on encouraging. This approach, augmented with Information Systems, Vol 12, No.3, pp.233-251, traditional text based approaches could be used 1994. by search engines for effective categorization of  Koller, D. and Sahami, M, Hierarchically web pages. classifying documents using very few words, Our current implementation uses a Proceedings of the Fourteenth International method for categorization, wherein the weights Conference on Machine Learning (ICML’97), assigned to each feature are set manually during pp.170-178, 1997. the initial training phase. A neural network  Lewis, D.D. and Ringuette, M. A based could be employed to automate the Classification of two learning algorithms for text training process. categorization, Third Annual Symposium on Adding a few more features based on Document Analysis and Information Retrieval heuristics, (e.g. the classification of a page as a (SDAIR’94), pp.81-93, 1994. home page by detecting a face at the top) would  Weigend, A.S., Weiner, E.D. and Peterson, increase the classification accuracy. J.O., Exploiting Hierarchy on Text Currently, we have used only images in Categorization, Information Retrieval, I(3), addition to the structural information present in pp.193-216, 1999. web pages. We are not exploiting other  Dumais, S.T., Platt, J., Heckerman, D., and information present in the form of video, audio Sahami, M., Inductive Learning Algorithms and etc. These could also be used to get valuable representations for text categorization, information about the page. Proceedings of the Seventh International We have currently used our approach to conference on Information and Knowledge categorize the web pages into very broad Management (CIKM’98), pp.148-155, 1998. categories. The same algorithm could also be  Susan Dumais, Hao Chen, Hierarchial used to classify the pages into more specific Classification of web content categories by changing the feature set.  http://www.yahoo.com  Wai-Chiu Wong, Ada Wai-Chee Fu, 7. References Incremental Document Clustering for Web Page  John.M.Pierre, Practical Issues for Classification, Chinese University of Hong Automated Categorization of Web Pages, Kong, July 2000. September 2000.  Guiseppe Attardi, Antonio Gulli, Fabrizio  Yiming Yang, Xin Lui A Reexamination of Sebastiani, Automatic Web Page Categorization Text Categorization methods, In proceedings of by Link and Context Analysis. the 22nd International Conference on Research and Development in Information Retrieval (SIGIR’99), pp. 42-49, University of California, Berkeley, USA 1999.