Improving Document Classification Using Anchor Text
Anchor text in the text actually creates the relationship between keywords and URL links, anchor text of the code: . Can be used as anchor text anchor text the page where the content of the assessment. Properly speaking, the link will be added to the page and the content of the page itself has a certain relationship.
Improving Document Classification Using Anchor Text Anuj Kumar, and Anupam Shukla Department of Information Technology, Indian Institute of Information Technology and Management, Gwalior, India ABSTRACT importance now. The similar concept has been supported by  stating that web users are The evolution of Web and the increasing not only interested in relevant results, but information availability has lead to diverse results that are also authoritative, which has information needs. The standard Information been conceptualised as result specificity in this Retrieval techniques fail to serve the need of paper. Result specificity is a term that we use specificity of search results in such a scenario. in this paper to signify ‘authoritative’ results Document classification improves the result which are specific to the users information specificity to greater extents, but generally need. relies on text only. This paper describes how link analysis combined with text can be used to A lot of research has been conducted in improve classification. This paper aims at improving the result specificity of retrieval studying the use of other external evidence systems, especially by considering the link such as anchor text with a 5 size window structure of a web-page. Artificial Intelligence technique and co-citation combined with the and Machine learning techniques seem to play content and link based classification. Text a major role in such context. Several Machine based classification is done using Naïve Bayes Learning algorithms such as Neural Networks, probabilistic model, and is combined with link Support Vector Machines, Naïve Bayes, and analysis, anchor text and co-citation evidence K-nearest Neighbour have been successfully for improved subject similarity. Experiments applied for automatic content management and reveal significant improvement in the extraction in the context of Information classification results, and thus improvement in Retrieval . Significant IR processes such as the search result effectiveness. Document Classification have been found suited to machine learning techniques . Keywords This paper focuses on improving the result Document Classification, Information specificity by enhancing Classification for retrieval, Text Processing, Pattern document retrieval. Extensive work has been Classification, Probabilistic model done in using document classification to improve the relevance of results. In spite of relying only on content based classification, 1. INTRODUCTION research has been focused on using the link structure of documents to improve The explosive evolution of web has led to classification. The area of Link Analysis huge availability of information and thus it is Ranking was first introduced by , critical to develop effective techniques for demonstrating that a web-page cited or linked finding relevant information. The major issue by other ‘authoritative’ pages was relatively faced with Information Retrieval in the recent more important than other unlinked sources. A past has turned from finding relevant results to joint probabilistic model for document providing results specific to the user query, in classification by using both content and link the closest possible proximity to relevance of structure as a combined evidence for better results. The number of text documents in predictions has been presented by . The link digital form has grown enormously in size and structure does provide additional evidence for consequently, the amount of information the interdependency of documents, but there available is huge. In practice, it is hard to has been little focus on using other external organise such enormous data sets and extract evidences for classification. External evidence relevant information from them. Other than such as anchor text is known as a strong finding relevant results, finding results specific indicator of topical relevance . Co-citation to the users information need has gained much also provides relevant evidence for similarity between two documents, i.e. if two documents out-going links to a document revealed are cited by the same document it is highly significant information about the relative likely that the documents are similar. Co- importance over other un-linked documents. citation has been considered as one of the link An extensive survey has been provided by  analysis techniques and combined with on the theory, algorithms and experiments on content-based classification models it has been link analysis. They propose algorithms based found to give improved classification results on a machine learning approach in contrast to . the usual algebraic and graph based algorithms. In this paper we propose a model quite close to , which combines content with link A good amount of research has also been done structure, co-citation and anchor text for in combining the content and link-structure for classification. A 5 size window technique is document classification, eventually leading to used while considering the anchor text, 5 improved document retrieval. A good study of words close to where the anchor text is placed how link mining is useful in improving the in the document are consider on both sides of classification accuracy has been presented by anchor text.  Describes a combined . It shows that link analysis comes handy probabilistic model but has not considered when the documents contain a high link external evidence such as Anchor Text as a density and the links are of good quality. measure for classification. We propose a new probabilistic model which uses the Anchor text  Combines link based and content based and the text in close proximity to the anchor methods for document classification and text as one of the significant measures for presents a combined probabilistic model. It classification, along with content and link combines 4 different link analysis methods structure. with content based classification methods such as Naïve Bayes, KNN and Support Vector In the following section we describe the related machine. Co-citation combined with content- work. In section 3, different classification based methods seems to give the best results. measures are described. The combined model A closely related model has been presented by for classification is proposed in section 4. . It separately quantifies the content-based Section 5 gives a brief description of the and link-based methods by PLSA and PHITS experiments and the evaluation of the techniques respectively, and combines them classification technique. Conclusion and future linearly in to a single model. It is interesting work are discussed in the last two sections. that their model finds application not only in retrieval but various other significant steps such as predictive navigation, intelligent 2. RELATED WORK crawling and topic identification. Great interest has been shown in using A framework for modelling link distributions machine learning techniques to improve has been proposed in . It presents document retrieval, and automating the experiments in a variety of domains by using extraction of relevant information. various link statistics such as mode of links, Cunningham et al. survey various applications binary link statistics and count-link statistics. of machine learning in IR. It describes various A latent semantic approach of automatic standard and specialised applications of concept extraction for text categorization has machine learning such as Information been suggested by . Link analysis based extraction, latent semantic indexing, classification framework has also been Information filtering etc. In spite of focusing proposed by  for labelled and unlabelled on content, research has also addressed the data. usage of hyperlinks as one of the major evidences for ranking results. Document The work in this paper is closely related to the classification also finds its roots in Information probabilistic model presented in . There has retrieval for achieving authoritativeness been relatively less interest on using external (specificity) of results. evidence such as anchor text. Anchor text has been shown to be quite useful in providing The area of link analysis (also called link relevant evidence in judging subject similarity mining) was first brought in to attention by  of two documents . Anchor text is the by presenting the page rank algorithm for hyperlinked words on a web page - the words improving the ranking of results. They showed users click on to navigate to a different page. that the link structure i.e. the in-coming and Anchor text usually gives useful information about the content of the page and gives an 3.2 Link Analysis and Co-Citation indication of what the page is about. It is interesting to explore that not only anchor text There has been much focus on Link Mining but the text in close proximity to the anchor in the IR community since  introduced their text could be of great relevance in PageRank algorithm for ranking web understanding the link between two documents. It is highly relevant in the context documents. We describe the combined model of the Web, as the Web is a huge structure of in the subsequent sections. hyperlinked documents where some documents point to some other, there by 3. CLASSIFICATION reflecting a pseudo-authentication of one web- MEASURES page by another ‘authoritative’ web-page. The more links point to a document, the more In this section we first give a brief important it is considered, barring the pages description of the classification methods based engineered for manipulation of search engine on content, link, co-citation and external rankings. evidence (anchor text). Later a combined model for classification is proposed.  Described a link analysis method called Hypertext Induced Topic Selection (HITS), which was probabilistically modelled (PHITS) 3.1 Content-Based Classification in . A simple model of quantifying Link structure in a data set is to count the in-links There have been questions on the and out-links related to each document. The sufficiency of content only classifiers. Content probability thus can be defined as: has rather been posed to be unreliable  and noisy  when used alone. By contrast, it has p(dL) = p(IL|c ).p(OL|c) (3) also been shown that in certain contexts like home-page finding it is content-only scenario where p(IL|c ) defines the probability that an that gives best results, and is not helped by the incoming link in d belongs to documents in hyperlink structure . Nevertheless, content class c and similarly for the out-link p(OL|c). can hardly be ruled out while classifying Either of the probabilities can be calculated by documents, as the text of a document is one of a simple count of the number of links (say in- the prime evidences to study their subject links) divided by total number of incoming similarity. links in a particular class. We used the Naïve Bayes probabilistic model In this paper we consider co-citation to be for content-based classification. The Naïve another measure along with Link Analysis for Bayes method is based on the basic assumption document classification. Co-citation occurs of term independence, and primarily relies on when two documents are cited or linked by the term frequency. Naïve Bayes has been same document; it reflects that the two successfully applied in document classification documents would be closely related to each in the past and gives promising results . The other. In its experiments  considered co- Naïve Bayes Model is based on the Bayes citation as one of the link-based similarity Theorem which gives us a conditional measure. It defines co-citation as, probability p(dT) that given a document d what is the probability that it belongs to class c: p(dC) = [P(d1)∩P(d2)] / [P(d1) P(d2)] (4) p(dT) = p(c|d) = p(d|c).p(c) (1) Eq. (4) tells us that the more parents’ d1 and d2 The probability division factor p(d) is not have in common, the more related they are. considered assuming the independence of words. The probability p(d|c) is defined as: It is interesting to note that both p(dL) and p(dC) are affected by the documents considered p(d|c) = ∏ p(wi |c) (2) i.e. whether we consider links within the corpus or also links from documents outside where wi is the i-th word in document d. Naïve the corpus. We feel that better results could be Bayes technique is also termed as Bag of achieved if external citations are also Words (BOW) as it uses the word frequency as considered in case of co-citation measure a parameter for classification. p(dC). We combined the Link Analysis and co- citation probabilities to get a single measure of subject similarity based on link structure. The resulting measure becomes: 4. THE COMBINED MODEL p(dLC) = p(dL).p(dC) (5) The content probability measure and the link Eq. (5) is later used to present a combined probability measure are combined in a model for content, subject similarity and probabilistic model similar to . This gives external evidence (anchor text) for us a new measure which we call p(dLCC) which classification. is calculated for each class, 3.3 External Evidence p(dLCC) = (1 – (1 – p(dT))(1 - p(dLC))) (8) Anchor text has found its place in We observed that the anchor text probability Informational retrieval systems for efficient values would be quite small as compared to the ranking enhancement of results. It is also other probability values, thus in order to believed that a lot of search engines use include the anchor text as one of the significant Anchor text as one of the external evidence contributors we used a multiplying factor α to . There have been experiments in TREC combine p(dLCC) and p(dA). The combined (Text Retrieval Conference) that focused on model becomes: exploring the possibilities of using link and link text (anchor text) as external evidence for P(d) = ∑ p(dLCC) + α.p(dA) (9) improved rankings of home-pages.  Demonstrated that ranking based on link Eq. (9) gives us a combined probabilistic anchor text is twice as effective as ranking model for document classification, supported based on document content, and experiments by important external evidence such as anchor show that the links as such are not much of a text. help. In this paper we propose the use of anchor text 5. EXPERIMENTS supported by a 5 size window frame on the either side of the anchor text. The experiments were not carried out extensively, rather on a small data set (nearly 3000 documents) collected locally from the university websites. Fig.1 Anchor Text with other texts on either The predefined classes are given in table 1. side. The training set was manually constructed by choosing documents at random and putting Fig.1 shows the relevance of using the them in the pre-defined classes. Remaining subsequent text on either side of the anchor documents were put together in a single folder text as a support for the link text, as it provides for classification. The results are encouraging valuable information about the link (that the and deserve closer investigation. It would be site is about research papers and literature). worth testing the system on a huge and We use a small size factor for the window (size standard data set. 5) as it does not outdo the anchor text probability measure. The anchor text measure Table1. Pre-defined Genre classes is also a probability value of the occurrence of Genre Uni1 Uni2 Uni3 Uni4 anchor text and the 5 words on either side of it Classes with respect to the total number of words in the Project class. It is quite similar to the Bayes Model, Pages 342 243 264 244 just that only anchor text and the neighbouring Home text is considered. The probability value is Pages 50 34 66 42 designed as: Course Pages 204 232 176 230 p(dA) = p(dA|c).p(c) (6) Publication Pages 65 44 80 64 p(dA|c) = ∏ p(wA |c) (7) Lab info Pages 144 154 130 90 where wA is the A-th word in the anchor text and the neighbouring text (up to five on either side). We noted that as the number of genre classes was increased the classification results degraded a little. Three different experiments Table 2 gives a comparison of results in the were conducted three experiments 1. Content-only classification 2. Content and Link analysis based The table above shows that the results do 3. The proposed Model with external improve by considering anchor text. But we evidence as anchor text. cannot be sure if this would be representative of a large data set. Though experimented on a It was observed that the inclusion of external smaller data set we believe that anchor text evidence (anchor text) with content and link does act as external evidence and could be structure gives better results than the other two used in the classification context. experiments. The parameter α in Eq. (9) is tuned and gives best results for 2<α<4. A 6. CONCLUSION percentage of the training data, called the holdout data, was reserved for use in parameter Result specificity has gained importance in tuning, that is to find the most acceptable value Information search and delivery since past few of α for the resultant model. It is important to years now, with the increasing availability of observe that it is less likely that the anchor text information on the Web. Relying only on the and the neighbouring text would contain content of a web-page does not help in such similar words, thus generally the term scenario, thus link analysis has gained sincere frequency for each word in the anchor text and attention in the field of Document the neighbouring text would be 1, thus 2<α<4 Classification and its application in retrieval. eventually increases the word count for better In this paper we discussed the various results. The parameter tuning was done so that measures for classification and proposed that the anchor and the window information are along with content and link analysis it is worth made more important in the combined considering external evidence such as anchor classification model. This helps us to test if text and the neighbouring text with a 5 size external evidence such as anchor text helps window. improve document classification or not. Content-only classifiers are thought to be noisy 5.1 Evaluation Measures and less efficient but give interesting results in certain scenarios such as home-page finding. The most standard evaluation measures in IR Naïve Bayes Probabilistic model works best in have been Precision and Recall. One of the a content-only classification scenario. Content interesting measures for classifier evaluation is combined with Link analysis (incoming links, F1 measure, which is the average of Precision outgoing links and co-citation) does improve (P) and Recall (R) . It is defined as: classification but fail in such scenarios (Home- page finding). The combined model for F1 = 2PR / (P + R) (10) classification proposed in this paper combines the content and link analysis measures with We evaluated our classification model using anchor text measure. Experiments, though on a the F1 measure, Precision and Recall. smaller level reveal that inclusion of anchor text improves classification to some extent. 5.2 Results We combined the measures by a parameter α, Table2. Result comparison for 3 different and note that finding appropriate value of α runs. requires manual/automatic parameter tuning. At this point of time it seems hard to propose a Run Type P R F1 supervised learning scenario for automatic identification of optimum value for α. The Content-Only 0.876 0.433 0.558 anchor text measure was intentionally favoured over content and link analysis measure, to Content and check the worth of including external evidence Link Analysis 0.905 0.431 0.583 in the context of document classification. Combined Thus, we concluded that external evidence Combined (anchor text) and neighbouring text in the Model with context of web helps improve classification, External 0.912 0.459 0.611 and improves result specificity in a document evidence retrieval model. (Anchor Text)  L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation 7. FUTURE WORK Ranking: Bringing Order to the Web, Stanford Digital Library working Future work can be done in conducting paper SIDL-WP-1999-0120 (version experiments on a standard web data set. A of 11/11/1999). See: http://www- critical problem is to calculate the tuning diglib.stanford.edu/cgi-bin/get/SIDL- parameter α automatically in a supervised WP-1999-0120 learning scenario. Other techniques of  T. Upstill, N. Craswell, and D. combining the various classification measures Hawking. Query independent are to be looked upon. Neural Networks are evidence in home page finding. ACM considered good for text categorization; we Transactions on Information Systems aim at studying the possibilities of including (TOIS), pages 286–313, 2003. content (text) and anchor text as input nodes  L. Cai, and T. Hofmann. Text and conducting relevant experiments. Categorization by Boosting Automatically Extracted Concepts. In 8. REFERENCES Proceedings of the 26th Annual International ACM SIGIR Conference  D. A. Cohn, T. Hofmann. The on Research and Development in Missing Link - A Probabilistic Model Information Retrieval, July 2003. of Document Content and Hypertext  A. Borodin, J. S. Rosenthal, G. O. Connectivity. In the Proceedings of Roberts, and P. Tsaparas. Link Neural Information Processing analysis ranking: Algorithms, theory Systems 2000, pages 430-436, and experiments. ACM Transactions December 2001. on Internet Technologies, pages 231-  P. Calado, M. Cristo, E. S. Moura, N. 297, 2005. Ziviani, B. A. Ribeiro-Neto, and M.  J. M. Kleinberg. Authoritative sources A. Gonçalves. Combining link-based in a hyperlinked environment. Journal and content-based methods for web of the ACM, pages 604-632, 1999. document classification. In the  M. Fisher, and R. Everson. When are Proceedings of Conference on Links Useful? Experiments in Text Information and Knowledge Classification. Twenty-fifth European Management 2003, pages 394-401, conference on IR Research, pages 41- November 2003. 56, April 2003.  S. Chakrabarti, B. Dom, and P. Indyk.  N. Craswell, D. Hawking, and S. Enhanced hypertext categorization Robertson. Effective site finding using hyperlinks. In Proceedings of using link anchor information. In the ACM SIGMOD International Proceedings of 24th SIGIR, pages Conference on Management of Data, 250.257, 2001. pages 307-318, June 1998.  Q. Lu and L. Getoor. Link-based  Q. Lu, and L. Getoor. Link-based Classification using labeled and classification. In Proceedings of The unlabeled data. In Proceedings of the Twentieth International Conference International Conference On Machine on Machine Learning, pages 496-503, Learning, August 2003. August 2003.  D. Hawking, N. Craswell. Overview  R. A. Calvo, J. Lee and X. Li. of the TREC 2001 Web Track, In Managing Content with Automatic Voorhees, E., Harman, D. (Eds), Document Classification. Journal of NIST Special Publication 500-250, Digital Information, Volume 5, issue 2002. 2, August 2004.  F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1):1–47, 2002.  S. J. Cunningham, J. Littin, and I. H. Witten. Applications of machine learning in information retrieval. Technical Report 97/6, University of Waikato, February 1997.