Improving Document Classification Using Anchor Text
W
Description
Anchor text in the text actually creates the relationship between keywords and URL links, anchor text of the code: . Can be used as anchor text anchor text the page where the content of the assessment. Properly speaking, the link will be added to the page and the content of the page itself has a certain relationship.
Document Sample


Improving Document Classification Using
Anchor Text
Anuj Kumar, and Anupam Shukla
Department of Information Technology, Indian Institute of Information Technology and
Management, Gwalior, India
ABSTRACT importance now. The similar concept has been
supported by [11] stating that web users are
The evolution of Web and the increasing not only interested in relevant results, but
information availability has lead to diverse results that are also authoritative, which has
information needs. The standard Information been conceptualised as result specificity in this
Retrieval techniques fail to serve the need of paper. Result specificity is a term that we use
specificity of search results in such a scenario. in this paper to signify ‘authoritative’ results
Document classification improves the result which are specific to the users information
specificity to greater extents, but generally need.
relies on text only. This paper describes how
link analysis combined with text can be used to A lot of research has been conducted in
improve classification. This paper aims at improving the result specificity of retrieval
studying the use of other external evidence systems, especially by considering the link
such as anchor text with a 5 size window structure of a web-page. Artificial Intelligence
technique and co-citation combined with the and Machine learning techniques seem to play
content and link based classification. Text a major role in such context. Several Machine
based classification is done using Naïve Bayes Learning algorithms such as Neural Networks,
probabilistic model, and is combined with link Support Vector Machines, Naïve Bayes, and
analysis, anchor text and co-citation evidence K-nearest Neighbour have been successfully
for improved subject similarity. Experiments applied for automatic content management and
reveal significant improvement in the extraction in the context of Information
classification results, and thus improvement in Retrieval [5]. Significant IR processes such as
the search result effectiveness. Document Classification have been found
suited to machine learning techniques [7].
Keywords
This paper focuses on improving the result
Document Classification, Information specificity by enhancing Classification for
retrieval, Text Processing, Pattern document retrieval. Extensive work has been
Classification, Probabilistic model done in using document classification to
improve the relevance of results. In spite of
relying only on content based classification,
1. INTRODUCTION research has been focused on using the link
structure of documents to improve
The explosive evolution of web has led to classification. The area of Link Analysis
huge availability of information and thus it is Ranking was first introduced by [8],
critical to develop effective techniques for demonstrating that a web-page cited or linked
finding relevant information. The major issue by other ‘authoritative’ pages was relatively
faced with Information Retrieval in the recent more important than other unlinked sources. A
past has turned from finding relevant results to joint probabilistic model for document
providing results specific to the user query, in classification by using both content and link
the closest possible proximity to relevance of structure as a combined evidence for better
results. The number of text documents in predictions has been presented by [1]. The link
digital form has grown enormously in size and structure does provide additional evidence for
consequently, the amount of information the interdependency of documents, but there
available is huge. In practice, it is hard to has been little focus on using other external
organise such enormous data sets and extract evidences for classification. External evidence
relevant information from them. Other than such as anchor text is known as a strong
finding relevant results, finding results specific indicator of topical relevance [9]. Co-citation
to the users information need has gained much also provides relevant evidence for similarity
between two documents, i.e. if two documents out-going links to a document revealed
are cited by the same document it is highly significant information about the relative
likely that the documents are similar. Co- importance over other un-linked documents.
citation has been considered as one of the link An extensive survey has been provided by [11]
analysis techniques and combined with on the theory, algorithms and experiments on
content-based classification models it has been link analysis. They propose algorithms based
found to give improved classification results on a machine learning approach in contrast to
[2]. the usual algebraic and graph based
algorithms.
In this paper we propose a model quite close to
[2], which combines content with link A good amount of research has also been done
structure, co-citation and anchor text for in combining the content and link-structure for
classification. A 5 size window technique is document classification, eventually leading to
used while considering the anchor text, 5 improved document retrieval. A good study of
words close to where the anchor text is placed how link mining is useful in improving the
in the document are consider on both sides of classification accuracy has been presented by
anchor text. [2] Describes a combined [13]. It shows that link analysis comes handy
probabilistic model but has not considered when the documents contain a high link
external evidence such as Anchor Text as a density and the links are of good quality.
measure for classification. We propose a new
probabilistic model which uses the Anchor text [2] Combines link based and content based
and the text in close proximity to the anchor methods for document classification and
text as one of the significant measures for presents a combined probabilistic model. It
classification, along with content and link combines 4 different link analysis methods
structure. with content based classification methods such
as Naïve Bayes, KNN and Support Vector
In the following section we describe the related machine. Co-citation combined with content-
work. In section 3, different classification based methods seems to give the best results.
measures are described. The combined model A closely related model has been presented by
for classification is proposed in section 4. [1]. It separately quantifies the content-based
Section 5 gives a brief description of the and link-based methods by PLSA and PHITS
experiments and the evaluation of the techniques respectively, and combines them
classification technique. Conclusion and future linearly in to a single model. It is interesting
work are discussed in the last two sections. that their model finds application not only in
retrieval but various other significant steps
such as predictive navigation, intelligent
2. RELATED WORK crawling and topic identification.
Great interest has been shown in using A framework for modelling link distributions
machine learning techniques to improve has been proposed in [4]. It presents
document retrieval, and automating the experiments in a variety of domains by using
extraction of relevant information. various link statistics such as mode of links,
Cunningham et al. survey various applications binary link statistics and count-link statistics.
of machine learning in IR. It describes various A latent semantic approach of automatic
standard and specialised applications of concept extraction for text categorization has
machine learning such as Information been suggested by [10]. Link analysis based
extraction, latent semantic indexing, classification framework has also been
Information filtering etc. In spite of focusing proposed by [15] for labelled and unlabelled
on content, research has also addressed the data.
usage of hyperlinks as one of the major
evidences for ranking results. Document The work in this paper is closely related to the
classification also finds its roots in Information probabilistic model presented in [2]. There has
retrieval for achieving authoritativeness been relatively less interest on using external
(specificity) of results. evidence such as anchor text. Anchor text has
been shown to be quite useful in providing
The area of link analysis (also called link relevant evidence in judging subject similarity
mining) was first brought in to attention by [8] of two documents [9]. Anchor text is the
by presenting the page rank algorithm for hyperlinked words on a web page - the words
improving the ranking of results. They showed users click on to navigate to a different page.
that the link structure i.e. the in-coming and Anchor text usually gives useful information
about the content of the page and gives an 3.2 Link Analysis and Co-Citation
indication of what the page is about. It is
interesting to explore that not only anchor text There has been much focus on Link Mining
but the text in close proximity to the anchor in the IR community since [8] introduced their
text could be of great relevance in PageRank algorithm for ranking web
understanding the link between two documents. It is highly relevant in the context
documents. We describe the combined model of the Web, as the Web is a huge structure of
in the subsequent sections. hyperlinked documents where some
documents point to some other, there by
3. CLASSIFICATION reflecting a pseudo-authentication of one web-
MEASURES page by another ‘authoritative’ web-page. The
more links point to a document, the more
In this section we first give a brief important it is considered, barring the pages
description of the classification methods based engineered for manipulation of search engine
on content, link, co-citation and external rankings.
evidence (anchor text). Later a combined
model for classification is proposed. [12] Described a link analysis method called
Hypertext Induced Topic Selection (HITS),
which was probabilistically modelled (PHITS)
3.1 Content-Based Classification in [1]. A simple model of quantifying Link
structure in a data set is to count the in-links
There have been questions on the and out-links related to each document. The
sufficiency of content only classifiers. Content probability thus can be defined as:
has rather been posed to be unreliable [3] and
noisy [2] when used alone. By contrast, it has p(dL) = p(IL|c ).p(OL|c) (3)
also been shown that in certain contexts like
home-page finding it is content-only scenario where p(IL|c ) defines the probability that an
that gives best results, and is not helped by the incoming link in d belongs to documents in
hyperlink structure [16]. Nevertheless, content class c and similarly for the out-link p(OL|c).
can hardly be ruled out while classifying Either of the probabilities can be calculated by
documents, as the text of a document is one of a simple count of the number of links (say in-
the prime evidences to study their subject links) divided by total number of incoming
similarity. links in a particular class.
We used the Naïve Bayes probabilistic model In this paper we consider co-citation to be
for content-based classification. The Naïve another measure along with Link Analysis for
Bayes method is based on the basic assumption document classification. Co-citation occurs
of term independence, and primarily relies on when two documents are cited or linked by the
term frequency. Naïve Bayes has been same document; it reflects that the two
successfully applied in document classification documents would be closely related to each
in the past and gives promising results [6]. The other. In its experiments [2] considered co-
Naïve Bayes Model is based on the Bayes citation as one of the link-based similarity
Theorem which gives us a conditional measure. It defines co-citation as,
probability p(dT) that given a document d what
is the probability that it belongs to class c:
p(dC) = [P(d1)∩P(d2)] / [P(d1) P(d2)] (4)
p(dT) = p(c|d) = p(d|c).p(c) (1)
Eq. (4) tells us that the more parents’ d1 and d2
The probability division factor p(d) is not have in common, the more related they are.
considered assuming the independence of
words. The probability p(d|c) is defined as: It is interesting to note that both p(dL) and
p(dC) are affected by the documents considered
p(d|c) = ∏ p(wi |c) (2) i.e. whether we consider links within the
corpus or also links from documents outside
where wi is the i-th word in document d. Naïve the corpus. We feel that better results could be
Bayes technique is also termed as Bag of achieved if external citations are also
Words (BOW) as it uses the word frequency as considered in case of co-citation measure
a parameter for classification. p(dC). We combined the Link Analysis and co-
citation probabilities to get a single measure of
subject similarity based on link structure. The
resulting measure becomes:
4. THE COMBINED MODEL
p(dLC) = p(dL).p(dC) (5)
The content probability measure and the link
Eq. (5) is later used to present a combined probability measure are combined in a
model for content, subject similarity and probabilistic model similar to [2]. This gives
external evidence (anchor text) for us a new measure which we call p(dLCC) which
classification. is calculated for each class,
3.3 External Evidence p(dLCC) = (1 – (1 – p(dT))(1 - p(dLC))) (8)
Anchor text has found its place in We observed that the anchor text probability
Informational retrieval systems for efficient values would be quite small as compared to the
ranking enhancement of results. It is also other probability values, thus in order to
believed that a lot of search engines use include the anchor text as one of the significant
Anchor text as one of the external evidence contributors we used a multiplying factor α to
[13]. There have been experiments in TREC combine p(dLCC) and p(dA). The combined
(Text Retrieval Conference) that focused on model becomes:
exploring the possibilities of using link and
link text (anchor text) as external evidence for P(d) = ∑ p(dLCC) + α.p(dA) (9)
improved rankings of home-pages. [14]
Demonstrated that ranking based on link Eq. (9) gives us a combined probabilistic
anchor text is twice as effective as ranking model for document classification, supported
based on document content, and experiments by important external evidence such as anchor
show that the links as such are not much of a text.
help.
In this paper we propose the use of anchor text
5. EXPERIMENTS
supported by a 5 size window frame on the
either side of the anchor text. The experiments were not carried out
extensively, rather on a small data set (nearly
3000 documents) collected locally from the
university websites.
Fig.1 Anchor Text with other texts on either The predefined classes are given in table 1.
side. The training set was manually constructed by
choosing documents at random and putting
Fig.1 shows the relevance of using the them in the pre-defined classes. Remaining
subsequent text on either side of the anchor documents were put together in a single folder
text as a support for the link text, as it provides for classification. The results are encouraging
valuable information about the link (that the and deserve closer investigation. It would be
site is about research papers and literature). worth testing the system on a huge and
We use a small size factor for the window (size standard data set.
5) as it does not outdo the anchor text
probability measure. The anchor text measure Table1. Pre-defined Genre classes
is also a probability value of the occurrence of Genre Uni1 Uni2 Uni3 Uni4
anchor text and the 5 words on either side of it Classes
with respect to the total number of words in the Project
class. It is quite similar to the Bayes Model, Pages 342 243 264 244
just that only anchor text and the neighbouring Home
text is considered. The probability value is Pages 50 34 66 42
designed as:
Course
Pages 204 232 176 230
p(dA) = p(dA|c).p(c) (6)
Publication
Pages 65 44 80 64
p(dA|c) = ∏ p(wA |c) (7)
Lab info
Pages 144 154 130 90
where wA is the A-th word in the anchor text
and the neighbouring text (up to five on either
side). We noted that as the number of genre classes
was increased the classification results
degraded a little. Three different experiments Table 2 gives a comparison of results in the
were conducted three experiments
1. Content-only classification
2. Content and Link analysis based The table above shows that the results do
3. The proposed Model with external improve by considering anchor text. But we
evidence as anchor text. cannot be sure if this would be representative
of a large data set. Though experimented on a
It was observed that the inclusion of external smaller data set we believe that anchor text
evidence (anchor text) with content and link does act as external evidence and could be
structure gives better results than the other two used in the classification context.
experiments. The parameter α in Eq. (9) is
tuned and gives best results for 2<α<4. A 6. CONCLUSION
percentage of the training data, called the
holdout data, was reserved for use in parameter Result specificity has gained importance in
tuning, that is to find the most acceptable value Information search and delivery since past few
of α for the resultant model. It is important to years now, with the increasing availability of
observe that it is less likely that the anchor text information on the Web. Relying only on the
and the neighbouring text would contain content of a web-page does not help in such
similar words, thus generally the term scenario, thus link analysis has gained sincere
frequency for each word in the anchor text and attention in the field of Document
the neighbouring text would be 1, thus 2<α<4 Classification and its application in retrieval.
eventually increases the word count for better In this paper we discussed the various
results. The parameter tuning was done so that measures for classification and proposed that
the anchor and the window information are along with content and link analysis it is worth
made more important in the combined considering external evidence such as anchor
classification model. This helps us to test if text and the neighbouring text with a 5 size
external evidence such as anchor text helps window.
improve document classification or not.
Content-only classifiers are thought to be noisy
5.1 Evaluation Measures and less efficient but give interesting results in
certain scenarios such as home-page finding.
The most standard evaluation measures in IR Naïve Bayes Probabilistic model works best in
have been Precision and Recall. One of the a content-only classification scenario. Content
interesting measures for classifier evaluation is combined with Link analysis (incoming links,
F1 measure, which is the average of Precision outgoing links and co-citation) does improve
(P) and Recall (R) [5]. It is defined as: classification but fail in such scenarios (Home-
page finding). The combined model for
F1 = 2PR / (P + R) (10) classification proposed in this paper combines
the content and link analysis measures with
We evaluated our classification model using anchor text measure. Experiments, though on a
the F1 measure, Precision and Recall. smaller level reveal that inclusion of anchor
text improves classification to some extent.
5.2 Results
We combined the measures by a parameter α,
Table2. Result comparison for 3 different and note that finding appropriate value of α
runs. requires manual/automatic parameter tuning.
At this point of time it seems hard to propose a
Run Type P R F1 supervised learning scenario for automatic
identification of optimum value for α. The
Content-Only 0.876 0.433 0.558 anchor text measure was intentionally favoured
over content and link analysis measure, to
Content and check the worth of including external evidence
Link Analysis 0.905 0.431 0.583 in the context of document classification.
Combined
Thus, we concluded that external evidence
Combined (anchor text) and neighbouring text in the
Model with context of web helps improve classification,
External 0.912 0.459 0.611 and improves result specificity in a document
evidence retrieval model.
(Anchor Text)
[8] L. Page, S. Brin, R. Motwani and T.
Winograd. The PageRank Citation
7. FUTURE WORK Ranking: Bringing Order to the Web,
Stanford Digital Library working
Future work can be done in conducting paper SIDL-WP-1999-0120 (version
experiments on a standard web data set. A of 11/11/1999). See: http://www-
critical problem is to calculate the tuning diglib.stanford.edu/cgi-bin/get/SIDL-
parameter α automatically in a supervised WP-1999-0120
learning scenario. Other techniques of [9] T. Upstill, N. Craswell, and D.
combining the various classification measures Hawking. Query independent
are to be looked upon. Neural Networks are evidence in home page finding. ACM
considered good for text categorization; we Transactions on Information Systems
aim at studying the possibilities of including (TOIS), pages 286–313, 2003.
content (text) and anchor text as input nodes [10] L. Cai, and T. Hofmann. Text
and conducting relevant experiments. Categorization by Boosting
Automatically Extracted Concepts. In
8. REFERENCES Proceedings of the 26th Annual
International ACM SIGIR Conference
[1] D. A. Cohn, T. Hofmann. The on Research and Development in
Missing Link - A Probabilistic Model Information Retrieval, July 2003.
of Document Content and Hypertext [11] A. Borodin, J. S. Rosenthal, G. O.
Connectivity. In the Proceedings of Roberts, and P. Tsaparas. Link
Neural Information Processing analysis ranking: Algorithms, theory
Systems 2000, pages 430-436, and experiments. ACM Transactions
December 2001. on Internet Technologies, pages 231-
[2] P. Calado, M. Cristo, E. S. Moura, N. 297, 2005.
Ziviani, B. A. Ribeiro-Neto, and M. [12] J. M. Kleinberg. Authoritative sources
A. Gonçalves. Combining link-based in a hyperlinked environment. Journal
and content-based methods for web of the ACM, pages 604-632, 1999.
document classification. In the [13] M. Fisher, and R. Everson. When are
Proceedings of Conference on Links Useful? Experiments in Text
Information and Knowledge Classification. Twenty-fifth European
Management 2003, pages 394-401, conference on IR Research, pages 41-
November 2003. 56, April 2003.
[3] S. Chakrabarti, B. Dom, and P. Indyk. [14] N. Craswell, D. Hawking, and S.
Enhanced hypertext categorization Robertson. Effective site finding
using hyperlinks. In Proceedings of using link anchor information. In
the ACM SIGMOD International Proceedings of 24th SIGIR, pages
Conference on Management of Data, 250.257, 2001.
pages 307-318, June 1998. [15] Q. Lu and L. Getoor. Link-based
[4] Q. Lu, and L. Getoor. Link-based Classification using labeled and
classification. In Proceedings of The unlabeled data. In Proceedings of the
Twentieth International Conference International Conference On Machine
on Machine Learning, pages 496-503, Learning, August 2003.
August 2003. [16] D. Hawking, N. Craswell. Overview
[5] R. A. Calvo, J. Lee and X. Li. of the TREC 2001 Web Track, In
Managing Content with Automatic Voorhees, E., Harman, D. (Eds),
Document Classification. Journal of NIST Special Publication 500-250,
Digital Information, Volume 5, issue 2002.
2, August 2004.
[6] F. Sebastiani. Machine learning in
automated text categorization. ACM
Computing Surveys (CSUR),
34(1):1–47, 2002.
[7] S. J. Cunningham, J. Littin, and I. H.
Witten. Applications of machine
learning in information retrieval.
Technical Report 97/6, University of
Waikato, February 1997.
Get documents about "