Classification based on URL Features and Features of Sibling Pages
Document Sample


(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, 2010
Webpage Classification based on URL
Features and Features of Sibling Pages
Sara Meshkizadeh, Dr.Amir masoud rahmani, Dr.Mashallah Abbasi Dezfuli
Department of Computer Department of Computer Department of Computer
engineering, engineering, engineering,
Science and Research Science and Research Science and Research
branch,Islamic Azad branch,Islamic Azad branch,Islamic Azad
University(IAU) , University(IAU) , University(IAU) ,
Khouzestan,Iran Tehran,Iran Khouzestan,Iran,
Sara_meshkizadeh@yahoo.com rahmani@srbiau.ac.ir Abbasi_masha@yahoo.com
webpage pictures to provide a classification with tree
Abstract:Webpage classification plays an important role structure. [3] involves combination of parent page
in information organization and retrieval. It involves information with HTML features to arrive at
assignment of one webpage to one or more than one classification. [4] reaches 80% accuracy through a
predetermined categories. The uncontrolled features of combination of web mining information. Combining
web content implies that more work is required for some features of HTML and URL and in sum 9
webpage classification compared with traditional text features, classification is made with 80% accuracy.
classification. The interconnected nature of hyper text, This study introduces a novel method for webpage
however, carries some features which contribute to the classification as it enhances existing algorithms. This
process, for example URL features of a webpage. This method combines three different features of webpage
study illustrates that using such features along with
URL, and it also employs information from
features of sibling pages, i.e. pages from the same sibling,
as well as Bayesian algorithm for combining the results neighboring pages with the same sibling, i.e. sibling
of these features, it would be possible to improve the pages to increase classification accuracy.
accuracy of webpage classification based on this
algorithm. In this paper, in Section 2 related concepts are
Keywords: classification, hyper text, URL, sibling described. Section 3 introduces the proposed
pages, Bayesian algorithm algorithm in detail. Section 4 evaluates the suggested
algorithm, and Section 5 provides conclusions and
insights towards future work.
I. INTRODUCTION
II. RELATED CONCEPTS
Traditional classification is supposed to be a A. Selecting Database
instructor-based teaching where a set of labeled data As webpage classification is usually taken as a
can be used for teaching a classifier to be employed supervised learning, there is a need for classified
for future classification. Webpage classification is samples for learning. In addition, some samples are
distinct from traditional on in a number of aspects. also required to test classification for classifier
First, traditional textual classification is usually evaluation. Manual labeling involves something more
performed on the structured documents written based than human, therefore, some available web directories
on a fixed style (e.g. news articles), whereas webpage are used for more research. One of them which is
content is far from such characteristics. Second, web employed more is OPD[12]. In this database, 4519050
pages are documents with HTML structure which various websites are classified by 84430 classification
might be translated for the user visually. Classification editors. This research has been performed on
plays a significant role in information management universities, shopping, forums, FAQ (frequently asked
and retrieval task. On the web, webpage classification questions) of ODP.
is essential for focused crawling, web directories,
topic specific web link, contextual advertising, and B. Feature Extraction from Webpage URL
topical structure. It can be of great help in increasing
search quality on the web. Using URL related The contextual content is the most important feature
information, [1] has achieved 48% accuracy in available on the webpage. However, taking diverse
classification. [2] uses URL information as well as range of parasites on webpage into account, using bag
of words directly may does not lead to higher
168 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, 2010
P(D|h) P(h)
P(D)
efficiency. The researchers have proposed a number of P(h |D)= (1)
methods for better application of contextual
characteristics. Feature selection is a popular one. In
addition to features of HTML tags, a webpage can be As expected, it can be seen that P(h|D) increases with
classified based on its own URL. URLs are highly the increase of P(h), and P(D|h). Therefore, it is
effective in classification. First, a URL is easily reasonable that P(D|h) decreases with the increase of
retrievable, and each URL is limited to one webpage, P(D), because with higher probability of occurrence
and each webpage has one special URL. Second, if P(D) which is independent from h, fewer evidence of
this method is solely employed, classification of one D are available to support h. In many learning
webpage based on its URL causes download removal scenarios, the learner considers a set of hypotheses H,
of the whole page. This tends to be an appropriate and it is interested in finding the hypothesis h∈H
method for classifying pages which are not existing or which is the most probable one or at least one of the
their download is impossible, or time/space is critical most probable ones. Any hypothesis which carries
for example in realtime classifications. such feature is called Maximum a posteriori, MAP.
Using Bayesian theorem, it is possible to find MAP
C. Bayesian Algorithm hypothesis to calculate posteriori probability of each
candidate. In other words, HMAP is a hypothesis that
Bayesian inference is provides a probability method
for inference. This method is built on the hypothesis HMAP =argmax hj є H P(hj | Di)
that the considered values follow a probable = argmax hj є H( P(Di|hj) P(hj) /P(D)) (2)
distribution, and that optimal decisions can be made = argmax hj є H P(Di|hj) P(hj)
with an eye to inference on the probabilities as well as
observed data. As this method is a quantitative one for Be noted that P(D) is deleted at the final step, as its
weighing evidences which support different calculation is independent from h, and it is always a
hypotheses, it is of great importance in machine constant. However, for all probable states on different
learning. Bayesian inference provides a direct method features of a problem, this theorem can be generalized
for dealing with probabilities for learning algorithms, for all existing probabilities, as for all values of
and it also creates a framework for analyzing feature D1,D2…Dn:
performance of algorithms which are not directly
related to probabilities. In many cases, the problem is H =argmax hj є H P(hj) Π P(Di|hj) (3)
finding the best hypothesis in hypothesis space H with
available D learning data. One method to express the And P(Di|hj) is calculated as follow:
n_c + mp
best hypothesis is that we claim we are looking for the
n+m
most probable hypothesis with D data in addition to P(Di|hj)= (4)
initial data on prior probabilities H. Bayesian theorem
is also a direct method for calculating these
probabilities.
To define Bayesian theorem, P(h) is used to express Where:
initial probability which maintains h hypothesis is n= number of examples where h= hj
true, earlier than observing learning data. P(h) is nc = number of examples where samples D= Di and
usually called prior probability, and it expresses any h= hj
prior knowledge which states on the chance of p= initial estimation for P(Di|hj)
correctness of hypothesis h. If there is no initial m= size of sample space
knowledge on hypotheses, we can assign a similar
probability to the whole hypotheses space H. D. Using Features of Neighbor Webpage
Likewise, P(h) is similarly used for expressing prior Although URL addresses of webpage contain useful
probability where D data are observed. In other words, features, they may miss, misunderstood, or not
probability of observing D in the case of there is no recognized for some reasons in a special URL. For
knowledge on correctness of hypotheses. P(D|h) is example, some pages do have brief addresses or
employed to express probability D in a space where concise contextual content. In such cases, it would be
hypothesis h is true. In machine learning we look for difficult for classifiers to decide on the accuracy
P(D|h), i.e. probability of correctness of hypothesis h according to URL features of the webpage. To solve
in the case of observing D learning data. P(D|h) is the problem, these features might be extracted from
called post probability, as it expresses our confidence neighbor pages which are related to the webpage
of hypothesis h after observing D data. under classification. In this study, pages with the same
Bayesian theorem is the main building block of sibling, i.e. sibling pages are employed to arrive at
Bayesian learning, as it provides a method for more accuracy in classification.
calculating post probability P(D|h) based on P(h)
along with P(D) and P(D|h).
169 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, 2010
III. Proposed Algorithm If a webpage such as www.aut.ac.ir/sites/e-
A. Pre-processing of Web Content shopping/iranair.ir is given to the system as a test, due
to the domain www.aut.ac.ir/sites/e-shopping/ in the
In most studies, pre-processing is performed prior to table, the system quickly and simply assigns shopping
feeding the web content to the classifier. In the class to the second address. It should be noted that
proposed method here, only URL address and page using domain similarity of webpage URL is a new
Title are required for implementing classification idea which has not been taken into account in
algorithm. First, the URL address and page Title are webpage classification.
pre-processed, i.e. words shorter than 3 characters,
numbers, conjunctions and prepositions stored in a 3) Third Feature
table called Stopwords are removed from this content.
Using the function Porter Stemmer [13], the useful To enhance the efficiency of the proposed algorithm
words are changed to their stems ( to avoid data another method based on URL address can also be
redundancy), and then they are stored in the relevant employed. This method involves expanding URL of a
table along with a value as the frequency of word in webpage to use existing elements better.
the data bank of the program. Forexample,considering
http://www.washington.edu/news/nytimes , and also
page title, i.e. NewYorkTimes it is easily understood
B. Using Extracted Features from Webpage URL
that the address refers to NewYorkTimes news
1) First Feature database. The machine, however, misses the point. To
solve the problem and to employ human guesses in the
URL do have appropriate features for classification.
procedure, with an eye to the function used in [1] but
Two sets of features which are easily extracted from
with some changes in definition and usage, a
URL pages are Postfix and Directory. The general
likelihood function is presented. Table 2 illustrates the
format of a postfix is usually as Abbreviation or
method. The likelihood is used to compare URL's
Abbreviation. Abbreviation. For example, .edu or ac.ir
token letter by letter, and similar word in the page
or .ac.uk indicate pages related to universities or
Title.
academic websites. The general format of a Directory
feature is mostly like Word(Abbreviation)slash. For Table2. Likelihood Function
example, a directory named FAQ or Forum
represented as /Faq/ or /Forum/ can represent the Rank Condition number
relation between the current page to the relevant class. First letter of URL token is the
These features are put in Table 1. 2 same as first letter of similar word 1
at page Title
Table 1: Features of URL Addresses First letter of URL token is the
1 same another letter of similar word 2
at page Title
A letter of URL token is the same
URL feature Specification number 1 as a letter of similar word at page 3
Title
Postfix : To find class of pages based
1 Last letter of URL token is the same
.edu,.ac.ir on Stem URL Address
2 as last letter of similar word at page 4
To find class of pages based Title
Directory :
on rest of Stem URL Address 2 0 Ignoring a character of URL token 5
Forums/,Faq/
For example consider the news website Newyorktimes
2) Second Feature with the address http://www.nytimes.com. Now take
the title "The New York Times-Breaking News,
The second feature important in webpage URL is
World News & Multimedia". The above-mentioned
attention to the domains which have been observed by
function calculates likelihood for a condition where
the system and further their correct class is identified.
URL token, Nytimes and similar word at page Title is
To do so, once the class of a webpage is determined at
Newyorktimes as bellow:
the last step and its accuracy is confirmed by the user,
the page's address is registered in the URL table along
(Condition1)N(Condition5)E(Condition5)
with the correct class. Afterwards, if a webpage with
W(Condition1)Y(Condition5)O(Condition5)
the same domain is given to the system, the system
R(Condition5)K(Condition1)T(Condition3)
can recognize more simply and quickly based on
I(Condition 3)M(Condition 3)E(Condition 4)S
similarity in addresses. Take for example the address
www.aut.ac.ir/sites/e-shopping/raja.ir whose domain
In this case, the word Newyorktimes receives score 11
is recognized as shopping, thereby it is registered
from words at the page Title which is the highest score
under shopping class at the URL table.
out of other tokens, and according to the likelihood
170 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, 2010
function, thereby in future if Nytimes is seen in test 3) Neighbor Effect Step
samples, it is replaced with Newyorktimes. To work
Due to the concise nature of most addresses and also
with the third feature, the system runs the likelihood
Title of many webpages, the classification based on
function on tokens of URL and the page Title. In the
just features extracted from a webpage has been
case of expanding existing token in URL, instead of
performed with 72.8% accuracy as shown at Table 3.
the former token, an expanded token is used for
To improve classification accuracy, it was decided to
classification.
use extracted information from neighbor pages. As
such, URL address and Title of 500 neighbor pages
C. Webpage Classification
related to Test step pages have been employed to
1) Training Step increase classification accuracy. As illustrated in
Table 3, the accuracy average raised to 85.5% that
In this step, about 2000 pages of considered classes in
implies the emphasis on neighbor pages is helpful. As
ODP database are used for system learning. As such,
described earlier, the system first analyses by the first
the webpage URL address, the page's Title along with
feature. If it fails it refers to the second feature, and
the category is fed to the system. In this step, useful
ultimately the third feature is used for classification.
words from considered features of the page URL are
The accuracy of averages of algorithm is shown in
extracted, and are stored in the table related to the
Table 3. The result of combining former features with
considered class of the program information bank
sibling pages features and their influence can be
along with the frequency of each word in each feature
observed.
as the value of that word in that feature. In this step,
information of pages in each category is stored in the Table 3. Results of Algorithm Accuracy based on different
table related to the same category. So far we have Features
been dealing with learning the system and feeding the
useful information. Shoppi Univers Foru F-
feature FAQ
ng ity ms measure
2) Test Step Based on
%48.4 %65.8 %61.4 %63.2 %59.7
feature 1
In this step, to test the system a number of different Based on
pages in Training step is used. To do so, URL address feature 2 %76.2 %74.5 %70.7 %69.8 %72.8
and the page Title are given to the system. The
system, based on the first feature analyses the Based on
feature 3 %66.3 %61.5 %68.4 %64.6 %65.2
possibility of recognizing the category according to
the address and checking features of Postfix and Based on
Directory. If it fails to find a previously observed combining
Postfix or Directory, it refers to the second feature, former
and compares the page address with all addresses features
%84.5 %85.6 %84.7 %88.4 %85.8
observed so far. After this step, it is obvious that and
accuracy of this algorithm increases as the number of sibling
learned pages considerably increases. If the address of pages
this page is not similar to existing addresses of the features
bank, the algorithm studies the third feature. It must
be stated that the first priority of classification system Figure 1 illustrates the results of algorithm evaluation
is the first feature, followed by the second feature and for different categories and based on different
finally the third one. To analyze the third feature, the features.
system performs the likelihood function on existing
tokens in URL and page Title. If the existing tokens in
URL are expanded, instead of former token an
expanded expression is employed for classification.
Afterwards, the frequency of each word is calculated
as frequency feature. Then using Bayesian rule,
posteriori probability P(h|D) from priori probability
P(h) along with P(D) and P(D/h) according to
frequency of each word on the page is calculated in
order to predict the possibility of belonging the page
to the category where the highest number of features
has been occurred. After calculating the probability
for all extracted words of one page for all four
categories, the category which has the highest
probability for all words is recognized as the main Figure 1. Results of Algorithm Evaluation for different
category and is announced to the user. Categories and Features
171 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, 2010
IV. Evaluation of Proposed Algorithm its combination with Bayesian algorithm, and
implying that this approach waits for further work.
In this study, to evaluate the proposed algorithm, 500 The fourth phase is the evaluation of combination of
different pages with training and test steps, and all three proposed features related to URL and using
categories downloaded from ODP such as universities, the features of neighbor pages in classification which
shopping, forums, and FAQ have been employed. To arrived at 85.8% accuracy in classification.
implement the algorithm, Vb.net 2005 programming Comparing this accuracy with [5] that reached 80%,
environment as well as Sqlserver 2000 database on a or [4] that toped at 63% through Bayesian
computer with specifications cpu core 2,2.13 GHz, implementation, it is approved that the proposed
and 2 GB Ram, and 300 GB Hard disk which ran the method is more efficient and promising. The
operating system Windows XP service pack 3 is used. algorithm's average accuracy, F-measure is shown in
After pre-processing downloaded pages and extraction different categories in Figure 2.
of useful features from URL address and Title of
pages, and calculating the probability of belonging of
the webpage to each one of categories using Bayesian
algorithm, and then calculating the above probability
based on neighbor pages information, the category
with the highest probability is recognized as main
page category and it is announced to the user. If the
category is recognized correctly, all its extracted
features are added to the table of program's
information bank, and is added as a correct case to the
statistics of correct recognition, parameter a from
calculation criterion of algorithm's general accuracy is
added too. Otherwise, the correct category is
recognized by the user, and the extracted information
is added from the page to the correct table.
Furthermore, one case to wrongly recognized cases Figure 2. Algorithm Average Accuracy based on different
from the first category, i.e. parameter b from Features
calculation criterion of algorithm's general accuracy,
V. Conclusion and Future Works
and also one case to cases which is wrongly ignored in
the correct category, i.e. parameter c from calculation In this study, based on URL features, a new method is
criterion of algorithm's general accuracy are added. To presented for webpage classification. This method
recognize the algorithm accuracy, algorithm accuracy involves combining three different features from
evaluation criterion has been used as follows: pages URL as well as information of neighbor page
information for higher accuracy in classification. The
Precision= a / (a+b) (5) proposed algorithm ultimately achieved 85.8%
Recall= a / (a+c) (6) accuracy, and it can be enhanced through combining
F-measure = 2 *Recall *Precision / (Recall + Precision) (7)
this method with the methods based on HTML
features of webpage.
where a= number of pages of sample examples which
are correctly classified, b= number of pages of sample
examples which are wrongly classifies, and c= number VI. Refrences
of sample examples which are wrongly ignored. [1]- M.kan,2004,”Web page categorization without
the web page”, Proceedings of the 13th international
In this study, the proposed algorithm has been World Wide Web conference on Alternate track
evaluated in four phases. In the first phase, the papers & posters, ACM,OnPages : 262 - 263 .
evaluation of algorithm implementation on the first [2]-L.k.shih,D.r.karger, 2004,”Using URLs and Table
feature of the webpage URL with no reference to Layout for Web Classification Task”, Proceedings of
other URL features illustrated 59.7% F-measure the 13th international conference on World Wide Web
accuracy. In the second phase, using the second ,ACM, OnPages : 193 - 202 .
selected feature raised algorithm accuracy to 72.8%. [3]-X.qi,B.D.Davison, 2008,”classifiers without
The second feature, i.e. domain likelihood in the borders : incorporating fielded text from neighboring
webpage address is a new feature not employed in webpages”, In Proceedings of the 31 st Annual
similar studies and higher algorithm accuracy is an international ACM SIGIR Conference on Research &
evidence to the fact. The third phase is evaluation of Development on Information Retrieval, Singapore,
using likelihood function in classification which July 2008, On page(s):643-650.
compared with [1] which shows 43% accuracy with [4]-S.Morales,H.Fandino,J.rodriguez, 2009,”
the same feature, illustrated 65% accuracy thanks to Hypertext Classification to filtrate information on the
web”, Proceedings of the 2009 Euro American
172 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, 2010
Conference on Telematics and Information Systems: their Applications, IEEE, Jul/Aug1999, Volume: 14,
New Opportunities to increase Digital Citizenship issue 4,Onpage(s):44-54.
2009, Prague, Czech Republic June 03 - 05, 2009,
Article No. 1 . [9]- Getoor, L, Diehl, C. 2005” Link mining: A
[5]-Ch.Lindemann,L.littig, 2006,”Coarse-grained survey”, ACM SIGKDD Explorations Newsletter
classification of websites by their structural archive (Special Issue on LinkMining), December
properties”, Proceedings of the 8th annual ACM 2005,Volume 7 , Issue 2 ,On Page(s): 3 – 12.
international workshop on Web information and data [10]- Furnkrunz, J. 2005,” Web mining. In The Data
management table of contents,Arlington, Virginia, Mining and Knowledge Discovery Handbook”, O.
USA .OnPage(s): 35 - 42 . Maimon and L. Rokach, Eds. Springer, Berlin,
[6]- Sebastiani, F.,2002,” Machine learning in Germany, On Page(s): 899–920
automated text categorization”. ACM Computing [11]- Choi, B. , Yao. Z, 2005,” Web page
Surveys (CSUR) archive,Volume 34 , Issue 1 (March classification. In Foundations and Advances in Data
2002),On Page(s): 1 - 47 . Mining” W. Chu and T. Y. Lin, Eds. Studies in
[7]- Chakrabarti, S, Morgan Kaufmann. 2003,” Fuzziness and Soft Computing, vol. 180. Springer-
Mining the Web: Discovering Knowledge from Verlag, Berlin, Germany, On Page(s):221–274.
Hypertext Data”,San Francisco, CA. Morgan- [12]-www.dmoz.org
Kaufmann Publishers. [13]-http://tartarus.org/~martin/PorterStemmer/
[8]- Mladenic, D, 1999,” Text-learning and related
intelligent agents: A survey”, Intelligent Systems and
173 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Related docs
Other docs by ijcsiseditor
Digital Images Encryption in Spatial Domain Based on Singular Value Decomposition and Cellular Automata
Views: 0 | Downloads: 0
Agent Behavior in Multiagent Systems: Issues and Challenges in Design, Development and Implementation
Views: 1 | Downloads: 0
Optimizing Cost, Delay, Packet Loss and Network Load in AODV Routing Protocols
Views: 2 | Downloads: 0
Get documents about "