Docstoc

Classification based on URL Features and Features of Sibling Pages

Document Sample
Classification based on URL Features and Features of Sibling Pages Powered By Docstoc
					                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                              Vol. 8, No. 2, 2010




         Webpage Classification based on URL
         Features and Features of Sibling Pages
           Sara Meshkizadeh,                        Dr.Amir masoud rahmani,                Dr.Mashallah Abbasi Dezfuli
        Department of Computer                       Department of Computer                    Department of Computer
              engineering,                                engineering,                               engineering,
         Science and Research                         Science and Research                      Science and Research
          branch,Islamic Azad                          branch,Islamic Azad                       branch,Islamic Azad
            University(IAU) ,                           University(IAU) ,                         University(IAU) ,
            Khouzestan,Iran                                Tehran,Iran                             Khouzestan,Iran,
     Sara_meshkizadeh@yahoo.com                     rahmani@srbiau.ac.ir                      Abbasi_masha@yahoo.com



                                                                     webpage pictures to provide a classification with tree
Abstract:Webpage classification plays an important role              structure. [3] involves combination of parent page
in information organization and retrieval. It involves               information with HTML features to arrive at
assignment of one webpage to one or more than one                    classification. [4] reaches 80% accuracy through a
predetermined categories. The uncontrolled features of               combination of web mining information. Combining
web content implies that more work is required for                   some features of HTML and URL and in sum 9
webpage classification compared with traditional text                features, classification is made with 80% accuracy.
classification. The interconnected nature of hyper text,             This study introduces a novel method for webpage
however, carries some features which contribute to the               classification as it enhances existing algorithms. This
process, for example URL features of a webpage. This                 method combines three different features of webpage
study illustrates that using such features along with
                                                                     URL, and it also employs information from
features of sibling pages, i.e. pages from the same sibling,
as well as Bayesian algorithm for combining the results              neighboring pages with the same sibling, i.e. sibling
of these features, it would be possible to improve the               pages to increase classification accuracy.
accuracy of webpage classification based on this
algorithm.                                                           In this paper, in Section 2 related concepts are
    Keywords: classification, hyper text, URL, sibling               described. Section 3 introduces the proposed
pages, Bayesian algorithm                                            algorithm in detail. Section 4 evaluates the suggested
                                                                     algorithm, and Section 5 provides conclusions and
                                                                     insights towards future work.
                 I.        INTRODUCTION
                                                                                   II.       RELATED CONCEPTS

Traditional classification is supposed to be a                       A.    Selecting Database
instructor-based teaching where a set of labeled data                As webpage classification is usually taken as a
can be used for teaching a classifier to be employed                 supervised learning, there is a need for classified
for future classification. Webpage classification is                 samples for learning. In addition, some samples are
distinct from traditional on in a number of aspects.                 also required to test classification for classifier
First, traditional textual classification is usually                 evaluation. Manual labeling involves something more
performed on the structured documents written based                  than human, therefore, some available web directories
on a fixed style (e.g. news articles), whereas webpage               are used for more research. One of them which is
content is far from such characteristics. Second, web                employed more is OPD[12]. In this database, 4519050
pages are documents with HTML structure which                        various websites are classified by 84430 classification
might be translated for the user visually. Classification            editors. This research has been performed on
plays a significant role in information management                   universities, shopping, forums, FAQ (frequently asked
and retrieval task. On the web, webpage classification               questions) of ODP.
is essential for focused crawling, web directories,
topic specific web link, contextual advertising, and                 B.    Feature Extraction from Webpage URL
topical structure. It can be of great help in increasing
search quality on the web. Using URL related                         The contextual content is the most important feature
information, [1] has achieved 48% accuracy in                        available on the webpage. However, taking diverse
classification. [2] uses URL information as well as                  range of parasites on webpage into account, using bag
                                                                     of words directly may does not lead to higher



                                                               168                            http://sites.google.com/site/ijcsis/
                                                                                              ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                          Vol. 8, No. 2, 2010



                                                                                            P(D|h) P(h)
                                                                                               P(D)
efficiency. The researchers have proposed a number of                            P(h |D)=                  (1)
methods for better application of contextual
characteristics. Feature selection is a popular one. In
addition to features of HTML tags, a webpage can be              As expected, it can be seen that P(h|D) increases with
classified based on its own URL. URLs are highly                 the increase of P(h), and P(D|h). Therefore, it is
effective in classification. First, a URL is easily              reasonable that P(D|h) decreases with the increase of
retrievable, and each URL is limited to one webpage,             P(D), because with higher probability of occurrence
and each webpage has one special URL. Second, if                 P(D) which is independent from h, fewer evidence of
this method is solely employed, classification of one            D are available to support h. In many learning
webpage based on its URL causes download removal                 scenarios, the learner considers a set of hypotheses H,
of the whole page. This tends to be an appropriate               and it is interested in finding the hypothesis h∈H
method for classifying pages which are not existing or           which is the most probable one or at least one of the
their download is impossible, or time/space is critical          most probable ones. Any hypothesis which carries
for example in realtime classifications.                         such feature is called Maximum a posteriori, MAP.
                                                                 Using Bayesian theorem, it is possible to find MAP
   C.    Bayesian Algorithm                                      hypothesis to calculate posteriori probability of each
                                                                 candidate. In other words, HMAP is a hypothesis that
Bayesian inference is provides a probability method
for inference. This method is built on the hypothesis             HMAP =argmax hj є H P(hj | Di)
that the considered values follow a probable                          = argmax hj є H( P(Di|hj) P(hj) /P(D))                 (2)
distribution, and that optimal decisions can be made                      = argmax hj є H P(Di|hj) P(hj)
with an eye to inference on the probabilities as well as
observed data. As this method is a quantitative one for          Be noted that P(D) is deleted at the final step, as its
weighing evidences which support different                       calculation is independent from h, and it is always a
hypotheses, it is of great importance in machine                 constant. However, for all probable states on different
learning. Bayesian inference provides a direct method            features of a problem, this theorem can be generalized
for dealing with probabilities for learning algorithms,          for all existing probabilities, as for all values of
and it also creates a framework for analyzing                    feature D1,D2…Dn:
performance of algorithms which are not directly
related to probabilities. In many cases, the problem is                   H =argmax hj є H P(hj) Π P(Di|hj) (3)
finding the best hypothesis in hypothesis space H with
available D learning data. One method to express the             And P(Di|hj) is calculated as follow:
                                                                                       n_c + mp
best hypothesis is that we claim we are looking for the

                                                                                         n+m
most probable hypothesis with D data in addition to                        P(Di|hj)=              (4)
initial data on prior probabilities H. Bayesian theorem
is also a direct method for calculating these
probabilities.
To define Bayesian theorem, P(h) is used to express              Where:
initial probability which maintains h hypothesis is              n= number of examples where h= hj
true, earlier than observing learning data. P(h) is              nc = number of examples where samples D= Di and
usually called prior probability, and it expresses any           h= hj
prior knowledge which states on the chance of                    p= initial estimation for P(Di|hj)
correctness of hypothesis h. If there is no initial              m= size of sample space
knowledge on hypotheses, we can assign a similar
probability to the whole hypotheses space H.                     D.    Using Features of Neighbor Webpage
Likewise, P(h) is similarly used for expressing prior            Although URL addresses of webpage contain useful
probability where D data are observed. In other words,           features, they may miss, misunderstood, or not
probability of observing D in the case of there is no            recognized for some reasons in a special URL. For
knowledge on correctness of hypotheses. P(D|h) is                example, some pages do have brief addresses or
employed to express probability D in a space where               concise contextual content. In such cases, it would be
hypothesis h is true. In machine learning we look for            difficult for classifiers to decide on the accuracy
P(D|h), i.e. probability of correctness of hypothesis h          according to URL features of the webpage. To solve
in the case of observing D learning data. P(D|h) is              the problem, these features might be extracted from
called post probability, as it expresses our confidence          neighbor pages which are related to the webpage
of hypothesis h after observing D data.                          under classification. In this study, pages with the same
     Bayesian theorem is the main building block of              sibling, i.e. sibling pages are employed to arrive at
Bayesian learning, as it provides a method for                   more accuracy in classification.
calculating post probability P(D|h) based on P(h)
along with P(D) and P(D|h).




                                                           169                            http://sites.google.com/site/ijcsis/
                                                                                          ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 8, No. 2, 2010



                    III.       Proposed Algorithm                          If a webpage such as www.aut.ac.ir/sites/e-
      A.      Pre-processing of Web Content                                shopping/iranair.ir is given to the system as a test, due
                                                                           to the domain www.aut.ac.ir/sites/e-shopping/ in the
In most studies, pre-processing is performed prior to                      table, the system quickly and simply assigns shopping
feeding the web content to the classifier. In the                          class to the second address. It should be noted that
proposed method here, only URL address and page                            using domain similarity of webpage URL is a new
Title are required for implementing classification                         idea which has not been taken into account in
algorithm. First, the URL address and page Title are                       webpage classification.
pre-processed, i.e. words shorter than 3 characters,
numbers, conjunctions and prepositions stored in a                           3) Third Feature
table called Stopwords are removed from this content.
Using the function Porter Stemmer [13], the useful                         To enhance the efficiency of the proposed algorithm
words are changed to their stems ( to avoid data                           another method based on URL address can also be
redundancy), and then they are stored in the relevant                      employed. This method involves expanding URL of a
table along with a value as the frequency of word in                       webpage to use existing elements better.
the data bank of the program.                                              Forexample,considering
                                                                           http://www.washington.edu/news/nytimes , and also
                                                                           page title, i.e. NewYorkTimes it is easily understood
B.     Using Extracted Features from Webpage URL
                                                                           that the address refers to NewYorkTimes news
1) First Feature                                                           database. The machine, however, misses the point. To
                                                                           solve the problem and to employ human guesses in the
URL do have appropriate features for classification.
                                                                           procedure, with an eye to the function used in [1] but
Two sets of features which are easily extracted from
                                                                           with some changes in definition and usage, a
URL pages are Postfix and Directory. The general
                                                                           likelihood function is presented. Table 2 illustrates the
format of a postfix is usually as Abbreviation or
                                                                           method. The likelihood is used to compare URL's
Abbreviation. Abbreviation. For example, .edu or ac.ir
                                                                           token letter by letter, and similar word in the page
or .ac.uk indicate pages related to universities or
                                                                           Title.
academic websites. The general format of a Directory
feature is mostly like Word(Abbreviation)slash. For                                     Table2. Likelihood Function
example, a directory named FAQ or Forum
represented as /Faq/ or /Forum/ can represent the                            Rank                    Condition                   number
relation between the current page to the relevant class.                                 First letter of URL token is the
These features are put in Table 1.                                             2       same as first letter of similar word          1
                                                                                                    at page Title
           Table 1: Features of URL Addresses                                            First letter of URL token is the
                                                                               1      same another letter of similar word            2
                                                                                                    at page Title
                                                                                        A letter of URL token is the same
     URL feature                   Specification             number            1        as a letter of similar word at page          3
                                                                                                         Title
       Postfix :           To find class of pages based
                                                                 1                    Last letter of URL token is the same
      .edu,.ac.ir            on Stem URL Address
                                                                               2      as last letter of similar word at page         4
                            To find class of pages based                                                 Title
      Directory :
                           on rest of Stem URL Address           2             0       Ignoring a character of URL token             5
     Forums/,Faq/

                                                                           For example consider the news website Newyorktimes
2) Second Feature                                                          with the address http://www.nytimes.com. Now take
                                                                           the title "The New York Times-Breaking News,
The second feature important in webpage URL is
                                                                           World News & Multimedia". The above-mentioned
attention to the domains which have been observed by
                                                                           function calculates likelihood for a condition where
the system and further their correct class is identified.
                                                                           URL token, Nytimes and similar word at page Title is
To do so, once the class of a webpage is determined at
                                                                           Newyorktimes as bellow:
the last step and its accuracy is confirmed by the user,
the page's address is registered in the URL table along
                                                                           (Condition1)N(Condition5)E(Condition5)
with the correct class. Afterwards, if a webpage with
                                                                           W(Condition1)Y(Condition5)O(Condition5)
the same domain is given to the system, the system
                                                                           R(Condition5)K(Condition1)T(Condition3)
can recognize more simply and quickly based on
                                                                           I(Condition 3)M(Condition 3)E(Condition 4)S
similarity in addresses. Take for example the address
www.aut.ac.ir/sites/e-shopping/raja.ir whose domain
                                                                           In this case, the word Newyorktimes receives score 11
is recognized as shopping, thereby it is registered
                                                                           from words at the page Title which is the highest score
under shopping class at the URL table.
                                                                           out of other tokens, and according to the likelihood




                                                                     170                            http://sites.google.com/site/ijcsis/
                                                                                                    ISSN 1947-5500
                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                            Vol. 8, No. 2, 2010



function, thereby in future if Nytimes is seen in test             3) Neighbor Effect Step
samples, it is replaced with Newyorktimes. To work
                                                                   Due to the concise nature of most addresses and also
with the third feature, the system runs the likelihood
                                                                   Title of many webpages, the classification based on
function on tokens of URL and the page Title. In the
                                                                   just features extracted from a webpage has been
case of expanding existing token in URL, instead of
                                                                   performed with 72.8% accuracy as shown at Table 3.
the former token, an expanded token is used for
                                                                   To improve classification accuracy, it was decided to
classification.
                                                                   use extracted information from neighbor pages. As
                                                                   such, URL address and Title of 500 neighbor pages
C.   Webpage Classification
                                                                   related to Test step pages have been employed to
1) Training Step                                                   increase classification accuracy. As illustrated in
                                                                   Table 3, the accuracy average raised to 85.5% that
In this step, about 2000 pages of considered classes in
                                                                   implies the emphasis on neighbor pages is helpful. As
ODP database are used for system learning. As such,
                                                                   described earlier, the system first analyses by the first
the webpage URL address, the page's Title along with
                                                                   feature. If it fails it refers to the second feature, and
the category is fed to the system. In this step, useful
                                                                   ultimately the third feature is used for classification.
words from considered features of the page URL are
                                                                   The accuracy of averages of algorithm is shown in
extracted, and are stored in the table related to the
                                                                   Table 3. The result of combining former features with
considered class of the program information bank
                                                                   sibling pages features and their influence can be
along with the frequency of each word in each feature
                                                                   observed.
as the value of that word in that feature. In this step,
information of pages in each category is stored in the              Table 3. Results of Algorithm Accuracy based on different
table related to the same category. So far we have                                           Features
been dealing with learning the system and feeding the
useful information.                                                         Shoppi     Univers      Foru                   F-
                                                                feature                                        FAQ
                                                                              ng         ity         ms                  measure
2) Test Step                                                   Based on
                                                                             %48.4      %65.8       %61.4     %63.2       %59.7
                                                               feature 1
In this step, to test the system a number of different         Based on
pages in Training step is used. To do so, URL address          feature 2     %76.2      %74.5       %70.7     %69.8       %72.8
and the page Title are given to the system. The
system, based on the first feature analyses the                Based on
                                                               feature 3     %66.3      %61.5       %68.4     %64.6       %65.2
possibility of recognizing the category according to
the address and checking features of Postfix and               Based on
Directory. If it fails to find a previously observed          combining
Postfix or Directory, it refers to the second feature,          former
and compares the page address with all addresses               features
                                                                             %84.5      %85.6       %84.7     %88.4       %85.8
observed so far. After this step, it is obvious that              and
accuracy of this algorithm increases as the number of           sibling
learned pages considerably increases. If the address of          pages
this page is not similar to existing addresses of the          features
bank, the algorithm studies the third feature. It must
be stated that the first priority of classification system         Figure 1 illustrates the results of algorithm evaluation
is the first feature, followed by the second feature and           for different categories and based on different
finally the third one. To analyze the third feature, the           features.
system performs the likelihood function on existing
tokens in URL and page Title. If the existing tokens in
URL are expanded, instead of former token an
expanded expression is employed for classification.
Afterwards, the frequency of each word is calculated
as frequency feature. Then using Bayesian rule,
posteriori probability P(h|D) from priori probability
P(h) along with P(D) and P(D/h) according to
frequency of each word on the page is calculated in
order to predict the possibility of belonging the page
to the category where the highest number of features
has been occurred. After calculating the probability
for all extracted words of one page for all four
categories, the category which has the highest
probability for all words is recognized as the main                  Figure 1. Results of Algorithm Evaluation for different
category and is announced to the user.                                             Categories and Features




                                                             171                            http://sites.google.com/site/ijcsis/
                                                                                            ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                             Vol. 8, No. 2, 2010



        IV.       Evaluation of Proposed Algorithm                  its combination with Bayesian algorithm, and
                                                                    implying that this approach waits for further work.
In this study, to evaluate the proposed algorithm, 500              The fourth phase is the evaluation of combination of
different pages with training and test steps, and                   all three proposed features related to URL and using
categories downloaded from ODP such as universities,                the features of neighbor pages in classification which
shopping, forums, and FAQ have been employed. To                    arrived at 85.8% accuracy in classification.
implement the algorithm, Vb.net 2005 programming                    Comparing this accuracy with [5] that reached 80%,
environment as well as Sqlserver 2000 database on a                 or [4] that toped at 63% through Bayesian
computer with specifications cpu core 2,2.13 GHz,                   implementation, it is approved that the proposed
and 2 GB Ram, and 300 GB Hard disk which ran the                    method is more efficient and promising. The
operating system Windows XP service pack 3 is used.                 algorithm's average accuracy, F-measure is shown in
After pre-processing downloaded pages and extraction                different categories in Figure 2.
of useful features from URL address and Title of
pages, and calculating the probability of belonging of
the webpage to each one of categories using Bayesian
algorithm, and then calculating the above probability
based on neighbor pages information, the category
with the highest probability is recognized as main
page category and it is announced to the user. If the
category is recognized correctly, all its extracted
features are added to the table of program's
information bank, and is added as a correct case to the
statistics of correct recognition, parameter a from
calculation criterion of algorithm's general accuracy is
added too. Otherwise, the correct category is
recognized by the user, and the extracted information
is added from the page to the correct table.
Furthermore, one case to wrongly recognized cases                    Figure 2. Algorithm Average Accuracy based on different
from the first category, i.e. parameter b from                                              Features
calculation criterion of algorithm's general accuracy,
                                                                               V.        Conclusion and Future Works
and also one case to cases which is wrongly ignored in
the correct category, i.e. parameter c from calculation             In this study, based on URL features, a new method is
criterion of algorithm's general accuracy are added. To             presented for webpage classification. This method
recognize the algorithm accuracy, algorithm accuracy                involves combining three different features from
evaluation criterion has been used as follows:                      pages URL as well as information of neighbor page
                                                                    information for higher accuracy in classification. The
                Precision= a / (a+b) (5)                            proposed algorithm ultimately achieved 85.8%
                 Recall= a / (a+c) (6)                              accuracy, and it can be enhanced through combining
F-measure = 2 *Recall *Precision / (Recall + Precision) (7)
                                                                    this method with the methods based on HTML
                                                                    features of webpage.
where a= number of pages of sample examples which
are correctly classified, b= number of pages of sample
examples which are wrongly classifies, and c= number                                   VI.          Refrences
of sample examples which are wrongly ignored.                        [1]- M.kan,2004,”Web page categorization without
                                                                    the web page”, Proceedings of the 13th international
In this study, the proposed algorithm has been                      World Wide Web conference on Alternate track
evaluated in four phases. In the first phase, the                   papers & posters, ACM,OnPages : 262 - 263 .
evaluation of algorithm implementation on the first                  [2]-L.k.shih,D.r.karger, 2004,”Using URLs and Table
feature of the webpage URL with no reference to                     Layout for Web Classification Task”, Proceedings of
other URL features illustrated 59.7% F-measure                      the 13th international conference on World Wide Web
accuracy. In the second phase, using the second                     ,ACM, OnPages : 193 - 202 .
selected feature raised algorithm accuracy to 72.8%.                [3]-X.qi,B.D.Davison,      2008,”classifiers    without
The second feature, i.e. domain likelihood in the                   borders : incorporating fielded text from neighboring
webpage address is a new feature not employed in                    webpages”, In Proceedings of the 31 st Annual
similar studies and higher algorithm accuracy is an                 international ACM SIGIR Conference on Research &
evidence to the fact. The third phase is evaluation of              Development on Information Retrieval, Singapore,
using likelihood function in classification which                   July 2008, On page(s):643-650.
compared with [1] which shows 43% accuracy with                     [4]-S.Morales,H.Fandino,J.rodriguez,        2009,”
the same feature, illustrated 65% accuracy thanks to                Hypertext Classification to filtrate information on the
                                                                    web”, Proceedings of the 2009 Euro American



                                                              172                            http://sites.google.com/site/ijcsis/
                                                                                             ISSN 1947-5500
                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                        Vol. 8, No. 2, 2010



Conference on Telematics and Information Systems:              their Applications, IEEE, Jul/Aug1999, Volume: 14,
New Opportunities to increase Digital Citizenship              issue                             4,Onpage(s):44-54.
2009, Prague, Czech Republic June 03 - 05, 2009,
Article No. 1 .                                                [9]- Getoor, L, Diehl, C. 2005” Link mining: A
[5]-Ch.Lindemann,L.littig,       2006,”Coarse-grained          survey”, ACM SIGKDD Explorations Newsletter
classification of websites by their structural                 archive (Special Issue on LinkMining), December
properties”, Proceedings of the 8th annual ACM                 2005,Volume 7 , Issue 2 ,On Page(s): 3 – 12.
international workshop on Web information and data             [10]- Furnkrunz, J. 2005,” Web mining. In The Data
management table of contents,Arlington, Virginia,              Mining and Knowledge Discovery Handbook”, O.
USA .OnPage(s): 35 - 42 .                                      Maimon and L. Rokach, Eds. Springer, Berlin,
[6]- Sebastiani, F.,2002,” Machine learning in                 Germany, On Page(s): 899–920
automated text categorization”. ACM Computing                  [11]- Choi, B. , Yao. Z, 2005,” Web page
Surveys (CSUR) archive,Volume 34 , Issue 1 (March              classification. In Foundations and Advances in Data
2002),On Page(s): 1 - 47 .                                     Mining” W. Chu and T. Y. Lin, Eds. Studies in
[7]- Chakrabarti, S, Morgan Kaufmann. 2003,”                   Fuzziness and Soft Computing, vol. 180. Springer-
Mining the Web: Discovering Knowledge from                     Verlag, Berlin, Germany, On Page(s):221–274.
Hypertext Data”,San Francisco, CA. Morgan-                     [12]-www.dmoz.org
Kaufmann Publishers.                                           [13]-http://tartarus.org/~martin/PorterStemmer/
[8]- Mladenic, D, 1999,” Text-learning and related
intelligent agents: A survey”, Intelligent Systems and




                                                         173                            http://sites.google.com/site/ijcsis/
                                                                                        ISSN 1947-5500