Two-Level Approach for Web Information Retrieval by ijcsiseditor


									                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 11, No. 4, April 2013

    Two-Level Approach for Web Information
          S. Subatra Devi                                                      Dr. P. Sheik Abdul Khader
         Research Scholar                                                              Professor & HOD
       PSVP Engineering College                                                 BSA Crescent Engineering College
      Chennai, Tamil Nadu, India.                                                 Chennai, Tamil Nadu, India.

Abstract - One of the most challenging issues for web                   In the second level of crawling, the topic keywords
search engines is finding high quality web pages or                  are verified with the page content. If the content of
pages with high popularity for users. The growth of the              the page have more number of topic keywords, then
Web is increasing day to day and retrieving the                      it is considered as a relevant page and the crawler
information, which is satisfied for the user has become a
difficult task. The main goal of this paper is to retrieve
                                                                     moves to the next stage of crawling. With each and
more number of, most relevant pages. For which, an                   every stage of crawling, the irrelevant pages are
approach with two-levels are undergone. In the first                 filtered out in the first level. This makes the crawling
level, the topic keywords are verified with the title of the         efficient and retrieves the most relevant pages
document, the snippet, and the URL path. In the second               effectively.
level, the page content is verified. This algorithm
produces efficient result which is being proved                         This paper is structured as follows. Section 2
experimentally on different levels.                                  shows the related work. In Section 3, the novel
                                                                     algorithm for web crawling process is proposed.
Keywords- Information Retrieval; Crawler; Snippet.
                                                                     Section 4 shows the experimental results and the
                                                                     performance evaluation of the proposed work.
                I.        INTRODUCTION                               Finally, Section 5 concludes the paper.
   Crawling has been the subject of widespread
                                                                                   II.       RELATED WORK
research and presently web crawling has been studied
in diverse aspects. The web crawler is a program that
                                                                        "Fish Search" is one of the first dynamic search
searches the information related to the user’s topic
                                                                     heuristics, that capitalizes on the intuition in which
[13] and provides the reliable result. It is not
                                                                     relevant documents often have relevant neighbors.
necessary for the crawler to collect all web pages.
                                                                     This algorithm [1] searches the query dynamically by
The crawler selects only required pages [10] and
                                                                     the value 0 and 1 and finds the information in the
retrieves relevant pages which are satisfied to the
                                                                     distributed hypertext. The search results are ranked
                                                                     based on user preferences in content and link and
                                                                     integrated to rank the results [4]. TF-IDF method is
   In this paper, the topic keywords are given to three
                                                                     the base method for retrieving the keywords from the
search engines namely, Google, Yahoo and MSN.
                                                                     page content. In addition to that, Vector similarity
The top 10 URL’s that exists in common to all the
                                                                     method [2] is applied.
three search engines are considered as the seed
URL’s during the initial iteration. Here, the crawler
                                                                        The topic keyword is used as a base in several
has three possibilities in the first level. For the given
                                                                     algorithms. Topic distillation is performed in [3]. A
top 10 URL’s, the three possibilities namely, the title
                                                                     Focused crawling [12], analyze its crawl boundary
of the document, the snippet and the URL path are
                                                                     that are likely to be most relevant for the crawler
verified with the given topic. If the topic keywords
                                                                     [10]. Text search based on the keyword [8] is the
exist in any two or in all the three possibilities, then
                                                                     basic concept for the information retrieval
the pages are considered as relevant pages for the
                                                                     algorithms. The hyperlink, linking from the parent to
next iteration. If the keywords exist only in one of the
                                                                     the child URL is based on several methods. Link
three possibilities, then it is considered as an
                                                                     score is calculated based on the division score in
irrelevant page and not included for the next iteration.
                                                                     algorithm [11]. Based on multi-information, the
This makes the algorithm to consider the most
                                                                     relevant pages are retrieved in [9]. There are several
relevant pages during the initial stage of the crawling.
                                                                     algorithms based on content and link strategy. The
                                                                     algorithm based on hyperlink and content relevance

                                                                                                 ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 11, No. 4, April 2013

and on HITS is presented as Heuristic search [5].               Where KWT represents the number of keywords in
Comparative study of two ranking algorithms namely              the title and WT represents the total number of words
page rank and users rank are studied [13]. Multiple             present in the title of the document.
information’s are used to improve the Shark-search                 The relevancy of the title of the document is
algorithm [7]. Breadth-first method is used to                  evaluated based on the number of keywords existing
produce high relevant pages [6] which is applied in             in the title with the total number of words in the title.
the proposed method.
                                                                2) The Snippet
                                                                  Here, the Snippet is checked if it contains the topic
   In the proposed method, the topic is given to the            keywords. The snippet gives the brief information of
search engines and the top 10 URL’s are considered              what the document page consists of. The relevancy is
as the input to the proposed method. For these                  determined as follows
URL’s, the initial preference is given for the title of
the document, the snippet and the parent URL’s,                                        KWSNIP
which are considered at Level1. If any two of these or                    RSSNIP = -------------
all the three possibilities contains the keyword, then                                 WSNIP
these URL’s are considered as relevant URL’s and
are given to Level2. In Level2, the page content of             KWSNIP represents the total number of keywords
the document is verified with the frequency of the              present in the Snippet and WSNIP represents the total
keyword.                                                        number of words in the Snippet. The relevancy is
                                                                more if all the keywords are present in the snippet.
A. Seed URL Extraction
                                                                3) The Parent URL
   Initially, the topic keywords are given to the three
different search engines Google, Yahoo and MSN.                    The top 10 URL’s generated from the search
The top ten URL’s that exists commonly in all the               engine are considered as the parent URL’s. These
three search engines are taken, and considered for              URL’s are checked it if contains the anchor text. The
evaluation. These URL’s are considered as the seed              number of keywords appearing in the parent URL is
URL's.                                                          checked. For this, the division method is used. If all
                                                                the keywords are present in the parent URL, then its
B. Relevancy Prediction                                         relevancy is 1, otherwise the relevancy depends on
                                                                the percentage value of the anchor text appearing on
   The relevancy of the document is predicted based             the parent link.
on the title of the document, the snippet and the
parent URL, at level1 and the page content method at               During the initial iteration, the URL will be acting
level2. These possibilities are discussed below. This           as the parent URL. For the forthcoming iterations, the
approach specifies the relevancy more precisely.                outgoing link of the parent URL will be the child
                                                                URL, i.e., the link URL.
1) The Title of the Document
                                                                   The relevancy of the parent URL is calculated as
  The title of the document is verified whether it              follows
contains the topic keyword. The document title
consists of a set of words. Each and every word wi is                                 KWPU
compared with the given keyword KWi.                                      RSPU = ---------------
         Ti = {w1, w2, w3, ….,wn}
                                                                Where KWPU represents the number of keywords in
Here, Ti represents the title of the document which             the parent URL and WPU represents the total number
consists of a set of words wi. The relevancy of the             of words in the parent URL.
title of the document is computed as
                                                                4) The Page Content
         RSTOD = ----------                                       The text content or the page content of the
                     WT                                         document is given the next preference which is
                                                                considered at level2. The keywords are extracted

                                                                                            ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 11, No. 4, April 2013

from the page content using stop word removal,                          IV.     EXPERIMENTAL RESULTS
stemming method and finally the tokens are
extracted. The frequencies of the tokens are found                The top 10 URL’s which has been selected from
and the tokens are arranged in an order such that the          the search engines are given as input to the proposed
token with higher frequency occurs first. The given            algorithm. The level1 process, which are the title,
topic keywords are compared with these set of tokens           snippet and the parent URL are checked if it contains
arranged based on the frequency.                               the topic keyword. The topic given for evaluation is
                                                               ‘query processing and optimization’.
   If the frequency of the keyword is more, then that
particular document is considered to be a more                     For this topic, three of the URL’s contains all the
relevant document. If the frequency of the keyword is          three possibilities of level1, four URL’s contains the
in an average, then it is considered as a relevant             title, snippet and three URL’s consists the topic
document. Otherwise, it is considered as an irrelevant         keyword in title alone. The topic keywords present in
document.                                                      all the three possibilities namely, the title, snippet and
                                                               the parent URL are considered to be more relevant.
  The relevancy score of the page content of the               The topic keywords occurring in any two are
document is computed as follows                                considered as relevant and in any one possibility is
                                                               considered as least relevant and it is not considered
                 KWPC                                          for level2.
        RSPC = -----------
                  WPC                                             After the completion of level1, then the level2 is
                                                               considered, which is the combination of level1 and
  Here, KWPC represents the frequency of the topic             the page content relevancy. The relevant pages at
keyword present in the page content and WPC                    level1 are considered for level2, and the irrelevant
represents the total number of tokens present in the           pages are not considered during the initial iteration
content of the document.                                       for level2. This removes the unwanted pages in the
                                                               initial stage of crawling and skews the search to more
5) The Relevancy Score                                         relevant pages. The relevancy score for the different
                                                               URL’s are listed in Table1.
   The relevancy score of the document for each URL
is computed based on the method specified above.               TABLE I.   RELEVANCY SCORE AT LEVEL1 AND LEVEL2
The aggregate of the relevancies specified above are
formed by summing the weighted individual                                                               Relevanc     Relevanc
relevancy score.                                               S.N             Parent URL               y score      y score
                                                               o.                                       at Level1    at Level2
Relevancy-Score= α1(RSTOD*wt1 + RSSNIP*wt2 +                   1        2.27         3.0
RSPU*wt3 )+ α2(RSPC*wt4)                                               orts/TR-20080105-1.pdf
                                                               2          1.60         2.85
   Here wt1 , wt2 , wt3 and wt4 are the weights which                  Book/slides/ch5revised.p
are used to normalize the relevancy scores. The                        pt
values of these weights vary between 0 and 1,                  3         2.35         3.15
inclusively. Based on these weights, the value of the                  r/query-processing-and-
weights can be increased to increase the importance                    optimisation
of the particular relevancy.                                   4          2.56         3.35
  After finding the relevancy score, it is compared                    and-optimization/ch0...
with the specified threshold value. If the relevancy           5           2.47         3.20
score is more than the threshold value, then the                       s470/query-
document is considered as the more relevant                            processing.pdf
document. These documents URL are placed in the                6         1.30          -
URL queue. The outgoing links of the parent URL                        zaiane/courses/cmput391
are fetched based on the relevancy and placed in the                   -02/.../sld004.htm
URL queue. The same process described in the                   7            1.25          -
earlier steps is performed sequentially for all the                    ?v=GYQZpYEaNvk
pages in the URL queue until the URL gets empty.               8            1.21          -

                                                                                            ISSN 1947-5500
                                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                              Vol. 11, No. 4, April 2013

9                              1.42      2.56             This retrieves the more number of most relevant
                                      bkin/teach/dbs12/set5.pdf                              pages at the beginning of the crawling. It has been
10                           › Content           1.35      2.45             proved experimentally that the proposed algorithm
                                                                                             retrieves the most relevant pages efficiently from the
                                                                                             initial stage of crawling.
   This relevancy score is calculated for the parent
URL, which is the seed URL during the initial                                                   The major issue on future work is to do test with
iteration. The same process is repeated for each                                             large volume of web pages. The future work also
outgoing link and the relevancy is checked. The                                              includes optimizing the code and the URL queue,
URL’s having the least relevancy score at Level2 is                                          which makes the crawler to retrieve maximum
discarded, and the URL’s having the more                                                     number of relevant pages in faster way.
relevancies is taken for consideration during the next
iteration.                                                                                                         REFERENCES
  The total numbers of relevant pages retrieved at                                           [1] P. De Bra, G-J Houben, Y. Kornatzky, and R.
various levels are indicated in Figure1. The graph                                               Post, “Information Retrieval in Distributed
compares the total number of pages crawled with the                                              Hypertexts”, in the Proceedings of RIAO'94,
number of relevant pages retrieved.                                                              Intelligent Multimedia, Information Retrieval
                                                                                                 Systems and Management, New York, NY,
                                     1000                                                        1994.
    No. of relevant pa ges Crawled

                                                                                             [2] Yang Yongsheng, Wang Hui , “Implementation
                                     600                                                         of Focused crawler”, Journal of computers Vol.
                                     400                                                         6, No: 1, January 2011.
                                     200                                 Level2
                                                                                             [3] K.Bharat   and     M.Henzinger,     “Improved
                                        0                                                        Algorithms for Topic Distillation in a
                                                                                                 Hyperlinked Environment”, In proc. Of the

                                                                                                 ACM SIGIR ’98 conference on Research and
                                            Total No. of pages Crawled                           Development in Information Retrieval.

                                                                                             [4] J.Jayanthi, Dr. K.S. Jayakumar, “An Integrated
                                                                                                 Page Ranking Algorithm for Personalized Web
Figure1: Relevant pages crawled during Level 1 and                                               Search”, International Journal of Computer
                     Level 2                                                                     Applications, 2011.

   The graph indicates that the more number of                                               [5] Lili Yan, Wencai Du, Yingbin wei and Henian
relevant pages are retrieved at level1, since the                                                chen, “A novel heuristic search algorithm based
irrelevant pages are discarded in the initial crawling.                                          on hyperlink and relevance strategy for Web
After level2, the most relevant pages are retrieved for                                          Search”, 2012, Advances in Intelligent and Soft
the given topic.                                                                                 Computing.

   For evaluating the efficiency, the experimentation                                        [6] M. Najork and J.L. Wiener. “Breadth-first
is performed on different topics and the relevancy is                                            Crawling yields high-quality pages”, In
checked. Different values for the weights are given to                                           Proceedings of the Tenth Conference on World
check the efficiency. It clearly indicates that the                                              Wide Web, Hong Kong, Elsevier Science, May
proposed algorithm retrieves the most relevant pages                                             2001, pp. 114–118.
                                                                                             [7] Zhumin Chen; Jun Ma; Jingsheng Lei; Bo
                                                                                                 Yuan;            Li Lian, “An Improved Shark-
                               V.           CONCLUSION & FUTURE WORK
                                                                                                 Search Algorithm Based on Multi-information”,
                                                                                                 Fourth International Conference on Fuzzy
    In this paper, the two levels are considered for an
                                                                                                 Systems and Knowledge Discovery, pp: 659 –
efficient result. At level1, the title, the snippet and the
                                                                                                 658, Aug 24-27, 2007.
parent URL are verified for relevancy. Based on
level1 relevancy, the URL’s are moved to level2.

                                                                                                                         ISSN 1947-5500
                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 11, No. 4, April 2013

[8] Lixin Hanna, Guihai Chen, “The HWS hybrid
    web search, Information and Software
    Technology”, Vol: 48, No: 8, Pp: 687-695,

[9] Shalin shah, “Implementing an Effective Web
    Crawler”, September 2006.

[10] S. Chakrabarti, M. van den Berg, and B. Dom,
     “Focused Crawling: A New Approach for Topic-
     Specific Resource Discovery”, In Proc. 8th
     WWW, 1999.

[11] Debashis Hati and Amritesh Kumar, “An
     Approach for Identifying URLs Based on
     Division Score and Link Score in Focused
     Crawler”, International Journal of Computer
     Applications, Vol. 2, no. 3, May 2010.

[12] Ahmed Patel and Nikita Schmidt, “Application
     of structured document parsing to focused web
     crawling”, Computer Standards & Interfaces,
     Vol. 33, no. 3, pp. 325-331, March 2011.

[13] Akshata D.Deore, Prof. R.L. Paikrao, “Ranking
     based web search algorithms”, International
     Journal of Scientific and Research Publications,
     Oct 2012.

                                                                                          ISSN 1947-5500

To top