; A Comparative Study Of Different Approaches For Improving Search Engine Performance
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

A Comparative Study Of Different Approaches For Improving Search Engine Performance

VIEWS: 28 PAGES: 10

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 1, Issue 3, September – October 2012 ISSN 2278-6856

More Info
  • pg 1
									       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856




     A Comparative Study Of Different Approaches
      For Improving Search Engine Performance
                                              Surabhi Lingwal1, Bhumika Gupta2
                                  1,2
                                        Dept of Computer Science & Engineering, G.B.P.E.C Pauri,
                                                          Uttarakhand,India


Abstract:     The enormous growth, diverse, dynamic and              Web mining is categorized as:
unstructured nature of web makes internet extremely difficult       Web Structure Mining: It is the technique to analyze and
in searching and retrieving relevant information and in             explain the links between different web pages and web
presenting query results. So to solve this problem, many            sites [1]. It works on hyperlinks and mines the topology
researchers are moving to web mining. Noises on web pages           of their arrangement. It tries to discover useful knowledge
are irrelevant to the main content on the web pages being           from the structure and hyperlinks. The goal of web
mined, and include advertisements, navigation bar, and
                                                                    structure mining is to generate structured summery about
copyright notices. The presence of near duplicate web pages
also degrades the performance while integrating data from           websites and web pages. It is using tree-like structure to
heterogeneous sources, as it increases the index storage space      analyze and describe HTML or XML.
and thereby increase the serving cost. Classifying and mining       Web Content Mining: It focuses on extracting knowledge
noise-free web pages and removal of redundant web pages             from the contents or their descriptions. It involves
will improve on accuracy of search results as well as search        techniques for summarizing, classification and clustering
speed.                                                              of the web contents. It can provide useful and interesting
This paper presents a comparative study of different                patterns about user needs and contribution behavior [1]. It
approaches for to improve the search engine performance             is related to text mining because much of the web
and speed. The results show that the system easily provides         contents are text based. Text mining focuses on
relevancies and delivers dominant text extraction, supporting       unstructured texts. Web content mining is semi-structured
users in their query to efficiently examine and make the most
                                                                    nature of the web. Technologies used in web content
of available web data sources. Experimental results revealed
                                                                    mining are NLP, IR.
that Mathematical approach is better than statistical and
signed approach.                                                    Web Usage Mining: It focuses on digging the usage of
Keywords: web content mining, outliers, redundant web               web contents from the logs maintained on web servers,
pages, relevant, precision                                          cookies logs, application server logs etc [1]. Web usage
                                                                    mining is the process by which we identify the browsing
                                                                    patterns by analyzing the navigational behavior of user. It
1. INTRODUCTION                                                     focuses on technique that can be used to predict the user
Web due to the presence of large amount of web data has             behavior while
become a prevalent tool for most of the e-activities such           user interacts with the web. It uses the secondary data on
as e-commerce, e-learning, e-government, e-science, its             the web. This activity involves automatic discovery of
purpose has pervaded to the helm of every day work. The             user access patterns from one or more web-servers. It
Web is enormous, widely scattered, global source for                consists of three phases namely: pre-processing, pattern
information services, hyperlink information, access and             discovery, pattern analysis. Web servers, proxies and
usage of information and website contents and                       client applications can quite easily capture data about web
organizations. With the rapid development of the Web, it            usage.
is imperative to provide users with tools for efficient and
effective resource and knowledge discovery. Search
engines have assumed a central role in the World Wide
Web’s infrastructure as its scale and impact have
escalated [13]. This useful knowledge discovery
is provided by web mining. Web mining process is given
in figure1.




             Figure 1.    Web mining process                                   Figure 2. Structure of Web mining


Volume 1, Issue 3, September – October 2012                                                                         Page 123
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


    1.1Web Content Mining                                       audio, image and HTML tags [18]. There are two groups
                                                                of web content outlier mining strategies. Those that
Web Content Mining is the process of extracting useful          directly mine the content outlier of documents to discover
information from the contents of web documents. The             information of outliers and those that reject outliers to
web documents may consists of text, images, audio, video        improve on the search content of other tools like search
or structured records like tables and lists. Mining can be      engines. In many web applications, one only wants the
applied on the web documents as well the results pages          main content of the web page without advertisements,
produced from a search engine. There are two types of           navigation links, copyright notices. Outliers are
approach in content mining called agent based approach          observations that deviate so much from other observations
and database based approach [3][16]. The agent based            to arouse suspicions that they might have been generated
approach concentrate on searching relevant information          using a different mechanism or data objects that are
using the characteristics of a particular domain to             inconsistent with the rest of the data objects [9]. Web
interpret and organize the collected information. The           Content Outliers are web document that show
database approach is used for retrieving the semi-              significantly different characteristics than other web
structure data from the web. Two groups of web content          documents taken from the same category. Outliers
mining specified in are those that directly mine the            identified in web data are referred to as web outlier and
content of documents and those that improve on the              mining of outliers is called as Web Content Outliers
content search of other tools like search engine [11].          Mining.
Web content mining approach is involved in:                     Outlier detection [5] broadly fall into following
   Structured Data Extraction: Data extraction is the act       categories:
or process of retrieving data out of data sources for further      Distribution based methods are conducted by the
data processing or data storage.                                statistics community. These methods deploy some known
   Unstructured Text Extraction: Typically unstructured         distribution model and detect as outliers points that
data sources include web pages, email, documents, PDFs,         deviate from the model.
scanned text, mainframe report, spool files etc.                  Depth based algorithms organize objects in convex
   Web Information Integration and Schema matching:             hull layers in data space according to peeling depth and
Although the web contains a huge amount of data, each           outliers expected to be with shallow depth values.
web site represents similar information differently. How
to identify or match semantically similar data is a very          Deviation based techniques detect outliers by checking
important problem with much practical application.              the characteristics of objects and identify an object as that
                                                                deviates these features as outlier.
  Building Concept Hierarchies: Concept hierarchies
are important in many generalized data mining                      Distance based algorithms give a rank to all points,
applications, such as multiple level associations rule          using distance of point from k-th nearest neighbor, and
mining.                                                         orders points by this rank. The top n points in ranked list
                                                                identified as outliers. Alternative approaches compute the
  Segmentation and Noise Detection: In many web                 outlier factor as sum of distances from k nearest
applications, one only wants the main content of the web        neighbors.
page without advertisements, navigation links, copyright
notices. Automatically segmenting Web page to extract             Density based methods rely on local outlier factor
the main content of the page is interesting problem.            (LOF) of each point, which depends on local density of
                                                                neighborhood. Points with high factor are indicated as
  Opinion extraction: Mining opinions is of great               outliers.
importance for marketing intelligence and product
benchmarking.                                                     1.3Redundant Web Pages

  1.2 Outliers Detection                                        The performance and reliability of web search engines
                                                                face huge problems due to the presence of extraordinarily
                                                                large amount of web data. The voluminous amount of web
Pages on the Web have an additional template (we call it        documents has resulted in problems for search engines
noisy) information that does not add value to the actual        leading to the fact that the search results are of less
content of the page. Even worse, it can harm the                relevance to the user. In addition to this, the presence of
effectiveness of Web mining techniques; these templates         duplicate and near duplicate web documents has created
could be eliminated by preprocessing. Templates form            an additional overhead for the search engines critically
one popular type of noise on the Internet [2]. Web content      affecting their performance [15]. The demand for
outlier mining is focused on detecting an irrelevant web        integrating data from heterogeneous sources leads to the
page from the rest of the web pages under the same              problem of near duplicate web pages. Near duplicate data
categories. Web outlier mining algorithms is applicable         bear high similarity to each other, yet they are not bitwise
for varying types of data such as text, hypertext, video,       identical [17][12]. “Near duplicates” are documents (web
Volume 1, Issue 3, September – October 2012                                                                      Page 124
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


pages) that differ only slightly in content. The difference    3. ARCHITECTURAL DESIGN
between these documents can be due to elements of the
                                                               The proposed system can be divided into 5 modules-1)
document that are included but not inherent to the main
                                                               user input 2) pre-processing 3) term frequency 4)
content of the page. For example, advertisements on web
                                                               comparison of term frequencies of similar word between
pages, or timestamps of when a page was updated, are
                                                               both documents 5) Relevance computation. The first
both information that is not important to a user when
                                                               module is where the user gives the input query. Based on
searching for the page, and thus not informative for the
                                                               that query the documents are retrieved from the search
search engine when crawling and indexing the page. The
                                                               engine. Most of the documents retrieved from the search
existences of near duplicate web page are due to exact
                                                               engine may or may not be relevant to the user query [9].
replica of the original site, mirrored site, versioned site,
                                                               The second module is pre-processing. The various steps
and multiple representations of the same physical object
                                                               involved in pre-processing are stemming, stop word
and plagiarized documents [5].
                                                               removal and tokenization. Stemming is the process for
                                                               reducing inflected (or sometimes derived) words to their
2. RELATED WORK                                                stem, base or root form – generally a written word form.
An algorithm is proposed for mining web content [6]            Stop words are common words that carry less important
using clustering technique and mathematical set formulae       meaning than keywords. Usually search engines remove
such as subset, union, intersection etc for detecting          stop words from a keyword phrase to return the most
outliers. Then the outlying data is removed from the           relevant result. Tokenization is the process of breaking a
original web content to get the required web content by        stream of text up into words, phrases, symbols, or other
the user. Also, the removal of outliers improves the           meaningful elements called tokens. The list of tokens
quality of the results from the search page. There is an       becomes input for further processing. The third module is
another paper [9] that proposed           two Statistical      the term frequency calculation. The words present in the
approaches based on Proportions (Z-test hypothesis) and        document are compared with the words present in the
chi square test (T-test) for mining this outlaid content.      domain dictionary. So the words that are matched with
Also comparative studies between these two methods are         the dictionary ate taken for the term frequency
presented. Elimination of this outlaid content during a        calculation. The fourth module is the term frequency
searching process improves the quality of search engines       calculation of the compared words that match with
further. There is another paper that proposed a                domain dictionary between both the documents Di and Dj.
mathematical approach [4] based on signed and                  The fifth module is the relevance calculation. The
rectangular representation to detect and remove the            comparison is based on the precision calculation.
redundancy between unstructured web documents also.
This method optimizes the indexing of web document as              3.1 Precision
well as improves the quality of search engines. In this
approach web documents are extracted; preprocessed and         It is the ratio between the number of relevant documents
n x m matrix is generated for each extracted document.         returned originally and the total number of retrieved
Each page is mined individually to detect redundant            documents returned after eliminating irrelevant
content by similarity computation of a word taken from         documents [8]. Here the relevant documents indicate the
all the 4-tuples of n x m matrix. Followed by that,            required documents which satisfy the user needs.
redundancy between two documents is found based on
signed approach. This paper proposes new algorithm for                Precision = Relevant
mining the web content by detecting the redundant links                              Retrieved after refinement
[5] from the web documents using set theoretical(classical         3.2Recall
mathematics) such as subset, union, intersection etc,.
Then the redundant links is removed from the original          It is the ratio between the number of relevant documents
web content to get the required information by the user.       returned originally and the total number of relevant
The obtained web documents D is divided into ‘n’ web           documents returned after eliminating irrelevant
pages based on the links. Then all the pages are               documents [8].
preprocessed, and each page is mined individually to                       Recall = Relevant
detect redundant links using set theory concepts. Initially,                          Relevant after refinement
the contents of first page is taken and compared with the
content of the second page. This process is repeated till         3.3 Time Taken
nth page. In general if any redundant links is noted, then     The time taken by the entire process is the sum of the
that particular web page itself is removed from that web       initial time taken by the general purpose search engines
document. Finally, a modified web document is obtained
                                                               plus the time taken by the refinement algorithm to
which contains required information catering to the user
                                                               process the results [8].
needs.


Volume 1, Issue 3, September – October 2012                                                                   Page 125
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


4. STATISTICAL APPROACH                                         Step 9: Compare Z value with the Z
                                                                       95% at level of confidence, where Z is the
Each document is mined to retrieve relevant web
                                                                       Critical Value.
document through test hypothesis using proportions.
                                                                Step10: If the Z value is lesser than Critical Value then
When the value of Z is equal to or less than 1.645 then
those documents are relevant [10]. Finally, a mined web                        Di and Dj are relevant documents.
document is obtained which contains required                            Else
information catering to the user needs. In this algorithm,                     Di and Dj are Irrelevant.
web documents are extracted based on the user query. The        Step 11: Increment j, and repeat from step 5 to step 9
extracted documents are pre-processed for making the            until
remaining process simpler. Followed by this, term
                                                                          j
frequency for the words presents in the document against
domain dictionary is computed for the ith and jth (i+1th)       Step 12: Increment i, and repeat from step 4 to step 10
documents. Then,similar words from the above                           until i<N.
documents along with their term frequencies are retrieved
for performing test statistic (Z) using proportions. Finally,
Z value is compared with the degrees of confidence at the
level of 95% which is obtained from the table. If the
calculated value is equal or less than 1.645 then both are
considered as relevant documents otherwise, they are
considered as irrelevant documents. The above process is
repeated for all the remaining documents for computation
of relevance.

4.1 Algorithm for retrieving relevant document
through test hypothesis.

Input : Web document.
Method: Statistical Method
Output: Extraction of relevant web document.
Step1: Extract the input web document Di where 1
Step 2: Pre-process the entire extracted document.              Figure 3. Proposed Architecture of Statistical Approach
Step 3: Initialize i=1.
Step 4: Initialize j=i+1.                                       Nomenclature
Step 5: Consider the document Di and Dj.                          Variables  Description
Step 6: Find the term frequency for all the words TF (Wik)        SE            Standard error
        in Di and TF (Wjk) in Dj that exist in Domain             P1            Sample proportion for ith Document
        Dictionary, where 1                                       P2            Sample proportion for jth Document
Step 7: Calculate TF1 (W) the total number of words N1
        and N2 in Di and Dj that matches with Domain
        Dictionary.                                             4.2 Experimental Results
Step 8:Perform the Proportionate Calculation for the            Here 5 web documents listed in table 1 are taken for test
        common words between Di and Dj through the              study. Initially these documents are preprocessed and then
        following steps:                                        the term frequencies for the similar words taken for the
                              ik             jk / N2
                                                                first two documents are computed [10]. Followed by that,
                      where Xik and Yjk are the Term            the statistical test hypothesis using proportions is applied
                                                                for those two documents to check the relevancy between
        Frequency of Di and Dj.
                                                                them. Similarly, the relevancy for the remaining
         Perform Standard Error :                               documents is computed. In this approach the degrees of
                     S.E(P1- P2) = SQRT [[P1 * (P1)/N1]+        confidence at 95% level which holds the value 1.645 is
                     [P2 * (1 –P2) /N2]].                       obtained from the statistical table.
         Calculate the Test Statistic:
                      Z = P1 – P2/ S.E. (P1 – P2)



Volume 1, Issue 3, September – October 2012                                                                      Page 126
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


                   Table 1: Input documents                   represents 4th word in 1st page). Then the jth word from
  D.No        Document Name                                   ith page is taken and its length is calculated (| Wji |) and
                                                              depending on the number of characters, the respective
  D1          Wcm.pdf                                         index on the domain dictionary is searched. If the word (
                                                              Wji ) is found in the dictionary, then positive count is
  D2          Page Content rank an approach to the            incremented by one else negative count is incremented by
              web content mining.pdf                          one. This process is carried out for all words in that web
                                                              page. Finally, positive count is compared with the
  D3          Neural Analysis.pdf                             negative count to check the relevancy of that web page. If
  D4          Deep_WCM.pdf                                    the positive count is less than the negative count, then
                                                              that page is irrelevant, otherwise it is considered as more
  D5          Medical Mining.pdf                              relevant. The approach is shown in figure 4.


                 Table 2: Experimental Results
         D1         D2       D3         D4           D5
  D1     *          1.310    2.27447    0.84306      2.9657
  D2     *          *        4.53130    1.51089      4.8332
  D3     *          *        *          2.79123      2.5671
  D4     *          *        *          *            3.4015
  D5     *          *        *          *            *

From the table 2, it is clear that documents 1, 2 and 4 are
less than or equal to 1.645. Therefore these documents
are relevant. On the other hand documents 3 and 5 have
values greater than 1.645, thus concluding them to be
irrelevant. Experimental results ensure that the memory
space gets reduced and improves the accuracy of search
results, after eliminating the irrelevant documents. As the                              Figure 1. Figure 4. Proposed
efficiency of web content is increased, the quality of the                            Architecture of Signed Approach
search engines also gets increased. Precision and recall of
the refined documents increases considerably.                 5.1 Algorithm for retrieving relevant document
                                                                  through signed approach.
5. SIGNED APPROACH
In the proposed system, web documents are extracted           Input: Domain Dictionary, Web Document Di
from the search engines by giving query by the user to the    Output: Relevant Pages and Irrelevant Pages
web. Then the obtained web documents D is                     Other Variable: Pos_count, Neg_count
preprocessed. The output is a set of documents with
white-spaced separated words and it is indexed in two         Extract the input web document D after preprocessing.
dimensional format (i,j), where ‘i’ represent web pages       Read the contents of web page Pi
and ’j’ represent words. Therefore, first word from first     Generate full word profile.
web page is indexed as (1,1), second word from the first      for ( i=1;i<=n;i++)
page is indexed as (1,2) etc,. The domain dictionary is
                                                              {
arranged in such a way that, all 1-letter word will be
indexed first, followed by 2-letter words, then 3-letter      Pos_count=0; Neg_count=0;
words similarly up to 15-letters word which is a very         for(j=1;j<=m;j++)
reasonable upper bounds for number of characters in a         {
word [8]. Each page is mined individually to detect           if ( jth word exists in dictionary)
relevant and irrelevant documents using signed approach.      {
Finally, a relevant web document is obtained which            Pos_count++;
contains required information catering to the user needs.
                                                              else
The proposed algorithm explores the advantages of full
word matching and signed approach using organized             Neg_count++;
domain dictionary where the indexing is done based on         }
the length of the word. The full word profile for the         }
document is generated in matrix form (i.e., W1,4 -            if (Pos_count >= Neg_count)

Volume 1, Issue 3, September – October 2012                                                                    Page 127
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


{                                                              Table 3: Experimental Results Documents Retrieved
Print Pi as relevant web page ;                                Originally With Respect To Computer Domain
else                                                            File name      Positive       Negative      Status
Print Pi as irrelevant web page ;                                              count          count
}                                                               File1.html     52             39            relevant
}                                                               File2.html     100            145           Not relevant
                                                                File3.html     127            107           Relevant
                                                                File4.html     13             49            Not relevant
Nomenclature:
                                                                File5.html     68             140           Not relevant
  Variables   Description
                                                                File6.html     184            113           Relevant
  D           Web document to be mined.                         File7.html     169            120           Relevant
  Pi          Web page                                          File8.html     36             75            Not relevant
  Wj,i           jth word in the ith web page                   File9.html     87             200           Not relevant
                                                               Table 4: Documents retreived after relevancy
                                                               computation
5.2 Experimental Results                                        File name      Positive       Negative        Status
Experimental results ensure that the memory space,                             count          count
search time and run time gets reduced by using organized        File1.html     52             39              relevant
domain dictionary than normal indexed dictionary for            File3.html     127            107             Relevant
checking the relevancy of the web documents. As the             File6.html     184            113             Relevant
efficiency of web content is increased, the quality of the      File7.html     169            120             Relevant
search engines also gets increased [8]. This method is
very simple to implement. The proposed algorithm is            6. MATHEMATICAL APPROACH
used by business personals to keep track of all the positive
                                                               The proposed work provides, a mathematical approach
and negatives aspects related to their business. The
                                                               based on signed, correlation and rectangular
document retrieved originally with respect to computer
                                                               representation of trust rating to mine related web content
domain is shown in table 3 and document after relevancy
                                                               without duplication for both structured and unstructured
computation is shown in table 4.
                                                               web documents. In the proposed system, web documents
                                                               are extracted from the search engines based on user query
                                                               to the web. The extracted web documents D is sliced into
                                                               ‘n’ web pages and each page is divided into ‘m’ words
                                                               [7]. Then the sliced web-document is preprocessed. After
                                                               preprocessing the term frequency of all the words are
                                                               calculated. Followed by that relevancy checking of web
                                                               document is performed using signed approach. Then
                                                               redundancy checking of web documents is performed
                                                               using correlation and signed approach of trust rating.
                                                               Finally, a mined web document is obtained which
                                                               contains desired information of the end user. System
                                                               design of mathematical approach is shown in figure 6.

                                                               6.1 Signed Approach for relevancy computation
                                                               The proposed algorithm explores the advantages of full
                                                               word matching and signed approach using organized
                                                               domain dictionary where the indexing is done based on
                                                               the length of the word. First, the input web document is
                                                               preprocessed and separated into white spaced words. The
                                                               full word profile for the document is generated in matrix
                                                               form (i.e., W1,5 - represents 5th word in 1st document).
                                                               Following the above process, term frequency for all the
                                                               words are found out. Then the jth word from ith document
                                                               is taken and its length is calculated (| Wij |) and depending
                                                               on the number of characters, the respective index on the
                                                               domain dictionary is searched [7]. If the word ( Wij ) is
    Figure 5.    Flow diagram of the proposed system           found in the dictionary, then positive count is

Volume 1, Issue 3, September – October 2012                                                                     Page 128
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


incremented by its corresponding term frequency else           6.2 Two way rectangular representation for checking
negative count is incremented by its corresponding term             relevant document
frequency. This process is carried out for all words in that   In two way representation, consider only n x m matrix for
web document. Finally, positive count is compared with         relevancy computation, where n represents total number
the negative count to check the relevancy of that web          of documents and m represents maximum number of
page. If the positive count is less than the negative count,   words in taken from any of the extracted documents.
then that page is irrelevant, otherwise it is considered as    Ignore all columns which consist of only zero entries [7]
more relevant.                                                 as represented in table 5.
                                                               Table 5: Rectangular representation for checking
                                                               relevancy in given document
                                                               Word         W1        W2       W3       --   --       Wm
                                                               /
                                                               Document
                                                               D1           (1,0)     (1,0)    (0,1)    --   --       (0,0)
                                                               D2           (0,1)     (0,1)    (1,0)    --   --       (0,1)
                                                               D3           (1,0)     (1,0)    (0,1)    --   (1,0)    (0,0)
                                                               --           --        --       --       --   --       --
                                                               --           --        --       --       --   --       --
                                                               Dn           (1,0)     (0,1)    (1,0)    --   (0,0)    (0,0)
  Figure 6. System Design of mathematical approach
                                                               Nomenclature
                                                               (Di, Wj)             jth word from ith web document.
Algorithm 1: Relevancy Computation algorithm through
                 signed approach.                               PC                  Positive Count
Input : Domain Dictionary, Web Document Di                      NC                  Negative Count
Output : Relevant Web Document and Irrelevant Web               Di                  ith web document.
            Document.
Step 1: extract the input web document Dn where 1               TF                  TF Term Frequency
Step 2: pre-process the entire extracted document.
Step 3: generate the full word profile.                        6.3 Correlation method for redundancy checking of
Step 4: initialize i=1.                                            web document
Step 5: consider the document Di                               The relevant document extracted from the above phase is
Step 6: initialize PC=0, NC=0.                                 sent to this phase for further processing. First pre
Step 7: compute the term frequency TF (Wj) for all words       processing is done for all the documents. Then, ith
        in Di where 1                                          document and i+1th document are taken for redundancy
Step 8: if Wj exist in domain dictionary then update           computation. Common words between these documents
                                                               are extracted and the term frequency for all the common
               PC = PC + TF (Wj)
                                                               words is found out. Followed by that Correlation co-
        else                                                   efficient is computed between these two documents. If the
             update NC= NC+ TF (Wj)                            Correlation value is 1, then the above documents are
Step 9: increment j                                            exactly redundant, therefore remove the second document
Step 10: repeat step 8 and step 9 till j                       from the original document set. This process is repeated
Step 11: compare Positive count (PC) with Negative             for the remaining documents.
          count (NC)
                                                               Algorithm 2: Redundancy computation using linear
          if PC < NC then                                                    correlation
               Di is outlaid (irrelevant) web document.
                                                               Input: Web document.
          else
               Di is relevant web document.                    Output: Identification and elimination of redundant web
Step 12: increment i                                                    document.
Step 13: repeat from step 5 till i
                                                               Step 1: extract the relevant web document Dn where
                                                                       1       .
                                                               Step 2: pre-process the entire extracted document.
                                                               Step 3: initialize i=1.

Volume 1, Issue 3, September – October 2012                                                                       Page 129
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


Step 4: initialize j=i+1.                                             Note: Since the first co-ordinate is always compared with
Step 5: consider the document Di and Dj.                              other pages, it never have the possibility of having -ve
Step 6: extract the common words present in Di and Dj.                sign, therefore, (- ,-) and (- , +) is not considered.
       Let T be the total number of common words.
                                                                      Nomenclature
Step 7: compute the term frequency TF (Wk) for the
                                                                      (Di, Dj) implies ith web document is compared with jth web
        common words in Di and Dj where 1               .             document for redundancy checking.
Step 8: perform the correlation between Di and Dj.
                                                                      6.5 Experimental Results
        determine: Xi to the term frequency for all the
        words in document Di and Yj to the term                       An experimental analysis has been done with 150
                                                                      documents extracted from the web related to web mining
        frequency for all the words in document Dj.
                                2            2
                                                                      domain. These documents are first pre-processed and then
        calculate: i          i     j      j       iY                 the relevancy computation using signed approach is
         compute: R1 , R2 and R3                                      performed. Followed by that, redundancy computation
                                      2          2                2
                    Where R1         i –       i) /T) ,R2       j     based on correlation method is done only for the relevant
–                                                                     documents [7]. Here results obtained for 15 input
                      ((Yj)2/T),R3                                    documents after relevancy computation is listed in table 8
                                         iYj –       i    j)/T)
                                                                      and the results of redundancy computation is projected in
         perform: Rxy= R3/( 1 *           2 )
                                                                      table 9. Finally mined web document without redundancy
Step 9: If the Rxy is equal to 1 then Di and Dj are                   in computed in table 10. The precision and recall of web
        redundant,                                                    documents after relevancy computation and redundancy
                hence eliminate Dj from set of documents.             computation is given in fig 7.
        else                                                          Table 8: Experimental results of Relevancy Computation
               Di and Dj are not redundant, hence retain
                                                                       D.No    Document Name                        Result
               both the documents.
Step 10: increment j, and repeat from step 5 to step until             D1      An integrated framework              Relevant
        j N.                                                                   for WCM.pdf
Step 11: increment i, and repeat from step 4 to step 10
        until i<N.                                                     D2      Software engineering.pdf             Irrelevan
                                                                                                                    t
6.4Two way rectangular representation for checking                     D3      Deep_WCM.pdf                         Relevant
     redundant document
Two way rectangular representation for checking                        D4      Copy of Deep_WCM.pdf                 Relevant
redundant document holds two important characteristics:
- Upper triangular matrix                                              D5      Elimination of Redundant             Irrelevan
- In diagonal there is path containing all vertices.                           Links.pdf                            t
It is shown in table 6.
                                                                       D6      Framework_WCOM.pdf                   Relevant
    Table 6: Rectangular representation for checking
            redundancy in given documents                              D7      Identify duplicated                  Irrelevan
 Web         D2     D3       D4      D5      D6                                content.pdf                          t
 document
 s                                                                     D8      Medical Mining.pdf                   Irrelevan
 D1          (+,-) (+,-) (+,-) (+,-) (+,+)                                                                          t
 D2          (0,0) (+,-) (+,-) (+,-) (+,-)
 D3          (0,0) (0,0) (+,-) (+,-) (+,-)                             D9      Neural Analysis.pdf                  Irrelevan
 D4          (0,0) (0,0) (0,0) (+,-) (+,-)                                                                          t
 D5          (0,0) (0,0) (0,0) (0,0) (+,-)
                                                                       D10     Outlier_lattice.pdf                  Irrelevan
   Table 7: Explanation and Comparative Study with                                                                  t
                  Signed Approach
 Page              Signed              Result                          D11     Page content rank.pdf                Relevant
 Comparison        Values
 (Di, Dj)          (+ , +)             Redundant Document              D12     Signed Approach.pdf                  Irrelevan
                                                                                                                    t
 (Di, Dj)         (+, - )            Not redundant
F(x) = (Di, Dj) = (+, +) or (+ ,-) if j > i= (0, 0) otherwise          D13     WCM.pdf                              Relevant
Volume 1, Issue 3, September – October 2012                                                                          Page 130
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


                                                               6.6 Signed approach for retrieving unique document
 D14       Fuzzy approach.pdf                      Irrelevan   Signed graph approach is applied for the output of
                                                   t           relevance analysis and redundancy computation phases.
                                                               Trust rating for the signed weight +- is assigned with +
 D15       Copy Page Content rank.pdf              Relevant    sign and for the remaining signed weight (++,-+,--) trust
                                                               rating is assigned with – sign. Final decision is made
                                                               based on trust rating. If the trust ratings holds + sign,
                                                               then it indicates mined web documents. Relevant
        Table 9: Experimental results of Redundancy
                       Computation                             document without redundancy implies unique mined
                                                               documents.
  D.No      D3         D4       D6        D11       D13
  D1        0.277      0.277    0.306     0.056     0.3046         Table 12: Signed approach for Mined Document
                                                                Page Comparison      Signed          Result
  D3        *          1        0.152     0.174     0.1733
                                                                                     Values
                                                                (Di, Dj)             (+ , +)         Redundant
  D4        *          *        0.152     0.174     0.1733
                                                                                                     Document
  D6        *          *        *         0.064     0.3739
                                                                (Di, Dj)             (+, - )         Not redundant
  D11       *          *        *         *         0.2527
                                                               7. CONCLUSION
                Table 10: Resultant Document                   Web content mining is a growing research area in the
  D.No                Document Name                            web mining for information retrieval. Retrieving relevant
  D1                  An integrated framework for WCM.pdf      content from huge web data repository is an important
                                                               task. This paper proposes the comparative study of three
  D3                  Deep_WCM.pdf
                                                               approaches namely statistical, signed and mathematical
                                                               approach. The statistical ensure that the memory space
  D6                  Framework_WCOM.pdf
                                                               gets reduced and improves the accuracy of search results,
                                                               after eliminating the irrelevant documents through test
  D11                 Page content rank.pdf
                                                               hypothesis. As the efficiency of web content is increased,
                                                               the quality of the search engines also gets increased. In
  D13                 WCM.pdf
                                                               signed approach positive count is compared with the
 Table 11: Precision of the proposed approach                  negative count to check the relevancy of that web page, if
                                                               it is less than the negative count, then that page is
  Dataset       Relevant            Relevant       Precision
                                                               irrelevant, otherwise it is relevant. The memory space,
  Size          document            document
                                                               search time and run time gets reduced by using organized
                through             Manually
                                                               domain dictionary than normal indexed dictionary for
                signed              computed
                approach                                       checking the relevancy of the web documents. This
                                                               method is very simple to implement. This algorithm is
                                                               used by business personals to keep track of all the positive
  50            30                  35             0.88
                                                               and negatives aspects related to their business. The
  75            50                  58             0.88
                                                               mathematical approach is based on signed approach for
  100           60                  70             0.88
                                                               relevancy computation, correlation for redundancy
  150           85                  100            0.87        computation of relevant document and rectangular
  200           112                 130            0.88        representation of trust rating to mine related web content
                                                               without duplication for both structured and unstructured
                                                               web documents.

                                                                Although the result is computed on different data sets but
                                                               Mathematical approach is considered as best among the
                                                               three, as it not only searches the relevant document
                                                               efficiently but also performs the redundancy computation
                                                               on relevant pages. Its Precision and Recall is highest
                                                               among the three. The quality of search results obtained
                                                               through this approach is accurate. Statistical approach is
                                                               also efficient in generating the accurate results. In future,
                                                               further modification can be performed on Mathematical
                Figure 7.   Precision and Recall               and Statistical algorithm to remove noises and
Volume 1, Issue 3, September – October 2012                                                                      Page 131
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 3, September – October 2012                                    ISSN 2278-6856


redundancy for to improve the performance of search        [13]    N.S. Kumar, P.M. Duraj Raj Vincent, “Web
engines.                                                     Mining- An Intergrated Approach, “International
                                                             Journal of Advanced Research in Computer Science
REFERENCES                                                   & Software Engineering, Vol 2, Issue 3, March 2012.
                                                           [14] P. Sivakumar, R.M.S. Parvathi, “An Efficient
 [1] A. Singh, “Agent Based Framework for Semantic
                                                             Approach of Noise Removal from Web Page for
   Web Content Mining,” International Journal of
                                                             Effectual Web Content Mining,” European Journal of
   Advancements in Technology, 2012.
                                                             Scientific Research, Vol.50 No.3, pp.340-351, 2011.
 [2] D. Alassi, R. Alhajj, “Effectiveness of template
                                                           [15] S.N. Das, M. Mathew, P.K. Vijayaraghavan, “An
   detection on noise reduction and websites
                                                             Efficient Approach for Finding Near Duplicate Web
   summarization,”      Information    Science,    2012
                                                             pages using Minimum Weight Overlapping Method,”
   www.elsevier.com/locate/ins.
                                                             International Journal of Electrical and Computer
 [3] F. Johnson, S.K. Gupta , “Web Content Mining
                                                             Engineering (IJECE) Vol.1, No.2, pp. 187~194,
   Techniques: A Survey,” International Journal of
                                                             December 2011.
   Computer Applications,Vol 47, No.11, June 2012.
                                                           [16] S.N. Mishra, A. Jaiswal, A. Ambhaikar, “An
 [4] G. Poonkuzhali , “Detection And Removal Of
                                                             Effective Algorithm for web mining based on Topic
   Redundant Web Content Through Rectangular And
                                                             Sensitive Link Analysis,“ International Journal of
   Signed Approach, “ International Journal of
                                                             Advanced Research in Computer Science & Software
   Engineering Science and Technology, Vol. 2(9), pp.
                                                             Engineering Research Vol 2, Issue 4, April 2012.
   4026-4032, 2010.
                                                           [17]     T. Gupta and L. Banda, “A Hybrid Model For
 [5] G. Poonkuzhali, K. Thiagarajan, K. Sarukesi,
                                                             Detection And      Elimination Of Near- Duplicates
   “Elimination of Redundant Links in Web Pages–
                                                             Based On Web Provenance For Effective Web
   Mathematical Approach,” World Academy of
                                                             Search,” International Journal of Advances in
   Science, Engineering and Technology 28, 2009.
                                                             Engineering & Technology, 192 Vol. 4, Issue 1, pp.
 [6] G. Poonkuzhali, K. Thiagarajan and K. Sarukesi ,
                                                             192-205, July 2012.
   “Set Theoretical Approach for Mining Web content
                                                           [18]    W.R.W. Zulkifeli, N. Mustapha, A. Mustapha
   through outliers detection,“ International Journal of
                                                             “Classic Term Weighting Technique for Mining Web
   Engineering Research & Industrial Applications
                                                             Content Outliers, “International Conference on
   (IJERIA), Vol.2, No.I, pp 131-138, 2009.
                                                             Computational Techniques and Artificial Intelligence
 [7] G.Poonkuzhali , K.Sarukesi, G.V. Uma, ”Web
                                                             (ICCTAI), Penang, Malaysia,2012
   Content Outlier Mining Through Mathematical
   Approach and Trust Rating, ” Recent Researches in
   Applied Computer and Applied Computational
   Science, 2010
 [8] G. Poonkuzhali, K. Thiagarajan, K. Sarukesi and
   G.V. Uma “Signed Approach for Mining Web
   Content Outliers,“ World Academy of Science,
   Engineering and Technology 56, 2009.
 [9] G. Poonkuzhali, R.K. Kumar, R.K. Keshav, K.
   Thiagarajan, K. Sarukesi, “ Effective Algorithms for
   Improving the Performance of Search Engine
   Results,” Issue 3, Volume 5, 2011.
 [10] G. Poonkuzhali, R.K. Kumar, R.K. Keshav, K.
   Thiagarajan, K. Sarukesi, “Statistical Approach for
   Improving the Quality of Search Results, “ Recent
   Researches in Applied Computer and Applied
   Computational Science, 2011.
 [11] G. Poonkuzhali, R.K. Kumar, P. Sudhakar, G.V.
   Uma, K. Sarukesi, “Relevance Ranking and
   Evaluation of Search Results through Web Content
   Mining, “ Proceedings of the International
   Multiconference of Engineers and Computer
   Scientist IMECS, Hong Kong. Vol 1, Mar 14-16,
   2012.
 [12] M. Mathew, S.N Das, T.R.L. Narayanan “A Novel
   Approach for Near-Duplicate Detection of Web Pages
   using TDW Matrix,” International          Journal of
   Computer Applications ,Vol 19, No.7, April 2011.

Volume 1, Issue 3, September – October 2012                                                           Page 132

								
To top