Web Metasearch Result Clustering System

Document Sample
Web Metasearch Result Clustering System Powered By Docstoc
					Revista Informatica Economică, nr. 4(48)/2008                                                   113

                       Web Metasearch Result Clustering System
                                             Adina LIPAI
                            Academy of Economic Studies, Bucharest, România

       The paper presents a web search result clustering algorithm that was integrated in to
a desktop application. The application aims to increase the web search engines performances
by reducing the user effort in finding a web page in the list of results returned by the search
Keywords: clustering, web search, search engines.

I   ntroduction
    The paper presents a web search results
clustering system. The purpose of the appli-
                                                    User interface: it has the task to retrieve the
                                                    user query, and other user preferences like
                                                    search engines to be used in the metasearch,
cation is to minimize the user effort in find-      visualization preferences, and clustering pa-
ing the required web page between the result        rameters. Figure 1 represents the user inter-
list, given at a common query. The system is        face of the application. The starting of the
a desktop application that creates a series of      clustering process can be made after all
subject orientated groups, based on the sub-        search engines are loaded into the browser
ject of the web pages. Semantically similar         and the query was passed to them. The re-
pages will be placed together in the same           sults offered by the search engines are parsed
group. The task of finding the appropriate          from the web browsers present in the applica-
web page will be reduced to finding the sub-        tion, using a series of text parsing operations.
ject cluster in which it was placed. Each clus-     The applications offers a few specific user
ter will be identified by a descriptive label.      preferences for both level of processing visu-
A web document search result clustering sys-        alisation and for clustering parameters. Few
tem, has the following functions: user query        of this are: language preference, number of
retrieval, interrogation of one of several on-      search engines to be used, detailed visual
line search engines, query results retrieval,       output, number of clusters, and others.
web documents processing, web documents             Processing modules: the application has
clustering and results visualisation. In the fol-   three major processing modules: document
lowing pages we will present the architectur-       processing module, document vector space
al components that achieve this functionality.      representation module, and clustering mod-
Main architectural components

                        Fig.1. Web result clustering software: user interface
114                                                                                Revista Informatica Economică, nr. 4(48)/2008

Document processing: the task is essential in                           term frequency – inverse liniar document
obtaining high quality results. The main                                frequency (Osiński, 2003]). The clustering
function of this module is obtaining of a con-                          process is made using the k-means clustering
densed description of the web results, by                               algorithm.
eliminating the words and characters that do
not have an informative value. De description                           Application functioning principles
will be used to obtain the term index vector.                           In the picture below we have represented the
The term index vector will be used to obtain                            main functioning principles of the application
the document-word matrix, that will be used                             ([Lipai, 2007]). The user has access to the
in the clustering process.                                              application interface, which he uses to pro-
The document-index term matrix, contains                                vide a query. The k-means algorithm used for
the documents in a numeric form, obtained                               clustering is adapted for document
by applying the vector space model trans-                               processing, and will calculate the similarity
formation. This representation will be used                             measure as distance evaluator. The applica-
for the clustering process. There will be used                          tion will perform 6 clusters by default, but it
two different formulas for calculating the                              can form 2 – 10 clusters, according to user
weight of a word in the document: term fre-                             preferences. The N + 1 cluster will be formed
quency – inverse document frequency and                                 from the unclassified instances.

                         User interface: user query module
                                                                                 Optional: search
                         Search en gine              U ser query                 preferences, clustering
                         selection                   subbmision                  parameters, etc

                                            Search engine query                                          Google
                        engine              module                         Query          A ltaVista
                        interface:                                                                         Yahoo!
                                            Result retrieval
                        module                                             Results     Lycos
                                                                           retrieval                   MSN

                        Document              Web page processing            H TML tag         Stop word
                        preprocessing                                        cleansing         cleansing
                           D ocument transformation: in hight
                           level information words                           Text processing

                        Vector space model
                        implementation                                         Weight index word
                                                Word vector index              calculation in document
                                                                               Document – index term
                                                                               matrix construction

                         Initial centroid                                              Similarity
                         selection                   C luster building                   l l ti
                                                                                       D ocument
                                                                                       distrubution in
                                                        Centroid                       clusters
                        Clustering module               recalculation

                        User interface:
                                                               Result visualization
                        Rezult visualization module

                         Fig.2. Application normal execution schema

Document pre-processing                                                 elimination of duplicate sites, elimination of
Results are obtained by retrivieng the result                           stop words and stop characters, root extrac-
list from more search engines. Document                                 tion and other natural language processing.
processing implies: HTML tag elimination,                               The root extraction module was made using
Revista Informatica Economică, nr. 4(48)/2008                                                115

basic Romanian grammatical rules.                 of word weight, the words that appear in only
                                                  one document will be eliminated. The last
Vector space model implementation                 step in representing the documents according
Implementing vector space model for a set of      with the vector space model is constructing
documents retrieved as a result of web search     the document-term matrix. The document –
consist of transforming the string of words       term matrix will be used by the clustering al-
that make up a snippet in to a equivalent nu-     gorithm.
meric vector. One document will be                In this matrix structure we have the docu-
represented by n numeric elements, where n        ments represented by lines, in the form of
is the number of unique words present in all      weighted vectors, and on columns we have
retrieved documents. The value of each ele-       the term words.
ment in the document will be given by the
presence of that word in the document cor-        Clustering algorithm implementation
pus.                                              Initial centroid calculation
First step in implementing the vector space       One of the major disadvantages of the k-
model consist in the construction of the index    means clustering algorithm is that it has a
term vector. The index term vector is con-        high computation time, because it makes to
structed from all the individual unique words     many iterations. This disadvantage can be
of the snippet and title of the retrieved docu-   unacceptable in web results clustering appli-
ments. Each of the retrieved result will be di-   cations, where the computation time has to
vided into component words and added to the       be very low. In this application we have im-
index vector ([Weiss, 2001]).                     plemented an initial centroid calculation me-
Second step consist of document vectorisa-        thod that has the purpose of reducing the
tion: each document will be transformed into      computation time. The algorithm consists in
its numeric vector representation. In the vec-    dividing the data set in k groups, where k is
tor space model each document will be             the number of clusters we want to obtain. For
represented by a numeric vector of lengh n,       each cluster, we will calculate average of
where n is the dimension of the index term        each index term. This average vector will
vector. The value of each word in the docu-       represent the initial centroids. The algorithms
ment corpus will be calculated with two dif-      has basic principles in the partitioning clus-
ferent formulas: term frequency-inverse doc-      tering algorithms.
ument frequency and term frequency – linear
inverse document frequency ([Osiński, 2003],      Implementation of the clustering algo-
[Wróblewski, 2003]).                              rithm
Term frequency-inverse document frequen-          The clustering task consists in calculating the
cy:                                               similarity between each document and each
                    n                             cluster. Each document will be distributed to
 wij = tf ij log n
                   df i (Formula 1)               the closest centroid. The most important
                                                  adaptation of the k-means algorithm that had
Term frequency – linear inverse document          to be done for web document processing are:
frequency:                                        calculation of similarity measure as distance
               df i−1 n − df i
idfl i = 1 −         =                            measure, use of soft assignation for cluster
               n −1    n − 1 (Formula 2)
                                                  forming. In the end we will have a set of k +
Where:                                            1 overlapped clusters, where the last cluster
• wij : represents the weight of index term cj    consists of unclassified documents.
from document Di;
• tfi: represents the frequency of term ci;       Cluster representation
• dfi: represents the number of documents in      After distributing each document to its clus-
which appears the term ci;                        ter, we have to visualize the clusters. First
• n: total number of documents.                   step in forming a visual representation is la-
After document vectorisation and calculation
116                                                       Revista Informatica Economică, nr. 4(48)/2008

bel formation. The label can be made up of          In the picture below we have represented the
one or more representative words. The label         output returned by the application, as a result
will be chosen from the centroid, and it will       of user query “refractie”, showing in detail
consist of the index terms with the highest         the label representation.

               Fig.3. Detailed representation of cluster labels: application output

Conclusion                                          web, in procedeengs of The Eighth Ineterna-
The paper presented a web metasearch result         tional Conference on Informatics In Econo-
clustering system. And clustering application       my, Informatics in Knowledge Society, ASE,
has the purpose to reduce the user effort in        Bucureşti, mai 2007.
finding an exact web document between the
hundreds of documents returned by a search
engine at a common query. The paper pre-
sented the main architectural components of
a clustering system, and its functionality. The
steps necessary to cluster a result list were al-
so presented, in the end showing the visuali-
sation of the clusters.

1. [Osiński, 2003] Osiński, S.- An Algorithm
for Clustering of Web Search Results, master
thesis, Poznań University of Technology, Po-
lonia, 2003.
2. [Weiss, 2001] Weiss, D .- A clustering in-
terface for web search results in polish and
english. Master's thesis, Poznan University of
Technology, Poland, June 2001.
3. [Wróblewski, 2003] Wróblewski, M., - A
Hierarchical WWW Pages Clustering Algo-
rithm Based on the Vector Space Model.
Master thesis, Department of Computing
Science, Poznań University of Technology,
Polonia, 2003.
4. [Lipai, 2007] Lipai, A. - Tehnici de cluste-
rizare a rezultatelor returnate într-o căutare

Shared By: