Leverage the BinRank’s performance using HubRank

Document Sample
Leverage the BinRank’s performance using HubRank Powered By Docstoc
					                                                                                                      ISSN 2320 2610
                                    Volume et al., International Journal of Multidisciplinary
                                  Dayananda P1, No.2, November - December 2012 in Cryptology and
                                  Information Security, 1 (2), November - December 2012, 13-16
                     International Journal of Multidisciplinary in Cryptology and Information Security
                           Available Online at

                   Leverage the BinRank’s performance using HubRank
                                and Parallel Computing
                                                       Dayananda P1, Thnga Selvi A2
                      Assistant Professor, Department of Information Science and Engg, MSRIT, Bangalore-54
                          Department of Information Science and Engg, MSRIT, Bangalore-54,

Abstract: Over the past decade, the amount of data                  random Web surfer who starts at a random Web page and
generated and posted on the web has increased                       follows outgoing links with uniform probability. The
exponentially. High processing speeds, quick retrievals and         biggest advantage of pageRank is its simplicity. But the
efficient handling of data are of upmost importance.                disadvantage is that it returns only the documents that
Searching on the web using a keyword and retrieving the             contain the keyword and the documents which may be
relevant document has become an important and yet                   more relevant to the search but does not contain the
interesting task, these are the two major issues which              keyword are ignored. Dynamic versions of the PageRank
                                                                    algorithm like Personalized PageRank (PPR) for Web
should not be comprised. The issue addressed in this paper
                                                                    graph datasets, it is a modification of PageRank that
is would like to provide an approach which intends to               performs search personalized on a base set that contains
provide an approximation to BinRank by integrating it with          web pages that a user is interested in. But Personalized
Hubrank and parallelize i.e execute the activities                  PageRank suffers from scalability.
simultaneously to reduce query execution time and also
increase the relevance of the results.                                 The ObjectRank system applies the random walk model,
                                                                    the effectiveness of which is proven by Google's PageRank,
Keywords: BinRank, HubRank, ObjectRank.                             to keyword search in databases modeled as labeled graphs.
                                                                    The system ranks the database. Objects with respect to the
INTRODUCTION                                                        user-provided keywords. ObjectRank extends personalized
                                                                    PageRank(PPR) to perform keyword search in databases.
     Dynamic authority –based keyword search algorithms             ObjectRank uses a query term posting list as a set of
like ObjectRank and Personalized PageRank (PPR),                    random walk starting points and conducts the walk on the
improves semantic link information to provide high recall           instance graph of the database. ObjectRank has
searches in the web. Most of their algorithms perform               successfully been applied to databases that have social
iterative computation over the full graph. If the graph is too      networking components, such as bibliographic data and
large, then such computation at query time becomes more             collaborative product design. ObjectRank suffers from the
complex and it is not feasible for computation. Then the            same scalability issues as personalized PageRank, as it
concept of BinRank came into picture. BinRank is a system           requires multiple iterations over all nodes and links of the
that approximates ObjectRank results by using a hybrid              entire database graph. The original ObjectRank system has
approach, in which a number of relatively small subsets of          two modes: online and offline. The online mode runs the
data graph are materialized. Any query was answered only            ranking algorithm once the query is received, which takes
by running ObjectRank on one of the subgraphs. This made            too long on large graphs. In the offline mode, ObjectRank
BinRank achieve subsecond query execution time. The                 precomputes top-k results for a query workload in advance.
HubRank system presents a viable way to dynamically                 This precomputation is very expensive and requires a lot of
Personalize PageRank[1] at query time on ER graphs by               storage space for precomputed results.
utilizing clever hubset selection strategies and early
termination bounds. HubRank is highly scalable for small                HubRank is a search system based on ObjectRank that
graphs when compared to ObjectRank. Also since BinRank              improves the scalability of ObjectRank by combining the
construction is precomputed offline, we plan to parallelize         hub based approaches and monte Carlo approach[2]. It
Bin construction activity and execute HubRank[4] on the             initially selects a fixed number of hub nodes by using a
subgraph that BinRank generates.                                    greedy hub selection algorithm that utilizes a query
                                                                    workload in order to minimize the query execution time.

     There are many existing ranking mechanisms. In this
section we discuss some of the various ranking schemes.
PageRank is a popular and simple algorithm used by
Google’s web search. It works as follows: it starts with a
                                                             @2012, IJMCIS All Rights Reserved
Dayananda P et al., International Journal of Multidisciplinary in Cryptology and Information Security, 1 (2), November - December 2012, 13-16

HubRank is highly scalable for smaller graphs because of
only fewer hubs are considered and early termination.

  The implementation procedure comprises of a generic
algorithm[4] and a parallel computing procedure. The
algorithm has two main phases first a pre-computation
phase followed by a query processing phase.

Pre-computation phase
This phase happens in two steps. In the step1, we take as
input the set of keywords in the entire database also called
workload w and output as a set of term bins. Step2 takes
the output of step1 as input and returns the set of
materialized subgraphs MSG as output.

Bin construction
The bin construction algorithm packs terms into bins by
partitioning workload w into a set of Bins composed of
frequently co occurring terms. The algorithm takes a single
parameter maxBinSize which limits the size of a Bin posting
list i.e of all terms in the Bin. During the bin construction the
bin identifies of each terms is inserted into the lucene index as
an additional field. This allows us to identify the
corresponding bin hence the MSG at query time
for a given query.

MSG generation
BinRank      uses    ObjectRank      algorithm     to    generate
Materialized SubGraph(MSG) for each bin. Since
HubRank algorithm is more scalable compare to
ObjectRank algorithm for smaller graphs. We plan to use
HubRank instead of ObjectRank to generate the MSG                                   Fig 1: Block diagram of Pre-computation phase
itself. We need to keep the size of the MSG being
constructed as small as possible to achieve higher
efficiency with regards to HubRank. For this purpose we                    Query processing phase
plan to produce more number of Bins i.e MSG’s , so that                        For a given keyword query we find the base set q and
size of each Bin is smaller enough for HubRank algorithm                   the bin identifier. With the above two information we
to process efficiently. Since there are more number of Bins                determine the MSG on which the HubRank is to be applied
the query processing time might get delayed. To overcome                   to return the TopK results.
this we also parallelize the MSG generation. Since each Bin
and hence its MSG is independent of each other. MSG                            Multi keyword queries are processed by taking each
generation process is more suitable for parallel computing.                individual keyword separately. For a union of keyword
                                                                           query we get the MSG for each individual keyword and run
Fig 1 shows the stages in Pre computation phase.
                                                                           HubRank separately on each MSG to return the TopK
                                                                           relevant entries. Since parallel computing is an emerging
                                                                           technique these days we execute HubRank on each MSG’s
                                                                           of a multikeyword query by running the HubRank
                                                                           algorithm on multiple cores simultaneously.
                                                                               HubRank is a search system based on ObjectRank that
                                                                           improved the scalability of ObjectRank by combining the
                                                                           above two approaches. It first selects a fixed number of hub
                                                                           nodes by using a greedy hub selection algorithm that utilizes a
                                                                           query workload in order to minimize the query execution time.

                                                        @2012, IJMCIS All Rights Reserved
Dayananda P et al., International Journal of Multidisciplinary in Cryptology and Information Security, 1 (2), November - December 2012, 13-16

Given a set of hub nodes H, it materializes the fingerprints of            tesla gpu architecture delivers high computational
hub nodes in H. At query time, it generates an active subgraph             throughput on massively parallel problems.
by expanding the base set with its neighbors. It stops following
a path when it encounters a hub node who’s PPV was
materialized, or the distance from the base set exceeds a fixed
maximum length. The efficiency of query processing and the
quality of query results are very sensitive to the size of H and
the hub selection scheme. The dynamic pruning takes a key
role in outperforming ObjectRank by a noticeable margin. The
below diagram shows stages in query processing phase.

         Fig 2: Block diagram of Query processing phase

Parallel computing approach
     The advent of multicore CPUs and manycore GPUs
means that mainstream processor chips are now parallel
systems. The GPU, as a specialized processor, addresses the
demands of real-time high-resolution 3D graphics compute-
intensive tasks. As of 2012 GPUs have evolved into highly
parallel    mul ti core systems allowing very efficient
manipulation of large blocks of data. This design is more
effective than general-purpose CPUs for algorithms where                            Fig 3: Block diagram of process flow of CUDA
processing of large blocks of data is done in parallel.
    In this context let us understand how we intend to                         As shown in the Fig 3 above, all the processors in the
utilize parallel computing .To achieve our goal we propose                 GPU will execute the same logic but using a different data
the use of Single Instruction, Multiple Data (SIMD)                        instance. For each bin of terms a materialiased subgraph
architecture, computers have several processors that follow                has to be constructed. For each individual bin, a separate
the same set of instructions, but each processor inputs                    processor will build a materialised sub graph. Though there
different data into those instructions. This can be useful for             is investment in hardware initially, it will be traded off for
analyzing large chunks of data based on the same criteria.                 the speed of execution. This parallelizing concept
Many complex computational problems don't fit this                         eventually will decrease the overall time required to
model. Coincidently in our algorithm especially the                        execute the algorithm.
precomputation phase which requires us to build sub graphs
based on keywords. if the same algorithm is implemented                        Let us consider a statistical computation of the
using the SIMD concept on a parallel computing device,                     algorithm’s execution time and compare the same with its
the speed of precomputation increases by a marked value.                   parallel computation[6]. If a parallel program is executed
                                                                           on a computer having p processors, the least possible
    The Compute Unified Device Architecture(CUDA)                          execution time will be equal to the sequential time divided
programming model provides a straightforward means of                      by number of processors
describing inherently parallel computations, and nvidia’s
                                                                           Tp is the parallel execution time, Tp is sequential execution
                                                                           time and p be the no of processors in the computer, then


                                                       @2012, IJMCIS All Rights Reserved
Dayananda P et al., International Journal of Multidisciplinary in Cryptology and Information Security, 1 (2), November - December 2012, 13-16

A measure called speedup value which is ratio of sequential                CONCLUSION
time and parallel execution time. The maximum speedup
value could be achieved in an ideal multiprocessor system                      In this paper, we proposed an approach to increases the
where there are no communication costs and the workload                    performance of BinRank using HubRank and parallelize
of processors is balanced. In such a system, every processor               i.e. execute the creation of bins simultaneously to reduce
needs Ts/p time units in order to complete its job so the                  query execution time and also increase the relevance of the
speedup value will be as the following:                                    results. To further enhance this work by providing the threat
                                                                           detection system to some extent, by storing the potential illegal
Let Speedup value be S                                                     keywords in database and ensure that the search is not on these
                                                                           words by checking the database before submitting the query for

   This leads to S=P, Previous statistics claim that for a                 [1] D. Fogaras, B. Ra´cz, K. Csaloga´ny, and T. Sarlo´ s,“Towards
wikepedia data set, pre computing about a thousand                              Scaling Fully Personalized PageRank: Algorithms, Lower
subgraphs, takes about 12 hours on a single CPU [4].The                         Bounds, and Experiments,” Internet
same if implemented using parallel computing with SIMD                          Math., vol. 2, no. 3, pp. 333-358, 2005.
architecture will ideally take, Ts=12 hours, according to the              [2] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova,
formula previously discussed Tp=Ts/p, let us futher                            “Monte Carlo Methods in PageRank Computation: When
consider the no of processors as 4 then Tp will be                             One Iteration Is Sufficient,” SIAM J. Numerical Analysis,
approximately 3 hours to build the same 1000 subgraphs.                        vol. 45, no. 2, pp. 890-904,2007.
                                                                           [3]   A. Balmin, V. Hristidis, and Y.
   In general, if Ts is the sequential time take for
                                                                                 Papakonstantinou,“ObjectRank: Authority-Based Keyword
implementing one subgraph then when done in parallel                             Search in Databases,” Proc. Int’l Conf. Very Large Data
using n processors, the total time taken to construct x                          Bases (VLDB),2004.
subgraphs will be, Considering ideally:
                                                                           [4]    Heasoo Hwang; Balmin, A.; Reinwald, B.; Nijkamp, E.; ,
                                                                                 "BinRank: Scaling Dynamic Authority-Based Search Using
                                                  (3)                            Materialized Subgraphs," Knowledge and Data Engineering,
                                                                                 IEEE Transactions on , vol.22, no.8, pp.1176-1190, Aug.
Tp is time taken to build 1 subgraph in parallel.                                2010 .
                                                                           [5]   S. Chakrabarti, “Dynamic Personalized PageRank in
                                                         (4)                     Entity-Relation Graphs,” Proc. Int’l World Wide Web
    The above formula holds good in ideal conditions only                        Conf.(WWW), 2007.
but According to the Amdahl law, it is very difficult, even                [6] Felician ALECU, “performance analysis of parallel
into an ideal parallel system, to obtain a speedup value                       algorithms”, Journal of Applied quantitative methods,
equal with the number of processors because each program,                      volume-2, issue-1,2007.
in terms of running time. The total time would become                      [7]   V.   Hristidis,   H.   Hwang,   and   Y.   Papakonstantinou,
                                                                                 “Authority-Based Keyword Search in Databases,” ACM
                                                         (5)                     Trans. Database Systems, vol. 33, no. 1, pp. 1-40, 2008.

Tt is the total time taken to build subgraphs, α is the                    [8] M.R. Garey and D.S. Johnson, “A 71/60 Theorem for
fraction of code which has to executed sequentially and (1-                    Bin Packing,” J. Complexity, vol. 1, pp. 65-106, 1985.
α ) is a part of code which require to build subgraph done                 [9] H. Hwang, A. Balmin, H. Pirahesh, and B. Reinwald,
parallel and also considering the internal tradeoff β .                        “Information Discovery in Loosely Integrated Data,” Proc.
                                                                               ACM SIGMOD, 2007.
                                                   (6)                     [10] J. Cho and U. Schonfeld, “Rankmass Crawler: A Crawler with
                                                                                High PageRank Coverage Guarantee,” Proc. Int’l Conf.
Ts will be total time taken to execute the pre computation                       Very Large Data Bases (VLDB), 2007.
phase when done parallely.

                                                         @2012, IJMCIS All Rights Reserved

Shared By: