Document Sample

ISSN 2320 2610 Volume et al., International Journal of Multidisciplinary Dayananda P1, No.2, November - December 2012 in Cryptology and Information Security, 1 (2), November - December 2012, 13-16 International Journal of Multidisciplinary in Cryptology and Information Security Available Online at http://warse.org/pdfs/ijmcis02122012.pdf Leverage the BinRank’s performance using HubRank and Parallel Computing Dayananda P1, Thnga Selvi A2 1 Assistant Professor, Department of Information Science and Engg, MSRIT, Bangalore-54 2 Department of Information Science and Engg, MSRIT, Bangalore-54, selvi2103@gmail.com Abstract: Over the past decade, the amount of data random Web surfer who starts at a random Web page and generated and posted on the web has increased follows outgoing links with uniform probability. The exponentially. High processing speeds, quick retrievals and biggest advantage of pageRank is its simplicity. But the efficient handling of data are of upmost importance. disadvantage is that it returns only the documents that Searching on the web using a keyword and retrieving the contain the keyword and the documents which may be relevant document has become an important and yet more relevant to the search but does not contain the interesting task, these are the two major issues which keyword are ignored. Dynamic versions of the PageRank algorithm like Personalized PageRank (PPR) for Web should not be comprised. The issue addressed in this paper graph datasets, it is a modification of PageRank that is would like to provide an approach which intends to performs search personalized on a base set that contains provide an approximation to BinRank by integrating it with web pages that a user is interested in. But Personalized Hubrank and parallelize i.e execute the activities PageRank suffers from scalability. simultaneously to reduce query execution time and also increase the relevance of the results. The ObjectRank system applies the random walk model, the effectiveness of which is proven by Google's PageRank, Keywords: BinRank, HubRank, ObjectRank. to keyword search in databases modeled as labeled graphs. The system ranks the database. Objects with respect to the INTRODUCTION user-provided keywords. ObjectRank extends personalized PageRank(PPR) to perform keyword search in databases. Dynamic authority –based keyword search algorithms ObjectRank uses a query term posting list as a set of like ObjectRank and Personalized PageRank (PPR), random walk starting points and conducts the walk on the improves semantic link information to provide high recall instance graph of the database. ObjectRank has searches in the web. Most of their algorithms perform successfully been applied to databases that have social iterative computation over the full graph. If the graph is too networking components, such as bibliographic data and large, then such computation at query time becomes more collaborative product design. ObjectRank suffers from the complex and it is not feasible for computation. Then the same scalability issues as personalized PageRank, as it concept of BinRank came into picture. BinRank is a system requires multiple iterations over all nodes and links of the that approximates ObjectRank results by using a hybrid entire database graph. The original ObjectRank system has approach, in which a number of relatively small subsets of two modes: online and offline. The online mode runs the data graph are materialized. Any query was answered only ranking algorithm once the query is received, which takes by running ObjectRank on one of the subgraphs. This made too long on large graphs. In the offline mode, ObjectRank BinRank achieve subsecond query execution time. The precomputes top-k results for a query workload in advance. HubRank system presents a viable way to dynamically This precomputation is very expensive and requires a lot of Personalize PageRank[1] at query time on ER graphs by storage space for precomputed results. utilizing clever hubset selection strategies and early termination bounds. HubRank is highly scalable for small HubRank is a search system based on ObjectRank that graphs when compared to ObjectRank. Also since BinRank improves the scalability of ObjectRank by combining the construction is precomputed offline, we plan to parallelize hub based approaches and monte Carlo approach[2]. It Bin construction activity and execute HubRank[4] on the initially selects a fixed number of hub nodes by using a subgraph that BinRank generates. greedy hub selection algorithm that utilizes a query workload in order to minimize the query execution time. RELATED WORK There are many existing ranking mechanisms. In this section we discuss some of the various ranking schemes. PageRank is a popular and simple algorithm used by Google’s web search. It works as follows: it starts with a 13 @2012, IJMCIS All Rights Reserved Dayananda P et al., International Journal of Multidisciplinary in Cryptology and Information Security, 1 (2), November - December 2012, 13-16 HubRank is highly scalable for smaller graphs because of only fewer hubs are considered and early termination. IMPLEMENTATION The implementation procedure comprises of a generic algorithm[4] and a parallel computing procedure. The algorithm has two main phases first a pre-computation phase followed by a query processing phase. Pre-computation phase This phase happens in two steps. In the step1, we take as input the set of keywords in the entire database also called workload w and output as a set of term bins. Step2 takes the output of step1 as input and returns the set of materialized subgraphs MSG as output. Bin construction The bin construction algorithm packs terms into bins by partitioning workload w into a set of Bins composed of frequently co occurring terms. The algorithm takes a single parameter maxBinSize which limits the size of a Bin posting list i.e of all terms in the Bin. During the bin construction the bin identifies of each terms is inserted into the lucene index as an additional field. This allows us to identify the corresponding bin hence the MSG at query time for a given query. MSG generation BinRank uses ObjectRank algorithm to generate Materialized SubGraph(MSG) for each bin. Since HubRank algorithm is more scalable compare to ObjectRank algorithm for smaller graphs. We plan to use HubRank instead of ObjectRank to generate the MSG Fig 1: Block diagram of Pre-computation phase itself. We need to keep the size of the MSG being constructed as small as possible to achieve higher efficiency with regards to HubRank. For this purpose we Query processing phase plan to produce more number of Bins i.e MSG’s , so that For a given keyword query we find the base set q and size of each Bin is smaller enough for HubRank algorithm the bin identifier. With the above two information we to process efficiently. Since there are more number of Bins determine the MSG on which the HubRank is to be applied the query processing time might get delayed. To overcome to return the TopK results. this we also parallelize the MSG generation. Since each Bin and hence its MSG is independent of each other. MSG Multi keyword queries are processed by taking each generation process is more suitable for parallel computing. individual keyword separately. For a union of keyword query we get the MSG for each individual keyword and run Fig 1 shows the stages in Pre computation phase. HubRank separately on each MSG to return the TopK relevant entries. Since parallel computing is an emerging technique these days we execute HubRank on each MSG’s of a multikeyword query by running the HubRank algorithm on multiple cores simultaneously. HubRank is a search system based on ObjectRank that improved the scalability of ObjectRank by combining the above two approaches. It first selects a fixed number of hub nodes by using a greedy hub selection algorithm that utilizes a query workload in order to minimize the query execution time. 14 @2012, IJMCIS All Rights Reserved Dayananda P et al., International Journal of Multidisciplinary in Cryptology and Information Security, 1 (2), November - December 2012, 13-16 Given a set of hub nodes H, it materializes the fingerprints of tesla gpu architecture delivers high computational hub nodes in H. At query time, it generates an active subgraph throughput on massively parallel problems. by expanding the base set with its neighbors. It stops following a path when it encounters a hub node who’s PPV was materialized, or the distance from the base set exceeds a fixed maximum length. The efficiency of query processing and the quality of query results are very sensitive to the size of H and the hub selection scheme. The dynamic pruning takes a key role in outperforming ObjectRank by a noticeable margin. The below diagram shows stages in query processing phase. Fig 2: Block diagram of Query processing phase Parallel computing approach The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. The GPU, as a specialized processor, addresses the demands of real-time high-resolution 3D graphics compute- intensive tasks. As of 2012 GPUs have evolved into highly parallel mul ti core systems allowing very efficient manipulation of large blocks of data. This design is more effective than general-purpose CPUs for algorithms where Fig 3: Block diagram of process flow of CUDA processing of large blocks of data is done in parallel. In this context let us understand how we intend to As shown in the Fig 3 above, all the processors in the utilize parallel computing .To achieve our goal we propose GPU will execute the same logic but using a different data the use of Single Instruction, Multiple Data (SIMD) instance. For each bin of terms a materialiased subgraph architecture, computers have several processors that follow has to be constructed. For each individual bin, a separate the same set of instructions, but each processor inputs processor will build a materialised sub graph. Though there different data into those instructions. This can be useful for is investment in hardware initially, it will be traded off for analyzing large chunks of data based on the same criteria. the speed of execution. This parallelizing concept Many complex computational problems don't fit this eventually will decrease the overall time required to model. Coincidently in our algorithm especially the execute the algorithm. precomputation phase which requires us to build sub graphs based on keywords. if the same algorithm is implemented Let us consider a statistical computation of the using the SIMD concept on a parallel computing device, algorithm’s execution time and compare the same with its the speed of precomputation increases by a marked value. parallel computation[6]. If a parallel program is executed on a computer having p processors, the least possible The Compute Unified Device Architecture(CUDA) execution time will be equal to the sequential time divided programming model provides a straightforward means of by number of processors describing inherently parallel computations, and nvidia’s Tp is the parallel execution time, Tp is sequential execution time and p be the no of processors in the computer, then (1) 15 @2012, IJMCIS All Rights Reserved Dayananda P et al., International Journal of Multidisciplinary in Cryptology and Information Security, 1 (2), November - December 2012, 13-16 A measure called speedup value which is ratio of sequential CONCLUSION time and parallel execution time. The maximum speedup value could be achieved in an ideal multiprocessor system In this paper, we proposed an approach to increases the where there are no communication costs and the workload performance of BinRank using HubRank and parallelize of processors is balanced. In such a system, every processor i.e. execute the creation of bins simultaneously to reduce needs Ts/p time units in order to complete its job so the query execution time and also increase the relevance of the speedup value will be as the following: results. To further enhance this work by providing the threat detection system to some extent, by storing the potential illegal Let Speedup value be S keywords in database and ensure that the search is not on these words by checking the database before submitting the query for (2) search. REFERENCES This leads to S=P, Previous statistics claim that for a [1] D. Fogaras, B. Ra´cz, K. Csaloga´ny, and T. Sarlo´ s,“Towards wikepedia data set, pre computing about a thousand Scaling Fully Personalized PageRank: Algorithms, Lower subgraphs, takes about 12 hours on a single CPU [4].The Bounds, and Experiments,” Internet same if implemented using parallel computing with SIMD Math., vol. 2, no. 3, pp. 333-358, 2005. architecture will ideally take, Ts=12 hours, according to the [2] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova, formula previously discussed Tp=Ts/p, let us futher “Monte Carlo Methods in PageRank Computation: When consider the no of processors as 4 then Tp will be One Iteration Is Sufficient,” SIAM J. Numerical Analysis, approximately 3 hours to build the same 1000 subgraphs. vol. 45, no. 2, pp. 890-904,2007. [3] A. Balmin, V. Hristidis, and Y. In general, if Ts is the sequential time take for Papakonstantinou,“ObjectRank: Authority-Based Keyword implementing one subgraph then when done in parallel Search in Databases,” Proc. Int’l Conf. Very Large Data using n processors, the total time taken to construct x Bases (VLDB),2004. subgraphs will be, Considering ideally: [4] Heasoo Hwang; Balmin, A.; Reinwald, B.; Nijkamp, E.; , "BinRank: Scaling Dynamic Authority-Based Search Using (3) Materialized Subgraphs," Knowledge and Data Engineering, IEEE Transactions on , vol.22, no.8, pp.1176-1190, Aug. Tp is time taken to build 1 subgraph in parallel. 2010 . [5] S. Chakrabarti, “Dynamic Personalized PageRank in (4) Entity-Relation Graphs,” Proc. Int’l World Wide Web The above formula holds good in ideal conditions only Conf.(WWW), 2007. but According to the Amdahl law, it is very difficult, even [6] Felician ALECU, “performance analysis of parallel into an ideal parallel system, to obtain a speedup value algorithms”, Journal of Applied quantitative methods, equal with the number of processors because each program, volume-2, issue-1,2007. in terms of running time. The total time would become [7] V. Hristidis, H. Hwang, and Y. Papakonstantinou, “Authority-Based Keyword Search in Databases,” ACM (5) Trans. Database Systems, vol. 33, no. 1, pp. 1-40, 2008. Tt is the total time taken to build subgraphs, α is the [8] M.R. Garey and D.S. Johnson, “A 71/60 Theorem for fraction of code which has to executed sequentially and (1- Bin Packing,” J. Complexity, vol. 1, pp. 65-106, 1985. α ) is a part of code which require to build subgraph done [9] H. Hwang, A. Balmin, H. Pirahesh, and B. Reinwald, parallel and also considering the internal tradeoff β . “Information Discovery in Loosely Integrated Data,” Proc. ACM SIGMOD, 2007. (6) [10] J. Cho and U. Schonfeld, “Rankmass Crawler: A Crawler with High PageRank Coverage Guarantee,” Proc. Int’l Conf. Ts will be total time taken to execute the pre computation Very Large Data Bases (VLDB), 2007. phase when done parallely. 16 @2012, IJMCIS All Rights Reserved

DOCUMENT INFO

Shared By:

Tags:

Stats:

views: | 28 |

posted: | 1/13/2013 |

language: | |

pages: | 4 |

OTHER DOCS BY warse1

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.