VIEWS: 77 PAGES: 6 CATEGORY: Emerging Technologies POSTED ON: 3/8/2011 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011 WEB-OBJECT RANK ALGORITHM FOR EFFICIENT INFORMATION COMPUTING Dr. Pushpa R. Suri Harmunish Taneja Department of Computer Science and Applications, Department of Information Technology, Kurukshetra University Maharishi Markendeshwar University, Kurukshetra, Haryana- 136119, India. Mullana, Haryana- 133203, India pushpa.suri@yahoo.com harmunish.taneja@gmail.com Abstract - In recent years there has been considerable search results based upon various lexicons. As the web interest in analyzing relative trust level of the web objects. contains the contradictions and hypothesis on a huge scale, As the web contain facts and the assumptions on the global therefore finding the relevant information using search scale resulting on various criterions for trusting web page. engines is a tedious job. With the help of object level In this paper an algorithm is proposed which assigns a ranking [22], various objects on a domain independent of rank to every web object like a requested document on the the query that describes the relative trust of the web page web that specify the quality of that object or the relative can be prioritized. The object rank of a page depends upon level of trust one can make on that web page. It is used for various factors associated with the web object. object level information extraction for ranking search The organization of the paper is as follows. Related results and is implemented in C++. In this paper the work is presented in section 2. Section 3 discusses the behavior of object rank for different values of moister challenges of high quality search results. In section 4, factor in a domain is analyzed. The results emphasize that Web_Object_Rank algorithm is proposed and discussed. the moister factor can be useful in rank computation and The algorithm is implemented in section 5. Finally Section further explore more web pages in alignment with the 6 concludes the paper on the basis of the results obtained. user’s requirements. II. RELATED WORK Keywords- Random Surfer Model, Information Google is a prototype of a large-scale search engine Computing, Web Objects, Information Retrieval System, that makes heavy use of the structure present in hypertext Web Graph, Ranking, Object Rank. [1]. Google is designed to crawl and index the web efficiently and produce much more satisfying search I. INTRODUCTION results than existing systems. Link Analysis Ranking [16] Information computing in various web domains is broadly emphasize that hyperlink structures are used to determine extracting the web objects of unstructured nature like text the relative authority of a web page and produce improved objects that convince information need from within large algorithms for the ranking of search results. The prototype collections using document-level ranking and therefore the with a full text and hyperlink database of web pages is structured information about real-world objects which is available at [8]. In the current era there is much concern in embedded in static web pages. Online databases exist on the using random graph models for the web. The Random web in huge amounts which are of unstructured nature. Surfer model [9] and the Page Rank-based selection model Unstructured data refers to the data which does not have clear, [11] are described as two major models [10]. Page Rank- semantically obvious structure [7]. In other words information based selection model tries to capture the effect that the computing constitutes process of searching, recovering, and search engines have on the growth of the web by adding understanding information, from huge amounts of stored data. new links according to Page Rank. The Page Rank The information from the web can be retrieved by algorithm is used in the Google search engine [12] for implementing searching techniques as Keyword based ranking search results. PageRank is a link analysis Searching, Concept-based Searching, Hybrid Search, and algorithm used by the Google Internet search engine that Knowledge Base Search. In case of object level information assigns a numerical weighting to each element of a computing, domain based search is required. Every commercial hyperlinked set of documents, such as the World Wide information retrieval systems try to facilitate a user’s access to Web (WWW), with the purpose of "measuring" its information that is relevant to his information needs. This relative importance within the set. Google is designed to paper highlights ranking problem for domain based be a scalable search engine with primary goal to provide information retrieval, which states that every owner of the high quality search results over a rapidly growing WWW document wants to improve ranking of its document for that it [18]. The PageRank theory suggests that even an can do many manipulations on its document like increasing imaginary surfer who is randomly clicking on links will number of links to the page by the dummy pages [1]. Object eventually stop clicking. The probability, at any step, that based information computing maintain the integrity of the the surfer will continue is a damping factor d [2]. The 162 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011 damping factor (α) is eminently empirical, and in most cases IV. WEB_OBJECT_RANK ALGORITHM AND the value of α can be taken as 0.85 [1]. Page Rank is the IMPLEMENTATION stationary state of a Markov chain [2, 7]. The chain is obtained Page Rank of a web object can be defined as the by perturbing the transition matrix induced by a web graph fraction of time that the surfer spends on an average on with a damping factor that spreads uniformly over the rank. that object. The probability that the random surfer visits a The behavior of Page Rank with respect to changes in α is web page is its Page Rank [1]. Evidently, web objects that useful in link-spam detection [3]. The mathematical analysis are hyperlinked by many other pages are visited more of Page Rank with change in α show that contrary to popular often. The random surfer gets bored and restarts from belief, for real-world graphs values of α close to 1 do not give another random web object with a probability termed as a more meaningful ranking [2,21]. The order of displayed web the moister factor (m). The probability that the surfer pages is computed by the search engine Google as the follow a randomly chosen outlink is (1-m). PageRank vector, whose entries are the Page Ranks of the web pages [4]. The Page Rank vector is the stationary distribution The Markov Chain is a discrete-time stochastic of a stochastic matrix, the Google matrix. The Google matrix process: a process that occurs in a series of time-steps in in turn is a convex combination of two stochastic matrices: each of which a random choice is made [7]. There is one one matrix represents the link structure of the web graph and a state corresponding to each web object. Hence, a Markov second, rank-one matrix, mimics the random behavior of web chain consists of N states if there are N numbers of Web surfers and can also be used to fight web spamming. As a Objects in the collection. A Markov chain is characterized consequence, Page Rank depend mainly the link structure of by an N × N Probability Transition Matrix P each of the web graph, but not on the contents of the web pages. Also whose entries is in the interval [0, 1]; the entries in each the Page Rank of the first vertex, the root of the graph, follows row of P add up to 1. Markov Property states that each the power law [10]. However, the power undergoes a phase- entry Pij is the transition probability that depends only on transition as parameters of the model vary. the current state i. A Markov chain’s probability distribution over its states may be viewed as a Probability Link-based ranking algorithms rank web pages by using the Vector: a vector all of whose entries are in the interval [0, dominant eigenvector of certain matrices--like the co-citation 1], and the entries add up to 1. According to [7, 14] the matrix or its variations [17]. Distributed page ranking on top of problem of computing bounds on the conditional steady- structured peer-to-peer networks is needed because the size of state Probability Vector of a subset of states in finite, the web grows at a remarkable speed and centralized page discrete-time Markov chains is considered. ranking is not scalable [5]. Page ranking can be propagation rates depending on the A. Web_Object_Rank Algorithm: Features types of the links and user’s specific set of interests [6]. Page Features of Object Rank Algorithm are as follow: filtering can be decided based on link types combined with Query independent algorithm (assigns a value to some other information relevant to links. For ranking, a profile containing a set of ranking rules to be followed in the task can every document independent of query). be specified to reflect user’s specific interests [20]. Content independent Algorithm. Similarities of contents between hyperlinked pages are useful Concerns with static quality of a web page. to produce a better global ranking of web pages [19]. Object Rank value can be computed offline using only web graph. III. CHALLENGES Object Rank is based upon the linking structure of the whole web. The primary focus of Web Information Retrieval Support System (WIRSS) is to address the aspects of search that Object Rank does not rank website as a whole but consider the specific needs and goals of the individuals it is determined for each web page individually. conducting web searches [15]. The major goal is to provide Object Rank of web pages Ti which link to page A high quality search results over a rapidly growing World Wide does not influence the rank of page A uniformly. Web. Google employs a number of techniques to improve More are the outbound links on a page T, less will search quality including page rank, anchor text, and proximity page A benefit from a link to it. information. Decentralized content publishing is the main Object Rank is a model of user’s behavior. reason for the explosive growth of the web. Corresponding to a user query there are many documents that can be retrieve by B. Web_Object_Rank Algorithm: Assumptions search engine. And every owner of the document wants to If there are multiple links between two web objects, improve the ranking of its document. Commercial search only a single edge is placed. engine have to maintain the integrity of there search results and this is one reason for the unavailability of the efforts made by No self loops allowed. them publicly. Democratization of content creation on the web The edges could be weighted, but we assume that generates new challenges in WIRSS. This gives rise to the no weight is assigned to edges in the graph. question on integrity of web pages. In a simplistic approach, one might argue that only some publishers are trustworthy and Links within the same web site are removed. others not. One more challenge is fast crawling technology is Isolated nodes are removed from the graph. needed to gather the web objects and keep them up to date. 163 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011 C. Web_Object_Rank Algorithm V. IMPLEMENTATION This algorithm is basically a query independent algorithm This implementation is based upon random surfer that takes a web graph as an input and assigns a rank to every model [7] and Markov chain [13, 14]. The random surfer object which can specify the relative authorization of that web visit the objects in the web graph according to distribution page. In the proposed algorithm, following is the list of based on which random surfer can be in one of the variables following four possible states at any time. moist_fact (m) is the moister factor: the probability of Initial state is state of the system from where it will random surfer to restart search from another web object start its walk. The system is set in the random state by 1-m is the probability of the random surfer to search web randomly selecting an object using random function and objects from randomly chosen outlinks value corresponding to that web object in the Probability outlinks is the number of web objects linked with a Vector is set to unity. Rest of the values in the Probability particular page Vector is zero. Steady state is that state of the system when N is the number of objects in the domain the Probability Vector of random surfer fulfills the prob[i][j] is the Probability Transition Matrix for all i ,j € properties of irreducibility and aperiodicity’s. To check 1 to N either the system get the steady state or not, two successive values of the Probability Vector must be same. Ideal state adj[i][j] is the Adjacency Matrix for all i ,j € 1 to N is that state of the random surfer when the system achieves x is the Probability Vector the steady state but at the same time web object ranks are itr is Iteration distributed uniformly to all documents. Toggling state is achieved by the random surfer when the system is not able D. Web_Object_Rank Algorithm to reach at steady state and just toggle between two set of object ranks. Step 1. Create a web graph of various objects in a domain. Step 2. Set prob[i][j]=adj[i][j] O 1 Step 3. Compute number of out links from a particular O node say counter. IF outlinks of web objects = NULL 4 THEN prob[i][j] is equally distributed for all i ,j ELSE prob values are distributed according to O number of outlinks 2 For all i,j IF (counter = 0) O O THEN 5 6 prob[i][j]=1/N ELSE O IF (prob[i][j] =1) 3 THEN prob[i][j] =1.0/counter O O Step 4. Multiply the resulting matrix by 1 − m. Step 5. Add m/N to every entry of the resulting matrix, 7 8 to obtain Probability Transition Matrix. For all i , j Do prob[i][j]=(prob[i][j]*(1- m))+((m/N); O Step 6. Randomly select a node from 0 to N-1 to start a 9 walk say s_int . Step 7. Initialize Random surfer and itr to keep account O 1 of number of iterations required to 0. 0 Step 8. Try to reach at steady state with in 200 iterations otherwise toggling occur Step 9. Multiplying Probability Transition Matrixes with Probability Vector to get steady state Fig. 1. Web Graph Step 10. Check either system enters in steady state or not Step 11. Print the ranks stored in Probability Vector x and EXIT. 164 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011 C. Results and Discussion The web graph shown in Fig 1 is used for analyzing various M oister Factor vsNo. of Iterations factors of the proposed algorithm. Variation in graph structures Moister Factor No. of iterations used for analysis change the performance of the algorithm. The 250 graph shows 10 web objects in a domain that are interlinked as strongly connected graph. Every two nodes of the graph have a 200 path with less number of links. Oi is the ith web object in the No. of Iterations domain where i vary from 1 to 10. The adjacency matrix for 150 web graph of Fig 1 is shown in Fig 2. 100 0 1 0 0 0 0 0 0 0 0 50 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 05 15 25 35 45 55 65 75 85 95 1 2 3 4 5 6 7 8 9 0 1 0 0 0 0 1 0 0 0 0 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. Moister Factor 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 Fig. 3 . Moister Factor vs Number of Iterations 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 It is further analyzed that as the Moister Factor is equal 0 0 0 0 0 0 0 0 0 1 to 1, random Surfer enters into the Ideal state and the corresponding rank values of the web objects is same as in 0 0 0 0 0 0 0 1 0 0 table 2. The graph for the ideal state is shown in Fig 4. Fig.2. Adjacency Matrix for all i ,j € 1 to 10 Table 2: Ranks of objects at moister factor 1 To analyze the convergence speed, number of iterations Object Computed Rank required by random surfer to reach at a steady state is recorded O1 0.1 in Table 1 and the corresponding graph is shown in fig 3. In O2 0.1 fig. 3 infinity value is shown by a large number of iterations (200 or more). It clearly shows that as the moister factor O3 0.1 approaches 1, the number of iterations is reduced. O4 0.1 O5 0.1 Table 1: Moister Factor Vs No. of Iterations Moister Factor No. of Iterations O6 0.1 0 Infinity O7 0.1 0.05 Infinity O8 0.1 0.1 Infinity O9 0.1 0.15 Infinity O10 0.1 0.2 83 0.25 73 0.3 62 Computed Rank at Moister factor 1 0.35 46 0.4 41 Computed Rank 0.12 Computed Rank 0.45 33 0.1 0.5 35 0.08 0.55 39 0.06 0.04 0.6 24 0.02 0.65 21 0 0.7 20 1 2 3 4 5 6 7 8 9 10 O O O O O O O O O O 0.75 22 Web Objects 0.8 16 0.85 12 Fig.4. Random Surfer Ideal State 0.9 11 Figure 5 shows that for the Moister Factor less than 0.95 10 0.2, no rank is provided to any web object and system 1 2 enters into the toggling state with large number of 165 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011 iterations for the given domain. Also, the ranks computed by REFERENCES the proposed algorithm for moister factor values from 0.2 to 1 [1] Sergey Brin , Lawrence Page, “The anatomy of a are shown. large-scale hypertextual web search engine”, Proceedings of the 7th International conference on World Wide Web 7, p.107-117, April 1998, Brisbane, Computed Object Ranks at various Moister Factor Australia [2] Paolo Boldi, Massimo Santini, S. Vigna, “PageRank MF=0.25 MF=0.3 MF=0.35 MF=0.4 MF=0.45 as a Function of the Damping Factor”, International World Wide Web Conference Proceedings of the 14th MF=0.5 MF=0.55 MF=0.6 MF=0.65 MF=0.7 International conference on World Wide Web Chiba, MF=0.75 MF=0.8 MF=0.85 MF=0.9 MF=0.95 Japan pages: 557 - 566 Year of Publication: 2005 MF=1.0 MF=0.2 [3] Hui Zhang, Ashish Goel, Ramesh Govindan, Kahn 0.250000 Mason,and Benjamin Van Roy. “Making eigenvector-based reputation systems robust to 0.200000 collusion”, In Stefano Leonardi Editor, Computed Rank ProceedingsWAW 2004, number 3243 in LNCS, 0.150000 pages 92–104. Springer-Verlag, 2004. [4] Nie Z., Wu F., Wen J.R., and Ma W.Y., “Extracting 0.100000 Objects from the Web”, 22nd International Conference on Data Engineering (ICDE’06), pp 1-3, 0.050000 Year: 2006. [5] Jianfeng Zheng, Zaiqing Nie, “Architecture of an 0.000000 Object-level Vertical Search”, IEEE, in the O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 Proceeding of International Conference on Web Web Object Information Systems and Mining, pp 51-55, Year: 2009. [6] Zhanzi qui,Matthias Hemmje,Erich J.Neuhold, Fig. 4. Moister factor (>.2) to different documents “Using Link types in web page ranking and filtering”; From the above graphs and analysis, we can say that the IEEE Computer Society Proceedings of the Second moister factor plays a main role in this algorithm and International Conference on Web Information performance of algorithm can be improved if this factor is Systems Engineering (WISE'01) Volume 1 ; Page: selected properly. The value of moister factor can vary from 0 311 Year of Publication: 2001 to 1 but in most of the cases system enter into the toggling state [7] Christopher D. Manning, Prabhakar Raghavan, if value selected is less than 0.2 and at the value 1 system enter Hinrich Schutze, “An Introduction to Information into ideal state giving insignificant results. Value must be Retrieval”, Publisher: Cambridge University closer to 1 but can not be 1. As shown in Fig. 2 systems Press New York, NY, USA , Pages: 461- achieve a steady state in less number of iterations if moister 470 Year: 2008 factor value is closer to 1. [8] http://google.stanford.edu/ [9] Blum, T.-H. H. Chan, and M. R. Rwebangira, “A CONCLUSION random-surfer web-graph model”. In ANALCO '06: The current study was conducted to demonstrate how the Proceedings of the 8th Workshop on Algorithm link structure of the web can be used to provide the ranking to Engineering and Experiments and the 3 rd Workshop various documents. This ranking can be provided offline. With on Analytic Algorithmics and Combinatorics, pages the help of this approach one can prioritize the various 238--246, Philadelphia, PA, USA, 2006. Society for documents on the web independent of the query. However a Industrial and Applied Mathematics. complete score computation is based on various other factors. [10] Prasad Chebolu, Páll Melsted,” PageRank and the In the proposed algorithm a damping factor is used that play a random surfer model”, Symposium on Discrete very important role on the analysis of the algorithm. After the Algorithms Proceedings of the 19th annual ACM- analysis it is concluded that damping factor must not be SIAM symposium on Discrete algorithms; Pages: selected closer to zero. At the damping factor one, the system 1010-1018.Year : 2008 enters into the ideal state and the ranking provided is [11] Gopal Pandurangan, Prabhakar Raghavan, Eli Upfal, insignificant. As per evaluation the damping factor must be “Using PageRank to Characterize Web Structure”, selected greater than or equals to 0.5. However, if we consider Proceedings of the 8th Annual International convergence speed as only factor to evaluate the performance Conference on Computing and Combinatorics, page than the best moister factor will be .95. The proposed algorithm No..330-339, August 15-17, 2002. is query independent algorithm and does not consider query during ranking. 166 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011 [12] Google technology overview [22] Nie Z., Zhang Y., Wen J.R., and Ma W.Y. “Object- {http://www.google.com/intl/en/corporate/tech.html}, level Ranking: Bringing Order to web Objects”, In 2004 Proceeding of World Wide Web (WWW), 2007. [13] R. Montenegro,P. Tetali, “Mathematical aspects of mixing times in Markov chains”, Foundations and Trends Dr. Pushpa R. Suri received her Ph.D. Degree from in Theoretical Computer Science Volume 1 , Issue Kurukshetra University, Kurukshetra. She is working as 3 (May 2006) Pages: 237 - 354 ;Year : 2006 Associate Professor in the Department of Computer [14] Tugrul Dayar, Nihal Pekergin, Sana Younes; “Conditional Science and Applications at Kurukshetra University, steady-state bounds for a subset of states in Markov Kurukshetra, Haryana, India. She has many publications chains”, ACM International Conference Proceeding in International and National Journals and Conferences. Series; Vol. 201 Proceeding from the 2006 workshop on Her teaching and research activities include Discrete Tools for solving structured Markov chains Article No.: Mathematical Structure, Data Structure, Information 3 Year: 2006 Computing and Database Systems. [15] Orland Hoeber, “Web Information Retrieval Support Systems: The Future of Web Search, Web Intelligence & Harmunish Taneja received his M.Phil. degree in Intelligent Agent”, Proceedings of the 2008 (Computer Science) from Algappa University, Tamil IEEE/WIC/ACM International Conference on Web Nadu and Master of Computer Applications from Guru Intelligence and Intelligent Agent Technology - Volume Jambeshwar University of Science and Technology, 03 Pages: 29-32;Year: 2008 Hissar, Haryana, India. Presently he is working as [16] Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, Assistant Professor in Information Technology Panayiotis Tsaparas, “Link analysis ranking: algorithms, Department of M.M. University, Mullana, Haryana, India. theory, and experiments”, ACM Transactions on Internet He is pursuing Ph.D. (Computer Science) from Technology (TOIT) Volume 5 , Issue 1 (Feb. 2005) Kurukshetra University, Kurukshetra. He has published Pages: 231 - 297 Year: 2005 11 papers in International / National Conferences and [17] R. Lempel, S. Moran, “Rank-Stability and Rank- Seminars. His teaching and research areas include Similarity of Link-Based Web Ranking Algorithms in Database systems, Web Information Retrieval, and Object Authority-Connected Graphs”, Publisher: Kluwer Oriented Information Computing. Academic Publishers, April 2005 Information Retrieval , Volume 8 Issue 2, Pages: 245 - 264 ;Year : 2005 [18] Sehgal, Umesh; Kaur, Kuljeet; Kumar, Pawan, “The Anatomy of a Large-Scale Hyper Textual Web Search Engine”, Computer and Electrical Engineering, 2009. ICCEE '09. Second International Conference on Volume 2, 28-30 Dec. 2009 Page(s):491 - 495 ; Year 2009 [19] Kritikopoulos, A., Sideri, M., Varlamis, “Wordrank: A Method for Ranking Web Pages Based on Content Similarity”, Databases, 2007. BNCOD '07, 24th British National Conference on 3-5 July 2007, Page(s): 92-100, Year: 2007 . [20] Zaiqing Nie, Ji-Rong Wen and Wei-Ying Ma, “Object- level Vertical Search” January 7-10, 2007, Asilomar, California, USA, 3rd Biennial Conference on Innovative Data Systems Research (CIDR), Year: 2007. [21] Zhi-Xiong Zhang, Jian Xu, Jian-Hua Liu, Qi Zhao, Na Hong, Si-Zhu Wu, Dai-Qing Yang, “Extraction knowledge objects in scientific web resource for research profiling”, IEEE, Baoding, 12-15 July 2009, pp 3475- 3480, Eighth International Conference on Machine Learning and Cybernetics, Year: 2009. 167 http://sites.google.com/site/ijcsis/ ISSN 1947-5500