A Novel Technique to Predict Oftenly used Web Pages from Usage Patterns

Document Sample
A Novel Technique to Predict Oftenly used Web Pages from Usage Patterns Powered By Docstoc
					    International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

   A Novel Technique to Predict Oftenly used Web
            Pages from Usage Patterns
                                           Niky Singhai1, Prof Rajesh Kumar Nigam2

                               M-Tech (Computer science and Engineering) TIT Bhopal, Bhopal, (M.P.) India
                       Associate Professor, (Computer science and Engineering) TIT Bhopal, Bhopal, (M.P.) India

Abstract: In this paper, we analyzed the Kth Order Markov            ranking of the results of a web search engine and the
model, formulated its general accuracy and present a                 provision of recommendations to users of (usually
complete framework to predict the Web page usage patterns
                                                                     commercial) web sites, known as web personalization.
from Web log files. The exponential proliferation of web
usage has dramatically increased the volume of Internet              Even with the speed of today‘s Internet, web latency is
traffic and has caused serious performance degradation in            still one of the major concerns of its users. Web servers
terms of user latency. One of the techniques that are used for       collect huge amount of data every day. Users search any
improving user latency is Caching and another is Web pre-
                                                                     information, that relevant data is prefetched from web
fetching. Our Studies have been conducted on pre-fetching
models based on Decision trees, Markov chains, and path              server. Reducing latency is particularly important for
analysis. However, the increased uses of dynamic pages,              online businesses, since if their web pages cannot be
frequent changes in site structure and user access patterns          opened within about eight seconds, they might lose
have limited the efficacy of static techniques. Approaches that
bank solely on caching offer limited performance
improvement because it is difficult for caching to handle the        The limited network bandwidth and server resources will
large number of increasingly diverse files. To perform               not be used efficiently and may worsen the access latency
successful prefetching, we must be able to predict the next set      problem. The most common approaches used for web user
of pages that will be accessed by users. This paper is to
predict the next action depending on the result of previous          browsing pattern prediction are Markov model, sequential
actions. In Web prediction, the next action corresponds to           association rules and clustering. However, if most
predicting the next page to be visited. The previous actions         prefetched web pages are not visited by the user in their
correspond to the previous pages that have already been              subsequent accesses. The objective of a Web
visited. In Web prediction, the Kth-order Markov model is the
probability that a user will visit the kth page provided that        personalization system is to provide users with the
user has visited the ordered k – 1 page. In this paper, a novel      information they want or need, without expecting from
Oftenly used Pages(Oft) Rank-like algorithm is proposed for          them to ask for it explicitly [1]. Page Rank is used in
conducting Web page prediction. As the tool for the algorithm
                                                                     order to rank web pages based on the results returned by a
implementations we chose the “language of choice in
industrial world” – MATLAB.                                          search engine after a user query. The ranking is
Keywords- Oftenly used pages(Oft),Decision trees,                    performed by evaluating the importance of a page in
Markov chains, path analysis, Page Rank, pre-fetching,               terms of its connectivity to and from other important
predict, Web page, Web content.                                      pages. In the context of navigating a web site, a page/path
                                                                     is important if many users have visited it before, we
1. INTRODUCTION                                                      propose a novel approach that is based on a personalized
The exponential proliferation of Web usage has                       version of Page Rank, applied to the navigational tree
dramatically increased the volume of Internet traffic and            created by the previous users‘navigations. Page Rank [2]
has caused serious performance degradation in terms of               determines the significance of Web pages and helps a
user latency and bandwidth on the Internet. The use of               search engine to choose high quality pages more
the World Wide Web has become indispensable in                       efficiently.
everybody’s life which has also made it critical to look for         As businesses move online, the competition between
ways to accommodate increasing number of users while                 businesses to keep the loyalty of their old customers and
preventing excessive delays and congestion.                          to lure new customers is even more important, since a
Web mining has been classified into three areas such as              competitor’s Web site may be only one click away. The
Web content mining, Web structure mining and Web                     fast pace and large amounts of data available in these
usage mining. The most common applications include the               online settings have recently made it imperative to use
Volume 1, Issue 4 November - December 2012                                                                            Page 49
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

automated data mining or knowledge discovery                   may help the organizations to guide the users in their
techniques to discover Web user profiles. These different      browsing activity and obtain relevant information in a
modes of usage or the so-called mass user profiles can be      shorter span of time. However, the resulting patterns
discovered using Web usage mining techniques that can          obtained through data mining techniques did not perform
automatically extract frequent access patterns from the        well in predicting the future browsing patterns due to the
history of previous user click streams stored in Web log       low matching rate of the resulting rules and users’
files. These profiles can later be harnessed toward            browsing behavior. This paper focuses on recommender
personalizing the Web site to the user or to support           systems based on the user’s navigational patterns and
targeted marketing.                                            provides suitable recommendations to cater to the current
Although there have been considerable advances in Web          needs of the user. The experimental results performed on
usage mining, there have been no detailed studies              real usage data from a commercial web site show a
presenting a fully integrated approach to mine a real Web      significant improvement in the recommendation
site with the challenging characteristics of today’s Web       effectiveness of the proposed system. Identification of the
sites, such as evolving profiles, dynamic content, and the     current interests of the user based on the short-term
availability of taxonomy or databases in addition to Web       navigational patterns instead of explicit user information
logs.                                                          has proved to be one of the potential sources for
The Web site in our study is managed by a nonprofit            recommendation of pages which may be of interest to the
organization that does not sell anything but only provides     user. In this work, they classify and match an online user
free information that is ideally complete, accurate, and up    based on his browsing interests using statistical
to date. Hence, it was crucial to understand the different     techniques. A novel approach for recommendations of
modes of usage and to know what kind of information the        unvisited pages has been suggested in work. An offline
visitors seek and read on the Web site and how this            data preprocessing and clustering approach is used to
information evolves with time. For this reason, we             determine groups of users with similar browsing patterns.
perform clustering of the user sessions extracted from the     In this research paper [4] initiatives have addressed the
Web logs to partition the users into several homogeneous       need for improved performance of Web page prediction
groups with similar activities and then extract user           accuracy that would profit many applications, e-business
profiles from each cluster as a set of relevant URLs.          in particular. Different Web usage mining frameworks
This procedure is repeated in subsequent new periods of        have been implemented for this purpose specifically
Web logging (such as biweekly), then the previously            Association rules, and Markov model. Each of these
discovered user profiles are tracked, and their evolution      frameworks has its own strengths and weaknesses and it
pattern is categorized. When clustering the user sessions,     has been proved that using each of these frameworks
we exploit the Web site hierarchy to give partial weights      individually does not provide a suitable solution that
in the session similarity between URLs that are distinct       answers today's Web page prediction needs. Endeavors to
and yet located closer together on this hierarchy. The         provide an improved Web page prediction accuracy by
Web site hierarchy is inferred both from the URL address       using a novel approach that involves integrating
and from a Web site database that organizes most of the        clustering, association rules and Markov models
dynamic URLs along an “is-a” ontology of items. We also        according to some constraints.
enrich the cluster profiles with various facets, including     Merging Web pages by web services according to
search queries submitted just before landing on the Web        functionality reduces the number of unique pages and,
site, and inquiring and inquired companies, in case users      accordingly, the number of sessions. The categorized
from (inquiring) companies inquire about any of the            sessions were divided into 7 clusters using the k-means
(inquired) companies listed on the Web site, which             algorithm and according to the Cosine distance measure.
provide related services.                                      Markov model implementation was carried out for the
                                                               original data in each cluster. The clusters were divided
2. LITERATURE SURVEY                                           into a training set and a test set each and 2-Markov model
With the rising growth of Web users [3], Web-based             accuracy was calculated accordingly.
organizations are keen to analyze the on-line browsing         Then, using the test set, each transaction was considered
behavior of the users in their web site and learn (identify)   as a new point and distance measures were calculated in
their interest instantly in a session. The analysis of the     order to define the cluster that the point belongs to. Next,
user’s current interest based on the navigational behavior     2-Markov model prediction accuracy was computed

Volume 1, Issue 4 November - December 2012                                                                       Page 50
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

considering the transaction as a test set and only the        experiences / sessions obtained during data processing of
cluster that the transaction belongs to as a training set.    large N-grams. They have also applied ranking to
Since association rules techniques require the                improve the prediction accuracy and to enhance its
determination of a minimum support factor and a               applicability. Our results show that increasing the rank
confidence factor, we used the experimental data to help      improves prediction accuracy.
determine such factors. We can only consider rules with
certain support factor and above a certain confidence
In research [5] web prediction is a classification problem
in which we attempt to predict the next set of Web pages
that a user may visit based on the knowledge of the
previously visited pages. Predicting user’s behavior while
serving the Internet can be applied effectively in various
critical applications. Such application has traditional
tradeoffs between modeling complexity and prediction            Figure 1 Prediction process in the two-tier model [5].
accuracy. They show that such framework can improve
the prediction time without compromising prediction           However, they have found that individual higher ranks
accuracy. They have used standard benchmark data sets         have contributed less to the prediction accuracy. In
to analyze, compare, and demonstrate the effectiveness of     addition, the results clearly show that better prediction is
our techniques using variations of Markov models and          always achieved when combining all-Kth ARM and all-
association rule mining. Our experiments show the             Kth Markov models. Finally, for the two-tier framework,
effectiveness of our modified Markov model in reducing        our results show the efficacy of the EC to reduce the
the number of paths without compromising accuracy.            prediction time without compromising the prediction
Additionally, the results support our analysis conclusions    accuracy.
that accuracy improves with higher orders of all-Kth
In this paper [5], they have reviewed the current state-of-
the-art solutions for the WPP. They analyzed the all-Kth
Markov model and formulated its general accuracy and
PR. Moreover, they proposed and presented the modified
Markov model to reduce the complexity of original
Markov model. The modified Markov model successfully
reduces the size of the Markov model while achieving          Figure 2 Summary of NASA and UOFS DATA Sets [5].
comparable prediction accuracy. Prediction process in the
two-tier model is show on figure 1. Additionally, they        They considered three data sets, namely, the NASA data
proposed and presented a two-tier prediction framework        set, the University of Saskatchewan’s (UOFS) data set,
in Web prediction. They showed that our two-tier              and the United Arab Emirates University (UAEU) data
framework contributed to preserving accuracy (although        set [6]. figure 2 shows a brief statistics of each data set. In
one classifier was consulted) and reducing prediction         addition to many other items, the preprocessing of a data
time.                                                         set includes the following: grouping of sessions,
They conducted extensive set of experiments using             identifying the beginning and the end of each session,
different prediction models, namely, Markov, ARM, all-        assigning a unique session ID for each session, and
Kth Markov, all-Kth ARM, and a combination of them.           filtering irrelevant records. In these experiments, they
They performed our experiments using three different          follow the cleaning steps and the session identification
data sets, namely, NASA, UOFS, and UAEU, with                 techniques introduced in [7].
various parameters such as rank, partition percentage,
and the maximum number of N-grams.                            3. PROPOSED TECHNIQUE
The results also show that smaller N-gram models
                                                              In our method, we classify Web pages based on hyperlink
perform better than higher N-gram models in terms of
                                                              relations and the site structure. Links are made by Web
accuracy. This is because of the small number of

Volume 1, Issue 4 November - December 2012                                                                         Page 51
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

designers based on relevance of content and certain                               for i=0 to length Li
interests of their own. We use this concept to build a                            Sx = P(Di )
category based dynamic prediction model. For example in                           count Sx
a general portal all pages under the                               return Sx
movies section fall under a single unique class.                         else
                                                                         File data format is not true
                                                                The major problem in this field is that, the prediction
                                                                models have been dependent on history data or logs data
                                                                [11]. They were unable to make predictions in the initial
                                                                stages [12]. We present the algorithm of our Oft based
                                                                prediction model.

                                                                Oft Algorithm:
                                                                        Read data from dataset Dsi
                                                                        Find Page Pi from Dsi of the requested URL
                                                                        Get Class number Clx from dataset Dsi
                                                                        Get the links associated Ast with the page (Pi )
                                                                                  Oftx ←Pi and Clx
                                                                                  Prediction-Value (Pvalue) ↔ oftx
                                                                                  Pvalue ↔ [Oft Page, Class,]
                                                                                  Oft(Pvalue )I = Pi with the URLs’ Oftx
                Figure 3 Snapshot of log file
                                                                        Compare the class numbers of the links with
We assume that a user will preferably visit the next page,
                                                                        from Dsi of the requested URL
which belongs to the same class as that of the current
                                                                        The link having the same class number will get
page. To apply this concept we consider a set of dominant
links that point to pages that define a particular category.
                                                                        if Oftx is higher
All the pages followed by that particular link remain in
                                                                            Predicted links to be sent to the users’ cache.
the same class. The pages are categorized further into
levels according to the page rank in the initial period and
                                                                           Set Oft(Pvalue )I = Pi with the URLs’ in Rlti
later, the users’ access frequency [8] [9] [10].
To begin with, HTTP requests arrive at the Predictor
Algorithm. The Predictor Algorithm uses the data from
                                                                In our approach, the Oft Page ranking system resides on
the data-structure for prediction. we categorize the users
                                                                the server side and all the information that is required for
on the basis of the related pages they access. Our model is
                                                                the computation of our personalized Oft Page Rank
divided into levels based on the popularity of the pages.
                                                                algorithm can be derived from the website’s Web logs
Each level is a collection of disjoint classes and each class
                                                                file. We assume that these Web logs are preprocessed and
contains related pages. Each page placed in higher levels
                                                                all user sessions are identified. The access oft of a page m,
has higher probability of being predicted.
                                                                and the number of times page n was visited right after
                                                                page m can be obtained simply by counting the number of
Algorithm for Access data from Log file:
                                                                times page m appears and the number of times pages m
Get input arguments from log file
                                                                and n appear consecutively in all user sessions
Analyze data format
 if format is true
                                                                In the case that page m is the last page of a user session,
     Detect line termination character
                                                                for which the access duration cannot be calculated, we
     Open file and set position indicator to end of header
                                                                can compute the average time-length spent on page m
         Find line break
                                                                from all user sessions as its access time-length. Therefore,
         Find line positions Li
                                                                when the access time-length spent on a Web page by a
         Filter data lines
                                                                user exceeds the average time-length spent on this page
         Find line break positions
                                                                by a large percentage, we use this average access time-
         Update line break indices
                                                                length as the user’s access time-length on this Web page.
         Read data Di
Volume 1, Issue 4 November - December 2012                                                                         Page 52
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

Let us consider a web domain as a directed graph G,
where the nodes represent the web pages and the edges
represent the links. Both nodes and edges carry weights,
the weight wm on node m is the total time-length all
previous users spent on browsing page m, while the
weight w(n,m) on edge n→m represents the sum of the
time–lengths spent on visiting page m when page n and m
were visited consecutively. If we consider all user sessions
as 1st-order Markov Chains (in this case, the next page to
be visited by a user only depends on the page the user is
visiting currently), then w` is the sum of the weights of
edges that point to node m. Let Bm be the set of pages that
point to page u, we have the equation:

                  wm         w(n, m )
                             n  Bm
                                                                        Figure 4 Oft based web page prediction

                                                               We filtered the records and only reserved the hits
From the definition of w(n,m) we can see that if more
                                                               requesting Web pages (such as *.htm, *.html, and
previous users follow the path n→m and stay on page m
                                                               *.aspx). When identifying user sessions.
for a longer time, the value of w(n,m) will be larger, thus
w(n,m) covers both information of access time-length and
access frequency of a page m.
In order to include access oft and accessing time-length of
a page to conduct the computation of our personalized Oft
Page Rank algorithm OPR, we adopt w(n,m) as the biasing
factor. When distributing its ranking value to its out
links, page n will now propagate:

                       w (n, m )

                      w (n, m )
                    n  Fn

units of its importance to page m in a non-uniform way,
where Fn is the set of pages that page n points to. We also
guarantee the web domains directed graph G is strongly
connected so that the calculation of OPR can converge to
a certain value by including the damping factor (1-α ).
Then we eliminate all dangling pages from G by adding a              Figure 5 Estimate for a day with system load
link to all other pages in G for pages with no out
links[13].                                                     All required information about the pages of the Website is
                                                               indexed using their URLs in a table where a URL acts as
4. EXPERIMENT AND RESULT ANALYSIS                              the key. When a request is received, a search on the table
In this research paper we use the Web logs of the              is conducted and the information thus obtained is website. We obtained the Web logs            analyzed We chose the most popular paths because using
of a 2 week period in Feb 2012 to March 2012 and used          these most accessed paths allowed us to provide a better
the Web logs from 22/Feb/2012 to 07/Mar/2012 as the            representation of the typical navigational behaviors of
training data set.                                             Web users than those paths that are with low access oft

Volume 1, Issue 4 November - December 2012                                                                       Page 53
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

                                                               prediction accuracy. However, we have found that
                                                               individual higher ranks have contributed less to the
                                                               prediction accuracy. In addition, the results clearly show
                                                               that better prediction is always achieved when all-Kth
                                                               Markov models is applying finally we obtained accurate
                                                               predictions of web page from data set.

                                                                   [1]. M. S. Aktas, M.A. Nacar and F. Menczer.
                                                                        Personalizing Page Rank Based on domain
                                                                        Profiles, Processing of WEBKDD 2004
                                                                        Workshop, 2004.
           Figure 6 Oft based prediction model                     [2]. M. D. Mulvenna, S. S. Anand and A.G Buchner.
                                                                        Personalization on the net using web mining,.
For the test session, its corresponding page is chosen, the
                                                                        Commun. ACM, 43, 8 (August), 123– 125,
dynamic support pruned all kth order markov model is
applied only on that date set. Fig 6 Shows that the                     2000.
prediction accuracy by proposed model is higher. Out of            [3]. C.P. Sumathi, R. Padmaja Valli, T. Santhanam.
the 15 days log data, accurate predictions were made by                 “Automatic Recommendation of Web Pages in
the proposed model, here we selected the most popular                   Web Usage Mining”, (IJCSE) International
paths the previous users followed from the training data                Journal on Computer Science and Engineering
set, and each path was                                                  Vol. 02, No. 09, 2010, 3046-3052
expanded to construct the corresponding sub-graph
                                                                   [4]. Faten Khalil Jiuyong Li Hua Wang, "Integrating
according to the sessions in the training data set.
                                                                        Recommendation Models for Improved Web
5. CONCLUSION                                                           Page      Prediction     Accuracy",     Thirty-First
                                                                        Australasian Computer Science Conference
The time-length spent on visiting a Web page and the oft
                                                                        (ACSC2008),           Wollongong,        Australia.
the Web page is accessed were used to bias Oft Page Rank
                                                                        Conferences in Research and Practice in
so that it favors the pages that were visited for a longer
                                                                        Information Technology (CRPIT), Vol. 74, 2008.
time and more frequently than others. the growth of Web
                                                                   [5]. Mamoun A. Awad and Issa Khalil "Prediction of
based application, specifically analyzing Web usage data
                                                                        ser’s Web Browsing Behavior: Application of
to better understand Web usage, and apply the knowledge
                                                                        Markov Model", IEEE, 2012.
to better serve users. This has lead to a number of open
                                                                   [6]. Internet Traffic Archive. [Online]. Available:
issues in Web Usage Mining area. aims to serve as a
source of ideas for people working on personalization of
                                                                   [7]. R. Cooley, B. Mobasher, and J. Srivastava, “Data
information systems. It proposes the easy, simple, best
                                                                        preparation for mining World Wide Web
approach, be used for user behavior pattern discovery.
                                                                        browsing patterns,” J. Knowl. Inf. Syst., vol. 1,
This outcome is based on experimental evaluation of
                                                                        no. 1, pp. 5–32, 1999.
several web log files over periods.
                                                                   [8]. Brin, S., Page, L.: The Anatomy of a Large-scale
However, In our experimental setup, we only propose the
                                                                        Hypertextual Web Search Engine. 7th WWW Int.
kth-order Markov Chain model, which is “memory less”,
                                                                        Conf., Brisbane, Australia (1998) 107-117.
to calculate the weights of edges in the directed graph of a
                                                                   [9]. Kleinberg, J.: Authoritative sources in a
website and to expand the sub-graph for a user’s current
                                                                        hyperlinked environment. 9th ACM-SIAM
navigational path and also will take into account using
                                                                        Symposium on Discrete Algorithms, ACM Press
higher order Markov Chain models to improve the
                                                                        (1998) 668-677.
prediction accuracy and also applying the proposed
                                                                   [10]. Mukhopadhyay, D., Biswas, P.: FlexiRank: An
approach to other data sets to evaluate its reliability and
                                                                        Algorithm Offering Flexibility and Accuracy for
performance. We have also applied ranking to improve
                                                                        Ranking the Web Pages. Lecture Notes in
the prediction accuracy and to enhance its applicability.
                                                                        Computer Science, Vol. 3816. Springer-Verlag,
Our results show that increasing the rank improves
                                                                        Berlin Heidelberg New York (2005) 308 – 313.

Volume 1, Issue 4 November - December 2012                                                                        Page 54
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

   [11]. Su, Z., Yang, Q., Lu, Y., Zhang, H.: WhatNext:
       A Prediction System for Web Requests using N-
       gram Sequence Models. 1st Int. Conf. on Web
       Information System and Engineering (2000)
   [12]. Davison, B.D.: Learning Web Request Patterns.
       Web Dynamics: Adapting to Change in Content,
       Size, Topology and Use. Springer-Verlag, Berlin
       Heidelberg New York (2004) 435-460.

   [13]. Yong Zhen Guo, Kotagiri Ramamohanarao
       and Laurence A. F.      Park, “Personalized
       PageRank for Web Page Prediction Based on
       Access      Time-Length and    Frequency”,
       WIC/ACM International Conference on Web
       Intelligence IEEE 2007.

Volume 1, Issue 4 November - December 2012                                            Page 55

Description: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: Email:, Volume 1, Issue 4, November – December 2012, ISSN 2278-6856, Impact Factor of IJETTCS for year 2012: 2.524