A Novel Technique to Predict Oftenly used Web Pages from Usage Patterns
Description
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 1, Issue 4, November – December 2012, ISSN 2278-6856, Impact Factor of IJETTCS for year 2012: 2.524
Document Sample


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 4, November – December 2012 ISSN 2278-6856
A Novel Technique to Predict Oftenly used Web
Pages from Usage Patterns
Niky Singhai1, Prof Rajesh Kumar Nigam2
1
M-Tech (Computer science and Engineering) TIT Bhopal, Bhopal, (M.P.) India
2
Associate Professor, (Computer science and Engineering) TIT Bhopal, Bhopal, (M.P.) India
Abstract: In this paper, we analyzed the Kth Order Markov ranking of the results of a web search engine and the
model, formulated its general accuracy and present a provision of recommendations to users of (usually
complete framework to predict the Web page usage patterns
commercial) web sites, known as web personalization.
from Web log files. The exponential proliferation of web
usage has dramatically increased the volume of Internet Even with the speed of today‘s Internet, web latency is
traffic and has caused serious performance degradation in still one of the major concerns of its users. Web servers
terms of user latency. One of the techniques that are used for collect huge amount of data every day. Users search any
improving user latency is Caching and another is Web pre-
information, that relevant data is prefetched from web
fetching. Our Studies have been conducted on pre-fetching
models based on Decision trees, Markov chains, and path server. Reducing latency is particularly important for
analysis. However, the increased uses of dynamic pages, online businesses, since if their web pages cannot be
frequent changes in site structure and user access patterns opened within about eight seconds, they might lose
have limited the efficacy of static techniques. Approaches that
bank solely on caching offer limited performance
customers.
improvement because it is difficult for caching to handle the The limited network bandwidth and server resources will
large number of increasingly diverse files. To perform not be used efficiently and may worsen the access latency
successful prefetching, we must be able to predict the next set problem. The most common approaches used for web user
of pages that will be accessed by users. This paper is to
predict the next action depending on the result of previous browsing pattern prediction are Markov model, sequential
actions. In Web prediction, the next action corresponds to association rules and clustering. However, if most
predicting the next page to be visited. The previous actions prefetched web pages are not visited by the user in their
correspond to the previous pages that have already been subsequent accesses. The objective of a Web
visited. In Web prediction, the Kth-order Markov model is the
probability that a user will visit the kth page provided that personalization system is to provide users with the
user has visited the ordered k – 1 page. In this paper, a novel information they want or need, without expecting from
Oftenly used Pages(Oft) Rank-like algorithm is proposed for them to ask for it explicitly [1]. Page Rank is used in
conducting Web page prediction. As the tool for the algorithm
order to rank web pages based on the results returned by a
implementations we chose the “language of choice in
industrial world” – MATLAB. search engine after a user query. The ranking is
Keywords- Oftenly used pages(Oft),Decision trees, performed by evaluating the importance of a page in
Markov chains, path analysis, Page Rank, pre-fetching, terms of its connectivity to and from other important
predict, Web page, Web content. pages. In the context of navigating a web site, a page/path
is important if many users have visited it before, we
1. INTRODUCTION propose a novel approach that is based on a personalized
The exponential proliferation of Web usage has version of Page Rank, applied to the navigational tree
dramatically increased the volume of Internet traffic and created by the previous users‘navigations. Page Rank [2]
has caused serious performance degradation in terms of determines the significance of Web pages and helps a
user latency and bandwidth on the Internet. The use of search engine to choose high quality pages more
the World Wide Web has become indispensable in efficiently.
everybody’s life which has also made it critical to look for As businesses move online, the competition between
ways to accommodate increasing number of users while businesses to keep the loyalty of their old customers and
preventing excessive delays and congestion. to lure new customers is even more important, since a
Web mining has been classified into three areas such as competitor’s Web site may be only one click away. The
Web content mining, Web structure mining and Web fast pace and large amounts of data available in these
usage mining. The most common applications include the online settings have recently made it imperative to use
Volume 1, Issue 4 November - December 2012 Page 49
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 4, November – December 2012 ISSN 2278-6856
automated data mining or knowledge discovery may help the organizations to guide the users in their
techniques to discover Web user profiles. These different browsing activity and obtain relevant information in a
modes of usage or the so-called mass user profiles can be shorter span of time. However, the resulting patterns
discovered using Web usage mining techniques that can obtained through data mining techniques did not perform
automatically extract frequent access patterns from the well in predicting the future browsing patterns due to the
history of previous user click streams stored in Web log low matching rate of the resulting rules and users’
files. These profiles can later be harnessed toward browsing behavior. This paper focuses on recommender
personalizing the Web site to the user or to support systems based on the user’s navigational patterns and
targeted marketing. provides suitable recommendations to cater to the current
Although there have been considerable advances in Web needs of the user. The experimental results performed on
usage mining, there have been no detailed studies real usage data from a commercial web site show a
presenting a fully integrated approach to mine a real Web significant improvement in the recommendation
site with the challenging characteristics of today’s Web effectiveness of the proposed system. Identification of the
sites, such as evolving profiles, dynamic content, and the current interests of the user based on the short-term
availability of taxonomy or databases in addition to Web navigational patterns instead of explicit user information
logs. has proved to be one of the potential sources for
The Web site in our study is managed by a nonprofit recommendation of pages which may be of interest to the
organization that does not sell anything but only provides user. In this work, they classify and match an online user
free information that is ideally complete, accurate, and up based on his browsing interests using statistical
to date. Hence, it was crucial to understand the different techniques. A novel approach for recommendations of
modes of usage and to know what kind of information the unvisited pages has been suggested in work. An offline
visitors seek and read on the Web site and how this data preprocessing and clustering approach is used to
information evolves with time. For this reason, we determine groups of users with similar browsing patterns.
perform clustering of the user sessions extracted from the In this research paper [4] initiatives have addressed the
Web logs to partition the users into several homogeneous need for improved performance of Web page prediction
groups with similar activities and then extract user accuracy that would profit many applications, e-business
profiles from each cluster as a set of relevant URLs. in particular. Different Web usage mining frameworks
This procedure is repeated in subsequent new periods of have been implemented for this purpose specifically
Web logging (such as biweekly), then the previously Association rules, and Markov model. Each of these
discovered user profiles are tracked, and their evolution frameworks has its own strengths and weaknesses and it
pattern is categorized. When clustering the user sessions, has been proved that using each of these frameworks
we exploit the Web site hierarchy to give partial weights individually does not provide a suitable solution that
in the session similarity between URLs that are distinct answers today's Web page prediction needs. Endeavors to
and yet located closer together on this hierarchy. The provide an improved Web page prediction accuracy by
Web site hierarchy is inferred both from the URL address using a novel approach that involves integrating
and from a Web site database that organizes most of the clustering, association rules and Markov models
dynamic URLs along an “is-a” ontology of items. We also according to some constraints.
enrich the cluster profiles with various facets, including Merging Web pages by web services according to
search queries submitted just before landing on the Web functionality reduces the number of unique pages and,
site, and inquiring and inquired companies, in case users accordingly, the number of sessions. The categorized
from (inquiring) companies inquire about any of the sessions were divided into 7 clusters using the k-means
(inquired) companies listed on the Web site, which algorithm and according to the Cosine distance measure.
provide related services. Markov model implementation was carried out for the
original data in each cluster. The clusters were divided
2. LITERATURE SURVEY into a training set and a test set each and 2-Markov model
With the rising growth of Web users [3], Web-based accuracy was calculated accordingly.
organizations are keen to analyze the on-line browsing Then, using the test set, each transaction was considered
behavior of the users in their web site and learn (identify) as a new point and distance measures were calculated in
their interest instantly in a session. The analysis of the order to define the cluster that the point belongs to. Next,
user’s current interest based on the navigational behavior 2-Markov model prediction accuracy was computed
Volume 1, Issue 4 November - December 2012 Page 50
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 4, November – December 2012 ISSN 2278-6856
considering the transaction as a test set and only the experiences / sessions obtained during data processing of
cluster that the transaction belongs to as a training set. large N-grams. They have also applied ranking to
Since association rules techniques require the improve the prediction accuracy and to enhance its
determination of a minimum support factor and a applicability. Our results show that increasing the rank
confidence factor, we used the experimental data to help improves prediction accuracy.
determine such factors. We can only consider rules with
certain support factor and above a certain confidence
threshold.
In research [5] web prediction is a classification problem
in which we attempt to predict the next set of Web pages
that a user may visit based on the knowledge of the
previously visited pages. Predicting user’s behavior while
serving the Internet can be applied effectively in various
critical applications. Such application has traditional
tradeoffs between modeling complexity and prediction Figure 1 Prediction process in the two-tier model [5].
accuracy. They show that such framework can improve
the prediction time without compromising prediction However, they have found that individual higher ranks
accuracy. They have used standard benchmark data sets have contributed less to the prediction accuracy. In
to analyze, compare, and demonstrate the effectiveness of addition, the results clearly show that better prediction is
our techniques using variations of Markov models and always achieved when combining all-Kth ARM and all-
association rule mining. Our experiments show the Kth Markov models. Finally, for the two-tier framework,
effectiveness of our modified Markov model in reducing our results show the efficacy of the EC to reduce the
the number of paths without compromising accuracy. prediction time without compromising the prediction
Additionally, the results support our analysis conclusions accuracy.
that accuracy improves with higher orders of all-Kth
model.
In this paper [5], they have reviewed the current state-of-
the-art solutions for the WPP. They analyzed the all-Kth
Markov model and formulated its general accuracy and
PR. Moreover, they proposed and presented the modified
Markov model to reduce the complexity of original
Markov model. The modified Markov model successfully
reduces the size of the Markov model while achieving Figure 2 Summary of NASA and UOFS DATA Sets [5].
comparable prediction accuracy. Prediction process in the
two-tier model is show on figure 1. Additionally, they They considered three data sets, namely, the NASA data
proposed and presented a two-tier prediction framework set, the University of Saskatchewan’s (UOFS) data set,
in Web prediction. They showed that our two-tier and the United Arab Emirates University (UAEU) data
framework contributed to preserving accuracy (although set [6]. figure 2 shows a brief statistics of each data set. In
one classifier was consulted) and reducing prediction addition to many other items, the preprocessing of a data
time. set includes the following: grouping of sessions,
They conducted extensive set of experiments using identifying the beginning and the end of each session,
different prediction models, namely, Markov, ARM, all- assigning a unique session ID for each session, and
Kth Markov, all-Kth ARM, and a combination of them. filtering irrelevant records. In these experiments, they
They performed our experiments using three different follow the cleaning steps and the session identification
data sets, namely, NASA, UOFS, and UAEU, with techniques introduced in [7].
various parameters such as rank, partition percentage,
and the maximum number of N-grams. 3. PROPOSED TECHNIQUE
The results also show that smaller N-gram models
In our method, we classify Web pages based on hyperlink
perform better than higher N-gram models in terms of
relations and the site structure. Links are made by Web
accuracy. This is because of the small number of
Volume 1, Issue 4 November - December 2012 Page 51
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 4, November – December 2012 ISSN 2278-6856
designers based on relevance of content and certain for i=0 to length Li
interests of their own. We use this concept to build a Sx = P(Di )
category based dynamic prediction model. For example in count Sx
a general portal www.njiffy.com all pages under the return Sx
movies section fall under a single unique class. else
File data format is not true
The major problem in this field is that, the prediction
models have been dependent on history data or logs data
[11]. They were unable to make predictions in the initial
stages [12]. We present the algorithm of our Oft based
prediction model.
Oft Algorithm:
Read data from dataset Dsi
Find Page Pi from Dsi of the requested URL
Get Class number Clx from dataset Dsi
Get the links associated Ast with the page (Pi )
Oftx ←Pi and Clx
Prediction-Value (Pvalue) ↔ oftx
Pvalue ↔ [Oft Page, Class,]
Oft(Pvalue )I = Pi with the URLs’ Oftx
Figure 3 Snapshot of log file
Compare the class numbers of the links with
We assume that a user will preferably visit the next page,
from Dsi of the requested URL
which belongs to the same class as that of the current
The link having the same class number will get
page. To apply this concept we consider a set of dominant
preference.
links that point to pages that define a particular category.
if Oftx is higher
All the pages followed by that particular link remain in
Predicted links to be sent to the users’ cache.
the same class. The pages are categorized further into
else
levels according to the page rank in the initial period and
Set Oft(Pvalue )I = Pi with the URLs’ in Rlti
later, the users’ access frequency [8] [9] [10].
return
To begin with, HTTP requests arrive at the Predictor
Algorithm. The Predictor Algorithm uses the data from
In our approach, the Oft Page ranking system resides on
the data-structure for prediction. we categorize the users
the server side and all the information that is required for
on the basis of the related pages they access. Our model is
the computation of our personalized Oft Page Rank
divided into levels based on the popularity of the pages.
algorithm can be derived from the website’s Web logs
Each level is a collection of disjoint classes and each class
file. We assume that these Web logs are preprocessed and
contains related pages. Each page placed in higher levels
all user sessions are identified. The access oft of a page m,
has higher probability of being predicted.
and the number of times page n was visited right after
page m can be obtained simply by counting the number of
Algorithm for Access data from Log file:
times page m appears and the number of times pages m
Get input arguments from log file
and n appear consecutively in all user sessions
Analyze data format
respectively.
if format is true
In the case that page m is the last page of a user session,
Detect line termination character
for which the access duration cannot be calculated, we
Open file and set position indicator to end of header
can compute the average time-length spent on page m
Find line break
from all user sessions as its access time-length. Therefore,
Find line positions Li
when the access time-length spent on a Web page by a
Filter data lines
user exceeds the average time-length spent on this page
Find line break positions
by a large percentage, we use this average access time-
Update line break indices
length as the user’s access time-length on this Web page.
Read data Di
Volume 1, Issue 4 November - December 2012 Page 52
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 4, November – December 2012 ISSN 2278-6856
Let us consider a web domain as a directed graph G,
where the nodes represent the web pages and the edges
represent the links. Both nodes and edges carry weights,
the weight wm on node m is the total time-length all
previous users spent on browsing page m, while the
weight w(n,m) on edge n→m represents the sum of the
time–lengths spent on visiting page m when page n and m
were visited consecutively. If we consider all user sessions
as 1st-order Markov Chains (in this case, the next page to
be visited by a user only depends on the page the user is
visiting currently), then w` is the sum of the weights of
edges that point to node m. Let Bm be the set of pages that
point to page u, we have the equation:
wm w(n, m )
n Bm
Figure 4 Oft based web page prediction
We filtered the records and only reserved the hits
From the definition of w(n,m) we can see that if more
requesting Web pages (such as *.htm, *.html, and
previous users follow the path n→m and stay on page m
*.aspx). When identifying user sessions.
for a longer time, the value of w(n,m) will be larger, thus
w(n,m) covers both information of access time-length and
access frequency of a page m.
In order to include access oft and accessing time-length of
a page to conduct the computation of our personalized Oft
Page Rank algorithm OPR, we adopt w(n,m) as the biasing
factor. When distributing its ranking value to its out
links, page n will now propagate:
w (n, m )
w (n, m )
n Fn
units of its importance to page m in a non-uniform way,
where Fn is the set of pages that page n points to. We also
guarantee the web domains directed graph G is strongly
connected so that the calculation of OPR can converge to
a certain value by including the damping factor (1-α ).
Then we eliminate all dangling pages from G by adding a Figure 5 Estimate for a day with system load
link to all other pages in G for pages with no out
links[13]. All required information about the pages of the Website is
indexed using their URLs in a table where a URL acts as
4. EXPERIMENT AND RESULT ANALYSIS the key. When a request is received, a search on the table
In this research paper we use the Web logs of the is conducted and the information thus obtained is
www.bhumisoft.com website. We obtained the Web logs analyzed We chose the most popular paths because using
of a 2 week period in Feb 2012 to March 2012 and used these most accessed paths allowed us to provide a better
the Web logs from 22/Feb/2012 to 07/Mar/2012 as the representation of the typical navigational behaviors of
training data set. Web users than those paths that are with low access oft
Volume 1, Issue 4 November - December 2012 Page 53
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 4, November – December 2012 ISSN 2278-6856
prediction accuracy. However, we have found that
individual higher ranks have contributed less to the
prediction accuracy. In addition, the results clearly show
that better prediction is always achieved when all-Kth
Markov models is applying finally we obtained accurate
predictions of web page from data set.
REFERENCE
.
[1]. M. S. Aktas, M.A. Nacar and F. Menczer.
Personalizing Page Rank Based on domain
Profiles, Processing of WEBKDD 2004
Workshop, 2004.
Figure 6 Oft based prediction model [2]. M. D. Mulvenna, S. S. Anand and A.G Buchner.
Personalization on the net using web mining,.
For the test session, its corresponding page is chosen, the
Commun. ACM, 43, 8 (August), 123– 125,
dynamic support pruned all kth order markov model is
applied only on that date set. Fig 6 Shows that the 2000.
prediction accuracy by proposed model is higher. Out of [3]. C.P. Sumathi, R. Padmaja Valli, T. Santhanam.
the 15 days log data, accurate predictions were made by “Automatic Recommendation of Web Pages in
the proposed model, here we selected the most popular Web Usage Mining”, (IJCSE) International
paths the previous users followed from the training data Journal on Computer Science and Engineering
set, and each path was Vol. 02, No. 09, 2010, 3046-3052
expanded to construct the corresponding sub-graph
[4]. Faten Khalil Jiuyong Li Hua Wang, "Integrating
according to the sessions in the training data set.
Recommendation Models for Improved Web
5. CONCLUSION Page Prediction Accuracy", Thirty-First
Australasian Computer Science Conference
The time-length spent on visiting a Web page and the oft
(ACSC2008), Wollongong, Australia.
the Web page is accessed were used to bias Oft Page Rank
Conferences in Research and Practice in
so that it favors the pages that were visited for a longer
Information Technology (CRPIT), Vol. 74, 2008.
time and more frequently than others. the growth of Web
[5]. Mamoun A. Awad and Issa Khalil "Prediction of
based application, specifically analyzing Web usage data
ser’s Web Browsing Behavior: Application of
to better understand Web usage, and apply the knowledge
Markov Model", IEEE, 2012.
to better serve users. This has lead to a number of open
[6]. Internet Traffic Archive. [Online]. Available:
issues in Web Usage Mining area. aims to serve as a
http://ita.ee.lbl.gov/html/ traces.html
source of ideas for people working on personalization of
[7]. R. Cooley, B. Mobasher, and J. Srivastava, “Data
information systems. It proposes the easy, simple, best
preparation for mining World Wide Web
approach, be used for user behavior pattern discovery.
browsing patterns,” J. Knowl. Inf. Syst., vol. 1,
This outcome is based on experimental evaluation of
no. 1, pp. 5–32, 1999.
several web log files over periods.
[8]. Brin, S., Page, L.: The Anatomy of a Large-scale
However, In our experimental setup, we only propose the
Hypertextual Web Search Engine. 7th WWW Int.
kth-order Markov Chain model, which is “memory less”,
Conf., Brisbane, Australia (1998) 107-117.
to calculate the weights of edges in the directed graph of a
[9]. Kleinberg, J.: Authoritative sources in a
website and to expand the sub-graph for a user’s current
hyperlinked environment. 9th ACM-SIAM
navigational path and also will take into account using
Symposium on Discrete Algorithms, ACM Press
higher order Markov Chain models to improve the
(1998) 668-677.
prediction accuracy and also applying the proposed
[10]. Mukhopadhyay, D., Biswas, P.: FlexiRank: An
approach to other data sets to evaluate its reliability and
Algorithm Offering Flexibility and Accuracy for
performance. We have also applied ranking to improve
Ranking the Web Pages. Lecture Notes in
the prediction accuracy and to enhance its applicability.
Computer Science, Vol. 3816. Springer-Verlag,
Our results show that increasing the rank improves
Berlin Heidelberg New York (2005) 308 – 313.
Volume 1, Issue 4 November - December 2012 Page 54
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 4, November – December 2012 ISSN 2278-6856
[11]. Su, Z., Yang, Q., Lu, Y., Zhang, H.: WhatNext:
A Prediction System for Web Requests using N-
gram Sequence Models. 1st Int. Conf. on Web
Information System and Engineering (2000)
200-207.
[12]. Davison, B.D.: Learning Web Request Patterns.
Web Dynamics: Adapting to Change in Content,
Size, Topology and Use. Springer-Verlag, Berlin
Heidelberg New York (2004) 435-460.
[13]. Yong Zhen Guo, Kotagiri Ramamohanarao
and Laurence A. F. Park, “Personalized
PageRank for Web Page Prediction Based on
Access Time-Length and Frequency”,
WIC/ACM International Conference on Web
Intelligence IEEE 2007.
Volume 1, Issue 4 November - December 2012 Page 55
Other docs by editorijettcs
Get documents about "