Survey on Web Usage Mining: Pattern Discovery and Applications by ijcsiseditor


More Info
									                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                              Vol. 9, No. 10, October 2011

         Survey on Web Usage Mining: Pattern Discovery
                       and Applications
Ms.C. Thangamani, Research Scholar                                         Dr. P. Thangaraj, Prof. & Head
Mother Teresa Women’s University                                          Department of computer Science & Engineering
Kodaikanal                                                                Bannari Amman Institute of Technology, Sathy
Abstract--- The past decade is described by an unexpected                The main difficulty with Web Mining in general and Web
development of the Web both in the quantity of Web sites and             Usage Mining in particular is the kind of data involved in
in the quantity of the accessing users. This enlargement                 processing. With the increase of Internet usage in this present
generated huge quantities of data related to the user interaction        world, the Web sites increased largely and a bundle of
with the Web sites, recorded in Web log files. In addition, the          transactions and usages are happening by the seconds. Away
Web sites holders uttered the requirement to recognize their             from the quantity of the data, the data is not entirely ordered.
visitors in an effective way so as to provide them web sites             It is organized in semi-structured manner so that it requires
with satisfaction. The Web Usage Mining (WUM) is                         more preprocessing and parsing before the gathering of the
developed in recent years in order to discover knowledge from            necessary data from the entire data.
databases. WUM consists of three phases: the preprocessing of
raw data, the discovery of schemas and the analysis of results.                     Web Data
A WUM technique gathers usage behavior from the Web
                                                                         In Web Usage Mining [18], data can be gathered from server
usage data. Large amount of web usage data makes difficulty
                                                                         logs, browser logs, proxy logs, or obtained from an
in analyzing those data. When applied to large quantity of
                                                                         organization's database. These data collections vary by means
data, the existing techniques of data mining, usually, results in
                                                                         of the place of the data source, the types of data available, the
unsatisfactory outcome by means of behaviors of the Web
                                                                         regional culture from where the data was gathered, and
sites' users. This paper focuses on analyzing the various web
                                                                         techniques of implementation.
usage mining techniques. This analysis will help the
researchers to develop a better technique for web usage                  There are various kinds of data that can be utilized in Web
mining.                                                                  Mining.
Keywords--- Web Usage Mining, World Wide Web, Pattern                       i.      Content
Discovery, Data Cleaning
                                                                           ii.      Structure
Web Usage Mining is a component of Web Mining, which of                   iii.      Usage
course is a part of Data Mining technique. Since Data Mining
                                                                                 Data Sources
includes the idea of mining significant and precious data from
huge quantity of data, Web Usage mining includes extraction              The data sources utilized in Web Usage Mining may include
of the access patterns of the users in the web site. This                web data repositories such as:
gathered data can then be utilized in a various ways like
improvement of the application, checking of fraudulent                    Web Server Logs – These are logs which contain the pattern
elements etc.                                                            of page requests. The World Wide Web Consortium preserves
                                                                         a regular arrangement for web server log files, but other
Web Usage Mining [16, 17] is usually referred as an element              informal designs are also subsist. Latest entries are
of the Business Intelligence in a business instead of technical          characteristically affixed to the ending of the file.
characteristic. It is utilized for predicting business plans by
means of the well-organized usage of Web Applications. It is             Information regarding the request which includes client IP
also essential for the Customer Relationship Management                  address, request date/time, page requested, HTTP code, bytes
(CRM) as it can guarantee customer fulfillment till the                  served, user agent, and referrer are normally included. This
interaction among the customer and the organization is                   information can be gathered into a single file, or split into
disturbed.                                                               separate logs like access log, error log, or referrer log. On the

                                                                                                     ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                              Vol. 9, No. 10, October 2011

other hand, server logs usually do not gather user-specific              Web usage mining is a technique of data mining in order to
data. These files are typically not available to regular Internet        mine the information of the Web server log file. It can
users. It can be accessible only by webmaster or other                   determine the browsing behaviors of user and some type of
administrative individuals. A numerical examination of the               correlations among the web pages. Web usage mining offers
server log may be utilized to gather traffic behavior by time of         the assistance for the Web site design, suggesting
day, day of week, referrer, or user agent.                               personalization server and other business making decision, etc.
                                                                         Web mining utilizes the data mining called the artificial
Proxy Server Logs - A Web proxy is a caching method which                intelligence and the chart expertise and so on to the Web data
happens among client browsers and Web servers. It assists to             and outlines the users visiting characteristics, and then obtains
decrease the load time of Web pages and also the network                 the users browsing patterns. Han et al., [2] performed a study
traffic at both the ends (server and client). A proxy server log         on Web Mining Algorithm based on Usage Mining and it also
includes the HTTP requests which are performed by various                constructs the design attitude of the electronic business
clients. This may serve as a data source to discover the usage           website application technique. This technique is
pattern of a group of unspecified users, sharing same proxy              uncomplicated, efficient and effortless to understand and
server.                                                                  appropriate to the Web usage mining requirement of building
                                                                         a low budget website.
Browser Logs – Different browsers such as Mozilla, Internet
Explorer etc. can be altered or different JavaScript and Java            Web usage mining takes advantage of data mining methods to
applets can be utilized to gather client side information. This          extract valuable data from usage behavior of World Wide Web
execution of client-side data gathering needs user assistance,           (WWW) users. The required characteristics is captured by
either in executing the working of JavaScript and Java applets,          Web servers and stored in Web usage data logs. The initial
or to willingly utilize the altered browser. Client-side                 stage of Web usage mining is the pre processing stage. In the
gathering scores over server-side gatherings as it decreases             preprocessing stage, initially, irrelevant data is cleared from
both the bot and session detection difficulties.                         the logs. This preprocessing stage is an important process in
                                                                         Web usage mining. The outcome of data preprocessing is
Web log mining usually involves the following phases:
                                                                         appropriate to the further processing like transaction
    •    Preprocessing                                                   identification, path examination, association rule mining,
                                                                         sequential pattern mining, etc. Inbarani et al., [3] proposed
    •    Pattern Discovery                                               rough set based feature selection for Web log Mining. Feature
                                                                         extraction is a preprocessing phase in web usage mining, and
    •    Pattern Analysis                                                it is highly efficient in decreasing the high dimensions to low
                                                                         dimensions by means of removing the irrelevant data,
This paper focuses on analysis about the various existing                escalating     the     learning    accuracy    and    enhancing
techniques with the phases described above.                              comprehensiveness.

    2. Related Works                                                     Web usage mining has grown to be fashionable in different
                                                                         business fields associated with Web site improvement. In Web
Web usage mining and statistical examinations are two
                                                                         usage mining, frequently interested navigational behavior are
methods to estimate practice of Web site. With the help of
                                                                         gathered by means of Web page addresses from the Web
Web usage mining techniques, graph mining envelops                       server visit logs, and the patterns are used in various
complex Web browsing patterns like parallel browsing. With               applications including recommendation. The semantic data of
the help of statistical examination techniques, examining page           the Web page text is usually not integrated in Web usage
browsing time suggests valuable data about Web site, usage
                                                                         mining. Salin et al., [4] proposed a structure for semantic
and its users. Heydari et al.[1], suggested a graph-based Web
                                                                         information for web usage mining based recommendation.
usage mining technique which merges Web usage mining and                 The repeated browsing paths are gathered by means of
statistical examination taking into account of client side data.         ontology instances as a substitute of Web page addresses and
Conversely, it merges graph based Web usage mining and                   the outcome is utilized for creating Web page suggestions to
browsing time examination by considering client side data. It
                                                                         the user. Additionally, an evaluation mechanism is
assists the web site owners to predict the user session
                                                                         implemented in order to test the accomplishment of the
accurately and enhance the website. It is determined to predict          prediction. Experimental outcome suggests that highly precise
the Web usage patterns with more accuracy.

                                                                                                     ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                             Vol. 9, No. 10, October 2011

prediction can be resulted by considering semantic data in the          Map (SOM). Author suggests the usage of SOM to pre-
Web usage mining.                                                       processed Web logs using the web log collected from
                                                               and gathers the frequent patterns.
In Web Usage Mining, web session clustering involves a
major role to categorize web users in accordance with the user          The web usage mining [19] makes use of data mining
browsing behavior and similarity measure. Web session                   approaches to find out interesting usage patterns from the
clustering in accordance with swarm assists in various                  available web data. Web personalization utilizes web usage
manners to handle the web resources efficiently link web                mining approaches for the development of customization.
personalization, layout alteration, website alteration and web          Customization concerns about knowledge acquisition through
server performance. Hussain et al., [5] proposed a hierarchical         the analysis of user's navigational activities. A user when goes
cluster based preprocessing methodology for Web Usage                   online more likely to obtain the links which is appropriate for
Mining. This structural design will envelop the data                    his necessities or usage in the website he browses. The
preprocessing phase to organize the web log data and translate          subsequent business requirement in the online industry will be
the uncompromising web log data into mathematical                       personalizing/customizing the web page satisfying for each
information. A session vector is generated, in order that               individuals need. The personalization of the web page will
suitable resemblance and swarm optimization could be utilized           engage clustering of several web pages having general usage
to group the web log information. The hierarchical cluster              pattern. As the size of the cluster goes on mounting because of
based technique will improve the conventional web session               the increase in users or development of interest of users it will
methods for more structured data about the user sessions.               become inevitable requirement for optimizing the clusters.
                                                                        Alphy Anna et al., [8] develops a cluster optimizing
 Mining the information of the Web server log files, determine          methodology in accordance with ants nestmate recognition
the session behavior of user and several types of correlations          capability and is used for removing the data redundancies that
among the Web pages. Web usage mining offers the assistance             possibly will take place after the clustering done by the web
for the Web site creation, given that personalization server and        usage mining techniques. For purpose of clustering an ART1-
additional business building judgment. There are various                neural network based technique is used. “AntNestmate
session regarding navigations are stored in Web server log              approach for cluster optimization” is presented to personalize
files, page attribute of which is Boolean quantity. Fang et al.,        web page clusters of target users.
[6] suggested a double algorithm of Web Usage Mining based
on sequence number for the purpose of improving the                     Internet has turn out to be an essential tool for everyone, Web
effectiveness of existing technique and decrease the executing          usage mining [20] in the same way becomes a hotspot, which
time of database scan. This is highly suitable for gathering            uses huge amounts of data in the Web server log and further
user browsing behaviors. This technique modifies the session            significant data sets for mining analysis and achieves valuable
pattern of user into binary, and then utilizes up and down              knowledge model about usage of important Web site. Several
search approach to double generate candidate frequent                   researches have to be done with the positive association rules
itemsets. This technique calculates support by sequence                 in Web usage mining, however negative association rules is
number dimension with the purpose of scanning session                   more significant, as a result Yang Bin et al., [9] have applied
pattern of user, which varies from existing double search               negative association rules to Web usage mining. Experimental
mining technique. The evaluation represents that the proposed           results have revealed that the negative association rules have a
system is faster and more accurate than existing algorithms.            significant role on access pattern to Web visitors to resolve the
                                                                        troubles in which positive association rules are referred to.
Huge quantity of information are collected repeatedly by Web
servers and stored in access log files. Examination of server           Web usage mining (WUM) is a kind of Web mining, which
access log can afford considerable and helpful data. Web                utilizes data mining techniques to obtain helpful information
Usage Mining is the technique of utilizing data mining process          from navigation pattern of Web users. The data must be
to the identification of usage patterns from Web data. It               preprocessed to enhance the effectiveness and simplify the
analyses the secondary data obtained from the behavior of the           mining process. Therefore it is significant to define before
users during some phase of Web sessions. Web usage mining               applying data mining techniques to determine user access
composes of three stages such as preprocessing, pattern                 patterns from Web log. The major use of data preprocessing is
discovery, and pattern examination. Etminani et al., [7]                to prune noisy and unrelated data, and to lessen data volume
proposed a web usage mining technique for discovery of the              for the pattern discovery stage. Aye et al., [10] chiefly
users' navigational patterns using Kohonen’s Self Organizing            concentrates on data preprocessing stage of the initial phase of

                                                                                                    ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                               Vol. 9, No. 10, October 2011

Web usage mining with activities like field extraction and data           common interests and behaviors by examining the data
cleaning techniques. Field extraction techniques carry out the            collected in Web servers.
process of separating fields from the single line of the log file.
Data cleaning technique removes inconsistent or unwanted                  Web usage mining is one of the major applications of data
items in the analyzed data.                                               mining techniques to logs of large Web data repositories with
                                                                          the aim of generating results used in some aspects, such as
The Internet is one of the rapidly growing fields of                      Web site design, user’s classification, designing adaptive Web
intelligence collection. When the users browse the website, the           sites and Web site personalization. Data preprocessing is a
users leave a lot of records of their actions. This enormous              vital phase in Web usage mining. The outcome of data
amount of data can be a valuable source of knowledge.                     preprocessing are significant to the next phases, like
Sophisticated mining processes are required for this                      transaction identification, path examination, association rules
knowledge to extract, recognize and to utilize effectively. Web           mining, sequential patterns mining, etc. Zhang Huiying et al.,
Usage Mining (WUM) systems are purposely designed to                      [14] used “USIA” algorithm was developed and its merits and
perform this task by examining the data representing usage                demerits were examined, USIA is experimentally proved that
data about a specific Web site. WUM can represent user                    not only its effectiveness is better and moreover it can
behavior and, consequently, to predict their future navigation.           recognize user and session accurately.
Online prediction is the one of the major Web Usage Mining
applications. On the other hand, the accuracy of the prediction           Web personalization systems are distinctive applications of
and classification in the existing structural design of predicting        Web usage mining. The Web personalization method is
users' future needs cannot still satisfy users particularly in            structured based on an online element and an off-line element.
large Web sites. In order to offer online prediction effectively,         The off-line element is focused at constructing the knowledge
Jalali et al., [11] advance structural design for online                  base by examining past user profiles that is then utilized in the
prediction in Web Usage Mining system and developed an                    online element. Common Web personalization systems
innovative method based on LCS algorithm for classifying                  generally use offline data preprocessing and the mining
user navigation patterns for predicting users' future needs.              procedure is not time-limited. On the other hand, this method
                                                                          is not a right choice in real-time dynamic environments.
Web Usage Mining is one of the significant approaches for                 Consequently, there is a requirement for high-performance
web recommendations, but the majority of its examinations                 online Web usage mining approaches to offer solutions to
are restricted in using web server log, and its applications are          these troubles. Chao et al., [15] developed a comprehensive
limited in serving a specific web site. In this approach, Yu              online data preprocessing process with the use of STPN. This
Zhang et al., [12] recommended a novel WWW-oriented web                   approach developed the structural design for online Web
recommendation system based on mining the enterprise proxy                usage mining in the data stream atmosphere and also
log. The author initially evaluates the difference among the              developed an online Web usage mining system with the use of
web server log and the enterprise proxy log, and then an                  STPN that offers Web personalized online services.
incremental data cleaning approach is developed according to
these differences. In data mining phase, this technique                                    3.   PROBLEMS AND DIRECTIONS
presented a clustering algorithm with hierarchical URL                    Web usage mining helps in the prediction of interesting web
similarity. Experimental observation reveals that this system             pages in the website. Design assistance can be gathered from
can implement the technology of Web Usage Mining                          these data so as to increase its users. At the same time, the
effectively in this new field.                                            gathered data need to be consistent enough to predict the
                                                                          accurate data.
Data mining concentrates on the techniques of non-trivial
extraction of inherent, previously unidentified, and potentially          Several researchers proposed their ideas to enhance the web
helpful information from extremely huge amount of data. Web               usage mining. The exiting works can be extended in order to
mining is merely an application of data mining techniques to              satisfy the requirements in the following ways:
Web data. Web Usage Mining (WUM) is a significant class in
Web mining. Web usage mining is an essential and rapid                    Initially, preprocessing can be improving by considering the
developing field of Web mining where numerous researches                  addition information to remove the irrelevant web log records.
have been done previously. Jianxi Zhang et al., [13] enhanced             This can be carried out by means of using the information
the fuzzy clustering approach to discover groups which share              such as browsing time, number of visits, etc.

                                                                                                      ISSN 1947-5500
                                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 10, October 2011

Next, the focus is on grouping the browsing patterns. This will                    [6]   Gang Fang, Jia-Le Wang, Hong Ying and Jiang Xiong; “A Double
                                                                                         Algorithm of Web Usage Mining Based on Sequence Number”,
assists in better prediction. Therefore, the clustering algorithm
                                                                                         International Conference on Information Engineering and Computer
used should be appropriate so as to perform better prediction.                           Science, 2009.
Also, in determining the user behaviors, the repeated sessions
can be eliminated so as to avoid redundancy.                                       [7]   Etminani, K., Delui, A.R., Yanehsari, N.R. and Rouhani, M., “Web
                                                                                         usage mining: Discovery of the users' navigational patterns using SOM”,
                                                                                         First International Conference on Networked Digital Technologies, Pp.
                              4.   CONCLUSION                                            224 – 249, 2009.
Web mining is the gathering of remarkable and helpful
information and implicit data from the behavior of uses based                      [8]   Alphy Anna and Prabakaran, S., “Cluster optimization for improved web
                                                                                         usage mining using ant nestmate approach”, International Conference on
on WWW, Web servers record and gathered data about user                                  Recent Trends in Information Technology (ICRTIT), Pp. 1271-1276,
interactions every time demands for web pages are received.                              2011.
Examination of those Web access logs can assist in
                                                                                   [9]   Yang Bin, Dong Xiangjun and Shi Fufu, “Research of WEB Usage
recognizing the user behavior and the web structure. When
                                                                                         Mining Based on Negative Association Rules”, International Forum on
viewing from business and applications viewpoint,                                        Computer Science-Technology and Applications, Pp. 196-199, 2009.
information gathered from the Web usage patterns can be
directly utilized for efficiently manage activities                                [10] Aye, T.T., “Web log cleaning for mining of web usage patterns”, 3rd
                                                                                        International Conference on Computer Research and Development
corresponding to e-business, e-services, e-education, on-line
                                                                                        (ICCRD), Pp. 490 – 494, 2011.
communities, etc. Accurate Web usage data could assist to
draw the attention of new customers, maintain present                              [11] Jalali, M.; Mustapha, N.; Sulaiman, N.B.; Mamat, A., “A Web Usage
customers, enhances cross marketing/sales, effectiveness of                             Mining Approach Based on LCS Algorithm in Online Predicting
                                                                                        Recommendation Systems”, 12th International Conference Information
promotional campaigns, track leaving customers and identifies                           Visualisation, Pp. 302 – 307, 2008.
the efficient logical structure for their Web space. User
profiles could be constructed by merging users’ navigation                         [12] Yu Zhang; Li Dai; Zhi-Jie Zhou, “A New Perspective of Web Usage
paths with other data characteristics like page viewing time,                           Mining: Using Enterprise Proxy Log”, International Conference on Web
                                                                                        Information Systems and Mining (WISM), Pp. 38 – 42, 2010.
hyperlink structure, and page content. Conversely, as the size
and complexity of the data escalated, the statistics suggested                     [13] Jianxi Zhang; Peiying Zhao; Lin Shang; Lunsheng Wang, “Web usage
by conventional Web log examination techniques may prove                                mining based on fuzzy clustering in identifying target group”,
                                                                                        International Colloquium on Computing, Communication, Control, and
insufficient and highly intelligent mining methods will be
                                                                                        Management, Pp. 209 – 212, 2009.
required. This paper discusses some of the existing web usage
mining techniques and assist the researchers to develop a                          [14] Zhang Huiying; Liang Wei, “An intelligent algorithm of data pre-
better strategy for web usage mining.                                                   processing in Web usage mining”, Intelligent Control and Automation,
                                                                                        Pp. 3119 – 3123, 2004.

                              REFERENCES                                           [15] Chao, Ching-Ming; Yang, Shih-Yang; Chen, Po-Zung; Sun, Chu-Hao,
[1]   Heydari, M., Helal, R.A. and Ghauth, K.I., “A graph-based web usage               “An Online Web Usage Mining System Using Stochastic Timed Petri
      mining method considering client side data”, International Conference             Nets”, 4th International Conference on Ubi-Media Computing (U-
      on Electrical Engineering and Informatics, Pp. 147-153, 2009.                     Media), Pp. 241 – 246, 2011.

[2]   Qingtian Han, Xiaoyan Gao and Wenguo Wu, “Study on Web Mining                [16] Hogo, M., Snorek, M. and Lingras, P., "Temporal Web usage mining",
      Algorithm based on Usage Mining”, 9th International Conference on                 International Conference on Web Intelligence, Pp. 450-453, 2003.
      Computer-Aided Industrial Design and Conceptual Design, Pp. 1121 –
      1124, 2008.                                                                  [17] DeMin Dong, "Exploration on Web Usage Mining and its Application",
                                                                                        International Workshop on Intelligent Systems and Applications, Pp. 1-
[3]   Inbarani, H.H., Thangavel, K and Pethalakshmi, A., “Rough Set Based               4, 2009.
      Feature Selection for Web Usage Mining”, International Conference on
      Computational Intelligence and Multimedia Applications, Pp. 33-38,           [18] Chih-Hung Wu, Yen-Liang Wu, Yuan-Ming Chang and Ming-Hung
      2007.                                                                             Hung, "Web Usage Mining on the Sequences of Clicking Patterns in a
                                                                                        Grid Computing Environment", International Conference on Machine
[4]   Salin, S. and Senkul, P., “Using semantic information for web usage               Learning and Cybernetics (ICMLC), Vol. 6, Pp. 2909-2914, 2010.
      mining based recommendation”, 24th International Symposium on
      Computer and Information Sciences, Pp. 236 – 241, 2009.                      [19] Tzekou, P., Stamou, S., Kozanidis, L. and Zotos, N., "Effective Site
                                                                                        Customization Based on Web Semantics and Usage Mining", Third
[5]   Hussain, T., Asghar, S. and Fong, S., “A hierarchical cluster based               International IEEE Conference on Signal-Image Technologies and
      preprocessing methodology for Web Usage Mining”, 6th International                Internet-Based System, Pp.51-59, 2007.
      Conference on Advanced Information Management and Service (IMS),
      Pp. 472-477, 2010.

                                                                                                                     ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 9, No. 10, October 2011

[20] Wu, K.L., Yu, P. S. and Ballman, A., "SpeedTracer: A Web usage
     mining and analysis tool", IBM Systems Journal, Vol. 37, No. 1, Pp. 89-
     105, 1998.

                             AUTHOR’S PROFILE

          1. Ms. C. Thangamani

                Research Scholar

                Mother Terasa Women’s University


          2. Dr. P. Thangavel, Prof. & Head

                Department of Computer Science & Engineering

                Bannari Amman Institute of Technology


                                                                                                             ISSN 1947-5500

To top