Web Usage Mining Processes and Applications

Document Sample
Web Usage Mining Processes and Applications Powered By Docstoc
					Web Usage Mining:
Processes and Applications

  Qiaoyuan Jiang
  CSE 8331
  November 24, 2003


   Brief overview of Web mining
   Web usage mining
   Application areas of Web usage mining
   Future research directions
   Conclusions

Web Mining
   Web Mining is the application of
    data mining techniques to discover
    and retrieve useful information and
    patterns from the World Wide Web
    documents and services [Etzioni,

Web Mining Categories
   Web Content Mining- extracting
    knowledge from the content of the Web
   Web Structure Mining- discovering
    the model underlying the link structures
    of the Web
   Web Usage Mining- discovering
    user’s navigation pattern and predicting
    user’s behavior

Web Usage Mining Processes
   Preprocessing: conversion of the raw data into the
    data abstraction (users, sessions, episodes,
    clicktreams, and pageviews) necessary for further
    applying the data mining algorithm.
   Pattern Discovery: is the key component of WUM,
    which converges the algorithms and techniques from
    data mining, machine learning, statistics and pattern
    recognition etc. research categories.
   Pattern Analysis: Validation and interpretation of
    the mined patterns

Web Usage Mining Processes

Web Usage Mining-
   Data Cleaning: remove outliers and/or irrelative
   User Identification: associate page references with
    different users
   Session Identification: divide all pages accessed
    by a user into sessions
   Path Completion: add important page access
    records that are missing in the access log due to
    browser and proxy server caching
   Formatting: format the sessions according to the
    type of data mining to be accomplished.
Web Usage Mining –
Preprocessing (Cont.)

Web Usage Mining -
Pattern Discovery Tasks
   Statistical Analysis
   Clustering
   Classification
   Association Rules
   Sequential Patterns
   Dependency Modeling

Web Usage Mining -
Pattern Discovery Tasks (Cont.)
   Statistical Analysis: frequency analysis, mean,
    median, etc.
       Improve system performance
       Provide support for marketing decisions
       Simplify site modification task
   Clustering:
       Clustering of users help to discover groups of
        users with similar navigation patterns => provide
        personalized Web content
       Clustering of pages help to discover groups of
        pages having related content => search engine
Web Usage Mining -
Pattern Discovery Tasks (Cont.)
   Classification: the technique to map a data
    item into one of several predefined classes
       Develop profile of users belonging to a particular
        class or category
   Association Rules: discover correlations
    among pages accessed together by a client
       Help the restructure of Web site
       Page prefetching
       Develop e-commerce marketing strategies

Web Usage Mining -
Pattern Discovery Tasks (Cont.)
   Sequential Patterns:          extract frequently occurring inter-
    session patterns such that the presence of a set of items s
    followed by another item in time order
      Predict future user visit patterns=>placing ads or
      Page prefeteching

   Dependency Modeling: determine if there are any
    significant dependencies among the variables in the Web
      Predict future Web resource consumption

      Develop business strategies to increase sales

      Improve navigational convenience of users

    Web Usage Mining -
    Pattern Analysis
   Pattern Analysis is the final stage of WUM,
    which involves the validation and interpretation
    of the mined pattern
   Validation: to eliminate the irrelative rules or
    patterns and to extract the interesting rules or
    patterns from the output of the pattern
    discovery process
   Interpretation: the output of mining algorithms
    is mainly in mathematic form and not suitable
    for direct human interpretations
Web Usage Mining -
Pattern Analysis Methodologies and Tools

   Visualization: help people to understand both
    real and abstract concepts
       WebViz: Web is visualized as a direct graph
   Query mechanism: allow analysts to extract
    only relevant and useful patterns by
    specifying constraints.
       WEBMINER
   On-Line Analytical Processing (OLAP): enable
    analysts to perform ad hoc analysis of data in
    multiple dimensions for decision-making
       WebLogMiner
    WEMINER Query Example
   Finds all ARs with min support of 1% and min
    confidence of 90%. The analyst only interested in
    clients from “.edu” domain and data later than Nov. 1st,
    2003 with page accesses start with URL A and contains
    B and C in that order:

SELECT association-rules(A*B*C*)
FROM log.data
WHERE date>=031101 AND domain=“edu”
  AND support = 1.0 AND confidence = 90.0

Application Areas for
Web Usage Mining
   Personalized: discover the preference and needs of
    individual Web users in order to provide personalized
    Web site for certain types of users
   Impersonalized: examine general user navigation
    patterns in order to understand how general users
    use the site
      System Improvement

      Site Modification

      Business Intelligence

      Web Characterization

System Improvement
   High performance of a web application is
    expected since it directly affects user’s
   WUM provides a key to understanding Web
    traffic behavior
   Applications
       Develop policies for web caching, network
        transmission, load balancing, or data distribution
       Detecting intrusion, fraud, and attempted break-ins to
        the system

Site Modification
   Structure of a Web site is another crucial
    attribute for attracting users other than the
    content of the Web
   WUM can provide detailed feedback on user’s
    navigation behavior, which can be used to
    redesign the Web site structure for user’s
    navigational convenience
   Adaptive Web site project [Perkowiz &
    Etzioni, 1998-1999]
Business Intelligence
   Information on how customers are
    using a Web site is critical
    information for marketers of e-
    commerce businesses
   WUM can provide business process
    optimization and marketing decisions
   Business intelligence includes
    personalization for C2B systems
Usage Characterization
   Mining general usage patterns (do not
    focus on any specific users or web sites)
    help in the study of how browsers are
    used and the user’s interaction with a
    browser interface.
   Enables the ability to look at the
    dynamics of the Web and how it is
   Choosing among thousands of options is
    challenge for Web users
   Goal: provides users with dynamic content
    tailored to their individual interest
   Form: recommending one or more items or
    pages to a user, based on the user’s profile and
    usage behavior, or the patterns of past visitors
    who have similar profiles.
   Performance Measurement:
       Effectiveness: accuracy + coverage
       Scalability
Applications of Personalization
   Customizing access to information sources
   Filtering news or e-mails
   Recommendation services for the browsing
   Tutoring systems
   Search
   More ...

3 phases of Personalization
   Data preparation and transformation:
    data cleaning, filtering, transaction
   Pattern discovery: discovery usage
   Recommendation: generate personalized
    content for a user based on matching the
    user’s session. (online process)

     Personalization Techniques –
     Collaborative Filtering (CF)

   Pattern discovery: online kNN algorithm applied on
    user profiles in a given domain and matching people who
    have the same taste.
   Recommendation: pages or items that are
    interested to the k-neighbors will be interested to the
    active user as well.
   Drawbacks:
       Online process =>Lack of scalability
       Static user profiles => low quality of recommendations

Personalization Techniques –
   Technique: clustering user transactions
    and pageviews.
   Advantages:
       User preference is automatically learned
        from usage data and therefore up-to-date.
       Better scalability through clustering
   Drawbacks:
       Low accuracy

        Personalization Techniques –
        Association Rules (ARs)
   Technique:
       For each user, create a transaction contains all the items the user
        have ever accessed.
       Find all rules satisfy the given support and confidence.
       For each active user, find all the rules supported by the user. Items
        predicted by these rules are the candidate recommendations
   Drawbacks:
       All association rules must be discovered prior generating
        recommendation. This can be improved by real-time generating
        ARs from a subset of transactions within the active users
       High support => better scalability and accuracy, low coverage.

Personalization Techniques –
Sequential Patterns (SPs)
   Technique: Markov Model
   Advantages:
       Better accuracy: SPs contains more precise
        information about user navigation behavior.
   Drawbacks:
       Low recommendation coverage
   More suitable for predictive tasks, e.g., Web

Personalization Techniques –
Hybrid Models
   Hybrid Models automatically switch among
    different personalization models based on
    localized degree of hyperlink connectivity.
       High connectivity degree => Non-SP models
       Low connectivity degree and deeper navigation
        path => SP models
   Performance: better than any individual

Future Research Directions
   Usage Mining on Semantic Web
       Help to build semantic Web
       With semantic Web, WUM can be improved
   Multimedia Web Data Mining
       Representation, problem solving and
        learning from Multimedia data is indeed a

        Future Research Directions
   Software Computing Technology for Web Mining
       Fuzzy logic: dealing with imprecision and conceptual data. Used
        in clustering Web log data and mining ARs.
       Neural network:
           Adaptive to new new data and information

           Suitable for parallel process

           Robust for missing, confusing, ill-defined data

           Capable for modeling non-linear decision boundaries

           Effective for learning user profiles

       Genetic algorithm: randomized search and optimization guided
        by evaluation criteria.
           Efficient, adaptive, robust, parallel process

           Used in search and query optimization, predict user preference

Future Research Directions
   Analysis of Discovered Patterns
       Research on efficient, flexible and powerful
        analysis tools
   More Applications
       Temporal evolutions of usage behavior
       Improving Web services
       Detect credit card fraud
       Privacy issues



Shared By: