Data Mining for Web Personalization

Document Sample
Data Mining for Web Personalization Powered By Docstoc
					Data Mining for Web
Presented by the Highflyers group
Who are the Highflyers?
• Irfan Butt – Introduction and Traditional
  approaches to Web Personalization
• Joel Gascoigne – Data Collection,
  Preprocessing and Modelling
• James Silver – Pattern Discovery Predictive
  Web User Modelling Part 1
• Aaron John-Baptiste – Pattern Discovery
  Predictive Web User Modelling Part 2
• Asad Qazi – Evaluating Personalized Models
  and Conclusion
• Paper titled: Data Mining for Web
• Author: Bamshad Mobasher
                  Irfan Butt
Introduction and Traditional approaches to Web
Introduction to Web Personalization
• Personalization
 ▫ Delivery of content tailored to a particular user
• Web Personalization
 ▫ Delivery of dynamic content, such as text, links
   tailored to a particular user or segments of user
Automatic Personalization Vs Customization
• Similarity: Both refer to delivery of content
• Difference: Creation and updating of user
• Examples
  ▫ Customization: My Yahoo, Dell Website
  ▫ Automatic Personalization: Amazon
Personalization in Traditional Approaches

• Two phases in the process of personalization
  1) Data Collection Phase    2) Learning Phase
• Classification based on learning from data
1. Memory Based Learning (Lazy)
  ▫ Examples: User-based collaborative system,
    Content-based filtering system
2. Model Based Learning (Eager)
   ▫ Examples: Item-based System
Memory Based Learning VS Model Based Learning

• Memory Based Learning (Lazy)
  ▫ Huge memory required
  ▫ Scalability issue
  ▫ Adaptable to changes

• Model Based Learning (Eager)
  ▫   Limited memory required
  ▫   Easily scalable
  ▫   Learning phase offline
  ▫   Not adaptable to changes
Traditional Approaches to Web
• Rule Based Personalization Systems
 ▫ Rules are used to recommend item
 ▫ Rules based on personal characteristics of user
 ▫ Static profiles result in degradation of system
Traditional Approaches to Web
• Content-based Filtering Systems
 ▫ User profile built on content descriptions of items
 ▫ Profile based on previous rating of items
Traditional Approaches to Web
• Collaborative Filtering Systems
 ▫ Single profile is built in the same way i.e. content-
   based filtering Systems
 ▫ Items from more than one profile is used to
   recommend new item or content
 ▫ These profiles are K Nearest Neighbors based on
   previous ratings of items of each profile
 ▫ Poor results as the system grows
Data Mining Approach to
• Data Mining (or Web Usage Mining)
 ▫ The automatic discovery and analysis of patterns
   in click stream and associated data collected or
   generated as a result of user interactions with Web
   resources on one or more Web sites
• Data Mining Cycle:
 ▫ Data preparation and transformation phase.
 ▫ Pattern discovery phase
 ▫ Recommendation phase
              Joel Gascoigne
Data Collection, Preprocessing and Modelling
Data Modelling and Representation
• Assume the existence of a set of m users:
  ▫ U = {u1, u2, …, um}

• Set of n items:
  ▫ I = {in, in, …, in}
Data Modelling and Representation
• The profile for a user u є U is an n-dimensional
  vector of ordered pairs:
  ▫ u(n) = {(i1, su(i1)), (i2, su(i2)), …, (in, su(in))}

• Typically, such profiles are collected over time
  and stored
  ▫ Can be represented as an n x m matrix, UP
Data Modelling and Representation
• A Personalisation System, PS can be viewed as a
  mapping of user profiles and items to obtain a
  rating of interest

• The mapping is not generally defined for the
  whole domain of user-item pairs
 ▫ System must predict interest scores
Data Modelling and Representation
• This general framework can be used with most
  approaches to personalisation

• In the data mining approach:
 ▫ A variety of machine learning techniques are
   applied to UP to discover aggregate user models
 ▫ These user models are used to make a prediction
   for the target user
Data Sources for Web Usage Mining
• Main data sources used in web usage mining are
  server log files
  ▫ Clickstream data

• Other data sources include the site files and
Data Sources for Web Usage Mining
• This data needs to be abstracted
  ▫ Pageview
     Representation of a collection of web objects
  ▫ Session
     A sequence of pageviews by a single user

• All sessions belonging to a user can be
  aggregated to create the profile for that user
Data Sources for Web Usage Mining
• Content data
 ▫ Collection of objects and relationships conveyed to
   the user
    Text
    Images
 ▫ Also, semantic or structual meta-data embedded
   within the site
    Domain ontology
      Could use an ontology language such as RDF
      Or a database schema
Data Sources for Web Usage Mining
• Also, operational databases for the site may
  include additional information about user and
 ▫ Geographic information
 ▫ User ratings
Primary Tasks in Data Preprocessing for
Web Usage Mining
Data Preprocessing for Web Usage
• Goal:
 ▫ Transform click-stream data into a set of user

• This “sessionized” data can be used as the input
  for a variety of data mining algorithms or further
Data Preprocessing for Web Usage
• Tasks in usage data preprocessing:
 ▫   Data Fusion
 ▫   Data Cleaning
 ▫   Pageview Identification
 ▫   Sessionization
 ▫   Episode Identification
Data Preprocessing for Web Usage
• Data Fusion:
 ▫ Merging of log files from web and application

• Data Cleaning:
 ▫ Tasks such as:
    Removing extraneous references to embedded
    Removing references due to spider navigations
Data Preprocessing for Web Usage
• Pageview Identification:
 ▫ Aggregation of collection of objects or pages,
   which should be considered a unit
 ▫ This process is dependent on the linkage structure
   of the site
 ▫ In the simplets case, each HTML file has a one-to-
   one correlation with a pageview
 ▫ Must distinguish between users
    Authentication system or cookies
Data Preprocessing for Web Usage
• Sessionization:
 ▫ Process of segmenting the user activity log of each
   user into sessions, each representing a single visit
   to the site

• Episode Identification:
 ▫ Episode is a subsequence of a session comprised
   of related pageviews
Data Preprocessing for Web Usage
• These tasks ultimately result in a set of n
  ▫ P = {p1, p2, …, pn}
• A set of v user transactions
  ▫ T = {t1, t2, …, tv}

• A user transaction captures the activity of a user
  during a particular session
Data Preprocessing for Web Usage
• Finally, one or more transactions or sessions
  associated with a given user can be aggregated to
  form the final profile for that user
 ▫ If the profile is generated from a single session, it
   represents short-term interests
 ▫ Aggregation of multiple sessions results in profiles
   that capture long-term interests
Data Preprocessing for Web Usage
• The collection of these profiles comprises the m
  x n matrix UP which can be used to perform
  various data mining tasks

• After basic clickstream preprocessing steps, data
  from other sources is integrated:
 ▫ Content, structure and user data
                 James Silver
Pattern Discovery Predictive Web User Modelling
                      Part 1
Model-Based Collaborative Techniques

• Two-stage recommendation process:
  ▫ (A) offline model-building (B) Real-time
   (Explicit & Implicit user behavioural data used)
• Offline model-building algorithms:
   (1) Clustering,
   (2) Association Rule Discovery,
   (3) Sequential Pattern Discovery,
   (4) Latent Variable Models (part 2)

 We also look at hybrid models (part 2)
(1) Clustering
• Clustering divides data into groups where:
 ▫ Inter-cluster similarities are minimised
 ▫ Intra-cluster similarities are maximised

• Generalization to Web usage mining
  ▫ User-based vs. Item-based clustering
  ▫ Efficiency and scalability improvements
(1) Clustering: User-based
• User profiles
• Partitions Matrix UP
 ▫ Clusters represent user segments based on
   common navigational behaviour
• Recommendations (target user u, target item i)
 ▫ Centroid vector vk computed for each cluster Ck
 ▫ Neighbourhood: All user segments that have a
   score for i and whose vk is most similar to u
(1) Clustering: Other
• Fuzzy Clustering
 ▫ Desirable to group users into many categories
• Distance issues
 ▫ Consider web-transactions as sequences

• Association Rule Hypergraph Partitioning
(2) Association Rule Discovery
 Finding groups of pages or items that are commonly
            accessed or purchased together
• Originally for mining supermarket basket data
• Discovering Association Rules involves:
 1)Discovering frequent itemsets
    Satisfying a minimum support threshold
 2)Discovering association rules
    Satisfying a minimum confidence threshold
(2) Association Rules: Concepts
• Transactions set T
• Itemsets I = {I1,I2,...,Ik} over T
• Association rule r has the form X => Y (sr, cr)
 ▫ sr = the support of X U Y
   (i.e. probability that X and Y occur together in a
 ▫ cr = the confidence of the rule r
   (i.e. the conditional probability that Y occurs in a
   transaction, given that X has occurred in that transaction)
(2) Recommendations
• Matching rule antecedents with target user profiles
  ▫ Sliding window solution
  ▫ Naive approach
  ▫ Frequent Itemset Graph
• Finding Candidate pages:
  ▫ Match current user session window with previously
    discovered frequent itemsets
• Recommendation Value
  ▫ Confidence of corresponding association rule
(2) Recommendations
(3) Sequential Models
• Now we consider the order when discovering
  frequently occurring itemsets.
• So: given the user transaction {i1,i2,i3}
  ▫ Association rules (i1=>i2) and (i2=>i1) are fine
  ▫ But sequential pattern (i2=>i1) not supported
• Two types of sequences:                              i1,i2 =>
  ▫ Contiguous (closed) sequence                       {i1,i2,i3}
  ▫ Open Sequence

• Frequent Navigational Paths
(3) Recommendations
• Trie-structure (aggregate tree)
 ▫ Each node is an item, root is the empty sequence
• Recommendation Generation
 ▫ Found in O(s) by traversing the tree
   ‘s’ = the length of the current user transaction deemed to be useful in
   recommending the next set of items
 ▫ Sliding window w
    Maximum depth of tree therefore is |w|+1
 ▫ Controlling the size of the tree
(3) Sequential Models: Contiguous
• Contiguous sequence patterns are particularly
 ▫ Valuable in page pre-fetching applications
 ▫ Rather than in general context of recommendation
(3) Sequential Models: Markov
• Another approach for sequential modelling
 ▫ Based on Stochastic methods
• Modelling the navigational activity in the website
  as a Markov chain
(3) Sequential Models: Markov
• A Markov model is represented by the 3-tuple
 ▫ A: set of possible actions (items)
 ▫ S: set of n states for which the model is built
   (visitor’s navigation history)
 ▫ T=[pi,j]nxn: Transition Probability Matrix
    pi,j: probability of a transition from state si to state sj
• Order : Number of prior events used in
  predicting each future event
(3) Markov for Web-mining
• Designed to predict the next user action based
  on the user’s previous surfing behaviour
• Also used to discover high-probability user
  navigational paths in a website
 ▫ User-prefered trails
• Various optimization methods
• Apart from Markov: Mixture Models
            Aaron John-Baptiste
Pattern Discovery Predictive Web User Modelling
                      Part 2
(4) Latent Variable Models (LVMs)
• Latent Variables are variables that haven't been
  directly observed but have rather been inferred.
 ▫ E.g. Morale is not measured directly but inferred

• Have more recently become popular as a
  modelling approach in web usage mining
• Two commonly used LVMs
 ▫ Finite Mixture Models (FMM)
 ▫ Factor Analysis (FA)
(4) FA and FMM
• Factor Analysis
 ▫ Aims to summarise and find relationships within
   observed data (all data)
 ▫ Used in pattern recognition, collaborative filtering
   and personalization based web usage mining
• Finite Mixture Models (FMM)
 ▫ Use a finite number of components to model (a
   page view, or user rating)
(4) Drawbacks to pure usage based
• Pure usage based models have drawbacks
 ▫ Process relies on user transactions or rating data
 ▫ New items or pages are therefore never
   recommended (“new item problem”)
 ▫ Also do not use knowledge from underlying
   domain and so cannot make more complex
(5) Hybrid models
• Uses a combination of user-based and content-
  based modelling.
• Three main types used in web mining
 ▫ Integrating content features
 ▫ Integrating semantic knowledge
 ▫ Using Linkage structure
(5) Integrating content features with
usage-based models
• Solves “new item problem”
  ▫ Use content characteristics of pages with user-
    based data
  ▫ Extract keywords from content to be used to
    discover patterns
  ▫ Not just using user data means new pages with
    relevant content can be recommended
  ▫ Users interests can be mapped to content,
    (concepts or topics)
(5) Integrating structured semantic
knowledge with usage-based models
• Content feature integration is useful when pages
  are rich in text and keywords
• However cannot capture more complex
  relationships where items have underlying
• Idea is to take the underlying meanings of
  objects and add them to the user-based data.
  Recommendations can then be made to pages or
  items with similar semantic meanings
(5) Using Linkage structure for model
learning and selection
• Other semantic data can be used such as
  relational databases and the hyperlink structure
  on a web page
• Mobasher proposes a hybrid recommendation
  system that switches between different
  algorithms based on the degree of connectivity
  in the site and user
• E.g. in a highly connected website, with short
  paths, non sequential models performed better
                 Asad Qazi
Evaluating Personalized Models and Conclusion
Evaluating Personalization models

The Primary Goal of this section is to evaluate the
accuracy and effectiveness of web personalization
Why Evaluate?
•   More complex web-based applications and more complex user interaction
    requires the selection of more sophisticated models
•   Need to further explore the impact of recommended model on user
•   There are several different modelling approaches to web personalization
•   Evaluating personalized models is an inherently challenging task firstly,
    because different models require different evaluation metrics, secondly, the
    required personalization actions may be quite different depending on the
    underlying domain, relevant data and intended application
•   Finally, there is also a lack of consensus among researchers as to what
    factors affect quality of service in personalized systems and of what
    elements contribute to user satisfaction
Common evaluation approaches
• A number of metrics have been proposed in literature for
  evaluating the robustness and predictive accuracy of a
  recommender system: this includes
• Mean Absolute Error (MAE)
• Classification Metrics (Precision and Recall)
• Receiver Operating Characteristic (ROC)
• The use of business metrics to measure the customer loyalty
  and satisfaction such as Recency Frequency Monetary (RFM)
• The use of other key dimensions along with metrics such as:
  Accuracy, Coverage, Utility, Explainability, Robustness,
  Scalability and User Satisfaction
•   Web personalisation is viewed as an application of data mining which
    dynamically     serves    customized      content      (pages, products,
    recommendations, etc.) to users based on their profiles, preferences, or
    expected interests of data available to personalization systems, the
    modelling approaches employed and the current approaches to
    evaluating these systems
•   We have also discussed the various sources of data available to
    personalization systems, the modelling approaches employed and the
    current approaches to evaluating these systems
•   Recent user studies have found that a number of issues can affect the
    perceived usefulness of personalization systems including, trust in the
    system, transparency of the recommendation logic, ability for a user to
    refine the system generated profile and diversity of recommendations
•   Most personalization systems tend to use a static profile of the user.
    However user interests are not static, changing with time and context.
    Few systems have attempted to handle the dynamics within the user
Any Questions?

Shared By: