09

Document Sample
09 Powered By Docstoc
					             Analysing Clickstream Data:
            From Anomaly Detection to
                  Visitor Profiling
ECML/PKDD Discovery Challenge October 7, 2005 Porto, Portugal



                                                          Peter I. Hofgesang
                                                                hpi@few.vu.nl
                                                           Wojtek Kowalczyk
                                                        wojtek@few.vu.nl
                Web server data
•   7 internet shops (home electronics)
•   80.000 visitors (IP-addresses) in 25 days
•   0.5 million sessions
•   3 million clicks (records in a log file)
•   Example record:
    11;1076262912;193.170.198.122;
    eb5cbe50997fcb7f9155c6c194c832a8;/znacka/?c=162&tisk=ano;
    http://www.google.com./search?hl=cs&q=Sennheiser+HD+650
    &btnG=Vyhledat+Googlem&lr=lang_cs

• Objective: discover interesting patterns !!!
              Data Mining Process

                                     SESSION IDENTIFICATION &
INPUT DATA       DATA PREPARATION    DETECTION OF ANOMALIES
                                                                 PROFILE MINING


                  PREPROCESSING




                                                                      Probability



                                                                                                Text
                                                                                         Text
                                                                                        T
                                                                                        e




                                                                                                       T
                                                                                        x




                                                                                                       e
                                                                                                       x
                                                                                        t




                                                                                                       t
                                       Identified sessions /                        Content types
Web access                           based on a new definition      Mixture model
 log data
                     DATABASE


                                                                                          3


                                                                                    2                  3

   Shop           BASIC STATISTICS
                                                                                                           3
information                                                      Tree of profile
                                                                  sequences
                                      Detection of anomalies
      Anomalies/Strange things I
• Multiple IP-addresses per session
  –   2 IP-addresses: 3.051 sessions
  –   3 IP-addresses: 362 sessions
  –   4 IP-addresses: 113 sessions
  –   ………………
  –   22 IP-addresses:    1 session
  –   Some sessions involve IP’s from different countries


• A few sessions (12) refer to multiple shops
    Anomalies/Strange things II
• Sessions with long duration
   – 476 sessions longer than 24 hours (up to 18 days)
• Very Intensive Sessions
   – 2.865 sessions with more than 100 visited pages
   – 19 sessions with more than 1.000 visited pages
   – 2 sessions with more than 10.000 visited pages
• Frequent IP-addresses with short sessions
   – E.g.: 29.320 sessions in less than 20 hours from 147.229.205.80
• “Parallel sessions”
   – Overlapping sequences of clicks from the same IP to the same
     shop within a short period with multiple SIDs
     (Opening a new window? Making a transaction? )
   Anomalies/Strange things III
• Sequences of short sessions that form sessions
Example: clicks from 62.209.194.163 (31 Jan 04)
   09:40:09   /dt/?c=13654;http://www.shop5.cz/
   09:41:21   /dt/param.php?id=115;
   09:41:21   /;
   09:41:37   /ls/?id=20;http://www.shop5.cz/dt/?c=13654
   09:41:42   /;
   09:42:24   /ls/?&id=20&view=1,2,3,8&pozice=20;http://www.shop5.cz/ls/…
   09:42:25   /;
   09:42:48   /ls/?&id=20&view=1,2,3,8;http://www.shop5.cz/ls/?&id=20& …
   09:42:48   /;
   09:42:53   /ls/?&id=20&view=1,2,3,8&pozice=40;http://www.shop5.cz/ls/…

       Each one has another session identifier !!!
              Fixing the data
• A new definition of “session”:

  A chronologically ordered sequence of “clicks”
  from the same IP-address to the same shop
  with no gaps longer than 30 minutes

• Sessions longer than 50 clicks ignored (12.000)
• Number of sessions dropped:
           522.410  281.153
 Old and New Sessions
Session Length   Count Old   Count New
      1           318.523     65.258
      2           24.762      31.821
      3           17.353      18.828
      4           15.351      16.332
      5           15.361      15.509
      6           13.455      13.448
      7           10.958      10.883
      8            9.045       9.095
      9            7.939       8.070
     10            7.028       7.091
      ...           ...         ...
          Visitor Profiling


Motivation: On the internet each shop is
just “one click away”. If a user is not
satisfied with the service he/she just goes
to a next one and will likely never return.
         Visitor Profiling Scheme
I.     Clustering of user
       sessions

II.    Analysis/interpretation
       of the clusters

III.   Assign a cluster label
       to each session

IV.    Analysis of the profile
       sequences
                   Clustering
• Cadez et al. (2001) - predictive profiles from
  historical transaction data
                                             K      C
                                p ( y ij )    k  
                                                           nijc
• Mixture of multinomials:                                 kc
                                             k 1   c 1
                                      N

• Full data likelihood:   p( D | )   p( Di | )
                                      i 1


• The unknown parameters {1 ,...,  K } and
  {1 ,...,  K } are estimated by the expectation
  maximization (EM) algorithm.
             Interpretation of the clusters
Profile 1 General overview of the products




Profile 2 Focused search




Profile 3 Potential buyers




Profile 4 Parameter based search
 The transitions of profiles

      P1       P2       P3       P4

P1   0.7208   0.1592   0.0621   0.0579

P2   0.5908   0.2828   0.0710   0.0553

P3   0.5022   0.1616   0.2873   0.0489

P4   0.6000   0.1702   0.0685   0.1613
Tree of user profiles
Tree of potential buyers
                  Conclusion

• We spot several anomalies  background information
  about pre-processing & data preparation is important
• Important features were missing (who is a buyer?)
• Four clear user profiles

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:2/1/2013
language:English
pages:16
xuxianglp xuxianglp http://
About