Docstoc

Web_Analytics

Document Sample
Web_Analytics Powered By Docstoc
					WIRED - Web Analytics Week
• Web Logs overview
• Web Analytics
  - Understanding Queries
  - Tracking Users
• Web Log Reliability
• Web Log Data Mining & KDD
Web Analytics
• Evaluation of Web Information Retrieval (& Web
  Information Seeking)

• What can we learn?
   - IR systems use
   - Web server administration
• Who are the users?
   - Types of users
   - User situations
• How does it affect or help IR?
Web Server Overview
• Any application that can serve files using the HTTP
  protocol
   -   Text, HTML, XHTML, XML…
   -   Graphics
   -   CGI, applets, serlets
   -   other media & MIME types
• Apache or MS IIS that serve primarily Web pages
• Servers create ASCII text log files showing:
   - Date, time, bytes transferred, (cache status)
   - Status/error codes, user IP address, (domain name)
   - Server method, URI, misc comments
Web Log Overview
• Access Log
  - Logs information such as page served or time
    served
• Referer Log
  - Logs name of the server and page that links to
    current served page
  - Not always
  - Can be from any Web site
• Agent Log
  - Logs browser type and operating system
     • Mozilla
     • Windows
What can we learn from Web logs?
• Every time a Web browser requests a file, it
  gets logged
  - Where the user came from
  - What kind of browser used to access the server
  - Referring URL
• Every time a page gets served, it gets logged
  - Request time, serve time, bytes transferred, URI,
    status code
Web Log Analysis in Action
• UT Web log reports
(Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00).
Successful requests: 39,826,634 (39,596,364)
Average successful requests per day: 5,690,083 (5,656,623)
Successful requests for pages: 4,189,081 (4,154,717)
Average successful requests for pages per day: 598,499 (593,530)
Failed requests: 442,129 (439,467)
Redirected requests: 1,101,849 (1,093,606)
Distinct files requested: 479,022 (473,341)
Corrupt logfile lines: 427
Data transferred: 278.504 Gbytes (276.650 Gbytes)
Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)
Problems with Web Servers
•   Actual user or intent not known
•   Paths difficult to determine
•   Infrequent access challenging to uncover
•   No State Information
•   Server Hits not Representative
    - Counters inaccurate
•   DOS, Floods, Bandwidth can Stop “intended” usage
•   Robots, etc.
•   ISP Proxy servers
•   “5.3 Unsound inferences from data that is logged”
    Haigh & Megarity, 1998.
Web Server Configuration
•   Unique file & directory names = “at a glance analysis”
•   Hierarchical directory structure
•   Redirect CGI to find referrer
•   Use a database
    - store web content
    - record usage data with context of content logged
• Create state information with programming
    - Servlets, ActiveX, Javascript
    - Custom server or log format
• Log rollover, report frequency, special case testing
Log File Format
• Extended Log File Format - W3C Working
  Draft WD-logfile-960323
  192.117.240.3 - - [24/Jul/1998:00:00:04 -0400]
  "GET /10/3/a3-160-e.html HTTP/1.0" 200 2308 "http://www.amicus.nlc-
  bnc.ca/wbin/resanet/itemdisp/l=0/d=1/r=1/e=0/h=10/i=11683503"
  "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)"
• Every server generates slightly different logs
   - Versions & operating system issues
   - Admin tweaks to log formats
• Extended Log Format most common
   - WWW Consortium Standards (= apache)
Let’s Look at some logs
• http://www.ischool.utexas.edu/analog-
  monthly.html
• http://www.ischool.utexas.edu/analog-
  weekly.html
Log Analysis Tools
•   Analog
•   Webalizer
•   Sawmill
•   WebTrends
•   AWStats
•   WWWStat
•   GetStats
•   Perl Scripts
•   Data Mining & Business Intelligence tools
WebTrends




• A whole industry of analytics
• Most popular commercial application
Measuring Web Site Usage
• Now that the Web is a primary source,
  understanding its use is critical
• Little external cues that the Web site is being
  used
• What - pages and their content/subject
• How - browsers
• Who - userid or IP
• When - trends, daily, weekly, yearly
• Where - the user is and what page they came
  from
What you can’t measure?
• Who the user is
  - Always
  - If the user’s needs have changed
• If they’re using the information
  - Browsing vs. Reading vs. Acting on the
    information
• Changes to site and how they affect each user
• Pages not used at all - and why
Analysis of a Very Large Search Log
• What kinds of patterns can we find?
• Request = query and results page
• 280 GB – Six Weeks of Web Queries
   - Almost 1 Billion Search Requests, 850K valid, 575K queries
   - 285 Million User Sessions (cookie issues)
   - Large volume, less trendy
   - Why are unique queries important?
• Web Users:
   - Use Short Queries in short sessions - 63.7% one request
   - Mostly Look at the First Ten Results only
   - Seldom Modify Queries
• Traditional IR Isn’t Accurately Describing Web Search
• Phrase Searching Could Be Augmented
                               • Silverstein, Henzinger, Marais, Moricz (1998)
Analysis of a Very Large Search Log
• 2.35 Average Terms Per Query
  - 0 = 20.6% (?)
  - 1 = 25.8%
  - 2 = 26.0%     =   72.4%
• Operators Per Query
  - 0 = 79.6%
• Terms Predictable
• First Set of Results Viewed Only = 85%
• Some (Single Term Phrase) Query Correlation
  - Augmentation
  - Taxonomy Input
  - Robots vs. Humans
Web Analytics and IR?
• Knowing access patterns of users
• Lists of search terms
  - Numbers of words
  - Words, concepts to add (synonyms)
  - Types of queries
• Success of searching a site
  - Was a result link clicked on?
  - How many pp/user after a search?
• Is a new or better search interface needed?
Real Life Information Retrieval
• 51K Queries from Excite (1997)
• Search Terms = 2.21
• Number of Terms
   - 1 = 31% 2 = 31% 3 = 18%    (80% Combined)
• Logic & Modifiers (by User)
   - Infrequent
   - AND, “+”, “-”
• Logic & Modifiers (by Query)
   - 6% of Users
   - Less Than 10% of Queries
   - Lots of Mistakes
• Uniqueness of Queries
   - 35% successive
   - 22% modified
   - 43% identical
Real Life Information Retrieval
• Queries per user 2.8
• Sessions
   - Flawed Analysis (User ID)
   - Some Revisits to Query (Result Page Revisits)
• Page Views
   - Accurate, but not by User
• Use of Relevance Feedback (more like this)
   - Not Used Much (~11%)
• Terms Used Typical & frequent
• Mistakes
   - Typos
   - Misspellings
   - Bad (Advanced) Query Formulation
                                 •   Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)
KDD for Extracting Knowledge
• Knowledge extraction, information discovery, information
  extraction, data archeology, data pattern processing, OLAP, HV
  statistical analysis
• Sounds as if “knowledge” is there to be
  found.
• User and usage context help find the
  knowledge
• Hypothesis before analysis
• Why KDD, why now?
   - Data storage, analysis costs
   - Visualization
KDD Process




• Database for structured data and queries
  - How structured, alorithms for queries
  - How results can be understood and visualized
  - Iterative & Interactive, hypothesis driven &
    hypothesis generating
KDD Efforts
• Data Cleaning
• Formulating the Questions
• “Finding useful features to represent the
  data” p30
• Models:
  -   Classification to fit data into pre-defined classes
  -   Regressions to fit predictions & values
  -   Clustering to class sets found in data
  -   Summarization to briefly describe data
  -   Dependency discovery of variable relationships
  -   Sequence analysis for time or interaction patterns
Data Prep for Mining the WWW
• Processing the data before mining
• WEBMINER system - site toplogy
  -   Cleaning
  -   User identification
  -   Session identification (episodes)
  -   Path completion
Web Usage Mining
• VL Verification
• Data Mining to Discover Patterns of Use
   - Pre-Processing
   - Pattern Discovery
   - Pattern Analysis
• Site Analysis, Not User Analysis



• Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.N. - 2000
Web Usage Discovery
  - Content
     • Text
     • Graphics
     • Features
  - Structure
     • Content Organization
     • Templates and Tags
  - Usage
     • Patterns
     • Page References
     • Dates and Times
  - User Profile
     • Demographics
     • Customer Information
Web Usage Collection
• Types of Data
  - Web Servers
  - Proxies
  - Web Clients
• Data Abstractions
  -   Sessions
  -   Episodes
  -   Clickstreams
  -   Page Views
• The Tools for Web Use Verification
Web Usage Preprocessing
• Usage Preprocessing
  - Understanding the Web Use Activities of the Site
  - Extract from Logs
• Content Preprocessing
  - Converting Content Into Formats for Processing
  - Understanding Content (Working with Dev Team)
• Structure Preprocessing
  - Mining Links and Navigation from Site
  - Understanding Page Content and Link Structures
Web Usage Pattern Discovery
• Clustering for Similarities
   - Pages
   - Users
   - Links
• Classification
   -   Mapping Data to Pre-defined Classes
   -   Rule Discovery
   -   Rule Rules
   -   Computation Intensive
   -   Many Paths to the Similar Answers
• Pattern Detection
   - Ordering By Time
   - Predicting Use With Time
Web Usage Mining as Evaluation?
• Mining Goals
    - Improved Design
    - Improved Delivery
    - Improved Content
•   Personalization (XMod Data)
•   System Improvement (Tech Data)
•   Site Modification (IA Data)
•   Business Intelligence (Market Data)
•   Usage Characterization (User Behavior Data)
Web Analytics Wrap-up
• What can we learn about users?
• What can we learn about services?
• How can we help users improve their use?
• How can IR models benefit from this
  analysis?
• What kind of improvements in Web IR
  systems and their interfaces can be take from
  this?

				
DOCUMENT INFO