Web log analysis
l Presented by: Zhan Wu
l Guided by: Dr.Bettina Berendt
l Seminar: Web Mining
What’s log file
Log file: “A file that lists all the actions
that have occurred”
Every time you visit a site, the web server will
generate a record of the HTTP transaction into
a log file.
Why web log analysis
l Is anyone looking at your Web site?
l Do they like what they see?
l Do all your links work well?
l What’s the traffic of your web?
Why web log analysis(cont.)
Web designers The incentives of
visitors,what make them
stay and what make them
Web administrators all clicks lead to
multimedia files, scripts
and applets are loaded
and displayed properly
Companies that place Make their investment
adv. effectively,refuse to waste
Log file type
Log file from
pd9e0e981.dip.t-dialin.net - - [01/Dec/2001:00:17:42
"GET /db/stellenliste.html HTTP/1.1" 200 8038
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Access log ,agent log and referrer log are always together
that is called extended log file.however some server turn off the
agent log and referrer log ,only leave the access log which is
called common log file
Address / DNS pd9e0e981.dip.t-dialin.net
timestamp [01/Dec/2001:00:17:42 +0100]
Request page "GET /db/stellenliste.html
Status code 200
Transfer volume 8038
four series status codes
l Success (200 series)
l Redirect (300 series)
l Failure (400 series)
l Server Error (500 series)
l The agent log has information about the
browser version, and operating system of
l “Mozilla” is the original code name
of Netscape.Now almost all browsers
compatible with Netscape use “Mozilla”
as code name.
l The referrer log indicates the page where the
visitor was located when making the next
l how is your site categorized in search engine ?
Referrer log (cont.)
l How is the path of the visitor navigate in your site?
l pd9e0e981.dip.t-dialin.net … "GET
f209b8c73ba53df99dc574a HTTP/1.1" …
l www.job.zeit.de joblist a certain job information
l Another standard and important log that
separates from the other three logs
example from www.schulweb.de
[Wed Jan 16 13:40:45 2002] [error]
[client 188.8.131.52] File does not
Overview of log analysis software
l Writing own program
l Free software (top 3 by Google pagerank)
“eETReMe Tracking”, ”The Webalizer”, ”Analog”
l Commercial software and solution package
(top 3 by Google pagerank)
“Wusage”, ”WebTrends”, ”AccessWatch”
Three step of web log analysis
Decide what we need
Choose a log analysis software
Analyze the output of program
Step 1: what we need
The traffic of the site
The distribution of the domains
The referrer site…
Step 1 (cont.)what we don’t need
l We don’t care the error log.this problem will be
left to the web administrator.
l We don’n care the browser ,operation system
of the visitors
l User sessions are not important either.
Step 2: which way I should choose
l Limited budget and poor background on computer
science determine that I have to choose the free
l I choose the Analog :
there are different versions for Macintosh, Unix, DOS,
Also, while the default configuration gives a great
report, Analog is easy customizable to produce exactly
17 the report you want.
Step 3: get the output-traffic
l All the data come
from the results of
l The average
request per month
is 912,615 and
30,420 per day ,the
month by month
Step 3(cont.) Domain distribution
l Why domain “eduserver” doesn't appear?
l Separating in-house from external
l Thank Dr.Berendt for filtering all the entries
from the ‘eduserver’ itself
Limitations of log Analysis
l User Sessions
l Not all information are captured
l Confusion of domains
l The popular methods to measure the user
sessions as following:
1. Authenticated user
3. IP address of the visitor
All these above have problems!!!
Not all entries are captured
l ISPs “cache” the specific pages
l Web browsers also have their own local
Confusion of domains
l there is nothing to stop a commercial entity
from registering a site in the .org domain .
l sites in the .com domain and other domains
can also be located in foreign countries, so you
cannot tell exactly which requests are coming
from users in other countries.
l .edu domains only exist in USA.We can not tell
a German educational site from the last term of
l there is a great deal of
useful information you
can get from web logs.
l There is still a lot of
things to do in this
field in the future.