zhan

Document Sample
zhan Powered By Docstoc
					         Web log analysis


l   Presented by: Zhan Wu
l   Guided by: Dr.Bettina Berendt
l   Seminar: Web Mining




    1
    What’s log file

    Log file:      “A file that lists all the actions
                    that have occurred”

        Every time you visit a site, the web server will
      generate a record of the HTTP transaction into
      a log file.



2
    Why web log analysis

    l   Is anyone looking at your Web site?
    l   Do they like what they see?
    l   Do all your links work well?
    l   What’s the traffic of your web?




3
     Why web log analysis(cont.)
    Web designers          The incentives of
                           visitors,what make them
                           stay and what make them
                           leave
    Web administrators     all clicks lead to
                           documents ,images,
                           multimedia files, scripts
                           and applets are loaded
                           and displayed properly
    Companies that place   Make their investment
    adv.                   effectively,refuse to waste
4                          money
Log file type

      Access log
      Referrer log
      Agent log
      Error log
     Log file from
          www.eduserver.de
pd9e0e981.dip.t-dialin.net - - [01/Dec/2001:00:17:42
+0100]
"GET /db/stellenliste.html HTTP/1.1" 200 8038
 Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
http://www.jobs.zeit.de/akad.html


  Access log ,agent log and referrer log are always together
that is called extended log file.however some server turn off the
agent log and referrer log ,only leave the access log which is
called common log file


6
                      Access log

    Address / DNS        pd9e0e981.dip.t-dialin.net
    identification
    authuser
    timestamp            [01/Dec/2001:00:17:42 +0100]
    Request page         "GET /db/stellenliste.html
                         HTTP/1.1"
    Status code          200
    Transfer volume      8038
7
        four series status codes



l   Success (200 series)
l   Redirect (300 series)
l   Failure (400 series)
l   Server Error (500 series)
    Agent log

    l   The agent log has information about the
        browser version, and operating system of
        the visitor.
    l    “Mozilla” is the original code name
        of Netscape.Now almost all browsers
        compatible with Netscape use “Mozilla”
        as code name.


9
     Referrer log

     l   The referrer log indicates the page where the
         visitor was located when making the next
         request.
     l    how is your site categorized in search engine ?

         http://de.dir.yahoo.com/Bildung_und_Ausbil
         dung/Portale_und_Linksammlungen/Bildungsse
         rver

10
        Referrer log (cont.)

l   How is the path of the visitor navigate in your site?

l   pd9e0e981.dip.t-dialin.net … "GET
    /db/set.html?Id=221&KATEGORIE=stellenangebot&s=aeadf3e55
    f209b8c73ba53df99dc574a HTTP/1.1" …
    http://www.bildungsserver.de/db/stellenliste.html

l   www.job.zeit.de          joblist         a certain job information


11
     Error log

     l   Another standard and important log that
         separates from the other three logs
         example from www.schulweb.de
         [Wed Jan 16 13:40:45 2002] [error]
         [client 194.51.47.214] File does not
         exist:
         /home/schulweb/html/images/dot_so.gif


12
         Overview of log analysis software

     l   Writing own program

     l Free software (top 3 by Google pagerank)
     “eETReMe Tracking”, ”The Webalizer”, ”Analog”

     l   Commercial software and solution package
     (top 3 by Google pagerank)
     “Wusage”,    ”WebTrends”, ”AccessWatch”

13
     Three step of web log analysis


           Decide what we need

         Choose a log analysis software

         Analyze the output of program


14
     Step 1: what we need

     The traffic of the site
     The distribution of the domains
     The referrer site…




15
     Step 1 (cont.)what we don’t need

     l   We don’t care the error log.this problem will be
         left to the web administrator.
     l   We don’n care the browser ,operation system
         of the visitors
     l   User sessions are not important either.




16
     Step 2: which way I should choose

     l   Limited budget and poor background on computer
         science determine that I have to choose the free
         software!

     l   I choose the Analog :

         there are different versions for Macintosh, Unix, DOS,
         Windows.

         Also, while the default configuration gives a great
         report, Analog is easy customizable to produce exactly
17       the report you want.
     Step 3: get the output-traffic
     l   All the data come
         from the results of
         Analog
     l   The average
         request per month
         is 912,615 and
         30,420 per day ,the
         traffic increased
         month by month
         last year.



18
     Step 3(cont.) Domain distribution




19
     Data Clean

     l   Why domain “eduserver” doesn't appear?
     l   Separating in-house from external




     l   Thank Dr.Berendt for filtering all the entries
         from the ‘eduserver’ itself

20
     Limitations of log Analysis

     l   User Sessions
     l   Not all information are captured
     l   Confusion of domains




21
     User Sessions

     l The popular methods to measure the user
       sessions as following:
     1. Authenticated user
     2. Cookies
     3. IP address of the visitor

     All these above have problems!!!

22
     Not all entries are captured

     l   ISPs “cache” the specific pages



     l   Web browsers also have their own local
         caches




23
     Confusion of domains

     l   there is nothing to stop a commercial entity
         from registering a site in the .org domain .
     l   sites in the .com domain and other domains
         can also be located in foreign countries, so you
         cannot tell exactly which requests are coming
         from users in other countries.
     l   .edu domains only exist in USA.We can not tell
         a German educational site from the last term of
         domains.

24
     Conclusion

     l   there is a great deal of
         useful information you
         can get from web logs.
     l   There is still a lot of
         things to do in this
         field in the future.




25
26

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:2/22/2014
language:Unknown
pages:26