Learning Center
Plans & pricing Sign in
Sign Out



									Basic WWW Technologies &
 Mathematic Background
              (Chap 2 & 1, Baldi)

                 Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information Engineering,
              National Cheng Kung University
           World Wide Web
• The World Wide Web (Web) is a network
  of information resources.
• The Web relies on three mechanisms to
  make these resources available:
  1. A uniform naming scheme for locating
     resources on the web (e.g., URIs).
  2. Protocols, for access to named resources
     over the web (e.g., HTTP).
  3. Hypertext, for easy navigation among
     resources (e.g., HTML).
              Internet vs. Web
• Internet:
   – Internet is a more general term
   – Includes physical aspect of underlying networks and
     mechanisms such as email, FTP, HTTP…
• Web:
   – Associated with information stored on the Internet
   – Refers to a broader class of networks, i.e. Web of
     English Literature
• Both Internet and web are networks

Essential Components of WWW
• Resources (HTML, HyperText Markup
  – Conceptual mappings to concrete or abstract entities, which do
    not change in the short term
  – Taggin support for structuring and laying out documents
• Resource identifiers (hyperlinks):
  – Strings of characters represent generalized addresses that may
    contain instructions for accessing the identified resource
  – is used to identify the Google homepage
• Transfer protocols (HTTP, HyperText
  Transmission Protocol)
  – Conventions that regulate the communication between a
    browser (web user agent) and a server

   Standard Generalized Markup
        Language (SGML)
• Based on GML (generalized markup language),
  developed by IBM in the 1960s
• An international standard (ISO 8879:1986)
  defines how descriptive markup should be
  embedded in a document
  – Markup: extra information characterizing structure of
    a document
• Gave birth to the extensible markup language
  (XML), W3C recommendation in 1998

          SGML Components
• SGML documents have three parts:
  – Declaration: specifies which characters and delimiters
    may appear in the application
  – DTD (Document Type Definition)/ style sheet: defines
    the syntax of markup constructs
  – Document instance: actual text (with the tag) of the
• More info could be found:

         HTML Background
• HTML was originally developed by Tim Berners-
  Lee while at CERN, and popularized by the
  Mosaic browser developed at NCSA.
• The Web depends on Web page authors and
  vendors sharing the same conventions for HTML.
  This has motivated joint work on specifications
  for HTML.
• HTML standards are organized by W3C :

         HTML Functionalities
• HTML gives authors the means to:
  – Publish online documents with headings, text, tables,
    lists, photos, etc
     • Include spread-sheets, video clips, sound clips, and other
       applications directly in their documents
  – Link information via hypertext links, at the click of a
  – Design forms for conducting transactions with remote
    services, for use in searching for information, making
    reservations, ordering products, etc

Sample Webpage

     Sample Webpage: HTML
• <HTML>
•    <HEAD>
•      <TITLE>The title of the webpage</TITLE>
•    </HEAD>
•    <BODY> <P>Body of the webpage
•    </BODY>
• </HTML>

              HTML Structure
• An HTML document is divided into a head section
  (here, between <HEAD> and </HEAD>) and a body
  (here, between <BODY> and </BODY>)
• The title of the document appears in the head (along
  with other information about the document)
• The content of the document appears in the body.
  The body in this example contains just one paragraph,
  marked up with <P>

             HTML Hyperlink
• <a href="relations/alumni">alumni</a>
• A link is a connection from one Web resource
  to another
• It has two ends, called anchors, and a direction
• Starts at the "source" anchor and points to the
  "destination" anchor, which may be any Web
  resource (e.g., an image, a video clip, a sound
  bite, a program, an HTML document)
       Resource Identifiers
• Uniform Resource Identifiers (URI):
  include two overlapping subsets of
  – URL: Uniform Resource Locators
  – URN: Uniform Resource Names

             Introduction to URIs
•   Every resource available on the Web has an
    address that may be encoded by a URI
•   URIs typically consist of three pieces:
    –   The naming scheme of the mechanism used to
        access the resource. (HTTP, FTP)
    –   The name of the machine hosting the resource
    –   The name of the resource itself, given as a path

             URI Example
• There is a document available via the HTTP
• Residing on the machines hosting
• Accessible via the path "/TR"

• Describe how messages are encoded and
• Different Layering Architectures
• ISO OSI 7-Layer Architecture
• TCP/IP 4-Layer Architecture

ISO OSI Layering Architecture

TCP/IP Layering Architecture

  TCP/IP Layering Architecture
• A simplified model, provides the end-to-
  end reliable connection
• The network layer
  – Hosts drop packages into this layer, layer
    routes towards destination
  – Only promise “Try my best”
• The transport layer
  – Reliable byte-oriented stream

Hypertext Transfer Protocol (HTTP)
• A connection-oriented protocol (TCP) used
  to carry WWW traffic between a browser
  and a server
• One of the transport layer protocol
  supported by Internet
• HTTP communication is established via a
  TCP connection and server port 80

GET Method in HTTP


• <HTML>
  <Form action= method=post>
  [1] Median Eminence (可複選):
  1.<input type=checkbox name=„Median Eminence‟ value=分泌>分泌
  2.<input type=checkbox name=„Median Eminence‟ value=一般>一般
  3.<input type=checkbox name=„Median Eminence‟ value=王錫崗>王錫
  崗.<input type=checkbox name=„Median Eminence‟ value=垂體>垂體
  其他:<input type=“text” name =„Median Eminence‟ >
  <input type=submit value=確認>

CGI processing

CGI (Common Gateway Interface)

                   Service Request

  Web Browser                           Web Server

                   Service Response
                                             Service Processing


HTTP Request Processing

GNU Wget

CGI: Get query search-results from
       Google using Wget

            Homework (1)
• Meta-search engine: dispatch the user
  query to several engines at same time,
  collect and merge the results into one list
  to the user.
• Homework: Develop a meta-search engine
  which responds user query with combined
  search results from a few search engines.

       Domain Name System
• DNS (domain name service): mapping from
  domain names to IP address
• IPv4:
• IPv4 was initially deployed January 1st. 1983 and
  is still the most commonly used version.
• 32 bit address, a string of 4 decimal numbers
  separated by dot, range from to
• IPv6:
• Revision of IPv4 with 128 bit address
      Top Level Domains (TLD)
• Top level domain names, .com, .edu, .gov and
  ISO 3166 country codes .de, .fr, .it
• There are three types of top-level domains:
• Generic domains were created for use by the Internet
• Country code domains were created to be used by
  individual country
• The .arpa domain Address and Routing Parameter Area
  domain is designated to be used exclusively for Internet-
  infrastructure purposes

               Server Log Files
• Server Transfer Log: transactions between a
  browser and server are logged
•   IP address, the time of the request
•   Method of the request (GET, HEAD, POST…)
•   Status code, a response from the server
•   Size in byte of the transaction
• Referrer Log: where the request originated
• Agent Log: browser software making the request
• Error Log: request resulted in errors (404)
        Server Log Analysis
• Most and least visited web pages
• Entry and exit pages
• Referrals from other sites or search
• What are the searched keywords
• How many clicks/page views a page
• Error reports, like broken links
Server Log Analysis

            Search Engines
• According to Pew Internet & American Life
  Project Report (2002), search engines are the
  most popular way to locate information online
• About 33 million U.S. Internet users query on
  search engines on a typical day.
• More than 80% have used search engines
• Search Engines are measured by coverage and
              Web Crawler
• A crawler is a program that picks up a
  page and follows all the links on that page
• Crawler = Spider
• Types of crawler:
  – Breadth First
  – Depth First

       Breadth First Crawlers
• Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and
  add them to a queue
• Pick the 1st link from the queue, get all
  links on the page and add to the queue
• Repeat above step till queue is empty

Breadth First Crawlers

        Depth First Crawlers
• Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start
• Visit link and get 1st non-visited link
• Repeat above step till no non-visited links
• Go to next non-visited link in the previous
  level and repeat 2nd step

Depth First Crawlers

• Overlap analysis used for estimating
  the size of the indexable web
• W: set of webpages
• Wa, Wb: pages crawled by two
  independent engines a and b
• P(Wa), P(Wb): probabilities that a page
  was crawled by a or b
  – P(Wa)=|Wa| / |W|
  – P(Wb)=|Wb| / |W|
            Overlap Analysis
• P(Wa Wb| Wb) = P(Wa  Wb)/ P(Wb)
                    = |Wa  Wb| / |Wb|
• If a and b are independent:
  – P(Wa Wb) = P(Wa)*P(Wb)
  – P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)
                  = |Wa| / |W| * (|Wb| / |W|) / (|Wb| / |W|)
                  = |Wa| / |W|
                  = P(Wa)

           Overlap Analysis
• Using |W| = |Wa|/ P(Wa), the researchers
  – Web had at least 320 million pages in 1997
  – 60% of web was covered by six major
  – Maximum coverage of a single engine was
    1/3 of the web

How to Improve the Coverage?
• Meta-search engine: dispatch the user
  query to several engines at same time,
  collect and merge the results into one list
  to the user.
• Any suggestions?
• Homework: Develop a meta-search engine
  which responds user query with combined
  search results from a few search engines.
• Model uncertainty: make inferences about
  events given observed data
• An event e: proposition or statement about the
  world at large
  – “the number of Web pages in existence on 1 January
    2003 was greater than five billion”
• A probability P(e): can be viewed as a number
  that reflects our uncertainty about whether e is
  true or false in the real world, given whatever
  information we have available.
         Learning from a Bayesian
• A conditional probability P(e | D): represent the degree
  of belief (Bayesian interpretation of probability), where D
  is the background information (data) on which our belief
  is based.
• Bayesian approach: probability as being a dynamic
  entity updated when more data arrive
                                    P( D | e) P(e)
                     P (e | D ) 
                                        P ( D)
   – Prior probability: P(e) is your belief in the event e before you see
     any data
   – Posterior probability: P(e | D) reflects your updated belief in
     event e given the observed data D
   – Likelihood: P(D | e) is the probability of the data under the
     assumption that e is true
       • How to model P(D | e)?
 Standard Probabilistic Distribution
• Discrete distributions • Continuous distributions
  P( X  k | p, n)    p k (1  p) n  k
                                                                                                             ( x )2
                       k                                                               1
                                                                N ( x |  , )                      2   2
                                                                                           e
                                     n!                                              2 
  P( X 1  k1 ,..., X m  k m )              p1k1 ... pm k m
                                  k1!...k m !                   f ( x |  )  e   x       Exponential
  P( X  k )  p(1  p) k 1           Geometric                                    1 x
                                                                ( x |  ,  )        x e
                          k                                                   ( )
  P( X  k |  )  e                   Poisson

       Learning from a Bayesian
          Perspective (cont.)
                        P( D | e) P(e)
         P (e | D ) 
                            P ( D)

• Take logarithms for easier operations
         log P(e | D)  log P( D | e)  log P(e)  log P( D)

• Obtain more data D2 (second data set)

                         P( D2 | e, D) P(e | D)
         P(e | D, D2 ) 
                              P( D2 | D)
  Parameter Estimation from Data
• Maximum a posteriori (MAP)
   – The objective of parameter estimation is to find or approximate
     the best set of parameters for a model, i.e., to find the set of
     parameters  maximizing the posterior P(|D), or log P(|D).
     This is called maximum a posteriori (MAP) estimation.
   – To deal with positive quantities, we can minimize - log P(|D)
     ( )   log P( | D)   log P( D |  )  log P( )  log P( D)
   – P(D) plays the role of a normalizing constant and is thus
     irrelevant for the optimization, i.e.,the minimization of
               ( )   log P( D |  )  log P( )
   – If the prior P() is uniform over sample space, then the problem
     reduces to finding the maximum of P(D|), or log P(D|). This is
     known as maximum likelihood (ML) estimation.
   – Simpler ML estimation procedure, i.e., the minimization of
                        ( )   log P( D |  )                        49
                Basic Formula
P ( x )   P ( x, h )
P ( x | y )   P ( x, h | y )

P( x, h | y)  P(h | y) P( x | y, h)
P ( x | y )   P ( h | y ) P ( x | y , h)

P ( x | y )   P ( h | y ) P ( x | h)

                          WMMKS Lab

To top