Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Lecture2

VIEWS: 6 PAGES: 50

									Basic WWW Technologies &
 Mathematic Background
              (Chap 2 & 1, Baldi)

                 Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information Engineering,
              National Cheng Kung University
                        2006/10/5
           World Wide Web
• The World Wide Web (Web) is a network
  of information resources.
• The Web relies on three mechanisms to
  make these resources available:
  1. A uniform naming scheme for locating
     resources on the web (e.g., URIs).
  2. Protocols, for access to named resources
     over the web (e.g., HTTP).
  3. Hypertext, for easy navigation among
     resources (e.g., HTML).
                                                2
              Internet vs. Web
• Internet:
   – Internet is a more general term
   – Includes physical aspect of underlying networks and
     mechanisms such as email, FTP, HTTP…
• Web:
   – Associated with information stored on the Internet
   – Refers to a broader class of networks, i.e. Web of
     English Literature
• Both Internet and web are networks

                                                           3
Essential Components of WWW
• Resources (HTML, HyperText Markup
  Language)
  – Conceptual mappings to concrete or abstract entities, which do
    not change in the short term
  – Taggin support for structuring and laying out documents
• Resource identifiers (hyperlinks):
  – Strings of characters represent generalized addresses that may
    contain instructions for accessing the identified resource
  – http://www.google.com/ is used to identify the Google homepage
• Transfer protocols (HTTP, HyperText
  Transmission Protocol)
  – Conventions that regulate the communication between a
    browser (web user agent) and a server

                                                                     4
   Standard Generalized Markup
        Language (SGML)
• Based on GML (generalized markup language),
  developed by IBM in the 1960s
• An international standard (ISO 8879:1986)
  defines how descriptive markup should be
  embedded in a document
  – Markup: extra information characterizing structure of
    a document
• Gave birth to the extensible markup language
  (XML), W3C recommendation in 1998

                                                            5
          SGML Components
• SGML documents have three parts:
  – Declaration: specifies which characters and delimiters
    may appear in the application
  – DTD (Document Type Definition)/ style sheet: defines
    the syntax of markup constructs
  – Document instance: actual text (with the tag) of the
    documents
• More info could be found:
  http://www.W3.Org/markup/SGML

                                                         6
         HTML Background
• HTML was originally developed by Tim Berners-
  Lee while at CERN, and popularized by the
  Mosaic browser developed at NCSA.
• The Web depends on Web page authors and
  vendors sharing the same conventions for HTML.
  This has motivated joint work on specifications
  for HTML.
• HTML standards are organized by W3C :
  http://www.w3.org/MarkUp/

                                               7
         HTML Functionalities
• HTML gives authors the means to:
  – Publish online documents with headings, text, tables,
    lists, photos, etc
     • Include spread-sheets, video clips, sound clips, and other
       applications directly in their documents
  – Link information via hypertext links, at the click of a
    button
  – Design forms for conducting transactions with remote
    services, for use in searching for information, making
    reservations, ordering products, etc


                                                                    8
Sample Webpage




                 9
     Sample Webpage: HTML
            Structure
• <HTML>
•    <HEAD>
•      <TITLE>The title of the webpage</TITLE>
•    </HEAD>
•    <BODY> <P>Body of the webpage
•    </BODY>
• </HTML>

                                             10
              HTML Structure
• An HTML document is divided into a head section
  (here, between <HEAD> and </HEAD>) and a body
  (here, between <BODY> and </BODY>)
• The title of the document appears in the head (along
  with other information about the document)
• The content of the document appears in the body.
  The body in this example contains just one paragraph,
  marked up with <P>


                                                     11
             HTML Hyperlink
• <a href="relations/alumni">alumni</a>
• A link is a connection from one Web resource
  to another
• It has two ends, called anchors, and a direction
• Starts at the "source" anchor and points to the
  "destination" anchor, which may be any Web
  resource (e.g., an image, a video clip, a sound
  bite, a program, an HTML document)
                                                 12
       Resource Identifiers
• Uniform Resource Identifiers (URI):
  include two overlapping subsets of
  identifiers
  – URL: Uniform Resource Locators
  – URN: Uniform Resource Names




                                        13
             Introduction to URIs
•   Every resource available on the Web has an
    address that may be encoded by a URI
•   URIs typically consist of three pieces:
    –   The naming scheme of the mechanism used to
        access the resource. (HTTP, FTP)
    –   The name of the machine hosting the resource
    –   The name of the resource itself, given as a path


                                                           14
             URI Example
• http://www.w3.org/TR
• There is a document available via the HTTP
  protocol
• Residing on the machines hosting
  www.w3.org
• Accessible via the path "/TR"



                                               15
              Protocols
• Describe how messages are encoded and
  exchanged
• Different Layering Architectures
• ISO OSI 7-Layer Architecture
• TCP/IP 4-Layer Architecture




                                      16
ISO OSI Layering Architecture




                                17
TCP/IP Layering Architecture




                               18
  TCP/IP Layering Architecture
• A simplified model, provides the end-to-
  end reliable connection
• The network layer
  – Hosts drop packages into this layer, layer
    routes towards destination
  – Only promise “Try my best”
• The transport layer
  – Reliable byte-oriented stream

                                                 19
Hypertext Transfer Protocol (HTTP)
• A connection-oriented protocol (TCP) used
  to carry WWW traffic between a browser
  and a server
• One of the transport layer protocol
  supported by Internet
• HTTP communication is established via a
  TCP connection and server port 80


                                         20
GET Method in HTTP




                     21
Form




       22
                           Form
• <HTML>
  <Form action= http://140.116.246.174/cgi-bin/meshdb.cgi method=post>
  [1] Median Eminence (可複選):
  1.<input type=checkbox name=„Median Eminence‟ value=分泌>分泌
  2.<input type=checkbox name=„Median Eminence‟ value=一般>一般
  3.<input type=checkbox name=„Median Eminence‟ value=王錫崗>王錫
  崗.<input type=checkbox name=„Median Eminence‟ value=垂體>垂體
  其他:<input type=“text” name =„Median Eminence‟ >
  <input type=submit value=確認>
  </Form>
  </HTML>



                                                                23
CGI processing




                 24
CGI (Common Gateway Interface)


                   Service Request

  Web Browser                           Web Server

                   Service Response
                               Output
                                             Service Processing

                                      CGI



                Database
                                                             25
HTTP Request Processing




                          26
GNU Wget




           27
CGI: Get query search-results from
       Google using Wget




                                 28
            Homework (1)
• Meta-search engine: dispatch the user
  query to several engines at same time,
  collect and merge the results into one list
  to the user.
• Homework: Develop a meta-search engine
  which responds user query with combined
  search results from a few search engines.


                                           29
       Domain Name System
• DNS (domain name service): mapping from
  domain names to IP address
• IPv4:
• IPv4 was initially deployed January 1st. 1983 and
  is still the most commonly used version.
• 32 bit address, a string of 4 decimal numbers
  separated by dot, range from 0.0.0.0 to
  255.255.255.255.
• IPv6:
• Revision of IPv4 with 128 bit address
                                                 30
      Top Level Domains (TLD)
• Top level domain names, .com, .edu, .gov and
  ISO 3166 country codes .de, .fr, .it
• There are three types of top-level domains:
• Generic domains were created for use by the Internet
  public
• Country code domains were created to be used by
  individual country
• The .arpa domain Address and Routing Parameter Area
  domain is designated to be used exclusively for Internet-
  infrastructure purposes

                                                          31
               Server Log Files
• Server Transfer Log: transactions between a
  browser and server are logged
•   IP address, the time of the request
•   Method of the request (GET, HEAD, POST…)
•   Status code, a response from the server
•   Size in byte of the transaction
• Referrer Log: where the request originated
• Agent Log: browser software making the request
    (spider)
• Error Log: request resulted in errors (404)
                                                   32
        Server Log Analysis
• Most and least visited web pages
• Entry and exit pages
• Referrals from other sites or search
  engines
• What are the searched keywords
• How many clicks/page views a page
  received
• Error reports, like broken links
                                         33
Server Log Analysis




                      34
            Search Engines
• According to Pew Internet & American Life
  Project Report (2002), search engines are the
  most popular way to locate information online
• About 33 million U.S. Internet users query on
  search engines on a typical day.
• More than 80% have used search engines
• Search Engines are measured by coverage and
  recency
                                              35
              Web Crawler
• A crawler is a program that picks up a
  page and follows all the links on that page
• Crawler = Spider
• Types of crawler:
  – Breadth First
  – Depth First



                                            36
       Breadth First Crawlers
• Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and
  add them to a queue
• Pick the 1st link from the queue, get all
  links on the page and add to the queue
• Repeat above step till queue is empty



                                              37
Breadth First Crawlers




                         38
        Depth First Crawlers
• Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start
  page
• Visit link and get 1st non-visited link
• Repeat above step till no non-visited links
• Go to next non-visited link in the previous
  level and repeat 2nd step

                                                39
Depth First Crawlers




                       40
               Coverage
• Overlap analysis used for estimating
  the size of the indexable web
• W: set of webpages
• Wa, Wb: pages crawled by two
  independent engines a and b
• P(Wa), P(Wb): probabilities that a page
  was crawled by a or b
  – P(Wa)=|Wa| / |W|
  – P(Wb)=|Wb| / |W|
                                            41
            Overlap Analysis
• P(Wa Wb| Wb) = P(Wa  Wb)/ P(Wb)
                    = |Wa  Wb| / |Wb|
• If a and b are independent:
  – P(Wa Wb) = P(Wa)*P(Wb)
  – P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)
                  = |Wa| / |W| * (|Wb| / |W|) / (|Wb| / |W|)
                  = |Wa| / |W|
                  = P(Wa)




                                                         42
           Overlap Analysis
• Using |W| = |Wa|/ P(Wa), the researchers
  found:
  – Web had at least 320 million pages in 1997
  – 60% of web was covered by six major
    engines
  – Maximum coverage of a single engine was
    1/3 of the web



                                                 43
How to Improve the Coverage?
• Meta-search engine: dispatch the user
  query to several engines at same time,
  collect and merge the results into one list
  to the user.
• Any suggestions?
• Homework: Develop a meta-search engine
  which responds user query with combined
  search results from a few search engines.
                                           44
                 Probability
• Model uncertainty: make inferences about
  events given observed data
• An event e: proposition or statement about the
  world at large
  – “the number of Web pages in existence on 1 January
    2003 was greater than five billion”
• A probability P(e): can be viewed as a number
  that reflects our uncertainty about whether e is
  true or false in the real world, given whatever
  information we have available.
                                                     45
         Learning from a Bayesian
                Perspective
• A conditional probability P(e | D): represent the degree
  of belief (Bayesian interpretation of probability), where D
  is the background information (data) on which our belief
  is based.
• Bayesian approach: probability as being a dynamic
  entity updated when more data arrive
                                    P( D | e) P(e)
                     P (e | D ) 
                                        P ( D)
   – Prior probability: P(e) is your belief in the event e before you see
     any data
   – Posterior probability: P(e | D) reflects your updated belief in
     event e given the observed data D
   – Likelihood: P(D | e) is the probability of the data under the
     assumption that e is true
                                                                       46
       • How to model P(D | e)?
 Standard Probabilistic Distribution
• Discrete distributions • Continuous distributions
                       n
  P( X  k | p, n)    p k (1  p) n  k
                                                                                                      1
                                                                                                             ( x )2
                       k                                                               1
                                                                N ( x |  , )                      2   2
                                                                                           e
                                     n!                                              2 
  P( X 1  k1 ,..., X m  k m )              p1k1 ... pm k m
                                  k1!...k m !                   f ( x |  )  e   x       Exponential
  P( X  k )  p(1  p) k 1           Geometric                                    1 x
                                                                ( x |  ,  )        x e
                          k                                                   ( )
  P( X  k |  )  e                   Poisson
                            k!
                                                                                                      Gamma




                                                                                                                         47
       Learning from a Bayesian
          Perspective (cont.)
                        P( D | e) P(e)
         P (e | D ) 
                            P ( D)

• Take logarithms for easier operations
         log P(e | D)  log P( D | e)  log P(e)  log P( D)

• Obtain more data D2 (second data set)

                         P( D2 | e, D) P(e | D)
         P(e | D, D2 ) 
                              P( D2 | D)
                                                               48
  Parameter Estimation from Data
• Maximum a posteriori (MAP)
   – The objective of parameter estimation is to find or approximate
     the best set of parameters for a model, i.e., to find the set of
     parameters  maximizing the posterior P(|D), or log P(|D).
     This is called maximum a posteriori (MAP) estimation.
   – To deal with positive quantities, we can minimize - log P(|D)
     ( )   log P( | D)   log P( D |  )  log P( )  log P( D)
   – P(D) plays the role of a normalizing constant and is thus
     irrelevant for the optimization, i.e.,the minimization of
               ( )   log P( D |  )  log P( )
   – If the prior P() is uniform over sample space, then the problem
     reduces to finding the maximum of P(D|), or log P(D|). This is
     known as maximum likelihood (ML) estimation.
   – Simpler ML estimation procedure, i.e., the minimization of
                        ( )   log P( D |  )                        49
                Basic Formula
P ( x )   P ( x, h )
            h
P ( x | y )   P ( x, h | y )
                    h

P( x, h | y)  P(h | y) P( x | y, h)
P ( x | y )   P ( h | y ) P ( x | y , h)
                    h

P ( x | y )   P ( h | y ) P ( x | h)
                h


                          WMMKS Lab

								
To top