Docstoc

http

Document Sample
http Powered By Docstoc
					HTTP – HyperText Transfer
        Protocol
 Representation and Management
     of Data on the Internet



                                 1
         A Common Protocol
• In order for two remote machines to
  „understand‟ each other they should
  – „„speak the same language‟‟
  – coordinate their „„talk‟‟
• The solution is to use protocols
• Examples:
  –   FTP – File Transfer Protocol
  –   SMTP – Simple Mail Transfer Protocol
  –   NNTP – Network News Transfer Protocol
  –   HTTP – HyperText Transfer Protocol
                                              2
      Why HTTP was Needed?
• According to Tim Berners-Lee (1991), a
  protocol was needed with the following
  features:
  –   A subset of the file transfer protocol
  –   The ability to request an index search
  –   Automatic format negotiation
  –   The ability to refer the client to another
      server
                                                   3
                 HTTP
                Request



 HTTP     Proxy Server
Request                          HTTP
                                Response
               HTTP Response
                                     http://www.cs.huji.ac.il/~dbi

Web Server    www.cs.huji.ac.il:80




File System
                                                             4
Department
Proxy Server



 University
Proxy Server



   Israel
Proxy Server



Web Server     www.w3.org:80
                               5
              Terminology
• User agent: client which initiates a request
  (browser, editor, web robot, …)
• Origin server: the server on which a given
  resource resides (web server a.k.a. HTTP
  server)
• Proxy: acts as both a server and a client
• Gateway: server which acts as intermediary
  for other servers
• Tunnel: acts as a blind relay between two
  connections
                                                 6
              Resources
• A resource is a chunk of information
  that can be identified by a URL
  (Universal Resource Locator)
• A resource can be
  – A file
  – A dynamically created page
• What we see on the browser can be a
  combination of some resources
                                         7
    Universal Resource Locator
 protocol://host:port/path#anchor?parameters

http://www.cs.huji.ac.il/~dbi/index.html#info

http://www.google.com/search?hl=en&g=blabla


  • There are other types of URL‟s
    – mailto:<account@site>
    – news:<newsgroup-name>                    8
              In a URL
• Spaces are represented by “+”
• Characters such as &,+,% are encoded
  in the form “%xx” where xx is the ascii
  value in hexadecimal; For example, “&”
  = “%26”
• The inputs to the parameters are given
  as a list of pairs of a parameter and a
  value:
    var1=value1&var2=value2&var3=value3
                                            9
war&peace Tolstoy




            10
http://www.google.com/search?hl=en&q=war%26peace+Tolstoy




                                                   11
     Nesting in Page
         Index.html



 Left frame         Right frame



Jumping fish   Fairy icon    HUJI icon



 What we see on the browser can be
 a combination of some resources
                                         12
           Nested Objects
• Suppose a client accesses a page containing
  10 inline images, how many sessions will be
  required to display the page completely?
• The answer is 11 HTTP sessions – why?
• Some browsers/servers support a feature
  called keep-alive which can keep the
  connection open until it is explicitly closed
• How can this help?

                                              13
           HTTP Session
• A basic HTTP session has four phases:
  1.Client opens the connection (a TCP
    connection)
  2.Client makes the request
  3.Server sends a response
  4.Server closes the connection



                                         14
         Stateless Protocol
• HTTP is a stateless protocol, which means
  that once a server has delivered the
  requested data to a client, the connection is
  closed, and the server retains no memory of
  what has just taken place
• What are the difficulties in working with a
  stateless protocol?
• How would you implement a site for buying
  some items?
• So why don‟t we have states in HTTP?
                                                  15
       Format of Request and
            Response
•   An initial line
•   Zero or more header lines
•   A blank line (i.e., a CRLF by itself), and
•   An optional message body (e.g., a file, query
    data, or query output)

    Note: CRLF = “\r\n”
    (usually ASCII 13 followed by ASCII 10)
                                                    16
                        Format of Request

method     sp   URL      sp version   cr   lf
header      :   value    cr lf
                                   headers
                                    lines
 header     :   value    cr   lf
cr lf


          Entity Body

                                                17
         Request Example
GET /index.html HTTP/1.1 [CRLF]
Accept: image/gif, image/jpeg [CRLF]
User-Agent: Mozilla/4.0 [CRLF]
Host: www.cs.huji.ac.il:80 [CRLF]
Connection: Keep-Alive [CRLF]
[CRLF]



                                       18
method    Request Example
                             request URL
GET /index.html HTTP/1.1           version
Accept: image/gif, image/jpeg
User-Agent: Mozilla/4.0
Host: www.cs.huji.ac.il:80
Connection: Keep-Alive
[blank line here]
                        headers

                                             19
         Request Methods
• GET returns the contents of the
  indicated document
• HEAD returns the header information
  for the indicated document
  – Useful for finding out info about a resource
    without retrieving it
• POST treats the document as an
  application and sends some data to it

                                               20
            More Methods
• PUT replaces the content of the document
  with some data
• DELETE deletes the indicated document
• TRACE invokes a remote loop-back of the
  request. The final recipient SHOULD reflect
  the message back to the client

• Usually these methods are not allowed

                                                21
           GET Request
• A request to get a resource from the
  Web
• The most frequently used method
• The request has no message body, but
  parameters can be sent in the request
  URL (i.e., the URL without the host
  part)

                                          22
            HEAD Request
• A HEAD request asks the server to return the
  response headers only, and not the actual
  resource (i.e., no message body)
• This is useful for checking characteristics of a
  resource without actually downloading it, thus
  saving bandwidth
• Used for testing hypertext links for validity,
  accessibility and recent modification

                                                 23
                    Post
• POST request can send data to the
  server
• POST is mostly used in form-filling
  – The data filled into the form are translated
    by the browser into some special format
    and sent to a program on the server using
    the POST command


                                                   24
               Post (cont.)
• There is a block of data sent with the request,
  in the message body
• There are usually extra headers to describe
  this message body, like Content-Type: and
  Content-Length:
• The request URL is a URL of a program to
  handle the sent data, not a file
• The HTTP response is normally the output of
  a program, not a static file

                                                25
             Post Example
• Here's a typical form submission, using
  POST:
POST /path/register.cgi HTTP/1.0
From: frog@cs.huji.ac.il
User-Agent: HTTPTool/1.0
Content-Type: application/x-www-form-urlencoded
Content-Length: 35


home=Ross+109&favorite+flavor=flies

                                                  26
                Headers
• HTTP 1.0 defines 16 headers
  – none are required
• HTTP 1.1 defines 46 headers
  – one header (Host:) is required in requests
    that are sent to Web servers




                                                 27
       Examples of Headers
• If an HTTP message includes a body, there
  are usually header lines in the message that
  describe the body, for example
• Content-Type:
  – gives the MIME-type of the data in the body, such
    as text/html or image/gif
• Content-Length:
  – gives the number of bytes in the body

   Why would we like to use these headers?          28
   Another Header Example
• Last-Modified:
  – Gives the modification date of the resource
    that is being returned
  – It's used in caching and other bandwidth-
    saving activities
  – Greenwich Mean Time should be used and
    the format is
    Last-Modified: Fri, 31 Dec 1999 23:59:59 GMT

                                                   29
             Host Header
• In HTTP 1.1
  – A request that is sent to a Web server must
    include a Host header
  – A request that is sent to a proxy does not
    have to include the Host header
• In HTTP 1.0
  – A request does not have to include the
    Host header
                How do we know who is the host when
                      there is no host header? 30
   Initial Line of a Response
• The initial line of a response is also
  called the status line
• The initial line consists of
  – HTTP version
  – response status code
  – reason phrase that describes the status
    code

                                              31
                          Format of Response

 version    sp status code sp    phrase     cr   lf   status
                                                       line
 header      :    value    cr lf

                                          headers
                                           lines
  header     :    value     cr   lf
cr lf


           Entity Body

                                                        32
           Response Example
HTTP/1.0 200 OK
Date: Fri, 31 Dec 1999 23:59:59 GMT
Content-Type: text/html
Content-Length: 1354

<html>
<body>
<h1>Hello World</h1>
(more file contents) . . .
</body>
</html>
                                      33
            Response Example
version       status code
                              reason phrase
 HTTP/1.0 200 OK
 Date: Fri, 31 Dec 1999 23:59:59 GMT
 Content-Type: text/html                  headers
 Content-Length: 1354

 <html>
 <body>
 <h1>Hello World</h1>
 (more file contents) . . .   message body
 </body>
 </html>
                                                34
                 Status Code
• The status code is a three-digit integer,
  and the first digit identifies the general
  category of response:
  –   1xx indicates an informational message
  –   2xx indicates success of some kind
  –   3xx redirects the client to another URL
  –   4xx indicates an error on the client's part
       • Yes, the system blames it on the client if a
         resource is not found (i.e., 404)
  – 5xx indicates an error on the server's part         35
          Status Code 1xx
• The 100 (Continue) Status
  – Allows a client to determine if the Server is
    willing to accept the request (based on the
    request headers) before the client sends
    the request body
  – The client‟s request must have the header
           Expect: 100 (Continue)


  What is this good for?                        36
         Status Code 2xx
Status codes 2xx – Success

• The action was successfully received,
  understood, and accepted
• Usually upon success a status code 200
  and a message OK are sent


                                       37
         Status Code 3xx
Status codes 3xx – Redirection

• Further action must be taken in order to
  complete the request
• The client is redirected to get the
  resource from another URL


                                         38
               3xx Codes
• 301 – Moved Permanently
• 302 – Moved Temporarily
  – The Location: header in the response gives
    the correct URL for either 301 or 302
  – Most browsers retry the new location
    automatically
• 304 – Not Modified
  – This is a response to If-Modified-Since:
    header in a the request
                                               39
         Status Code 4xx
Status codes 4xx – Client error
• The request contains bad syntax or
  cannot be fulfilled



  404 File not found

                                       40
               4xx Codes
•   400 – Syntax Error
•   401 – Unauthorized
•   403 – Forbidden – “permission denied”
•   404 – Not Found




                                            41
          Status Code 5xx
Status codes 5xx – Server error

• The server failed to fulfill an apparently
  valid request

   For example,
   502 Bad gateway

                                               42
               5xx Codes
• 502 – Bad Gateway
• 503 – Service Unavailable
  – The response may include a Retry-After:
    header to indicate when the client might try
    again




                                               43
        Response Information
• Description of information in the headers:
  –   Server              Type of server
  –   Date                Date and time
  –   Content-Length      Number of bytes
  –   Content-Type        Mime type
  –   Content-Language English, for example
  –   Content-Encoding    Data compression
  –   Last-Modified       Date when last modified
  –   Expires       Date when file becomes invalid
                                                     44
        Manually Experimenting
             with HTTP
>host www
www.cs.huji.ac.il is a nickname for vafla.cs.huji.ac.il
vafla.cs.huji.ac.il has address 132.65.80.39
vafla.cs.huji.as.il mail is handled (pri=10) by cs.huji.ac.il

>telnet www.cs.huji.ac.il 80
Trying 132.65.80.39…
Connected to vafla.cs.huji.ac.il.
Escape character is „^]‟.
                                                            45
       Sending a Request
>GET /~dbi/index.html HTTP/1.0
[blank line]




                                 46
           The Response
HTTP/1.1 200 OK
Date: Sun, 11 Mar 2001 21:42:15 GMT
Server: Apache/1.3.9 (Unix)
Last-Modified: Sun, 25 Feb 2001 21:42:15 GMT
Content-Length: 479
Content-Type: text/html

<html>
       (html code …)
</html>                                    47
GET /~dbi/index.html HTTP/1.0

        HTTP/1.1 200 OK


            HTML code




                          48
GET /~dbi/no-such-page.html HTTP/1.0

            HTTP/1.1 404 Not Found

                      HTML code




                                49
           GET /index.html HTTP/1.1

          HTTP/1.1 400 Bad Request


                  HTML code


Why is it a Bad Request?

HTTP/1.1 without Host Header
                               50
       Redirection Process
• Client asks for /foo, which is really a
  directory
• Server guesses that client meant /foo/
  and so it replies with
  – 302 Moved
  – Location: /foo/
• Most browsers retry the new location
  automatically
                                            51
  Advantages of Redirection
• Simple Uses: Fix clients naming errors
• Complex Uses: Server can send client
  dynamically to a different page
  depending on
  – Who they are
  – What server is managing their session, etc.
• Note the changing URL in the browser

                                              52
      Why Location Header
          is Needed?
• Client must know of the new URL, so
  that it will convert relative URLs
  correctly
• Suppose there is a relative URL
  “a.html”
  – Client should convert it to “/foo/a.html” and
    not to “/a.html”

                                                53
  New Features in HTTP 1.1
• Persistent connections
• Virtual Hosts
  – That is why the HOST header is required




                                              54
 Persistent vs. Non-Persistent
          Connection
• A page that we see on the browser can
  include more than one resource
• The resources are sent from the server
  to the client one after the other
• Sending the resources to the browser
  can be by using a persistent connection
  or by using a non-persistent connection

                                        55
  Non-Persistent Connection
1. Browser opens TCP connection to port 80 of
   server (handshake)
2. Browser sends http request message
3. Server receives request, locates object,
   sends response
4. Server closes TCP connection
5. Browser receives response, parses object
6. Browser repeats steps 1-5 for each
   embedded object
                                            56
      Persistent Connection
1. Browser opens TCP connection to port 80 of
   server (handshake)
2. Browser sends http request message
3. Server receives request, locates object,
   sends response
4. Browser receives response, parses object
5. Browser repeats steps 2-4 for each
   embedded object
6. TCP connection closes on demand or
   timeout                                  57
   Advantages of Persistent
         Connection
• CPU time saved in routers and hosts
• HTTP requests and responses can be
  pipelined on a connection
• Network congestion is reduced
• Latency on subsequent requests is
  reduced
     What are the disadvantages of
        persistent connection?
                                        58
                  Pipelines
• 2 types of persistent connections
   – without pipelining
     • the client issues a new request only after the
       previous response has arrived
  – with pipelining
     • client sends the request as soon as it
       encounters a reference
     • multiple requests/responses
         – on the same IP packet, or
         – on back-to-back packets

                                                        59
              Virtual Hosts
• With HTTP 1.1, one server at one IP address
  can be multi-homed:
  – “www.cs.huji.ac.il” and “www.math.huji.ac.il” can
    live on the same server
  – These are called virtual hosts
  – Without this mechanism, we have to use 2
    different IP addresses
• It is like several people sharing one phone
• An HTTP request must specify the host name
  (and possibly port) for which the request is
  intended (this is done using the Host header)
                                                    60
       Virtual Hosting (cont.)
• Virtual hosting
   – reduces hardware expenditures
   – extends the ability to support additional servers
   – makes load balancing and capacity planning much
     easier
• Without it
   – each host name requires a unique IP address, and
     we are quickly running out of IP addresses with
     the explosion of new domains

                                                    61
               Caching
Caching improves performance
• Eliminates the need to send requests in
  many cases (reduces network round-
  trips), using an expiration mechanism
• Eliminates the need to send full
  responses in other cases (reduces
  network bandwidth), using a validation
 mechanism
                                        62
For example, how much traffic is reduced if
 it is not required to send the Google icon
            on each search result?
                                              63
                  Client Caching
•Client GET /fruit/apple.gif
•Server responds with
Last-Modified-Date: ...                 cache
•Client caches object          client
and last-modified-date
•Client sends
GET /fruit/apple.gif …
If-Modified-Since: …
•Server returns either
      304 Not Modified                  server
      or object
                                                 64
              Network Caches

            GET /fruit/apple.gif                     server
client
                            proxy
                            server
                                          GET /fruit/apple.gif
  client

                   GET /fruit/apple.gif              server

         client
                                                                 65
               Benefit of Caching
               10Mbps LAN

   client
                                              server
                     1.5Mbps
                 R             R   Internet
   client
                                                server
15 req/sec
100Kbits/req
                 proxy
                 server 40% hit rate

   client

                                                         66
           Expiration Model
• Servers may provide an expiration time using
  the Expires header
  – By checking the expiration time, the cache can
    return a fresh response without contacting the
    server
• If the expiration time is not specified, the
  cache can heuristically estimate the
  expiration times (e.g., using header values,
  such as the Last-Modified time)

                                                     67
       The Risk in Caching
• Response might not be
      “semantically transparent”
  – the response is different from what would
    have been returned by the origin server
• The cache should verify that the copy is
  fresh (i.e., expiration time has not
  passed)
• The copy is stale if it is not fresh
                                                68
               Validators
• A validator is any mechanism that may
  help in determining whether a copy is
  fresh or stale
  – A strong validator is, for example, a
    counter that is incremented whenever the
    resource is changed
  – A weak validator is, for example, a counter
    that is incremented only when a significant
    change is made
           For example, if the only change in the
              site is the number of visitors …      69
           Using the Cache
• To check whether a copy is fresh, the cache
  must either
  – Use the expiration model, or
  – Compare the Last-Modified time or some
    validator with the origin server
• In the second case, the origin server either
  – Responds with the message 304(Not
    Modified), or
  – Sends a full response with the entity body

                                                 70
Some Cache-Control Headers
• Cache-control headers specify directives to
  the cache
   – Can be included in either requests or responses
• max-age=[seconds] – max amount of time
  that an object will be considered fresh
• s-maxage=[seconds] – only applies to proxy
  caches
• must-revalidate – tells caches that they must
  obey freshness information
• proxy-revalidate – only applies to proxy
  chaches                                       71
Old Way not to Use the Cache
• The Pragma: no-cache request
  header indicates that the request should
  not be satisfied from a cache
• Same as the no-cache cash-directive
• Directive applies to any recipient along
  the request/response chain
Don’t use pragma – only applies to requests and exists
just for compatibility with HTTP 1.0
                                                         72
    If-Modified-Since Header
• The If-Modified-Since: header is used
  with a GET request
• If the requested resource has been modified
  since the given date, the server returns the
  resource as it normally would (i.e., header is
  ignored)
• Otherwise, the server returns a
  304 Not Modified response, including the
  Date: header, but with no message body
             HTTP/1.1 304 Not Modified
             Date: Fri, 31 Dec 1999 23:59:59 GMT
             [blank line]                          73
  If-Unmodified-Since Header
• The If-Unmodified-Since: header can
  be used with any method
• If the requested resource has not been
  modified since the given date, the server
  returns the resource as it normally would
• Otherwise, the server returns a
  412 Precondition Failed response

  HTTP/1.1 412 Precondition Failed
  [blank line]                                74
     If-None-Match Header
• ETag is a validator generated by the
  server (i.e., unique identifier)
  – It is part of the HTTP 1.1 specification
• If the ETag matches when an If-None-
  Match header is specified, then the
  object is really the same, and is not
  returned

                                               75
Cooperative Caching




                      76
  Cooperative Caching (cont.)
• Higher level cache (e.g., national cash)
  – larger user population
  – higher hit rates
• Multiple Web cashes which cooperate 
  Improve overall performance
• Cooperative cashes usually built from
  clusters
  – divide the traffic overhead
  – improve storage capacity

                                             77
 Cooperative Caching (cont.)
• Which cashes should be asked for a
  particular doc?

• Hash routing (of URLs) – an object will
  not be present in more than one cash



                                            78
              Hop by Hop
• HTTP/1.1 introduces the concept of
  hop-by-hop headers:
  – Message headers that apply only to a
    given connection, and not to the entire path
  – It enables much more power with the
    usage of proxies (cashes)
  – The headers give information that is
    directed to the proxies on the way to the
    client
                                              79
          Chunked Encoding
• Music, video clips and other multimedia
  content is sent to the client by chunks of data
• Among other problems, are the difficulties
  that
   – One data chunk varies in size and composition
     from the next
   – The size of the chunks may not be specified in the
     headers and so it may be difficult to recognize the
     end of a chunk
   – There should be a way to deal with „infinite‟
     responses in order to deal with data chunks that
     are very big (or with infinite files that are created 80
     dynamically)
           Compression
• Most image formats (GIF, JPEG,
  MPEG) are precompressed
• Many other data types used in the Web
  are not precompressed
• Compression could save almost 40% of
  the bytes sent via HTTP
• There is a need for negotiating the type
  of encoding of the compressed resource
                                         81
        Compression (cont.)
• Client sends the header Accept-Encoding
  – The header indicates the content-encodings that
    the client can handle and the ones that the client
    prefers
• Server Sends
  – Content-Encoding header – for end-to-end
    encoding indication
  – Transfer-Encoding header – for hop-to-hop
    encoding indication (supported only in HTTP/1.1)
                                                         82
          Authentication
• Many sites require users to provide a
  username and password in order to
  access the documents housed on the
  server
• This requirement provides a mechanism
  for keeping track of users (more than
  just a security mechanism)

                                      83
              Authentication
• Client sends ordinary request message
• Server responds with
   – 401 Authorization Required status code
   – WWW-Authenticate header which specifies how
     to perform authentication
• Client resends the requested message, but this time
  including the Authorization header (e.g., user-
  name & password)
• Client continues to add this header for each following
  request to that server

                                                      84
            Authentication

          GET ~dbi/getGrade.jsp
                                     server
client
          Authorization Required     www.cs.huji.ac.il

         GET ~dbi/getGrade.jsp
         Authorization: user snoopy:passwordofsnoopy

                 Response


                                                       85
                Cookies
• Cookies are an alternative way to
  identify browsers
• Cookies are essentially small files that
  are saved in the file system of the client
• A cookie can store information on the
  client and thus helps in recognizing the
  client and getting required information
  about the client
        How can cookies solve the problem
        that HTTP is stateless?             86
                  Cookies
• Server response includes the Set-cookie
  header that has the attributes
  –   name = VALUE
  –   expires = DATE STRING
  –   domain = DOMAIN NAME
  –   path = PATH
  –   secure
• Clients returns a cookie only to a server with
  matching URL (the server that put the cookie)
                                               87
                 Cookies
• Example:
  – Client contacts a web site for the first time
  – Server response includes the header:
               Set-cookie : 1678453
  – Client stores the cookie value and the
    server name in a special “cookie file”
  – For each further request for that server, the
    client will add the header
                 Cookie : 1678453

                                                    88
             Cookies (cont.)
• Usage:
  – Server requires authentication, but doesn‟t want to
    hassle a user with a user-name and password
  – Remembering user‟s preferences for advertising
  – Cookies enable creating a virtual shopping cart
• Problems
  – Users who access the same site from different
    machines
  – Privacy

                                                     89
                     Links
• For specifications and additional
  information:
  –   http://www.w3.org/Protocols/
  –   http://www.w3.org/Protocols/Specs.html
  –   http://www.jmarshall.com/easy/http/
  –   http://wdvl.com/Internet/Protocols/HTTP/art
      icle.html


                                                90

				
DOCUMENT INFO