Internet Programming by mmcsx


									Internet Programming

Dr. S.F. Wu (R323, x706,

Continuous Assessment – 30% (Tests 10%, Assignment 10%, Lab 10%)
Examination – 70%

Reference book: Programming the World Web Web, R.W. Sebesta, 2nd edition, Addison Wesley
  • ARPAnet - late 1960s and early 1970s
  • Network reliability
  • For ARPA-funded research organizations

  BITnet, CSnet - late 1970s & early 1980s email and file transfer for other institutions

  NSFnet - 1986
  • Originally for non-DOD funded places
  • Initially connected five supercomputer centers
  • By 1990, it had replaced ARPAnet for non-military uses
  • Soon became the network for all (by 1990)

Internet Programming                                                                        Slide 1
What is Internet?
  A world-wide network of computer networks

  At the lowest level, since 1982, all connections use TCP/IP

  TCP/IP hides the differences among devices connected to the Internet

  Internet Protocol (IP) Addresses
  • Every node has a unique numeric address
  • Form: 32-bit binary number (E.g.
  • New standard, IPv6, has 128 bits (1998)
  • Organizations are assigned groups of IPs for their computers

  Domain names
  • Form: host-name.domain-names (e.g.
  • First domain is the smallest; last is the largest
  • Last domain specifies the type of organization
  • Fully qualified domain name - the host name and all of the domain names
  • DNS servers - convert fully qualified domain names to IPs.

Internet Programming                                                          Slide 2
World Wide Web
  By the mid-1980s, several different protocols had been invented and were being used on the
  Internet, all with different user interfaces (E.g. telnet, ftp, usenet). They all run on top of
  The WWW was designed as a possible solution to the proliferation of different protocols
  being used on the Internet.
  • Tim Berners-Lee at CERN proposed the Web in 1989
  • Purpose: to allow scientists to have access to many databases of scientific work through
     their own computers
  Document form: hypertext as HTML
  Browser and Web Server


Internet Programming                                                                        Slide 3
Web Programming
  • Text file to describe the general form and layout of documents
  • An HTML document is a mix of content and controls
  • Controls have tags and their attributes
  • Tags often delimit content and specify something about how the content should be arranged
    in the document
  • Attributes provide additional information about the content of a tag
  • Text editor (notepad) or HTML editor (Frontpage)
  • Cascading Style Sheet (CSS)
  • A client-side HTML-embedded scripting language
  • Only related to Java through syntax
  • Dynamically typed and not object-oriented
  • Provides a way to access elements of HTML documents and dynamically change them
  • Provides server-side computation for HTML documents, through CGI
  • Perl is good for CGI programming because:
       Direct access to operating systems functions

Internet Programming                                                                  Slide 4
      Powerful character string pattern-matching operations
      Access to database systems
  • Perl is highly platform independent, and has been ported to all common platforms
  • Perl is not just for CGI
  • A server-side scripting language
  • An alternative to CGI
  • Great for form processing and database access through the Web
  Database Connectivity
  • Dynamic contents
  Socket Programming
  • The guts
  • An Introduction to Sockets
  • Java Connection-Oriented Classes
  • Java Datagram Classes
  • An HTTP Server Application

Internet Programming                                                                   Slide 5
Operation of the Web
The Web software is designed around a distributed client-server architecture. A Web
client(called a Web browser if it is intended for interactive use) is a program which can send
requests for documents to any Web server. A Web server is a program that, upon receipt of a
request, sends the document requested(or an error message if appropriate) back to the requesting
client. Using a distributed architecture means that a client program may be running on a
completely separate machine from that of the server, possibly in another room or even in
another country. Because the task of document storage is left to the server and the task of
document presentation is left to the client, each program can concentrate on those duties and
progress independently of each other. Because servers usually operate only when documents are
requested, they put a minimal amount of workload on the computers they run on.


Internet Programming                                                                          Slide 6
Operation of the Web
  Running a Web client, the user selects a hyperlink in a piece of hypertext connecting to
  another document – “The history of computers”, for example.
  The Web client uses the address associated with that hyperlink to connect to the Web server
  at a specified network address and asks for the document associated with “The history of
  The server responds by sending the text and any other media within that text(pictures, sounds,
  or movies) to the client, which the client then renders for presentation on the user's screen.

The language that the Web clients and servers use to communicate with each other is called the
Hypertext Transfer Protocol(HTTP). All Web clients and servers must be able to speak HTTP in
order to send and receive hypermedia documents.

A popular web server is Apache.

Internet Programming                                                                     Slide 7
Open Standard Interconnection (OSI)
  In early 1980s, manufacturers began tostandardize networking so that networks from different
  manufacturers could communicate.
  • International Organization for Standardization (ISO)
  • Institute of Electrical and Electronics Engineers (IEEE)
  Open Systems Interconnect (OSI)
  • A networking model developed by ISO to identify and standardize all the levels of
     communication needed in networking
  A Layered Model provides:
  • modular engineering
  • reduces complexity
  • accelerates evolution
  • standardizes interfaces
  • interoperable technology
  • simplifies learning

Internet Programming                                                                   Slide 8
OSI Model: Concept

Internet Programming   Slide 9
OSI Model: The Seven Layers

Internet Programming          Slide 10
  Application Layer
  • Interfaces with application software such as web browsers or web servers.
  Presentation Layer
  • Receives requests for files from the Application layer and presents the requests to the
    Session layer.
  • Reformats, compresses, or encrypts data as necessary.
  Sessions Layer
  • Establishes and maintains a session between two networked stations or hosts.
  Transport Layer
  • Responsible for error checking.
  • Requests a resend when the data is corrupted.
  • Guarantees successful delivery of data.
  Network Layer
  • Divides a block of data into segments (data packets or datagrams) that are small enough to
    travel over a network.
  • Responsible for routing (finding the best possible route by which to send the data
  • packets over a group of networks)

Internet Programming                                                                   Slide 11
  • Reassembles the packets once they reach their destination
  Data Link Layer
  • Disassembles packets of data into smaller packets as needed to transport over the network.
  • Reassembles the data at the other end.
  Physical Layer
  • Passes data packets onto the cabling media.

Internet Programming                                                                         Slide 12
Network Device and OSI

  False. IPX belongs to network layer. Error recovery belongs to transport layer.

Internet Programming                                                                Slide 13
The focus in the TCP/IP world is on agreeing on a protocol standard which can be made to work
in diverse heterogeneous networks. The focus in the OSI world has always been more on the
standard than the implementation of the standard.

Internet Programming                                                                 Slide 14

TCP connection oriented
UDP connectionless
How about IP?

Internet Programming      Slide 15
TCP (Transmission Control Protocol) is a connection-based protocol that provides a reliable
flow of data between two computers.

When two applications want to communicate to each other reliably, they establish a connection
and send data back and forth over that connection. This is analogous to making a telephone call.
If you want to speak to Aunt Beatrice in Kentucky, a connection is established when you dial
her phone number and she answers. You send data back and forth over the connection by
speaking to one another over the phone lines. Like the phone company, TCP guarantees that
data sent from one end of the connection actually gets to the other end and in the same order it
was sent. Otherwise, an error is reported.

TCP provides a point-to-point channel for applications that require reliable communications.
The Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), and Telnet are all
examples of applications that require a reliable communication channel. The order in which the
data is sent and received over the network is critical to the success of these applications. When
HTTP is used to read from a URL, the data must be received in the order in which it was sent.
Otherwise, you end up with a jumbled HTML file, a corrupt zip file, or some other invalid

Internet Programming                                                                      Slide 16
UDP (User Datagram Protocol) is a protocol that sends independent packets of data from one
computer to another with no guarantees about arrival. UDP is not connection-based like TCP.

The UDP protocol provides for communication that is not guaranteed between two applications
on the network. UDP is not connection-based like TCP. Sending datagrams is much like sending
a letter through the postal service: The order of delivery is not important and is not guaranteed,
and each message is independent of any other.

For many applications, the guarantee of reliability is critical to the success of the transfer of
information from one end of the connection to the other. However, other forms of
communication don't require such strict standards. In fact, they may be slowed down by the
extra overhead or the reliable connection may invalidate the service altogether.

Many firewalls and routers have been configured not to allow UDP packets. If you're having
trouble connecting to a service outside your firewall, or if clients are having trouble connecting
to your service, ask your system administrator if UDP is permitted.

Internet Programming                                                                          Slide 17
Generally speaking, a computer has a single physical connection to the network. All data
destined for a particular computer arrives through that connection. However, the data may be
intended for different applications running on the computer. So how does the computer know to
which application to forward the data? Through the use of ports.

Data transmitted over the Internet is accompanied by addressing information that identifies the
computer and the port for which it is destined. The computer is identified by its 32-bit IP
address, which IP uses to deliver data to the right computer on the network. Ports are identified
by a 16-bit number, which TCP and UDP use to deliver the data to the right application.

In connection-based communication such as TCP, a server application binds a socket to a
specific port number. This has the effect of registering the server with the system to receive all
data destined for that port. A client can then meet with the server at the server's port, as
illustrated here:

Internet Programming                                                                        Slide 18
In connectionless communication such as UDP, the datagram packet contains the port number of
its destination and UDP routes the packet to the appropriate application, as illustrated in this

Port numbers range from 0 to 65,535 because ports are represented by 16-bit numbers. The port
numbers ranging from 0 - 1023 are restricted; they are reserved for use by well-known services
such as HTTP (80) and FTP (20) and other system services. These ports are called well-known
ports. Your applications should not attempt to bind to them.

Internet Programming                                                                    Slide 19
TCP and UDP Datagram

TCP datagram                                 UDP datagram

What can you conclude from these two datagrams?

Internet Programming                                        Slide 20
IP Datagram

The datagram fields are clarified below:
VERS - is the IP version number (currently binary 0100 (4), but can now also be version 6). All
nodes must use the same version.
HLEN - header length in 32-bit words, so if the number is 6, then 6 x 32 bit words are in the
header i.e. 24 bytes which is the maximum size. The minimum size is 20 bytes or 5 x 32-bit
Type of Service - is how the datagram should be used, e.g. delay, precedence, reliability,
minimum cost, throughput etc. This TOS field is now used by Differentiated Services and is
called the Diff Serv Code Point (DSCP).

Total Length - is the number of octets that the IP datagram takes up including the header. The
maximum size that an IP datagram can be is 65,535 octets.
Identification - The Identification is a unique number assigned to a datagram fragment to help in
the reassembly of fragmented datagrams.

Internet Programming                                                                     Slide 21
IP Datagram
Flags - Bit 0 is always 0 and is reserved. Bit 1 indicates whether a datagram can be fragmented
(0) or not (1). Bit 2 indicates to the receiving unit whether the fragment is the last one in the
datagram (1) or if there are still more fragments to come (0).

Frag Offset - in units of 8 octets (64 bits) this specifies a value for each data fragment in the
reassembly process. Different sized Maximum Transmission Units (MTUs) can be used
throughout the Internet.

TTL - the time that the datagram is allowed to exist on the network. A router that processes the
packet decrements this by one. Once the value reaches 0, the packet is discarded.

Protocol - Layer 4 protocol sending the datagram, UDP uses the number 17, TCP uses 6, ICMP
uses 1, IGRP uses 88 and OSPF uses 89.

Header Checksum - error control for the header only.

IP Options - this field is for testing, debugging and security.
Padding - there is padding added sometimes just to make sure that the datagram is confined
within a 32 bit boundary in multiples of 32 bits.

Internet Programming                                                                          Slide 22
IP Address
To be able to identify a host on the internet, each host is assigned an address, the IP address, or
Internet Address. When the host is attached to more than one network, it is called multi-homed
and it has one IP address for each network interface. The IP address consists of a pair of

IP address = <network number><host number>

IP addresses are 32-bit numbers usually represented in a dotted decimal form (as the decimal
representation of four 8-bit values concatenated with dots). For example is an IP
address with 128.2 being the network number and 7.9 being the host number. The rules used to
divide an IP address into its network and host parts are explained below.

The binary format of the IP address is:
     10000000 00000010 00000111 00001001

IP addresses are used by the IP protocol (see Internet Protocol (IP)) to uniquely identify a host
on the internet. IP datagrams (the basic data packets exchanged between hosts) are transmitted
by some physical network attached to the host and each IP datagram contains a source IP
address and a destination IP address.

Internet Programming                                                                         Slide 23
Classes of Address
The first few bits of the IP address specify
how the rest of the address should be
separated into its network and host part.
  Class A addresses use 7 bits for the
  network number giving 126 possible
  networks (we shall see below that out of
  every group of network and host
  numbers, two have a special meaning).
  The remaining 24 bits are used for the
  host number, so each networks can have
  up to 2(superscript 24)-2 2 to the power 24 minus 2 (16,777,214) hosts.
  Class B addresses use 14 bits for the network number, and 16 bits for the host number giving
  16382 networks each with a maximum of 65534 hosts.
  Class C addresses use 21 bits for the network number and 8 for the host number giving
  2,097,150 networks each with up to 254 hosts.
  Class D addresses are reserved for multicasting, which is used to address groups of hosts in a
  limited area.
  Class E addresses are reserved for future use.

Internet Programming                                                                     Slide 24
IP Routing
On a LAN, every host sees every packet that is sent by every other host on that LAN. Normally,
it will only do something with that packet if it is addressed to itself, or if the destination is a
broadcast address.
A router is different. A router examines every packet, and compares the destination address with
a table of addresses that it holds in memory. If it finds an exact match, it forwards the packet to
an address associated with that entry in the table. This associated address may be the address of
another network in a point- to- point link, or it may be the address of the next-hop router.
If the router doesn’t find a match, it runs through the table again, this time looking for a match
on just the network ID part of the address. Again, if a match is found, the packet is sent on to the
address associated with that entry.
If a match still isn’t found, the router looks to see if a default next- hop address is present. If so,
the packet is sent there. If no default address is present, the router sends an host unreachable? or
network unreachable? message back to the sender. If you see this message, it usually indicates a
router failure at some point in the network.

The difficult part of a router’s job is not how it routes packets, but how it builds up its table. In
the simplest case, the router table is static: it is read in from a file at start- up. This is adequate
for simple networks.

Internet Programming                                                                             Slide 25
The traceroute utility can be used to trace the route from a source to a destination. It works by
sending three UDP datagrams to an invalid port address at the remote host. Using the default
settings, three datagrams are sent, each with a Time-To-Live (TTL) field value set to one. The
TTL value of 1 causes the datagram to time out as soon as it hits the first router in the path. The
router then responds with an message indicating that the time has expired. Another three UDP
datagrams are sent with TTL set to 2, causing the second router to respond. This process
continues until the packets eventually reach the destination.
$ ./traceroute
traceroute to (, 30 hops max, 38 byte packets
 1 ( 0.353 ms 0.294 ms 0.237 ms
 2 ( 0.770 ms 0.881 ms 0.740 ms
 3 ( 1.331 ms 1.682 ms 1.694 ms
 4 ( 2.151 ms 2.420 ms 2.582 ms
 5 ( 2.143 ms 1.645 ms 1.668 ms
 6 ( 2.126 ms 1.716 ms 1.529 ms
 7 ( 137.982 ms 137.494 ms 137.559 ms
 8 ( 137.499 ms 138.248 ms 146.431 ms
 9 ( 137.599 ms 137.483 ms137.659 ms
10 ( 138.606 ms 138.111 ms 139.159 ms
11 ( 145.910 ms 146.847 ms 146.898 ms
12 ( 146.500 ms 146.571 ms 146.741 ms
13 ( 147.135 ms 151.027 ms 146.581 ms
14 ( 146.413 ms 146.493 ms 146.381 ms
15 ( 146.850 ms 146.803 ms 147.090 ms

Internet Programming                                                                        Slide 26
HTTP stands for Hypertext Transfer Protocol. It's the network protocol used to deliver virtually
all files and other data (collectively called resources) on the World Wide Web, whether they're
HTML files, image files, query results, or anything else. Usually, HTTP takes place through
TCP/IP sockets.

A browser is an HTTP client because it sends requests to an HTTP server (Web server), which
then sends responses back to the client. The standard (and default) port for HTTP servers to
listen on is 80, though they can use any port.

HTTP is the network protocol of the Web. It is both simple and powerful. Knowing HTTP
enables you to write Web browsers, Web servers, automatic page downloaders, link-checkers,
and other useful tools.

HTTP is used to transmit resources, not just files. A resource is some chunk of information that
can be identified by a URL (it's the R in URL). The most common kind of resource is a file, but
a resource may also be a dynamically-generated query result, the output of a CGI script, a
document that is available in several languages, or something else.

Internet Programming                                                                     Slide 27
HTTP Transactions
An HTTP client opens a connection and sends a request message to an HTTP server; the server
then returns a response message, usually containing the resource that was requested. After
delivering the response, the server closes the connection (making HTTP a stateless protocol, i.e.
not maintaining any connection information between transactions).

The format of the request and response messages are similar, and English-oriented. Both kinds
of messages consist of:
   an initial line,
   zero or more header lines,
   a blank line (i.e. a CRLF by itself), and
   an optional message body (e.g. a file, or query data, or query output).
Initial lines and headers should end in CRLF, though you should gracefully handle lines ending
in just LF. (More exactly, CR and LF here mean ASCII values 13 and 10, even though some
platforms may use different characters.)

Internet Programming                                                                      Slide 28
Initial Request Line
The initial line is different for the request than for the response. A request line has three parts,
separated by spaces: a method name, the local path of the requested resource, and the version of
HTTP being used. A typical request line is:
GET /path/to/file/index.html HTTP/1.0

GET is the most common HTTP method; it says "give me this resource". Other methods include
POST and HEAD. Method names are always uppercase.
The path is the part of the URL after the host name. The HTTP version always takes the form
"HTTP/x.x", uppercase.

Internet Programming                                                                         Slide 29
The POST Method
To send data to the server to be processed in some way, like by a CGI script. A POST request is
different from a GET request in the following ways:
   There's a block of data sent with the request, in the message body.
   There are usually extra headers to describe this message body, like Content-Type: and
   The request URL is not a resource to retrieve; it's usually a program to handle the data you're
   The HTTP response is normally program output, not a static file.
The most common use of POST, is to submit HTML form data to CGI scripts. In this case, the
Content-Type: header is usually application/x-www-form-urlencoded, and the Content-Length:
header gives the length of the URL-encoded form data. The CGI script receives the message
body through STDIN, and decodes it. Here's a typical form submission, using POST:
POST /path/script.cgi HTTP/1.0
User-Agent: HTTPTool/1.0
Content-Type: application/x-www-form-urlencoded
Content-Length: 32

You can use a POST request to send whatever data you want, not just form submissions. Just
make sure the sender and the receiving program agree on the format.

Internet Programming                                                                       Slide 30
The Header Lines
Header lines provide information about the request or response, or about the object sent in the
message body.

HTTP 1.0 defines 16 headers, though none are required. HTTP 1.1 defines 46 headers, and one
(Host:) is required in requests. For Net-politeness, consider including these headers in your

  The From: header gives the email address of whoever's making the request, or running the
  program doing so. (This must be user-configurable, for privacy concerns.)

  The User-Agent: header identifies the program that's making the request, in the form
  "Program-name/x.xx", where x.xx is the (mostly) alphanumeric version of the program. For
  example, Netscape 3.0 sends the header "User-agent: Mozilla/3.0Gold".

These headers help webmasters troubleshoot problems. They also reveal information about the
user. When you decide which headers to include, you must balance the webmasters' logging
needs against your users' needs for privacy.

Internet Programming                                                                      Slide 31
The Message Body
An HTTP message may have a body of data sent after the header lines. In a response, this is
where the requested resource is returned to the client (the most common use of the message
body), or perhaps explanatory text if there's an error. In a request, this is where user-entered data
or uploaded files are sent to the server.

If an HTTP message includes a body, there are usually header lines in the message that describe
the body. In particular:

  The Content-Type: header gives the MIME-type of the data in the body, such as text/html or
  The Content-Length: header gives the number of bytes in the body.

Internet Programming                                                                         Slide 32
Initial Response Line (Status Line)
The initial response line, called the status line, also has three parts separated by spaces: the
HTTP version, a response status code that gives the result of the request, and an English reason
phrase describing the status code. Typical status lines are:
HTTP/1.0 200 OK
HTTP/1.0 404 Not Found

The most common status codes are:

200 OK
The request succeeded, and the resulting resource (e.g. file or script output) is returned in the
message body.

404 Not Found
The requested resource doesn't exist.

500 Server Error
An unexpected server error. The most common cause is a server-side script that has bad syntax,
fails, or otherwise can't run correctly.

Internet Programming                                                                         Slide 33
Sample HTTP Exchange
GET /path/file.html HTTP/1.0
User-Agent: HTTPTool/1.0
[blank line here]

The server should respond with something like the following,

HTTP/1.0 200 OK
Date: Fri, 31 Dec 1999 23:59:59 GMT
Content-Type: text/html
Content-Length: 1354

<h1>Happy New Millennium!</h1>
(more file contents)

After sending the response, the server closes the connection.

Internet Programming                                            Slide 34

To top