Learning Center
Plans & pricing Sign in
Sign Out

Computer Networking - Kurose, Ross

VIEWS: 128 PAGES: 674

									 Table of Contents

                                    Computer Networking
                A Top-Down Approach Featuring the Internet
                                                 James F. Kurose and Keith W. Ross

Link to the Addison-Wesley WWW site for this book
Link to overheads for this book

Online Forum Discussion About This Book - with Voice!

   1. Computer Networks and the Internet
         1. What is the Internet?
         2. What is a Protocol?
         3. The Network Edge
         4. The Network Core
                 s Interactive Programs for Tracing Routes in the Internet

                 s Java Applet: Message Switching and Packet Switching

         5. Access Networks and Physical Media
         6. Delay and Loss in Packet-Switched Networks
         7. Protocol Layers and Their Service Models
         8. Internet Backbones, NAPs and ISPs
         9. A Brief History of Computer Networking and the Internet
        10. ATM
        11. Summary
        12. Homework Problems and Discussion Questions
   2. Application Layer
         1. Principles of Application-Layer Protocols
         2. The World Wide Web: HTTP
         3. File Transfer: FTP
         4. Electronic Mail in the Internet (1 of 4) [5/13/2004 11:49:49 AM]
Table of Contents

        5. The Internet's Directory Service: DNS
                s Interactive Programs for Exploring DNS

        6. Socket Programming with TCP
        7. Socket Programming with UDP
        8. Building a Simple Web Server
        9. Summary
       10. Homework Problems and Discussion Questions
  3. Transport Layer
        1. Transport-Layer Services and Principles
        2. Multiplexing and Demultiplexing Applications
        3. Connectionless Transport: UDP
        4. Principles of Reliable of Data Transfer
                s Java Applet: Flow Control in Action

        5. Connection-Oriented Transport: TCP
        6. Principles of Congestion Control
        7. TCP Congestion Control
        8. Summary
        9. Homework Problems and Discussion Questions
  4. Network Layer and Routing
        1. Introduction and Network Service Model
        2. Routing Principles
        3. Hierarchical Routing
        4. Internet Protocol
                s Java Applet: IP Fragmentation

        5. Routing in the Internet
        6. What is Inside a Router?
        7. IPv6
        8. Multicast Routing
        9. Summary
       10. Homework Problems and Discussion Questions
  5. Link Layer and Local Area Networks
        1. The Data Link Layer: Introduction, Services
        2. Error Detection and Correction
        3. Multiple Acces Protocols and LANs
        4. LAN Addresses and ARP
        5. Ethernet
                s CSMA/CD Applet

        6. Hubs, Bridges and Switches
        7. Wireless LANs: IEEE 802.11
        8. The Point-to-Point Protocol
        9. ATM
       10. X.25 and Frame Relay
       11. Summary
       12. Homework Problems and Discussion Questions (2 of 4) [5/13/2004 11:49:49 AM]
 Table of Contents

   6. Multimedia Networking
         1. Multimedia Networking Applications
         2. Streaming Stored Audio and Video
         3. Making the Best of the Best-Effort Service: An Internet Phone Example
         4. RTP
         5. Beyond Best Effort
         6. Scheduling and Policing Mechanisms for Providing QoS Guarantees
         7. Integrated Services
         8. RSVP
         9. Differentiated Services
        10. Summary
        11. Homework Problems and Discussion Questions
   7. Security in Computer Networks
         1. What is Network Security?
         2. Principles of Cryptography
         3. Authentication: Who are You?
         4. Integrity
         5. Key Distribution and Certification
         6. Secure E-Mail
         7. Internet Commerce
         8. Network-Layer Security: IPsec
                 s 1999 Panel Discussion on Internet Security

         9. Summary
        10. Homework Problems and Discussion Questions
   8. Network Management
         1. What is Network Managmenet?
         2. The Infrastructure for Network Management
         3. The Internet Network Management Framework
         4. ASN.1
         5. Firewalls
         6. Summary
         7. Homework Problems and Discussion Questions


    q   Lab: Building a multi-threaded Web server in Java
    q   Lab: Building a mail user agent in Java
    q   Lab: Implementing a reliable transport protocol
    q   Lab: Implementing a distributed, asynchronous distance vector routing algorithm (3 of 4) [5/13/2004 11:49:49 AM]
 Table of Contents

Some relevant online audio material:
Unix Network Programming, Jim Kurose

Introduction to Computer Networks, Jim Kurose

Internet Protocols, Keith Ross

Distribution of Stored Information in the Web, Keith Ross

Asynchronous learning links:
The Web of Asynchronous Learning Networks

Copyright 1996-2000 James F. Kurose and Keith W. Ross (4 of 4) [5/13/2004 11:49:49 AM]
Computer Networking

                                        Computer Networking: A
                                        Top-Down Approach
                                        Featuring the Internet

                                       Instructor and student resources for this book are
                                         available at
                                                               ross! [5/13/2004 11:49:55 AM]

                                                         Jim Kurose
                                                         Department of Computer Science
                                                         University of Massachusetts
                                                         Amherst MA 01003 USA
                                                         ph: 413-545-2742, FAX: 413-545-1249

Jim Kurose received a B.A. degree in physics from Wesleyan University in 1978 and his M.S. and
Ph.D. degrees in computer science from Columbia University in 1980 and 1984, respectively. He is
currently a Professor of Computer Science at the University of Massachusetts, where he is also co-
director of the Networking Research Laboratory of the Multimedia Systems Laboratory. He is currently
serving a term as Chairman of the Department of Computer Science. Professor Kurose was a Visiting
Scientist at IBM Research during the 1990/91 academic year, and at INRIA and at EURECOM, both in
Sophia Antipolis, France, during the 1997/98 academic year.

His research interests include real-time and multimedia communication, network and operating system
support for servers, and modeling and performance evaluation. Dr. Kurose is the past Editor-in-Chief of
the IEEE Transactions on Communications and of the IEEE/ACM Transactions on Networking. He has
been active in the program committees for IEEE Infocom, ACM SIGCOMM, and ACM SIGMETRICS
conferences for a number of years.

He is the six-time recipient of the Outstanding Teacher Award from the National Technological
University (NTU), the recipient of the Outstanding Teacher Award from the College of Science and
Natural Mathematics at the University of Massachusetts, and the recipient of the 1996 Outstanding
Teaching Award of the Northeast Association of Graduate Schools. He has been the recipient of a GE
Fellowship, IBM Faculty Development Award, and a Lilly Teaching Fellowship. He is a Fellow of the
IEEE, and a member of ACM, Phi Beta Kappa, Eta Kappa Nu, and Sigma Xi.

He is currently working on an on-line introductory networking textbook, "Computer Networking, a top
down approach featuring the Internet," with Keith Ross. The book is available on-line, and is to be
published by Addison-Wesley Longman in 2000. (1 of 2) [5/13/2004 11:50:14 AM]

[Research Group] [Publications] [Courses (including on-line

Sept. 1999 (2 of 2) [5/13/2004 11:50:14 AM]
 Keith Ross

Keith W. Ross

Professor Keith ROSS
Dept. of Multimedia Communications
Institute Eurécom
06904 Sophia Antipolis

Telephone: +33 (0)4 93 00 26 97 (from US dial 011-33-4-93-00-26-97)
Fax: +33 (0)4 93 00 26 27

New! Wimba Voice Forums and Voice E-mail

Try out our new voice forum at . It is Java-based, so there is nothing
to install. You can also use Wimba to send streaming voice e-mails to anyone.

Brief Biography

Keith Ross received his Ph.D. from the University of Michigan in 1985 (Program in Computer,
Information and Control Engineering). He was a professor at the University of Pennsylvania
from 1985 through 1997. At the University of Pennsylvania, his primary appointment was in
the Department of Systems Engineering and his secondary appointment was in the Wharton
School. He joined the Multimedia Communications Dept. at Institut Eurecom in January 1998,
and became department chairman in October 1998. In Fall 1999, while remaining a professor
at Institut Eurecom, he co-founded and became CEO of

Keith Ross has published over 50 papers and written two books. He has served on editorial
boards of five major journals, and has served on the program committees of major networking
conferences, including Infocom and Sigcomm. He has supervised more than ten Ph.D. theses. (1 of 3) [5/13/2004 11:50:49 AM]
 Keith Ross

His research and teaching interests include multimedia networking, asynchronous learning,
Web caching, streaming audio and video, and traffic modeling.

Along with Jim Kurose, he recently completed the preliminary edition of "Computer Networking:
A Top-Down Approach Featuring the Internet," a textbook published by Addison-Wesley. The
final edition and interactive Web edition will be available in August 2000.

Multimedia Networking Group

Our research group studies Web caching, streaming of stored/audio over the Internet,
multimedia asynchronous messaging, and QoS traffic modeling.

Recent publications


Computer Networking: A Top-Down Approach Featuring the Internet, James F. Kurose
and Keith W. Ross.

Multiservice Loss Networks for Broadband Telecommunication Networks Keith W.

Fall 99 Courses at Eurecom (2 of 3) [5/13/2004 11:50:49 AM]
 Keith Ross

       Multimedia Networking Part I (a.k.a. High-Speed Networking)
       Multimedia Networking Part II

Online Presentations

      Distribution of Stored Information in the Web: An indepth tutorial on Web caching. Includes synchronized
RealAudio served from Eurécom.

       Multimedia Networking: Short course, including material on CBR/VBR video encoding, residential access
technologies, near video on demand, statistical multiplexing and prefetching of prerecorded video, smoothing of
prerecorded video, and modeling the disk subsystem in video servers.

      Audio and Video in the Internet: Extended lecture covering multimedia streaming, Internet phone, Internet QoS,

       Internet Protocols: Lectures on demand covering introductory material on Internet protocols. Includes
synchronized audio served from UPenn. (3 of 3) [5/13/2004 11:50:49 AM]

                Preface and Acknowledgments
Welcome to our online textbook, Computer Networking: A Top-Down Approach. We ( Jim Kurose,
Keith Ross, and Addison-Wesley-Longman) think you will find this textbook to be very different than
the other computer networking books that are currently available. Perhaps the most unique and
innovative feature of this textbook is that it is online and accessible through a Web browser. We
believe that our online format has several things going for it. First, an online text can be accessed from
any browser in the world, so a student (or any other reader) can gain access to the book at anytime from
anyplace. Second, as all of us Internet enthusiasts know, much of the best material describing the
intricacies of the Internet is in the Internet itself. Our hyperlinks, embedded in a coherent context,
provide the reader direct access to some of the best sites relating to computer networks and Internet
protocols. The links do not only point to RFCs but also to sites that are more pedagogic in nature,
including home-brewed pages on particular aspects of Internet technology and articles appearing in
online trade magazines. Being online also allows us to include many interactive features, including direct
access to the Traceroute program, direct access to search engines for Internet Drafts, Java applets that
animate difficult concepts, and (in the near future) direct access to streaming audio. Being online enables
us to use more fonts and colors (both within the text and in diagrams), making the text both perky and
cheerful. Finally, an online format will allow us to frequently release new editions (say, every year),
which will enable the text to keep pace with this rapidly changing field.

Another unusual feature of the text is its Internet focus. Most of the existing textbooks begin with a
broader perspective and address the Internet as just as one of many computer network technologies. We
instead put Internet protocols in the spotlight, and use the Internet protocols as motivation for studying
some of the more fundamental computer networking concepts. But why put the Internet in the spotlight,
why not some other networking technology such as ATM? Most computer networking students have had
already significant "hands on" experience with the Internet (e.g., surfing the Web and sending e-mail at
the very least) before taking a course on computer networks. We have found that modern-day students in
computer science and electrical engineering, being intensive users of the Internet, are enormously curious
about what is under the hood of the Internet. Thus, it is easy to get students excited about computer
networking when using the Internet as your guiding vehicle. A second reason for the Internet focus is
that in recent years computer networking has become synonymous with the Internet. This wasn't the case
five-to-ten years ago, when there was a lot of talk about ATM LANs and applications direclty interfacing
with ATM (without passing through TCP/IP). But we have now reached the point where just about all
data traffic is carried over the Internet or intranets. Furthermore, streaming audio and video have recently
become commonplace in the Internet, and someday telephony may be too. Because our book has an
Internet focus, it is organized around a five-layer Internet architecture rather than around the more
traditional seven-layer OSI architecture.

Another unique feature of this book is that it is also top-down in how the content is organized. As we
mentioned above, this text, as almost all computer networks textbooks, uses a layered architectural model
to organize the content. However, unlike other texts, this text begins at the application-layer and works
its way down the protocol stack. The rationale behind this top-down organization is that once one (1 of 4) [5/13/2004 11:51:08 AM]

understands the applications, one can then understand the network services needed to support these
applications. One can then, in turn, examine the various ways in which such services might be
provided/implemented by a network architecture. Covering applications early thus provides motivation
for the remainder of the text.

An early emphasis on application-layer issues differs from the approaches taken in most other texts,
which have only a small (or nonexistent) amount of material on network applications, their requirements,
application-layer paradigms (e.g., client/server), and the application programming interfaces (e.g.,
sockets). Studying application-layer protocols first allows students to develop an intuitive feel for what
protocols are (the role of message exchange and the actions taken on events) in the context of network
applications (e.g., the Web, FTP and e-mail) which they use daily. Furthermore, the inclusion of a
significant amount of material at the application layer reflects our own belief that there has been, and will
continue to be, a significant growth in emphasis (in the research community, and in industry) in the
higher levels of network architecture. These higher layers -- as exemplified by the Web as an application
layer protocol -- is the true ``growth area'' in computer networking.

This textbook also contains material on application programming development - material not covered
in depth by any introductory computer networks textbook. (While there are books devoted to network
programming, e.g., the texts by Stevens, they are not introductory networking textbooks.) There are
several compelling reasons for including this material. First, anyone wanting to write a network
application must know about socket programming - the material is thus of great practical interest.
Second, early exposure to socket programming is valuable for pedagogical reasons as well - it allows
students to write actual network application-level programs and gain first-hand experience with many of
this issues involved in having multiple geographically distributed processes communicate. We present
the material on application programming in a Java context rather than a C context, because socket
programming in Java is simpler, and allows students to quickly see the forest through the trees.

It has been said that computer networking textbooks are even more boring than accounting texts.
Certainly, one seed of truth in the statement is that many books are simply a compendium of facts about a
myriad of computer networking technologies and protocols, such as packet formats or service interfaces
(and given the wealth of protocol standards, there is no shortage of such facts!). What is missing in such
accounting-like textbooks is an identification of the important, underlying issues that must be solved by a
network architecture, and a methodical study of the various approaches taken towards addressing these
issues. Many texts focus on what a network does, rather than why. Addressing the principles, rather than
just the dry standards material, can make a textbook more interesting and accessible. (A sense of humor,
use of analogies, and real-world examples also help.)

The field of networking is now mature enough that a number of fundamentally important issues can be
identified. For example, in the transport layer, the fundamental issues include reliable communication
over an unreliable channel, connection establishment/teardown and handshaking, congestion and flow
control, and multiplexing. In the routing layer, two fundamentally important issues are how to find
``good'' paths between two routers, and how to deal with large, heterogeneous systems. In the data link
layer, a fundamental problem is how to share a multiple access channel. This text identifies fundamental (2 of 4) [5/13/2004 11:51:08 AM]

networking issues as well as approaches towards addressing these issues. We believe that the
combination of using the Internet to get the student's foot in door and then emphasizing the issues and
solution approaches will allow the student to quickly understand just about any networking technology.
For example, reliable data transfer is a fundamental issue in both the transport and data link layer.
Various mechanisms (e.g., error detection, use of timeouts and retransmit, positive and negative
acknowledgments, and forward error correction) have been designed to provide reliable data transfer
service. Once one understands these approaches, the data transfer aspects of protocols like TCP and
various reliable multicast protocols can been seen as case studies illustrating these mechanisms.

How an Instructor Can Use this Online Book
This online book can be used as the textbook for a course on computer networking just like any other
textbook. The instructor can assign readings and homework problems, and base lectures on the material
within the text. However, the textbook is also ideally suited for asynchronous online courses. Such
courses are particularly appealing to students who commute to school or have difficulty scheduling
classes due to course time conflicts. The authors already have significant experience in leading
asynchronous online courses, using an earlier draft of this online text. They have found that one
successful asynchronous format is to have students do weekly asynchronous readings (and listenings!)
and to have students participate in weekly newsgroup discussions about the readings. Students can have a
virtual presence by sharing the URLs of the their personal Web pages with the rest of the class. Students
can even collaborative on joint projects, such as research papers and network application development,
asynchronously over the Internet. Readers are encouraged to visit the following sites which are devoted
to asynchronous online education:

The Web of Asynchronous Learning Networks

Journal of Asynchronous Learning Networks

Asynchronous Learning Networks Magazine

Lot's of people have given us invaluable help on this project since it began in 1996. For now, we simply
say "Thanks!" and list some of the names alphabetically.

Paul Amer
Daniel Brushteyn
John Daigle
Wu-chi Feng (3 of 4) [5/13/2004 11:51:08 AM]

Albert Huang
Jussi Kangasharju
Hyojin Kim
Roberta Lewis
William Liang
Willis Marti
Deep Medhi
George Polyzos
Martin Reisslein
Despina Saparilla
Subin Shrestra
David Turner
Ellen Zegura
Shuchun Zhang

and all the UPenn, UMass and Eurecom students that have suffered through earlier drafts!

(List is incomplete. Will be adding names shortly.) (4 of 4) [5/13/2004 11:51:08 AM]
 Instructor Overheads: Computer Networking: A Top Down Approach Featuring the Internet

                                      Computer Networking
                A Top-Down Approach Featuring the Internet
                                                    James F. Kurose and Keith W. Ross

                                        Instructor Overheads
You'll find links below to overheads (powerpoint files, compressed postscript and PDF format) for the textbook,
Computer Networking: A Top-Down Approach Featuring the Internet, by Jim Kurose and Keith Ross, published by
Addison Wesley Longman. If you want to find out more about the book, you can check out the on-line version of the
text at or at The
publisher's WWW site for the book is

Note that the overheads below are being made available in powerpoint format (as well as postscript and pdf, shortly)
so that instructors can modify the overheads to suit their own teaching needs. While we hope that many instructors
will make use of the overheads (regardless of whether or not our text is used for the course), we ask that you use the
overheads for educational purposes only. Please respect the intellectual property represented in the overheads and do
not use them for your own direct commercial benefit.

Questions or comments to Jim Kurose or Keith Ross

Chapter 1: Computer Networks and the Internet

    q   chapter1a.ppt (Part 1, powerpoint format, 1.178M, last update: 21-Dec-99))
    q   chapter1b.ppt (Part 2, powerpoint format, 215K, last update: 21-Dec-99)

Chapter 2: The Application Layer

    q   chapter2a.ppt (Part 1, powerpoint format, 568K, last update: 21-Dec-99)
    q   chapter2b.ppt (Part 2, powerpoint format, 276K, last update: 21-Dec-99) (1 of 2) [5/13/2004 11:51:16 AM]
 Instructor Overheads: Computer Networking: A Top Down Approach Featuring the Internet

Chapter 3: The Transport Layer

    q   chapter3a.ppt (Part 1, powerpoint format, 1.201M, last update: 28-Dec-99)
    q   chapter3b.ppt (Part 2, powerpoint format, 640K, last update: 2-Jan-00)

Chapter 4: The Network layer and Routing

    q   chapter4a.ppt (Part 1, powerpoint format, 951K, last update: 25-Feb-00)
    q   chapter4b.ppt (Part 2, up though section 4.4, last update: 25-Feb-00)

The following slides (and those for chapters 5 and 6, are courtesy of Mario Gerla and Medy Sanadidi, UCLA. They
taught a networking course based on our text last Fall and developed the overheads below. They were kind enough to
allow us to post their overheads here. A big "thanks" to both of them!

    q   chapter4c_ucla.ppt (powerpoint.250K)
    q   chapter4d_ucla.ppt (powerpoint,258K)

Chapter 5: The Link Layer and Local Area Networks

    q   chapter5a_ucla.ppt (powerpoint, 641K)
    q   chapter5b_ucla.ppt (powerpoint, 256K)
    q   chapter5c_ucla.ppt (powerpoint. 653K)
    q   chapter5d_ucla.ppt (powerpoint, 777K)

Chapter 6: Multimedia Networking

    q   chapter6a_ucla.ppt (powerpoint, 410K)
    q   chapter6b_ucla.ppt (powerpoint, 704K)

Chapter 7: Security in Computer Networks
Chapter 8: Network Management

More Overheads are being added daily ......

Copyright 1996-2000 James F. Kurose and Keith W. Ross (2 of 2) [5/13/2004 11:51:16 AM]
  What is the Internet?

                                     1.1 What is the Internet?
In this book we use the public Internet, a specific computer network (and one which probably most readers have used), as our
principle vehicle for discussing computer networking protocols. But what is the Internet? We would like to give you a one-sentence
definition of the Internet, a definition that you can take home and share with your family and friends. Alas, the Internet is very
complex, both in terms of its hardware and software components, as well as the services it provides.

A Nuts and Bolts Description

Instead of giving a one-sentence definition, let's try a more descriptive approach. There are a couple of ways to do this. One way is
to describe the nuts and bolts of the Internet, that is, the basic hardware and software components that make up the Internet. Another
way is to describe the Internet in terms of a networking infrastructure that provides services to distributed applications. Let's begin
with the nuts-and-bolts description, using Figure 1.1-1 to illustrate our discussion.

                                               Figure 1.1-1: Some "pieces" of the Internet

    q   The public Internet is a world-wide computer network, i.e., a network that interconnects millions of computing devices
        throughout the world. Most of these computing devices are traditional desktop PCs, Unix-based workstations, and so called
        "servers" that store and transmit information such as WWW pages and e-mail messages. Increasingly, non-traditional
        computing devices such as Web TVs, mobile computers, pagers and toasters are being connected to the Internet. (Toasters are
        not the only rather unusual devices to have been hooked up to the Internet; see the The Future of the Living Room.) In the (1 of 4) [5/13/2004 11:51:20 AM]
  What is the Internet?

        Internet jargon, all of these devices are called hosts or end systems. The Internet applications with which many of us are
        familiar, such as the WWW and e-mail, are network application programs that run on such end systems. We will look into
        Internet end systems in more detail in section 1.3 and then delve deeply into the study of network applications in Chapter 2.

    q   End systems, as well as most other "pieces" of the Internet, run protocols that control the sending and receiving of
        information within the Internet. TCP (the Transmission Control Protocol) and IP (the Internet Protocol) are two of the most
        important protocols in the Internet. The Internet's principle protocols are collectively known as TCP/IP protocols. We begin
        looking into protocols in section 1.2. But that's just a start --much of this entire book is concerned with computer network

    q   End systems are connected together by communication links. We'll see in section 1.5 that there are many types of
        communication links. Links are made up of different types of physical media: coaxial cable, copper wire, fiber optics, and
        radio spectrum. Different links can transmit data at different rates. The link transmission rate is often called the link
        bandwidth, and is typically measured in bits/second.

    q   Usually, end systems are not directly attached to each other via a single communication link. Instead, they are indirectly
        connected to each other through intermediate switching devices known as routers. A router takes information arriving on
        one of its incoming communication links and then forwards that information on one of its outgoing communication links.
        The IP protocol specifies the format of the information that is sent and received among routers and end systems. The path
        that transmitted information takes from the sending end system, through a series of communications links and routers, to the
        receiving end system is known as a route or path through the network. We introduce routing in more detail in section 1.4,
        and study the algorithms used to determine routes, as well as the internal structure of a router itself, in Chapter 4.

    q   Rather than provide a dedicated path between communicating end systems, the Internet uses a technique known as packet
        switching that allows multiple communicating end systems to share a path, or parts of a path, at the same time. We will see
        that packet switching can often use a link more "efficiently" than circuit switching (where each pair of communicating end
        systems gets a dedicated path). The earliest ancestors of the Internet were the first packet-switched networks; today's public
        Internet is the grande dame of all existing packet-switched networks.

    q   The Internet is really a network of networks. That is, the Internet is an interconnected set of privately and publicly owned
        and managed networks. Any network connected to the Internet must run the IP protocol and conform to certain naming and
        addressing conventions. Other than these few constraints, however, a network operator can configure and run its network
        (i.e., its little "piece" of the Internet) however it chooses. Because of the universal use of the IP protocol in the Internet, the
        IP protocol is sometimes referred to as the Internet dail tone.

    q   The topology of the Internet, i.e., the structure of the interconnection among the various pieces of the Internet, is loosely
        hierarchical. Roughly speaking, from bottom-to-top, the hierarchy consists of end systems connected to local Internet
        Service Providers (ISPs) though access networks. An access network may be a so-called local area network within a
        company or university, a dial telephone line with a modem, or a high-speed cable-based or phone-based access network.
        Local ISP's are in turn connected to regional ISPs, which are in turn connected to national and international ISPs. The
        national and international ISPs are connected together at the highest tier in the hierarchy. New tiers and branches (i.e., new
        networks, and new networks of networks) can be added just as a new piece of Lego can be attached to an existing Lego
        construction. In the first half of 1996, approximately 40,000 new network addresses were added to the Internet [Network
        1996] - an astounding growth rate.

    q   At the technical and developmental level, the Internet is made possible through creation, testing and implementation of
        Internet Standards. These standards are developed by the Internet Engineering Task Force (IETF). The IETF standards
        documents are called RFCs (request for comments). RFCs started out as general request for comments (hence the name) to
        resolve architecture problems which faced the precursor to the Internet. RFCs, though not formally standards, have evolved
        to the point where they are cited as such. RFCs tend to be quite technical and detailed. They define protocols such as TCP,
        IP, HTTP (for the Web) and SMTP (for open-standards e-mail). There are more than 2000 different RFC's

The public Internet (i.e., the global network of networks discussed above) is the network that one typically refers to as the Internet. (2 of 4) [5/13/2004 11:51:20 AM]
  What is the Internet?

There are also many private networks, such as certain corporate and government networks, whose hosts are not accessible from (i.e.,
they can not exchange messages with) hosts outside of that private network. These private networks are often referred to as
intranets, as they often use the same "internet technology" (e.g., the same types of host, routers, links, protocols, and standards) as
the public Internet.

A Service Description

The discussion above has identified many of the pieces that make up the Internet. Let's now leave the nuts and bolts description and
take a more abstract, service-oriented, view:

    q   The Internet allows distributed applications running on its end systems to exchange data with each other. These
        applications include remote login, file transfer, electronic mail, audio and video streaming, real-time audio and video
        conferencing, distributed games, the World Wide Web, and much much more [AT&T 1998]. It is worth emphasizing that the
        Web is not a separate network but rather just one of many distributed applications that use the communication services
        provided by the Internet. The Web could also run over a network besides the Internet. One reason that the Internet is the
        communication medium of choice for the Web, however, is that no other existing packet-switched network connects more
        than 43 million [Network 1999] computers together and has 100 million or so users [Almanac]. (By the way, determining the
        number of computers hooked up to the Internet is a very difficult task, as no one is responsible for maintaining a list of who's
        connected. When a new network is added to the Internet, its administrators do not need to report which end systems are
        connected to that network. Similarly, an exiting network does not report its changes in connected end systems to any central

    q   The Internet provides two services to its distributed applications: a connection-oriented service and a connectionless
        service. Loosely speaking, connection-oriented service guarantees that data transmitted from a sender to a receiver will
        eventually be delivered to the receiver in-order and in its entirety. Connectionless service does not make any guarantees
        about eventual delivery. Typically, a distributed application makes use of one or the other of these two services and not both.
        We examine these two different services in section 1..3 and in great detail in Chapter 3.

    q   Currently the Internet does not provide a service that makes promises about how long it will take to deliver the data from
        sender to receiver. And except for increasing your access bit rate to your Internet Service Provider (ISP), you currently
        cannot obtain better service (e.g., shorter delays) by paying more -- a state of affairs that some (particularly Americans!) find
        odd. We'll take a look at state-of-the art Internet research that is aimed at changing this situation in Chapter 6.

Our second description of the Internet - in terms of the services it provides to distributed applications -- is a non-traditional, but
important, one. Increasingly, advances in the "nuts and bolts" components of the Internet are being driven by the needs of new
applications. So it's important to keep in mind that the Internet is an infrastructure in which new applications are being constantly
invented and deployed.

We have given two descriptions of the Internet, one in terms of the hardware and software components that make up the Internet, the
other in terms of the services it provides to distributed applications. But perhaps you are even more confused as to what the Internet
is. What is packet switching, TCP/IP and connection-oriented service? What are routers? What kinds of communication links are
present in the Internet? What is a distributed application? What does the Internet have to do with children's toys? If you feel a bit
overwhelmed by all of this now, don't worry - the purpose of this book is to introduce you to both the nuts and bolts of the Internet,
as well as the principles that govern how and why it works. We will explain these important terms and questions in the subsequent
sections and chapters.

Some Good Hyperlinks

As every Internet researcher knows, some of the best and most accurate information about the Internet and its protocols is not in
hard copy books, journals, or magazines. The best stuff about the Internet is in the Internet itself! Of course, there's really too much
material to sift through, and sometimes the gems are few and far between. Below, we list a few generally excellent WWW sites for (3 of 4) [5/13/2004 11:51:20 AM]
  What is the Internet?

network- and Internet-related material. Throughout the book, we will also present links to relevant, high quality URL's that
provide background, original (i.e., a citation), or advanced material related to the particular topic under study. Here is a set of key
links that you will want to consult while you proceed through this book:

Internet Engineering Task Force (IETF): The IETF is an open international community concerned with the development and
operation of the Internet and its architecture. The IETF was formally established by the Internet Architecture Board (IAB) in 1986.
The IETF meets three times a year; much of its ongoing work is conducted via mailing lists by working groups. Typically, based
upon previous IETF proceedings, working groups will convene at meetings of the IETF to discuss the work of the IETF working
groups. The IETF is administered by the Internet Society, whose WWW site contains lots of high-quality, Internet-related material.

The World Wide Web Consortium (W3C): The W3C was founded in 1994 to develop common protocols for the evolution of the
World Wide Web. This an outstanding site with fascinating information on emerging Web technologies, protocols and standards.

The Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE): These are the
two main international professional societies that have technical conferences, magazines, and journals in the networking area. The
ACM Special Interest Group in Data Communications (SIGCOMM), the IEEE Communications Society, and the IEEE Computer
Society are the groups within these bodies whose efforts are most closely related to networking.

Connected: An Internet Encyclopedia: An attempt to take the Internet tradition of open, free protocol specifications, merge it with a
1990s Web presentation, and produce a readable and useful reference to the technical operation of the Internet. The site contains
material on over 100 Internet topics.

Data communications tutorials from the online magazine Data Communications: One of the better magazines for data
communications technology. The site includes many excellent tutorials.

Media History Project: You may be wondering how the Internet got started. Or you may wonder how electrical communications got
started in the first place. And you may even wonder about what preceded electrical communications! Fortunately, the Web contains
an abundance of excellent resources available on these subjects. This site promotes the study of media history from petroglyths to
pixels. It covers the history of digital media, mass media, electrical media, print media, and even oral and scribal culture.


[Almanac 1998] Computer Industry Almanac, December 1998,
[AT&T 1998] "Killer Apps," AT&T WWW page
[Network 1996] Network Wizards, Internet Domain Survey, July 1996,
[Network 1999] Network Wizards, Internet Domain Survey, January 1999,

Return to Table Of Contents

Copyright Keith W. Ross and Jim Kurose 1996-2000 (4 of 4) [5/13/2004 11:51:20 AM]
 What is a protocol?

                               1.2.What is a Protocol?
Now that we've got a bit of a feel for what the "Internet" is, let's consider another important word is the
title of this book: "protocol." What is a protocol? What does a protocol do? How would you recognize a
protocol if you met one?

A Human Analogy

It is probably easiest to understand the notion of a computer network protocol by first considering some
human analogies, since we humans execute protocols all of the time. Consider what you do when you
want to ask someone for the time of day. A typical exchange is shown in Figure 1.2-1. Human protocol
(or good manners, at least) dictates that one first offers a greeting (the first "Hi" in Figure 1.2-1) to initiate
communication with someone else. The typical response to a "Hi" message (at least outside of New York
City) is a returned "Hi" message. Implicitly, one then takes a cordial "Hi" response as an indication that
one can proceed ahead and ask for the time of day. A different response to the initial "Hi" (such as "Don't
bother me!", or "I don't speak English," or an unprintable reply that one might receive in New York City)
might indicate an unwillingness or inability to communicate. In this case, the human protocol would be to
not ask for the time of day. Sometimes one gets no reponse at all to a question, in which case one typically
gives up asking that person for the time. Note that in our human protocol, there are specific messages we
send, and specific actions we take in response to the received reply messages or other events (such as no
reply within some given amount of time). Clearly, transmitted and received messages, and actions taken
when these message are sent or received or other events occur, play a central role in a human protocol. If
people run different protocols (e.g., if one person has manners but the other does not, or if one understands
the concept of time and the other does not) the protocols do not interoperate and no useful work can be
accomplished. The same is true in networking -- it takes two (or more) communicating entities running
the same protocol in order to accomplish a task.

Let's consider a second human analogy. Suppose you're in a college class (a computer networking class,
for example!). The teacher is droning on about protocols and you're confused. The teacher stops to ask,
"Are there any questions?" (a message that is transmitted to, and received by, all students who are not
sleeping). You raise your hand (transmitting an implicit message to the teacher). Your teacher
acknowledges you with a smile, saying "Yes ......." (a transmitted message encouraging you to ask your
question - teachers love to be asked questions) and you then ask your question (i.e., transmit your message
to your teacher). Your teacher hears your question (receives your question message) and answers
(transmits a reply to you). Once again, we see that the transmission and receipt of messages, and a set of
conventional actions taken when these mesages are sent and received, are at the heart of this question-and-
answer protocol.

Network Protocols (1 of 3) [5/13/2004 11:51:23 AM]
 What is a protocol?

A network protocol is similar to a human protocol, except that the entities exchanging messages and
taking actions are hardware or software components of a computer network, components that we will
study shortly in the following sections. All activity in the Internet that involves two or more
communicating remote entities is governed by a protocol. Protocols in routers determine a packet's path
from source to destination; hardware-implemented protocols in the network interface cards of two
physically connected computers control the flow of bits on the "wire" between the two computers; a
congestion control protocol controls the rate at which packets are transmitted between sender and receiver.
Protocols are running everywhere in the Internet, and consequently much of this book is about computer
network protocols.

                          Figure 1.2-1: A human protocol and a computer network protocol

As an example of a computer network protocol with which you are probably familiar, consider what
happens when you make a request to a WWW server, i.e., when you type in the URL of a WWW page
into your web browser. The scenario is illustrated in the right half of Figure 1.2-1. First, your computer
will send a so-called "connection request" message to the WWW server and wait for a reply. The WWW
server will eventually receive your connection request message and return a "connection reply" message.
Knowing that it is now OK to request the WWW document, your computer then sends the name of the
WWW page it wants to fetch from that WWW server in a "get" message. Finally, the WWW server
returns the contents of the WWW document to your computer.

Given the human and networking examples above, the exchange of messages and the actions taken when
these messages are sent and received are the key defining elements of a protocol:

  A protocol defines the format and the order of messages exchanged between two or more
communicating entities, as well as the actions taken on the transmission and/or receipt of a message.

The Internet, and computer networks in general, make extensive use of protocols. Different protocols are (2 of 3) [5/13/2004 11:51:23 AM]
 What is a protocol?

used to accomplish different communication tasks. As you read through this book, you will learn that
some protocols are simple and straightforward, while others are complex and intellectually deep.
Mastering the field of computer networking is equivalent to understanding the what, why and how of
networking protocols.

Return to Table Of Contents

Copyright Keith W. Ross and Jim Kurose 1996-2000 (3 of 3) [5/13/2004 11:51:23 AM]
 End systems, protocols, and end-to-end service models

                                    1.3 The Network Edge
In the previous sections we presented a high-level description of the Internet and networking protocols.
We are now going to delve a bit more deeply into the components of the Internet. We begin in this section
at the edge of network and look at the components with which we are most familiar -- the computers (e.g.,
PCs and workstations) that we use on a daily basis. In the next section we will move from the network
edge to the network core and examine switching and routing in computer networks. Then in Section 1.5
we will discuss the actual physical links that carry the signals sent between the computers and the

1.3.1 End Systems, Clients and Servers
In computer networking jargon, the computers that we use on a daily basis are often referred to as or
hosts or end systems. They are referred to as "hosts" because they host (run) application-level programs
such as a Web browser or server program, or an e-mail program. They are also referred to as "end
systems" because they sit at the "edge" of the Internet, as shown in Figure 1.3-1. Throughout this book we
will use the terms hosts and end systems interchangeably, that is, host = end system.

Hosts are sometimes further divided into two categories: clients and servers. Informally, clients often
tend to be desktop PC's or workstations, while servers are more powerful machines. But there is a more
precise meaning of a client and a server in computer networking. In the so-called client-server model, a
client program running on one end system requests and receives information from a server running on
another end system. This client-server model is undoubtedly the most prevalent structure for Internet
applications. We will study the client-server model in detail in Chapter 2. The Web, e-mail, file transfer,
remote login (e.g., Telnet), newgroups and many other popular applications adopt the client-server model.
Since a client typically runs on one computer and the server runs on another computer, client-server
Internet applications are, by definition, distributed applications. The client and the server interact with
each other by communicating (i.e., sending each other messages) over the Internet. At this level of
abstraction, the routers, links and other "pieces" of the Internet serve as a "black box" that transfers
messages between the distributed, communicating components of an Internet application. This is the
level of abstraction depicted in Figure 1.3-1. (1 of 5) [5/13/2004 11:51:27 AM]
 End systems, protocols, and end-to-end service models

                                                  Figure 1.3-1: End system Interaction

Computers (e.g., a PC or a workstation), operating as clients and servers, are the most prevalent type of
end system. However, an increasing number of alternative devices, such as so-called network computers
and thin clients [Thinworld 1998], Web TV's and set top boxes [Mills 1998], digital cameras, and other
devices are being attached to the Internet as end systems. An interesting discussion of the continuing
evolution of Internet applications is [AT&T 1998].

1.3.2 Connectionless and Connection-Oriented
We have seen that end systems exchange messages with each other according to an application-level
protocol in order to accomplish some task. The links, routers and other pieces of the Internet provide the
means to transport these messages between the end system applications. But what are the characteristics
of this communication service that is provided? The Internet, and more generally TCP/IP networks,
provide two types of services to its applications: connectionless service and connection-oriented
service. A developer creating an Internet application (e.g., an email application, a file transfer
application, a Web application or an Internet phone application) must program the application to use one (2 of 5) [5/13/2004 11:51:27 AM]
 End systems, protocols, and end-to-end service models

of these two services. Here, we only briefly describe these two services; we shall discuss them in much
more detail in Chapter 3, which covers transport layer protocols.

Connection-Oriented Service

When an application uses the connection-oriented service, the client and the server (residing in different
end systems) send control packets to each other before sending packets with real data (such as e-mail
messages). This so-called handshaking procedure alerts the client and server, allowing them to prepare
for an onslaught of packets. It is interesting to note that this initial hand-shaking procedure is similar to
the protocol used in human interaction. The exchange of "hi's" we saw in Figure 1.2-1 is an example of a
human "handshaking protocol" (even though handshaking is not literally taking place between the two
people). The two TCP messages that are exchanged as part of the WWW interaction shown in Figure 1.2-
1 are two of the three messages exchanged when TCP sets up a connection between a sender and
receiver. The third TCP message (not shown) that forms the final part of the TCP three-way handshake
(see Section 3.7) is contained in the get message shown in Figure 1.2-1.

Once the handshaking procedure is finished, a "connection" is said to be established between the two end
systems. But the two end systems are connected in a very loose manner, hence the terminology
"connection-oriented". In particular, only the end systems themselves are aware of this connection; the
packet switches (i.e., routers) within the Internet are completely oblivious to the connection. This is
because a TCP connection is nothing more than allocated resources (buffers) and state variables in the end
systems. The packet switches do not maintain any connection state information.

The Internet's connection oriented service comes bundled with several other services, including reliable
data transfer, flow control and congestion control. By reliable data transfer, we mean that an application
can rely on the connection to deliver all of its data without error and in the proper order. Reliability in the
Internet is achieved through the use of acknowledgments and retransmissions. To get a preliminary idea
about how the Internet implements the reliable transport service, consider an application that has
established a connection between end systems A and B. When end system B receives a packet from A, it
sends an acknowledgment; when end system A receives the acknowledgment, it knows that the
corresponding packet has definitely been received. When end system A doesn't receive an
acknowledgment, it assumes that the packet it sent was not received by B; it therefore retransmits the
packet.Flow control makes sure that neither side of a connection overwhelms the other side by sending
too many packets too fast. Indeed, the application at one one side of the connection may not be able to
process information as quickly as it receives the information. Therefore, there is a risk of overwhelming
either side of an application. The flow-control service forces the sending end system to reduce its rate
whenever there is such a risk. We shall see in Chapter 3 that the Internet implements the flow control
service by using sender and receiver buffers in the communicating end systems. The Internet's congestion
control service helps prevent the Internet from entering a state of grid lock. When a router becomes
congested, its buffers can overflow and packet loss can occur. In such circumstances, if every pair of
communicating end systems continues to pump packets into the network as fast as they can, gridlock sets
in and few packets are delivered to their destinations. The Internet avoids this problem by forcing end
systems to diminish the rate at which they send packets into the network during periods of congestion. (3 of 5) [5/13/2004 11:51:27 AM]
 End systems, protocols, and end-to-end service models

End systems are alerted to the existence of severe congestion when they stop receiving acknowledgments
for the packets they have sent.

We emphasize here that although the Internet's connection-oriented service comes bundled with reliable
data transfer, flow control and congestion control, these three features are by no means essential
components of a connection-oriented service. A different type of computer network may provide a
connection-oriented service to its applications without bundling in one or more of these features. Indeed,
any protocol that performs handshaking between the communicating entities before transferring data is a
connection-orieinted service [Iren].

The Internet's connection-oriented service has a name -- TCP (Transmission Control Protocol); the initial
version of the TCP protocol is defined in the Internet Request for Comments RFC 793 [RFC 793]. The
services that TCP provides to an application include reliable transport, flow control and congestion
control. It is important to note that an application need only care about the services that are provided; it
need not to worry about how TCP actually implements reliability, flow control, or congestion control.
We, of course, are very interested in how TCP implements these services and we shall cover these topics
in detail in Chapter 3.

Connectionless Service

There is no handshaking with the Internet's connectionless service. When one side of an application wants
to send packets to another side of an application, the sending application simply sends the packets. Since
there is no handshaking procedure prior to the transmission of the packets, data can be delivered faster.
But there are no acknowledgments either, so a source never knows for sure which packets arrive at the
destination. Moreover, the service makes no provision for flow control or congestion control. The
Internet's connectionless service is provided by UDP (User Datagram Protocol); UDP is defined in the
Internet Request for Comments RFC 768 [RFC 768].

Most of the more familiar Internet applications use TCP, the Internet's connection-oriented service. These
applications include Telnet (remote login), SMTP (for electronic mail), FTP (for file transfer), and HTTP
(for the Web). Nevertheless, UDP, the Internet's connectionless service, is used by many applications,
including many of the emerging multimedia applications, such as Internet phone, audio-on-demand, and
video conferencing.


[AT&T 1998] "Killer Apps," AT&T WWW page
[Iren] S.Iren, P.Amer, P.Conrad, "The Transport Layer: Tutorial and Survey," ACM Computing Surveys,
June 1999
[Thinworld 1998] Thinworld homepage, (4 of 5) [5/13/2004 11:51:27 AM]
 End systems, protocols, and end-to-end service models

[Mills 1998] S. Mills, "TV set-tops set to take off ", CNET, Oct. 1998
[RFC 768] J. Postel, " Datagram Protocol," RFC 768, Aug. 1980.
[RFC 793] J. Postel, "Transmission Control Protocol," RFC 793, September 1981.

Return to Table of Contents

Copyright Keith W. Ross and Jim Kurose 1996-2000 (5 of 5) [5/13/2004 11:51:27 AM]
 The Network Core

                                 1.4 The Network Core
Having examined the endsystems and end-end transport service model of the Internet in section 1.3, let
us now delve more deeply into the "inside" of the network. In this section we study the network core --
the mesh of routers that interconnect the Internet's endsystems. Figure 1.4-1 highlights the network core
in red.

                                                Figure 1.4-1: The network core

1.4.1 Circuit Switching, Packet Switching and
Message Switching
There are two fundamental approaches towards building a network core: circuit switching and packet
switching. In circuit-switched networks, the resources needed along a path (buffers, link bandwidth) to
provide for communication between the endsystems are reserved for the duration of the session. In
packet-switched networks, these resources are not reserved; a session's messages use the resource on
demand, and as a consequence, may have to wait (i.e., queue) for access to a communication link. As a
simple analogy, consider two restaurants -- one which requires reservations and another which neither
requires reservations nor accepts them. For the restaurant that requires reservations, we have to go (1 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

through the hassle of first calling (or sending an e-mail!) before we leave home. But when we arrive at
the restaurant we can, in principle, immediately communicate with the waiter and order our meal. For the
restaurant that does not require reservations, we don't need to bother to reserve a table. But when we
arrive at the restaurant, we may have to wait for a table before we can communicate with the waiter.

The ubiquitous telephone networks are examples of circuit-switched networks. Consider what happens
when one person wants to send information (voice or facsimile) to another over a telephone network.
Before the sender can send the information, the network must first establish a connection between the
sender and the receiver. In contrast with the TCP connection that we discussed in the previous section,
this is a bona fide connection for which the switches on the path between the sender and receiver
maintain connection state for that connection. In the jargon of telephony, this connection is called a
circuit. When the network establishes the circuit, it also reserves a constant transmission rate in the
network's links for the duration of the connection. This reservation allows the sender to transfer the data
to the receiver at the guaranteed constant rate.

Today's Internet is a quintessential packet-switched network. Consider what happens when one host
wants to send a packet to another host over a packet-switched network. As with circuit-switching, the
packet is transmitted over a series of communication links. But with packet-switching, the packet is sent
into the network without reserving any bandwidth whatsoever. If one of the links is congested because
other packets need to be transmitted over the link at the same time, then our packet will have to wait in a
buffer at the sending side of the transmission line, and suffer a delay. The Internet makes its best effort to
deliver the data in a timely manner. But it does not make any guarantees.

Not all telecommunication networks can be neatly classified as pure circuit-switched networks or pure
packet-switched networks. For example, for networks based on the ATM technology, a connection can
make a reservation and yet its messages may still wait for congested resources! Nevertheless, this
fundamental classification into packet- and circuit-switched networks is an excellent starting point in
understanding telecommunication network technology.

Circuit Switching

This book is about computer networks, the Internet and packet switching, not about telephone networks
and circuit switching. Nevertheless, it is important to understand why the Internet and other computer
networks use packet switching rather than the more traditional circuit-switching technology used in the
telephone networks. For this reason, we now give a brief overview of circuit switching.

Figure 1.4-2 illustrates a circuit-switched network. In this network the three circuit switches are
interconnected by two links; each of these links has n circuits, so that each link can support n
simultaneous connections. The endsystems (e.g., PCs and workstations) are each directly connected to
one of the switches. (Ordinary telephones are also connected to the switches, but they are not shown in
the diagram.) Notice that some of the hosts have analog access to the switches, whereas others have
direct digital access. For analog access, a modem is required. When two hosts desire to communicate, the (2 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

network establishes a dedicated end-to-end circuit between two hosts. (Conference calls between more
than two devices are, of course, also possible. But to keep things simple, let's suppose for now that there
are only two hosts for each connection.) Thus in order for host A to send messages to host B, the network
must first reserve one circuit on each of two links.

                           Figure 1.4-2:A simple circuit-switched network consisting of three circuit switches
                           interconnected with two links. Each link has n circuits; each end-to-end circuit over
                           a link gets the fraction 1/n of the link's bandwidth for the duration of the circuit.
                           The ncircuits in a link can be either TDM or FDM circuits.

A circuit in a link is implemented with either frequency division multiplexing (FDM) or time-division
multiplexing (TDM). With FDM, the frequency spectrum of a link is shared among the connections
established across the link. Specifically, the link dedicates a frequency band to each connection for the
duration of the connection. In telephone networks, this frequency band typically has a width of 4 kHz.
The width of the band is called, not surprisingly, the bandwidth. FM radio stations also use FDM to
share microwave frequency spectrum. (3 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

The trend in modern telephony is to replace FDM with TDM. The majority of the links in most telephone
systems in the United States and in other developed countries currently employ TDM. For a TDM link,
time is divided into frames of fixed duration and each frame is divided into a fixed number of time slots.
When the network establish a connection across a link, the network dedicates one time slot in every
frame to the connection. These slots are dedicated for the sole use of that connection, with a time slot
available for use (in every frame) to transmit the connection's data.

Figure 1.4.3 illustrates FDM and TDM for a specific network link. For FDM, the frequency domain is
segmented into a number of circuits, each of bandwidth 4 KHz (i.e., 4,000 Hertz or 4,000 cycles per
second). For TDM, the time domain is segmented into four circuits; each circuit is assigned the same
dedicated slot in the revolving TDM frames. The transmission rate of the frame is equal to the frame rate
multiplied by the number of bits in a slot. For example, if the link transmits 8,000 frames per second and
each slot consists of 8 bits, then the transmission rate is 64 Kbps.

                                     Figure 1.4-3: With FDM, each circuit continuously gets a fraction of the
                                     bandwidth. With TDM, each circuit gets all of the bandwidth periodically
                                     during brief intervals of time (i.e., during slots).

Proponents of packet switching have always argued that circuit switching is wasteful because the (4 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

dedicated circuits are idle during silent periods. For example, when one of the conversants in a
telephone call stops talking, the idle network resources (frequency bands or slots in the links along the
connection's route) cannot be used by other ongoing connections. As another example of how these
resources can be underutilized, consider a radiologist who uses a circuit-switched network to remotely
access a series of x-rays. The radiologist sets up a connection, requests an image, contemplates the
image, and then requests a new image. Network resources are wasted during the radiologist's
contemplation periods. Proponents of packet switching also enjoy pointing out that establishing end-to-
end circuits and reserving end-to-end bandwidth is complicated and requires complex signaling software
to coordinate the operation of the switches along the end-to-end path.

Before we finish our discussion of circuit switching, let's work through a numerical example that should
shed further insight on the matter. Let us consider how long it takes to send a file of 640 Kbits from host
A to host B over a circuit-switched network. Suppose that all links in the network use TDM with 24
slots and have bit rate 1.536 Mbps. Also suppose that it takes 500 msec to establish an end-to-end circuit
before A can begin to transmit the file. How long does it take to send the file? Each circuit has a
transmission rate of (1.536 Mbps)/24 = 64 Kbps, so it takes (640 Kbits)/(64 Kbps) = 10 seconds to
transmit the file. To this 10 seconds we add the the circuit establishment time, giving 10.5 seconds to
send the file. Note that the transmission time is independent of the number links: the transmission time
would be 10 seconds if the end-to-end circuit passes through one link or one-hundred links. AT&T Labs
provides an interactive site [AT&T 1998] to explore transmission delay for various file types and
transmission technologies.

Packet Switching

We saw in sections 1.2 and 1.3. that application-level protocols exchange messages in accomplishing
their task. Messages can contain anything the protocol designer desires. Messages may perform a
control function (e.g., the "hi" messages in our handshaking example) or can contain data, such as an
ASCII file, a Postscript file, a Web page, a digital audio file. In modern packet-switched networks, the
source breaks long messages into smaller packets. Between source and destination, each of these packets
traverse communication links and packet switches (also known as routers). Packets are transmitted over
each communication link at a rate equal to the full transmission rate of the link. Most packet switches use
store and forward transmission at the inputs to the links. Store-and-forward transmission means that
the switch must receive the entire packet before it can begin to transmit the first bit of the packet onto the
outbound link. Thus store-and-forward packet-switches introduce a store-and-forward delay at the
input to each link along the packet's route. This delay is proportional to the packet's length in bits. In
particular, if a packet consists of L bits, and the packet is to be forwarded onto an outbound link of R bps,
then the store-and-forward delay at the switch is L/R seconds.

Within each router there are multiple buffers (also called queues), with each link having an input buffer
(to store packets that have just arrived to that link) and an output buffer. The output buffers play a key
role in packet switching. If an arriving packet needs to be transmitted across a link but finds the link
busy with the transmission of another packet, the arriving packet must wait in the output buffer. Thus, in (5 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

addition to the store-and-forward delays, packets suffer output buffer queueing delays. These delays are
variable and depend on the level of congestion in the network. Since the amount of buffer space is finite,
an arriving packet may find that the buffer is completely filled with other packets waiting for
transmission. In this case, packet loss will occur - either the arriving packet or one of the already-
queued packets will be dropped. Returning to our restaurant analogy from earlier in this section, the
queueing delay is analogous to the amount of time one spends waiting for a table. Packet loss is
analogous to being told by the waiter that you must leave the premises because there are already too
many other people waiting at the bar for a table.

Figure 1.4-4 illustrates a simple packet-switched network. Suppose Hosts A and B are sending packets to
Host E. Hosts A and B first send their packets along 28.8 Kbps links to the first packet switch. The
packet switch directs these packets to the 1.544 Mbps link. If there is congestion at this link, the packets
queue in the link's output buffer before they can be transmitted onto the link. Consider now how Host A
and Host B packets are transmitted onto this link. As shown in Figure 1.4-4, the sequence of A and B
packets does not follow any periodic ordering; the ordering is random or statistical -- packets are sent
whenever they happen to be present at the link. For this reason, we often say that packet switching
employs statistical multiplexing. Statistical multiplexing sharply contrasts with time-division
multiplexing (TDM), for which each host gets the same slot in a revolving TDM frame. (6 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

                                                Figure 1.4-4: Packet switching

Let us now consider how long it takes to send a packet of L bits from host A to another host across a
packet-switched network. Let us suppose that there are Q links between A and E, each of rate R bps.
Assume that queueing delays and end-to-end propagation delays are negligible and that there is no
connection establishment. The packet must first be transmitted onto the first link emanating from host A;
this takes L/R seconds. It must then be transmitted on each of the Q-1 remaining links, that is, it must be
stored-and-forwarded Q-1 times. Thus the total delay is QL/R.

Packet Switching versus Circuit Switching

Having described circuit switching and packet switching, let us compare the two. Opponents of packet
switching have often argued that the packet switching is not suitable for real-time services (e.g.,
telephone calls and video conference calls) due to its variable and unpredictable delays. Proponents of
packet switching argue that (1) it offers better sharing of bandwidth than circuit switching and (2) it is
simpler, more efficient, and less costly to implement than circuit-switching. Generally speaking, people
who do not like to hassle with restaurant reservations prefer packet switching to circuit switching. (7 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

Why is packet-switching more efficient? Let us look at a simple example. Suppose users share a 1 Mbps
link. Also suppose that each user alternates between periods of activity (when it generates data at a
constant rate of 100Kbits/sec) and periods of inactivity (when it generates no data). Suppose further
that a user is active only 10% of the time (and is idle drinking coffee during the remaining 90% of the
time). With circuit-switching, 100 Kbps must be reserved for each user at all times. Thus, the link can
support only ten simultaneous users. With packet switching, if there are 35 users, the probability that
there are 10 or more simultaneously active users is less than .0004. If there are 10 or less simultaneously
active users (which happens with probability .9996), the aggregate arrival rate of data is less than 1Mbps
(the output rate of the link). Thus, users' packets flow through the link essentially without delay, as is the
case with circuit switching. When there are more than 10 simultaneously active users, then the aggregate
arrival rate of packets will exceed the output capacity of the link, and the output queue will begin to grow
(until the aggregate input rate falls back below 1Mbps, at which point the queue will begin to diminish in
length). Because the probability of having ten or more simultaneously active users is very very small,
packet-switching almost always has the same delay performance as circuit switching, but does so while
allowing for more than three times the number of users.

Although packet switching and circuit switching are both very prevalent in today's telecommunication
networks, the trend is certainly in the direction of packet switching. Even many of today's circuit-
switched telephone networks are slowly migrating towards packet switching. In particular, telephone
networks often convert to packet switching for the expensive overseas portion of a telephone call.

Message Switching

In a modern packet-switched network, the source host segments long messages into smaller packets and
sends the smaller packets into the network; the receiver reassembles the packets back into the original
message. But why bother to segment the messages into packets in the first place, only to have to
reassemble packets into messages? Doesn't this place an additional and unnecessary burden on the source
and destination? Although the segmentation and reassembly do complicate the design of the source and
receiver, researchers and network designers concluded in the early days of packet switching that the
advantages of segmentation greatly compensate for its complexity. Before discussing some of these
advantages, we need to introduce some terminology. We say that a packet-switched network performs
message switching if the sources do not segment messages, i.e., they send a message into the network as
a whole. Thus message switching is a specific kind of packet switching, whereby the packets traversing
the network are themselves entire messages.

Figure 1.4-5 illustrates message switching in a route consisting of two packet switches (PSs) and three
links. With message switching, the message stays in tact as it traverses the network. Because the switches
are store-and-forward packet switches, a packet switch must receive the entire message before it can
begin to forward the message on an outbound link. (8 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

                                   Figure 1.4-5: A simple message-switched network

Figure 1.4-6 illustrates packet switching for the same network. In this example the original message has
been divided into five distinct packets. In Figure 1.4-6, the first packet has arrived at the destination, the
second and third packets are in transit in the network, and the last two packets are still in the source.
Again, because the switches are store-and-forward packet switches, a packet switch must receive an
entire packet before it can begin to forward the packet on an outbound link.

                                    Figure 1.4-6: A simple packet-switched network

One major advantage of packet switching (with segmented messages) is that it achieves end-to-end
delays that are typically much smaller than the delays associated with message-switching. We illustrate
this point with the following simple example. Consider a message that is 7.5 Mbits long. Suppose that
between source and destination there are two packet switches and three links, and that each link has a
transmission rate of 1.5Mbps. Assuming there is no congestion in the network, how much time is
required to move the message from source to destination with message switching? It takes the source 5
seconds to move the message from the source to the first switch. Because the switches use store-and-
forward, the first switch cannot begin to transmit any bits in the message onto the link until this first
switch has received the entire message. Once the first switch has received the entire message, it takes 5
seconds to move the message from the first switch to the second switch. Thus it takes ten seconds to
move the message from the source to the second switch. Following this logic we see that a total of 15
seconds is needed to move the message from source to destination. These delays are illustrated in Figure
1.4-7. (9 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

     Figure 1.4-7: Timing of message transfer of a 7.5 Mbit message in a message-switched network

Continuing with the same example, now suppose that the source breaks the message into 5000 packets,
with each packet being 1.5 Kbits long. Again assuming that there is no congestion in the network, how
long does it take to move the 5000 packets from source to destination? It takes the source 1 msec to
move the first packet from the source to the first switch. And it takes the first switch 1 msec to move this
first packet from the first to the second switch. But while the first packet is being moved from the first
switch to the second switch, the second packet is simultaneously moved from the source to the first
switch. Thus the second packet reaches the first switch at time = 2 msec. Following this logic we see that
the last packet is completely received at the first switch at time = 5000 msec = 5 seconds. Since this last
packet has to be transmitted on two more links, the last packet is received by the destination at 5.002
seconds:. (10 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

 Figure 1.4-8: Timing of packet transfer of a 7.5 Mbit message, divided into 5000 packets, in a packet-
                                           switched network

Amazingly enough, packet-switching has reduced the message-switching delay by a factor of three! But
why is this so? What is packet-switching doing that is different from message switching? The key
difference is that message switching is performing sequential transmission whereas packet switching is
performing parallel transmission. Observe that with message switching, while one node (the source or
one of the switches) is transmitting, the remaining nodes are idle. With packet switching, once the first
packet reaches the last switch, three nodes transmit at the same time.

Packet switching has yet another important advantage over message switching. As we will discuss later (11 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

in this book, bit errors can be introduced into packets as they transit the network. When a switch detects
an error in a packet, it typically discards the entire packet. So, if the entire message is a packet and one
bit in the message gets corrupted, the entire message is discarded. If, on the other hand, the message is
segmented into many packets and one bit in one of the packets is corrupted, then only that one packet is

Packet switching is not without its disadvantages, however, with respect to message switching. We will
see that each packet or message must carry, in addition to the data being sent from the sending
application to the receiving application, an amount of control information. This information, which is
carried in the packet or message header, might include the identity of the sender and receiver and a
packet or message identifier (e.g., number). Since the amount of header information would be
approximately the same for a message or a packet, the amount of header overhead per byte of data is
higher for packet switching than for message switching.

Before moving on to the next subsection, you are highly encouraged to explore the Message Switching
Java Applet. This applet will allow you to experiment with different message and packet sizes, and will
allow you to examine the effect of additional propagation delays.

1.4.2 Routing in Data Networks
There are two broad classes of packet-switched networks: datagram networks and virtual-circuit
networks. They differ according to whether they route packets according to host destination addresses
or according to virtual circuit numbers. We shall call any network that routes packets according to host
destination addresses a datagram network. The IP protocol of the Internet routes packets according to
the destination addresses; hence the Internet is a datagram network. We shall call any network that routes
packets according to virtual-circuit numbers a virtual-circuit network. Examples of packet-switching
technologies that use virtual circuits include X.25, frame relay, and ATM.

Virtual Circuit Networks

A virtual circuit (VC) consists of (1) a path (i.e., a series of links and packet switches) between the
source and destination hosts, (2) virtual circuit numbers, one number for each link along the path, and (3)
entries in VC-number translation tables in each packet switch along the path. Once a VC is established
between source and destination, packets can be sent with the appropriate VC numbers. Because a VC has
a different VC number on each link, an intermediate packet switch must replace the VC number of each
traversing packet with a new one. The new VC number is obtained from the VC-number translation

To illustrate the concept, consider the network shown in Figure 1.4-9. Suppose host A requests that the
network establish a VC between itself and host B. Suppose that the network chooses the pathA - PS1 - (12 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

PS2 - B and assigns VC numbers 12, 22, 32 to the three links in this path. Then, when a packet as part of
this VC leaves host A, the value in the VC number field is 12; when it leaves PS1, the value is 22; and
when it leaves PS2, the value is 32. The numbers next to the links of PS1 are the interface numbers.

                                      Figure 1.4-9: A simple virtual circuit network

How does the switch determine the replacement VC number for a packet traversing the switch? Each
switch has a VC number translation table; for example, the VC number translation table in PS 1 might
look something like this:

                     Incoming                  Incoming                 Outgoing                   Outgoing
                     Interface                    VC#                   Interface                    VC#

                              1                        12                        3                    22
                              2                        63                        1                    18
                              3                        7                         2                    17
                              1                        97                        3                    87
                              ...                      ...                      ...                   ...

Whenever a new VC is established across a switch, an entry is added to the VC number table. Similarly,
whenever a VC terminates, the entries in each table along its path are removed. (13 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

You might be wondering why a packet doesn't just keep the same VC number on each of the links along
its route? The answer to this question is twofold. First, by replacing the number from link to link, the
length of the VC field is reduced. Second, and more importantly, by permitting a different VC number
for each link along the path of the VC, a network management function is simplified. Specifically, with
the multiple VC numbers, each link in the path can choose a VC number independently of what the other
links in the path chose. If a common number were required for all links along the path, the switches
would have to exchange and process a substantial number of messages to agree on the VC number to be
used for a connection.

If a network employs virtual circuits, then the network's switches must maintain state information for
the ongoing connections. Specifically, each time a new connection is established across a switch, a new
connection entry must be added to the switch's VC-number translation table; and each time a connection
is released, an entry must be removed from the table. Note that even if there is no VC number translation,
it is still necessary to maintain state information that associates VC numbers to interface numbers. The
issue of whether or not a switch or router maintains state information for each ongoing connection is a
crucial one - one which we return to shortly below.

Datagram Networks

Datagam networks are analogous in many respects to the postal services . When a sender sends a letter to
a destination, the sender wraps the letter in an envelope and writes the destination address on the
envelope. This destination address has a hierarchical structure. For example, letters sent to a location in
the United States include the country (the USA), the state (e.g., Pennsylvania), the city (e.g.,
Philadelphia), the street (e.g., Walnut Street) and the number of the house on the street (e.g., 421). The
postal services use the address on the envelope to route the letter to its destination. For example, if the
letter is sent from France, then a postal office in France will first direct the letter to a postal center in the
USA. This postal center in the USA will then send the letter to a postal center in Philadelphia. Finally a
mail person working in Philadelphia will deliver the letter to its ultimate destination.

In a datagram network, each packet that traverses the network contains in its header the address of the
destination. As with postal addresses, this address has a hierarchical structure. When a packet arrives at a
packet switch in the network, the packet switch examines a portion of the packet's destination address
and forwards the packet to an adjacent switch. More specifically, each packet switch has a routing table
which maps destination addresses (or portions of the destination addresses) to an outbound link. When a
packet arrives at switch, the switch examines the address and indexes its table with this address to find
the appropriate outbound link. The switch then sends the packet into this outbound link.

The whole routing process is also analogous to the car driver who does not use maps but instead prefers
to ask for directions. For example, suppose Joe is driving from Philadelphia to 156 Lakeside Drive in
Orlando, Florida. Joe first drives to his neighborhood gas station and asks how to get to 156 Lakeside
Drive in Orlando, Florida. The gas station attendant extracts the Florida portion of the address and tells (14 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

Joe that he needs to get onto the interstate highway I-95 South, which has an entrance just next to the gas
station. He also tells Joe that once he enters Florida he should ask someone else there. Joe then takes I-95
South until he gets to Jacksonville, Florida, at which point he asks another gas station attendant for
directions. The attendant extracts the Orlando portion of the address and tells Joe that he should continue
on I-95 to Daytona Beach and then ask someone else. In Daytona Beach another gas station attendant
also extracts the Orlando portion of the address and tells Joe that he should take I-4 directly to Orlando.
Joe takes I-4 and gets off at the Orlando exit. Joe goes to another gas station attendant, and this time the
attendant extracts the Lakeside Drive portion of the address, and tells Joe the road he must follow to get
to Lakeside Drive. Once Joe reaches Lakeside Drive he asks a kid on a bicycle how to get to his
destination. The kid extracts the 156 portion of the address and points to the house. Joe finally reaches
his ultimate destination.

We will be discussing routing in datagram networks in great detail in this book. But for now we mention
that, in contrast with VC networks, datagram networks do not maintain connection state information in
their switches. In fact, a switch in a pure datagram network is completely oblivious to any flows of traffic
that may be passing through it -- it makes routing decisions for each individual packet. Because VC
networks must maintain connection state information in their switches, opponents of VC networks argue
that VC networks are overly complex. These opponents include most researchers and engineers in the
Internet community. Proponents of VC networks feel that VCs can offer applications a wider variety of
networking services. Many researchers and engineers in the ATM community are outspoken advocates
for VCs.

How would you like to actually see the route packets take in the Internet? We now invite you to get your
hands dirty by interacting with the Traceroute program.

Network Taxonomy

We have now introduced several important networking concepts: circuit switching, packet switching,
message switching, virtual circuits, connectionless service, and connection oriented service. How does it
all fit together?

First, in our simple view of the World, a telecommunications network either employs circuit-switching or
packet-switching: (15 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

Figure 1.4-10: highest-level distinction among telecommunication networks: circuit-switched or packet-

A link in a circuit-switched network can employ either FDM or TDM:

                         Figure 1.4-11: Circuit switching implementation: FDM or TDM?

Packet switch networks are either virtual-circuit networks or datagram networks. Switches in virtual-
circuit networks route packets according to the packets' VC numbers and maintain connection state.
Switches in datagram networks route packets according to the packets' destination addresses and do not
maintain connection state: (16 of 17) [5/13/2004 11:51:45 AM]
 The Network Core

                Figure 1.4-12: Packet switching implementation: virtual circuits or datagrams?

Examples of packet-switched networks which use VCs include X.25, frame relay, and ATM. A packet-
switched network either (1) uses VCs for all of its message routing, or (2) uses destination addresses for
all of its message routing. It doesn't employ both routing techniques. (This last statement is a bit of a
white lie, as there are networks that use datagram routing "on top of" VC routing. This is the case for "IP
over ATM," as we shall cover later in the book.)

A datagram network is not, however, either a connectionless or a connection-oriented network. Indeed, a
datagram network can provide the connectionless service to some of its applications and the connection-
oriented service to other applications. For example, the Internet, which is a datagram network, is a
datagram network that provides both connectionless and connection-oriented service to its applications.
We saw in section 1.3 that these services are provided in the Internet by the UDP and TCP protocols,
respectively. Networks with VCs - such as X.25, Frame Relay, and ATM - are always, however,

Return to Table Of Contents

Copyright Keith W. Ross and Jim Kurose 1996-2000 (17 of 17) [5/13/2004 11:51:45 AM]

                   Tracing Routes in the Internet
Traceoute is a popular program for tracing a packet's route from any source host to any destination host
in the Internet. Before we explain what traceroute does and how it works, first try running the traceroute
program. In the box below, enter the name of any host, such as or The host
name that you enter will be sent to a server located at IBM Israel in Tel-Aviv, Israel. The host in Tel-Aviv will
respond with the route taken from Tel-Aviv to the host you have listed in the box below. After running the
program, return to this page for a discussion of the traceroute program.

Host address or name                                                                 Submit

Leave empty to find the route to your browser.

After having traced the route from Tel-Aviv to your favorite host, try it again with a new starting place --
Dana Point in sunny southern California.

Host address or name                                                                 Submit

What Traceroute Does and How It Works

The main packet switches in the Internet are called routers, and routers use datagram routing.
Specifically, when a source constructs a packet, it appends the destination address onto the packet.
When the packet arrives at a router, the switch determines the appropriate outgoing link for the packet by
examining the packet's destination address.

Traceroute is a little program that can run in any Internet host. When the user specifies a destination host
name, the program sends multiple packets towards that destination. As these packets work their way
towards the destinations, they pass through a series of routers. When a router receives one of these
packets, it sends a little message back to the source. This message contains the name and address of the

More specifically, suppose there are N-1 routers between the soruce and the destination. Then the source
will send N packets into the network, with each packet addressed to the ultimate destination. These
packets are also marked 1 through N, with the first of the N packets marked 1 and the last of the N
packets marked N. When the nth router receives the nth packet marked n, the router destroys the packet
and sends a message to the source. And when the destination host receives the Nth packet, the destination
destoys it as well, but again returns a message back to the source. The source records the time that (1 of 4) [5/13/2004 11:51:48 AM]

elapses from when it sends a packet until when it receives the corresponding return message; it also
records the name and address of the router (or the destination host) that returns the message. In this
manner, the source can reconstruct the route taken by packets flowing from source to destination, and the
source can determine the round-trip delays to all the intervening routers. Traceroute actually repeats the
experiment just described three times, so the source actually sends 3*N packets to the destination.

The [RFC 1393] describes traceout in detail. The Internet Encyclopedia as also gives an overview of
how traceroute works.

Here is an example of the output of the traceroute program, where the route is being traced from the
source host (at the University of Pennsylvania) to (at the University of
Paris VI). The output has six columns: the first column is the n value described above, i.e., the number of
the router along the route; the second column is the name of the router; the third column is the address of
the router (of the form; the last three columns are the round-trip delays for three
experiments. If the source receives less than three messages from any given router, because of packet
loss in the network, traceroute places an asterisk just after the router number and reports less than three
round-trip times for that router.

1 GW.CIS.UPENN.EDU ( 3 ms 2 ms 1 ms

2 DEFAULT7-GW.UPENN.EDU ( 3 ms 1 ms 2 ms

3 ( 3 ms 4 ms 3 ms

4 ( 6 ms 6 ms 6 ms

5 ( 7 ms 6 ms 6 ms

6 ( 16 ms 305 ms 192 ms

7 ( 20 ms 196 ms 18 ms

8 sl-dc-6-H2/ ( 19 ms 18 ms 24 ms

9 ( 19 ms 24 ms 18 ms

10 gsl-dc-3-Fddi0/ ( 19 ms 18 ms 20 ms

11 * ( 133 ms 94 ms

12 ( 93 ms 95 ms 97 ms (2 of 4) [5/13/2004 11:51:48 AM]

13 ( 200 ms 94 ms 209 ms

14 ( 105 ms 101 ms 105 ms

15 ( 108 ms 102 ms 95 ms

16 ( 110 ms 97 ms 91 ms

17 ( 94 ms 96 ms 100 ms

18 ( 100 ms 94 ms 100 ms

19 ( 96 ms 100 ms 94 ms

20 ( 121 ms 100 ms 97 ms

21 * ( 105 ms 102 ms

In the above trace there are no routers between the source and the destination. Most of these routers have
a name, and all of them have addresses. For example, the name of router 8 is sl-dc-6-H2/0- and its address is Looking at the data provided for this same router, we
see that in the first of the three trials the roundtrip delay between the source and the router 8 was 19
msec. The roundtrip delays for the subsequent two trials were 18 and 24 msec. These roundtrip delays
include packet propagation delays, router processing delays, and queueing delays due to congestion in
the Internet. Because the congestion is varying with time, the roundtrip delay to a router n can actually be
longer than the roundtrip delay to router n+1. Note in the above example that there is a big jump in the
round-trip delay when going from router 10 to router 11. This is because the link between routers 10 and
11 is a transatlantic link.

Want to try out traceroute from some other starting points besides Tel-Aviv and Dana Point? Then visit
Yahoo's List of sites offering route tracing.


[RFC 1393] G. Malkin, "Traceroute Using an IP Option," RFC 1393, January 1993.

Return to Table Of Contents (3 of 4) [5/13/2004 11:51:48 AM]

Copyright Keith W. Ross and Jim Kurose 1996-1998 (4 of 4) [5/13/2004 11:51:48 AM]
 Message Switching

                            Interactive Java Applet:
  Message Switching & Packet Switching
This interactive applet enables you to actually see why packet switching can have much smaller delays
than message switching when packets pass through store-and-forward switches. In this applet there are
four nodes: a source (node A), a destination (node B), and two store-and-forward switches. Each packet
sent from the source must be transmitted over three links before it reaches the destination. Each of these
links has a transmission rate of 4 Kbps and an optional propagation delay of one second.

Each small rectangle represents 1 Kbit of data. When you press Start, the rectangles are grouped into one
packet in the transmit buffer of the source. The packet is transmitted to the first switch, where it must be
stored before it is forwarded. The packet then continues towards the destination.

To simulate message switching, set the packet size equal to the message size. To simulate packet
switching, set the packet size to less than the message size. To examine the effect of link propagation
delays, check the appropriate boxes for optional propagation delays. For a variety of scenarios, it is
highly recommended that you calculate the end-to-end delay analytically and then verify your calculation
with the applet. [5/13/2004 11:51:49 AM]
  Access Networks and Physical Media

             1.5 Access Networks and Physical Media
In sections 1.3 and 1.4 we have examined the roles of end systems and routers in a network architecture. In this section we
consider the access network - the physical link(s) that connect an end system to its edge router, i.e., the first router on a path
from the end system to any other distant end system.. Since access network technology is closely tied to physical media
technology (fiber, coaxial pair, twisted pair telephone wire, radio spectrum), we consider these two topics together in this

1.5.1 Access Networks
Figure 1.5-1 shows the access networks' links highlighted in red.

                                                Figure 1.5-1: Access networks
Access networks can be loosely divided into three categories:

    q   residential access networks, connecting a home end system into the network;
    q   institutional access networks, connecting an end system in a business or educational institution into the network;
    q   mobile access networks, connecting a mobile end system into the network

These categories are not hard and fast; some corporate end systems may well use the access network technology that we ascribe
to residential access networks, and vice versa. Our descriptions below are meant to hold for the common (if not every) case.

Residential Access Networks

A residential access network connects a home end system (typically a PC, but perhaps a Web TV or other residential system) to
an edge router. Probably the most common form of home access is using a modem over a POTS (plain old telephone system)
dialup line to an Internet service provider (ISP). The home modem converts the digital output of the PC into analog format for
transmission over the analog phone line. A modem in the ISP converts the analog signal back into digital form for input to the
ISP router. In this case, the "access network" is simply a point-to-point dialup link into an edge router. The point-to-point link
is your ordinary twisted-pair phone line. (We will discuss twisted pair later in this section.) Today's modem speeds allow dialup
access at rates up to 56 Kbps. However, due to the poor quality of twisted-pair line between many homes and ISPs, many users (1 of 7) [5/13/2004 11:52:10 AM]
  Access Networks and Physical Media

get an effective rate significantly less than 56 Kbps. For an in depth discussion of the practical aspects of modems see the
Institute for Global Communications (IGC) web page on Modems and Data Communications.

While dialup modems require conversion of the end system's digital data into analog form for transmission, so-called
narrowband ISDN technology (Integrated Services Digital Network) [Pacific Bell 1998] allows for all-digital transmission of
data from a home end system over ISDN "telephone" lines to a phone company central office. Although ISDN was originally
conceived as a way to carry digital data from one end of the phone system to another, it is also an important network access
technology that provides higher speed access (e.g., 128 Kbps) from the home into a data network such as the Internet. In this
case, ISDN can be thought of simply as a "better modem" [NAS 1995]. A good source for additional WWW information on
ISDN is Dan Kegel's ISDN page.

Dialup modems and narrowband ISDN are already widely deployed technologies. Two new technologies, Asymmetric Digital
Subscriber Line (ADSL) [ADSL 1998] and hybrid fiber coaxial cable (HFC) [Cable 1998] are currently being deployed.
ADSL is conceptually similar to dialup modems: it is a new modem technology again running over existing twisted pair
telephone lines, but can transmit at rates of up to about 8 Mbps from the ISP router to a home end system. The data rate in the
reverse direction, from the home end system to the central office router, is less than 1 Mbps. The asymmetry in the access
speeds gives rise to the term "Asymmetric" in ADSL. The asymmetry in the data rates reflects the belief that home users are
more likely to be a consumer of information (bringing data into their homes) than a producer of information.

ADSL uses frequency division multiplexing, as described in the previous section. In particular, ADSL divides the
communication link between the home the ISP into three non-overlapping frequency bands:

             r   a high-speed downstream channel, in the 50 KHz to 1 MHz band;
             r   a medium-speed upstream channel, in the 4 KHz to 50 KHz band;
             r   and an ordinary POTs two-way telephone channel, in the 0 to 4 KHz band.

One of the features of ADSL is that the service allows the user to make an ordinary telephone call, using the POTs channel,
while simultaneously surfing the Web. This feature is not available with standard dailup modems. The actually amount of
downstream and upstream bandwidth available to the user is a function of the distance between the home modem and the ISP
modem, the gauge of the twisted pair line, and the degree of electrical interference. For a high-quality line with negligible
electrical interference, an 8 Mbps downstream transmission rate is possible if the distance between the home and the ISP is less
than 3,000 meters; the downstream transmission rate drops to about 2 Mbps for a distance of 6,000 meters. The upstream rate
ranges from 16 Kbps to 1 Mbps.

While ADSL, ISDN and dailup modems all use ordinary phone lines, HFC access networks are extensions of the current cable
network used for broadcasting cable television. In a traditional cable system, a cable head end station broadcasts through a
distribution of coaxial cable and amplifiers to residences. (We discuss coaxial cable later in this chapter.) As illustrated in
Figure 1.5-2, fiber optics (also to be discussed soon) connect the cable head end to neighborhood-level junctions, from which
traditional coaxial cable is then used to reach individual houses and apartments. Each neighborhood juncture typically supports
500 to 5000 homes. (2 of 7) [5/13/2004 11:52:10 AM]
  Access Networks and Physical Media

                                           Figure 1.5-2: A hybrid fiber-coax access network

As with ADSL, HFC requires special modems, called cable modems. Companies that provide cable Internet access require their
customers to either purchase or lease a modem. One such company is CyberCable, which uses Motorola's CyberSurfer Cable
Modem and provides high-speed Internet access to most of the neighborhoods in Paris. Typically, the cable modem is an
external device and connects to the home PC through a 10-BaseT Ethernet port. (We will discuss Ethernet in great detail in
Chapter 5.) Cable modems divide the HFC network into two channels, a downstream and an upstream channel. As with ADSL,
the downstream channel is typically allocated more bandwidth and hence a larger transmission rate. For example, the
downstream rate of the CyberCable system is 10 Mbps and the upstream rate is 768 Kbps. However, with HFC (and not with
ADSL), these rates are shared among the homes, as we discuss below.

One important characteristic of the HFC is that it is a shared broadcast medium. In particular, every packet sent by the headend
travels downstream on every link to every home; and every packet sent by a home travels on the upstream channel to the
headend. For this reason, if several users are receiving different Internet videos on the downstream channel, actual rate at which
each user receives its video will be significantly less than downstream rate. On the other hand, if all the active users are Web
surfing, then each of the users may actually receive Web pages at the full downstream rate, as a small collection of users will
rarely receive a Web page at exactly the same time. Because the upstream channel is also shared, packets sent by two different
homes at the same time will collide, which further decreases the effective upstream bandwidth. (We will discuss this collision
issue in some detail when we discuss Ethernet in Chapter 5.) Advocates of ADSL are quick to point out that ADSL is a point-to-
point connection between the home and ISP, and therefore all the ADSL bandwidth is dedicated rather than shared. Cable
advocates, however, argue that a reasonably dimensioned HFC network provides higher bandwidths than ADSL [@Home (3 of 7) [5/13/2004 11:52:10 AM]
  Access Networks and Physical Media

1998]. The battle between ADSL and HFC for high speed residential access has clearly begun, e.g., [@Home 1998].

Enterprise Access Networks

 In enterprise access networks, a local area network (LAN) is used to connect an end system to an edge router. As we will see
in Chapter 5, there are many different types of LAN technology. However, Ethernet technology is currently by far the most
prevalent access technology in enterprise networks. Ethernet operates 10 Mbps or 100Mbps (and now even at 1 Gbps). It uses
either twisted-pair copper wire are coaxial cable to connect a number of end systems with each other and with an edge router.
The edge router is responsible for routing packets that have destinations outside of that LAN. Like HFC, Ethernet uses a shared
medium, so that end users share the the transmission rate of the LAN. More recently, shared Ethernet technology has been
migrating towards switched Ethernet technology. Switched Ethernet uses multiple coaxial cable or twisted pair Ethernet
segments connected at a "switch" to allow the full bandwidth an Ethernet to be delivered to different users on the same LAN
simultaneously [Cisco 1998]. We will explore shared and switched Ethernet in some detail in Chapter 5.

Mobile Access Networks

Mobile access networks use the radio spectrum to connect a mobile end system (e.g., a laptop PC or a PDA with a wireless
modem) to a base station, as shown in Figure 1.5-1. This base station, in turn, is connected to an edge router of a data network.

An emerging standard for wireless data networking is Cellular Digital Packet Data (CDPD) [Wireless 1998]. As the name
suggests, a CDPD network operates as an overlay network (i.e., as a separate, smaller "virtual" network, as a piece of the larger
network) within the cellular telephone network. A CDPD network thus uses the same radio spectrum as the cellular phone
system, and operates at speeds in the 10's of Kbits per second. As with cable-based access networks and shared Ethernet,
CDPD end systems must share the transmission media with other CDPD end systems within the cell covered by a base station.
A media access control (MAC) protocol is used to arbitrate channel sharing among the CDPD end systems; we will cover MAC
protocols in detail in Chapter 5.

The CDPD system supports the IP protocol, and thus allows an IP end system to exchange IP packets over the wireless channel
with an IP base station. A CDPD network can actually support multiple network layer protocols; in addition to IP, the ISO
CNLP protocol is also supported. CDPD does not provide for any protocols above the network layer. From an Internet
perspective, CDPD can be viewed as extending the Internet dialtone (i.e., the ability to transfer IP packets) across a wireless
link between a mobile end system and an Internet router. An excellent introduction to CDPD is [Waung 98].

1.5.2 Physical Media
In the previous subsection we gave an overview of some of the most important access network technologies in the Internet.
While describing these technologies, we also indicated the physical media used. For example, we said that HFC uses a
combination of fiber cable and coaxial cable. We said that ordinary modems, ISDN, and ADSL use twisted-pair copper wire.
And we said that mobile access network use the radio spectrum. In this subsection we provide a brief overview of these and
other transmission media that are commonly employed in the Internet.

In order to define what is meant by a "physical medium,", let us reflect on the brief life of a bit. Consider a bit traveling from
one end system, through a series of links and routers, to another end system. This poor bit gets transmitted many, many times!
The source end-system first transmits the bit and shortly thereafter the first router in the series receives the bit; the first router
then transmits the bit and shortly afterwards the second router receives the bit, etc. Thus our bit, when traveling from source to
destination, passes through a series of transmitter-receiver pairs. For each transmitter-receiver pair, the bit is sent by
propagating electromagnetic waves across a physical medium. The physical medium can take many shapes and forms, and
does not have to be of the same type for each transmitter-receiver pair along the path. Examples of physical media include (4 of 7) [5/13/2004 11:52:10 AM]
  Access Networks and Physical Media

twisted-pair copper wire, coaxial cable, multimode fiber optic cable, terrestrial radio spectrum and satellite radio spectrum.
Physical media fall into two categories: guided media and unguided media. With guided media, the waves are guided along a
solid medium, such as a fiber-optic cable, a twisted-pair cooper wire or a coaxial cable. With unguided media, the waves
propagate in the atmosphere and in outer space, such as in a digital satellite channel or in a CDPD system.

Some Popular Physical Media

Suppose you want to wire a building to allow computers to access the Internet or an intranet -- should you use twisted-pair
copper wire, coaxial cable, or fiber optics? Which of these media gives the highest bit rates over the longest distances? We shall
address these questions below.

But before we get into the characteristics of the various guided medium types, let us say a few words about their costs. The
actual cost of the physical link (copper wire, fiber optic cable, etc.) is often relatively minor compared with the other
networking costs. In particular, the labor cost associated with the installation of the physical link can be orders of magnitude
higher than the cost of the material. For this reason, many builders install twisted pair, optical fiber, and coaxial cable to every
room in a building. Even if only one medium is initially used, there is a good chance that another medium could be used in the
near future, and so money is saved but not having to lay additional wires.

Twisted-Pair Copper Wire

The least-expensive and most commonly-used transmission medium is twisted-pair copper wire. For over one-hundred years it
has been used by telephone networks. In fact, more than 99% of the wired connections from the telephone handset to the local
telephone switch use twisted-pair copper wire. Most of us have seen twisted pair in our homes and work environments. Twisted
pair consists of two insulated copper wires, each about 1 mm thick, arranged in a regular spiral pattern; see Figure 1.5-3. The
wires are twisted together to reduce the electrical interference from similar pairs close by. Typically, a number of pairs are
bundled together in a cable by wrapping the pairs in a protective shield. A wire pair constitutes a single communication link.

                                                        Figure 1.5-3: Twisted Pair

Unshielded twisted pair (UTP) is commonly used for computer networks within a building, that is, for local area networks
(LANs). Data rates for LANs using twisted pair today range from 10 Mbps to 100 Mbps. The data rates that can be achieved
depend on the thickness of the wire and the distance between transmitter and receiver. Two types of UTP are common in LANs:
category 3 and category 5. Category 3 corresponds to voice-grade twisted pair, commonly found in office buildings. Office
buildings are often prewired with two or more parallel pairs of category 3 twisted pair; one pair is used for telephone
communication, and the additional pairs can be used for additional telephone lines or for LAN networking. 10 Mbps Ethernet,
one of the most prevalent LAN types, can use category 3 UTP. Category 5, with its more twists per centimeter and Teflon
insulation, can handle higher bit rates. 100 Mbps Ethernet running on category 5 UTP has become very popular in recent years.
In recent years, category 5 UTP has become common for preinstallation in new office buildings.

When fiber-optic technology emerged in the 1980s, many people disparaged twisted-pair because of its relatively low bit rates.
Some people even felt that fiber optic technology would completely replace twisted pair. But twisted pair did not give up so
easily. Modern twisted-pair technology, such as category 5 UTP, can achieve data rates of 100 Mbps for distances up to a few
hundred meters. Even higher rates are possible over shorter distances. In the end, twisted-pair has emerged as the dominant
solution for high-speed LAN networking.

As discussed in Section 1.5.1, twisted-pair is also commonly used for residential Internet access. We saw that dial-up modem
technology enables access at rates of up to 56 Kbps over twisted pair. We also saw that ISDN is available in many communities, (5 of 7) [5/13/2004 11:52:10 AM]
  Access Networks and Physical Media

providing access rates of about 128 Kbps over twisted pair. We also saw that ADSL (Asymmetric Digital Subscriber Loop)
technology has enabled residential users to access the Web at rates in excess of 6 Mbps over twisted pair.


Like twisted pair, coaxial cable consists of two copper conductors, but the two conductors are concentric rather than parallel.
With this construction and a special insulation and shielding, coaxial cable can have higher bit rates than twisted pair. Coaxial
cable comes in two varieties: baseband coaxial cable and broadband coaxial cable.

Baseband coaxial cable, also called 50-ohm cable, is about a centimeter thick, lightweight, and easy to bend. It is commonly
used in LANs; in fact, the computer you use at work or at school is probably connected to a LAN with either baseband coaxial
cable or with UTP. Take a look at the the connection to your computer's interface card. If you see a telephone-like jack and
some wire that resembles telephone wire, you are using UTP; if you see a T-connector and a cable running out of both sides of
the T-connector, you are using baseband coaxial cable. The terminology "baseband" comes from the fact that the stream of bits
is dumped directly into the cable, without shifting the signal to a different frequency band. 10 Mbps Ethernets can use either
UTP or baseband coaxial cable. As we will discuss in the Chapter 5, it is a little more expensive to use UTP for 10 Mbps
Ethernet, as UTP requires an additional networking device, called a hub.

Broadband coaxial cable, also called 75-ohm cable, is quite a bit thicker, heavier, and stiffer than the baseband variety. It was
once commonly used in LANs and can still be found in some older installations. For LANs, baseband cable is now preferable,
since it is less expensive, easier to physically handle, and does not require attachment cables. Broadband cable, however, is
quite common in cable television systems. As we saw in Section 1.5.1, cable television systems have been recently been
coupled with cable modems to provide residential users with Web access at rates of 10 Mbps or higher. With broadband coaxial
cable, the transmitter shifts the digital signal to a specific frequency band, and the resulting analog signal is sent from the
transmitter to one or more receivers. Both baseband and broadband coaxial cable can be used as a guided shared medium.
Specifically, a number of end systems can be connected directly to the cable, and all the end systems receive whatever any one
of the computers transmits. We will look at this issue in more detail in Chapter 5.

Fiber Optics

An optical fiber is a thin, flexible medium that conducts pulses of light, with each pulse representing a bit. A single optical fiber
can support tremendous bit rates, up to tens or even hundreds of gigabits per second. They are immune to electromagnetic
interference, have very low signal attenuation up to 100 kilometers, and are very hard to tap. These characteristics have made
fiber optics the preferred long-haul guided transmission media, particularly for overseas links. Many of the long-distance
telephone networks in the United States and elsewhere now use fiber optics exclusively. Fiber optics is also prevalent in the
backbone of the Internet. However, the high cost of optical devices -- such as transmitters, receivers, and switches -- has
hindered their deployment for short-haul transport, such as in a LAN or into the home in a residential access network. AT&T
Labs provides an excellent site on fiber optics, including several nice animations.

Terrestrial and Satellite Radio Channels

Radio channels carry signals in the electromagnetic spectrum. They are an attractive media because require no physical "wire"
to be installed, can penetrate walls, provide connectivity to a mobile user, and can potentially carry a signal for long distances.
The characteristics a radio channel depend significantly on the propagation environment and the distance over which a signal is
to be carried. Environmental considerations determine path loss and shadow fading (which decrease in signal strength as it
travels over a distance and around/through obstructing objects), multipath fading (due to signal reflection off of interfering
objects), and interference (due to other radio channels or electromagnetic signals).

Terrestrial radio channels can be broadly classified into two groups: those that operate as local area networks (typically
spanning 10's to a few hundred meters) and wide-area radio channels that are used for mobile data services (typically operating
within a metropolitan region). A number of wireless LAN products are on the market, operating in the 1 to 10's of Mbps range. (6 of 7) [5/13/2004 11:52:10 AM]
  Access Networks and Physical Media

Mobile data services (such as the CDPD standard we touched on in section 1.3), typically provide channels that operate at 10's
of Kbps. See [Goodman 97] for a survey and discussion of the technology and products.

A communication satellite links two or more earth-based microwave transmitter/receivers, known as ground stations. The
satellite receives transmissions on one frequency band, regenerates the signal using a repeater (discussed below), and transmits
the signal on another frequency. Satellites can provide bandwidths in the gigabit per second range. Two types of satellites are
used in communications: geostationary satellites and low-altitude satellites.

Geostationary satellites permanently remain above the same spot on the Earth. This stationary presence is achieved by placing
the satellite in orbit at 36,000 kilometers above the Earth's surface. This huge distance between from ground station though
satellite back to ground station introduces a substantial signal propagation delay of 250 milliseconds. Nevertheless, satellites
links are often used in telephone networks and in the backbone of the Internet.

Low-altitude satellites are placed much closer to the Earth and do not remain permanently above one spot on the Earth. They
rotate around the Earth just as the Moon rotates around the Earth. To provide continuous coverage to an area, many satellites to
be placed in orbit. There are currently many low-altitude communication systems in development. The Iridium system, for
example, consists of 66 low-altitude satellites. Lloyd's satellite constellations provides and collects information on Iridium as
well as other satellite constellation systems. The low-altitude satellite technology may be used for Internet access sometime in
the future.

Return to Table Of Contents


[@Home 1998] @Home, "xDSL vs. @HOME™'S Hybrid-fiber-coaxial (HFC) Cable Modem Network: the Facts," 1998.
[ADSL 1998] ADSL Forum, ADSL Tutorial, 1998.
[Cable 1998] Cable Data News, "Overview of Cable Modem Technology and Services," 1998.
[Cisco 1998] Cisco, "Designing Switched LAN Internetworks," 1998.
[Goodman 1997] D. Goodman (Chair), "The Evolution of Untethered Communications," National Academy Press, December
[NAS 1995] National Academy of Sciences, "The Unpredictable Certainty: Information Infrastructure Through 2000," 1995.
[Pacific Bell 1998] Pacific Bell, "ISDN Users Guide,"
[Waung 1998] W. Waung, "Wireless Mobile Data Networking The CDPD Approach," Wireless Data Forum, 1998.
[Wireless 1998] Wireless Data Forum, "CDPD System Specification Release 1.1," 1998

Copyright Keith W. Ross and Jim Kurose 1996-2000 (7 of 7) [5/13/2004 11:52:10 AM]
 Delay and Loss in Packet-Switched Networks

    1.6 Delay and Loss in Packet-Switched
Having now briefly considered the major "pieces" of the Internet architecture - the applications, end
systems, end-to-end transport protocols, routers, and links - let us now consider what can happen to a
packet as it travels from its source to its destination. Recall that a packet starts in a host (the source),
passes through a series of routers, and ends its journey in another host (the destination). As a packet
travels from one node (host or router) to the subsequent node (host or router) along this path, the packet
suffers from several different types of delays at each node along the path. The most important of these
delays are the nodal processing delay, queuing delay, transmission delay and propagation delay;
together, these delays accumulate to give a total nodal delay. In order to acquire a deep understanding of
packet switching and computer networks, we must understand the nature and importance of these delays.

                                              Figure 1.6-1: The delay through router A

Let us explore these delays in the context of Figure 1.6-1. As part of its end-to-end route between source
and destination, a packet is sent from the upstream node through router, A, to router B. Our goal is to
characterize the nodal delay at router A. Note that router A has three outbound links, one leading to (1 of 7) [5/13/2004 11:52:28 AM]
 Delay and Loss in Packet-Switched Networks

router B, another leading to router C, and yet another leading to router D. Each link is preceded a queue
(also known as a buffer). When the packet arrives at router A (from the upstream node), router A
examines the packet's header to determine the appropriate outbound link for the packet, and then directs
the packet to the link. In this example, the outbound link for the packet is the one that leads to router B.
A packet can only be transmitted on a link if there is no other packet currently being transmitted on the
link and if there are no other packets preceding it in the queue; if the link is currently busy or if there are
other packets already queued for the link, the newly arriving packet will then join the queue.

The time required to examine the packet's header and determine where to direct the packet is part of the
processing delay. The processing delay can also include other factors, such as the time needed to check
for bit-level errors in the packet that occurred in transmitting the packet's bits from the upstream router to
router A. After this nodal processing, the router directs the packet to the queue that precedes the link to
router B. (In section 4.7 we will study the details of how a router operates.) At the queue, the packet
experiences a queuing delay as it waits to be transmitted onto the link. The queuing delay of a specific
packet will depend on the number of other, earlier-arriving packets that are queued and waiting for
transmission across the link; the delay of a given packet can vary significantly from packet to packet. If
the queue is empty and no other packet is currently being transmitted, then our packet's queuing delay is
zero. On the other hand, if the traffic is heavy and many other packets are also waiting to be transmitted,
the queuing delay will be long. We will see shortly that the number of packets that an arriving packet
might expect to find on arrival (informally, the average number of queued packets, which is proportional
to the average delay experienced by packets) is a function of the intensity and nature of the traffic
arriving to the queue.

Assuming that packets are transmitted in first-come-first-serve manner, as is common in the Internet, our
packet can be transmitted once all the packets that have arrived before it have been transmitted. Denote
the length of the packet by L bits and denote the transmission rate of the link (from router A to router B)
by R bits/sec. The rate R is determined by transmission rate of the link to router B. For example, for a 10
Mbps Ethernet link, the rate is R=10 Mbps; for a 100 Mbps Ethernet link, the rate is R=100 Mbps. The
transmission delay (also called the store-and-forward delay, as discussed in Section 1.4) is L/R. This is
the amount of time required to transmit all of the packet's bits into the link.

Once a bit is pushed onto the link, it needs to propagate to router B. The time required to propagate from
the beginning of the link to router B is the propagation delay. The bit propagates at the propagation
speed of the link. The propagation speed depends on the physical medium of the link (i.e., multimode
fiber, twisted-pair copper wire, etc.) and is in the range of

                                              2*108 meters/sec to 3*108 meters/sec,

equal to, or a little less than, the speed of light. The propagation delay is the distance between two routers
divided by the propagation speed. That is, the propagation delay is d/s, where d is the distance between
router A and router B and s is the propagation speed of the link. Once the last bit of the packet propagates
to node B, it and all the preceding bits of the packet are stored in router B. The whole process then (2 of 7) [5/13/2004 11:52:28 AM]
 Delay and Loss in Packet-Switched Networks

continues with router B now performing the forwarding.

Newcomers to the field of computer networking sometimes have difficulty understanding the difference
between transmission delay and propagation delay. The difference is subtle but important. The
transmission delay is the amount of time required for the router to push out the packet; it is a function of
the packet's length and the transmission rate of the link, but has nothing to do with the distance between
the two routers. The propagation delay, on the other hand, is the time it takes a bit to propagate from one
router to the next; it is a function of the distance between the two routers, but has nothing to do with the
packet's length or the transmission rate of the link.

An analogy might clarify the notions of transmission and propagation delay. Consider a highway which
has a toll booth every 100 kilometers. You can think of the highway segments between toll booths as
links and the toll booths as routers. Suppose that cars travel (i.e., propagate) on the highway at a rate of
100 km/hour (i.e., when a car leaves a toll booth it instantaneously accelerates to 100 km/hour and
maintains that speed between toll booths). Suppose that there is a caravan of 10 cars that are traveling
together, and that these ten cars follow each other in a fixed order. You can think of each car as a bit and
the caravan as a packet. Also suppose that each toll booth services (i.e., transmits) a car at a rate of one
car per 12 seconds, and that it is late at night so that the caravan's cars are only cars on the highway.
Finally, suppose that whenever the first car of the caravan arrives at a toll booth, it waits at the entrance
until the nine other cars have arrived and lined up behind it. (Thus the entire caravan must be "stored" at
the toll booth before it can begin to be "forwarded".) The time required for the toll booth to push the
entire caravan onto the highway is 10/(5 cars/minute) = 2 minutes. This time is analogous to the
transmission delay in a router. The time required for a car to travel from the exit of one toll booth to the
next toll booth is 100 Km/(100 km/hour) = 1 hour. This time is analogous to propagation delay.
Therefore the time from when the caravan is "stored" in front of a toll booth until the caravan is "stored"
in front of the next toll booth is the sum of "transmission delay" and "the propagation delay" - in this
example, 62 minutes.

Let's explore this analogy a bit more. What would happen if the toll-booth service time for a caravan
were greater than the time for a car to travel between toll booths? For example, suppose cars travel at rate
1000 km/hr and the toll booth services cars at rate one car per minute. Then the traveling delay between
toll booths is 6 minutes and the time to serve a caravan is 10 minutes. In this case, the first few cars in the
caravan will arrive at the second toll booth before the last cars in caravan leave the first toll booth. This
situation also arises in packet-switched networks - the first bits in a packet can arrive at a router while
many of the remaining bits in the packet are still waiting to be transmitted by the preceding router.

If we let dproc, dqueue, dtrans and dprop denote the processing, queuing, transmission and propagation
delays, then the total nodal delay is given by

                                              dnodal = dproc + dqueue + dtrans + dprop .

The contribution of these delay components can vary significantly. For example, dprop can be negligible (3 of 7) [5/13/2004 11:52:28 AM]
 Delay and Loss in Packet-Switched Networks

(e.g., a couple of microseconds) for a link connecting two routers on the same university campus;
however, dprop is hundreds of milliseconds for two routers interconnected by a geostationary satellite
link, and can be the dominant term in dnodal. Similarly, dtrans can be range from negligible to significant.
Its contribution is typically negligible for transmission rates of 10 Mbps and higher (e.g., for LANs);
however, it can be hundreds of milliseconds for large Internet packets sent over 28.8 kbps modem links.
The processing delay, dproc , is often negligible; however, it strongly influences a router's maximum
throughput, which is the maximum rate at which a router can forward packets.

Queuing Delay

The most complicated and interesting component of nodal delay is the queuing delay dqueue. In fact,
queuing delay is so important and interesting in computer networking that thousands of papers and
numerous of books have been written about it [Bertsekas 1992] [Daigle 1991] [Kleinrock 1975]
[Kleinrock 1976] [Ross 1995]! We only give a high-level, intuitive discussion of queuing delay here; the
more curious reader may want to browse through some of the books (or even eventually write a Ph.D.
thesis on the subject!). Unlike the other three delays (namely, dproc , dtrans and dprop ), the queuing delay
can vary from packet to packet. For example, if ten packets arrive to an empty queue at the same time,
the first packet transmitted will suffer no queuing delay, while the last packet transmitted will suffer a
relatively large queuing delay (while it waits for the other nine packets to be transmitted). Therefore,
when characterizing queuing delay, one typically uses statistical measures, such as average queuing
delay, variance of queuing delay and the probability that the queuing delay exceeds some specified value.

When is the queuing delay big and when is it insignificant? The answer to this question depends largely
on the rate at which traffic arrives to the queue, the transmission rate of the link, and the nature of the
arriving traffic, i.e., whether the traffic arrives periodically or whether it arrives in bursts. To gain some
insight here, let a denote the average rate at which packets arrive to the queue (a is units of packets/sec).
Recall that R is the transmission rate, i.e., it is the rate (in bits/sec) at which bits are pushed out of the
queue. Also suppose, for simplicity, that all packets consist of L bits. Then the average rate at which bits
arrive to the queue is La bits/sec. Finally, assume that the queue is very big, so that it can hold essentially
an infinite number of bits. The ratio La/R, called the traffic intensity, often plays an important role in
estimating the extent of the queuing delay. If La/R > 1, then the average rate at which bits arrive to the
queue exceeds the rate at which the bits can be transmitted from the queue. In this unfortunate situation,
the queue will tend to increase without bound and the queuing delay will approach infinity! Therefore,
one of the golden rules in traffic engineering is: design your system so that the traffic intensity is no
greater than one.

Now consider the case La/R =< 1. Here, the nature of the arriving traffic impacts the queuing delay. For
example, if packets arrive periodically, i.e., one packet arrives every L/R seconds, then every packet will
arrive to an empty queue and there will be no queuing delay. On the other hand, if packets arrive in
bursts but periodically, there can be a significant average queuing delay. For example, suppose N packets
arrive at the same time every (L/R)N seconds. Then the first packet transmitted has no queuing delay; the (4 of 7) [5/13/2004 11:52:28 AM]
 Delay and Loss in Packet-Switched Networks

second packet transmitted has a queuing delay of L/R seconds; and more generally, the nth packet
transmitted has a queuing delay of (n-1)L/R seconds. We leave it as an exercise for the reader to calculate
the average queuing delay in this example.

The two examples described above of periodic arrivals are a bit academic. Typically the arrival process
to a queue is random, i.e., the arrivals do not follow any pattern; packets are spaced apart by random
amounts of time. In this more realistic case, the quantity La/R is not usually sufficient to fully
characterize the delay statistics. Nonetheless, it is useful in gaining an intuitive understanding of the
extent of the queuing delay. In particular, if traffic intensity is close to zero, then packets are pushed out
at a rate much higher than the packet arrival rate; therefore, the average queuing delay will be close to
zero. On the other hand, when the traffic intensity is close to 1, there will be intervals of time when the
arrival rate exceeds the transmission capacity (due to the burstiness of arrivals), and a queue will form.
As the traffic intensity approaches 1, the average queue length gets larger and larger. The qualitative
dependence of average queuing delay on the traffic intensity is shown in Figure 1.6-2 below.

One important aspect of Figure 1.6-2 is the fact that as the traffic intensity approaches 1, the average
queueing delay increases rapidly. A small percentage increase in the intensity will result in a much larger
percentage-wise increase in delay. Perhaps you have experienced this phenomenon on the highway. If
you regularly drive on a road that is typically congested, the fact that the road is typically congested
means that its traffic intensity is close to 1. If some event causes an even slightly-larger-than-usual
amount of traffic, the delays you experience can be huge.

                      Figure 1.6-2: Dependence of average queuing delay on traffic intensity.

Packet Loss

In our discussions above, we have assumed that the queue is capable of holding an infinite number of
packets. In reality a queue preceding a link has finite capacity, although the queuing capacity greatly (5 of 7) [5/13/2004 11:52:28 AM]
 Delay and Loss in Packet-Switched Networks

depends on the switch design and cost. Because the queue capacity is a finite, packet delays do not really
approach infinity as the traffic intensity approaches one. Instead, a packet can arrive to find a full queue.
With no place to store such a packet, a router will drop that packet; that is, the packet will be lost. From
an end-system viewpoint, this will look like a packet having been transmitted into the network core, but
never emerging from the network at the destination. The fraction of lost packets increases as the traffic
intensity increases. Therefore, performance at a node is often measured not only in terms of delay, but
also in terms of the probability of packet loss. As we shall discuss in the subsequent chapters, a lost
packet may be retransmitted on an end-to-end basis, by either the application or by the transport layer

End-to-End Delay

Our discussion up to this point has been focused on the nodal delay, i.e., the delay at a single router. Let
us conclude our discussion by briefly considering the delay from source to destination. To get a handle
on this concept, suppose there are Q-1 routers between the source host and the destination host. Let us
also suppose that the network is uncongested (so that queuing delays are negligible), the processing delay
at each router and at the source host is dproc, the transmission rate out of each router and out of the
source host is R bits/sec, and the propagation delay between each pair or routers and between the source
host and the first router is dprop. The nodal delays accumulate and give an end-to-end delay,

                                              dendend = Q (dproc + dtrans + dprop) ,

where once again dtrans = L/R, where L is the packet size. We leave it to the reader to generalize this
formula to the case of heterogeneous delays at the nodes and to the presence of an average queuing delay
at each node.

Return to Table Of Contents


[Bertsekas 1992] D. Bertsekas and R. Gallager, Data Networks, 2nd Edition, Prentice Hall, Englewood
Cliffs, N.J., 1992
[Daigle 1991] J.N. Daigle, Queuing Theory for Telecommunications, Addision-Wesley, Reading
Massachusetts, 1991.
[Kleinrock 1975] L. Kleinrock, Queuing Systems, Volume 1, John Wiley, New York, 1975.
[Kleinrock 1976] L. Kleinrock, Queuing Systems, Volume 2, John Wiley, New York, 1976.
[Ross 1995] K.W. Ross, Multiservice Loss Models for Broadband Telecommunication Networks,
Springer, Berlin, 1995. (6 of 7) [5/13/2004 11:52:28 AM]
 Delay and Loss in Packet-Switched Networks

Copyright Keith W. Ross and James F. Kurose 1996-2000 (7 of 7) [5/13/2004 11:52:28 AM]
  Protocol Layers and Their Service Models

       1.7 Protocol Layers and Their Service Models
From our discussion thus far, it is apparent that the Internet is an extremely complicated system. We have seen that there are
many "pieces" to the Internet: numerous applications and protocols, various types of end systems and connections between end
systems, routers, and various types of link-level media. Given this enormous complexity, is there any hope of organizing
network architecture, or at least our discussion of network architecture? Fortunately, the answers to both questions is "yes."

Before attempting to organize our thoughts on Internet architecture, let's look for a human analogy. Actually, we deal with
complex systems all the time in our every day life. Imagine if someone asked you to describe, for example, the airline system.
How would you find the structure to describe this complex system that has ticketing agents, baggage checkers, gate personnel,
pilots and airplanes, air traffic control, and a worldwide system for routing airplanes? One way to describe this system might be
to describe the series of actions you take (or others take for you) when you fly on an airline. You purchase your ticket, check
your bags, go to the gate and eventually get loaded onto the plane. The plane takes off and is routed to its destination. After
your plane lands, you de-plane at the gate and claim your bags. If the trip was bad, you complain about the flight to the ticket
agent (getting nothing for your effort). This scenario is shown in Figure 1.7-1.

                                             Figure 1.7-1: Taking an airplane trip: actions

Already, we can see some analogies here with computer networking: you are being shipped from source to destination by the
airline; a packet is shipped from source host to destination host in the Internet. But this is not quite the analogy we are after.
We are looking for some structure in Figure 1.7-1. Looking at Figure 1.7-1, we note that there is a ticketing function at each
end; there is also a baggage function for already ticketed passengers, and a gate function for already-ticketed and already-
baggage-checked passengers. For passengers who have made it through the gate (i.e., passengers who are already ticketed,
baggage-checked, and through the gate), there is a takeoff and landing function, and while in flight, there is an airplane routing
function. This suggests that we can look at the functionality in Figure 1.7-1 in a horizontal manner, as shown in Figure 1.7-2. (1 of 7) [5/13/2004 11:52:52 AM]
  Protocol Layers and Their Service Models

                                         Figure 1.7-2: horizontal "layering" of airline functionality

Figure 1.7-2 has divided the airline functionality into layers, providing a framework in which we can discuss airline travel.
Now, when we want to describe a part of airline travel we can talk about a specific, well-defined component of airline travel.
For example, when we discuss gate functionality, we know we are discussing functionality that sits "below" baggage handling,
and "above" takeoff and landing. We note that each layer, combined with the layers below it, implement some functionality,
some service. At the ticketing layer and below, airline-counter-to-airline-counter transfer of a person is accomplished. At the
baggage layer and below, baggage-check-to-baggage-claim transfer of a person and their bags in accomplished. Note that the
baggage layer provides this service only to an already ticketed person. At the gate layer, departure-gate-to-arrival-gate transfer
of a person and their bags is accomplished. At the takeoff/landing layer, runway-to-runway transfer of a person (actually, many
people) and their bags, is accomplished. Each layer provides its functionality (service) by (i) performing certain actions within
that layer (e.g., at the gate layer, loading and unloading people from an airplane) and by (ii) using the services of the layer
directly below it (e.g., in the gate layer, using the runway-to-runway passenger transfer service of the takeoff/landing layer).

As noted above, a layered architecture allows us to discuss a well-defined, specific part of a large and complex system. This
itself is of considerable value. When a system has a layered structure it is also much easier to change the implementation of the
service provided by the layer. As long as the layer provides the same service to the layer above it, and uses the same services
from the layer below it, the remainder of the system remains unchanged when a layer's implementation is changed. (Note that
changing the implementation of a service is very different from changing the service itself!) For example, if the gate functions
were changed (e.g., to have people board and disembark by height), the remainder of the airline system would remain
unchanged since the gate layer still provides the same function (loading and unloading people); it simply implements that
function in a different manner after the change. For large and complex systems that are constantly being updated, the ability to
change the implementation of a service without affecting other components of the system is another important advantage of

But enough with airlines. Let's now turn our attention to network protocols. To reduce design complexity, network designers
organize protocols -- and the network hardware and software that implements the protocols -- in layers. With a layered
protocol architecture, each protocol belongs to one of the layers. It's important to realize that a protocol in layer n is distributed
among the network entities (including end systems and packet switches) that implement that protocol, just as the functions in
our layered airline architecture were distributed between the departing and arriving airports. In other words, there's a "piece" of (2 of 7) [5/13/2004 11:52:52 AM]
  Protocol Layers and Their Service Models

layer n in each of the network entities. These "pieces" communicate with each other by exchanging layer-n messages. These
messages are called layer-n protocol data units, or more commonly n-PDUs. The contents and format of an n-PDU, as well as
the manner in which the n-PDUs are exchanged among the network elements, are defined by a layer-n protocol. When taken
together, the protocols of the various layers are called the protocol stack.

 When layer n of Host A sends an n-PDU to layer n of Host B, layer n of Host A passes the n-PDU to layer n-1 and then lets
layer n-1 deliver the n-PDU to layer n of B; thus layer n is said to rely on layer n-1 to deliver its n-PDU to the destination. A
key concept is that of the service model of a layer. Layer n-1 is said to offer services to layer n. For example, layer n-1 might
guarantee that the n-PDU will arrive without error at layer n in the destination within one second, or it might only guarantee
that the n-PDU will eventually arrive at the destination without any assurances about error.

The concept of protocol layering is a fairly abstract and is sometimes difficult to grasp at first. This concept will become clear
as we study the Internet layers and their constituent protocols in greater detail. But let use now try to shed some insight on
protocol layering and protocol stacks with an example. Consider a network which organizes its communication protocols in four
layers. Because there are four layers, there are four types of PDUs: 1-PDUs, 2-PDUs, 3-PDUs and 4-PDUs. As shown in
Figure 1.7-3, the application, operating at the highest layer, layer 4, creates a message, M. Any message created at this highest
layer is a 4-PDU. The message M itself may consist of many different fields (in much the same way as a structure or record in a
programming language may contain different fields); it is up to the application to define and interpret the fields in the message.
The fields might contain the name of the sender, a code indicating the type of the message, and some additional data.

Within the source host, the contents of the entire message M is then "passed" down the protocol stack to layer 3. In the example
in Figure 1.7-3, layer 3 in the source host divides a 4-PDU, M, into two parts, M1 and M2. The layer 3 in the source host then
adds to M1 and M2, so-called headers, to create two layer 3 PDUs. Headers contain the additional information needed by the
sending and receiving sides of layer 3 to implement the service that layer 3 provides to layer 4. The procedure continues in the
source, adding more header at each layer, until the 1-PDUs are created. The 1-PDUs are sent out of the source host onto a
physical link. At the other end, the destination host receives 1-PDUs and directs them up the protocol stack. At each layer, the
corresponding header is removed. Finally, M is reassembled from M1 and M2 and then passed on to the application.

                               Figure 1.7-3: different PDU's at different layers in the protocol architecture

Note that in Figure 1.7-3, layer n uses the services of layer n-1. For example, once layer 4 creates the message M, it passes the
message down to layer 3 and relies on layer 3 to deliver the message to layer 4 at the destination.

Interesting enough, this notion of relying on lower layer services is prevalent in many other forms of communication. For
example, consider ordinary postal mail. When you write a letter, you include envelope information such as the destination (3 of 7) [5/13/2004 11:52:52 AM]
  Protocol Layers and Their Service Models

address and the return address with the letter. The letter along with the address information can be considered a PDU at the
highest layer of the protocol stack. You then drop the PDU in a mailbox. At this point, the letter is out of your hands. The
postal service may then add some of its own internal information onto your letter, essentially adding a header to your letter. For
example, in the United States a barcode is often printed on your letter.

Once you drop your envelope into a mailbox, you rely on the services of the postal service to deliver the letter to the correct
destination in a timely manner. For example, you don't worry about whether a postal truck will break down while carrying the
letter. Instead the postal service takes care of this, presumably with well-defined plans to recover from such failures.
Furthermore, within the postal service itself there are layers, and the protocols at one layer rely on and use the services of the
layer below.

In order for one layer to interoperate with the layer below it, the interfaces between the two layers must be precisely defined.
Standards bodies define precisely the interfaces between adjacent layers (e.g., the format of the PDUs passed between the
layers) and permit the developers of networking software and hardware to implement the interior of the layers as they please.
Therefore, if a new and improved implementation of a layer is released, the new implementation can replace the old
implementation and, in theory, the layers will continue to interoperate.

In a computer network, each layer may perform one or more of the following generic set of tasks:

     q   Error control, which makes the logical channel between the layers in two peer network elements more reliable.
     q   Flow control, which avoids overwhelming a slower peer with PDUs.
     q   Segmentation and Reassembly, which at the transmitting side divides large data chunks into smaller pieces; and at the
         receiving side reassembles the smaller pieces into the original large chunk.
     q   Multiplexing, which allows several higher-level sessions to share a single lower-level connection.
     q   Connection setup, which provides the handshaking with a peer.

Protocol layering has conceptual and structural advantages. We mention, however, that some researchers and networking
engineers are vehemently opposed to layering [Wakeman 1992]. One potential drawback of layering is that one layer may
duplicate lower-layer functionality. For example, many protocol stacks provide error recovery on both a link basis and an end-to-
end basis. A second potential drawback is that functionality at one layer may need information (e.g., a timestamp value) that is
present only in another layer; this violates the goal of separation of layers.

1.7.1 The Internet Protocol Stack
The Internet stack consists of five layers: the physical, data link, network, transport and application layers. Rather than use the
cumbersome terminology PDU-n for each of the five layers, we instead give special names to the PDUs in four of the five
layers: frame, datagram, segment, and message. We avoid naming a data unit for the physical layer, as no name is commonly
used at this layer. The Internet stack and the corresponding PDU names are illustrated in Figure 1.7-4. (4 of 7) [5/13/2004 11:52:52 AM]
  Protocol Layers and Their Service Models

                                             Figure 1.7-4: The protocol stack, and protocol data units

A protocol layer can be implemented in software, in hardware, or using a combination of the two. Application-layer protocols --
such as HTTP and SMTP -- are almost always implemented in software in the end systems; so are transport layer protocols.
Because the physical layer and data link layers are responsible for handling communication over a specific link, they are
typically implemented in a network interface card (e.g., Ethernet or ATM interface cards) associated with a given link. The
network layer is often a mixed implementation of hardware and software.

We now summarize the Internet layers and the services they provide:

     q   Application layer: The application layer is responsible for supporting network applications. The application layer
         includes many protocols, including HTTP to support the Web, SMTP to support electronic mail, and FTP to support file
         transfer. We shall see in Chapter 2 that it is very easy to create our own new application-layer protocols.

     q   Transport layer: The transport layer is responsible for transporting application-layer messages between the client and
         server sides of an application. In the Internet there are two transport protocols, TCP and UDP, either of which can
         transport application-layer messages. TCP provides a connection-oriented service to its applications. This service
         includes guaranteed delivery of application-layer messages to the destination and flow control (i.e., sender/receiver
         speed matching). TCP also segments long messages into shorter segments and provides a congestion control
         mechanism, so that a source throttles its transmission rate when the network is congested. The UDP protocol provides its
         applications a connnectionless service, which (as we saw in section 1.3) is very much a no-frills service.

     q   Network layer: The network layer is responsible for routing datagrams from one host to another. The Internet's network
         layer has two principle components. First it has a protocol that defines the fields in the IP datagram as well as how the
         end systems and routers act on these fields. This protocol is the celebrated IP protocol.. There is only one IP protocol,
         and all Internet components that have a network layer must run the IP protocol. The Internet's network layer also
         contains routing protocols that determine the routes that datagrams take between sources and destinations. The Internet (5 of 7) [5/13/2004 11:52:52 AM]
  Protocol Layers and Their Service Models

         has many routing protocols. As we saw in section 1.4, the Internet is network of networks and within a network, the
         network administrator can run any routing protocol desired. Although the network layer contains both the IP protocol
         and numerous routing protocols, it is often simply referred to as the IP layer, reflecting that fact that IP is the glue that
         binds the Internet together.

         The Internet transport layer protocols (TCP and UDP) in a source host passes a transport layer segment and a destination
         address to the IP layer, just as you give the postal service a letter with a destination address. The IP layer then provides
         the service of routing the segment to its destination. When the packet arrives at the destination, IP passes the segment to
         the transport layer within the destination.

     q   Link layer: The network layer routes a packet through a series of packet switches (i.e., routers) between the source and
         destination. To move a packet from one node (host or packet switch) to the next node in the route, the network layer
         must rely on the services of the link layer. In particular, at each node IP passes the datagram to the link layer, which
         delivers the datagram to the next node along the route. At this next node, the link layer passes the IP datagram to the
         network layer. The process is analogous to the postal worker at a mailing center who puts a letter into a plane, which will
         deliver the letter to the next postal center along the route. The services provided at the link layer depend on the specific
         link-layer protocol that is employed over the link. For example, some protocols provide reliable delivery on a link basis,
         i.e., from transmitting node, over one link, to receiving node. Note that this reliable delivery service is different from the
         reliable delivery service of TCP, which provides reliable delivery from one end system to another. Examples of link
         layers include Ethernet and PPP; in some contexts, ATM and frame relay can be considered link layers. As datagrams
         typically need to traverse several links to travel from source to destination, a datagram may be handled by different link-
         layer protocols at different links along its route. For example, a datagram may be handled by Ethernet on one link and
         then PPP on the next link. IP will receive a different service from each of the different link-layer protocols.

     q   Physical layer: While the job of the link layer is to move entire frames from one network element to an adjacent
         network element, the job of the physical layer is to move the individual bits within the frame from one node to the next.
         The protocols in this layer are again link dependent, and further depend on the actual transmission medium of the link
         (e.g., twisted-pair copper wire, single mode fiber optics). For example, Ethernet has many physical layer protocols: one
         for twisted-pair copper wire, another for coaxial cable, another for fiber, etc. In each case, a bit is moved across the link
         in a different way.

If you examine the Table Of Contents, you will see that we have roughly organized this book using the layers of the Internet
protocol stack. We take a top-down approach, first covering the application layer and then preceding downwards.

1.7.2 Network Entities and Layers
The most important network entities are end systems and packet switches. As we shall discuss later in this book, there are two
two types of packet switches: routers and bridges. We presented an overview of routers in the earlier sections. Bridges will be
discussed in detail in Chapter 5 whereas routers will be covered in more detail in Chapter 4. Similar to end systems, routers and
bridges organize the networking hardware and software into layers. But routers and bridges do not implement all of the layers
in the protocol stack; they typically only implement the bottom layers. As shown in Figure 1.7-5, bridges implement layers 1
and 2; routers implement layers 1 through 3. This means, for example, that Internet routers are capable of implementing the IP
protocol (a layer 3 protocol), while bridges are not. We will see later that while bridges do not recognize IP addresses, they are
capable of recognizing layer 2 addresses, such as Ethernet addresses. Note that hosts implement all five layers; this is
consistent with the view that the Internet architecture puts much of its complexity at the "edges" of the network. Repeaters, yet
another kind of network entity to be discussed in Chapter 5, implement only layer 1 functionality. (6 of 7) [5/13/2004 11:52:52 AM]
 Protocol Layers and Their Service Models

  Figure 1.7-5: Hosts, routers and bridges - each contain a different set of layers, reflecting their differences in functionality


[Wakeman 1992] Ian Wakeman, Jon Crowcroft, Zheng Wang, and Dejan Sirovica, "Layering considered harmful," IEEE
Network, January 1992, p. 7.

Return to Table Of Contents

Copyright Keith W. Ross and Jim Kurose 1996-2000 (7 of 7) [5/13/2004 11:52:52 AM]
 Internet structure: Backbones, NAP's and ISP's

   1.8 Internet Backbones, NAPs and ISPs
Our discussion of layering in the previous section has perhaps given the impression that the Internet is a
carefully organized and highly intertwined structure. This is certainly true in the sense that all of the
network entities (end systems, routers and bridges) use a common set of protocols, enabling the entities
to communicate with each other. If one wanted to change, remove, or add a protocol, one would have to
follow a long and arduous procedure to get approval from the IETF, which will (among other things)
make sure that the changes are consistent with the highly intertwined structure. However, from a
topological perspective, to many people the Internet seems to be growing in a chaotic manner, with new
sections, branches and wings popping up in random places on a daily basis. Indeed, unlike the protocols,
the Internet's topology can grow and evolve without approval from a central authority. Let us now try to
a grip on the seemingly nebulous Internet topology.

As we mentioned at the beginning of this chapter, the topology of the Internet is loosely hierarchical.
Roughly speaking, from bottom-to-top the hierarchy consists of end systems (PCs, workstations, etc.)
connected to local Internet Service Providers (ISPs). The local ISPs are in turn connected to regional
ISPs, which are in turn connected to national and international ISPs. The national and international ISPs
are connected together at the highest tier in the hierarchy. New tiers and branches can be added just as a
new piece of Lego can be attached to an existing Lego construction.

In this section we describe the topology of the Internet in the United States as of 1999. Let's begin at the
top of the hierarchy and work our way down. Residing at the very top of the hierarchy are the national
ISPs, which are called National Backbone Provider (NBPs). The NBPs form independent backbone
networks that span North America (and typically abroad as well). Just as there are multiple long-distance
telephone companies in the USA, there are multiple NBPs that compete with each other for traffic and
customers. The existing NBPs include internetMCI, SprintLink, PSINet, UUNet Technologies, and
AGIS. The NBPs typically have high-bandwidth transmission links, with bandwidths ranging from 1.5
Mbps to 622 Mbps and higher. Each NBP also has numerous hubs which interconnect its links and at
which regional ISPs can tap into the NBP.

The NBPs themselves must be interconnected to each other. To see this, suppose one regional ISP, say
MidWestnet, is connected to the MCI NBP and another regional ISP, say EastCoastnet, is connected to
Sprint's NBP. How can traffic be sent from MidWestnet to EastCoastnet? The solution is to introduce
switching centers, called Network Access Points (NAPs), which interconnect the NBPs, thereby
allowing each regional ISP to pass traffic to any other regional ISP. To keep us all confused, some of the
NAPs are not referred to as NAPs but instead as MAEs (Metropolitan Area Exchanges). In the United
States, many of the NAPs are run by RBOCs (Regional Bell Operating Companies); for example,
PacBell has a NAP in San Francisco and Ameritech has a NAP in Chicago. For a list of major NBP's
(those connected into at least three MAPs/MAE's), see [Haynal 99].

Because the NAPs relay and switch tremendous volumes of Internet traffic, they are typically in (1 of 4) [5/13/2004 11:53:17 AM]
 Internet structure: Backbones, NAP's and ISP's

themselves complex high-speed switching networks concentrated in a small geographical area (for
example, a single building). Often the NAPs use high-speed ATM switching technology in the heart of
the NAP, with IP riding on top of ATM. (We provide a brief introduction to ATM at the end of this
chapter, and discuss IP-over-ATM in Chapter 5) Figure 1.8-1 illustrates PacBell's San Francisco NAP,
The details of Figure 1.8-1 are unimportant for us now; it is worthwhile to note, however, that the NBP
hubs can themselves be complex data networks.

              Figure 1.8-1: The PacBell NAP Architecture (courtesy of the Pacific Bell Web site).

The astute reader may have noticed that ATM technology, which uses virtual circuits, can be found at
certain places within the Internet. But earlier we said that the "Internet is a datagram network and does
not use virtual circuits". We admit now that this statement stretches the truth a little bit . We made this (2 of 4) [5/13/2004 11:53:17 AM]
 Internet structure: Backbones, NAP's and ISP's

statement because it helps the reader to see the forest through the trees by not having the main issues
obscured. The truth is that there are virtual circuits in the Internet, but they are in localized pockets of the
Internet and they are buried deep down in the protocol stack, typically at layer 2. If you find this
confusing, just pretend for now that the Internet does not employ any technology that uses virtual
circuits. This is not too far from the truth.

Running an NBP is not cheap. In June 1996, the cost of leasing 45 Mbps fiber optics from coast-to-coast,
as well as the additional hardware required, was approximately $150,000 per month. And the fees that an
NBP pays the NAPs to connect to the NAPs can exceed $300,000 annually. NBPs and NAPs also have
significant capital costs in equipment for high-speed networking. An NBP earns money by charging a
monthly fee to the regional ISPs that connect to it. The fee that an NBP charges to a regional ISP
typically depends on the bandwidth of the connection between the regional ISP and the NBP; clearly a
1.5 Mbps connection would be charged less than a 45 Mbps connection. Once the fixed-bandwidth
connection is in place, the regional ISP can pump and receive as much data as it pleases, up to the
bandwidth of the connection, at no additional cost. If an NBP has significant revenues from the regional
ISPs that connect to it, it may be able to cover the high capital and monthly costs of setting up and
maintaining an NBP.

A regional ISP is also a complex network, consisting of routers and transmission links with rates ranging
from 64 Kbps upward. A regional ISP typically taps into an NBP (at an NBP hub), but it can also tap
directly into an NAP, in which case the regional NBP pays a monthly fee to a NAP instead of to a NBP.
A regional ISP can also tap into the Internet backbone at two or more distinct points (for example, at an
NBP hub or at a NAP). How does a regional ISP cover its costs? To answer this question, let's jump to
the bottom of the hierarchy.

End systems gain access to the Internet by connecting to a local ISP. Universities and corporations can
act as local ISPs, but backbone service providers can also serve as a local ISP. Many local ISPs are small
"mom and pop" companies, however. A popular WWW site known simple as "The List" contains link to
nearly 8000 local, regional, and backbone ISPs [List 1999]. The local ISPs tap into one of the regional
ISPs in its region. Analogous to the fee structure between the regional ISP and the NBP, the local ISP
pays a monthly fee to its regional ISP which depends on the bandwidth of the connection. Finally, the
local ISP charges its customers (typically) a flat, monthly fee for Internet access: the higher the
transmission rate of the connection, the higher the monthly fee.

We conclude this section by mentioning that anyone of us can become a local ISP as soon as we have an
Internet connection. All we need to do is purchase the necessary equipment (for example, router and
modem pool) that is needed to allow other users to connect to our so-called "point of presence." Thus,
new tiers and branches can be added to the Internet topology just as a new piece of Lego can be attached
to an existing Lego construction.

Return to Table Of Contents (3 of 4) [5/13/2004 11:53:17 AM]
 Internet structure: Backbones, NAP's and ISP's


[Haynal 99] R. Haynal, "Internet Backbones,"
[List 1999] "The List: The Definitive ISP Buyer's Guide,"

Copyright Keith W. Ross and Jim Kurose 1996-2000 (4 of 4) [5/13/2004 11:53:17 AM]
 A brief history of computer networking and the Internet

                                       1.9 A Brief History of
    Computer Networking and the Internet
 Sections 1.1-1.8 presented an overview of technology of computer networking and the Internet. You
should know enough now to impress your family and friends. However, if you really want to be a big hit
at the next cocktail party, you should sprinkle your discourse with tidbits about the fascinating history of
the Internet.

1961-1972: Development and Demonstration of Early Packet Switching

The field of computer networking and today's Internet trace their beginnings back to the early 1960s, a
time at which the telephone network was the world's dominant communication network. Recall from
section 1.3, that the telephone network uses circuit switching to transmit information from a sender to
receiver -- an appropriate choice given that voice is transmitted at a constant rate between sender and
receiver. Given the increasing importance (and great expense) of computers in the early 1960's and the
advent of timeshared computers, it was perhaps natural (at least with perfect hindsight!) to consider the
question of how to hook computers together so that they could be shared among geographically
distributed users. The traffic generated by such users was likely to be "bursty" -- intervals of activity,
e.g., the sending of a command to a remote computer, followed by periods of inactivity, while waiting for
a reply or while contemplating the received response.

Three research groups around the world, all unaware of the others' work [Leiner 98], began inventing the
notion of packet switching as an efficient and robust alternative to circuit switching. The first published
work on packet-switching techniques was the work by Leonard Kleinrock [Kleinrock 1961, Kleinrock
1964], at that time a graduate student at MIT. Using queuing theory, Kleinrock's work elegantly
demonstrated the effectiveness of the packet-switching approach for bursty traffic sources. At the same
time, Paul Baran at the Rand Institute had begun investigating the use of packet switching for secure
voice over military networks [Baran 1964], while at the National Physical Laboratory in England, Donald
Davies and Roger Scantlebury were also developing their ideas on packet switching.

The work at MIT, Rand, and NPL laid the foundations for today's Internet. But the Internet also has a
long history of a "Let's build it and demonstrate it" attitude that also dates back to the early 1960's. J.C.R.
Licklider [DEC 1990] and Lawrence Roberts, both colleagues of Kleinrock's at MIT, both went on to lead
the computer science program at the Advanced Projects Research Agency (ARPA) in the United States.
Roberts [Roberts 67] published an overall plan for the so-called ARPAnet [Roberts 1967], the first packet-
switched computer network and a direct ancestor of today's public Internet. The early packet switches (1 of 8) [5/13/2004 11:54:02 AM]
 A brief history of computer networking and the Internet

were known as Interface Message Processors (IMP's) and the contract to build these switches was
awarded to BBN. On Labor Day in 1969, the first IMP was installed at UCLA, with three additional IMP
being installed shortly thereafter at the Stanford Research Institute, UC Santa Barbara, and the University
of Utah. The fledgling precursor to the Internet was four nodes large by the end of 1969. Kleinrock
recalls the very first use of the network to perform a remote login from UCLA to SRI crashing the system
[Kleinrock 1998].

                     Figure 1.9-1: The first Internet Message Processor (IMP), with L. Kleinrock

By 1972, ARPAnet had grown to approximately 15 nodes, and was given its first public demonstration by
Robert Kahn at the 1972 International Conference on Computer Communications. The first host-to-host
protocol between ARPAnet end systems known as the Network Control Protocol (NCP) was completed
[RFC 001]. With an end-to-end protocol available, applications could now be written. The first e-mail
program was written by Ray Tomlinson at BBN in 1972.

1972 - 1980: Internetworking, and New and Proprietary Networks

The initial ARPAnet was a single, closed network. In order to communicate with an ARPAnet host, one
had to actually be attached to another ARPAnet IMP. In the early to mid 1970's, additional packet-
switching networks besides ARPAnet came into being; ALOHAnet, a satellite network linking together (2 of 8) [5/13/2004 11:54:02 AM]
 A brief history of computer networking and the Internet

universities on the Hawaiian islands [Abramson 1972]; Telenet, a BBN commercial packet-switching
network based on ARPAnet technology; Tymnet; and Transpac, a French packet-switching network.
The number of networks was beginning to grow. In 1973, Robert Metcalfe's PhD thesis laid out the
principle of Ethernet, which would later lead to a huge growth in so-called Local Area Networks (LANs)
that operated over a small distance based on the Ethernet protocol.

Once again, with perfect hindsight one might now see that the time was ripe for developing an
encompassing architecture for connecting networks together. Pioneering work on interconnecting
networks (once again under the sponsorship of DARPA), in essence creating a network of networks, was
done by Vinton Cerf and Robert Kahn [Cerf 1974]; the term "internetting" was coined to describe this
work. The architectural principles that Kahn' articulated for creating a so-called "open network
architecture" are the foundation on which today's Internet is built [Leiner 98]:

     q   minimalism, autonomy: a network should be able to operate on its own, with no internal changes
         required for it to be internetworked with other networks;
     q   best effort service: internetworked networks would provide best effort, end-to-end service. If
         reliable communication was required, this could accomplished by retransmitting lost messages
         from the sending host;
     q   stateless routers: the routers in the internetworked networks would not maintain any per-flow
         state about any ongoing connection
     q   decentralized control: there would be no global control over the internetworked networks.

These principles continue to serve as the architectural foundation for today's Internet, even 25 years later -
a testament to insight of the early Internet designers.

These architectural principles were embodied in the TCP protocol. The early versions of TCP, however,
were quite different from today's TCP. The early versions of TCP combined a reliable in-sequence
delivery of data via end system retransmission (still part of today's TCP) with forwarding functions
(which today are performed by IP). Early experimentation with TCP, combined with the recognition of
the importance of an unreliable, non-flow-controlled end-end transport service for application such as
packetized voice, led to the separation of IP out of TCP and the development of the UDP protocol. The
three key Internet protocols that we see today -- TCP, UDP and IP -- were conceptually in place by the
end of the 1970's.

In addition to the DARPA Internet-related research, many other important networking activities were
underway. In Hawaii, Norman Abramson was developing ALOHAnet, a packet-based radio network that
allowed multiple remote sites on the Hawaiian islands to communicate with each other. The ALOHA
protocol [Abramson 1970] was the first so-called multiple access protocol, allowing geographically
distributed users to share a single broadcast communication medium (a radio frequency). Abramson's
work on multiple access protocols was built upon by Robert Metcalfe in the development of the Ethernet
protocol [Metcalfe 1976] for wire-based shared broadcast networks. Interestingly, Metcalfe's Ethernet
protocol was motivated by the need to connect multiple PCs, printers, and shared disks together [Perkins (3 of 8) [5/13/2004 11:54:02 AM]
 A brief history of computer networking and the Internet

1994]. Twenty-five years ago, well before the PC revolution and the explosion of networks, Metcalfe and
his colleagues were laying the foundation for today's PC LANs. Ethernet technology represented an
important step for internetworking as well. Each Ethernet local area network was itself a network, and as
the number of LANs proliferated, the need to internetwork these LANs together became all the more
important. An excellent source for information on Ethernet is Spurgeon's Ethernet Web Site, which
includes Metcalfe's drawing of his Ethernet concept, as shown below in Figure 1.9-2. We discuss
Ethernet, Aloha, and other LAN technologies in detail in Chapter 5;

Figure 1.9-2: A 1976 drawing by R. Metcalfe of the Ethernet concept (from Charles Spurgeon's Ethernet
Web Site)

In addition to the DARPA internetworking efforts and the Aloha/Ethernet multiple access networks, a
number of companies were developing their own proprietary network architectures. Digital Equipment
Corporation (Digital) released the first version of the DECnet in 1975, allowing two PDP-11
minicomputers to communicate with each other. DECnet has continued to evolve since then, with
significant portions of the OSI protocol suite being based on ideas pioneered in DECnet. Other
important players during the 1970's were Xerox (with the XNS architecture) and IBM (with the SNA
architecture). Each of these early networking efforts would contribute to the knowledge base that would
drive networking in the 80's and 90's.

It is also worth noting here that in the 1980's (and even before), researchers (see, e.g., [Fraser 1983,
Turner 1986, Fraser 1993]) were also developing a "competitor" technology to the Internet architecture.
These efforts have contributed to the development of the ATM (Asynchronous Transfer Mode)
architecture, a connection-oriented approach based on the use of fixed size packets, known as cells. We
will examine portions of the ATM architecture throughout this book.

1980 - 1990: A Proliferation of Networks (4 of 8) [5/13/2004 11:54:02 AM]
 A brief history of computer networking and the Internet

 By the end of the 1970's approximately 200 hosts were connected to the ARPAnet. By the end of the
1980's the number of host connected to the public Internet, a confederation of networks looking much
like today's Internet would reach 100,000. The 1980's would be a time of tremendous growth.

Much of the growth in the early 1980's resulted from several distinct efforts to create computer networks
linking universities together. BITnet (Because It's There NETwork) provided email and file transfers
among several universities in the Northeast. CSNET (Computer Science NETwork) was formed to link
together university researchers without access to ARPAnet. In 1986, NSFNET was created to provide
access to NSF-sponsored supercomputing centers. Starting with an initial backbone speed of 56Kbps,
NSFNET's backbone would be running at 1.5 Mbps by the end of the decade, and would be serving as a
primary backbone linking together regional networks.

In the ARPAnet community, many of the final pieces of today's Internet architecture were falling into
place. January 1, 1983 saw the official deployment of TCP/IP as the new standard host protocol for
Arpanet (replacing the NCP protocol). The transition [Postel 1981] from NCP to TCP/IP was a "flag
day" type event -- all host were required to transfer over to TCP/IP as of that day. In the late 1980's,
important extensions were made to TCP to implement host-based congestion control [Jacobson 1988].
The Domain Name System, used to map between a human-readable Internet name (e.g., and its 32-bit IP address, was also developed [Mockapetris 1983, Mockapetris 1987].

Paralleling this development of the ARPAnet (which was for the most part a US effort), in the early
1980s the French launched the Minitel project, an ambitious plan to bring data networking into everyone's
home. Sponsored by the French government, the Minitel system consisted of a public packet-switched
network (based on the X.25 protocol suite, which uses virtual circuits), Minitel servers, and inexpensive
terminals with built-in low speed modems. The Minitel became a huge success in 1984 when the French
government gave away a free Minitel terminal to each French household that wanted one. Minitel sites
included free sites -- such as a telephone directory site -- as well as private sites, which collected a usage-
based fee from each user. At its peak in the mid 1990s, it offered more than 20,000 different services,
ranging from home banking to specialized research databases. It was used by over 20% of France's
population, generated more than $1 billion each year, and created 10,000 jobs. The Minitel was in a large
fraction of French homes ten years before most Americans had ever heard of the Internet. It still enjoys
widespread use in France, but is increasingly facing stiff competition from the Internet.

The 1990s: Commercialization and the Web

The 1990's were issued in with two events that symbolized the continued evolution and the soon-to-arrive
commercialization of the Internet. First, ARPAnet, the progenitor of the Internet ceased to exist.
MILNET and the Defense Data Network had grown in the 1980's to carry most of the US Department of
Defense related traffic and NSFNET had begun to serve as a backbone network connecting regional
networks in the United States and national networks overseas. Also, in 1990, The World
( became the first public dialup Internet Service Provider (ISP). In 1991, NSFNET (5 of 8) [5/13/2004 11:54:02 AM]
 A brief history of computer networking and the Internet

lifted its restrictions on use of NSFNET for commercial purposes. NSFNET itself would be
decommissioned in 1995, with Internet backbone traffic being carried by commercial Internet Service

The main event of the 1990's however, was to be the release of the World Wide Web, which brought the
Internet into the homes and businesses of millions and millions of people, worldwide. The Web also
served as a platform for enabling and deploying hundreds of new applications, including on-line stock
trading and banking, streamed multimedia services, and information retrieval services. For a brief history
of the early days of the WWW, see [W3C 1995].

The WWW was invented at CERN by Tim Berners-Lee in 1989-1991 [Berners-Lee 1989], based on ideas
originating in earlier work on hypertext from the 1940's by Bush [Bush 1945] and since the 1960's by
Ted Nelson [Ziff-Davis 1998]. Berners-Lee and his associates developed initial versions of HTML,
HTTP, a Web server and a browser -- the four key components of the WWW. The original CERN
browsers only provided a line-mode interface. Around the end of 1992 there were about 200 Web servers
in operation, this collection of servers being the tip of the iceberg for what was about to come. At about
this time several researchers were developing Web browsers with GUI interfaces, including Marc
Andreesen, who developed the popular GUI browser Mosaic for X. He released an alpha version of his
browser in 1993, and in 1994 formed Mosaic Communications, which later became Netscape
Communications Corporation. By 1995 university students were using Mosaic and Netscape browsers to
surf the Web on a daily basis. At about this time the US government began to transfer the control of the
Internet backbone to private carriers. Companies -- big and small -- began to operate Web servers and
transact commerce over the Web. In 1996 Microsoft got into the Web business in a big way, and in the
late 1990s it was sued for making its browser a central component of its operating system. In 1999 there
were over two-million Web servers in operation. And all of this happened in less than ten years!

During the 1990's, networking research and development also made significant advances in the areas of
high-speed routers and routing (see, e.g., Chapter 4) and local area networks (see, e.g., Chapter 5). The
technical community struggled with the problems of defining and implementing an Internet service model
for traffic requiring real-time constraints, such as continuous media applications (see, e.g., Chapter 6).
The need to secure and manage Internet infrastructure (see. e.g., Chapter 7 and 8) also became of
paramount importance as e-commerce applications proliferated and the Internet became a central
component of the world's telecommunications infrastructure.


Two excellent discussions of the history of the Internet are [Hobbes 1998] and [Leiner 1998].

[Abramson 1970] N. Abramson, The Aloha System - Another Alternative for Computer
Communications, Proceedings of Fall Joint Computer Conference, AFIPS Conference, 1970, p.37.
[Baran 1964] P. Baran, "On Distributed Communication Networks," IEEE Transactions on (6 of 8) [5/13/2004 11:54:02 AM]
 A brief history of computer networking and the Internet

Communication Systems, March, 1964. Rand Corporation Technical report with the same title
(Memorandum RM-3420-PR, 1964).
[Berners-Lee 1989] Tim Berners-Lee, CERN, "Information Management: A Proposal," March 1989,
May 1990
[Bush 1945] V. Bush, "As We May Think," The Atlantic Monthly, July 1945.
[Cerf 1974] V. Cerf and R. Kahn, "A protocol for packet network interconnection," IEEE Transactions
on Communications Technology, Vol. COM-22, Number 5 (May 1974) , pp. 627-641.
[DEC 1990] Digital Equipment Corporation, "In Memoriam: J.C.R. Licklider 1915-1990," SRC
Research Report 61, August 1990.
[Hobbes 1998] R. Hobbes Zakon, "Hobbes Internet Timeline", Version 3.3, 1998.
[Fraser 1983] Fraser, A. G. (1983). Towards a universal data transport system. IEEE Journal on
Selected Areas in Communications, SAC-1(5):803-816.
[Fraser 1993] Fraser, A. G. (1993). Early experiments with asynchronous time division networks. IEEE
Network Magazine, 7(1):12-27.
[Jacobson 1988] V. Jacobson, "Congestion Avoidance and Control," Proc. ACM Sigcomm 1988
in Computer Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988
[Kleinrock 1961] L. Kleinrock, "Information Flow in Large Communication Networks," RLE Quarterly
Progress Report, July 1961.
[Kleinrock 1964] L. Kleinrock, 1964 Communication Nets: Stochastic Message Flow and Delay,
McGraw-Hill 1964, later re-issued by Dover Books.
[Kleinrock 1998] L. Kleinrock, "The Birth of the Internet,"
[Leiner 98] B. Leiner, V. Cerf, D. Clark, R. Kahn, L. Kleinrock, D. Lynch, J. Postel, L. Roberts, S.
Woolf, "A Brieif History of the Internet,"
[Metcalfe 1976] Robert M. Metcalfe and David R. Boggs.``Ethernet: Distributed Packet Switching for
Local Computer Networks,'' Communications of the Association for Computing Machinery, Vol19/No 7,
July 1976.
[Mockapetris 1983] P.V. Mockapetris, "Domain names: Implementation specification," RFC 833, Nov-
[Mockapetris 1987] P.V. Mockapetris, "Domain names - concepts and facilities," RFC 1034, Nov-01-
[Perkins 1994] A. Perkins, "Networking with Bob Metcalfe," The Red Herring Magazine, November
[Postel 1981] J. Postel, "NCP/TCP Transition Plan," RFC 7801 November 1981.
[RFC 001] S. Crocker, "Host Software, RFC 001 (the very first RFC!).
[Roberts 1967] L. Roberts, T. Merril "Toward a Cooperative Network of Time-Shared Computers," Fall
AFIPS Conference, Oct. 1966.
[Turner 1986] J. Turner, ``New Directions in Communications (or Which Way to the Information
Age?),'' Proceedings of the Zurich Seminar on Digital Communication, pp. 25--32, 3/86.
[W3C 1995] The World Wide Web Consortium, "A Little History of the World Wide Web," 1995.
[Ziff-Davis 1998] Ziff-Davis Publishing, "Ted Nelson: hypertext pioneer," (7 of 8) [5/13/2004 11:54:02 AM]
 A brief history of computer networking and the Internet

Return to Table Of Contents

Copyright Keith W. Ross and Jim Kurose 1996-2000 (8 of 8) [5/13/2004 11:54:02 AM]

1.10 Asynchronous Transfer Mode (ATM)
Thus far, our focus has been on the Internet and its protocols. But many other existing packet-switching
technologies can also provide end-to-end networking solutions. Among these alternatives to the Internet,
so called Asynchronous Transfer Mode (ATM) networks are perhaps the most well-known. ATM
arrived on the scene in the early 1990s. It is useful to discuss ATM for two reasons. First, it provides an
interesting contrast to the Internet, and by exploring its differences, we will gain more insight into the
Internet. Second, ATM is often used as a link-layer technology in the backbone of the Internet. Since we
will refer to ATM throughout this book, we end this chapter with a brief overview of ATM.

The Original Goals of ATM

The standards for ATM were first developed in the mid 1980s. For those too young to remember, at this
time there were predominately two types of networks: telephone networks, that were (and still are)
primarily used to carry real-time voice; and data networks, that were primarily used to transfer text files,
support remote login, and provide email. There were also dedicated private networks available for video
conferencing. The Internet existed at this time, but few people were thinking about using it to transport
phone calls, and the WWW was as yet unheard of. It was therefore natural to design a networking
technology that would be appropriate for transporting real-time audio and video as well as text, email and
image files.

ATM achieved this goal. Two standards bodies, the ATM Forum [ATM Forum] and the International
Telecommunications Union [ITU] have developed ATM standards for Broadband Integrated Services
Digital Networks (BISDNs). The ATM standards call for packet switching with virtual circuits (called
virtual channels in ATM jargon); the standards define how applications directly interface with ATM, so
that ATM provides complete networking solution for distributed applications. Paralleling the
development of the ATM standards, major companies throughout the world made significant investments
in ATM research and development. These investments have led to a myriad of high-performing ATM
technologies, including ATM switches that can switch terabits per second. In recent years, ATM
technology has been deployed very aggressively within both telephone networks and the Internet

Although ATM has been deployed within networks, it has been unsuccessful in extending itself all the
way to desktop PCs and workstations. And it is now questionable whether ATM will ever have a
significant presence at the desktop. Indeed, while ATM was brewing in the standards committees and
research labs in the late 1980s and early 1990s, the Internet and its TCP/IP protocols were already
operational and making significant headway:

     q   The TCP/IP protocol suite was integrated into all of the most popular operating systems. (1 of 4) [5/13/2004 11:54:05 AM]

     q   Companies began to transact commerce (e-commerce) over the Internet.
     q   Residential Internet access became very cheap.
     q   Many wonderful desktop applications were developed for TCP/IP networks, including the World
         Wide Web, Internet phone, and interactive streaming video. Thousands of companies are
         currently developing new applications and services for the Internet.

Furthermore, throughout the 1990s, several low-cost high-speed LAN technologies were developed,
including 100 Mbps Ethernet and more recently Gigabit Ethernet, mitigating the need for ATM use in
high-speed LAN applications. Today, we live in a world where almost all networking application
products interface directly with TCP/IP. Nevertheless, ATM switches can switch packets at very high
rates, and consequently has been deployed in Internet backbone networks, where the need to transport
traffic at high rates is most acute. We will discuss the topic of IP over ATM in Section 5.8.

Principle Characteristics of ATM

We shall discuss ATM in some detail in subsequent chapters. For now we briefly outline its principle

     q   The ATM standard defines a full suite of communication protocols, from the transport layer all
         the way down through the physical layer.
     q   It uses packet switching with fixed length packets of 53 bytes. In ATM jargon these packets are
         called cells. Each cell has 5 bytes of header and 48 bytes of "payload". The fixed length cells and
         simple headers have facilitated high-speed switching.
     q   ATM uses virtual circuits (VCs). In ATM jargon, virtual circuits are called virtual channels. The
         ATM header includes a field for the virtual channel number, which is called the virtual channel
         identifier (VCI) in ATM jargon. As discussed in Section 1.3, packet switches use the VCI to
         route cells towards their destinations; ATM switches also perform VCI translation.
     q    ATM provides no retransmissions on a link-by-link basis. If a switch detects an error in an ATM
         cell, it attempts to correct the error using error correcting codes. If it cannot correct the error, it
         drops the cell and does not ask the preceding switch to retransmit the cell.
     q   ATM provides congestion control on an end-to-end basis. That is, the transmission of ATM cells
         is not directly regulated by the switches in times of congestion. However, the network switches
         themselves do provide feedback to a sending end system to help it regulate its transmission rate
         when the network becomes congested.
     q   ATM can run over just about any physical layer. It often runs over fiber optics using the SONET
         standard at speeds of 155.52 Mbps, 622 Mbps and higher.

Overview of the ATM Layers

As shown in Figure 1.10-1, the ATM protocol stack consists of three layers: the ATM adaptation layer
(AAL), the ATM Layer, and the ATM Physical Layer: (2 of 4) [5/13/2004 11:54:05 AM]

                                                 ATM Adaptation Layer (AAL)
                                                             ATM Layer
                                                       ATM Physical Layer
                                            Figure 1.10-1: The three ATM layers.

The ATM Physical Layer deals with voltages, bit timings, and framing on the physical medium. The
ATM Layer is the core of the ATM standard. It defines the structure of the ATM cell. The ATM
Adaptation Layer is analogous to the transport layer in the Internet protocol stack. ATM includes many
different types of AALs to support many different types of services.

Currently, ATM is often used as a link-layer technology within localized regions of the Internet. A
special AAL type, AAL5, has been developed to allow TCP/IP to interface with ATM. At the IP-to-ATM
interface, AAL5 prepares IP datagrams for ATM transport; at the ATM-to-IP interface, AAL5
reassembles ATM cells into IP datagrams. Figure 1.10-2 shows the protocol stack for the regions of the
Internet that use ATM.

                                            Application Layer (HTTP, FTP, etc.)
                                                Transport Layer (TCP or UDP)
                                                        Network Layer (IP)
                                                             ATM Layer
                                                       ATM Physical Layer
                                    Figure 1.10-2: Internet-over-ATM protocol stack.

Note that in this configuration, the three ATM layers have been squeezed into the lower two layers of the
Internet protocol stack. In particular, the Internet's network layer "sees" ATM as a link-layer protocol.

This concludes our brief introduction to ATM. We will return to ATM from time to time throughout this


[ATM Forum] The ATM Forum Web site,
[ITU] The ITU Web site,

Return to Table Of Contents (3 of 4) [5/13/2004 11:54:05 AM]

Copyright Keith W. Ross and Jim Kurose 1996-2000 (4 of 4) [5/13/2004 11:54:05 AM]
 Chapter 1 summary

                                            1.11 Summary
In this chapter we've covered a tremendous amount of material! We've looked at the various pieces of
hardware and software that make up the Internet in particular, and computer networks in general. We
started at the "edge" of the network, looking at end systems and applications, and at the transport service
provided to the applications running on the end systems. Using network-based distributed applications
as examples, we introduced the notion of a protocol - a key concept in networking. We then dove deeper
inside the network, into the network core, identifying packet-switching and circuit switching as the two
basic approaches for transporting data through a telecommunication network, and examining the
strengths and weaknesses of each approach. We then looked at the lowest (from an architectural
standpoint) parts of the network -- the link layer technologies and physical media typically found in the
access network.

In the second part of this introductory chapter we then took the broader view on networking. From a
performance standpoint, we identified the causes of packet delay and packet loss in the Internet. We
identified key architectural principles (layering, service models) in networking. We then examined the
structure of today's Internet. We finished our introduction to networking with a brief history of computer
networking. The first chapter in itself constitutes a mini-course in computer networking.

So, we have indeed covered a tremendous amount of ground in this first chapter! If you're a bit
overwhelmed, don't worry. In the following chapters we will revisit all of these ideas, covering them in
much more detail (that's a promise, not a threat!). At this point, we hope you leave this chapter with a
still-developing intuition for the pieces that make up a network, a still-developing command for the
vocabulary of networking (don't be shy to refer back to this chapter), and an ever-growing desire to learn
more about networking. That's the task ahead of us for the rest of this book.

Roadmapping This Book

Before starting any trip, we should always glance at a roadmap in order to become familiar with the
major roads and junctures that lie between us and our ultimate destination. For the trip we are about to
embark on, the ultimate destination is a deep understanding of the how, what and why of computer
networks. Our roadmap is the sequence of chapters of this book:

    1.   Computer Networks and the Internet
    2.   Application Layer
    3.   Transport Layer
    4.   Network Layer and Routing
    5.   Link Layer and Local Area Networks
    6.   Multimedia Networking
    7.   Security in Computer Networks
    8.   Network Management (1 of 2) [5/13/2004 11:54:07 AM]
 Chapter 1 summary

Taking a look at this roadmap, we identify Chapters 2 through 5 as the four core chapters of this book.
You should notice that there is one chapter for each of the top four layers of the Internet protocol stack.
Further note that our journey will begin at the top of the Internet protocol stack, namely, the application
layer, and will work its way downward. The rationale behind this top-down journey is that once we
understand the applications, we can then understand the network services needed to support these
applications. We can then, in turn, examine the various ways in which such services might be
implemented by a network architecture. Covering applications early thus provides motivation for the
remainder of the text.

The second half of the book -- Chapters 6 through 8 -- zoom in on three enormously important (and
somewhat independent) topics in modern computer networking. In Chapter 6 (Multimedia Networking),
we examine audio and video applications -- such as Internet phone, video conferencing, and streaming
of stored media. We also look at how a packet-switched network can be designed to provide consistent
quality of service to audio and video applications. In Chapter 7 (Security in Computer Networks), we
first look at the underpinnings of encryption and network security, and then examine how the basic
theory is being applied in broad range of Internet contexts, including electronic mail and Internet
commerce. The last chapter (Network Management) examines the key issues in network management as
well as the Internet protocols that address these issues.

Return to Table of Contents

Copyright Keith W. Ross and Jim Kurose 1996-2000 (2 of 2) [5/13/2004 11:54:07 AM]
 Chapter 1 Homework and Discussion Questions

      Homework Problems and Discussion
                                                     Chapter 1
Review Questions

Sections 1.1-1.4

1) What are the two types of services that the Internet provides to its applications? What are some of
characteristics of each of these services?

2) It has been said that flow control and congestion control are equivalent. Is this true for the Internet's
connection-oriented service? Are the objectives of flow control and congestion control the same?

3) Briefly describe how the Internet's connection-oriented service provides reliable transport.

4) What advantage does a circuit-switched network have over a packet-switched network?

4) What advantages does TDM have over FDM in a circuit-switched network?

5) Suppose that between a sending host and a receiving host there is exactly one packet switch. The
transmission rates between the sending host and the switch and between the switch and the receiving host
are R1 and R2, respectively. Assuming that the router uses store-and-forward packet switching, what is
the total end-to-end delay to send a packet of length L. (Ignore queuing and propagation delay.)

6) What are some of the networking technologies that use virtual circuits? Find good URLs that discuss
and explain these technologies.

7) What is meant by connection state information in a virtual-circuit network?

8) Suppose you are developing a standard for a new type of network. You need to decide whether your
network will use VCs or datagram routing. What are the pros and cons for using VCs?

Sections 1.5-1.7

9) Is HFC bandwidth dedicated or shared among users? Are collisions possible in a downstream HFC (1 of 6) [5/13/2004 11:54:11 AM]
 Chapter 1 Homework and Discussion Questions

channel? Why or why not?

10) What are the transmission rate of Ethernet LANs? For a given transmission rate, can each user on the
LAN continuously transmit at that rate?

11) What are some of the physical media that Ethernet can run over?

12) Dail-up modems, ISDN, HFC and ADSL are all used for residential access. For each of these access
technologies, provide a range of transmission rates and comment on whether the bandwidth is shared or

13) Consider sending a series of packets from a sending host to a receiving host over a fixed route. List
the delay components in the end-to-end delay for a single packet. Which of these delays are constant and
which are fixed?

14) Review the car-caravan analogy in Section 1.6. Again assume a propagation speed of 100km/hour.

        a) Suppose the caravan travels 200 km, beginning in front of one toll booth, passing through a
        second toll booth, and finishing just before a third toll booth. What is the end-to-end delay?

        b) Repeat (a), now assuming that there are 7 cars in the caravan instead of 10.

15) List five tasks that a layer can perform. It is possible that one (or more) of these tasks could be
performed by two (or more) layers?

16) What are the five layers in the Internet protocol stack? What are the principle responsibilities for
each of these layers?

17) Which layers in the Internet protocol stack does a router process?


1) Design and describe an application-level protocol to be used between an Automatic Teller Machine,
and a bank's centralized computer. Your protocol should allow a user's card and password to be verified,
the account balance (which is maintained at the centralized computer) to be queried, and an account
withdrawal (i.e., when money is given to the user) to be made. Your protocol entities should be able to
handle the all-too-common case in which there is not enough money in the account to cover the
withdrawal. Specify your protocol by listing the messages exchanged, and the action taken by the
Automatic Teller Machine or the bank's centralized computer on transmission and receipt of messages.
Sketch the operation of your protocol for the case of a simple withdrawl with no errors, using a diagram (2 of 6) [5/13/2004 11:54:11 AM]
 Chapter 1 Homework and Discussion Questions

similar to that in Figure 1.2-1. Explicity state the assumptions made by your protocol about the
underlying end-to-end transport service.

2) Consider an application which transmits data at a steady rate (e.g., the sender generates a N bit unit of
data every k time units, where k is small and fixed). Also, when such an application starts, it will stay on
for relatively long period of time. Answer the following questions, briefly justifying your answer:

     q   Would a packet-switched network or a circuit-switched network be more appropriate for this
         application? Why?
     q   Suppose that a packet-switching network is used and the only traffic in this network comes from
         such applications as described above. Furthermore, assume that the sum of the application data
         rates is less that the capacities of each and every link. Is some form of congestion control needed?

3) Consider sending a file of F = M *L bits over a path of Q links. Each link transmits at R bps. The
network is lightly loaded so that there are no queueing delays. When a form of packet switching is used,
the M * L bits are broken up into M packets, each packet with L bits. Propagation delay is negligible.

         a) Suppose the network is a packet-switched virtual-circuit network. Denote the VC set-up time
         by ts seconds. Suppose to each packet the sending layers add a total of hbits of header. How long
         does it take to send the file from source to destination?

         b) Suppose the network is a packet-switched datagram network, and a connectionless service is
         used. Now suppose each packet has 2h bits of header. How long does it take to send the file?

         c) Repeat (b), but assume message switching is used (i.e., 2hbits are added to the message, and the
         message is not segmented).

         d) Finally, suppose that the network is a circuit switched network. Further suppose that the
         transmission rate of the circuit between source and destination is Rbps. Assuming tsset-up time
         and hbits of header appended to the entire file, how long does it take to send the file?

4) Experiment with the message-switching Java applet in this chapter. Do the delays in the applet
correspond to the delays in the previous question? How do link propagation delays effect the the overall
end-to-end delay for packet switching and for message switching?

5) Consider sending a large file of F bits from Host A to Host B.There are two links (and one switch)
between A and B, and the links are uncongested (i.e., no queueing delays). Host A segments the file into
segments of S bits each and adds 40 bits of header to each segment, forming packets of L = 40 + S bits.
Each link has a transmission rate of R bps. Find the value of S that minimizes the delay of moving the
packet from Host A to Host B. Neglect propagation delay. (3 of 6) [5/13/2004 11:54:11 AM]
 Chapter 1 Homework and Discussion Questions

6) This elementary problem begins to explore propagation delay and transmission delay, two central
concepts in data networking. Consider two hosts, Hosts A and B, connected by a single link of rate R
bps. Suppose that the two hosts are separted by m meters, and suppose the propagation speed along the
link is s meters/sec. Host A is to send a packet of size L bits to Host B.

        a) Express the propagation delay, dprop in terms of mand s.
        b) Determine the transmission time of the packet, dtrans in terms of Land R.
        c) Ignoring processing and queing delays, obtain an expression for the end-to-end delay.
        d) Suppose Host A begins to transmit the packet at time t=0. At time t=dtrans, where is the last bit
        of the packet?
        e) Suppose dpropis greater than dtrans . At time t=dtrans, where is the first bit of the packet?
        f)) Suppose dpropis less than dtrans . At time t=dtrans, where is the first bit of the packet?
        g) Suppose s=2.5*108, L=100bits and R=28 kbps. Find the distance mso that dpropequals dtrans.

7) In this problem we consider sending voice from Host A to Host B over a packet-switched network
(e.g., Internet phone). Host A converts on-the-fly analog voice to a digital 64 kbps bit stream. Host A
then groups the bits into 48-byte packets. There is one link between host A and B; its transmission rate is
1 Mbps and its propagation delay is 2 msec. As soon as Host A gathers a packet, it sends it to Host B. As
soon as Host B receives an entire packet, it coverts the packet's bits to an analog signal. How much time
elapses from when a bit is created (from the original analog signal at A) until a bit is decoded (as part of
the analog signal at B)?

8) Suppose users share a 1 Mbps link. Also suppose each user requires 100 Kbps when transmitting, but
each user only transmits 10% of the time. (See the discussion on "Packet Switching versus Circuit
Switching" in Section 1.4.1.)

        a) When circuit-switching is used, how many users can be supported?

        b) For the remainder of this problem, suppose packet-switching is used. Find the probability that a
        given user is transmitting.

        c) Suppose there are 40 users. Find the probability that at any given time, n users are transmitting

        d) Find the probability that there are 10 or more users transmitting simultaneously.

9) Consider the queueing delay in a router buffer (preceding an outbound link). Suppose all packets are L
bits, the transmission rate is R bps and that N packets arrive to the buffer every L/RN seconds. Find the
average queueing delay of a packet. (4 of 6) [5/13/2004 11:54:11 AM]
 Chapter 1 Homework and Discussion Questions

10) Consider the queueing delay in a router buffer. Let I denote traffic intensity, that is, I = La/R.
Suppose that the queueing delay takes the form LR/(1-I) for I < 1. (a) Provide a formula for the "total
delay," that is, the queueing delay plus the transmission delay. (b) Plot the transmission delay as a
function of L/R.

11) (a) Generalize the end-to-end delay formula in Section 1.6 for heterogeneous processing rates,
transmission rates, and propagation delays. (b) Repeat (a), but now also suppose that there is an average
queuing delay of dqueue at each node.

12) Consider an application that transmits data at a steady rate (e.g., the sender generates one packet of N
bits every k time units, where k is small and fixed). Also, when such an application starts, it will stay on
for relatively long period of time.

        a) Would a packet-switched network or a circuit-switched network be more appropriate for this
        application? Why?
        b) Suppose that a packet-switched network is used and the only traffic in this network comes from
        such applications as described above. Furthermore, assume that the sum of the application data
        rates is less that the capacities of each and every link. Is some form of congestion control needed?
        Why or why not?

13) Perform a traceroute between source and destination on the same continent at three different hours of
the day. Find the average and standard deviation of the delays. Do the same for a source and destination
on different continents.

14) Recall that ATM uses 53 byte packets consisting of 5 header bytes and 48 payload bytes. Fifty-three
bytes is unusually small for fixed-length packets; most networking protocols (IP, Ethernet, frame relay,
etc.) use packets that are, on average, significantly larger. One of the drawbacks of a small packet size is
that a large fraction of link bandwidth is consumed by overhead bytes; in the case of ATM, almost ten
percent of the bandwidth is "wasted" by the ATM header. In this problem we investigate why such a
small packet size was chosen. To this end, suppose that the ATM cell consists of P bytes (possible
different from 48) and 5 bytes of header.

        a) Consider sending a digitally encoded voice source directly over ATM. Suppose the source is
        encoded at a constant rate of 64 kbps. Assume each cell is entirely filled before the source sends
        the cell into the network. The time required to fill a cell is the packetization delay.In terms of L,
        determine the packetization delay in milliseconds.

        b) Packetization delays greater than 20 msecs can cause noticeable and unpleasant echo.
        Determine the packetization delay for L= 1,500 bytes (roughly corresponding to a maximum-size
        Ethernet packet) and for L = 48 (corresponding to an ATM cell).

        c) Calculate the store-and-forward delay at a single ATM switch for a link rate of R = 155 Mbps (5 of 6) [5/13/2004 11:54:11 AM]
 Chapter 1 Homework and Discussion Questions

        (a popular link speed for ATM) for L = 1500 bytes and L = 48 bytes.

        d) Comment on the advantages of using a small cell size.

Discussion Questions

1) Write a one-paragraph description for each of three major projects currently under way at the W3C.

2) What is Internet phone? Describe some of the existing products for Internet phone. Find some of the
Web sites of companies that are in the Internet phone business.

3) What is Internet audio-on-demand? Describe some of the existing products for Internet audio-on-
demand. Find some of the Web sites of companies that are in the Internet audio-on-demand business.
Find some Web sites which provide audio-on-demand content.

4) What is Internet video conferencing? Describe some of the existing products for Internet video
conferencing. Find some of the Web sites of companies that are in the Internet video-conferencing

5) Surf the Web to find a company that is offering HFC Internet access. What transmission rate of the
cable modem? Is this rate always guaranteed for each user on the network?

6) Discussion question: Suppose you are developing an application for the Internet.Would you have your
application run over TCP or UDP? Elaborate. (We will explore this question in some detail in subsequent
chapters. For now appeal to your intuition to answer the question.)

7) Discussion question: What are some of the current activities of the The World Wide Web Consortium
(W3C)? What are some of the current activities of the National Laboratory for Applied Network
Research or NLANR?

8) Discussion question: What does the current topological structure of the Internet (i.e., backbone ISPs,
regional ISPs, and local ISPs) have in common with the topological structure of the telephone networks
in the USA? How is pricing in the Internet the same as or different from pricing in the phone system.

Copyright Keith W. Ross and Jim Kurose 1996-2000 (6 of 6) [5/13/2004 11:54:11 AM]
 Network Applications: Terminology and Basic Concepts

                                             2.1 Principles of
                        Application Layer Protocols
Network applications are the raisons d'etre of a computer network. If we couldn't conceive of any useful
applications, there wouldn't be any need to design networking protocols to support them. But over the past
thirty years, many people have devised numerous ingenious and wonderful networking applications. These
applications include the classic text-based applications that became popular in the 1980s, including remote
access to computers, electronic mail, file transfers, newsgroups, and chat. But they also include more
recently conceived multimedia applications, such as the World Wide Web, Internet telephony, video
conferencing, and audio and video on demand.

Although network applications are diverse and have many interacting components, software is almost
always at their core. Recall from Section 1.2 that for a network application's software is distributed among
two or more end systems (i.e., host computers). For example, with the Web there are two pieces of
software that communicate with each other: the browser software in the user's host (PC, Mac or
workstation), and the Web server software in the Web server. With Telnet, there are again two pieces of
software in two hosts: software in the local host and software in the remote host. With multiparty video
conferencing, there is a software piece in each host that participates in the conference.

In the jargon of operating systems, it is not actually software pieces (i.e., programs) that are
communicating but in truth processes that are communicating. A process can be thought of as a program
that is running within an end system. When communicating processes are running on the same end system,
they communicate with each other using interprocess communication. The rules for interprocess
communication are governed by the end system's operating system. But in this book we are not interested
in how processes on the same host communicate, but instead in how processes running on different end
systems (with potentially different operating systems) communicate. Processes on two different end
systems communicate with each other by exchanging messages across the computer network. A sending
process creates and sends messages into the network; a receiving process receives these messages and
possibly responds by sending messages back. Networking applications have application-layer protocols
that define the format and order of the messages exchanged between processes, as well as the actions taken
on the transmission or receipt of a message.

The application layer is a particularly good place to start our study of protocols. It's familiar ground.
We're acquainted with many of the applications that rely on the protocols we will study. It will give us a
good feel for what protocols are all about, and will introduce us to many of the same issues that we'll see
again when we study transport, network, and data link layer protocols.

2.1.1 Application-Layer Protocols (1 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

It is important to distinguish between network applications and application-layer protocols. An
application-layer protocol is only one piece (albeit, a big piece) of a network application. Let's look at a
couple of examples. The Web is a network application that allows users to obtain "documents" from Web
servers on demand. The Web application consists of many components, including a standard for document
formats (i.e., HTML), Web browsers (e.g., Netscape Navigator and Internet Explorer), Web servers (e.g.,
Apache, Microsoft and Netscape servers), and an application-layer protocol. The Web's application-layer
protocol, HTTP (the HyperText Transfer Protocol [RFC 2068]), defines how messages are passed between
browser and Web server. Thus, HTTP is only one piece (albeit, a big piece) of the Web application. As
another example, consider the Internet electronic mail application. Internet electronic mail also has many
components, including mail servers that house user mailboxes, mail readers that allow users to read and
create messages, a standard for defining the structure of an email message (i.e., MIME) and application-
layer protocols that define how messages are passed between servers, how messages are passed between
servers and mail readers, and how the contents of certain parts of the mail message (e.g., a mail message
header) are to be interpreted. The principal application-layer protocol for electronic mail is SMTP (Simple
Mail Transfer Protocol [RFC 821]). Thus, SMTP is only one piece (albeit, a big piece) of the email

As noted above, an application layer protocol defines how an application's processes, running on different
end systems, pass messages to each other. In particular, an application layer protocol defines:

     q   the types of messages exchanged, e.g., request messages and response messages;
     q   the syntax of the various message types, i.e., the fields in the message and how the fields are
     q   the semantics of the fields, i.e., the meaning of the information in the fields;
     q   rules for determining when and how a process sends messages and responds to messages.

Some application-layer protocols are specified in RFCs and are therefore in the public domain. For
example, HTTP is available as an RFC. If a browser developer follows the rules of the HTTP RFC, the
browser will be able to retrieve Web pages from any Web server (more precisely, any Web server that has
also followed the rules of the HTTP RFC). Many other application-layer protocols are proprietary and
intentionally not available in the public domain. For example, many of the existing Internet phone
products use proprietary application-layer protocols.

Clients and Servers

A network application protocol typically has two parts or "sides", a client side and a server side. The
client side in one end system communicates with the server side in another end system. For example, a
Web browser implements the client side of HTTP and a Web server implements the server side of HTTP.
In another example, e-mail, the sending mail server implements the client side of SMTP and the receiving
mail server implements the server side of SMTP.

For many applications, a host will implement both the client and server sides of an application. For (2 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

example, consider a Telnet session between Hosts A and B. (Recall that Telnet is a popular remote login
application.) If Host A initiates the Telnet session (so that a user at Host A is logging onto Host B), then
Host A runs the client side of the application and Host B runs the server side. On the other hand, if Host B
initiates the Telnet session, then Host B runs the client side of the application. FTP, used for transferring
files between two hosts, provides another example. When an FTP session exists between two hosts, then
either host can transfer a file to the other host during the session. However, as is the case for almost all
network applications, the host that initiates the session is labeled the client. Furthermore, a host can
actually act as both a client and a server at the same time for a given application. For example, a mail
server host runs the client side of SMTP (for sending mail) as well as the server side of SMTP (for
receiving mail).

Processes Communicating Across a Network

As noted above, an application involves two processes in two different hosts communicating with each
other over a network. (Actually, a multicast application can involve communication among more than two
hosts. We shall address this issue in Chapter 4.) The two processes communicate with each other by
sending and receiving messages through their sockets. A process's socket can be thought of as the process's
door: a process sends messages into, and receives message from, the network through its socket. When a
process wants to send a message to another process on another host, it shoves the message out its door.
The process assumes that there is a transportation infrastructure on the other side of the door that will
transport the message to the door of the destination process.

               Figure 2.1-1: Application processes, sockets, and the underlying transport protocol.

Figure 2.1-1 illustrates socket communication between two processes that communicate over the Internet.
(The figure assumes that the underlying transport protocol is TCP, although the UDP protocol could be
used as well in the Internet.) As shown in this figure, a socket is the interface between the application (3 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

layer and the transport layer within a host. It is also referred to as the API (application programmers
interface) between the application and the network, since the socket is the programming interface with
which networked applications are built in the Internet.. The application developer has control of
everything on the application-layer side of the socket but has little control of the transport-layer side of the
socket. The only control that the application developer has on the transport-layer side is (i) the choice of
transport protocol and (ii) perhaps the ability to fix a few transport-layer parameters such as maximum
buffer and maximum segment sizes. Once the application developer chooses a transport protocol (if a
choice is available), the application is built using the transport layer the services offered by that protocol.
We will explore sockets in some detail in Sections 2.6 and 2.7.

Addressing Processes

In order for a process on one host to send a message to a process on another host, the sending process must
identify the receiving process. To identify the receiving process, one must typically specify two pieces of
information: (i) the name or address of the host machine, and (ii) an identifier that specifies the identity of
the receiving process on the destination host.

Let us first consider host addresses. In Internet applications, the destination host is specified by its IP
address. We will discuss IP addresses in great detail in Chapter 4. For now, it suffices to know that the IP
address is a 32-bit quantity that uniquely identifies the end-system (more precisely, it uniquely identifies
the interface that connects that host to the Internet). Since the IP address of any end system connected to
the public Internet must be globally unique, the assignment of IP addresses must be carefully managed, as
discussed in section 4.4. ATM networks have a different addressing standard. The ITU-T has specified
telephone number-like addresses, called E.164 addresses [ITU 1997], for use in public ATM networks.
E.164 address consist of between seven and 15 digits, with each digit encoded as a byte (yielding an
address of between 56 and 120 bits in length). The assignment of these address is carefully managed by
country- or region-specific standards bodies; in the United States, the American National Standards
Institute (ANSI) provides this address registration service. We will not cover ATM end-system addressing
in depth in this book; see [Fritz 1997, Cisco 1999] for more details.

In addition to knowing the address of the end system to which a message is destined, a sending application
must also specify information that will allow the receiving end system to direct the message to the
appropriate process on that system. A receive-side port number serves this purpose in the Internet.
Popular application-layer protocols have been assigned specific port numbers. For example, a Web server
process (which uses the HTTP protocol) is identified by port number 80. A mail server (using the SMTP)
protocol is identified by port number 25. A list of well-known port numbers for all Internet standard
protocols can be found in [RTC 1700]. When a developer creates a new network application, the
application must be assigned a new port number.

User Agents

Before we begin a more detailed study of application-layer protocols, it is useful to discuss the notion of a (4 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

user agent. The user agent is an interface between the user and the network application. For example,
consider the Web. For this application, the user agent is a browser such as Netscape Navigator or
Microsoft Explorer. The browser allows a user to view Web pages, to navigate in the Web, to provide
input to forms, to interact with Java applets, etc. The browser also implements the client side of the HTTP
protocol. Thus, when activated, the browser is a process that, along with providing an interface to the user,
sends messages into a socket. As an another example, consider the electronic mail application. In this case,
the user agent is a "mail reader" that allows a user to compose and read messages. Many companies
market mail readers (e.g., Eudora, Netscape Messenger) with a graphical user interface that can run on
PCs, Macs and workstations. Mail readers running on PCs also implement the client side of application
layer protocols; typically they implement the client side of SMTP for sending mail and the client side of a
mail retrieval protocol, such as POP3 or IMAP (see section 2.4), for receiving mail.

2.1.2 What Services Does an Application Need?
Recall that a socket is the interface between the application process and the transport protocol. The
application at the sending side sends messages through the door. At the other side of the door, the transport
protocol has the responsibility of moving the messages across the network to the door at the receiving
process. Many networks, including the Internet, provide more than one transport protocol. When you
develop an application, you must choose one of the available transport protocols. How do you make this
choice? Most likely, you will study the services that are provided by the available transport protocols, and
you will pick the protocol with the services that best match the needs of your application. The situation is
similar to choosing either train or airplane transport for travel between two cities (say New York City and
Boston). You have to choose one or the other, and each transport mode offers different services. (For
example, the train offers downtown pick up and drop off, whereas the plane offers shorter transport time.)

What services might a network application need from a transport protocol? We can broadly classify an
application's service requirements along three dimensions: data loss, bandwidth, and timing.

     q   Data Loss. Some applications, such as electronic mail, file transfer, remote host access, Web
         document transfers, and financial applications require fully reliable data transfer, i.e., no data loss.
         In particular, a loss of file data, or data in a financial transaction, can have devastating
         consequences (in the latter case, for either the bank or the customer!). Other loss tolerant
         applications, most notably multimedia applications such as real-time audio/video or stored
         audio/video, can tolerate some amount of data loss. In these latter applications, lost data might
         result in a small glitch in the played-out audio/video - not a crucial impairment. The effects of such
         loss on application quality, and actual amount of tolerable packet loss, will depend strongly on the
         coding scheme used.
     q   Bandwidth. Some applications must be able to transmit data at a certain rate in order to be
         "effective". For example, if an Internet telephony application encodes voice at 32 Kbps, then it
         must be able to send data into the network, and have data delivered to the receiving application, at
         this rate. If this amount of bandwidth is not available, the application needs to either encode at a
         different rate (and receive enough bandwidth to sustain this different coding rate) or should give up - (5 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

         - receiving half of the needed bandwidth is of no use to such a bandwidth-sensitive application.
         While many current multimedia applications are bandwidth sensitive, future multimedia
         applications may use adaptive coding technique to encode at a rate that matches the currently-
         available bandwidth. While bandwidth-sensitive applications require a given amount of bandwidth,
         elastic applications can make use of as much or as little bandwidth as happens to be available.
         Electronic mail, file transfer, remote access, and Web transfers are all elastic applications. Of
         course, the more bandwidth, the better. There's an adage that says that one can not be too rich, too
         thin, or have too much bandwidth.
     q   Timing. The final service requirement is that of timing. Interactive real-time applications, such as
         Internet telephony, virtual environments, teleconferencing, and multiplayer games require tight
         timing constraints on data delivery in order to be "effective." For example, many of these
         applications require that end-to-end delays be on the order of a few hundred of milliseconds or less.
         (See Chapter 6 and [Gauthier 1999, Ramjee 94].) Long delays in Internet telephony, for example,
         tend to result in unnatural pauses in the conversation; in a multiplayer game or virtual interactive
         environment, a long delay between taking an action and seeing the response from the environment
         (e.g., from another player on the end of an end-to-end connection) makes the application feel less
         "realistic." For non-real-time applications, lower delay is always preferable to high delay, but no
         tight constraint is placed on the end-to-end delays.

Figure 2.1-2 summarizes the reliability, bandwidth, and timing requirements of some popular and
emerging Internet applications.

                  Application                     Data Loss        Bandwidth                       Time sensitive?
                  file transfer                   no loss         elastic                          no
                  electronic mail                 no loss         elastic                          no
                  Web documents                   no loss         elastic                          no
                                                      audio: few Kbps to
                  real-time audio/video loss-tolerant                                              yes: 100's of msec
                                                      video: 10's Kbps to 5
                                                                  same as interactive
                  stored audio/video              loss-tolerant                                    yes: few seconds
                  interactive games               loss-tolerant few Kbps to 10's Kbps yes: 100's msecs
                  financial applications required                 elastic                          yes and no

                              Figure 2.1-2: Requirements of selected network applications.

Figure 2.1-2 outlines only a few of the key requirements of a few of the more popular Internet
applications. Our goal here is not to provide a complete classification, but simply to identify a few of the (6 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

most important axes along which network application requirements can be classified.

2.1.3 Services Provided by the Internet Transport
The Internet (and more generally TCP/IP networks) makes available two transport protocols to
applications, namely, UDP (User Datagram Protocol) and TCP (Transmission Control Protocol). When a
developer creates a new application for the Internet, one of the first decisions that the developer must
make is whether to use UDP or TCP. Each of these protocols offers a different service model to the
invoking applications.

TCP Services

The TCP service model includes a connection-oriented service and a reliable data transfer service. When
an application invokes TCP for its transport protocol, the application receives both of these services from

     q   Connection-oriented service: TCP has the client and server exchange transport-layer control
         information with each other before the application-level messages begin to flow. This so-called
         handshaking procedure (that is part of the TCP protocol) alerts the client and server, allowing them
         to prepare for an onslaught of packets. After the handshaking phase, a TCP connection is said to
         exist between the sockets of the two processes. The connection is a full-duplex connection in that
         the two processes can send messages to each other over the connection at the same time. When the
         application is finished sending messages, it must tear down the connection. The service is referred
         to as a "connection-oriented" service rather than a "connection" service (or a "virtual circuit"
         service), because the two processes are connected in a very loose manner. In Chapter 3 we will
         discuss connection-oriented service in detail and examine how it is implemented.
     q   Reliable transport service: The communicating processes can rely on TCP to to deliver all
         messages sent without error and in the proper order. When one side of the application passes a
         stream of bytes into a socket, it can count on TCP to deliver the same stream of data to the
         receiving socket, with no missing or duplicate bytes.

TCP also includes a congestion control mechanism, a service for the general welfare of the Internet rather
than for the direct benefit of the communicating processes. The TCP congestion control mechanism
throttles a process (client or server) when the network is congested. In particular, as we shall see in
Chapter 3, TCP congestion control attempts to limit each TCP connection to its fair share of network

The throttling of the transmission rate can have a very harmful effect on real-time audio and video
applications that have minimum bandwidth requirements. Moreover, real-time applications are loss-
tolerant and do not need a fully reliable transport service. In fact, the TCP acknowledgments and (7 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

retransmissions that provide the reliable transport service (discussed in Chapter 3) can further slow down
the transmission rate of useful real-time data. For these reasons, developers of real-time applications
usually run their applications over UDP rather than TCP.

Having outlined the services provided by TCP, let us say a few words about the services that TCP does not
provide. First, TCP does not guarantee a minimum transmission rate. In particular, a sending process is not
permitted to transmit at any rate it pleases; instead the sending rate is regulated by TCP congestion control,
which may force the sender to send at a low average rate. Second, TCP does not provide any delay
guarantees. In particular, when a sending process passes a message into a TCP socket, the message will
eventually arrive to receiving socket, but TCP guarantees absolutely no limit on how long the message
may take to get there. As many of us have experienced with the World Wide Wait, one can sometimes
wait tens of seconds or even minutes for TCP to deliver a message (containing, for example, an HTML
file) from Web server to Web client. In summary, TCP guarantees delivery of all data, but provides no
guarantees on the rate of delivery or on the delays experienced by individual messages.

UDP Services

UDP is a no-frills, lightweight transport protocol with a minimalist service model. UDP is connectionless,
so there is no handshaking before the two processes start to communicate. UDP provides an unreliable
data transfer service, that is, when a process sends a message into a UDP socket, UDP provides no
guarantee that the message will ever reach the receiving socket. Furthermore, messages that do arrive to
the receiving socket may arrive out of order. Returning to our houses/doors analogy for processes/sockets,
UDP is like having a long line of taxis waiting for passengers on the other side of the sender's door. When
a passenger (analogous to an application message) exits the house, it hops in one of the taxis. Some of the
taxis may break down, so they don't ever deliver the passenger to the receiving door; taxis may also take
different routes, so that passengers arrive to the receiving door out of order.

On the other hand, UDP does not include a congestion control mechanism, so a sending process can pump
data into a UDP socket at any rate it pleases. Although all the data may not make it to the receiving socket,
a large fraction of the data may arrive. Also, because UDP does not use acknowledgments or
retransmissions that can slow down the delivery of useful real-time data, developers of real-time
applications often choose to run their applications over UDP. Similar to TCP, UDP provides no guarantee
on delay. As many of us know, a taxi can be stuck in a traffic jam for a very long time (while the meter
continues to run!).

         Application                       Application-layer protocol                  Underlying Transport Protocol
         electronic mail                   SMTP [RFC 821]                              TCP
         remote terminal access Telnet [RFC 854]                                       TCP
         Web                               HTTP [RFC 2068]                             TCP (8 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

         file transfer                     FTP [RFC 959]                               TCP
         remote file server                NFS [McKusik 1996]                          UDP or TCP
         streaming multimedia              proprietary (e.g., Real Networks) UDP or TCP
         Internet telephony                proprietary (e.g., Vocaltec)                typically UDP

     Figure 2.1-3: Popular Internet applications, their application-layer protocols, and their underlying
                                            transport protocols.

Figure 2.1-3 indicates the transport protocols used by some popular Internet applications. We see that
email, remote terminal access, the Web and file transfer all use TCP. These applications have chosen TCP
primarily because TCP provides the reliable data transfer service, guaranteeing that all data will eventually
get to its destination. We also see that Internet telephone typically runs over UDP. Each side of an Internet
phone application needs to send data across the network at some minimum rate (see Figure 2.1-2); this is
more likely to be possible with UDP than with TCP. Also, Internet phone applications are loss-tolerant, so
they do not need the reliable data transfer service (and the acknowledgments and retransmissions that
implement the service) provided by TCP.

As noted earlier, neither TCP nor UDP offer timing guarantees. Does this mean that time-sensitive
applications can not run in today's Internet? The answer is clearly no - the Internet has been hosting time-
sensitive applications for many years. These applications often work pretty well because they have been
designed to cope, to the greatest extent possible, with this lack of guarantee. We shall investigate several
of these design tricks in Chapter 6. Nevertheless, clever design has its limitations when delay is excessive,
as is often the case in the public Internet. In summary, today's Internet can often provide satisfactory
service to time-sensitive applications, but it can not provide any timing or bandwidth guarantees. In
Chapter 6, we shall also discuss emerging Internet service models that provide new services, including
guaranteed delay service for time-sensitive applications.

2.1.4 Network Applications Covered in this Book
New public domain and proprietary Internet applications are being developed everyday. Rather than
treating a large number of Internet applications in an encyclopedic manner, we have chosen to focus on a
small number of important and popular applications. In this chapter we discuss in some detail four popular
applications: the Web, file transfer, electronic mail, and directory service. We first discuss the Web, not
only because the Web is an enormously popular application, but also because its application-layer
protocol, HTTP, is relatively simple and illustrates many key principles of network protocols. We then
discuss file transfer, as it provides a nice contrast to HTTP and enables us to highlight some additional
principles. We discuss electronic mail, the Internet's first killer application. We shall see that modern
electronic mail makes use of not one, but of several, application-layer protocols. The Web, file transfer,
and electronic mail have common service requirements: they all require a reliable transfer service, none of
them have special timing requirements, and they all welcome an elastic bandwidth offering. The services (9 of 10) [5/13/2004 11:54:31 AM]
 Network Applications: Terminology and Basic Concepts

provided by TCP are largely sufficient for these three applications. The fourth application, Domain Name
System (DNS), provides a directory service for the Internet. Most users do not interact with DNS directly;
instead, users invoke DNS indirectly through other applications (including the Web, file transfer, and
electronic mail). DNS illustrates nicely how a distributed database can be implemented in the Internet.
None of the four applications discussed in this chapter are particularly time sensitive; we will defer our
discussion of such time-sensitive applications until Chapter 6.


[Cisco 1999] Cisco Systems Inc., "ATM Signaling and Addressing," July 1999.
[Gauthier 1999] L. Gauthier, C. Diot, J. Kurose, "End-to-end Transmission Control Mechanisms for
Multiparty Interactive Applications on the Internet," Proc. IEEE Infocom 99, April 1999.
[Fritz 1997] J. Fritz, "Demystifying ATM Addressing ," Byte Magazine, December 1997.
[ITU 1997] International Telecommunications Union, "Recommendation E.164/I.331 - The international
public telecommunication numbering plan," May 1997.
[McKusik 1996] Marshall Kirk McKusick, Keith Bostic, Michael Karels, and John Quarterman, "The
Design and Implementation of the 4.4BSD Operating System, " Addison-Wesley Publishing Company,
Inc. (0-201-54979-4), 1996. Chapter 9 of this text, is entitled 'The Network File System' and is on-line at
[Ramjee 1994] R. Ramjee, J. Kurose, D. Towsley, H. Schulzrinne, "Adaptive Playout Mechanisms for
Packetized Audio Applications in Wide-Area Networks", Proc. IEEE Infocom 94.
[RFC 821] J.B. Postel, "Simple Mail Transfer Protocol," RFC 821, August 1982.
[RFC 854] J. Postel, J, Reynolds, "TELNET Protocol Specification," RFC 854. May 1993.
[RFC 959] J. Postel, J. Reynolds, "File Transfer Protocol (FTP)," RFC 959, Oct. 1985
[RFC 1035] P. Mockapetris, "Domain Names - Implementation and Specification", RFC 1035, Nov. 1987.
[RFC 1700] J. Reynolds, J. Postel, "Assigned Numbers," RFC 1700, Oct. 1994.
[RFC 2068] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. Berners-Lee, "Hypertext Transfer
Protocol -- HTTP/1.1," RFC 2068, January 1997

Return to Table Of Contents

Copyright Keith W. Ross and James F. Kurose 1996-2000. All rights reserved. (10 of 10) [5/13/2004 11:54:31 AM]
  The HyperText Transfer Protocol

                                2.2 The World Wide Web: HTTP
In the 1980s the Internet was used by researchers, academics and university students to login to remote hosts, to transfer files from local
hosts to remote hosts and vice versa, to receive and send news, and to receive and send electronic mail. Although these applications were
(and continue to be) extremely useful, the Internet was essentially unknown outside the academic and research communities. Then in early
1990s the Internet's killer application arrived on the scene -- the World Wide Web. The Web is the Internet application that caught the
general public's eye. It is dramatically changing how people interact inside and outside their work environments. It has spawned thousands
of start up companies. It has elevated the Internet from just one of many data networks (including online networks such as Prodigy,
America On Line and Compuserve, national data networks such as Minitel/Transpac in France, and private X.25 and frame relay networks)
to essentially the one and only data network.

History is sprinkled with the arrival of electronic communication technologies that have had major societal impacts. The first such
technology was the telephone, invented in the 1870s. The telephone allowed two persons to orally communicate in real-time without being
in the same physical location. It had a major impact on society -- both good and bad. The next electronic communication technology was
broadcast radio/television, which arrived in the 1920s and 1930s. Broadcast radio/television allowed people to receive vast quantities of
audio and video information. It also had a major impact on society -- both good and bad. The third major communication technology that
has changed the way people live and work is the Web. Perhaps what appeals the most to users about the Web is that it is on demand. Users
receive what they want, when they want it. This is unlike broadcast radio and television, which force users to "tune in" when the content
provider makes the content available. In addition to being on demand, the Web has many other wonderful features that people love and
cherish. It is enormously easy for any individual to make any available available over the Web; everyone can become a publisher at
extremely low cost. Hyperlinks and search engines help us navigate through an ocean of Web sites. Graphics and animated graphics
stimulate our senses. Forms, Java applets, Active X components, as well as many other devices enable us to interact with pages and sites.
And more and more, the Web provides a menu interface to vast quantities of audio and video material stored in the Internet, audio and
video that can be accessed on demand.

2.2.1 Overview of HTTP
The Hypertext Transfer Protocol (HTTP), the Web's application-layer protocol, is at the heart of the Web. HTTP is implemented in two
programs: a client program and server program. The client program and server programs, executing on different end systems, talk to each
other by exchanging HTTP messages. HTTP defines the structure of these messages and how the client and server exchange the messages.
Before explaining HTTP in detail, it is useful to review some Web terminology.

A Web page (also called a document) consists of objects. An object is a simply file -- such as a HTML file, a JPEG image, a GIF image, a
Java applet, an audio clip, etc. -- that is addressable by a single URL. Most Web pages consist of a base HTML file and several referenced
objects. For example, if a Web page contains HTML text and five JPEG images, then the Web page has six objects: the base HTML file
plus the five images. The base HTML file references the other objects in the page with the objects' URLs. Each URL has two components:
the host name of the server that houses the object and the object's path name. For example, the URL


has for a host name and /someDepartment/picture.gif for a path name. A browser is a user agent
for the Web; it displays to the user the requested Web page and provides numerous navigational and configuration features. Web browsers
also implement the client side of HTTP. Thus, in the context of the Web, we will interchangeably use the words "browser" and "client".
Popular Web browsers include Netscape Communicator and Microsoft Explorer. A Web server houses Web objects, each addressable by
a URL. Web servers also implement the server side of HTTP. Popular Web servers include Apache, Microsoft Internet Information Server,
and the Netscape Enterprise Server. (Netcraft provides a nice survey of Web server penetration [Netcraft].)

HTTP defines how Web clients (i.e., browsers) request Web pages from servers (i.e., Web servers) and how servers transfer Web pages to
clients. We discuss the interaction between client and server in detail below, but the general idea is illustrated in Figure 2.2-1. When a user
requests a Web page (e.g., clicks on a hyperlink), the browser sends HTTP request messages for the objects in the page to the server. The
server receives the requests and responds with HTTP response messages that contain the objects. Through 1997 essentially all browsers
and Web servers implement version HTTP/1.0, which is defined in [RFC 1945]. Beginning in 1998 Web servers and browsers began to
implement version HTTP/1.1, which is defined in [RFC 2068]. HTTP/1.1 is backward compatible with HTTP/1.0; a Web server running (1 of 15) [5/13/2004 11:55:09 AM]
  The HyperText Transfer Protocol

1.1 can "talk" with a browser running 1.0, and a browser running 1.1 can "talk" with a server running 1.0.

                                                Figure 2.2-1: HTTP request-response behavior

Both HTTP/1.0 and HTTP/1.1 use TCP as their underlying transport protocol (rather than running on top of UDP). The HTTP client first
initiates a TCP connection with the server. Once the connection is established, the browser and the server processes access TCP through
their socket interfaces. As described in Section 2.1, on the client side the socket interface is the "door" between the client process and the
TCP connection; on the server side it is the "door" between the server process and the TCP connection. The client sends HTTP request
messages into its socket interface and receives HTTP response messages from its socket interface. Similarly, the HTTP server receives
request messages from its socket interface and sends response messages into the socket interface. Once the client sends a message into its
socket interface, the message is "out of the client's hands" and is "in the hands of TCP". Recall from Section 2.1 that TCP provides a
reliable data transfer service to HTTP. This implies that each HTTP request message emitted by a client process eventually arrives in tact
at the server; similarly, each HTTP response message emitted by the server process eventually arrives in tact at the client. Here we see one
of the great advantages of a layered architecture - HTTP need not worry about lost data, or the details of how TCP recovers from loss or
reordering of data within the network. That is the job of TCP and the protocols in the lower layers of the protocol stack.

TCP also employs a congestion control mechanism which we shall discuss in detail in Chapter 3. We only mention here that this
mechanism forces each new TCP connection to initially transmit data at a relatively slow rate, but then allows each connection to ramp up
to a relatively high rate when the network is uncongested. The initial slow-transmission phase is referred to as slow start.

It is important to note that the server sends requested files to clients without storing any state information about the client. If a particular
client asks for the same object twice in a period of a few seconds, the server does not respond by saying that it just served the object to the
client; instead, the server resends the object, as it has completely forgotten what it did earlier. Because an HTTP server maintains no
information about the clients, HTTP is said to be a stateless protocol.

2.2.2 Non-Persistent and Persistent Connections
HTTP can use both non-persistent connections and persistent connections. Non-persistent connections is the default mode for HTTP/1.0.
Conversely, persistent connections is the default mode for HTTP/1.1.

Non-Persistent Connections

Let us walk through the steps of transferring a Web page from server to client for the case of non-persistent connections. Suppose the page
consists of a base HTML file and 10 JPEG images, and that all 11 of these objects reside on the same server. Suppose the URL for the base
HTML file is (2 of 15) [5/13/2004 11:55:09 AM]
  The HyperText Transfer Protocol .

Here is what happens:

    1. The HTTP client initiates a TCP connection to the server Port number 80 is used as the default port
       number at which the HTTP server will be listening for HTTP clients that want to retrieve documents using HTTP.
    2. The HTTP client sends a HTTP request message into the socket associated with the TCP connection that was established in step 1.
       The request message either includes the entire URL or simply the path name /someDepartment/home.index. (We will
       discuss the HTTP messages in some detail below.)
    3. The HTTP server receives the request message via the socket associated with the connection that was established in step 1, retrieves
       the object /someDepartment/home.index from its storage (RAM or disk), encapsulates the object in a HTTP response
       message, and sends the response message into the TCP connection.
    4. The HTTP server tells TCP to close the TCP connection. (But TCP doesn't actually terminate the connection until the client has
       received the response message in tact.)
    5. The HTTP client receives the response message. The TCP connection terminates. The message indicates that the encapsulated
       object is an HTML file. The client extracts the file from the response message, parses the HTML file and finds references to the ten
       JPEG objects.
    6. The first four steps are then repeated for each of the referenced JPEG objects.

As the browser receives the Web page, it displays the page to the user. Two different browsers may interpret (i.e., display to the user) a
Web page in somewhat different ways. HTTP has nothing to do with how a Web page is interpreted by a client. The HTTP specifications
([RFC 1945] and [RFC 2068]) only define the communication protocol between the client HTTP program and the server HTTP program.

The steps above use non-persistent connections because each TCP connection is closed after the server sends the object -- the connection
does not persist for other objects. Note that each TCP connection transports exactly one request message and one response message. Thus,
in this example, when a user requests the Web page, 11 TCP connections are generated.

In the steps described above, we were intentionally vague about whether the client obtains the 10 JPEGs over ten serial TCP connections,
or whether some of the JPEGs are obtained over parallel TCP connections. Indeed, users can configure modern browsers to control the
degree of parallelism. In their default modes, most browsers open five to ten parallel TCP connections, and each of these connections
handles one request-response transaction. If the user prefers, the maximum number of parallel connections can be set to one, in which case
the ten connections are established serially. As we shall see in the next chapter, the use of parallel connections shortens the response time
since it cuts out some of the RTT and slow-start delays. Parallel TCP connections can also allow the requesting browser to steal a larger
share of its fair share of the end-to-end transmission bandwidth.

Before continuing, let's do a back of the envelope calculation to estimate the amount of time from when a client requests the base HTML
file until the file is received by the client. To this end we define the round-trip time RTT, which is the time it takes for a small packet to
travel from client to server and then back to the client. The RTT includes packet propagation delays, packet queuing delays in intermediate
routers and switches, and packet processing delays. (These delays were discussed in Section 1.6.) Now consider what happens when a user
clicks on a hyperlink. This causes the browser to initiate a TCP connection between the browser and the Web server; this involves a "three-
way handshake" -- the client sends a small TCP message to the server, the server acknowledges and responds with a small message, and
finally the client acknowledges back to the server. One RTT elapses after the first two parts of the three-way handshake. After completing
the first two parts of the handshake, the client sends the HTTP request message into the TCP connection, and TCP "piggybacks" the last
acknowledgment (the third part of the three-way handshake) onto the request message. Once the request message arrives at the server, the
server sends the HTML file into the TCP connection. This HTTP request/response eats up another RTT. Thus, roughly, the total response
time is 2RTT plus the transmission time at the server of the HTML file.

Persistent Connections

Non-persistent connections have some shortcomings. First, a brand new connection must be established and maintained for each requested
object. For each of these connections, TCP buffers must be allocated and TCP variables must be kept in both the client and server. This can
place a serious burden on the Web server, which may be serving requests from hundreds of different clients simultaneously. Second, as we
just described, each object suffers two RTTs -- one RTT to establish the TCP connection and one RTT to request and receive an object.
Finally, each object suffers from TCP slow start because every TCP connection begins with a TCP slow-start phase. However, the
accumulation of RTT and slow start delays is partially alleviated by the use of parallel TCP connections. (3 of 15) [5/13/2004 11:55:09 AM]
  The HyperText Transfer Protocol

With persistent connections, the server leaves the TCP connection open after sending responses. Subsequent requests and responses
between the same client and server can be sent over the same connection. In particular, an entire Web page (in the example above, the base
HTML file and the ten images) can be sent over a single persistent TCP connection; moreover, multiple Web pages residing on the same
server can be sent over one persistent TCP connection. Typically, the HTTP server closes the connection when it isn’t used for a certain
time (the timeout interval), which is often configurable. There are two versions of persistent connections: without pipelining and with
pipelining. For the version without pipelining, the client issues a new request only when the previous response has been received. In this
case, each of the referenced objects (the ten images in the example above) experiences one RTT in order to request and receive the object.
Although this is an improvement over non-persistent's two RTTs, the RTT delay can be further reduced with pipelining. Another
disadvantage of no pipelining is that after the server sends an object over the persistent TCP connection, the connection hangs -- does
nothing -- while it waits for another request to arrive. This hanging wastes server resources.

The default mode of HTTP/1.1 uses persistent connections with pipelining. In this case, the HTTP client issues a request as soon as it
encounters a reference. Thus the HTTP client can make back-to-back requests for the referenced objects. When the server receives the
requests, it can send the objects back-to-back. If all the requests are sent back-to-back and all the responses are sent back-to-back, then
only one RTT is expended for all the referenced objects (rather than one RTT per referenced object when pipelining isn't used).
Furthermore, the pipelined TCP connection hangs for a smaller fraction of time. In addition to reducing RTT delays, persistent connections
(with or without pipelining) have a smaller slow-start delay than non-persistent connections. This is because that after sending the first
object, the persistent server does not have to send the next object at the initial slow rate since it continues to use the same TCP connection.
Instead, the server can pick up at the rate where the first object left off. We shall quantitatively compare the performance of non-persistent
and persistent connections in the homework problems of Chapter 3. The interested reader is also encouraged to see [Heidemann 1997] and
[Nielsen 1997].

2.2.3 HTTP Message Format
The HTTP specifications 1.0 ([RFC 1945] and 1.1 [RFC 2068]) define the HTTP message formats. There are two types of HTTP
messages, request messages and response messages, both of which are discussed below.

HTTP Request Message

Below we provide a typical HTTP request message:

       GET /somedir/page.html HTTP/1.1
       Connection: close
       User-agent: Mozilla/4.0
       Accept: text/html, image/gif, image/jpeg

    (extra carriage return, line feed)

We can learn a lot my taking a good look at this simple request message. First of all, we see that the message is written in ordinary ASCII
text, so that your ordinary computer-literate human being can read it. Second, we see that the message consists of five lines, each followed
by a carriage return and a line feed. The last line is followed by an additional carriage return and line feed. Although this particular request
message has five lines, a request message can have many more lines or as little as one line. The first line of a HTTP request message is
called the request line; the subsequent lines are called the header lines. The request line has three fields: the method field, the URL field,
and the HTTP version field. The method field can take on several different values, including GET, POST, and HEAD. The great majority of
HTTP request messages use the GET method. The GET method is used when the browser requests an object, with the requested object
identified in the URL field. In this example, the browser is requesting the object /somedir/page.html. (The browser doesn't have to
specify the host name in the URL field since the TCP connection is already connected to the host (server) that serves the requested file.)
The version is self-explanatory; in this example, the browser implements version HTTP/1.1.

Now let's look at the header lines in the example. By including the Connection:close header line, the browser is telling the server
that it doesn't want to use persistent connections; it wants the server to close the connection after sending the requested object. Thus the
browser that generated this request message implements HTTP/1.1 but it doesn't want to bother with persistent connections. The User- (4 of 15) [5/13/2004 11:55:09 AM]
  The HyperText Transfer Protocol

agent: header line specifies the user agent, i.e., the browser type that is making the request to the server . Here the user agent is
Mozilla/4.0, a Netscape browser. This header line is useful because the server can actually send different versions of the same object
to different types of user agents. (Each of the versions is addressed by the same URL.) The Accept: header line tells the server the type
of objects the browser is prepared to accept. In this case, the client is prepared to accept HTML text, a GIF image or a JPEG image. If the
file /somedir/page.html contains a Java applet (and who says it can't!), then the server shouldn't send the file, since the browser can
not handle that object type. Finally, the Accept-language: header indicates that the user prefers to receive a French version of the
object, if such an object exists on the server; otherwise, the server should send its default version.

Having looked at an example, let us now look at the general format for a request message, as shown in Figure 2.2-2:

                                             Figure 2.2-2: general format of a request message

We see that the general format follows closely the example request message above. You may have noticed, however, that after the header
lines (and the additional carriage return and line feed) there is an "Entity Body". The Entity Body is not used with the GET method, but is
used with the POST method. The HTTP client uses the POST method when the user fills out a form -- for example, when a user gives
search words to a search engine such as Yahoo. With a POST message, the user is still requesting a Web page from the server, but the
specific contents of the Web page depend on what the user wrote in the form fields. If the value of the method field is POST, then the
entity body contains what the user typed into the form fields. The HEAD method is similar to the POST method. When a server receives a
request with the HEAD method, it responds with an HTTP message but it leaves out the requested object. The HEAD method is often used
by HTTP server developers for debugging.

HTTP Response Message

Below we provide a typical HTTP response message. This response message could be the response to the example request message just

       HTTP/1.1 200 OK
       Connection: close
       Date: Thu, 06 Aug 1998 12:00:15 GMT
       Server: Apache/1.3.0 (Unix)
       Last-Modified: Mon, 22 Jun 1998 09:23:24 GMT
       Content-Length: 6821
       Content-Type: text/html
       data data data data data ... (5 of 15) [5/13/2004 11:55:09 AM]
  The HyperText Transfer Protocol

Let's take a careful look at this response message. It has three sections: an initial status line, six header lines, and then the entity body.
The entity body is the meat of the message -- it contains the requested object itself (represented by data data data data data ...). The status
line has three fields: the protocol version field, a status code, and a corresponding status message. In this example, the status line indicates
that the server is using HTTP/1.1 and that that everything is OK (i.e., the server has found, and is sending, the requested object).

Now let's look at the header lines. The server uses the Connection: close header line to tell the client that it is going to close the
TCP connection after sending the message. The Date: header line indicates the time and date when the HTTP response was created and
sent by the server. Note that this is not the time when the object was created or last modified; it is the time when the server retrieves the
object from its file system, inserts the object into the response message and sends the response message. The Server: header line
indicates that the message was generated by an Apache Web server; it is analogous to the User-agent: header line in the HTTP request
message. The Last-Modified: header line indicates the time and date when the object was created or last modified. The Last-
Modified: header, which we cover in more detail below, is critical for object caching, both in the local client and in network cache
(a.k.a. proxy) servers. The Content-Length: header line indicates the number of bytes in the object being sent. The Content-
Type: header line indicates that the object in the entity body is HTML text. (The object type is officially indicated by the Content-
Type: header and not by the file extension.)

Note that if the server receives an HTTP/1.0 request, it will not use persistent connections, even if it is an HTTP/1.1 server. Instead the
HTTP/1.1 server will close the TCP connection after sending the object. This is necessary because an HTTP/1.0 client expects the server to
close the connection.

                                             Figure 2.2-3: General format of a response message

Having looked at an example, let us now examine the general format of a response message, which is shown in Figure 2.2-3. This general
format of the response message matches the previous example of a response message. Let's say a few additional words about status codes
and their phrases. The status code and associated phrase indicate the result of the request. Some common status codes and associated
phrases include: (6 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

    q   200 OK: Request succeeded and the information is returned in the response.
    q   301 Moved Permanently: Requested object has been permanently moved; new URL is specified in Location: header of
        the response message. The client software will automatically retrieve the new URL.
    q   400 Bad Request: A generic error code indicating that the request could not be understood by the server.
    q   404 Not Found: The requested document does not exist on this server
    q   505 HTTP Version Not Supported: The request HTTP protocol version is not supported by the server.

How would you like to see a real HTTP response message? This is very easy to do! First Telnet into your favorite WWW server. Then
type in a one-line request message for some object that is housed on the server. For example, if you can logon to a Unix machine, type:

        telnet 80
        GET /~ross/index.html HTTP/1.0

(Hit the carriage return twice after typing the second line.) This opens a TCP connection to port 80 of the host and
then sends the HTTP GET command. You should see a response message that includes the base HTML file of Professor Ross's homepage.
If you'd rather just see the HTTP message lines and not receive the object itself, replace GET with HEAD. Finally, replace /~ross/index.html
with /~ross/banana.html and see what kind of response message you get.

In this section we discussed a number of header lines that can be used within HTTP request and response messages. The HTTP
specification (especially HTTP/1.1) defines many, many more header lines that can be inserted by browsers, Web servers and network
cache servers. We have only covered a small fraction of the totality of header lines. We will cover a few more below and another small
fraction when we discuss network Web caching at the end of this chapter. A readable and comprehensive discussion of HTTP headers and
status codes is given in [Luotonen 1998]. An excellent introduction to the technical issues surrounding the Web is [Yeager 1996].

How does a browser decide which header lines it includes in a request message? How does a Web server decide which header lines it
includes in a response messages? A browser will generate header lines as a function of the browser type and version (e.g., an HTTP/1.0
browser will not generate any 1.1 header lines), user configuration of browser (e.g., preferred language) and whether the browser currently
has a cached, but possibly out-of-date, version of the object. Web servers behave similarly: there are different products, versions, and
configurations, all of which influence which header lines are included in response messages.

2.2.4 User-Server Interaction: Authentication and Cookies
We mentioned above that an HTTP server is stateless. This simplifies server design, and has permitted engineers to develop very high-
performing Web servers. However, it is often desirable for a Web site to identify users, either because the server wishes to restrict user
access or because it wants to serve content as a function of the user identity. HTTP provides two mechanisms to help a server identify a
user: authentication and cookies.


Many sites require users to provide a username and a password in order to access the documents housed on the server. This requirement is
referred to as authentication. HTTP provides special status codes and headers to help sites perform authentication. Let us walk through an
example to get a feel for how these special status codes and headers work.. Suppose a client requests an object from a server, and the server
requires user authorization.

    1. The client first sends an ordinary request message with no special header lines.
    2. The server then responds with empty entity body and with a 401 Authorization Required status code. In this response
       message the server includes the WWW-Authenticate: header, which specifies the details about how to perform authentication.
       (Typically, it indicates to the user needs to provide a username and a password.)
    3. The client receives the response message and prompts the user for a username and password. The client resends the request
       message, but this time includes an Authorization: header line, which includes the username and password.

After obtaining the first object, the client continues to send the username and password in subsequent requests for objects on the server. (7 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

(This typically continues until the client closes his browser. However, while the browser remains open, the username and password are
cached, so the user is not prompted for a username and password for each object it requests!) In this manner, the site can identify the user
for every request.

We will see in Chapter 7 that HTTP performs a rather weak form of authentication, one that would not be difficult to break. We will study
more secure and robust authentication schemes later in Chapter 7.


Cookies are an alternative mechanism for sites to keep track of users. They are defined in RFC 2109. Some Web sites use cookies and
others don't. Let's walk through an example. Suppose a client contacts a Web site for the first time, and this site uses cookies. The server’s
response will include a Set-cookie: header. Often this header line contains an identification number generated by the Web server. For
example, the header line might be:

        Set-cookie: 1678453

When the the HTTP client receives the response message, it sees the Set-cookie: header and identification number. It then appends a
line to a special cookie file that is stored in the client machine. This line typically includes the host name of the server and user's associated
identification number. In subsequent requests to the same server, say one week later, the client includes a Cookie: request header, and
this header line specifies the identification number for that server. In the current example, the request message includes the header line:

        Cookie: 1678453

In this manner, the server does not know the username of the user, but the server does know that this user is the same user that made a
specific request one week ago.

Web servers use cookies for many different purposes:

    q   If a server requires authentication but doesn't want to hassle a user with a username and password prompt every time the user visits
        the site, it can set a cookie.
    q   If a server wants to remember a user's preferences so that it can provide targeted advertising during subsequent visits, it can set a
    q   If a user is shopping at a site (e.g., buying several CDs), the server can use cookies to keep track of the items that the user is
        purchasing, i.e., to create a virtual shopping cart.

We mention, however, that cookies pose problems for mobile users who access the same site from different machines. The site will treat
the same user as a different user for each different machine used. We conclude by pointing the reader to the page Persistent Client State
HTTP Cookies, which provides an in-depth but readable introduction to cookies. We also recommend Cookies Central, which includes
extensive information on the cookie controversy.

2.2.5 The Conditional GET
By storing previously retrieved objects, Web caching can reduce object-retrieval delays and diminish the amount of Web traffic sent over
the Internet. Web caches can reside in a client or in an intermediate network cache server. We will discuss network caching at the end of
this chapter. In this subsection, we restrict our attention to client caching.

Although Web caching can reduce user-perceived response times, it introduces a new problem -- a copy of an object residing in the cache
may be stale. In other words, the object housed in the Web server may have been modified since the copy was cached at the client.
Fortunately, HTTP has a mechanism that allows the client to employ caching while still ensuring that all objects passed to the browser are
up-to-date. This mechanism is called the conditional GET. An HTTP request message is a so-called conditional GET message if (i) the
request message uses the GET method and (ii) the request message includes an If-Modified-Since:header line.

To illustrate how the conditional GET operates, let's walk through an example. First, a browser requests an uncached object from some (8 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

Web server:

        GET /fruit/kiwi.gif HTTP/1.0
        User-agent: Mozilla/4.0
        Accept: text/html, image/gif, image/jpeg

Second, the Web server sends a response message with the object to the client:

        HTTP/1.0 200 OK
        Date: Wed, 12 Aug 1998 15:39:29
        Server: Apache/1.3.0 (Unix)
        Last-Modified: Mon, 22 Jun 1998 09:23:24
        Content-Type: image/gif
        data data data data data ...

The client displays the object to the user but also saves the object in its local cache. Importantly, the client also caches the last-modified
date along with the object. Third, one week later, the user requests the same object and the object is still in the cache. Since this object may
have been modified at the Web server in the past week, the browser performs an up-to-date check by issuing conditional GET.
Specifically, the browser sends

        GET /fruit/kiwi.gif HTTP/1.0
        User-agent: Mozilla/4.0
        Accept: text/html, image/gif, image/jpeg
        If-modified-since: Mon, 22 Jun 1998 09:23:24

Note that the value of the If-modified-since: header line is exactly equal to value of the Last-Modified: header line that
was sent by the server one week ago. This conditional GET is telling the server to only send the object if the object has been modified
since the specified date. Suppose the object has not been modified since 22 Jun 1998 09:23:24. Then, fourth, the Web server sends
a response message to the client:

        HTTP/1.0 304 Not Modified
        Date: Wed, 19 Aug 1998 15:39:29
        Server: Apache/1.3.0 (Unix)
        (empty entity body)

We see that in response to the conditional GET, the Web server still sends a response message, but it doesn't bother to include the
requested object in the response message. Including the requested object would only waste bandwidth and increase user perceived response
time, particularly if the object is large (such as a high resolution image). Note that this last response message has in the status line 304
Not Modified, which tells the client that it can go ahead and use its cached copy of the object.

2.2.6 Web Caches
A Web cache -- also called a proxy server -- is a network entity that satisfies HTTP requests on the behalf of a client. The Web cache has
its own disk storage, and keeps in this storage copies of recently requested objects. As shown in Figure 2.2-4, users configure their
browsers so that all of their HTTP requests are first directed to the Web cache. (This is a straightforward procedure with Microsoft and
Netscape browsers.) Once a browser is configured, each browser request for an object is first directed to the Web cache. As an example,
suppose a browser is requesting the object .

    q   The browser establishes a TCP connection to the proxy server and sends an HTTP request for the object to the Web cache.
    q   The Web cache checks to see if it has a copy of the object stored locally. If it does, the Web cache forwards the object within an
        HTTP response message to the client browser.
    q   If the Web cache does not have the object, the Web cache opens a TCP connection to the origin server, that is, to The Web cache then sends an HTTP request for the object into the TCP connection. After receiving this
        request, the origin server sends the object within an HTTP response to the Web cache. (9 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

    q   When the Web cache receives the object, it stores a copy in its local storage and forwards a copy, within an HTTP response
        message, to the client browser (over the existing TCP connection between the client browser and the Web cache).

                                       Figure 2.2-4: Clients requesting objects through a Web cache.
Note that a cache is both a server and a client at the same time. When it receives requests from and sends responses to a browser, it is a
server. When it sends requests to and receives responses from an origin server it is a client.

So why bother with a Web cache? What advantages does it have? Web caches are enjoying wide-scale deployment in the Internet for at
least three reasons. First, a Web cache can substantially reduce the response time for a client request, particularly if the bottleneck
bandwidth between the client and the origin server is much less than the bottleneck bandwidth between the client and the cache. If there is
a high-speed connection between the client and the cache, as there often is, and if the cache has the requested object, then the cache will be
able to rapidly deliver the object to the client. Second, as we will soon illustrate with an example, Web caches can substantially reduce
traffic on an institution's access link to the Internet. By reducing traffic, the institution (e.g., a company or a university) does not have to
upgrade bandwidth as quickly, thereby reducing costs. Furthermore, Web caches can substantially reduce Web traffic in the Internet as a
whole, thereby improving performance for all applications. In 1998, over 75% of Internet traffic was Web traffic, so a significant reduction
in Web traffic can translate into a significant improvement in Internet performance [Claffy 1998]. Third, an Internet dense with Web
caches -- e.g., at institutional, regional and national levels -- provides an infrastructure for rapid distribution of content, even for content
providers who run their sites on low-speed servers behind low-speed access links. If such a "resouce-poor" content provider suddenly has
popular content to distribute, this popular content will quickly be copied into the Internet caches, and high user demand will be satisfied.

To gain a deeper understanding of the benefits of caches, let us consider an example in the context of Figure 2.2-5. In this figure, there are
two networks - the institutional network and the Internet. The institutional network is a high-speed LAN. A router in the institutional
network and a router in the Internet are connected by a 1.5 Mbps link. The institutional network consists of a high-speed LAN which is
connected to the Internet through a 1.5 Mbps access link. The origin servers are attached to the Internet, but located all over the globe.
Suppose that the average object size is 100 Kbits and that the average request rate from the institution's browsers to the origin servers is 15
requests per second. Also suppose that amount of time it takes from when the router on the Internet side of the access link in Figure 2.2-5
forwards an HTTP request (within an IP datagram) until it receives the IP datagram (typically, many IP datagrams) containing the
corresponding response is two seconds on average. Informally, we refer to this last delay as the "Internet delay".

The total response time -- that is the time from when a browser requests an object until the browser receives the object -- is the sum of the
LAN delay, the access delay (i.e., the delay between the two routers) and the Internet delay. Let us now do a very crude calculation to
estimate this delay. The traffic intensity on the LAN (see Section 1.6) is

                                             (15 requests/sec)*(100 Kbits/request)/(10Mbps) = .15

whereas the traffic intensity on access link (from Internet router to institution router) is (10 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

                                             (15 requests/sec)*(100 Kbits/request)/(1.5 Mbps) = 1

A traffic intensity of .15 on a LAN typically results in at most tens of milliseconds of delay; hence, we can neglect the LAN delay.
However, as discussed in Section 1.6, as the traffic intensity approaches 1 (as is the case of the access link in Figure 2.2-5), the delay on a
link becomes very large and grows without bound. Thus, the average response time to satisfy requests is going to be on the order of
minutes, if not more, which is unacceptable for the institution's users. Clearly something must be done.

                                    Figure 2.2-5: Bottleneck between institutional network and the Internet.

One possible solution is to increase the access rate from 1.5 Mbps to, say, 10 Mbps. This will lower the traffic intensity on the access link
to .15, which translates to negligible delays between the two routers. In this case, the total response response time will roughly be 2
seconds, that is, the Internet delay. But this solution also means that the institution must upgrade its access link from 1.5 Mbps to 10 Mbps,
which can be very costly.

Now consider the alternative solution of not upgrading the access link but instead installing a Web cache in the institutional network. This
solution is illustrated in Figure 2.2-6. Hit rates -- the fraction of requests that are satisfied by a cache -- typically range from .2 to .7 in
practice. For illustrative purposes, let us suppose that the cache provides a hit rate of .4 for this institution. Because the clients and the
cache are connected to the same high-speed LAN, 40% of the requests will be satisfied almost immediately, say within 10 milliseconds, by
the cache. Nevertheless, the remaining 60% of the requests still need to be satisfied by the origin servers. But with only 60% of the
requested objects passing through the access link, the traffic intensity on the access link is reduced from 1.0 to .6 . Typically a traffic
intensity less than .8 corresponds to a small delay , say tens of milliseconds, on a 1.5 Mbps link, which is negligible compared with the 2
second Internet delay. Given these considerations, average delay therefore is

                                                    .4*(0.010 seconds) + .6*(2.01 seconds) (11 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

which is just slightly larger than 2.1 seconds. Thus, this second solution provides an even lower response time then the first solution, and it
doesn't require the institution to upgrade its access rate. The institution does, of course, have to purchase and install a Web cache. But this
cost is low -- many caches use public-domain software that run on inexpensive servers and PCs.

                                         Figure 2.2-6: Adding a cache to the institutional network.

Cooperative Caching

Multiple Web caches, located at different places in the Internet, can cooperate and improve overall performance. For example, an
institutional cache can be configured to send its HTTP requests to a cache in a backbone ISP at the national level. In this case, when the
institutional cache does not have the requested object in its storage, it forwards the HTTP request to the national cache. The national cache
then retrieves the object from its own storage or, if the object is not in storage, from the origin server. The national cache then sends the
object (within an HTTP response message) to the institutional cache, which in turn forwards the object to the requesting browser.
Whenever an object passes through a cache (institutional or national), the cache leaves a copy in its local storage. The advantage of passing
through a higher-level cache, such as a national cache, is that it has a larger user population and therefore higher hit rates.

An example of cooperative caching system is the NLANR caching system, which consists of a number of backbone caches in the US
providing service to institutional and regional caches from all over the globe [NLANR]. The NLANR caching hierarchy is shown in Figure
2.2-7 [Huffaker 1998]. The caches obtain objects from each other using a combination of HTTP and ICP (Internet Caching Protocol). ICP
is an application-layer protocol that allows one cache to quickly ask another cache if it has a given document [RFC 2186]; a cache can then
use HTTP to retrieve the object from the other cache. ICP is used extensively in many cooperative caching systems, and is fully supported (12 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

by Squid, a popular public-domain software for Web caching [Squid]. If you are interested in learning more about ICP, you are encouraged
to see [Luotonen 1998] [Ross 1998] and the ICP RFC [RFC 2186].

                                    Figure 2.2-7: The NLANR caching hierarchy. (Courtesy of [Huffaker 1998]).

An alternative form of cooperative caching involves clusters of caches, often co-located on the same LAN. A single cache is often replaced
with a cluster of caches when the single cache is not sufficient to handle the traffic or provide sufficient storage capacity. Although cache
clustering is a natural way to scale as traffic increases, they introduce a new problem: When a browser wants to request a particular object,
to which cache in the cache cluster should it send the request? This problem can be elegantly solved using hash routing (If you are not
familiar with hash functions, you can read about them in Chapter 7.) In the simplest form of hash routing, the browser hashes the URL, and
depending on the result of the hash, the browser directs its request message to one of the caches in the cluster. By having all the browsers
use the same hash function, an object will never be present in more than one cache in the cluster, and if the object is indeed in the cache
cluster, the browser will always direct its request to the correct cache. Hash routing is the essence of the Cache Array Routing Protocol
(CARP). If you are interested in learning more about hash routing or CARP, see [Valloppillil 1997], [Luotonen 1998], [Ross 1998] and
[Ross 1997].

Web caching is a rich and complex subject; over two thirds (40 pages) of the HTTP/1.1 RFC is devoted to Web caching [RFC 2068]! Web
caching has also enjoyed extensive research and product development in recent years. Furthermore, caches are now being built to handle
streaming audio and video. Caches will likely play an important role as the Internet begins to provide an infrastructure for the large-scale,
on-demand distribution of music, television shows and movies in the Internet.

References (13 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

Some of the best information about HTTP can be found in the W3C pages. Their overview page is an excellent starting point for a wealth
of information about the HTTP activities at the W3C. You will also find material on HTTP-Next Generation and Web caching. If you are
interested in HTTP, the W3C site will keep you busy for a long, long time.

[Claffy 1998] K. Claffy, G. Miller and K. Thompson, "The Nature of the Beast: Recent Traffic Measurements from the Internet Backbone,
CAIDA Web site,, 1998.
[Heidemann 1997] J. Heidemann, K. Obraczka and J. Touch, Modeling the Performance of HTTP Over Several Transport Protocols,"
IEEE/ACM Transactions on Networking, Vol. 5, No. 5, October 1997, pp. 616-630.
[Huffaker 1998] B. Huffaker, J. Jung, D. Wessels and K. Claffy, Visualization of the Growth and Topology of the NLANR Caching
Hierarchy, , 1998.
[Luotonen 1998] A. Luotonen, "Web Proxy Servers," Prentice Hall, New Jersey, 1998.
[Netcraft] Survey of Web Server Penetration, Netcraft Web Site,
[NLANR] A Distributed Testbed for National Information Provisioning, .
[Nielsen 1997] H. F. Nielsen, J. Gettys, A. Baird-Smith, E. Prud'hommeaux, H.W. Lie, C. Lilley, Network Performance Effects of
HTTP/1.1, CSS1, and PNG, W3C Document, 1997 (also appeared in SIGCOMM' 97).
[RFC 1945] T. Berners-Lee, R. Fielding, and H. Frystyk, "Hypertext Transfer Protocol -- HTTP/1.0," [RFC 1945], May 1996.
[RFC 2068] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1," [RFC 2068],
January 1997
[RFC 2109] D. Kristol and L. Montulli, "HTTP State Management Mechanism," [RFC 2109], February 1997.
[RFC 2186] K. Claffy and D. Wessels, "Internet Caching Protocol (ICP), version 2," [RFC 2186], September 1997.
[Ross 1997] K.W. Ross, "Hash-Routing for Collections of Shared Web Caches," IEEE Network
Magazine, Nov-Dec 1997
[Ross 1998] K.W. Ross, Distribution of Stored Information in the Web, A Online Tutorial,, 1998.
[Squid] Squid Internet Object Cache,
[Valloppillil 1997] V. Valloppillil and K.W. Ross, "Cache Array Routing Protocol," Internet Draft, <draft-vinod-carp-v1-03.txt>, June
[Yeager 1996] N.J. Yeager and R.E. McGrath, "Web Server Technology," Morgan Kaufmann Publishers, San Francisco, 1996.

Search RFCs and Internet Drafts

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s) here.


Press button to submit your query or reset the form:    Submit      Reset

Query Options:

           Case insensitive

       Maximum number of hits: 25

Return to Table Of Contents (14 of 15) [5/13/2004 11:55:10 AM]
  The HyperText Transfer Protocol

Copyright Keith W. Ross and James F. Kurose 1996-2000 . All rights reserved. (15 of 15) [5/13/2004 11:55:10 AM]

                                    2.3 File Transfer: FTP
FTP (File Transfer Protocol) is a protocol for transferring a file from one host to another host. The protocol
dates back to 1971 (when the Internet was still an experiment), but remains enormously popular. FTP is
described in [RFC 959]. Figure 2.3-1 provides an overview of the services provided by FTP.

                            Figure 2.3-1: FTP moves files between local and remote file systems.

In a typical FTP session, the user is sitting in front of one host (the local host) and wants to transfer files to or
from a remote host. In order for the user to access the remote account, the user must provide a user identification
and a password. After providing this authorization information, the user can transfer files from the local file
system to the remote file system and vice versa. As shown in Figure 2.3-1, the user interacts with FTP through
an FTP user agent. The user first provides the hostname of the remote host, which causes the FTP client process
in the local host to establish a TCP connection with the FTP server process in the remote host. The user then
provides the user identification and password, which get sent over the TCP connection as part of FTP
commands. Once the server has authorized the user, the user copies one or more files stored in the local file
system into the remote file system (or vice versa).

HTTP and FTP are both file transfer protocols and have many common characteristics; for example, they both
run on top of TCP, the Internet's connection-oriented, transport-layer, reliable data transfer protocol. However,
the two application-layer protocols have some important differences. The most striking difference is that FTP
uses two parallel TCP connections to transfer a file, a control connection and a data connection. The control
connection is used for sending control information between the two hosts -- information such as user
identification, password, commands to change remote directory, and commands to "put" and "get" files. The data
connection is used to actually send a file. Because FTP uses a separate control connection, FTP is said to send its
control information out-of-band. In Chapter 6 we shall see that the RTSP protocol, which is used for controlling
the transfer of continuous media such as audio and video, also sends its control information out-of-band. HTTP,
as you recall, sends request and response header lines into the same TCP connection that carries the transferred
file itself. For this reason, HTTP is said to send its control information in-band. In the next section we shall see
that SMTP, the main protocol for electronic mail, also sends control information in-band. The FTP control and
data connections are illustrated in Figure 2.3-2. (1 of 4) [5/13/2004 11:55:16 AM]

                                           Figure 2.3-2: Control and data connections

When a user starts an FTP session with a remote host, FTP first sets up a control TCP connection on server port
number 21. The client side of FTP sends the user identification and password over this control connection. The
client side of FTP also sends, over the control connection, commands to change the remote directory. When the
user requests a file transfer (either to, or from, the remote host), FTP opens a TCP data connection on server
port number 20. FTP sends exactly one file over the data connection and then closes the data connection. If,
during the same session, the user wants to transfer another file, FTP opens another data TCP connection. Thus,
with FTP, the control connection remains open throughout the duration of the user session, but a new data
connection is created for each file transferred within a session (i.e., the data connections are non-persistent).

Throughout a session, the FTP server must maintain state about the user. In particular, the server must associate
the control connection with a specific user account, and the server must keep track of the user's current directory
as the user wanders about the remote directory tree. Keeping track of this state information for each ongoing
user session significantly impedes the total number of sessions that FTP can maintain simultaneously. HTTP, on
the other hand, is stateless -- it does not have to keep track of any user state.

FTP Commands and Replies

We end this section with a brief discussion of some of the more common FTP commands. The commands, from
client to server, and replies, from server to client, are sent across the control TCP connection in 7-bit ASCII
format. Thus, like HTTP commands, FTP commands are readable by people. In order to delineate successive
commands, a carriage return and line feed end each command (and reply). Each command consists of four
uppercase ASCII characters, some with optional arguments. Some of the more common commands are given
below (with options in italics):

     q   USER username : Used to send the user identification to server.
     q   PASS password : Used to send the user password to the server.
     q   LIST : Used to ask the server to send back a list of all the files in the current remote directory. The list of
         files is sent over a (new and non-persistent) data TCP connection and not over the control TCP
     q   RETR filename : Used to retrieve (i.e., get) a file from the current directory of the remote host.
     q   STOR filename : Used to store (i.e., put) a file into the current directory of the remote host.

There is typically a one-to-one correspondence between the command that the user issues and the FTP command
sent across the control connection. Each command is followed by a reply, sent from server to client. The replies (2 of 4) [5/13/2004 11:55:16 AM]

are three-digit numbers, with an optional message following the number. This is similar in structure to the status
code and phrase in the status line of the HTTP response message; the inventors of HTTP intentionally included
this similarity in the HTTP response messages. Some typical replies, along with their possible messages, are as

     q   331    Username OK, password required
     q   125    Data connection already open; transfer starting
     q   425    Can't open data connection
     q   452    Error writing file

Readers who are interested in learning about the other FTP commands and replies are encouraged to read [RFC


[RFC 959] J.B. Postel and J.K. Reynolds, "File Transfer Protocol," [RFC 959], October 1985.

Search RFCs and Internet Drafts

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s) here.


Press button to submit your query or reset the form:                Submit       Reset

Query Options:

             Case insensitive

         Maximum number of hits: 25

Return to Table Of Contents (3 of 4) [5/13/2004 11:55:16 AM]

Copyright Keith W. Ross and James F. Kurose 1996-2000 . All rights reserved. (4 of 4) [5/13/2004 11:55:16 AM]

                          2.4 Electronic Mail in the Internet
Along with the Web, electronic mail is one of the most popular Internet applications. Just like ordinary "snail mail," email is
asynchronous -- people send and read messages when it is convenient for them, without having to coordinate with other peoples'
schedules. In contrast with snail mail, electronic mail is fast, easy to distribute, and inexpensive. Moreover, modern electronic
mail messages can include hyperlinks, HTML formatted text, images, sound and even video. In this section we will examine the
application-layer protocols that are at the heart of Internet electronic mail. But before we jump into an in-depth discussion of
these protocols, let's take a bird's eye view of the Internet mail system and its key components.

                                    Figure 2.4-1: A bird's eye view of the Internet e-mail system.

Figure 2.4-1 presents a high-level view of the Internet mail system. We see from this diagram that it has three major
components: user agents, mail servers, and the Simple Mail Transfer Protocol (SMTP). We now describe each of these

components in the context of a sender, Alice            , sending an email message to a recipient, Bob         . User agents
allow users to read, reply to, forward, save, and compose messages. (User agents for electronic mail are sometimes called mail
readers, although we will generally avoid this term in this book.) When Alice is finished composing her message, her user
agent sends the message to her mail server, where the message is placed in the mail server's outgoing message queue. When
Bob wants to read a message, his user agent obtains the message from his mailbox in his mail server. In the late 1990s, GUI
(graphical user interface) user agents became popular, allowing users to view and compose multimedia messages. Currently,
Eudora, Microsoft's Outlook Express, and Netscape's Messenger are among the popular GUI user agents for email. There are
also many text-based email user interfaces in the public domain, including mail, pine and elm.

Mail servers form the core of the e-mail infrastructure. Each recipient, such as Bob, has a mailbox located in one of the mail
servers. Bob's mailbox manages and maintains the messages that have been sent to him. A typical message starts its journey in (1 of 14) [5/13/2004 11:55:34 AM]

the sender's user agent, travels to the sender's mail server, and then travels to the recipient's mail server, where it is deposited in
the recipient's mailbox. When Bob wants to access the messages in his mailbox, the mail server containing the mailbox
authenticates Bob (with user names and passwords). Alice's mail server must also deal with failures in Bob's mail server. If
Alice's server cannot deliver mail to Bob's server, Alice's server holds the message in a message queue and attempts to transfer
the message later. Reattempts are often done every 30 minutes or so; if there is no success after several days, the server removes
the message and notifies the sender (Alice) with an email message.

The Simple Mail Transfer Protocol (SMTP) is the principle application-layer protocol for Internet electronic mail. It uses the
reliable data transfer service of TCP to transfer mail from the sender's mail server to the recipient's mail server. As with most
application-layer protocols, SMTP has two sides: a client side which executes on the sender's mail server, and server side which
executes on the recipient's mail server. Both the client and server sides of SMTP run on every mail server. When a mail server
sends mail (to other mail servers), it acts as an SMTP client. When a mail server receives mail (from other mail servers) it acts
as an SMTP server.

2.4.1 SMTP
SMTP, defined in [RFC 821], is at the heart of Internet electronic mail. As mentioned above, SMTP transfers messages from
senders' mail servers to the recipients' mail servers. SMTP is much older than HTTP. (The SMTP RFC dates back to 1982, and
SMTP was around long before that.) Although SMTP has numerous wonderful qualities, as evidenced by its ubiquity in the
Internet, it is nevertheless a legacy technology that possesses certain "archaic" characteristics. For example, it restricts the body
(not just the headers) of all mail messages to be in simple seven-bit ASCII. This restriction was not bothersome in the early
1980s when transmission capacity was scarce and no one was emailing large attachments or large image, audio or video files.
But today, in the multimedia era, the seven-bit ASCII restriction is a bit of a pain -- it requires binary multimedia data to be
encoded to ASCII before being sent over SMTP; and it requires the corresponding ASCII message to be decoded back to
binary after SMTP transport. Recall from Section 2.3 that HTTP does not require multimedia data to be ASCII encoded before

To illustrate the basic operation of SMTP, let's walk through a common scenario. Suppose Alice wants to send Bob a simple
ASCII message:

     q   Alice invokes her user agent for email, provides Bob's email address (e.g.,, composes a message
         and instructs the user agent to send the message.
     q   Alice's user agent sends the message her mail server, where it is placed in a message queue.
     q   The client side of SMTP, running on Alice's mail server, sees the message in the message queue. It opens a TCP
         connection to a SMTP server, running on Bob's mail server.
     q   After some initial SMTP handshaking, the SMTP client sends Alice's message into the TCP connection.
     q   At Bob's mail server host, the server side of SMTP receives the message. Bob's mail server then places the message in
         Bob's mailbox.
     q   Bob invokes his user agent to read the message at his convenience.

The scenario is summarized in the Figure 2.4-2. (2 of 14) [5/13/2004 11:55:34 AM]

                             Figure 2.4-2: Alice's mail server transfers Alice's message to Bob's mail server.

It is important to observe that SMTP does not use intermediate mail servers for sending mail, even when the two mail servers
are located at opposite ends of the world. If Alice's server is in Hong Kong and Bob's server is in Mobile, Alabama, the TCP
"connection" is a direct connection between the Hong Kong and Mobile servers. In particular, if Bob's mail server is down, the
message remains in Alice's mail server and waits for a new attempt -- the message does not get placed in some intermediate
mail server.

Let's now take a closer look at how SMTP transfers a message from a sending mail server to a receiving mail server. We will
see that the SMTP protocol has many similarities with protocols that are used for face-to-face human interaction. First, the
client SMTP (running on the sending mail server host) has TCP establish a connection on port 25 to the server SMTP (running
on the receiving mail server host). If the server is down, the client tries again later. Once this connection is established, the
server and client perform some application-layer handshaking. Just as humans often introduce themselves before transferring
information from one to another, SMTP clients and servers introduce themselves before transferring information. During this
SMTP handshaking phase, the SMTP client indicates the email address of the sender (the person who generated the message)
and the email address of the recipient. Once the SMTP client and server have introduced themselves to each other, the client
sends the message. SMTP can count on the reliable data transfer service of TCP to get the message to the server without errors.
The client then repeats this process over the same TCP connection if it has other messages to send to the server; otherwise, it
instructs TCP to close the connection.

Let us take a look at an example transcript between client (C) and server (S). The host name of the client is and the
host name of the server is The ASCII text prefaced with C: are exactly the lines the client sends into its TCP
socket; and the ASCII text prefaced with S: are exactly the lines the server sends into its TCP socket. The following transcript
begins as soon as the TCP connection is established:

        S:   220
        C:   HELO
        S:   250 Hello, pleased to meet you
        C:   MAIL FROM: <>
        S:   250 Sender ok
        C:   RCPT TO: <>
        S:   250 ... Recipient ok
        C:   DATA
        S:   354 Enter mail, end with "." on a line by itself
        C:   Do you like ketchup?
        C:   How about pickles?
        C:   . (3 of 14) [5/13/2004 11:55:34 AM]

        S: 250 Message accepted for delivery
        C: QUIT
        S: 221 closing connection

In the above example, the client sends a message ("Do you like ketchup? How about pickles?") from mail server to
mail server The client issued five commands: HELO (an abbreviation for HELLO), MAIL FROM, RCPT TO,
DATA, and QUIT. These commands are self explanatory. The server issues replies to each command, with each reply having
a reply code and some (optional) English-language explanation. We mention here that SMTP uses persistent connections: if the
sending mail server has several messages to send to the same receiving mail server, it can send all of the messages over the
same TCP connection. For each message, the client begins the process with a new HELO and only issues QUIT
after all messages have been sent.

It is highly recommended that you use Telnet to carry out a direct dialogue with an SMTP server. To do this, issue telnet
serverName 25 . When you do this, you are simply establishing a TCP connection between your local host and the mail
server. After typing this line, you should immediately receive the 220 reply from the server. Then issue the SMTP commands
HELO, MAIL FROM, RCPT TO, DATA, and QUIT at the appropriate times. If you Telnet into your friend's SMTP
server, you should be able to send mail to your friend in this manner (i.e., without using your mail user agent).

Comparison with HTTP

Let us now briefly compare SMTP to HTTP. Both protocols are used to transfer files from one host to another; HTTP transfers
files (or objects) from Web server to Web user agent (i.e., the browser); SMTP transfers files (i.e., email messages) from one
mail server to another mail server. When transferring the files, both persistent HTTP and SMTP use persistent connections, that
is, they can send multiple files over the same TCP connection. Thus the two protocols have common characteristics. However,
there are important differences. First, HTTP is principally a pull protocol -- someone loads information on a Web server and
users use HTTP to pull the information off the server at their convenience. In particular, the TCP connection is initiated by the
machine that wants to receive the file. On the other hand, SMTP is primarily a push protocol -- the sending mail server pushes
the file to the receiving mail server. In particular, the TCP connection is initiated by the machine that wants to send the file.

A second important difference, which we alluded to earlier, is that SMTP requires each message, including the body of each
message, to be in seven-bit ASCII format. Furthermore, the SMTP RFC requires the body of every message to end with a line
consisting of only a period -- i.e., in ASCII jargon, the body of each message ends with "CRLF.CRLF", where CR and LF
stand for carriage return and line feed, respectively. In this manner, while the SMTP server is receiving a series of messages
from an SMTP client over a persistent TCP connection, the server can delineate the messages by searching for "CRLF.CRLF"
in the byte stream. (This operation of searching through a character stream is referred to as "parsing".) Now suppose that the
body of one of the messages is not ASCII text but instead binary data (for example, a JPEG image). It is possible that this
binary data might accidentally have the bit pattern associated with ASCII representation of "CR LF . CR LF" in the middle of
the bit stream. This would cause the SMTP server to incorrectly conclude that the message has terminated. To get around this
and related problems, binary data is first encoded to ASCII in such a way that certain ASCII characters (including ".") are not
used. Returning to our comparison with HTTP, we note that neither non-persistent nor persistent HTTP has to bother with the
ASCII conversion. For non-persistent HTTP, each TCP connection transfers exactly one object; when the server closes the
connection, the client knows it has received one entire response message. For persistent HTTP, each response message includes
a Content-length: header line, enabling the client to delineate the end of each message.

A third important difference concerns how a document consisting of text and images (along with possibly other media types) is
handled. As we learned in Section 2.3, HTTP encapsulates each object in its own HTTP response message. Internet mail, as we
shall discuss in greater detail below, places all of the message's objects into one message.

2.4.2 Mail Message Formats and MIME (4 of 14) [5/13/2004 11:55:34 AM]

When Alice sends an ordinary snail-mail letter to Bob, she puts the letter into an envelope, on which there is all kinds of
peripheral information such as Bob's address, Alice's return address, and the date (supplied by the postal service). Similarly,
when an email message is sent from one person to another, a header containing peripheral information proceeds the body of the
message itself. This peripheral information is contained in a series of header lines, which are defined in [RFC 822]. The header
lines and the body of message are separated by a blank line (i.e., by CRLF). RFC 822 specifies the exact format for mail header
lines as well their semantic interpretations. As with HTTP, each header line contains readable text, consisting of a keyword
followed by a colon followed by a value. Some of the keywords are required and others are optional. Every header must have a
From: header line and a To: header line; a header may include a Subject: header line as well as other optional header
lines. It is important to note that these header lines are different from the SMTP commands we studied in section 2.4.1 (even
though they contain some common words such as "from" and "to"). The commands in section 2.4.1 were part of the SMTP
handshaking protocol; the header lines examined in this section are part of the mail message itself.

A typical message header looks like this:

        Subject: Searching for the meaning of life.

After the message header, a blank line follows then the message body (in ASCII) follows. The message terminates with a line
containing only a period, as discussed above. It is highly recommended that you use Telnet to send to a mail server a message
that contains some header lines, including the Subject: header line. To do this, issue telnet serverName 25 . The
actual message is sent into the TCP connection right after the SMTP DATA command. The message consists of the message
headers, the blank line, and the message body. The final line with a single period indicates the end of the message.

The MIME Extension for Non-ASCII Data

While the message headers described in RFC 822 are satisfactory for sending ordinary ASCII text, they are not sufficiently rich
enough for multimedia messages (e.g., messages with images, audio and video) or for carrying non-ASCII text formats (e.g.,
characters used by languages other than English). To send content different from ASCII text, the sending user agent must
include additional headers in the message. These extra headers are defined in [RFC 2045] and [RFC 2046], the MIME
extension to [RFC 822]. Two key MIME headers for supporting multimedia are the Content-Type: header and the
Content-Transfer-Encoding: header. The Content-Type: header allows the receiving user agent to take an
appropriate action on the message. For example, by indicating that the message body contains a JPEG image, the receiving user
agent can direct the message body to a JPEG decompression routine. To understand the need of the Content-Transfer-
Encoding: header, recall that non-ASCII text messages must be encoded to an ASCII format that isn't going to confuse
SMTP. The Content-Transfer-Encoding: header alerts the receiving user agent that the message body has been ASCII
encoded and the type of encoding used. Thus, when a user agent receives a message with these two headers, it first uses the
value of the Content-Transfer-Encoding: header to convert the message body to its original non-ASCII form, and
then uses the Content-Type: header to determine what actions it should take on the message body.

Let's take a look at a concrete example. Suppose Alice wants to send Bob a JPEG image. To do this, Alice invokes her user
agent for email, specifies Bob's email address, specifies the subject of the message, and inserts the JPEG image into the
message body of the message. (Depending on the user agent Alice uses, she might insert the image into the message as an
"attachment".) When Alice finishes composing her message, she clicks on "Send". Alice's user agent then generates a MIME
message, which might look something like this:

             Subject: Picture of yummy crepe. (5 of 14) [5/13/2004 11:55:34 AM]

             MIME-Version: 1.0
             Content-Transfer-Encoding: base64
             Content-Type: image/jpeg

             base64 encoded data .....
             ......base64 encoded data


We observe from the above MIME message that Alice's user agent encoded the JPEG image using base64 encoding. This is one
of several encoding techniques standardized in the MIME [RFC 2045] for conversion to an acceptable seven-bit ASCII format.
Another popular encoding technique is quoted-printable content-transfer-encoding, which is typically used to convert an
ordinary ASCII message to ASCII text void of undesirable character strings (e.g., a line with a single period.)

When Bob reads his mail with his user agent, his user agent operates on this same MIME message. When Bob's user agent
observes the Content-Transfer-Encoding: base64 header line, it proceeds to decode the base64-encoded message
body. The message also includes a Content-Type: image/jpeg header line; this indicates to Bob's user agent that the
message body (after base64 decoding) should be JPEG decompressed. Finally, the message includes the MIME-Version:
header, which, of course, indicates the MIME version that is being used. Note that the message otherwise follows the standard
RFC 822/SMTP format. In particular, after the message header there is a blank line and then the message body; and after the
message body, there is a line with a single period.

Let's now take a closer look at the Content-Type: header. According to the MIME specification, [RFC 2046], this header
has the following format:

                                      Content-Type: type/subtype ; parameters

where the "parameters" (along with the semi-colon) is optional. Paraphrasing [RFC 2046], the Content-Type field is used to
specify the nature of the data in the body of a MIME entity, by giving media type and subtype names. After the type and
subtype names, the remainder of the header field is a set of parameters. In general, the top-level type is used to declare the
general type of data, while the subtype specifies a specific format for that type of data. The parameters are modifiers of the
subtype, and as such do not fundamentally affect the nature of the content. The set of meaningful parameters depends on the
type and subtype. Most parameters are associated with a single specific subtype. MIME has been carefully designed to be
extensible, and it is expected that the set of media type/subtype pairs and their associated parameters will grow significantly
over time. In order to ensure that the set of such types/subtypes is developed in an orderly, well-specified, and public manner,
MIME sets up a registration process which uses the Internet Assigned Numbers Authority (IANA) as a central registry for
MIME's various areas of extensibility. The registration process for these areas is described in [RFC 2048].

Currently there are seven top-level types defined. For each type, there is a list of associated subtypes, and the lists of subtypes
are growing every year. We describe five of these types below:

     q   text: The text type is used to indicate to the receiving user agent that the message body contains textual information. One
         extremely common type/subtype pair is text/plain. The subtype plain indicates plain text containing no formatting
         commands or directives. Plain text is to be displayed as is; no special software is required to get the full meaning of the
         text, aside from support for the indicated character set. If you take a glance at the MIME headers in some of the
         messages in your mailbox, you will almost certainly see content type header lines with text/plain; charset=us-
         ascii or text/plain; charset="ISO-8859-1". The parameters indicate the character set used to generate
         the message. Another type/subtype pair that is gaining popularity is text/html. The html subtype indicates to the mail (6 of 14) [5/13/2004 11:55:34 AM]

         reader that it should interpret the embedded HTML tags that are included in the message. This allows the receiving user
         agent to display the message as a Web page, which might include a variety of fonts, hyperlinks, applets, etc.
     q   image: The image type is used to indicate to the receiving user agent that the message body is an image. Two popular
         type/subtype pairs are image/gif and image/jpeg. When the receiving user agent encounters image/gif, it knows that it
         should decode the GIF image and then display it.
     q   audio: The audio type requires an audio output device (such as a speaker or a telephone) to render the contents. Some of
         the standardized subtypes include basic (basic 8-bit mu-law encoded) and 32kadpcm (a 32 Kbps format defined in [RFC
     q   video: The video type includes mpeg, and quicktime for subtypes.
     q   application: The application type is for data that does not fit in any of the other categories. It is often used for data that
         must be processed by an application before it is viewable or usable by a user. For example, when a user attaches a
         Microsoft Word document to an email message, the sending user agent typically uses application/msword for the
         type/subtype pair. When the receiving user agent observes the content type application/msword, it launches the
         Microsoft Word application and passes the body of the MIME message to the application. A particularly important
         subtype for the application type is octet-stream, which is used to indicate that the body contains arbitrary binary data.
         Upon receiving this type, a mail reader will prompt the user, providing the option to save to the message to disk for later

There is one MIME type that is particularly important and requires special discussion, namely, the multipart type. Just as a
Web page can contain many objects (text, images, applets, etc.), so can an email message. Recall that the Web sends each of the
objects within independent HTTP response messages. Internet email, on the other hand, places all the objects (or "parts") in the
same message. In particular, when a multimedia message contains more than one object (such as multiple images or some
ASCII text and some images) the message typically has Content-type: multipart/mixed. This content type header
line indicates to the receiving user agent that the message contains multiple objects. With all the objects in the same message,
the receiving user agent needs a means to determine (i) where each object begins and ends, (ii) how each non-ASCII object was
transfer encoded, and (iii) the content type of each message. This is done by placing boundary characters between each object
and preceding each object in the message with Content-type: and Content-Transfer-Encoding: header lines.

To obtain a better understanding of multipart/mixed, let's look at an example. Suppose that Alice wants to send a message to
Bob consisting of some ASCII text, followed by a JPEG image, followed by more ASCII text. Using her user agent, Alice types
some text, attaches a JPEG image, and then types some more text. Her user agent then generates a message something like this:

    Subject: Picture of yummy crepe with commentary
    MIME-Version: 1.0
    Content-Type: multipart/mixed; Boundary=StartOfNextPart
    Dear Bob,
    Please find a picture of an absolutely scrumptious crepe.

    Content-Transfer-Encoding: base64
    Content-Type: image/jpeg

    base64 encoded data .....
    ......base64 encoded data

    --StartOfNextPart (7 of 14) [5/13/2004 11:55:34 AM]

    Let me know if you would like the recipe.

Examining the above message, we note that the Content-Type: line in the header indicates how the various parts in the
message are separated. The separation always begins with two dashes and ends with CRLF.

As mentioned earlier, the list of registered MIME types grows every year. The RFC [2048] describes the registration procedures
which use the Internet Assigned Numbers Authority (IANA) as a central registry for such values. A list of the current MIME
subtypes is maintained at numerous sites. The reader is also encouraged to glance at Yahoo's MIME Category Page.

The Received Message

As we have discussed, an email message consists of many components. The core of the message is the message body, which is
the actually data being sent from sender to receiver. For a multipart message, the message body itself consists of many parts,
with each part preceded with one or more lines of peripheral information. Preceding the message body is a blank line and then a
number of header lines. These header lines include RFC 822 header lines such as From:, To: and Subject: header lines.
The header lines also include MIME header lines such as Content-type: and Content-transfer-encoding:
header lines. But we would be remiss if we didn't mention another class of header lines that are inserted by the SMTP receiving
server. Indeed, the receiving server, upon receiving a message with RFC 822 and MIME header lines, appends a Received:
header line to the top of the message; this header line specifies the name of the SMTP server that sent the message ("from"), the
name of the SMTP server that received the message ("by") and the time at which the receiving server received the message.
Thus the message seen by the destination user takes the following form:

        Received: from by ; 12 Oct 98 15:27:39 GMT
        Subject: Picture of yummy crepe.
        MIME-Version: 1.0
        Content-Transfer-Encoding: base64
        Content-Type: image/jpeg

        base64 encoded data .......
        .......base64 encoded data

Almost everyone who has used electronic mail has seen the Received: header line (along with the other header lines)
preceding email messages. (This line is often directly seen on the screen or when the message is sent to a printer.) You may
have noticed that a single message sometimes has multiple Received: header lines and a more complex Return-Path:
header line. This is because a message may be received by more than one SMTP server in the path between sender and
recipient. For example, if Bob has instructed his email server to forward all his messages to, then the
message read by Bob's user agent would begin with something like:

        Received: from by; 12 Oct 98 15:30:01 GMT
        Received: from by ; 12 Oct 98 15:27:39 GMT

These header lines provide the receiving user agent a trace of the SMTP servers visited as well as timestamps of when the visits
occurred. You can learn more about the syntax of these header lines in the SMTP RFC, which is one of the more readable of the
many RFCs. (8 of 14) [5/13/2004 11:55:34 AM]

2.4.3 Mail Access Protocols
Once SMTP delivers the message from Alice's mail server to Bob's mail server, the message is placed in Bob's mailbox.
Throughout this discussion we have tacitly assumed that Bob reads his mail by logging onto the server host (most likely
through Telnet) and then executes a mail reader (e.g., mail, elm, etc.) on that host. Up until the early 1990s this was the standard
way of doing things. But today a typical user reads mail with a user agent that executes on his or her local PC (or Mac),
whether that PC be an office PC, a home PC, or a portable PC. By executing the user agent on a local PC, users enjoy a rich set
of features, including the ability to view multimedia messages and attachments. Popular mail user agents that run on local PCs
include Eudora, Microsoft's Outlook Express, and Netscape's Messenger.

Given that Bob (the recipient) executes his user agent on the his local PC, it is natural to consider placing a mail server on the
his local PC as well. There is a problem with this approach, however. Recall that a mail server manages mailboxes and runs the
client and server sides of SMTP. If Bob's mail server were to reside on his local PC, then Bob's PC would have to remain
constantly on, and connected to the Internet, in order to receive new mail, which can arrive at any time. This is impractical for
the great majority of Internet users. Instead, a typical user runs a user agent on the local PC but accesses a mailbox from a
shared mail server - a mail server that is always running, that is always connected to the Internet, and that is shared with other
users. The mail server is typically maintained by the user's ISP, which could be a residential or an institutional (university,
company, etc.) ISP.

With user agents running on users' local PCs and mail servers hosted by ISPs, a protocol is needed to allow the user agent and
the mail server to communicate. Let us first consider how a message that originates at Alice's local PC makes its way to Bob's
SMTP mail server. This task could simply be done by having Alice's user agent communicate directly with Bob's mail server in
the language of SMTP: Alice's user agent would initiate a TCP connection to Bob's mail server, issue the SMTP handshaking
commands, upload the message with the DATA command, and then close the connection. This approach, although perfectly
feasible, is not commonly employed, primarily because it doesn't offer the Alice any recourse to a crashed destination mail
server. Instead, Alice's user agent initiates a SMTP dialogue with her own mail server (rather than with the recipient's mail
server) and uploads the message. Alice's mail server then establishes a new SMTP session with Bob's mail server and relays the
message to Bob's mail server. If Bob's mail server is down, then Alice's mail server holds the message and tries again later. The
SMTP RFC defines how the SMTP commands can be used to relay a message across multiple SMTP servers.

But there is still one missing piece to the puzzle! How does a recipient like Bob, running a user agent on his local PC, obtain his
messages, which are sitting on a mail server within Bob's ISP? The puzzle is completed by introducing a special access protocol
that transfers the messages from Bob's mail server to the local PC. There are currently two popular mail access protocols: POP3
(Post Office Protocol - Version 3) and IMAP (Internet Mail Access Protocol). We shall discuss both of these protocols below.
Note that Bob's user agent can't use SMTP to obtain the messages: obtaining the messages is a pull operation whereas SMTP is
a push protocol. Figure 2.4-3 provides a summary of the protocols that are used for Internet mail: SMTP is used to transfer mail
from the sender's mail server to the recipient's mail server; SMTP is also used to transfer mail from the sender's user agent to the
sender's mail server. POP3 or IMAP are used to transfer mail from the recipient's mail server to the recipient's user agent.

                                  Figure 2.4-3: E-mail protocols and their communicating entities. (9 of 14) [5/13/2004 11:55:34 AM]


POP3, defined in [RFC 1939], is an extremely simple mail access protocol. Because the protocol is so simple, its functionality
is rather limited. POP3 begins when the user agent (the client) opens a TCP connection to the the mail server (the server) on
port 110. With the TCP connection established, POP3 progresses through three phases: authorization, transaction and update.
During the first phase, authorization, the user agent sends a user name and a password to authenticate the user downloading the
mail. During the second phase, transaction, the user agent retrieves messages. During the transaction phase, the user agent can
also mark messages for deletion, remove deletion marks, and obtain mail statistics. The third phase, update, occurs after the
client has issued the quit command ending the POP3 session; at this time, the mail server deletes the messages that were
marked for deletion.

In a POP3 transaction, the user agent issues commands, and the server responds to each command with a reply. There are two
possible responses: +OK (sometimes followed by server-to-client data), whereby the server is saying that the previous
command was fine; and -ERR, whereby the server is saying that something was wrong with the previous command.

The authorization phase has two principle commands: user<user name> and pass<password>. To illustrate these two
commands, we suggest that you Telnet directly into a POP3 server, using port 110, and issue these commands. Suppose that
mailServer is the name of your mail server. You will see something like:

        telnet mailServer 110
        +OK POP3 server ready
        user alice
        pass hungry
        +OK user successfully logged on

If you misspell a command, the POP3 server will reply with an -ERR message.

Now let's take a look at the transaction phase. A user agent using POP3 can often be configured (by the user) to "download and
delete" or to "download and keep". The sequence of commands issued by a POP3 user agent depend on which of these two
modes the user agent is operating in. In the download-and-delete mode, the user agent will issue the list, retr and dele
commands. As an example, suppose the user has two messages in his or her mailbox. In the dialogue below C: (standing for
client) is the user agent and S: (standing for server) is the mail server. The transaction will look something like:

        C: list
        S: 1 498
        S: 2 912
        S: .
        C: retr 1
        S: blah blah ...
        S: .................
        S: ..........blah
        S: .
        C: dele 1
        C: retr 2
        S: blah blah ...
        S: .................
        S: ..........blah
        S: .
        C: dele 2 (10 of 14) [5/13/2004 11:55:34 AM]

        S:+OK POP3 server signing off

The user agent first asks the mail server to list the size of each of the stored messages. The user agent then retrieves and deletes
each message from the server. Note that after the authorization phase, the user agent employed only four commands: list,
retr, dele, and quit. The syntax for these commands is defined in RFC 1939]. After issuing the quit command, the POP3
server enters the update phase and removes messages 1 and 2 from the mailbox.

A problem with this download-and-delete mode is that the recipient, Bob, may be nomadic and want to access his mail from
multiple machines, including the office PC, the home PC and a portable computer. The download-and-delete mode scatters
Bob's mail over all the local machines; in particular, if Bob first reads a message on a home PC, he will not be able to reread the
message on his portable later in the evening. In the download-and-keep mode, the user agent leaves the messages on the mail
server after downloading them. In this case, Bob can reread messages from different machines; he can access a message from
work, and then access it again later in the week from home.

During a POP3 session between a user agent the mail server, the POP3 server maintains some state information; in particular, it
keeps track of which messages have been marked deleted. However, the POP3 server is not required to carry state information
across POP3 sessions. For example, no message is marked for deletion at the beginning of each session. This lack of state
information across sessions greatly simplifies the implementation of a POP3 server.


Once Bob has downloaded his messages to the local machine using POP3, he can create mail folders and move the downloaded
messages into the folders. Bob can then delete messages, move messages across folders, and search for messages (say by sender
name or subject). But this paradigm -- folders and messages in the local machine -- poses a problem for the nomadic user, who
would prefer to maintain a folder hierarchy on a remote server that can be accessed by from any computer. This is not possible
with POP3.

To solve this and other problems, the Internet Mail Access Protocol (IMAP), defined in [RFC 1730], was invented. Like POP3,
IMAP is a mail access protocol. It has many more features than POP3, but it is also significantly more complex. (And thus the
client and server side implementations are significantly more complex.) IMAP is designed to allow users to manipulate remote
mailboxes as if they were local. In particular, IMAP enables Bob to create and maintain multiple message folders at the mail
server. Bob can put messages in folders and move messages from one folder to another. IMAP also provides commands that
allow Bob to search remote folders for messages matching specific criteria. One reason why an IMAP implementation is much
more complicated than a POP3 implementation is that the IMAP server must maintain a folder hierarchy for each of its users.
This state information persists across a particular user's successive accesses to the IMAP server. Recall that a POP3 server, by
contrast, does not maintain anything about a particular user once the user quits the POP3 session.

Another important feature of IMAP is that it has commands that permit a user agent to obtain components of messages. For
example, a user agent can obtain just the message header of a message or just one part of a multipart MIME message. This
feature is useful when there is a low-bandwidth connection between the user agent and its mail server, for example, a wireless
or slow-speed modem connection. With a low-bandwidth connection, the user may not want to download all the messages in its
mailbox, particularly avoiding long messages that might contain, for example, an audio or video clip.

An IMAP session consists of the establishment of a connection between the client (i.e., the user agent) and the IMAP server, an
initial greeting from the server, and client-server interactions. The client/server interactions are similar to, but richer than, those
of POP3. They consist of a client command, server data, and a server completion result response. The IMAP server is always in
one of four states. In the non-authenticated state, which starts when the connection starts, the user must supply a user name and
password before most commands will be permitted. In the authenticated state, the user must select a folder before sending
commands that affect messages. In the selected state, the user can issue commands that affect messages (retrieve, move, delete,
retrieve a part in a multipart message, etc.). Finally, the logout state is when the session is being terminated. The IMAP (11 of 14) [5/13/2004 11:55:35 AM]

commands are organized by the state in which the command is permitted. You can read all about IMAP at the official IMAP


More and more users today are using browser-based email services such as Hotmail or Yahoo! Mail. With these servers, the
user agent is an ordinary Web browser and the user communicates with its mailbox on its mailserver via HTTP. When a
recipient, such as Bob, wants to access the messages in his mailbox, the messages are sent from Bob's mail server to Bob's
browser using the HTTP protocol rather than the POP3 or IMAP protocol. When a sender with an account on an HTTP-based
email server, such as Alice, wants to send a message, the message is sent from her browser to her mail server over HTTP rather
than over SMTP. The mail server, however, still sends messages to, and receives messages from, other mail servers using
SMTP. This solution to mail access is enormously convenient for the user on the go. The user need only to be able to access a
browser in order to send and receive messages. The browser can be in an Internet cafe, in a friend's house, in a hotel room with
a Web TV, etc. As with IMAP, users can organize their messages in a hierarchy of folders on the remote server. In fact, Web-
based email is so convenient that it may replace POP3 and IMAP access in the upcoming years. Its principle disadvantage is
that it can be slow, as the server is typically far from the client and interaction with the server is done through CGI scripts.

2.4.4 Continuous Media Email
Continuous-media (CM) email is email that includes audio or video. CM email is appealing for asynchronous communication
among friends and family. For example, a young child who cannot type would prefer sending an audio message to his or her
grandparents. Furthermore, CM email can be desirable in many corporate contexts, as an office worker may be able to record a
CM message more quickly than typing a text message. (English can be spoken at a rate of 180 words per minute, whereas the
average office worker types words at a much slower rate.) Continuous-media e-mail resembles in some respects ordinary voice-
mail messaging in the telephone system. However, continuous-media e-mail is much more powerful. Not only does it provide
the user with a graphical interface to the user's mailbox, but it also allows the user to annotate and reply to CM messages and to
forward CM messages to a large number of recipients.

CM e-mail differs from traditional text mail in many ways. These differences include much larger messages, more stringent end-
to-end delay requirements, and greater sensitivity to recipients with highly heterogeneous Internet access rates and local storage
capabilities. Unfortunately, the current e-mail infrastructure has several inadequacies that obstruct the widespread adoption of
CM e-mail. First, many existing mail servers do not have the capacity to store large CM objects; recipient mail servers typically
reject such messages, which makes sending CM messages to such recipients impossible. Second, the existing mail paradigm of
transporting entire messages to the recipient's mail server before recipient rendering can lead to excessive waste of bandwidth
and storage. Indeed, stored CM is often not rendered in its entirety [Padhye 1999], so that bandwidth and recipient storage is
wasted by receiving data that is never rendered. (For example, one can imagine listening to the first fifteen seconds of a long
audio email from a rather long-winded colleague, and then deciding to delete the remaining 20 minutes of the message without
listening to it.) Third, current mail access protocols (POP3, IMAP and HTTP) are inappropriate for streaming CM to recipients.
(Streaming CM is discussed in detail in Chapter 6.) In particular, the current mail access protocols do not provide functionality
that allows a user to pause/resume a message or to reposition within a message; furthermore, streaming over TCP is often leads
to poor reception (see Chapter 6). These inadequacies will hopefully be addressed in the upcoming years. Possible solutions are
discussed in [Gay 1997] [Hess 1998] [Shurman 1996] and [Turner 1999].


In addition to the references below, a readable but detailed, overview of modern electronic mail is given in [Hughes 1998].

[Gay 1997] V. Gay and B. Dervella, "MHEGAM - A Multimedia Messaging System," IEEE Multimedia Magazine, Oct-Dec. (12 of 14) [5/13/2004 11:55:35 AM]

1997, pp. 22-29.
[Hess 1998] C. Hess, D. Lin and K. Nahrstedt, "VistaMail: An Integrated Multimedia Mailing System," IEEE Multimedia
Magazine, Oct.-Dec, 1988, pp. 13-23.
[Hughes 1998] L. Hughes, Internet E-mail: Protocols, Standards and Implementation, Artech House, Norwood, MA, 1998.
[Padhye 1999] J. Padhye and J. Kurose, "An Empirical Study of Client Interactions with a Continuous-Media Courseware
Server," IEEE Internet Computing, April 1999.
[RFC 821] J.B. Postel, "Simple Mail Transfer Protocol," [RFC 821], August 1982.
[RFC 822] D.H. Crocker, "Standard for the Format of ARPA Internet Text Messages," [RFC 822], August 1982.
[RFC 977] B. Kantor and P. Lapsley, "Network News Transfer Protocol," [RFC 977], February 1986.
[RFC 1730] M. Crispin, "Internet Message Access Protocol - Version 4," [RFC 1730], December 1994.
[RFC 1911] G. Vaudreuil, "Voice Profil for Internet Mail," [RFC 1911], February 1996.
[RFC 1939] J. Myers and M. Rose, "Post Office Protocol - Version 3," [RFC 1939], May 1996.
[RFC 2045] N. Borenstein and N. Freed, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet
Message Bodies," [RFC 2045], November 1996.
[RFC 2046] N. Borenstein and N. Freed, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types," [RFC
2046], November 1996.
[RFC 2048] N. Freed, J. Klensin and J. Postel "Multipurpose Internet Mail Extensions (MIME) Part Four: Registration
Procedures," [RFC 2048], November 1996.
[Schurmann 1996] G. Schurmann, "Multimedia Mail," Multimedia Systems, ACM Press, Oct. 1996, pp. 281-295.
[Turner 1999] D.A. Turner and K.W. Ross, "Continuous-Media Internet E-Mail: Infrastructure Inadequacies and Solutions,

Search RFCs and Internet Drafts

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s) here.


Press button to submit your query or reset the form:        Submit     Reset

Query Options:

            Case insensitive

        Maximum number of hits:       25

Return to Table Of Contents

Copyright Keith W. Ross and James F. Kurose 1996-2000. All rights reserved. (13 of 14) [5/13/2004 11:55:35 AM]
Keith\book\applications\smtp (14 of 14) [5/13/2004 11:55:35 AM]
  The Domain Name System

           2.5 DNS - The Internet's Directory Service
We human beings can be identified in many ways. For example, we can be identified by the names that appear on our birth
certificates. We can be identified by our social security numbers. We can be identified by our driver's license numbers.
Although each of these identifiers can be used to identify people, within a given context, one identifier may be more
appropriate than an other. For example, the computers at the IRS (the infamous tax collecting agency in the US) prefer to use
fixed-length social security numbers rather than birth-certificate names. On the other hand, ordinary people prefer the more
mnemonic birth-certificate names rather than social security numbers. (Indeed, can you imagine saying, "Hi. My name is 132-
67-9875. Please meet my husband, 178-87-1146.")

Just as humans can be identified in many ways, so too can Internet hosts. One identifier for a host is its hostname. Hostnames -
- such as,, and -- are mnemonic and are therefore appreciated by
humans. However, hostnames provide little, if any, information about the location within the Internet of the host. (A hostname
such as, which ends with the country code .fr, tells us that the host is in France, but doesn't say much more.)
Furthermore, because hostnames can consist of variable-length alpha-numeric characters, they would be difficult to process by
routers. For these reasons, hosts are also identified by so-called IP addresses. We will discuss IP addresses in some detail in
Chapter 4, but it is useful to say a few brief words about them now. An IP address consists of four bytes and has a rigid
hierarchical structure. An IP address looks like, where each period separates one of the bytes expressed in
decimal notation from 0 to 127. An IP address is hierarchical because as we scan the address from left to right, we obtain more
and more specific information about where (i.e., within which network, in the network of networks) the host is located in the
Internet. (Just as when we scan a postal address from bottom to top we obtain more and more specific information about where
the residence is located). An IP address is included in the header of each IP datagram, and Internet routers use this IP address
to route s datagram towards its destination.

2.5.1 Services Provided by DNS
We have just seen that there are two ways to identify a host -- a hostname and an IP address. People prefer the more
mnemonic hostname identifier, while routers prefer fixed-length, hierarchically-structured IP addresses. In order to reconcile
these different preferences, we need a directory service that translates hostnames to IP addresses. This is the main task of the
the Internet's Domain Name System (DNS). The DNS is (i) a distributed database implemented in a hierarchy of name
servers and (ii) an application-layer protocol that allows hosts and name servers to communicate in order to provide the
translation service. Name servers are usually Unix machines running the Berkeley Internet Name Domain (BIND) software.
The DNS protocol runs over UDP and uses port 53. Following this chapter we provide interactive links to DNS programs that
allow you to translate arbitrary hostnames, among other things.

DNS is commonly employed by other application-layer protocols -- including HTTP, SMTP and FTP - to translate user-
supplied host names to IP addresses. As an example, consider what happens when a browser (i.e., an HTTP client), running on
some user's machine, requests the URL In order for the user's machine to be able to send an
HTTP request message to the Web server, the user's machine must obtain the IP address of This is done as follows. The same user machine runs the client-side of the DNS application. The
browser extracts the hostname,, from the URL and passes the hostname to the client-side of the DNS
application. As part of a DNS query message, the DNS client sends a query containing the hostname to a DNS server. The
DNS client eventually receives a reply, which includes the IP address for the hostname. The browser then opens a TCP
connection to the HTTP server process located at that IP address. All IP datagrams sent to from the client to server as part of
this connection will include this IP address in the destination address field of the datagrams. In particular, the IP datagram(s)
that encapsulate the HTTP request message use this IP address. We see from this example that DNS adds an additional delay --
sometimes substantial -- to the Internet applications that use DNS. Fortunately, as we shall discuss below, the desired IP
address is often cached in a "near by" DNS name server, which helps to reduce the DNS network traffic as well as the average
DNS delay. (1 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

Like HTTP, FTP, and SMTP, the DNS protocol is an application-layer protocol since (i) it runs between communicating end
systems (again using the client-server paradigm), and (ii) it relies on an underlying end-to-end transport protocol (i.e., UDP) to
transfer DNS messages between communicating end systems. In another sense, however, the role of the DNS is quite
different from Web, file transfer, and email applications. Unlike these applications, the DNS is not an application with which
a user directly interacts. Instead, the DNS provides a core Internet function -- namely, translating hostnames to their
underlying IP addresses, for user applications and other software in the Internet. We noted earlier in Section 1.2 that much of
the "complexity" in the Internet architecture is located at the "edges" of the network. The DNS, which implements the critical
name-to-address translation process using clients and servers located at the edge of the network, is yet another example of that
design philosophy.

DNS provides a few other important services in addition to translating hostnames to IP addresses:

    q   Host aliasing: A host with a complicated hostname can have one or more alias names. For example, a hostname such
        as could have, say, two aliases such as and In
        this case, the hostname is said to be canonical hostname. Alias hostnames, when
        present, are typically more mnemonic than a canonical hostname. DNS can be invoked by an application to obtain the
        canonical hostname for a supplied alias hostname as well as the IP address of the host.
    q   Mail server aliasing: For obvious reasons, it is highly desirable that email addresses be mnemonic. For example, if
        Bob has an account with Hotmail, Bob's email address might be as simple as However, the
        hostname of the Hotmail mail server is more complicated and much less mnemonic than simply (e.g., the
        canonical hostname might be something like DNS can be invoked by a mail
        application to obtain the canonical hostname for a supplied alias hostname as well as the IP address of the host. In fact,
        DNS permits a company's mail server and Web server to have identical (aliased) hostnames; for example, a company's
        Web server and mail server can both be called
    q   Load Distribution: Increasingly, DNS is also being used to perform load distribution among replicated servers, such
        as replicated Web servers. Busy sites, such as, are replicated over multiple servers, with each server running
        on a different end system, and having a different IP address. For replicated Web servers, a set of IP addresses is thus
        associated with one canonical hostname. The DNS database contains this set of IP addresses. When clients make a
        DNS query for a name mapped to a set of addresses, the server responds with the entire set of IP addresses, but rotates
        the ordering of the addresses within each reply. Because a client typically sends its HTTP request message to the IP
        address that is listed first in the set, DNS rotation distributes the traffic among all the replicated servers. DNS rotation is
        also used for email so that multiple mail servers can have the same alias name.

The DNS is specified in [RFC 1034] and [RFC 1035], and updated in several additional RFCs. It is a complex system, and we
only touch upon key aspects of its operation here. The interested reader is referred to these RFCs and the book [Abitz 1993].

2.5.2 Overview of How DNS Works
We now present a high-level overview of how DNS works. Our discussion shall focus on the hostname to IP address
translation service. From the client's perspective, the DNS is a black box. The client sends a DNS query message into the black
box, specifying the hostname that needs to be translated to an IP address. On many Unix-based machines,
gethostbyname() is the library routine that an application calls in order to issue the query message. In Section 2.7, we
shall present a Java program that begins by issuing a DNS query. After a delay, ranging from milliseconds to tens of seconds,
the client receives a DNS reply message that provides the desired mapping. Thus, from the client's perspective, DNS is a
simple, straightforward translation service. But in fact, the black box that implements the service is complex, consisting of
large number of name servers distributed around the globe, as well as an application-layer protocol that specifies how the
name servers and querying hosts communicate.

A simple design for DNS would have one Internet name server that contains all the mappings. In this centralized design,
clients simply direct all queries to the single name server, and the name server responds directly to the querying clients. (2 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

Although the simplicity of this design is attractive, it is completely inappropriate for today's Internet, with its vast (and
growing) number of hosts. The problems with a centralized design include:

    q   A single point of failure. If the name server crashes, so too does the entire Internet!
    q   Traffic volumes. A single name server would have to handle all DNS queries (for all the HTTP requests, email
        messages, etc. generated from millions of hosts)
    q   Distant centralized database. A single name server cannot be "close" to all the querying clients. If we put the single
        name server in New York City, then all queries from Australia must travel to the other side of the globe, perhaps over
        slow and congested links. This can lead to significant delays (thereby increasing the "world wide wait" for the Web and
        other applications).
    q   Maintenance. The single name server would have to keep records for all Internet hosts. Not only would this
        centralized database be huge, but it would have to be updated frequently to account for every new host. There are also
        authentication and authorization problems associated with allowing any user to register a host with the centralized

In summary, a centralized database in a single name server simply doesn't scale. Consequently, the DNS is distributed by
design. In fact, the DNS is a wonderful example of how a distributed database can be implemented in the Internet.

In order to deal with the issue of scale, the DNS uses a large number of name servers, organized in a hierarchical fashion and
distributed around the world. No one name server has all of the mappings for all of the hosts in the Internet. Instead, the
mappings are distributed across the name servers. To a first approximation, there are three types of name servers: local name
servers, root name servers, and authoritative name servers. These name servers, again to a first approximation, interact with
each other and with the querying host as follows:

    q   Local name servers: Each ISP - such as a university, an academic department, an employee's company or a residential
        ISP - has a local name server (also called a default name server). When a host issues a DNS query message, the
        message is first sent to the host's local name server. The IP address of the local name server is typically configured
        by hand in a host. (On a Windows 95/98 machine, you can find the IP address of the local name server used by your PC
        by opening the Control Panel, and then selecting "Network", then selecting an installed TCP/IP component, and then
        selecting the DNS configuration folder tab.) The local name server is typically "close" to the client; in the case of an
        institutional ISP, it may be on the same LAN as the client host; for a residential ISP, the name server is typically
        separated from the client host by no more than a few routers. If a host requests a translation for another host that is part
        of the same local ISP, then the local name server will be able to immediately provide the the requested IP address. For
        example, when the host requests the IP address for, the local name server at Eurecom
        will be able to provide the requested IP address without contacting any other name servers.
    q   Root name servers: In the Internet there are a dozen or so of "root name servers," most of which are currently located
        in North America. A February 1998 map of the root servers is shown in Figure 2.5-1. When a local name server cannot
        immediately satisfy a query from a host (because it does not have a record for the hostname being requested), the local
        name server behaves as a DNS client and queries one of the root name servers. If the root name server has a record for
        the hostname, it sends a DNS reply message to the local name server, and the local name server then sends a DNS reply
        to the querying host. But the root name server may not have a record for the hostname. Instead, the rootname server
        knows the IP address of an "authoritative name server" that has the mapping for that particular hostname.
    q   Authoritative name servers: Every host is registered with an authoritative name server. Typically, the authoritative
        name server for a host is a name server in the host's local ISP. (Actually, each host is required to have at least two
        authoritative name servers, in case of failures.) By definition, a name server is authoritative for a host if it always has a
        DNS record that translates the host's hostname to that host's IP address. When an authoritative name server is queried
        by a root server, the authoritative name server responds with a DNS reply that contains the requested mapping. The
        root server then forwards the mapping to the local name server, which in turn forwards the mapping to the requesting
        host. Many name servers act as both local and and authoritative name servers. (3 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

Figure 2.5-1: A February 1998 map of the DNS root servers. Obtained from the WIA alliance Web site (

Let's take a look at a simple example. Suppose the host desires the IP address of Also
suppose that Eurecom's local name server is called and that an authoritative name server for
is called As shown in Figure 2.5-2, the host first sends a DNS query message to its local name
server, The query message contains the hostname to be translated, namely, The local name
server forwards the query message to a root name server. The root name server forwards the query message to the name server
that is authoritative for all the hosts in the domain, namely, to The authoritative name server then
sends the desired mapping to the querying host, via the root name server and the local name server. Note that in this example,
in order to obtain the mapping for one hostname, six DNS messages were sent: three query messages and three reply
messages. (4 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

                           Figure 2.5-2: Recursive queries to obtain the mapping for

Our discussion up to this point has assumed that the root name server knows the IP address of an authoritative name server for
every hostname. This assumption may be incorrect. For a given hostname, the root name server may only know the IP address
of an intermediate name server that in turn knows the IP address of an authoritative name server for the hostname. To illustrate
this, consider once again the above example with the host querying for the IP address of
Suppose now that the University of Massachusetts has a name server for the university, called Also suppose
that each of the departments at University of Massachusetts has its own name server, and that each departmental name server
is authoritative for all the hosts in the department. As shown in Figure 2.5-3, when the root name server receives a query for a
host with hostname ending with it forwards the query to the name server This name server forwards
all queries with hostnames ending with to the name server, which is authoritative for all
hostnames ending with The authoritative name server sends the desired mapping to the intermediate name
server,, which forwards the mapping to the root name server, which forwards the mapping to the local name
server,, which forwards the mapping to the requesting host! In this example, eight DNS messages are sent.
Actually, even more DNS messages can be sent in order to translate a single hostname - there can be two or more intermediate
name servers in the chain between the root name server and the authoritative name server! (5 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

      Figure 2.5-3: Recursive queries with an intermediate name server between the root and authoritative name servers.

The examples up to this point assumed that all queries are recursive queries. When a host or name server A makes a recursive
query to a name server B, then name server B obtains the requested mapping on behalf of A and then forwards the mapping to
A. The DNS protocol also allows for iterative queries at any step in the chain between requesting host and authoritative name
server. When a name server A makes an iterative query to name server B, if name server B does not have the requested
mapping, it immediately sends a DNS reply to A that contains the IP address of the next name server in the chain, say, name
server C. Name server A then sends a query directly to name server C.

In the sequence of queries that are are required to translate a hostname, some of the queries can be iterative and others
recursive. Such a combination of recursive and iterative queries is illustrated in Figure 2.5-4. Typically, all queries in the query
chain are recursive except for the query from the local name server to the root name server, which is iterative. (Because root
servers handle huge volumes of queries, it is preferable to use the less burdensome iterative queries for root servers.) (6 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

                                  Figure 2.5-4: A query chain with recursive and iterative queries.

Our discussion this far has not touched on one important feature of the DNS: DNS caching. In reality, DNS extensively
exploits caching in order to improve the delay performance and to reduce the number of DNS messages in the network. The
idea is very simple. When a name server receives a DNS mapping for some hostname, it caches the mapping in local memory
(disk or RAM) while passing the message along the name server chain. Given a cached hostname/ IPaddress translation pair, if
another query arrives to the name server for the same hostname, the name server can provide the desired IP address, even if it
is not authoritative for the hostname. In order to deal with the ephemeral hosts, a cached record is discarded after a period of
time (often set to two days). As an example, suppose that queries the DNS for the IP address for the hostname Furthermore suppose that a few hours later, another Eurecom host, say, also queries DNS with the
same hostname. Because of caching, the local name server at Eurecom will be able to immediately return the IP address to the
requesting host without having to query name servers on another continent. Any name server may cache DNS mappings.

2.5.3 DNS Records
The name servers that together implement the DNS distributed database, store Resource Records (RR) for the hostname to IP
address mappings. Each DNS reply message carries one or more resource records. In this and the following subsection, we
provide a brief overview of DNS resource records and messages; more details can be found in [Abitz] or in the DNS RFCs
[RFC 1034] [RFC 1035]. (7 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

A resource record is a four-tuple that contains the following fields:

                                               (Name, Value, Type, TTL)
TTL is the time to live of the resource record; it determines the time at which a resource should be removed from a cache. In
the example records given below, we will ignore the TTL field. The meaning of Name and Value depend on Type:

    q   If Type=A, then Name is a hostname and Value is the IP address for the hostname. Thus, a Type A record provides the
        standard hostname to IP address mapping. As an example, (,, A) is
        a Type A record.
    q   If Type=NS, then Name is a domain (such as and Value is the hostname of a server that knows how to
        obtain the IP addresses for hosts in the domain. This record is used to route DNS queries further along in the query
        chain. As an example, (,, NS) is a Type NS record.
    q   If Type=CNAME, then Value is a canonical hostname for the alias hostname Name. This record can provide querying
        hosts the canonical name for a hostname. As an example, (,, CNAME) is a
        CNAME record.
    q   If Type=MX, then Value is a hostname of a mail server that has an alias hostname Name. As an example, (, MX) is an MX record. MX records allow the hostnames of mail servers to have simple

If a name server is authoritative for a particular hostname, then the name server will contain a Type A record for the hostname.
(Even if the name server is not authoritative, it may contain a Type A record in its cache.) If a server is not authoritative for a
hostname, then the server will contain a Type NS record for the domain that includes the hostname; it will also contain a Type
A record that provides the IP address of the name server in the Value field of the NS record. As an example, suppose a root
server is not authoritative for the host Then the root server will contain a record for a domain that includes
the host, e.g.,
                                           (,, NS).

The root server would also contain a type A record which maps the name server to an IP address, e.g.,

                                         (,, A).

2.5.4 DNS Messages
Earlier in this section we alluded to DNS query and reply messages. These are the only two kinds of DNS messages.
Furthermore, both request and reply messages have the same format, as shown in Figure 2.5-5. (8 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

                                                  Figure 2.5-5: DNS message format

The semantics of the various fields in a DNS message are as follows:

    q   The first 12 bytes is the header section, which has a number of fields. The first field is a 16-bit number that identifies
        the query. This identifier is copied into the reply message to a query, allowing the client to match received replies with
        sent queries. There are a number of flags in the flag field. A one-bit query/reply flag indicates whether the message is a
        query (0) or a reply (1). A one bit authoritative flag is set in a reply message when a name server is an authoritative
        server for a queried name. A one bit recursion-desired flag is set when a client (host or name server) desires that the
        name server to perform recursion when it doesn't have the record. A one-bit recursion available field is set in a reply if
        the name server supports recursion. In the header, there are also four "number of" fields. These fields indicate the
        number of occurrences of the four types of "data" sections that follow the header.
    q   The question section contains information about the query that is being made. This section includes (i) a name field
        that contains the name that is being queried, and (ii) a type field that indicates the type of question being asked about
        the name (e.g., a host address associated with a name - type "A", or the mail server for a name - type "MX").
    q   In a reply from a name server, the answer section contains the resource records for the name that was originally
        queried. Recall that in each resource record there is the Type (e.g., A, NS, CSNAME and MX), the Value and the TTL.
        A reply can return multiple RRs in the answer, since a hostname can have multiple IP addresses (e.g., for replicated
        Web servers, as discussed earlier in this section).
    q   The authority section contains records of other authoritative servers.
    q   The additional section contains other "helpful" records. For example, the answer field in a reply to an MX query will
        contain the hostname of a mail server associated with the alias name Name. The additional section will contain a
        Type A record providing the IP address for the canonical hostname of the mail server. (9 of 10) [5/13/2004 11:55:59 AM]
  The Domain Name System

The discussion above has focussed on how data is retrieved from the DNS database. You might be wondering how data gets
into the database in the first place? Until recently, the contents of each DNS server was configured statically, e.g., from a
configuration file created by a system manager. More recently, an UPDATE option has been added to the DNS protocol to
allow data to be dynamically added or deleted from the database via DNS messages. [RFC 2136] specifies DNS dynamic

DNSNet provides a nice collection of documents pertaining to DNS [DNSNet]. The Internet Software Consortium provides
many resources for BIND, a popular public-domain name server for Unix machines [BIND].


[Abitz 1993] Paul Albitz and Cricket Liu, DNS and BIND, O'Reilly & Associates, Petaluma, CA, 1993
[BIND] Internet Software Consortium page on BIND,
[DNSNet] DNSNet page on DNS resources,
[RFC 1034] P. Mockapetris, "Domain Names - Concepts and Facilities," RFC 1034, Nov. 1987.
[RFC 1035] P. Mockapetris, "Domain Names - Implementation and Specification," RFC 1035, Nov. 1987.
[RFC 2136] P. Vixie, S. Thomson, Y. Rekhter, J. Bound, "Dynamic Updates in the Domain Name System," RFC 2136, April

Search RFCs and Internet Drafts

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s) here.


Press button to submit your query or reset the form:        Submit      Reset

Query Options:

           Case insensitive

       Maximum number of hits: 25

Return to Table of Contents

Copyright 1996-2000 Keith W. Ross and James F. Kurose (10 of 10) [5/13/2004 11:55:59 AM]

  Interactive Programs for Exploring DNS
There are at least three client programs available for exploring the contents of name servers in the
Internet. The most widely available program is nslookup; two other programs, which are a little more
powerful than nslookup, are dig and host. Lucky for us, several institutions and individuals have made
these client programs available through Web. browsers.

We stongly encourage you to get your hands dirty and play with these programs. They can give
significant insight into how DNS works. All of these programs mimic DNS clients. They send a DNS
query message to a name server (which can often be supplied by the user), and they receive a
corresponding DNS response. They then extract information (e.g., IP addresses, whether the response is
authoritative, etc.) and present the information to the user.


Some of the nslookup sites provide only the basic nslookup service, i.e., they allow you to enter a
hostname and they return an IP address. Visit some of the nslookup sights below and try entering
hostnames for popular hosts (such as or as well as hostnames for the less
popular hosts. You will see that the popular hostnames typically return numerous IP addresses, because
the site is replicated in numerous servers. (See the discussion in Section 2.5 on DNS rotation.) Some of
the nslookup sites also return the hostname and IP address of the name server that provides the
information. Also, some of the nslookup sites indicate whether the result is non-authoritative (i.e.,
obtained from a cache).

Some of the nslookup sites allow the user to supply more information. For example, the user can request
to receive the canonical hostname and IP address for a mail server. And the user can also indicate the
name server at which it wants the chain of queries to begin.

dig and host

The programs dig and host allow the user to further refine the query by indicating, for example, whether
the query should be recursive or interative. There are currently not as many Web sites that provide the (1 of 2) [5/13/2004 11:56:02 AM]

dig and host service. But there are a few:

Return to Table of Contents

Copyright 1996-1999 Keith W. Ross and James F. Kurose (2 of 2) [5/13/2004 11:56:02 AM]
  Socket Programming in Java

                       2.6 Socket Programming with TCP
This and the subsequent sections provide an introduction to network application development. Recall from Section 2.1 that the
core of a network application consists of a pair of programs -- a client program and a server program. When these two programs
are executed, a client and server process are created, and these two processes communicate with each other by reading from
and writing to sockets. When a creating a networking application, the developer's main task is to write the code for both the
client and server programs.

There are two sorts of client-server applications. One sort is a client-server application that is an implementation of a protocol
standard defined in an RFC. For such an implementation, the client and server programs must conform to the rules dictated by
the RFC. For example, the client program could be an implementation of the FTP client, defined in [RFC 959], and the server
program could be implementation of the FTP server, also defined in [RFC 959]. If one developer writes code for the client
program and an independent developer writes code for the server program, and both developers carefully follow the rules of the
RFC, then the two programs will be able to interoperate. Indeed, most of today's network applications involve communication
between client and server programs that have been created by independent developers. (For example, a Netscape browser
communicating with an Apache Web server, or a FTP client on a PC uploading a file to a Unix FTP server.) When a client or
server program implements a protocol defined in an RFC, it should use the port number associated with the protocol. (Port
numbers were briefly discussed in Section 2.1. They will be covered in more detail in the next chapter.)

The other sort of client-server application is a proprietary client-server application. In this case the client and server programs
do not necessarily conform to any existing RFC. A single developer (or development team) creates both the client and server
programs, and the developer has complete control over what goes in the code. But because the code does not implement a
public-domain protocol, other independent developers will not be able to develop code that interoperate with the application.
When developing a proprietary application, the developer must be careful not to use one of the the well-known port numbers
defined in the RFCs.

In this and the next section, we will examine the key issues for the development of a proprietary client-server application.
During the development phase, one of the first decisions the developer must make is whether the application is to run over TCP
or over UDP. TCP is connection-oriented and provides a reliable byte stream channel through which data flows between two
endsystems. UDP is connectionless and sends independent packets of data from one end system to the other, without any
guarantees about delivery. In this section we develop a simple-client application that runs over TCP; in the subsequent section,
we develop a simple-client application that runs over UDP.

We present these simple TCP and UDP applications in Java. We could have written the code in C or C++, but we opted for Java
for several reasons. First, the applications are more neatly and cleanly written in Java; with Java there are fewer lines of code,
and each line can be explained to the novice programmer without much difficulty. Second, client-server programming in Java is
becoming increasingly popular, and may even become the norm in upcoming years. Java is platform independent, it has
exception mechanisms for robust handling of common problems that occur during I/O and networking operations, and its
threading facilities provide a way to easily implement powerful servers. But there is no need to be frightened if you are not
familiar with Java. You should be able to follow the code if you have experience programming in another language.

For readers who are interested in client-server programming in C, there are several good references available, including
[Stevens 1990] , [Frost 1994] and [Kurose 1996] .

2.6.1 Socket Programming with TCP
Recall from Section 2.1 that processes running on different machines communicate with each other by sending messages into
sockets. We said that each process was analogous to a house and the process's socket is analogous to a door. As shown in Figure
2.6.1, the socket is the door between the application process and TCP. The application developer has control of everything on (1 of 9) [5/13/2004 11:56:16 AM]
  Socket Programming in Java

the application-layer side of the socket; however, it has little control of the transport-layer side. (At the very most, the
application developer has the ability to fix a few TCP parameters, such as maximum buffer and maximum segment sizes.)

                                    Figure 2.6-1: Processes communicating through TCP sockets.

Now let's to a little closer look at the interaction of the client and server programs. The client has the job of initiating contact
with the server. In order for the server to be able to react to the client's initial contact, the server has to be ready. This implies
two things. First, the server program can not be dormant; it must be running as a process before the client attempts to initiate
contact. Second, the server program must have some sort of door (i.e., socket) that welcomes some initial contact from a client
(running on an arbitrary machine). Using our house/door analogy for a process/socket, we will sometimes refer to the client's
initial contact as "knocking on the door".

With the server process running, the client process can initiate a TCP connection to the server. This is done in the client
program by creating a socket object. When the client creates its socket object, it specifies the address of the server process,
namely, the IP address of the server and the port number of the process. Upon creation of the socket object, TCP in the client
initiates a three-way handshake and establishes a TCP connection with the server. The three-way handshake is completely
transparent to the client and server programs.

During the three-way handshake, the client process knocks on the welcoming door of the server process. When the server
"hears" the knocking, it creates a new door (i.e., a new socket) that is dedicated to that particular client. In our example below,
the welcoming door is a ServerSocket object that we call the welcomeSocket. When a client knocks on this door, the program
invokes welcomeSocket's accept() method, which creates a new door for the client. At the end of the handshaking phase, a TCP
connection exists between the client's socket and the server's new socket. Henceforth, we refer to the new socket as the server's
"connection socket".

From the application's perspective, the TCP connection is a direct virtual pipe between the client's socket and the server's
connection socket. The client process can send arbitrary bytes into its socket; TCP guarantees that the server process will
receive (through the connection socket) each byte in the order sent. Furthermore, just as people can go in and out the same door,
the client process can also receive bytes from its socket and the server process can also send bytes into its connection socket.
This is illustrated in Figure 2.6.2. (2 of 9) [5/13/2004 11:56:16 AM]
  Socket Programming in Java

                                Figure 2.6-2: Client socket, welcoming socket and connection socket.

Because sockets play a central role in client-server applications, client-server application development is also referred to as
socket programming. Before providing our example client-server application, it is useful to discuss the notion of a stream. A
stream is a flowing sequence of characters that flow into or out of a process. Each stream is either an input stream for the
process or an output stream for the process. If the stream is an input stream, then it is attached to some input source for the
process, such as standard input (the keyboard) or a socket into which characters flow from the Internet. If the stream is an
output stream, then it is attached to some output source for the process, such as standard output (the monitor) or a socket out of
which characters flow into the Internet.

2.6.2 An Example Client-Server Application in Java
We shall use the following simple client-server application to demonstrate socket programming for both TCP and UDP:

    1.   A client reads a line from its standard input (keyboard) and sends the line out its socket to the server.
    2.   The server reads a line from its connection socket.
    3.   The server converts the line to uppercase.
    4.   The server sends the modified line out its connection socket to the client.
    5.   The client reads the modified line from its socket and prints the line on its standard output (monitor).

Below we provide the client-server program pair for a TCP implementation of the application. We provide a detailed, line-by-
line analysis after each program. The client program is called, and the server program is called
In order to emphasize the key issues, we intentionally provide code that is to the point but not bullet proof. "Good code" would
certainly have a few more auxiliary lines.

Once the the two programs are compiled on their respective hosts, the server program is first executed at the server, which
creates a process at the server. As discussed above, the server process waits to be contacted by a client process. When the client
program is executed, a process is created at the client, and this process contacts the server and establishes a TCP connection
with it. The user at the client may then "use" the application to send a line and then receive a capitalized version of the line.

Here is the code for the client side of the application: (3 of 9) [5/13/2004 11:56:16 AM]
    Socket Programming in Java

class TCPClient {

        public static void main(String argv[]) throws Exception
            String sentence;
            String modifiedSentence;

               BufferedReader inFromUser =
                 new BufferedReader(new InputStreamReader(;

               Socket clientSocket = new Socket("hostname", 6789);

               DataOutputStream outToServer =
                 new DataOutputStream(clientSocket.getOutputStream());

               BufferedReader inFromServer =
                 new BufferedReader(new InputStreamReader(clientSocket.getInputStream()));

               sentence = inFromUser.readLine();

               outToServer.writeBytes(sentence + '\n');

               modifiedSentence = inFromServer.readLine();

               System.out.println("FROM SERVER: " + modifiedSentence);



The program TCPClient creates three streams and one socket, as shown in Figure 2.6-3. (4 of 9) [5/13/2004 11:56:16 AM]
  Socket Programming in Java

                                      Figure 2.6-3: TCPClient has three streams and one socket.

The socket is called clientSocket. The stream inFromUser is an input stream to the program; it is attached to the standard
input, i.e., the keyboard. When the user types characters on the keyboard, the characters flow into the stream inFromUser. The
stream inFromServer is another input stream to the program; it is attached to the socket. Characters that arrive from the
network flow into the stream inFromServer. Finally, the stream outToServer is is an output stream from the program; it is also
attached to the socket. Characters that the client sends to the network flow into the stream outToServer.

Let's now take a look at the various lines in the code.

       import*; and are java packages. The package contains classes for input and output streams. In particular, the package contains the BufferedReader and DataOutputStream classes, classes that the program uses to create the three
streams illustrated above. The package provides classes for network support. In particular, it contains the Socket and
ServerSocket classes. The clientSocket object of this program is derived from the Socket class.

    class TCPClient {
         public static void main(String argv[]) throws Exception

The above is standard stuff that you see at the beginning of most java code. The first line is the beginning of a class definition
block. The keyword class begins the class definition for the class named TCPClient. A class contains variables and methods.
The variables and methods of the class are embraced by the curly brackets that begin and end the class definition block. The
class TCPClient has no class variables and exactly one method, the main( ) method. Methods are similar to the functions or
procedures in languages such as C; the main method in the Java language is similar to the main function in C and C++. When
the Java interpreter executes an application (by being invoked upon the application's controlling class), it starts by calling the
class's main method. The main method then calls all the other methods required to run the application. For this introduction into
socket programming in Java, you may ignore the keywords public, static, void, main, throws Exceptions (although you must
include them in the code).

      String sentence; (5 of 9) [5/13/2004 11:56:16 AM]
  Socket Programming in Java

      String modifiedSentence;

These above two lines declare objects of type String. The object sentence is the string typed by the user and sent to the server.
The object modifiedSentence is the string obtained from the server and sent the user's standard output.

       BufferedReader inFromUser =
                 new BufferedReader(new InputStreamReader(;

The above line creates the stream object inFromUser of type BufferedReader. The input stream is initialized with,
which attaches the stream to the standard input. The command allows the client to read text from its keyboard.

    Socket clientSocket = new Socket("hostname", 6789);

The above line creates the object clientSocket of type Socket. It also initiates the TCP connection between client and server.
The variable "host name" must be replaced with the host name of the server (e.g., ""). Before the TCP
connection is actually initiated, the client performs a DNS look up on the hostname to obtain the host's IP address. The number
6789 is the port number. You can use a different port number; but you must make sure that you use the same port number at the
server side of the application. As discussed earlier, the host's IP address along with the applications port number identifies the
server process.

    DataOutputStream outToServer =
       new DataOutputStream(clientSocket.getOutputStream());

    BufferedReader inFromServer =
       new BufferedReader(new inputStreamReader(clientSocket.getInputStream()));

The above two lines create stream objects that are attached to the socket. The outToServer stream provides the process output
to the socket. The inFromServer stream provides the process input from the socket. (See diagram above.)

   sentence = inFromUser.readLine();

The above line places a line typed by user into the string sentence. The string sentence continues to gather characters until the
user ends the line by typing a carriage return. The line passes from standard input through the stream inFromUser into the
string sentence.

   outToServer.writeBytes(sentence + '\n');

The above line sends the string sentence augmented with a carriage return into the outToServer stream. The augmented
sentence flows through the client's socket and into the TCP pipe. The client then waits to receive characters from the server.

    modifiedSentence = inFromServer.readLine();

When characters arrive from the server, they flow through the stream inFromServer and get placed into the string
modifiedSentence. Characters continue to accumulate in modifiedSentence until the line ends with a carriage return character.

       System.out.println("FROM SERVER                        " + modifiedSentence);

The above line prints to the monitor the string modifiedSentence returned by the server. (6 of 9) [5/13/2004 11:56:16 AM]
    Socket Programming in Java


This last line closes the socket and, hence, closes the TCP connection between the client and the server. It causes TCP in the
client to send a TCP message to TCP in the server (see Section 3.5).

Now let's take a look at the server program.


class TCPServer {

     public static void main(String argv[]) throws Exception
         String clientSentence;
         String capitalizedSentence;

            ServerSocket welcomeSocket = new ServerSocket(6789);

            while(true) {

                    Socket connectionSocket = welcomeSocket.accept();

           BufferedReader inFromClient =
             new BufferedReader(new

                    DataOutputStream outToClient =
                      new DataOutputStream(connectionSocket.getOutputStream());

                    clientSentence = inFromClient.readLine();

                    capitalizedSentence = clientSentence.toUpperCase() + '\n';


TCPServer has many similarities with TCPClient. Let us now take a look at the lines in We will not comment
on the lines which are identical or similar to commands in

The first line in TCPServer that is substantially different from what we saw in TCPClient is: (7 of 9) [5/13/2004 11:56:16 AM]
  Socket Programming in Java

    ServerSocket welcomeSocket = new ServerSocket(6789);

The above line creates the object welcomeSocket, which is of type ServerSocket. The WelcomeSocket, as discussed above, is a
sort of door that waits for a knock from some client. The port number 6789 identifies the process at the server. The following
line is:

    Socket connectionSocket = welcomeSocket.accept();

The above line creates a new socket, called connectionSocket, when some client knocks on welcomeSocket. TCP then
establishes a direct virtual pipe between clientSocket at the client and connectionSocket at the server. The client and server can
then send bytes to each other over the pipe, and all bytes sent arrive at the other side in order. With connectionSocket
established, the server can continue to listen for other requests from other clients for the application using welcomeSocket.
(This version of the program doesn't actually listen for more connection requests. But it can be modified with threads to do so.)
The program then creates several stream objects, analogous to the stream objects created in clientSocket. Now consider:

     capitalizedSentence = clientSentence.toUpperCase() + '\n';

This command is the heart of application. It takes the line sent by the client, capitalizes it and adds a carriage return. It uses the
method toUpperCase(). All the other commands in the program are peripheral; they are used for communication with the client.

That completes our analysis of the TCP program pair. Recall that TCP provides a reliable data transfer service. This implies, in
particular, that if one the user's characters gets corrupted in the network, then the client host will retransmit the character,
thereby providing correct delivery of the data. These retransmissions are completely transparent to the application programs.
The DNS lookup is also transparent to the application programs.

To test the program pair, you install and compile in one host and in another host. Be sure to
include the proper host name of the server in You then execute TCPServer.class, the compiled server program,
in the server. This creates a process in the server which idles until it is contacted by some client. Then you execute
TCPClient.class, the compiled client program, in the client. This creates a process in the client and establishes a TCP connection
between the client and server processes. Finally, to use the application, you type a sentence followed by
a carriage return.

To develop your own client-server application, you can begin by slightly modifying the programs. For example, instead of
converting all the letters to uppercase, the server can count the number of times the letter "s" appears and return this number.


In section we provided an introduction to TCP socket programming in Java. Several good online introductions to C socket
programming are available, including Kurose and KeshevRef. A comprehensive reference on C socket programming for Unix
hosts is Stevens.

[RFC 959] J.B. Postel and J.K. Reynolds, "Filel Transfer Protocol," [RFC 959], October 1985.
[Stevens 1990] W.R. Stevens, Unix Network Porgramming, Prentice-Hall, Englewood Cliffs, N.J.
[Frost 1994] J. Frost, BSD Sockets: A Quick and Dirty Primer,
[Kurose 1996] J.F. Kurose, Unix Network Programming,

Return to Table Of Contents (8 of 9) [5/13/2004 11:56:16 AM]
 Socket Programming in Java

Copyright Keith W. Ross and James F. Kurose 1996-2000 (9 of 9) [5/13/2004 11:56:16 AM]

           2.7 Socket Programming with UDP
We learned in the previous section that when two processes communicate over TCP, from the
perspective of the processes it is as if there is a pipe between the two processes. This pipe remains in
place until one of the two processes closes it. When one of the processes wants to send some bytes to the
other process, it simply inserts the bytes into the pipe. The sending process does not have to attach a
destination address to the bytes because the pipe is logically connected to the destination. Furthermore,
the pipe provides a reliably byte stream channel -- the sequence of bytes received by the receiving
process is exactly the sequence bytes that the sender inserted into the pipe.

UDP also allows two (or more) processes running on different hosts to communicate. However, UDP
differs from TCP in many fundamental ways. First, UDP is a connectionless service -- there isn't an
initial handshaking phase during which a pipe is established between the two processes. Because UDP
doesn't have a pipe, when a process wants to send a batch of bytes to another process, the sending
process must exclude attach the destination process's address to the batch of bytes. And this must be done
for each batch of bytes the sending process sends. Thus UDP is similar to a taxi service -- each time a
group of people get in a taxi, the group has to inform the driver of the destination address. As with TCP,
the destination address is a tuple consisting of the IP address of the destination host and the port number
of the destination process. We shall refer to the batch of information bytes along with the IP destination
address and port number as the the "packet".

After having created a packet, the sending process pushes the packet into the network through a socket.
Continuing with our taxi analogy, at the other side of the socket, there is a taxi waiting for the packet.
The taxi then drives the packet in the direction of the packet's destination address. However, the taxi does
not guarantee that it will eventually get the datagram to its ultimate destination; the taxi could break
down. In other terms, UDP provides an unreliable transport service to its communication processes -- it
makes no guarantees that a datagram will reach its ultimate destination.

In this section we will illustrate UDP client-server programming by redeveloping the same application of
the previous section, but this time over UDP. We shall also see that the Java code for UDP is different
from the TCP code in many important ways. In particular, we shall see that there is (i) no initial
handshaking between the two processes, and therefore no need for a welcoming socket, (ii) no streams
are attached to the sockets, (iii) the sending hosts creates "packets" by attaching the IP destination
address and port number to each batch of bytes it sends, and (iv) the receiving process must unravel to
received packet to obtain the packet's information bytes. Recall once again our simple application:

    1. A client reads a line from its standard input (keyboard) and sends the line out its socket to the
    2. The server reads a line from its socket.
    3. The server converts the line to uppercase.
    4. The server sends the modified line out its socket to the client. (1 of 8) [5/13/2004 11:56:22 AM]

    5. The client reads the modified line through its socket and prints the line on its standard output

Here is the code for the client side of the application:


class UDPClient {
    public static void main(String args[]) throws Exception

           BufferedReader inFromUser =
             new BufferedReader(new InputStreamReader(;

           DatagramSocket clientSocket = new DatagramSocket();

           InetAddress IPAddress = InetAddress.getByName("hostname");

           byte[] sendData = new byte[1024];
           byte[] receiveData = new byte[1024];

           String sentence = inFromUser.readLine();

           sendData = sentence.getBytes();

           DatagramPacket sendPacket =
              new DatagramPacket(sendData, sendData.length, IPAddress,


           DatagramPacket receivePacket =
              new DatagramPacket(receiveData, receiveData.length);

           clientSocket.receive(receivePacket); (2 of 8) [5/13/2004 11:56:22 AM]

           String modifiedSentence =
               new String(receivePacket.getData());

           System.out.println("FROM SERVER:" + modifiedSentence);



The program constructs one stream and one socket, as shown in Figure 2.7-1. The socket
is called clientSocket, and it is of type DatagramSocket. Note that UDP uses a different kind of socket
than TCP at the client. In particular, with UDP our client uses a DatagramSocket whereas with TCP our
client used a Socket. The stream inFromUser is an input stream to the program; it is attached to the
standard input, i.e., the keyboard. We had an equivalent stream in our TCP version of the program. When
the user types characters on the keyboard, the characters flow into the stream inFromUser. But in
contrast with TCP, there are no streams (input or output) attached to the socket. Instead of feeding bytes
to stream attached to a Socket object, UDP will push individual packets through the DatagramSocket

                            Figure 2.7-1: has one stream and one socket. (3 of 8) [5/13/2004 11:56:22 AM]

Let's now take a look at the lines in the code that differ significantly from

          DatagramSocket clientSocket = new DatagramSocket();

The above line creates the object clientSocket of type DatagramSocket. In contrast with,
this line does not initiate a TCP connection. In particular, the client host does not contact the server host
upon execution of this line. For this reason, the constructor DatagramSocket() does not take the server
hostname or port number as arguments. Using our door/pipe analogy, the execution of the above line
creates a door for the client process but does not create a pipe between the two processes.

          InetAddress IPAddress = InetAddress.getByName("hostname");

In order to send bytes to a destination process, we shall need to obtain the address of the process. Part of
this address is the IP address of the destination host. The above line invokes a DNS look up that
translates "hostname" (supplied in the code by the developer) to an IP address. DNS was also invoked by
the TCP version of the client, although it was done there implicitly rather than explicitly. The method
getByName() takes as an argument the hostname of the server and returns the IP address of this same
server. It places this address in the object IPAddress of type InetAddress.

          byte[] sendData = new byte[1024];
          byte[] receiveData = new byte[1024];

The byte arrays sendData and receiveData will hold the data the client sends and receives, respectively.

       sendData = sentence.getBytes();

The above line essentially performs a type conversion. It takes the string sentence and renames it as
sendData, which is an array of bytes.

       DatagramPacket sendPacket =
          new DatagramPacket(sendData, sendData.length, IPAddress, 9876);

The above line constructs the packet, sendPacket, that the the client will pop into the network through
its socket. This packet includes that data that is contained in the packet, sendData, the length of this data,
the IP address of the server, and the port number of the application (which we have set to 9876). Note
that sendPacket is of type DatagramPacket.


In the above line the method send() of the object clientSocket takes the packet just constructed and pops (4 of 8) [5/13/2004 11:56:22 AM]

it into the network through clientSocket. Once again, note that UDP sends the line of characters in a
manner very different from TCP. TCP simply inserted the line into a stream, which had a logical direct
connection to the server; UDP creates a packet which includes the address of the server. After sending
the packet, the client then waits to receive a packet from the server.

      DatagramPacket receivePacket =
         new DatagramPacket(receiveData, receiveData.length);

In the above line, while waiting for the packet from the server, the client creates a place holder for the
packet, receivePacket, an object of type DatagramPacket.


The client idles until it receives a packet; when it does receive a packet, it puts the packet in

      String modifiedSentence =
         new String(receivePacket.getData());

The above line extracts the data from receivePacket and performs a type conversion, converting an array
of bytes into the string modifiedSentence.

      System.out.println("FROM SERVER:" + modifiedSentence);

The above, which is also present in TCPClient, prints out the string modifiedSentence at the client's


This last line closes the socket. Because UDP is connectionless, this line does not cause the client to send
a tranport-layer message to the server (in contrast with TCPClient).

Let's now take a look at the server side of the application:


class UDPServer { (5 of 8) [5/13/2004 11:56:22 AM]

    public static void main(String args[]) throws Exception

           DatagramSocket serverSocket = new DatagramSocket(9876);

           byte[] receiveData = new byte[1024];
           byte[] sendData = new byte[1024];


                   DatagramPacket receivePacket =
                      new DatagramPacket(receiveData, receiveData.length);


                   String sentence = new String(receivePacket.getData());

                   InetAddress IPAddress = receivePacket.getAddress();

                   int port = receivePacket.getPort();

                   String capitalizedSentence = sentence.toUpperCase();

                   sendData = capitalizedSentence.getBytes();

                   DatagramPacket sendPacket =
                      new DatagramPacket(sendData, sendData.length, IPAddress,


The program constructs one socket, as shown in Figure 2.7-2. The socket is called
serverSocket. It is an object of type DatagramSocket, as was the socket in the client side of the
application. Once again, no streams are attached to the socket. (6 of 8) [5/13/2004 11:56:22 AM]

                                      Figure 2.7-2: has one socket.

Let's now take a look at the lines in the code that differ from

       DatagramSocket serverSocket = new DatagramSocket(9876);

The above line constructs the DatagramSocket serverSocket at port 9876. All data sent and received will
pass through this socket. Because UDP is connectionless, we do not have to spawn a new socket and
continue to listen for new connection requests, as done in If multiple clients access this
application, they will all send their packets into this single door, serverSocket.

       String sentence = new String(receivePacket.getData());

       InetAddress IPAddress = receivePacket.getAddress();

       int port = receivePacket.getPort();

The above three lines unravel the packet that arrives from the client. The first of the three lines extracts
the data from the packet and places the data in the String sentence; it has an analogous line in
UDPClient. The second line extracts the IP address; the third line extracts the client port number, which
is chosen by the client and is different from the server port number 9876. (We will discuss client port (7 of 8) [5/13/2004 11:56:22 AM]

numbers in some detail in the next chapter.) It is necessary for the server to obtain the address (IP address
and port number) of the client, so that it can send the capitalized sentence back to the client.

That completes our analysis of the UDP program pair. To test the application, you install and compile in one host and in another host. (Be sure to include the proper hostname
of the server in Then execute the two programs on their respective hosts. Unlike with
TCP, you can first execute the client side and then the server side. This is because, when you execute the
client side, the client process does not attempt to initiate a connection with the server. Once you have
executed the client and server programs, you may use the application by typing a line at the client.

Return to Table Of Contents

Copyright Keith W. Ross and James F, Kurose 1996-2000 (8 of 8) [5/13/2004 11:56:22 AM]

              2.8 Building a Simple Web Server
Now that we have studied HTTP in some detail and have learned how to write client-server applications
in Java, let us combine this new-found knowledge and build a simple Web server in Java. We will see
that the task is remarkably easy.

Our goal is to build a server that does the following:

     q   Handles only one HTTP request.
     q   Accepts and parses the HTTP request.
     q   Gets the requested file from the server's file system.
     q   Creates an HTTP response message consisting of the requested file preceded by header lines.
     q   Sends the response directly to the client.

Let's try to make the code as simple as possible in order to shed insight on the networking concerns. The
code that we present will be far from bullet proof! For example, let's not worry about handling
exceptions. And let's assume that the client requests an object that is in server's file system.

Here is the code for a simple Web server:

import java.util.*;

class WebServer{

         public static void main(String argv[]) throws Exception                              {

                    String requestMessageLine;
                    String fileName;

                    ServerSocket listenSocket = new ServerSocket(6789);
                    Socket connectionSocket = listenSocket.accept();

                    BufferedReader inFromClient =
                      new BufferedReader(new (1 of 6) [5/13/2004 11:56:26 AM]

          DataOutputStream outToClient =
            new DataOutputStream(connectionSocket.getOutputStream());

                   requestMessageLine = inFromClient.readLine();

                   StringTokenizer tokenizedLine =
                     new StringTokenizer(requestMessageLine);

                  if (tokenizedLine.nextToken().equals("GET")){

                   fileName = tokenizedLine.nextToken();

                   if (fileName.startsWith("/") == true )
                                  fileName = fileName.substring(1);

                  File file = new File(fileName);
                   int numOfBytes = (int) file.length();

                   FileInputStream inFile                         = new FileInputStream (fileName);

                  byte[] fileInBytes = new byte[numOfBytes];

                   outToClient.writeBytes("HTTP/1.0 200 Document Follows\r\n");

          if (fileName.endsWith(".jpg"))
          if (fileName.endsWith(".gif"))

                   outToClient.writeBytes("Content-Length: " + numOfBytes +

                   outToClient.write(fileInBytes, 0, numOfBytes); (2 of 6) [5/13/2004 11:56:26 AM]


         else System.out.println("Bad Request Message");


Let us now take a look at the code. The first half the program is almost identical to As
with, we import the and packages. In addition to these two packages we
also import the java.util package, which contains the StringTokenizer class, which is used for parsing
HTTP request messages. Looking now at the lines within the class WebServer, we define two string

      String requestMessageLine;
       String fileName;

The object requestMessageLine is a string that will contain the first line in the HTTP request message.
The object fileName is a string that will contain the file name of the requested file. The next set of
commands is identical to the corresponding set of commands in

       ServerSocket listenSocket = new ServerSocket(6789);
       Socket connectionSocket = listenSocket.accept();

    BufferedReader inFromClient =
      new BufferedReader(new
    DataOutputStream outToClient =
      new DataOutputStream(connectionSocket.getOutputStream());

Two socket-like objects are created. The first of these objects is listenSocket, which is of type
ServerSocket. The object listenSocket is created by the server program before receiving a request for a
TCP connection from a client. It listens at port 6789, and waits for a request from some client to establish
a TCP connection. When a request for a connection arrives, the accept() method of listenSocket creates a
new object, connectionSocket, of type Socket. Next two streams are created: the BufferedReader
inFromClient and the DataOutputStream outToClient. The HTTP request message comes from the
network, through connectionSocket and into inFromClient; the HTTP response message goes into
outToClient, through connectionSocket and into the network. The remaining portion of the code differs
significantly from (3 of 6) [5/13/2004 11:56:26 AM]

      requestMessageLine = inFromClient.readLine();

The above command reads the first line of the HTTP request message. This line is supposed to be of the

         GET file_name HTTP/1.0

Our server must now parse the line to extract the filename.

   StringTokenizer tokenizedLine = new

     if (tokenizedLine.nextToken().equals("GET")){

          fileName = tokenizedLine.nextToken();

          if (fileName.startsWith("/") == true )
             fileName = fileName.substring( 1 );

The above commands parse the first line of the request message to obtain the requested filename. The
object tokenizedLine can be thought of as the original request line with each of the "words" GET,
file_name and HTTP/1.0 placed in a separate place holder called a token. The server knows from the
HTTP RFC that the file name for the requested file is contained in the token that follows the token
containing "GET". This file name is put in a string called fileName. The purpose of the last if statement
in the above code is to remove the backslash that may precede the filename.

              FileInputStream inFile                         = new FileInputStream (fileName);

The above command attaches a stream, inFile, to the file fileName.

              byte[] fileInBytes = new byte[numOfBytes];

The above commands determine the size of the file and construct an array of bytes of that size. The name
of the array is fileInBytes. The last command reads from the stream inFile to the byte array fileInBytes.
The program must convert to bytes because the output stream outToClient may only be fed with bytes.

Now we are ready to construct the HTTP response message. To this end we must first send the HTTP (4 of 6) [5/13/2004 11:56:26 AM]

response header lines into the DataOutputStream outToClient:

            outToClient.writeBytes("HTTP/1.0 200 Document Follows\r\n");

             if (fileName.endsWith(".jpg"))
                  outToClient.writeBytes("Content-Type: image/jpeg\r\n");
             if (fileName.endsWith(".gif"))
                  outToClient.writeBytes("Content-Type: image/gif\r\n");

       outToClient.writeBytes("Content-Length: " + numOfBytes +

The above set of commands are particularly interesting. These commands prepare the header lines for
HTTP response message and send the header lines to the TCP send buffer. The first command sends the
mandatory status line: HTTP/1.0 200 Document Follows, followed by a carriage return and a line feed.
The next two command lines prepare a single content-type header line. If the server is to transfer a gif
image, then the server prepares the header line Content-Type: image/jpeg. If, on the other hand, the
server is to transfer a jpeg image, then the server prepares the header line Content-Type: image/gif. (In
this simple Web server, no content line is sent if the object is neither a gif nor a jpeg image.) The server
then prepares and sends a content-length header line and a mandatory blank line to precede the object
itself that is to be sent. We now must send the file FileName into the DataOutputStream outToClient.
But because outToClient works with bytes, we first must perform a conversion to bytes:

We can now send the requested file:

        outToClient.write(fileInBytes, 0, numOfBytes);

The above command sends the requested file, fileInBytes, to the TCP send buffer. TCP will concatenate
the file, fileInBytes, to the header lines just created, segment the concatenation if necessary, and send the
TCP segments to the client.


After serving one request for one file, the server performs some housekeeping by closing the socket

To test this web server, install it on a host. Also put some files in the host. Then use a browser running on
any machine to request a file from the server. When you request a file, you will need to use the port
number that you include in the server code (e.g., 6789). So if your server is located at (5 of 6) [5/13/2004 11:56:26 AM]
 Keith\book\applications\webserver, the file is somefile.html, and the port number is 6789, then the browser should
request .

Return to Table of Contents

Copyright 1996-2000 Keith W. Ross and James F. Kurose (6 of 6) [5/13/2004 11:56:26 AM]
 Chapter 2: Summary

                                           2.10 Summary
In this chapter we've studied both the conceptual and the implementation aspects of network applications.
We've learned about the ubiquitous client-server paradigm adopted by Internet applications and seen its
use in the HTTP, FTP, SMTP, POP3 and DNS protocols. We've studied these important application-
level protocols, and their associated applications (the Web, file transfer, e-mail, and the domain name
system) in some detail. We've examined how the socket API can be used to build network applications
and walked through not only the use of sockets over connection-oriented (TCP) and connectionless
(UDP) end-to-end transport services, but also built a simple web server using this API. The first step in
our top-down journey "down" the layered network architecture is complete.

At the very beginning of this book, in section 1.3, we gave a rather vague, bare bones definition of a
protocol as defining "the format and the order of messages exchanged between two communicating
entities, as well as the actions taken on the transmission and/or receipt of a message." The material in
this chapter, and in particular the detailed study of the HTTP, FTP, SMTP, POP3 and DNS protocols, has
now added considerable substance to this definition. Protocols are a key concept in networking; our study
of applications protocols has now given us the opportunity to develop a more intuitive feels for what
protocols are all about.

In Section 2.1 we described the service models that TCP and UDP offer to applications that invoke them.
We took an even closer look at these service models when we developed simple applications that run
over TCP and UDP in Sections 2.6-2.7. However, we have said little about how TCP and UDP provide
these service models. For example, we have said very little about how TCP provides a reliable data
transfer service to its applications. In the next chapter we shall take a careful look at not only the what,
but also the how and why, of transport protocols.

Armed with a knowledge about Internet application structure and application-level protocols, we're now
ready to head further down the protocol stack and examine the transport layer in Chapter 3.

Return to Table of Contents

Copyright 1996-2000 Keith W. Ross and James F. Kurose [5/13/2004 11:56:28 AM]
 Homeowrk Probems for Chapter 2

      Homework Problems and Discussion
                                                     Chapter 2
Review Questions

Section 2.1

1) List five non-proprietary Internet applications and the application-layer protocols that they use.

2) For a communication session between two hosts, which host is the client and which is the server?

3) What information is used by a process running on one host to identify a process running running on
another host?

4) List the various network-application user agents that you use on a daily basis.

5) Referring to Figure 2.1-2, we see that not none of applications listed in the table require both "no data
loss" and "timing". Can you conceive of an application that requires no data loss and that is also highly
time sensitive?

Sections 2.2-2.5

6) What is meant by a handshaking protocol?

7) Why do HTTP, FTP, SMTP, POP3 and IMAP run on top of TCP rather than UDP?

8) Consider an e-commerce site that wants to keep a purchase record for each of its customers. Describe
how this can be done with HTTP authentication. Describe how this can be done with cookies.

9) What is the difference between persistent HTTP with pipelining and persistent HTTP without
pipelining? Which of the two is used by HTTP/1.1?

10) Telnet into a Web server and send a muli-line request message. Include in the request message the
If-modified-since: header line to force a response message with the 304 Not Modified
status code. (1 of 4) [5/13/2004 11:56:32 AM]
 Homeowrk Probems for Chapter 2

11) Why is it said that FTP sends control information "out of band"?

12) Suppose Alice with a Web-based e-mail account (such as Yahoo! mail or Hotmail) sends a message
to Bob, who accesses his mail from his mail server using POP3. Discuss how the message gets from
Alice's host to Bob's host. Be sure to list the series of application-layer protocols that are used to move
the message between the two hosts.

13) Suppose that you send an e-mail message whose only data is a Microsoft Excel attachment. What
might the header lines (including MIME lines) look like?

14) Print out the header of a message that you have recently received. How many Recieved: header
lines are there? Analyze each of the header lines in the message.

15) From a user's perspective, what is the difference between the download-and-delete mode and the
download-and-keep mode in POP3?

16) Redraw Figure 2.5-4 for when all queries from the local nameserver are iterative.

17) Each Internet host will have at least one local name server and one authoratative name server. What
role does each of these servers have in DNS?

18) Is it possible that an organization's Web server and mail server have exactly the same alias for a
hostname (e.g., What would be the "type" for the RR that contains the hostname of the mail

19) Use nslookup to find a Web server that has multiple IP addresses. Does the Web server of your
institution (school, company, etc.) have multiple IP addresses?

Sections 2.6-2.9

20) The UDP server described in Section 2.7 only needed one socket, whereas the TCP server described
in Section 2.6 needed two sockets. Why? If the TCP server were to support n simultaneous connections,
each from a different client host, how many sockets would the TCP server need?

21) For the client-server application over TCP described in Section 2.6, why must the server program be
executed before the client program? For the client-server application over UDP described in Section 2.7,
why may the client program be executed before the server program?


1) True or false. (2 of 4) [5/13/2004 11:56:32 AM]
 Homeowrk Probems for Chapter 2

        a) Suppose a user requests a Web page that consists of some text and two images. For this page
        the client will send one request message and recieve three response messages?

        b) True or false. Two distinct Web pages (e.g., and can be sent over the same persistent connection?

        c) With non-persistent connections between browser and origin server, it is possible for a single
        TCP segment to carry two distinct HTTP request messages?

        d) The Date: header in the HTTP response message indicates when the object in the response
        was last modified?

2) Read RFC 959 for FTP. List all of the client commands that are supported by the RFC.

3) Read RFC 1700. What are the well-known port numbers for the "simple file transfer protocol" (sftp)?
For the "network news transfoer protocol" (nntp)?

4) Suppose within your web browser you click on a link to obtain a web page. Suppose that the IP
address for the associated URL is not cached in your local host, so that a DNS look up is necessary to
obtain the IP address. Suppose that n DNS servers are visited before your host receives the IP address
from DNS; the successive visits incur a RTT of RTT1, ..., RTTn. Further suppose that web page
associated with the link contains exactly one object, a small amount of HTML text. Let RTT0 denote the
RTT between the local host and the server containing the object. Assuming zero transmission time of the
object, how much time elapses from when the client clicks on the link until the client receives the object.

5) Referring to question (4), suppose the page contains three very small objects. Neglecting transmission
times, how much time elapses with (a) nonpersistent HTTP with no parallel TCP connections, (b)
nonpersistent HTTP with parallel connections, (c) persistent HTTP with pipelining.

6) Two HTTP request methods are GET and POST. Are there any other methods in HTTP/1.0? If so,
what are they used for? How about HTTP/1.1 ?

7) Write a simple TCP program for a server that accepts lines of input from a client and prints the lines
onto the server's standard output. (You can do this by modifying the program in the
text.) Compile and execute your program. On any other machine which contains a Web browser, set the
proxy server in the browser to the machine in which your server program is running; also configure the
port number appropriately. Your browser should now send its GET request messages to your server, and
your server should display the messages on its standard output. Use this platform to determine whether
your browser generates conditional GET messages for objects that are locally cached. (3 of 4) [5/13/2004 11:56:32 AM]
 Homeowrk Probems for Chapter 2

7) Read the POP3 RFC, RFC 1939. What is the purpose of the UIDL POP3 command?

8) Install and compile the Java programs TCPClient and UDPClient on one host and TCPServer and
UDPServer on another host.

        a) Suppose you run TCPClient before you run TCPServer. What happens? Why?
        b) Suppose you run UDPClient before you run UDPServer. What happens? Why?
        c) What happens if you use different port numbers for the client and server sides?

9) Rewrite so that it can accept multiple connections. (Hint: You will need to use

Discussion Questions

1) What is a CGI script? Give examples of two popular Web sites that use CGI scripts. Explain how
these sites use CGI. Which languages are CGI scripts typically written in?

2) How can you configure your browser for local caching? What kinds of options do you have?

3) Can you configure your browser to open multiple simultaneous connections to a Web site? What are
the advantages and disadvantages of having a large number of simultaneous TCP connections?

4) Discussion question: Consider SMTP, POP3 and IMAP. Are these stateless protocols? Why or why

5) We have seen that Internet TCP sockets treat the data being sent as a byte stream but UDP sockets
recognize message boundaries. What is one advantage and one disadvantage of byte-oriented API versus
having the API explicitly recognize and preserve application-defined message boundaries?

6) Would it be possible to implement a connection-oriented service (e.g., SMTP or HTTP) on top of a
connectionless service? What would be some of the difficulties involved in doing so, and how could
these be overcome?

Copyright 1996-2000 Keith W. Ross and James F. Kurose (4 of 4) [5/13/2004 11:56:32 AM]
 The Transport Layer: Overview

                3.1 Transport Layer Services and
Residing between the application and network layers, the transport layer is in the core of the layered network
architecture. It has the critical role of providing communication services directly to the application processes
running on different hosts. In this chapter we'll examine the possible services provided by a transport layer
protocol and the principles underlying various approaches towards providing these services. We'll also look at
how these services are implemented and instantiated in existing protocols; as usual, particular emphasis will
be given to the Internet protocols, namely, TCP and UDP transport layer protocols.

In the previous two chapters we have touched on the role of the transport layer and the services that it
provides. Let's quickly review what we have already learned about the transport layer:

     q   A transport layer protocol provides for logical communication between application processes running
         on different hosts. By "logical" communication, we mean that although the communicating application
         processes are not physically connected to each other (indeed, they may be on different sides of the
         planet, connected via numerous routers and a wide range of link types), from the applications'
         viewpoint, it is as if they were physically connected. Application processes use the logical
         communication provided by the transport layer to send messages to each other, free for the worry of
         the details of the physical infrastructure used to carry these messages. Figure 3.1-1 illustrates the
         notion of logical communication.

     q   As shown in Figure 3.1-1, transport layer protocols are implemented in the end systems but not in
         network routers. Network routers only act on the network-layer fields of the layer-3 PDUs; they do
         not act on the transport-layer fields.

     q   At the sending side, the transport layer converts the messages it receives from a sending application
         process into 4-PDUs (that is, transport-layer protocol data units). This is done by (possibly) breaking
         the application messages into smaller chunks and adding a transport-layer header to each chunk to
         create 4-PDUs. The transport layer then passes the 4-PDUs to the network layer, where each 4-PDU is
         encapsulated into a 3-PDU. At the receiving side, the transport layer receives the 4-PDUs from the
         network layer, removes the transport header from the 4-PDUs, reassembles the messages and passes
         them to a receiving application process.

     q   A computer network can make more than one transport layer protocol available to network
         applications. For example, the Internet has two protocols -- TCP and UDP. Each of these protocols
         provides a different set of transport layer services to the invoking application.

     q   All transport layer protocols provide an application multiplexing/demultiplexing service. This service
         will be described in detail in the next section. As discussed in Section 2.1, in addition to
         multiplexing/demultiplexing service, a transport protocol can possibly provide other services to
         invoking applications, including reliable data transfer, bandwidth guarantees, and delay guarantees. (1 of 5) [5/13/2004 11:56:39 AM]
 The Transport Layer: Overview

Figure 3.1-1: The transport layer provides logical rather than physical communication between applications.

3.1.1 Relationship between Transport and Network
From the perspective of network applications, the transport layer is the underlying communication
infrastructure. Of course, there is more to the communication infrastructure than just the transport layer. For
example, the network layer lies just below the transport layer in the protocol stack. Whereas a transport layer
protocol provides logical communication between processes running on different hosts, a network layer
protocol provides logical communication between hosts. This distinction is subtle but important. Let's
examine this distinction with the aid of a household analogy.

Consider two houses, one on the East Coast and the other on the West Coast, with each house being home to
a dozen kids. The kids in the East Coast household are cousins with the kids in the West Coast households.
The kids in the two households love to write each other -- each kid writes each cousin every week, with each
letter delivered by the traditional postal service in a separate envelope. Thus, each household sends 144 letters
to the other household every week. (These kids would save a lot of money if they had e-mail!). In each of the
households there is one kid -- Alice in the West Coast house and Bob in the East Coast house -- responsible
for mail collection and mail distribution. Each week Alice visits all her brothers and sisters, collects the mail,
and gives the mail to a postal-service mail person who makes daily visits to the house. When letters arrive to
the West Coast house, Alice also has the job of distributing the mail to her brothers and sisters. Bob has a (2 of 5) [5/13/2004 11:56:39 AM]
 The Transport Layer: Overview

similar job on the East coast.

In this example, the postal service provides logical communication between the two houses -- the postal
service moves mail from house to house, not from person to person. On the other hand, Alice and Bob
provide logical communication between the cousins -- Alice and Bob pick up mail from and deliver mail to,
their brothers and sisters. Note that, from the cousins' perspective, Alice and Bob are the mail service, even
though Alice and Bob are only a part (the end system part) of the end-to-end delivery process. This household
example serves as a nice analogy for explaining how the transport layer relates to the network layer:

     q   hosts (also called end systems) = houses
     q   processes = cousins
     q   application messages = letters in envelope
     q   network layer protocol = postal service (including mail persons)
     q   transport layer protocol = Alice and Bob

Continuing with this analogy, observe that Alice and Bob do all their work within their respective homes;
they are not involved, for example, in sorting mail in any intermediate mail center or in moving mail from one
mail center to another. Similarly, transport layer protocols live in the end systems. Within an end system, a
transport protocol moves messages from application processes to the network edge (i.e., the network layer)
and vice versa; but it doesn't have any say about how the messages are moved within the network core. In
fact, as illustrated in Figure 3.1-1, intermediate routers neither act on, nor recognize, any information that the
transport layer may have appended to the application messages.

Continuing with our family saga, suppose now that when Alice and Bob go on vacation, another cousin pair --
say, Susan and Harvey -- substitute for them and provide the household-internal collection and delivery of
mail. Unfortunately for the two families, Susan and Harvey do not do the collection and delivery in exactly
the same way as Alice and Bob. Being younger kids, Susan and Harvey pick up and drop off the mail less
frequently and occasionally lose letters (which are sometimes chewed up by the family dog). Thus, the cousin-
pair Susan and Harvey do not provide the same set of services (i.e., the same service model) as Alice and Bob.
In an analogous manner, a computer network may make available multiple transport protocols, with each
protocol offering a different service model to applications.

The possible services that Alice and Bob can provide are clearly constrained by the possible services that the
postal service provides. For example, if the postal service doesn't provide a maximum bound on how long it
can take to deliver mail between the two houses (e.g., three days), then there is no way that Alice and Bob can
guarantee a maximum delay for mail delivery between any of the cousin pairs. In a similar manner, the
services that a transport protocol can provide are often constrained by the service model of the underlying
network-layer protocol. If the network layer protocol cannot provide delay or bandwidth guarantees for 4-
PDUs sent between hosts, then the transport layer protocol can not provide delay or bandwidth guarantees for
the messages sent between processes.

Nevertheless, certain services can be offered by a transport protocol even when the underlying network
protocol doesn't offer the corresponding service at the network layer. For example, as we'll see in this chapter,
a transport protocol can offer reliable data transfer service to an application even when the underlying (3 of 5) [5/13/2004 11:56:39 AM]
 The Transport Layer: Overview

network protocol is unreliable, that is, even when the network protocol loses, garbles and duplicates packets.
As another example (which we'll explore in Chapter 7 when we discuss network security), a transport protocol
can use encryption to guarantee that application messages are not read by intruders, even when the network
layer cannot guarantee the secrecy of 4-PDUs.

3.1.2 Overview of the Transport Layer in the Internet
The Internet, and more generally a TCP/IP network, makes available two distinct transport-layer protocols to
the application layer. One of these protocols is UDP (User Datagram Protocol), which provides an unreliable,
connectionless service to the invoking application. The second of the these protocols is TCP (Transmission
Control Protocol), which provides a reliable, connection-oriented service to the invoking application. When
designing a network application, the application developer must specify one of these two transport protocols.
As we saw in Sections 2.6 and 2.7, the application developer selects between UDP and TCP when creating

To simplify terminology, when in an Internet context, we refer to the 4-PDU as a segment. We mention,
however, that the Internet literature (e.g., the RFCs) also refers to the PDU for TCP as a segment but often
refers to the PDU for UDP as a datagram. But this same Internet literature also uses the terminology datagram
for the network-layer PDU! For an introductory book on computer networking such as this one, we believe
that it is less confusing to refer to both TCP and UDP PDUs as segments, and reserve the terminology
datagram for the network-layer PDU.

Before preceding with our brief introduction of UDP and TCP, it is useful to say a few words about the
Internet's network layer. (The network layer is examined in detail in Chapter 4.) The Internet's network-layer
protocol has a name -- IP, which abbreviates "Internet Protocol". IP provides logical communication between
hosts. The IP service model is a best-effort delivery service. This means that IP makes its "best effort" to
deliver segments between communicating hosts, but it makes no guarantees. In particular, it does not
guarantee segment delivery, it does not guarantee orderly delivery of segments, and it does it guarantee the
integrity of the data in the segments. For these reasons, IP is said to be an unreliable service. We also
mention here that every host has an IP address. We will examine IP addressing in detail in Chapter 4; for this
chapter we need only keep in mind that each host has a unique IP address.

Having taken a glimpse at the IP service model, let's now summarize the service model of UDP and TCP. The
most fundamental responsibility of UDP and TCP is to extend IP's delivery service between two end
systems to a delivery service between two processes running on the end systems. Extending host-to-host
delivery to process-to-process delivery is called application multiplexing and demultiplexing. We'll discuss
application multiplexing and demultiplexing in the next section. UDP and TCP also provide integrity
checking by including error detection fields in its header. These two minimal transport-layer services -- host-
to-host data delivery and error checking -- are the only two services that UDP provides! In particular, like IP,
UDP is an unreliable service -- it does not guarantee data sent by one process will arrive in tact to the
destination process. UDP is discussed in detail in Section 3.3.

TCP, on the other hand, offers several additional services to applications.. First and foremost, it provides
reliable data transfer. Using flow control, sequence numbers, acknowledgments and timers (techniques we'll (4 of 5) [5/13/2004 11:56:39 AM]
 The Transport Layer: Overview

explore in detail in this Chapter), TCP's guarantee of reliable data transfer ensures that data is delivered from
sending process to receiving process, correctly and in order. TCP thus converts IP's unreliable service
between end systems into a reliable data transport service between processes. TCP also uses congestion
control. Congestion control is not so much a service provided to the invoking application as it is a service for
the Internet as a whole -- a service for the general good. In loose terms, TCP congestion control prevents any
one TCP connection from swamping the links and switches between communicating hosts with an excessive
amount of traffic. In principle, TCP permits TCP connections traversing a congested network link to equally
share that link's bandwidth. This is done by regulating the rate at which an the sending-side TCPs can send
traffic into the network. UDP traffic, on the other hand, is unregulated. A an application using UDP transport
can send traffic at any rate it pleases, for as long as it pleases.

A protocol that provides reliable data transfer and congestion control is necessarily complex. We will need
several sections to cover the principles of reliable data transfer and congestion control, and additional sections
to cover the TCP protocol itself. These topics are investigated in Sections 3.4 through 3.8. The approach taken
in this chapter is to alternative between the basic principles and the TCP protocol. For example, we first
discuss reliable data transfer in a general setting and then discuss how TCP specifically provides reliable data
transfer. Similarly, we first discuss congestion control in a general setting and then discuss how TCP uses
congestion control. But before getting into all this good stuff, let's first look at application multiplexing and
demultiplexing in the next section.

Return to Table of Contents

Copyright 1996-2000 Keith W. Ross and James F. Kurose (5 of 5) [5/13/2004 11:56:39 AM]
 Multiplexing and Demultiplexing Network Applications

              3.2 Multiplexing and Demultiplexing
In this section we discuss the multiplexing/demultiplexing of messages by the transport layer from/to the
application layer. In order to keep the discussion concrete, we'll discuss this basic service in the context of the
Internet's transport layer. We emphasize, however, that multiplexing and demultiplexing services are provided in
almost every protocol architecture ever designed. Moreover, multiplexing/demultiplexing are generic services,
often found in several layers within a given protocol stack.

Although the multiplexing/demultiplexing service is not among the most exciting services that can be provided by
a transport layer protocol, it is an absolutely critical one. To understand why it so critical, consider the fact that IP
delivers data between two end systems, with each end system identified with a unique IP address. IP does not
deliver data between the application processes that run on these end systems. Extending host-to-host delivery to a
process-to-process delivery is the job of the transport layer's application multiplexing and demultiplexing service.

At the destination host, the transport layer receives segments (i.e., transport-layer PDUs) from the network layer
just below. The transport layer has the responsibility of delivering the data in these segments to the appropriate
application process running in the host. Let's take a look at an example. Suppose you are sitting in front of your
computer, and you are downloading Web pages while running one FTP session and two Telnet sessions. You
therefore have four network application processes running -- two Telnet processes, one FTP process, and one
HTTP process. When the transport layer in your computer receives data from the network layer below, it needs to
direct the received data to one of these four processes. Let's now examine how this is done.

Each transport-layer segment has a field that contains information that is used to determine the process to which
the segment's data is to be delivered. At the receiving end, the transport layer can then examine this field to
determine the receiving process, and then direct the segment to that process. This job of delivering the data in a
transport-layer segment to the correct application process is called demultiplexing. The job of gathering data at the
source host from different application processes, enveloping the data with header information (which will later be
used in demultiplexing) to create segments, and passing the segments to the network layer is called multiplexing.

To illustrate the demultiplexing job, let us return to the household saga in the previous section. Each of the kids is
distinguished by his or her name. When Bob receives a batch of mail from the mail person, he performs a
demultiplexing operation by observing to whom the letters are addressed and then hand delivering the mail to his
brothers and sisters. Alice performs a multiplexing operation when she collects letters from her brothers and sisters
and gives the collected mail to the mail person.

UDP and TCP perform the demultiplexing and multiplexing jobs by including two special fields in the segment
headers: the source port number field and the destination port number field. These two fields are illustrated in
Figure 3.2-1. When taken together, the fields uniquely identify an application process running on the destination
host. (The UDP and TCP segments have other fields as well, and they will be addressed in the subsequent sections
of this chapter.) (1 of 4) [5/13/2004 11:56:52 AM]
 Multiplexing and Demultiplexing Network Applications

                  Figure 3.2-1: Source and destination port number fields in a transport layer segment.

The notion of port numbers was briefly introduced in Sections 2.6-2.7, in which we studied application
development and socket programming. The port number is a 16-bit number, ranging from from 0 to 65535. The
port numbers ranging from 0 - 1023 are called well-known port numbers and are restricted, which means that
they are reserved for use by well-known application protocols such as HTTP and FTP. HTTP uses port number 80;
FTP uses port number 21. The list of well-known port numbers is given in [RFC 1700]. When we develop a new
application (such as one of the applications developed in Sections 2.6-2.8), we must assign the application a port

Given that each type of application running on an end system has a unique port number, then why is it that the
transport-layer segment has fields for two port numbers, a source port number and a destination port number? The
answer is simple: An end system may be running two processes of same type at the same time, and thus the port
number of an application may not suffice to identify a specific process. For example, many Web servers spawn a
new HTTP process for every request it receives; whenever such a Web server is servicing more than one request
(which is by no means uncommon), the server is running more than one process with port number 80. Therefore, in
order to uniquely identify processes, a second port number is needed.

How is this second port number created? Which port number goes in the source port number field of a segment?
Which goes in the destination port number field of a segment? To answer these questions, recall from Section 2.1
that networked applications are organized around the client-server model. Typically, the host that initiates the
application is the client and the other host is the server. Now let's look at a specific example. Suppose the
application has port number 23 (the port number for Telnet). Consider a transport layer segment leaving the client
(i.e., the host that initiated the Telnet session) and destined for the server. What are the destination and source port
numbers for this segment? For the destination port number, this segment has the port number of the application,
namely, 23. For the source port number, the client uses a number that is not being used by any of its other
processes. (This is can be done automatically by the transport-layer software running on the client and is
transparent to the application developer. An application can also explicitly request a specific port number using the
bind() system call on many Unix-like systems.) Let's say the client chooses port number x. Then each segment
that this process sends will have its source port number set to x and destination port number set to 23. When the
segment arrives at the server, the source and destination port numbers in the segment enable the server host to pass
the data of the segment to the correct application process: the destination port number 23 identifies a Telnet process
and the source port number x identifies the specific Telnet process.

The situation is reversed for the segments flowing from the server to the client. The source port number is now the
application port number, 23. The destination port number is now x. (The same x used for the source port number
for the segments sent from client to server.) When a segment arrives at the client, the source and destination port
numbers in the segment will enable the client host to pass the data of the segment to the correct application (2 of 4) [5/13/2004 11:56:52 AM]
 Multiplexing and Demultiplexing Network Applications

process, which is identified by the port number pair. Figure 3.2-2 summarizes the discussion:

                 Figure 3.2-2: Use of source and destination port numbers in a client-server application

Now you may be wondering, what happens if two different clients establish a Telnet session to a server, and each
of these clients choose the same source port number x? How will the server be able to demultiplex the segments
when the two sessions have exactly the same port number pair? The answer to this question is that server also uses
the IP addresses in the IP datagrams carrying these segments. (We will discuss IP datagrams and addressing in
detail in Chapter 4.) The situation is illustrated in Figure 3.2-3, in which host A initiates two Telnet sessions to host
C, and host A initiates one Telnet session to host C. Hosts A, B and C each have their own unique IP address; host
A has IP address A, host B has IP address B, and host C has IP address C. Host A assigns two different source port
(SP) numbers (x and y) to the two Telnet connections emanating from host A. But because host B is choosing
source port numbers independently from A, it can also assign SP=x to its Telnet connection. Nevertheless, host C is
still able to demultiplex the two connections since the two connections have different source IP addresses. In
summary, we see that when a destination host receives data from the network layer, the triplet [source IP address,
source port number, destination port number] is used to forward the data to the appropriate process. (3 of 4) [5/13/2004 11:56:52 AM]
 Multiplexing and Demultiplexing Network Applications

     Figure 3.2-3: Two clients, using the same port numbers to communicate with the same server application

Now that we understand how the transport layer can multiplex and demultiplex messages from/to network
applications, let's move on and discuss one of the Internet's transport protocols, UDP. In the next section we shall
see that UDP adds little more to the network layer protocol than multiplexing/demultiplexing service.


[RFC 1700] J. Reynolds and J. Postel, "Assigned Numbers," RFC 1700, October 1994.

Return to Table of Contents

Copyright 1996-2000 Keith W. Ross and James F. Kurose (4 of 4) [5/13/2004 11:56:52 AM]
 UDP: the User Datagram Protocol

          3.3 Connectionless Transport: UDP
The Internet makes two transport protocols available to its applications, UDP and TCP. In this section
we take a close look at UDP: how it works and what it does. The reader is encouraged to refer back to
material in Section 2.1, which includes an overview of the UDP service model, and to the material in
Section 2.7, which discusses socket programming over UDP.

To motivate our discussion about UDP, suppose you were interested in designing a no-frills, bare-bones
transport protocol. How might you go about doing this? You might first consider using a vacuous
transport protocol. In particular, on the sending side, you might consider taking the messages from the
application process and passing them directly to the network layer; and on the receiving side, you might
consider taking the messages arriving from the network layer and passing them directly to the
application process. But as we learned in the previous section, we have to do a little more than nothing.
At the very least, the transport layer must provide a multiplexing/demultiplexing service in order to pass
data between the network layer and the correct application.

UDP, defined in [RFC 768], does just about as little as a transport protocol can. Aside from the
multiplexing/demultiplexing function and some light error checking, it adds nothing to IP. In fact, if the
application developer chooses UDP instead of TCP, then the application is talking almost directly with
IP. UDP takes messages from application process, attaches source and destination port number fields for
the multiplexing/demultiplexing service, adds two other fields of minor importance, and passes the
resulting "segment" to the network layer. The network layer encapsulates the segment into an IP
datagram and then makes a best-effort attempt to deliver the segment to the receiving host. If the segment
arrives at the receiving host, UDP uses the port numbers and the IP source and destination addresses to
deliver the data in the segment to the correct application process. Note that with UDP there is no
handshaking between sending and receiving transport-layer entities before sending a segment. For this
reason, UDP is said to be connectionless.

DNS is an example of an application-layer protocol that uses UDP. When the DNS application (see
section 2.5) in a host wants to make a query, it constructs a DNS query message and passes the message
to a UDP socket (see Section 2.7). Without performing any handshaking, UDP adds a header fields to the
message and passes the resulting segment to the network layer. The network layer encapsulates the UDP
segment into a datagram and sends the datagram to a name server. The DNS application at the querying
host then waits for a reply to its query. If it doesn't receive a reply (possibly because UDP lost the query
or the reply), it either tries sending the query to another nameserver, or it informs the invoking
application that it can't get a reply. We mention that the DNS specification permits DNS to run over TCP
instead of UDP; in practice, however, DNS almost always runs over UDP.

Now you might be wondering why an application developer would ever choose to build an application
over UDP rather than over TCP. Isn't TCP always preferable to UDP since TCP provides a reliable data
transfer service and UDP does not? The answer is no, as many applications are better suited for UDP for (1 of 7) [5/13/2004 11:57:01 AM]
 UDP: the User Datagram Protocol

the following reasons:

     q   No connection establishment. As we shall discuss in Section 3.5, TCP uses a three-way
         handshake before it starts to transfer data. UDP just blasts away without any formal preliminaries.
         Thus UDP does not introduce any delay to establish a connection. This is probably the principle
         reason why DNS runs over UDP rather than TCP -- DNS would be much slower if it ran over
         TCP. HTTP uses TCP rather than UDP, since reliability is critical for Web pages with text. But,
         as we briefly discussed in Section 2.2, the TCP connection establishment delay in HTTP is an
         important contributor to the "world wide wait".
     q   No connection state. TCP maintains connection state in the end systems. This connection state
         includes receive and send buffers, congestion control parameters, and sequence and
         acknowledgment number parameters. We will see in Section 3.5 that this state information is
         needed to implement TCP's reliable data transfer service and to provide congestion control. UDP,
         on the other hand, does not maintain connection state and does not track any of these parameters.
         For this reason, a server devoted to a particular application can typically support many more
         active clients when the application runs over UDP rather than TCP.
     q   Small segment header overhead. The TCP segment has 20 bytes of header overhead in every
         segment, whereas UDP only has 8 bytes of overhead.
     q   Unregulated send rate. TCP has a congestion control mechanism that throttles the sender when
         one or more links between sender and receiver becomes excessively congested. This throttling can
         have a severe impact on real-time applications, which can tolerate some packet loss but require a
         minimum send rate. On the other hand, the speed at which UDP sends data is only constrained by
         the rate at which the application generates data, the capabilities of the source (CPU, clock rate,
         etc.) and the access bandwidth to the Internet. We should keep in mind, however, that the
         receiving host does not necessarily receive all the data - when the network is congested, a
         significant fraction of the UDP-transmitted data could be lost due to router buffer overflow. Thus,
         the receive rate is limited by network congestion even if the sending rate is not constrained.

Table 3.1-1 lists popular Internet applications and the transport protocols that they use. As we expect, e-
mail, remote terminal access, the Web and file transfer run over TCP -- these applications need the
reliable data transfer service of TCP. Nevertheless, many important applications run over UDP rather
TCP. UDP is used for RIP routing table updates (see Chapter 4 on the network layer), because the
updates are sent periodically, so that lost updates are replaced by more up-to-date updates. UDP is used
to carry network management (SNMP - see Chapter 8) data. UDP is preferred to TCP in this case, since
network management must often run when the network is in a stressed state - precisely when reliable,
congestion-controlled data transfer is difficult to achieve. Also, as we mentioned earlier, DNS runs over
UDP, thereby avoiding TCP's connection establishment delays.

           Application                      Application-layer protocol Underlying Transport Protocol
           electronic mail                  SMTP                                   TCP
           remote terminal access Telnet                                           TCP (2 of 7) [5/13/2004 11:57:01 AM]
 UDP: the User Datagram Protocol

           Web                              HTTP                                   TCP
           file transfer                    FTP                                    TCP
           remote file server               NFS                                    typically UDP
           streaming multimedia             proprietary                            typically UDP
           Internet telephony               proprietary                            typically UDP
           Network Management SNMP                                                 typically UDP
           Routing Protocol                 RIP                                    typically UDP
           Name Translation                 DNS                                    typically UDP
              Figure 3.1-1: Popular Internet applications and their underlying transport protocols.

As shown in Figure 3.1-1, UDP is also commonly used today with multimedia applications, such as
Internet phone, real-time video conferencing, and streaming of stored audio and video. We shall take a
close look at these applications in Chapter 6. We just mention now that all of these applications can
tolerate a small fraction of packet loss, so that reliable data transfer is not absolutely critical for the
success of the application. Furthermore, interactive real-time applications, such as Internet phone and
video conferencing, react very poorly to TCP's congestion control. For these reasons, developers of
multimedia applications often choose to run the applications over UDP instead of TCP. Finally, because
TCP cannot be employed with multicast, multicast applications run over UDP.

Although commonly done today, running multimedia applications over UDP is controversial to say the
least. As we mentioned above, UDP lacks any form of congestion control. But congestion control is
needed to prevent the network from entering a congested state in which very little useful work is done. If
everyone were to start streaming high bit-rate video without using any congestion control, there would be
so much packet overflow at routers that no one would see anything. Thus, the lack of congestion control
in UDP is a potentially serious problem. Many researchers have proposed new mechanisms to force all
sources, including UDP sources, to perform adaptive congestion control [Mahdavi].

Before discussing the UDP segment structure, we mention that it is possible for an application to have
reliable data transfer when using UDP. This can be done if reliability is built into the application itself
(e.g., by adding acknowledgement and retransmission mechanisms, such as those we shall study in the
next section). But this a non-trivial task that would keep an application developer busy debugging for a
long time. Nevertheless, building reliability directly into the application allows the application to "have
its cake and eat it too" -- that is, application processes can communicate reliably without being
constrained by the transmission rate constraints imposed by TCP's congestion control mechanism.
Application-level reliability also allows an application to tailor its own application-specific form of error
control. An interactive real-time may occasionally choose to retransmit a lost message, provided that (3 of 7) [5/13/2004 11:57:01 AM]
 UDP: the User Datagram Protocol

round trip network delays are small enough to avoid adding significant playout delays [Papadopoulos

Many of today's proprietary streaming applications do just this -- they run over UDP, but they have built
acknowledgements and retransmissions into the application in order reduce packet loss.

UDP Segment Structure
The UDP segment structure, shown in Figure 3.3-2, is defined in [RFC 768].

                                            Figure 3.3-2: UDP segment structure

The application data occupies the data field of the UDP datagram. For example, for DNS, the data field
contains either a query message or a response message. For a streaming audio application, audio samples
fill the data field. The UDP header has only four fields, each consisting of four bytes. As discussed in the
previous section, the port numbers allow the destination host to pass the application data to the correct
process running on that host (i.e., perform the demultiplexing function). The checksum is used by the
receiving host to check if errors have been introduced into the segment during the course of its
transmission from source to destination. (Basic principles of error detection are described in Section

UDP Checksum
The UDP checksum provides for error detection. UDP at the sender side performs the one's complement
of the sum of all the 16-bit words in the segment. This result is put in the checksum field of the UDP
segment. (In truth, the checksum is also calculated over a few of the fields in the IP header in addition to (4 of 7) [5/13/2004 11:57:01 AM]
 UDP: the User Datagram Protocol

the UDP segment. But we ignore this detail in order to see the forest through the trees.) When the
segment arrives (if it arrives!) at the receiving host, all 16-bit words are added together, including the
checksum. If this sum equals 1111111111111111, then the segment has no detected errors. If one of the
bits is a zero, then we know that errors have been introduced into the segment.

Here we give a simple example of the checksum calculation. You can find details about efficient
implementation of the calculation in the [RFC 1071]. As an example, suppose that we have the following
three 16-bit words:


The sum of first of these 16-bit words is:


Adding the third word to the above sum gives


The 1's complement is obtained by converting all the 0s to 1s and converting all the 1s to 0s. Thus the 1's
complement of the sum 1100101011001010 is 0011010100110101, which becomes the checksum. At the
receiver, all four 16-bit words are added, including the checksum. If no errors are introduced into the
segment, then clearly the sum at the receiver will be 1111111111111111. If one of the bits is a zero, then
we know that errors have been introduced into the segment. In section 5.1, we'll see that the Internet
checksum is not foolproof -- even if the sum equals 111111111111111, it is still possible that there are
undetected errors in the segment. For this reason, a number of protocols use more sophisticated error
detection techniques than simple checksumming.

You may wonder why UDP provides a checksum in the first place, as many link-layer protocols
(including the popular Ethernet protocol) also provide error checking? The reason is that there is no
guarantee that all the links between source and destination provide error checking -- one of the links may
use a protocol that does not provide error checking. Because IP is supposed to run over just about any
layer-2 protocol, it is useful for the transport layer to provide error checking as a safety measure.
Although UDP provides error checking, it does not do anything to recover from an error. Some (5 of 7) [5/13/2004 11:57:01 AM]
 UDP: the User Datagram Protocol

implementations of UDP simply discard the damaged segment; others pass the damaged segment to the
application with a warning.

That wraps up our discussion of UDP. We will soon see that TCP offers reliable data transfer to its
applications as well as other services that UDP doesn't offer. Naturally, TCP is also more complex than
UDP. Before discussing TCP, however, it will be useful to step back and first discuss the underlying
principles of reliable data transfer, which we do in the subsequent section. We will then explore TCP in
Section 3.5, where we will see that TCP has it foundations in these underlying principles.


[Papadopoulos 1996] C. Papadopoulos and G. Parulkar, "Retransmission-Based Error Control for
Continuous Media Applications," Proceedings of the 6th International Workshop on Network and
Operating System Support for Digital Audio and Video (NOSSDAV), April 1996.
[Mahdavi] J. Mahdavi and S. Floyd, "The TCP-Friendly Website,"
[RFC 768] J.Postel, "User Datagram Protocol," RFC 768, August 1980.
[RFC 1071] R. Braden, D. Borman, C. Partridge, "Computing The Internet Checksum," RFC 1071,
September 1988.

Search RFCs and Internet Drafts

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s)


Press button to submit your query or reset the form:                   Submit        Reset

Query Options:

              Case insensitive

         Maximum number of hits: 25 (6 of 7) [5/13/2004 11:57:01 AM]
 UDP: the User Datagram Protocol

Return to Table Of Contents

Copyright Keith W. Ross and James F. Kurose 1996-2000 (7 of 7) [5/13/2004 11:57:01 AM]
  Principle of Reliable Data Transfer

                   3.4 Principles of Reliable Data Transfer
In this section, we consider the problem of reliable data transfer in a general context. This is appropriate since the problem of
implementing reliable data transfer occurs not only at the transport layer, but also at the link layer and the application layer as well.
The general problem is thus of central importance to networking. Indeed, if one had to identify a ``top-10'' list of fundamentally
important problems in all of networking, this would be a top candidate to lead that list. In the next section we will examine TCP
and show, in particular, that TCP exploits many of the principles that we are about to describe.

                               Figure 3.4-1: Reliable data transfer: service model and service implementation.

Figure 3.4-1 illustrates the framework for our study of reliable data transfer. The service abstraction provided to the upper layer
entities is that of a reliable channel through which data can be transferred. With a reliable channel, no transferred data bits are
corrupted (flipped from 0 to 1, or vice versa) or lost, and all are delivered in the order in which they were sent. This is precisely the
service model offered by TCP to the Internet applications that invoke it.

It is the responsibility of a reliable data transfer protocol to implement this service abstraction. This task is made difficult by the
fact that layer below the reliable data transfer protocol may be unreliable. For example, TCP is a reliable data transfer protocol that
is implemented on top of an unreliable (IP) end-end network layer. More generally, the layer beneath the two reliably-
communicating endpoints might consist of a single physical link (e.g., as in the case of a link-level data transfer protocol) or a
global internetwork (e.g., as in the case of a transport-level protocol). For our purposes, however, we can view this lower layer
simply as an unreliable point-to-point channel.

In this section, we will incrementally develop the sender and receiver sides of a reliable data transfer protocol, considering
increasingly complex models of the underlying channel. Figure 3.4-1(b) illustrates the interfaces for our data transfer protocol. The
sending side of the data transfer protocol will be invoked from above by a call to rdt_send(). It will be passed the data to be
delivered to the upper-layer at the receiving side. (Here rdt stands for ``reliable data transfer'' protocol and _send indicates that
the sending side of rdt is being called. The first step in developing any protocol is to choose a good name!) On the receiving side,
rdt_rcv() will be called when a packet arrives from the receiving side of the channel. When the rdt protocol wants to deliver
data to the upper-layer, it will do so by calling deliver_data(). In the following we use the terminology "packet" rather than
"segment" for the protocol data unit.. Because the theory developed in this section applies to computer networks in general, and not
just to the Internet transport layer, the generic term "packet" is perhaps more appropriate here.

In this section we consider only the case of unidirectional data transfer, i.e., data transfer from the sending to receiving side. The (1 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

case of reliable bidirectional (i.e., full duplex) data transfer is conceptually no more difficult but considerably more tedious.
Although we consider only unidirectional data transfer, it is important to note that the sending and receiving sides of our protocol
will nonetheless need to transmit packets in both directions, as indicated in Figure 3.4-1. We will see shortly that in addition to
exchanging packets containing the data to be transferred, the sending and receiving sides of rdt will also need to exchange
control packets back and forth. Both the send and receive sides of rdt send packets to the other side by a call to udt_send()
(unreliable data transfer).

3.4.1 Building a Reliable Data Transfer Protocol
Reliable Data Transfer over a Perfectly Reliable Channel: rdt1.0

We first consider the simplest case in which the underlying channel is completely reliable. The protocol itself, which we will call
rdt1.0, is trivial. The finite state machine (FSM) definitions for the rdt1.0 sender and receiver are shown in Figure 3.4-2.
The sender and receiver FSMs in Figure 3.4-2 each have just one state. The arrows in the FSM description indicate the transition of
the protocol from one state to another. (Since each FSM in Figure 3.4-2 has just one state, a transition is necessarily from the one
state back to itself; we'll see more complicated state diagrams shortly.). The event causing the transition is shown above the
horizontal line labeling the transition, and the action(s) taken when the event occurs are shown below the horizontal line.

The sending side of rdt simply accepts data from the upper-layer via the rdt_send(data)event, puts the data into a packet
(via the action make_pkt(packet,data)) and sends the packet into the channel. In practice, the rdt_send(data)event
would result from a procedure call (e.g., to rdt_send()) by the upper layer application.

On the receiving side, rdt receives a packet from the underlying channel via the rdt_rcv(packet) event, removes the data
from the packet (via the action extract(packet,data)) and passes the data up to the upper-layer. In practice, the
rdt_rcv(packet)event would result from a procedure call (e.g., to rdt_rcv()) from the lower layer protocol.

In this simple protocol, there is no difference between a unit of data and a packet. Also, all packet flow is from the sender to
receiver - with a perfectly reliable channel there is no need for the receiver side to provide any feedback to the sender since nothing
can go wrong!

                                        Figure 3.4-2: rdt1.0 - a protocol for a completely reliable channel

Reliable Data Transfer over a Channel with Bit Errors: rdt2.0

 A more realistic model of the underlying channel is one in which bits in a packet may be corrupted. Such bit errors typically occur
in the physical components of a network as a packet is transmitted, propagates, or is buffered. We'll continue to assume for the
moment that all transmitted packets are received (although their bits may be corrupted) in the order in which they were sent.

Before developing a protocol for reliably communicating over such a channel, first consider how people might deal with such a (2 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

situation. Consider how you yourself might dictate a long message over the phone. In a typical scenario, the message taker might
say ``OK'' after each sentence has been heard, understood, and recorded. If the message taker hears a garbled sentence, you're asked
to repeat the garbled sentence. This message dictation protocol uses both positive acknowledgements (``OK'') and negative
acknowledgements (``Please repeat that''). These control messages allow the receiver to let the sender know what has been
received correctly, and what has been received in error and thus requires repeating. In a computer network setting, reliable data
transfer protocols based on such retransmission are known ARQ (Automatic Repeat reQuest) protocols.

Fundamentally, two additional protocol capabilities are required in ARQ protocols to handle the presence of bit errors:

    q   Error detection. First, a mechanism is needed to allow the receiver to detect when bit errors have occurred. Recall from
        Sections 3.3 that the UDP transport protocol uses the Internet checksum field for exactly this purpose. In Chapter 5 we'll
        examine error detection and correction techniques in greater detail; These techniques allow the receiver to detect, and
        possibly correct packet bit errors. For now, we need only know that these techniques require that extra bits (beyond the bits
        of original data to be transferred) be sent from the sender to receiver; these bits will be gathered into the packet checksum
        field of the rdt2.0 data packet.
    q   Receiver feedback. Since the sender and receiver are typically executing on different end systems, possibly separated by
        thousands of miles, the only way for the sender to learn of the receiver's view of the world (in this case, whether or not a
        packet was received correctly) is for the receiver to provide explicit feedback to the sender. The positive (ACK) and
        negative acknowledgement (NAK) replies in the message dictation scenario are an example of such feedback. Our rdt2.0
        protocol will similarly send ACK and NAK packets back from the receiver to the sender. In principle, these packets need
        only be one bit long, e.g., a zero value could indicate a NAK and a value of 1 could indicate an ACK.

Figure 3.4-3 shows the FSM representation of rdt2.0, a data transfer protocol employing error detection, positive
acknowledgements (ACKs), and negative acknowledgements (NAKs).

The send side of rdt2.0 has two states. In one state, the send-side protocol is waiting for data to be passed down from the upper
layer. In the other state, the sender protocol is waiting for an ACK or a NAK packet from the receiver. If an ACK packet is received
(the notation rdt_rcv(rcvpkt) && isACK(rcvpkt) in Figure 3.4-3 corresponds to this event), the sender knows the most
recently transmitted packet has been received correctly and thus the protocol returns to the state of waiting for data from the upper
layer. If a NAK is received, the protocol retransmits the last packet and waits for an ACK or NAK to be returned by the receiver in
response to the retransmitted data packet. It is important to note that when the receiver is in the wait-for-ACK-or-NAK state, it can
not get more data from the upper layer; that will only happen after the sender receives an ACK and leaves this state. Thus, the
sender will not send a new piece of data until it is sure that the receiver has correctly received the current packet. Because of this
behavior, protocols such as rdt2.0 are known as stop-and-wait protocols.

The receiver-side FSM for rdt2.0 still has a single state. On packet arrival, the receiver replies with either an ACK or a NAK,
depending on whether or not the received packet is corrupted. In Figure 3.4-3, the notation rdt_rcv(rcvpkt) &&
corrupt(rcvpkt) corresponds to the event where a packet is received and is found to be in error. (3 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

                                        Figure 3.4-3: rdt2.0 - a protocol for a channel with bit-errors

Protocol rdt2.0 may look as if it works but unfortunately has a fatal flaw. In particular, we haven't accounted for the possibility
that the ACK or NAK packet could be corrupted! (Before proceeding on, you should think about how this problem may be fixed.)
Unfortunately, our slight oversight is not as innocuous as it may seem. Minimally, we will need to add checksum bits to ACK/NAK
packets in order to detect such errors. The more difficult question is how the protocol should recover from errors in ACK or NAK
packets. The difficulty here is that if an ACK or NAK is corrupted, the sender has no way of knowing whether or not the receiver
has correctly received the last piece of transmitted data.

Consider three possibilities for handling corrupted ACKs or NAKs:

    q   For the first possibility, consider what a human might do in the message dictation scenario. If the speaker didn't understand
        the ``OK'' or ``Please repeat that'' reply from the receiver, the speaker would probably ask ``What did you say?'' (thus
        introducing a new type of sender-to-receiver packet to our protocol). The speaker would then repeat the reply. But what if
        the speaker's ``What did you say'' is corrupted? The receiver, having no idea whether the garbled sentence was part of the
        dictation or a request to repeat the last reply, would probably then respond with ``What did you say?'' And then, of course,
        that response might be garbled. Clearly, we're heading down a difficult path.
    q   A second alternative is to add enough checksum bits to allow the sender to not only detect, but recover from, bit errors. This
        solves the immediate problem for a channel which can corrupt packets but not lose them.
    q   A third approach is for the sender to simply resend the current data packet when it receives a garbled ACK or NAK packet.
        This, however, introduces duplicate packets into the sender-to-receiver channel. The fundamental difficulty with duplicate
        packets is that the receiver doesn't know whether the ACK or NAK it last sent was received correctly at the sender. Thus, it
        can not know a priori whether an arriving packet contains new data or is a retransmission!

A simple solution to this new problem (and one adopted in almost all existing data transfer protocols including TCP) is to add a
new field to the data packet and have the sender number its data packets by putting a sequence number into this field. The receiver
then need only check this sequence number to determine whether or not the received packet is a retransmission. For this simple
case of a stop-and-wait protocol, a 1-bit sequence number will suffice, since it will allow the receiver to know whether the sender is
resending the previously transmitted packet (the sequence number of the received packet has the same sequence number as the most
recently received packet) or a new packet (the sequence number changes, i.e., moves ``forward'' in modulo 2 arithmetic). Since we
are currently assuming a channel that does not lose packets, ACK and NAK packets do not themselves need to indicate the (4 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

sequence number of the packet they are ACKing or NAKing, since the sender knows that a received ACK or NAK packet (whether
garbled or not) was generated in response to its most recently transmitted data packet.

                                                          Figure 3.4-4: rdt2.1 sender (5 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

                                                          Figure 3.4-5: rdt2.1 recevier

Figures 3.4-4 and 3.4-5 show the FSM description for rdt2.1, our fixed version of rdt2.0. The rdt2.1 sender and receiver
FSM's each now have twice as many states as before. This is because the protocol state must now reflect whether the packet
currently being sent (by the sender) or expected (at the receiver) should have a sequence number of 0 or 1. Note that the actions in
those states where a 0-numbered packet is being sent or expected are mirror images of those where a 1-numbered packet is being
sent or expected; the only differences have to do with the handling of the sequence number.

Protocol rdt2.1 uses both positive and negative acknowledgements from the receiver to the sender. A negative acknowledgement is
sent whenever a corrupted packet, or an out of order packet, is received. We can accomplish the same effect as a NAK if instead of
sending a NAK, we instead send an ACK for the last correctly received packet. A sender that receives two ACKs for the same
packet (i.e., receives duplicate ACKs) knows that the recevier did not correctly receive the packet following the packet that is
being ACKed twice. Many TCP implementations use the receipt of so-called "triple duplicate ACKs" (three ACK packets all
ACK'ing the same packet) to trigger a retransmission at the sender. Our NAK-free reliable data transfer protocol for a channel with
bit errors is rdt2.2, shown in Figure 3.4-6 and 3.4-7. (6 of 20) [5/13/2004 11:57:40 AM]
Principle of Reliable Data Transfer

                                                     Figure 3.4-6: rdt2.2 sender

                                                    Figure 3.4-7: rdt2.2 receiver (7 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

Reliable Data Transfer over a Lossy Channel with Bit Errors: rdt3.0

Suppose now that in addition to corrupting bits, the underlying channel can lose packets as well, a not uncommon event in today's
computer networks (including the Internet). Two additional concerns must now be addressed by the protocol: how to detect packet
loss and what to do when this occurs. The use of checksumming, sequence numbers, ACK packets, and retransmissions - the
techniques already developed in rdt 2.2 - will allow us to answer the latter concern. Handling the first concern will require
adding a new protocol mechanism.

There are many possible approaches towards dealing with packet loss (several more of which are explored in the exercises at the
end of the chapter). Here, we'll put the burden of detecting and recovering from lost packets on the sender. Suppose that the sender
transmits a data packet and either that packet, or the receiver's ACK of that packet, gets lost. In either case, no reply is forthcoming
at the sender from the receiver. If the sender is willing to wait long enough so that it is certain that a packet has been lost, it can
simply retransmit the data packet. You should convince yourself that this protocol does indeed work.

But how long must the sender wait to be certain that something has been lost? It must clearly wait at least as long as a round trip
delay between the sender and receiver (which may include buffering at intermediate routers or gateways) plus whatever amount of
time is needed to process a packet at the receiver. In many networks, this worst case maximum delay is very difficult to even
estimate, much less know with certainty. Moreover, the protocol should ideally recover from packet loss as soon as possible;
waiting for a worst case delay could mean a long wait until error recovery is initiated. The approach thus adopted in practice is for
the sender to ``judiciously'' chose a time value such that packet loss is likely, although not guaranteed, to have happened. If an ACK
is not received within this time, the packet is retransmitted. Note that if a packet experiences a particularly large delay, the sender
may retransmit the packet even though neither the data packet nor its ACK have been lost. This introduces the possibility of
duplicate data packets in the sender-to-receiver channel. Happily, protocol rdt2.2 already has enough functionality (i.e.,
sequence numbers) to handle the case of duplicate packets.

From the sender's viewpoint, retransmission is a panacea. The sender does not know whether a data packet was lost, an ACK was
lost, or if the packet or ACK was simply overly delayed. In all cases, the action is the same: retransmit. In order to implement a
time-based retransmission mechanism, a countdown timer will be needed that can interrupt the sender after a given amount of
timer has expired. The sender will thus need to be able to (i) start the timer each time a packet (either a first time packet, or a
retransmission) is sent, (ii) respond to a timer interrupt (taking appropriate actions), and (iii) stop the timer.

The existence of sender-generated duplicate packets and packet (data, ACK) loss also complicates the sender's processing of any
ACK packet it receives. If an ACK is received, how is the sender to know if it was sent by the receiver in response to its (sender's)
own most recently transmitted packet, or is a delayed ACK sent in response to an earlier transmission of a different data packet?
The solution to this dilemma is to augment the ACK packet with an acknowledgement field. When the receiver generates an ACK,
it will copy the sequence number of the data packet being ACK'ed into this acknowledgement field. By examining the contents of
the acknowledgment field, the sender can determine the sequence number of the packet being positively acknowledged. (8 of 20) [5/13/2004 11:57:40 AM]
Principle of Reliable Data Transfer

                                               Figure 3. 4-8: rdt 3.0 sender FSM (9 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

                                        Figure 3.4-9: Operation of rdt 3.0, the alternating bit protocol

Figure 3.4-8 shows the sender FSM for rdt3.0, a protocol that reliably transfers data over a channel that can corrupt or lose
packets. Figure 3.4-9 shows how the protocol operates with no lost or delayed packets, and how it handles lost data packets. In (10 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

Figure 3.4-9, time moves forward from the top of the diagram towards the bottom of the diagram; note that a receive time for a
packet is neccessarily later than the send time for a packet as a result of transmisison and propagation delays. In Figures 3.4-9(b)-
(d), the send-side brackets indicate the times at which a timer is set and later times out. Several of the more subtle aspects of this
protocol are explored in the exercises at the end of this chapter. Because packet sequence numbers alternate between 0 and 1,
protocol rdt3.0 is sometimes known as the alternating bit protocol.

We have now assembled the key elements of a data transfer protocol. Checksums, sequence numbers, timers, and positive and
negative acknowledgement packets each play a crucial and necessary role in the operation of the protocol. We now have a working
reliable data transfer protocol!

3.4.2 Pipelined Reliable Data Transfer Protocols
Protocol rdt3.0 is a functionally correct protocol, but it is unlikely that anyone would be happy with its performance,
particularly in today's high speed networks. At the heart of rdt3.0's performance problem is the fact that it is a stop-and-wait

To appreciate the performance impact of this stop-and-wait behavior, consider an idealized case of two end hosts, one located on
the west coast of the United States and the other located on the east cost. The speed-of-light propagation delay, Tprop, between these
two end systems is approximately 15 milliseconds. Suppose that they are connected by a channel with a capacity, C, of 1 Gigabit
(10**9 bits) per second. With a packet size, SP, of 1K bytes per packet including both header fields and data, the time needed to
actually transmit the packet into the 1Gbps link is

                                        Ttrans = SP/C = (8 Kbits/packet)/ (10**9 bits/sec) = 8 microseconds

With our stop and wait protocol, if the sender begins sending the packet at t = 0, then at t = 8 microsecs the last bit enters the
channel at the sender side. The packet then makes its 15 msec cross country journey, as depicted in Figure 3.4-10a, with the last bit
of the packet emerging at the receiver at t = 15.008 msec. Assuming for simplicity that ACK packets are the same size as data
packets and that the receiver can begin sending an ACK packet as soon as the last bit of a data packet is received, the last bit of the
ACK packet emerges back at the receiver at t = 30.016 msec. Thus, in 30.016 msec, the sender was only busy (sending or
receiving) for .016 msec. If we define the utilization of the sender (or the channel) as the fraction of time the sender is actually
busy sending bits into the channel, we have a rather dismal sender utilization, Usender, of

                                                        Usender = (.008/ 30.016) = 0.00015

That is, the sender was busy only 1.5 hundredths of one percent of the time. Viewed another way, the sender was only able to send
1K bytes in 30.016 milliseconds, an effective throughput of only 33KB/sec - even thought a 1Gigabit per second link was
available! Imagine the unhappy network manager who just paid a fortune for a gigabit capacity link but manages to get a
throughput of only 33KB! This is a graphic example of how network protocols can limit the capabilities provided by the underlying
network hardware. Also, we have neglected lower layer protocol processing times at the sender and receiver, as well as the
processing and queueing delays that would occur at any intermediate routers between the sender and receiver. Including these
effects would only serve to further increase the delay and further accentuate the poor performance. (11 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

                                            Figure 3.4-10: Stop-and-wait versus pipelined protocols

The solution to this particular performance problem is a simple one: rather than operate in a stop-and-wait manner, the sender is
allowed to send multiple packets without waiting for acknowledgements, as shown in Figure 3.4-10(b). Since the many in-transit
sender-to-receiver packets can be visualized as filling a pipeline, this technique is known as pipelining. Pipelining has several
consequences for reliable data transfer protocols:

    q   The range of sequence numbers must be increased, since each in-transit packet (not counting retransmissions) must have a
        unique sequence number and there may be multiple, in-transit, unacknowledged packets.
    q   The sender and receiver-sides of the protocols may have to buffer more than one packet. Minimally, the sender will have to
        buffer packets that have been transmitted, but not yet acknowledged. Buffering of correctly-received packets may also be
        needed at the receiver, as discussed below.

The range of sequence numbers needed and the buffering requirements will depend on the manner in which a data transfer protocol
responds to lost, corrupted, and overly delayed packets. Two basic approaches towards pipelined error recovery can be identified:
Go-Back-N and selective repeat.

3.4.3 Go-Back-N (GBN)

                                        Figure 3.4-11: Sender's view of sequence numbers in Go-Back-N

In a Go-Back-N (GBN) protocol, the sender is allowed to transmit multiple packets (when available) without waiting for an
acknowledgment, but is constrained to have no more than some maximum allowable number, N, of unacknowledged packets in the
pipeline. Figure 3.4-11 shows the sender's view of the range of sequence numbers in a GBN protocol. If we define base to be the (12 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

sequence number of the oldest unacknowledged packet and nextseqnum to be the smallest unused sequence number (i.e., the
sequence number of the next packet to be sent), then four intervals in the range of sequence numbers can be identified. Sequence
numbers in the interval [0,base-1] correspond to packets that have already been transmitted and acknowledged. The interval
[base,nextseqnum-1] corresponds to packets that have been sent but not yet acknowledged. Sequence numbers in the interval
[nextseqnum,base+N-1] can be used for packets that can be sent immediately, should data arrive from the upper layer. Finally,
sequence numbers greater than or equal to base+N can not be used until an unacknowledged packet currently in the pipeline has
been acknowledged.

As suggested by Figure 3.4-11, the range of permissible sequence numbers for transmitted but not-yet-acknowledged packets can
be viewed as a ``window'' of size N over the range of sequence numbers. As the protocol operates, this window slides forward over
the sequence number space. For this reason, N is often referred to as the window size and the GBN protocol itself as a sliding
window protocol. You might be wondering why even limit the number of outstandstanding, unacknowledged packet to a value of
N in the first place. Why not allow an unlimited number of such packets? We will see in Section 3.5 that flow conontrol is one
reason to impose a limt on the sender. We'll examine another reason to do so in section 3.7, when we study TCP congestion

In practice, a packet's sequence number is carried in a fixed length field in the packet header. If k is the number of bits in the packet
sequence number field, the range of sequence numbers is thus [0,2k-1]. With a finite range of sequence numbers, all arithmetic
involving sequence numbers must then be done using modulo 2k arithmetic. (That is, the sequence number space can be thought of
as a ring of size 2k, where sequence number 2k-1 is immediately followed by sequence number 0.) Recall that rtd3.0 had a 1-bit
sequence number and a range of sequence numbers of [0,1].Several of the problems at the end of this chapter explore consequences
of a finite range of sequence numbers. We will see in Section 3.5 that TCP has a 32-bit sequence number field, where TCP
sequence numbers count bytes in the byte stream rather than packets.

                                        Figure 3.4-12 Extended FSM description of GBN sender. (13 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

                                        Figure 3.4-13 Extended FSM description of GBN receiver.

Figures 3.4-12 and 3.4-13 give an extended-FSM description of the sender and receiver sides of an ACK-based, NAK-free, GBN
protocol. We refer to this FSM description as an extended-FSM since we have added variables (similar to programming language
variables) for base and nextseqnum, and also added operations on these variables and conditional actions involving these variables.
Note that the extended-FSM specification is now beginning to look somewhat like a programming language specification.
[Bochman 84] provides an excellent survey of additional extensions to FSM techniques as well as other programming language-
based techniques for specifying protocols.

The GBN sender must respond to three types of events:

    q   Invocation from above. When rdt_send() is called from above, the sender first checks to see if the window is full, i.e.,
        whether there are N outstanding, unacknowledged packets. If the window is not full, a packet is created and sent, and
        variables are appropriately updated. If the window is full, the sender simply returns the data back to the upper layer, an
        implicit indication that the window is full. The upper layer would presumably then have to try again later. In a real
        implementation, the sender would more likely have either buffered (but not immediately sent) this data, or would have a
        synchronization mechanism (e.g., a semaphore or a flag) that would allow the upper layer to call rdt_send() only when
        the window is not full.
    q   Receipt of an ACK. In our GBN protocol, an acknowledgement for packet with sequence number n will be taken to be a
        cumulative acknowledgement, indicating that all packets with a sequence number up to and including n have been
        correctly received at the receiver. We'll come back to this issue shortly when we examine the receiver side of GBN.
    q   A timeout event. The protocol's name, ``Go-Back-N,'' is derived from the sender's behavior in the presence of lost or overly
        delayed packets. As in the stop-and-wait protocol, a timer will again be used to recover from lost data or acknowledgement
        packets. If a timeout occurs, the sender resends all packets that have been previously sent but that have not yet been
        acknowledged. Our sender in Figure 3.4-12 uses only a single timer, which can be thought of as a timer for the oldest
        tranmitted-but-not-yet-acknowledged packet. If an ACK is received but there are still additional transmitted-but-yet-to-be-
        acknowledged packets, the timer is restarted. If there are no outstanding unacknowledged packets, the timer is stopped.

The receiver's actions in GBN are also simple. If a packet with sequence number n is received correctly and is in-order (i.e., the
data last delivered to the upper layer came from a packet with sequence number n-1), the receiver sends an ACK for packet n and
delivers the data portion of the packet to the upper layer. In all other cases, the receiver discards the packet and resends an ACK for
the most recently received in-order packet. Note that since packets are delivered one-at-a-time to the upper layer, if packet k has
been received and delivered, then all packets with a sequence number lower than k have also been delivered. Thus, the use of
cumulative acknowledgements is a natural choice for GBN.

In our GBN protocol, the receiver discards out-of-order packets. While it may seem silly and wasteful to discard a correctly
received (but out-of-order) packet, there is some justification for doing so. Recall that the receiver must deliver data, in-order, to the
upper layer. Suppose now that packet n is expected, but packet n+1 arrives. Since data must be delivered in order, the receiver
could buffer (save) packet n+1 and then deliver this packet to the upper layer after it had later received and delivered packet n.
However, if packet n is lost, both it and packet n+1 will eventually be retransmitted as a result of the GBN retransmission rule at
the sender. Thus, the receiver can simply discard packet n+1. The advantage of this approach is the simplicity of receiver buffering -
the receiver need not buffer any out-of-order packets. Thus, while the sender must maintain the upper and lower bounds of its
window and the position of nextseqnum within this window, the only piece of information the receiver need maintain is the (14 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

sequence number of the next in-order packet. This value is held in the variable expectedseqnum, shown in the receiver FSM in
Figure 3.4-13. Of course, the disadvantage of throwing away a correctly received packet is that the subsequent retransmission of
that packet might be lost or garbled and thus even more retransmissions would be required.

                                                 Figure 3.4-14: Go-Back-N in operation

Figure 3.4-14 shows the operation of the GBN protocol for the case of a window size of four packets. Because of this window size
limitation, the sender sends packets 0 through 3 but then must wait for one or more of these packets to be acknowledged before
proceeding. As each successive ACK (e.g., ACK0 and ACK1) is received, the window slides forwards and the sender can transmit
one new packet (pkt4 and pkt5, respectively). On the receiver side, packet 2 is lost and thus packets 3, 4, and 5 are found to be out-
of-order and are discarded.

Before closing our discussion of GBN, it is worth noting that an implementation of this protocol in a protocol stack would likely be
structured similar to that of the extended FSM in Figure 3.4-12. The implementation would also likely be in the form of various
procedures that implement the actions to be taken in response to the various events that can occur. In such event-based
programming, the various procedures are called (invoked) either by other procedures in the protocol stack, or as the result of an
interrupt. In the sender, these events would be (i) a call from the upper layer entity to invoke rdt_send(), (ii) a timer interrupt,
and (iii) a call from the lower layer to invoke rdt_rcv() when a packet arrives. The programming exercises at the end of this
chapter will give you a chance to actually implement these routines in a simulated, but realistic, network setting.

We note here that the GBN protocol incorporates almost all of the techniques that we will enounter when we study the reliable data
transfer components of TCP in Section 3.5: the use of sequence numbers, cumulative acknowledgements, checksums, and a time-
out/retransmit operation. Indeed, TCP is often referred to as a GBN style of protocol. There are, however, some differences.
Many TCP implementations will buffer correctly-received but out-of-order segments [Stevens 1994]. A proposed modification to
TCP, the so-called selective acknowledgment [RFC 2018], will also allow a TCP receiver to selectively acknowledge a single out-
of-order packet rather than cumulatively acknowledge the last correctly received packet. The notion of a selective acknowledgment (15 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

is at the heart of the second broad class of pipelined protocols: the so called selective repeat protocols.

3.4.4 Selective Repeat (SR)
The GBN protocol allows the sender to potentially ``fill the pipeline'' in Figure 3.4-10 with packets, thus avoiding the channel
utilization problems we noted with stop-and-wait protocols. There are, however, scenarios in which GBN itself will suffer from
performance problems. In particular, when the window size and bandwidth-delay product are both large, many packets can be in
the pipeline. A single packet error can thus cause GBN to retransmit a large number of packets, many of which may be
unnecessary. As the probability of channel errors increases, the pipeline can become filled with these unnecessary retransmissions.
Imagine in our message dictation scenario, if every time a word was garbled, the surrounding 1000 words (e.g., a window size of
1000 words) had to be repeated. The dictation would be slowed by all of the reiterated words.

As the name suggests, Selective Repeat (SR) protocols avoid unnecessary retransmissions by having the sender retransmit only
those packets that it suspects were received in error (i.e., were lost or corrupted) at the receiver. This individual, as-needed,
retransmission will require that the receiver individually acknowledge correctly-received packets. A window size of N will again be
used to limit the number of outstanding, unacknowledged packets in the pipeline. However, unlike GBN, the sender will have
already received ACKs for some of the packets in the window. Figure 3.4-15 shows the SR sender's view of the sequence number
space. Figure 3.4-16 details the various actions taken by the SR sender.

The SR receiver will acknowledge a correctly received packet whether or not it is in-order. Out-of-order packets are buffered until
any missing packets (i.e., packets with lower sequence numbers) are received, at which point a batch of packets can be delivered in-
order to the upper layer. Figure figsrreceiver itemizes the the various actions taken by the SR receiver. Figure 3.4-18 shows an
example of SR operation in the presence of lost packets. Note that in Figure 3.4-18, the receiver initially buffers packets 3 and 4,
and delivers them together with packet 2 to the upper layer when packet 2 is finally received. (16 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

                                  Figure 3.4-15: SR sender and receiver views of sequence number space

    1. Data received from above. When data is received from above, the SR sender checks the next available sequence number
       for the packet. If the sequence number is within the sender's window, the data is packetized and sent; otherwise it is either
       buffered or returned to the upper layer for later transmission, as in GBN.
    2. Timeout. Timers are again used to protect against lost packets. However, each packet must now have its own logical timer,
       since only a single packet will be transmitted on timeout. A single hardware timer can be used to mimic the operation of
       multiple logical timers.
    3. ACK received. If an ACK is received, the SR sender marks that packet as having been received, provided it is in the
       window. If the packet's sequence number is equal to sendbase, the window base is moved forward to the unacknowledged
       packet with the smallest sequence number. If the window moves and there are untransmitted packets with sequence numbers
       that now fall within the window, these packets are transmitted.

                                                     Figure 3.4-16: Selective Repeat sender actions

    1. Packet with sequence number in [rcvbase, rcvbase+N-1] is correctly received. In this case, the received packet falls
       within the receivers window and a selective ACK packet is returned to the sender. If the packet was not previously received,
       it is buffered. If this packet has a sequence number equal to the base of the receive window (rcvbase in Figure 3.4-15), then
       this packet, and any previously buffered and consecutively numbered (beginning with rcvbase) packets are delivered to the
       upper layer. The receive window is then moved forward by the number of packets delivered to the upper layer.As an
       example, consider Figure 3.4-18 When a packet with a sequence number of rcvbase=2 is received, it and packets
       rcvbase+1 and rcvbase+2 can be delivered to the upper layer.
    2. Packet with sequence number in [rcvbase-N,rcvbase-1] is received. In this case, an ACK must be generated, even though
       this is a packet that the receiver has previously acknowledged.
    3. Otherwise. Ignore the packet.
                                            Figure 3.4-17: Selective Repeat Receiver Actions

It is important to note that in step 2 in Figure 3.4-17, the receiver re-acknowledges (rather than ignores) already received packets
with certain sequence numbers below the current window base. You should convince yourself that this re-acknowledgement is
indeed needed. Given the sender and receiver sequence number spaces in Figure 3.4-15 for example, if there is no ACK for packet
sendbase propagating from the receiver to the sender, the sender will eventually retransmit packet sendbase, even though it is clear
(to us, not the sender!) that the receiver has already received that packet. If the receiver were not to ACK this packet, the sender's
window would never move forward! This example illustrates an important aspect of SR protocols (and many other protocols as
well): the sender and receiver will not always have an identical view of what has been received correctly and what has not. For SR
protocols, this means that the sender and reeciver windows will not always coincide. (17 of 20) [5/13/2004 11:57:40 AM]
Principle of Reliable Data Transfer

                                                     Figure 3.4-18: SR Operation (18 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

                  Figure 3.4-19: SR receiver dilemma with too large windows: a new packet or a retransmission?

The lack of synchronization between sender and receiver windows has important consequences when we are faced with the reality
of a finite range of sequence numbers. Consider what could happen, for example, with a finite range of four packet sequence
numbers, 0,1,2,3 and a window size of three. Suppose packets 0 through 2 are transmitted and correctly received and acknowledged
at the receiver. At this point, the receiver's window is over the fourth, fifth and sixth packets, which have sequence numbers 3, 0,
and 1, respectively. Now consider two scenarios. In the first scenario, shown in Figure 3.4-19(a), the ACKs for the first three
packets are lost and the sender retransmits these packets. The receiver thus next receives a packet with sequence number 0 - a copy
of the first packet sent.

In the second scenario, shown in Figure 3.4-19(b), the ACKs for the first three packets are all delivered correctly. The sender thus
moves its window forward and sends the fourth, fifth and sixth packets, with sequence numbers 3, 0, 1, respectively. The packet
with sequence number 3 is lost, but the packet with sequence number 0 arrives - a packet containing new data. (19 of 20) [5/13/2004 11:57:40 AM]
  Principle of Reliable Data Transfer

Now consider the receiver's viewpoint in Figure 3.4-19, which has a figurative curtain between the sender and the receiver, since
the receiver can not ``see'' the actions taken by the sender. All the receiver observes is the sequence of messages it receives from the
channel and sends into the channel. As far as it is concerned, the two scenarios in Figure 3.4-19 are identical. There is no way of
distinguishing the retransmission of the first packet from an original transmission of the fifth packet. Clearly, a window size that is
one smaller than the size of the sequence number space won't work. But how small must the window size be? A problem at the end
of the chapter asks you to show that the window size must be less than or equal to half the size of the sequence number space.

Let us conclude our discussion of reliable data transfer protocols by considering one remaining assumption in our underlying
channel model. Recall that we have assumed that packets can not be re-ordered within the channel between the sender and rceiver.
This is generally a reasonable assumption when the sender and receiver are connected by a single physical wire. However, when
the ``channel'' connecting the two is a network, packet reordering can occur. One manifestation of packet ordering is that old copies
of a packet with a sequence or acknowledgement number of x can appear, even though neither the sender's nor the receiver's
window contains x. With packet reordering, the channel can be thought of as essentially buffering packets and spontaneously
emitting these packets at any point in the future. Because sequence numbers may be reused, some care must be taken to guard
against such duplicate packets. The approach taken in practice is to insure that a sequence number is not reused until the sender is
relatively ``sure'' than any previously sent packets with sequence number x are no longer in the network. This is done by assuming
that a packet can not ``live'' in the network for longer than some fixed maximum amount of time. A maximum packet lifetime of
approximately three minutes is assumed in the TCP extensions for high-speed networks [RFC 1323]. Sunshine [Sunshine 1978]
describes a method for using sequence numbers such that reordering problems can be completely avoided.


[Bochman 84] G.V. Bochmann and C.A. Sunshine, "Formal methods in communication protocol design", IEEE Transactions on
Communicaitons, Vol. COM-28, No. 4, (April 1980), pp 624-631.
[RFC 1323] V. Jacobson, S. Braden, D. Borman, "TCP Extensions for High Performance," RFC 1323, May 1992.
[RFC 2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP Selective Acknowledgment Options," RFC 2018, October
[Stevens 1994] W.R. Stevens, TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley, Reading, MA, 1994.
[Sunshine 1978] C. Sunshine and Y.K. Dalal, "Connection Management in Transport Protocols", Computer Networks, Amsterdam,
The Netherlands: North-Holland", 1978.

Copyright 1999 Keith W. Ross and James F. Kurose, All Rights Reserved. (20 of 20) [5/13/2004 11:57:40 AM]

                                                  TCP Flow Control

        NOTES :

             1. Host B comsumes data in 2Kbyte chunks at random times.
             2. When Host A receives an acknowledgment with WIN=0, Host A sends a packet with one
                byte of data. It is assumed for simplicity, that this one byte is not comsumed by the
                receiver. [5/13/2004 11:57:42 AM]
  Transmission Control Protocol

                         3.5 Connection-Oriented Transport: TCP
Now that we have covered the underlying principles of reliable data transfer, let's turn to TCP -- the Internet's transport-layer, connection-oriented,
reliable transport protocol. In this section, we'll see that in order to provide reliable data transfer, TCP relies on many of the underlying principles
discussed in the previous section, including error detection, retransmissions, cumulative acknowledgements, timers and header fields for sequence and
acknowledgement numbers. TCP is defined in [RFC 793], [RFC 1122], [RFC 1323], [RFC 2018] and [RFC 2581].

3.5.1 The TCP Connection
TCP provides multiplexing, demultiplexing, and error detection (but not recovery) in exactly the same manner as UDP. Nevertheless, TCP and UDP
differ in many ways. The most fundamental difference is that UDP is connectionless, while TCP is connection-oriented. UDP is connectionless
because it sends data without ever establishing a connection. TCP is connection-oriented because before one application process can begin to send data
to another, the two processes must first "handshake" with each other -- that is, they must send some preliminary segments to each other to establish the
parameters of the ensuing data transfer. As part of the TCP connection establishment, both sides of the connection will initialize many TCP "state
variables" (many of which will be discussed in this section and in Section 3.7) associated with the TCP connection.

The TCP "connection" is not an end-to-end TDM or FDM circuit as in a circuit-switched network. Nor is it a virtual circuit (see Chapter 1), as the
connection state resides entirely in the two end systems. Because the TCP protocol runs only in the end systems and not in the intermediate network
elements (routers and bridges), the intermediate network elements do not maintain TCP connection state. In fact, the intermediate routers are
completely oblivious to TCP connections; they see datagrams, not connections.

A TCP connection provides for full duplex data transfer. That is, application-level data can be transferred in both directions between two hosts - if
there is a TCP connection between process A on one host and process B on another host, then application-level data can flow from A to B at the same
time as application-level data flows from B to A. TCP connection is also always point-to-point, i.e., between a single sender and a single receiver. So
called "multicasting" (see Section 4.8) -- the transfer of data from one sender to many receivers in a single send operation -- is not possible with TCP.
With TCP, two hosts are company and three are a crowd!

Let us now take a look at how a TCP connection is established. Suppose a process running in one host wants to initiate a connection with another
process in another host. Recall that the host that is initiating the connection is called the client host, while the other host is called the server host. The
client application process first informs the client TCP that it wants to establish a connection to a process in the server. Recall from Section 2.6, a Java
client program does this by issuing the command:

                                  Socket clientSocket = new Socket("hostname", "port number");
The TCP in the client then proceeds to establish a TCP connection with the TCP in the server. We will discuss in some detail the connection
establishment procedure at the end of this section. For now it suffices to know that the client first sends a special TCP segment; the server responds
with a second special TCP segment; and finally the client responds again with a third special segment. The first two segments contain no "payload,"
i.e., no application-layer data; the third of these segments may carry a payload. Because three segments are sent between the two hosts, this connection
establishment procedure is often referred to as a three-way handshake.

Once a TCP connection is established, the two application processes can send data to each other; because TCP is full-duplex they can send data at the
same time. Let us consider the sending of data from the client process to the server process. The client process passes a stream of data through the
socket (the door of the process), as described in Section 2.6. Once the data passes through the door, the data is now in the hands of TCP running in the
client. As shown in the Figure 3.5-1, TCP directs this data to the connection's send buffer, which is one of the buffers that is set aside during the initial
three-way handshake. From time to time, TCP will "grab" chunks of data from the send buffer. The maximum amount of data that can be grabbed and
placed in a segment is limited by the Maximum Segment Size (MSS). The MSS depends on the TCP implementation (determined by the operating
system) and can often be configured; common values are 1,500 bytes, 536 bytes and 512 bytes. (These segment sizes are often chosen in order to avoid
IP fragmentation, which will be discussed in the next chapter.) Note that the MSS is the maximum amount of application-level data in the segment, not
the maximum size of the TCP segment including headers. (This terminology is confusing, but we have to live with it, as it is well entrenched.) (1 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

                                                       Figure 3.5-1: TCP send and receive buffers

TCP encapsulates each chunk of client data with TCP header, thereby forming TCP segments. The segments are passed down to the network layer,
where they are separately encapsulated within network-layer IP datagrams. The IP datagrams are then sent into the network. When TCP receives a
segment at the other end, the segment's data is placed in the TCP connection's receive buffer. The application reads the stream of data from this buffer.
Each side of the connection has its own send buffer and its own receive buffer. The send and receive buffers for data flowing in one direction are
shown in Figure 3.5-1.

We see from this discussion that a TCP connection consists of buffers, variables and a socket connection to a process in one host, and another set of
buffers, variables and a socket connection to a process in another host. As mentioned earlier, no buffers or variables are allocated to the connection in
the network elements (routers, bridges and repeaters) between the hosts.

3.5.2 TCP Segment Structure
Having taken a brief look at the TCP connection, let's examine the TCP segment structure. The TCP segment consists of header fields and a data field.
The data field contains a chunk of application data. As mentioned above, the MSS limits the maximum size of a segment's data field. When TCP sends
a large file, such as an encoded image as part of a Web page, it typically breaks the file into chunks of size MSS (except for the last chunk, which will
often be less than the MSS). Interactive applications, however, often transmit data chunks that are smaller than the MSS; for example, with remote
login applications like Telnet, the data field in the TCP segment is often only one byte. Because the TCP header is typically 20 bytes (12 bytes more
than the UDP header), segments sent by Telnet may only be 21 bytes in length.

Figure 3.3-2 shows the structure of the TCP segment. As with UDP, the header includes source and destination port numbers, that are used for
multiplexing/demultiplexing data from/to upper layer applications. Also as with UDP, the header includes a checksum field. A TCP segment header
also contains the following fields:

    q   The32-bit sequence number field, and the 32-bit acknowledgment number field are used by the TCP sender and receiver in implementing a
        reliable data transfer service, as discussed below.
    q   The 16-bit window size field is used for the purposes of flow control. We will see shortly that it is used to indicate the number of bytes that a
        receiver is willing to accept.
    q   The 4-bit length field specifies the length of the TCP header in 32-bit words. The TCP header can be of variable length due to the TCP options
        field, discussed below. (Typically, the options field is empty, so that the length of the typical TCP header is 20 bytes.)
    q   The optional and variable length options field is used when a sender and receiver negotiate the maximum segment size (MSS) or as a window
        scaling factor for use in high-speed networks. A timestamping option is also defined. See [RFC 854], [RFC1323] for additional details.
    q   The flag field contains 6 bits. The ACK bit is used to indicate that the value carried in the acknowledgment field is valid. The RST, SYN and
        FIN bits are used for connection setup and teardown, as we will discuss at the end of this section. When the PSH bit is set, this is an indication
        that the receiver should pass the data to the upper layer immediately. Finally, the URG bit is used to indicate there is data in this segment that
        the sending-side upper layer entity has marked as ``urgent.'' The location of the last byte of this urgent data is indicated by the 16-bit urgent
        data pointer. TCP must inform the receiving-side upper layer entity when urgent data exists and pass it a pointer to the end of the urgent data.
        (In practice, the PSH, URG and pointer to urgent data are not used. However, we mention these fields for completeness.) (2 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

                                                            Figure 3.5-2: TCP segment structure

3.5.3 Sequence Numbers and Acknowledgment Numbers
Two of the most important fields in the TCP segment header are the sequence number field and the acknowledgment number field. These fields are a
critical part of TCP's reliable data transfer service. But before discussing how these fields are used to provide reliable data transfer, let us first explain
what exactly TCP puts in these fields.

TCP views data as an unstructured, but ordered, stream of bytes. TCP's use of sequence numbers reflects this view in that sequence numbers are over
the stream of transmitted bytes and not over the series of transmitted segments. The sequence number for a segment is the byte-stream number of the
first byte in the segment. Let's look at an example. Suppose that a process in host A wants to send a stream of data to a process in host B over a TCP
connection. The TCP in host A will implicitly number each byte in the data stream. Suppose that the data stream consists of a file consisting of
500,000 bytes, that the MSS is 1,000 bytes, and that the first byte of the data stream is numbered zero. As shown in Figure 3.5-3, TCP constructs 500
segments out of the data stream. The first segment gets assigned sequence number 0, the second segment gets assigned sequence number 1000, the
third segment gets assigned sequence number 2000, and so on.. Each sequence number is inserted in the sequence number field in the header of the
appropriate TCP segment.

                                                    Figure 3.5-3: Dividing file data into TCP segments.

Now let us consider acknowledgment numbers. These are a little trickier than sequence numbers. Recall that TCP is full duplex, so that host A may be
receiving data from host B while it sends data to host B (as part of the same TCP connection). Each of the segments that arrive from host B have a
sequence number for the data flowing from B to A. The acknowledgment number that host A puts in its segment is sequence number of the next byte
host A is expecting from host B. It is good to look at a few examples to understand what is going on here. Suppose that host A has received all bytes
numbered 0 through 535 from B and suppose that it is about to send a segment to host B. In other words, host A is waiting for byte 536 and all the
subsequent bytes in host B's data stream. So host A puts 536 in the acknowledgment number field of the segment it sends to B. (3 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

As another example, suppose that host A has received one segment from host B containing bytes 0 through 535 and another segment containing bytes
900 through 1,000. For some reason host A has not yet received bytes 536 through 899. In this example, host A is still waiting for byte 536 (and
beyond) in order to recreate B's data stream. Thus, A's next segment to B will contain 536 in the acknowledgment number field. Because TCP only
acknowledges bytes up to the first missing byte in the stream, TCP is said to provide cumulative acknowledgements.

This last example also brings up an important but subtle issue. Host A received the third segment (bytes 900 through 1,000) before receiving the second
segment (bytes 536 through 899). Thus, the third segment arrived out of order. The subtle issue is: What does a host do when it receives out of order
segments in a TCP connection? Interestingly, the TCP RFCs do not impose any rules here, and leave the decision up to the people programming a TCP
implementation. There are basically two choices: either (i) the receiver immediately discards out-of-order bytes; or (ii) the receiver keeps the out-of-
order bytes and waits for the missing bytes to fill in the gaps. Clearly, the latter choice is more efficient in terms of network bandwidth, whereas the
former choice significantly simplifies the TCP code. Throughout the remainder of this introductory discussion of TCP, we focus on the former
implementation, that is, we assume that the TCP receiver discards out-of-order segments.

In Figure 3.5.3 we assumed that the initial sequence number was zero. In truth, both sides of a TCP connection randomly choose an initial sequence
number. This is done to minimize the possibility a segment that is still present in the network from an earlier, already-terminated connection between
two hosts is mistaken for a valid segment in a later connection between these same two hosts (who also happen to be using the same port numbers as
the old connection) [Sunshine 78].

3.5.4 Telnet: A Case Study for Sequence and Acknowledgment Numbers
Telnet, defined in [RFC 854], is a popular application-layer protocol used for remote login. It runs over TCP and is designed to work between any pair
of hosts. Unlike the bulk-data transfer applications discussed in Chapter 2, Telnet is an interactive application. We discuss a Telnet example here, as it
nicely illustrates TCP sequence and acknowledgment numbers.

Suppose one host,, initiates a Telnet session with host (Anticipating our discussion on IP addressing in the next chapter, we
take the liberty to use IP addresses to identify the hosts.) Because host initiates the session, it is labeled the client and host is
labeled the server. Each character typed by the user (at the client) will be sent to the remote host; the remote host will send back a copy of each
character, which will be displayed on the Telnet user's screen. This "echo back" is used to ensure that characters seen by the Telnet user have already
been received and processed at the remote site. Each character thus traverses the network twice between when the user hits the key and when the
character is displayed on the user's monitor.

Now suppose the user types a single letter, 'C', and then grabs a coffee. Let's examine the TCP segments that are sent between the client and server. As
shown in Figure 3.5-4, we suppose the starting sequence numbers are 42 and 79 for the client and server, respectively. Recall that the sequence number
of a segment is the sequence number of first byte in the data field. Thus the first segment sent from the client will have sequence number 42; the first
segment sent from the server will have sequence number 79. Recall that the acknowledgment number is the sequence number of the next byte of data
that the host is waiting for. After the TCP connection is established but before any data is sent, the client is waiting for byte 79 and the server is waiting
for byte 42. (4 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

                            Figure 3.5-4: Sequence and acknowledgment numbers for a simple Telnet application over TCP

As shown in Figure 3.5-4, three segments are sent. The first segment is sent from the client to the server, containing the one-byte ASCII representation
of the letter 'C' in its data field. This first segment also has 42 in its sequence number field, as we just described. Also, because the client has not yet
received any data from the server, this first segment will have 79 in its acknowledgment number field.

The second segment is sent from the server to the client. It serves a dual purpose. First it provides an acknowledgment for the data the client has
received. By putting 43 in the acknowledgment field, the server is telling the client that it has successfully received everything up through byte 42 and
is now waiting for bytes 43 onward. The second purpose of this segment is to echo back the letter 'C'. Thus, the second segment has the ASCII
representation of 'C' in its data field. This second segment has the sequence number 79, the initial sequence number of the server-to-client data flow of
this TCP connection, as this is the very first byte of data that the server is sending. Note that the acknowledgement for client-to-server data is carried in
a segment carrying server-to-client data; this acknowledgement is said to be piggybacked on the server-to-client data segment.

The third segment is sent from the client to the server. Its sole purpose is to acknowledge the data it has received from the server. (Recall that the
second segment contained data -- the letter 'C' -- from the server to the client.) This segment has an empty data field (i.e., the acknowledgment is not
being piggybacked with any cient-to-server data). The segment has 80 in the acknowledgment number field because the client has received the stream
of bytes up through byte sequence number 79 and it is now waiting for bytes 80 onward. You might think it odd that this segment also has a sequence
number since the segment contains no data. But because TCP has a sequence number field, the segment needs to have some sequence number.

3.5.5 Reliable Data Transfer
Recall that the Internet's network layer service (IP service) is unreliable. IP does not guarantee datagram delivery, does not guarantee in-order delivery
of datagrams, and does not guarantee the integrity of the data in the datagrams. With IP service, datagrams can overflow router buffers and never reach
their destination, datagrams can arrive out of order, and bits in the datagram can get corrupted (flipped from 0 to 1 and vice versa). Because transport-
layer segments are carried across the network by IP datagrams, transport-layer segments can also suffer from these problems as well.

TCP creates a reliable data transfer service on top of IP's unreliable best-effort service. Many popular application protocols -- including FTP, SMTP,
NNTP, HTTP and Telnet -- use TCP rather than UDP primarily because TCP provides reliable data transfer service. TCP's reliable data transfer
service ensures that the data stream that a process reads out of its TCP receive buffer is uncorrupted, without gaps, without duplication, and in
sequence, i.e., the byte stream is exactly the same byte stream that was sent by the end system on the other side of the connection. In this subsection we
provide an informal overview of how TCP provides reliable data transfer. We shall see that the reliable data transfer service of TCP uses many of the
principles that we studied in Section 3.4.

Retransmissions (5 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

Retransmission of lost and corrupted data is crucial for providing reliable data transfer. TCP provides reliable data transfer by using positive
acknowledgments and timers in much the same way as we studied in section 3.4. TCP acknowledges data that has been received correctly, and
retransmits segments when segments or their corresponding acknowledgements are thought to be lost or corrupted. Just as in the case of our reliable
data transfer protocol, rdt3.0, TCP can not itself tell for certain if a segment, or its ACK, is lost, corrupted, or overly delayed. In all cases, TCP's
response will be the same: retransmit the segment in question.

TCP also uses pipelining, allowing the sender to have multiple transmitted but yet-to-be-acknowledged segments outstanding at any given time. We
saw in the previous section that pipelining can greatly improve the throughput of a TCP connection when the ratio of the segment size to round trip
delay is small. The specific number of outstanding unacknowledged segments that a sender can have is determined by TCP's flow control and
congestion control mechanisms. TCP flow control is discussed at the end of this section; TCP congestion control is discussed in Section 3.7. For the
time being, we must simply be aware that the sender can have multiple transmitted, but unacknowledged, segments at any given time.

       /* assume sender is not constrained by TCP flow or congestion control,
          that data from above is less than MSS in size, and that data transfer is
          in one direction only */

       sendbase = initial_sequence number                        /* see Figure 3.4-11 */
       nextseqnum = initial_sequence number

       loop (forever) {

                event:data received from application above
                      create TCP segment with sequence number nextseqnum
                      start timer for segment nextseqnum
                      pass segment to IP
                      nextseqnum = nextseqnum + length(data)

                event: timer timeout for segment with sequence number y
                      retransmit segment with sequence number y
                      compue new timeout interval for segment y
                      restart timer for sequence number y

                event: ACK received, with ACK field value of y
                      if (y > sendbase) { /* cumulative ACK of all data up to y */
                          cancel all timers for segments with sequence numbers < y
                       sendbase = y
                      else { /* a duplicate ACK for already ACKed segment */
                          increment number of duplicate ACKs received for y
                          if (number of duplicate ACKS received for y == 3) {
                              /* TCP fast retransmit */
                              resend segment with sequence number y
                              restart timer for segment y
               } /* end of loop forever */

                                                           Figure 3.5-5: simplified TCP sender

Figure 3.5-5 shows the three major events related to data transmission/retransmission at a simplified TCP sender. Let us consider a TCP connection
between host A and B and focus on the data stream being sent from host A to host B. At the sending host (A), TCP is passed application-layer data,
which it frames into segments and then passes on to IP. The passing of data from the application to TCP and the subsequent framing and transmission
of a segment is the first important event that the TCP sender must handle. Each time TCP releases a segment to IP, it starts a timer for that segment. If (6 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

this timer expires, an interrupt event is generated at host A. TCP responds to the timeout event, the second major type of event that the TCP sender
must handle, by retransmitting the segment that caused the timeout.

The third major event that must be handled by the TCP sender is the arrival of an acknowledgement segment (ACK) from the receiver (more
specifically, a segment containing a valid ACK field value). Here, the sender's TCP must determine whether the ACK is a first-time ACK for a
segment that the sender has yet to receive an acknowledgement for, or a so-called duplicate ACK that re-acknowledges a segment for which the
sender has already received an earlier acknowledgement. In the case of the arrival of a first-time ACK, the sender now knows that all data up to the
byte being acknowledged has been received correctly at the receiver. The sender can thus update its TCP state variable that tracks the sequence number
of the last byte that is known to have been received correctly and in-order at the receiver.

To understand the sender's response to a duplicate ACK, we must look at why the receiver sends a duplicate ACK in the first place. Table 3.5-1
summarizes the TCP receiver's ACK generation policy. When a TCP receiver receives a segment with a sequence number that is larger than the next,
expected, in-order sequence number, it detects a gap in the data stream - i.e., a missing segment. Since TCP does not use negative acknowledgements,
the receiver can not send an explicit negative acknowledgement back to the sender. Instead, it simply re-acknowledges (i.e., generates a duplicate ACK
for) the last in-order byte of data it has received. If the TCP sender receives three duplicate ACKs for the same data, it takes this as an indication that
the segment following the segment that has been ACKed three times has been lost. In this case, TCP performs a fast retransmit [RFC 2581],
retransmitting the missing segment before that segment's timer expires.

                       Event                                               TCP receiver action
                       Arrival of in-order segment with expected
                                                                           Delayed ACK. Wait up to 500 ms for arrival
                       sequence number. All data up to up to expected
                                                                           of another in-order segment. If next in-order segment
                       sequence number already acknowledged.
                                                                           does not arrives in this interval, send an ACK
                       No gaps in the received data.
                       Arrival of in-order segment with expected
                       sequence number. One other in-order                 Immediately send single cumulative ACK,
                       segment waiting for ACK transmission.               ACKing both in-order segments
                       No gaps in the received data.
                       Arrival of out-of-order segment with higher-than Immediately send duplicate ACK, indicating sequence
                       expected sequence number. Gap detected.          number of next expected byte
                       Arrival of segment that partially or completely     Immediately send ACK, provided that segment starts
                       fills in gap in received data                       at the lower end of gap.
                                     Table 3.5-1: TCP ACK generation recommendations [RFC 1122, RFC 2581]

A Few Interesting Scenarios

We end this discussion by looking at a few simple scenarios. Figure 3.5-6 depicts the scenario where host A sends one segment to host B. Suppose that
this segment has sequence number 92 and contains 8 bytes of data. After sending this segment, host A waits for a segment from B with
acknowledgment number 100. Although the segment from A is received at B, the acknowledgment from B to A gets lost. In this case, the timer
expires, and host A retransmits the same segment. Of course, when host B receives the retransmission, it will observe that the bytes in the segment
duplicate bytes it has already deposited in its receive buffer. Thus TCP in host B will discard the bytes in the retransmitted segment. (7 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

                                             Figure 3.5-6: Retransmission due to a lost acknowledgment

In a second scenario, host A sends two segments back to back. The first segment has sequence number 92 and 8 bytes of data, and the second segment
has sequence number 100 and 20 bytes of data. Suppose that both segments arrive intact at B, and B sends two separate acknowledgements for each of
these segments. The first of these acknowledgements has acknowledgment number 100; the second has acknowledgment number 120. Suppose now
that neither of the acknowledgements arrive at host A before the timeout of the first segment. When the timer expires, host A resends the first segment
with sequence number 92. Now, you may ask, does A also resend second segment? According to the rules described above, host A resends the segment
only if the timer expires before the arrival of an acknowledgment with an acknowledgment number of 120 or greater. Thus, as shown in Figure 3.5-7, if
the second acknowledgment does not get lost and arrives before the timeout of the second segment, A does not resend the second segment.

                          Figure 3.5-7: Segment is not retransmitted because its acknowledgment arrives before the timeout.

In a third and final scenario, suppose host A sends the two segments, exactly as in the second example. The acknowledgment of the first segment is lost
in the network, but just before the timeout of the first segment, host A receives an acknowledgment with acknowledgment number 120. Host A
therefore knows that host B has received everything up through byte 119; so host A does not resend either of the two segments. This scenario is
illustrated in the Figure 3.5-8. (8 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

                                   Figure 3.5-8: A cumulative acknowledgment avoids retransmission of first segment
Recall that in the previous section we said that TCP is a Go-Back-N style protocol. This is because acknowledgements are cumulative and correctly-
received but out-of-order segments are not individually ACKed by the receiver. Consequently, as shown in Figure 3.5-5 (see also Figure 3.4-11), the
TCP sender need only maintain the smallest sequence number of a transmitted but unacknowledged byte (sendbase) and the sequence number of the
next byte to be sent (nextseqnum). But the reader should keep in mind that although the reliable-data-transfer component of TCP resembles Go-
Back-N, it is by no means a pure implementation of Go-Back-N. To see that there are some striking differences between TCP and Go-Back-N, consider
what happens when the sender sends a sequence of segments 1, 2,..., N, and all of the segments arrive in order without error at the receiver. Further
suppose that the acknowledgment for packet n < N gets lost, but the remaining N-1 acknowledgments arrive at the sender before their respective
timeouts. In this example, Go-Back-N would retransmit not only packet n, but also all the subsequent packets n+1, n+2,...,N. TCP, on the other hand,
would retransmit at most one segment, namely, segment n. Moreover, TCP would not even retransmit segment n if the acknowledgement for segment
n+1 arrives before the timeout for segment n.

There have recently been several proposals [RFC 2018, Fall 1996, Mathis 1996] to extend the TCP ACKing scheme to be more similar to a selective
repeat protocol. The key idea in these proposals is to provide the sender with explicit information about which segments have been received correctly,
and which are still missing at the receiver.

3.5.6 Flow Control
Recall that the hosts on each side of a TCP connection each set aside a receive buffer for the connection. When the TCP connection receives bytes that
are correct and in sequence, it places the data in the receive buffer. The associated application process will read data from this buffer, but not
necessarily at the instant the data arrives. Indeed, the receiving application may be busy with some other task and may not even attempt to read the data
until long after it has arrived. If the application is relatively slow at reading the data, the sender can very easily overflow the connection's receive buffer
by sending too much data too quickly. TCP thus provides a flow control service to its applications by eliminating the possibility of the sender
overflowing the receiver's buffer. Flow control is thus a speed matching service - matching the rate at which the sender is seding to the rate at which the
receiving application is reading. As noted earlier, a TCP sender can also be throttled due to congestion within the IP network; this form of sender
control is referred to as congestion control, a topic we will explore in detail in Sections 3.6 and 3.7. While the actions taken by flow and congestion
control are similar (the throttling of the sender), they are obviously taken for very different reasons. Unfortunately, many authors use the term
interchangeably, and the savvy reader would be careful to distinguish between the two cases. Let's now discuss how TCP provides its flow control

TCP provides flow control by having the sender maintain a variable called the receive window. Informally, the receive window is used to give the
sender an idea about how much free buffer space is available at the receiver. In a full-duplex connection, the sender at each side of the connection
maintains a distinct receive window. The receive window is dynamic, i.e., it changes throughout a connection's lifetime. Let's investigate the receive
window in the context of a file transfer. Suppose that host A is sending a large file to host B over a TCP connection. Host B allocates a receive buffer to
this connection; denote its size by RcvBuffer. From time to time, the application process in host B reads from the buffer. Define the following

       LastByteRead = the number of the last byte in the data stream read from the buffer by the application process in B.

       LastByteRcvd = the number of the last byte in the data stream that has arrived from the network and has been placed in the receive buffer at
       B. (9 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

Because TCP is not permitted to overflow the allocated buffer, we must have:

       LastByteRcvd - LastByteRead <= RcvBuffer

The receive window, denoted RcvWindow, is set to the amount of spare room in the buffer:

       RcvWindow = RcvBuffer - [ LastByteRcvd - LastByteRead]

Because the spare room changes with time, RcvWindow is dynamic. The variable RcvWindow is illustrated in Figure 3.5-9.

                                  Figure 3.5-9: The receive window (RcvWindow) and the receive buffer (RcvBuffer)

How does the connection use the variable RcvWindow to provide the flow control service? Host B informs host A of how much spare room it has in
the connection buffer by placing its current value of RcvWindow in the window field of every segment it sends to A. Initially host B sets
RcvWindow = RcvBuffer. Note that to pull this off, host B must keep track of several connection-specific variables.

Host A in turn keeps track of two variables, LastByteSent and LastByteAcked, which have obvious meanings. Note that the difference between
these two variables, LastByteSent - LastByteAcked, is the amount of unacknowledged data that A has sent into the connection. By keeping
the amount of unacknowledged data less than the value of RcvWindow, host A is assured that it is not overflowing the receive buffer at host B. Thus
host A makes sure throughout the connection's life that

       LastByteSent - LastByteAcked <= RcvWindow.

There is one minor technical problem with this scheme. To see this, suppose host B's receive buffer becomes full so that RcvWindow = 0. After
advertising RcvWindow = 0 to host A, also suppose that B has nothing to send to A. As the application process at B empties the buffer, TCP does
not send new segments with new RcvWindows to host A -- TCP will only send a segment to host A if it has data to send or if it has an
acknowledgment to send. Therefore host A is never informed that some space has opened up in host B's receive buffer: host A is blocked and can
transmit no more data! To solve this problem, the TCP specification requires host A to continue to send segments with one data byte when B's receive
window is zero. These segments will be acknowledged by the receiver. Eventually the buffer will begin to empty and the acknowledgements will
contain non-zero RcvWindow.

Having described TCP's flow control service, we briefly mention here that UDP does not provide flow control. To understand the issue here, consider
sending a series of UDP segments from a process on host A to a process on host B. For a typical UDP implementation, UDP will append the segments
(more precisely, the data in the segments) in a finite-size queue that "precedes" the corresponding socket (i.e., the door to the process). The process
reads one entire segment at a time from the queue. If the process does not read the segments fast enough from the queue, the queue will overflow and
segments will get lost.

Following this section we provide an interactive Java applet which should provide significant insight into the TCP receive window.

3.5.7 Round Trip Time and Timeout

Recall that when a host sends a segment into a TCP connection, it starts a timer. If the timer expires before the host receives an acknowledgment for the
data in the segment, the host retransmits the segment. The time from when the timer is started until when it expires is called the timeout of the timer.
A natural question is, how large should timeout be? Clearly, the timeout should be larger than the connection's round-trip time, i.e., the time from (10 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

when a segment is sent until it is acknowledged. Otherwise, unnecessary retransmissions would be sent. But the timeout should not be much larger
than the round-trip time; otherwise, when a segment is lost, TCP would not quickly retransmit the segment, thereby introducing significant data transfer
delays into the application. Before discussing the timeout interval in more detail, let us take a closer look at the round-trip time (RTT). The discussion
below is based on the TCP work in [Jacobson 1988].

Estimating the Average Round-Trip Time

The sample RTT, denoted SampleRTT, for a segment is the time from when the segment is sent (i.e., passed to IP) until an acknowledgment for the
segment is received. Each segment sent will have its own associated SampleRTT. Obviously, the SampleRTT values will fluctuate from segment to
segment due to congestion in the routers and to the varying load on the end systems. Because of this fluctuation, any given SampleRTT value may be
atypical. In order to estimate a typical RTT, it is therefore natural to take some sort of average of the SampleRTT values. TCP maintains an average,
called EstimatedRTT, of the SampleRTT values. Upon receiving an acknowledgment and obtaining a new SampleRTT, TCP updates
EstimatedRTT according to the following formula:

        EstimatedRTT = (1-x) EstimatedRTT + x SampleRTT.

The above formula is written in the form of a programming language statement - the new value of EstimatedRTT is a weighted combination of the
previous value of Estimated RTT and the new value for SampleRTT. A typical value of x is x = .1, in which case the above formula becomes:

        EstimatedRTT = .9 EstimatedRTT + .1 SampleRTT.

Note that EstimatedRTT is a weighted average of the SampleRTT values. As we will see in the homework, this weighted average puts more weight
on recent samples than on old samples, This is natural, as the more recent samples better reflect the current congestion in the network. In statistics, such
an average is called an exponential weighted moving average (EWMA). The word "exponential" appears in EWMA because the weight of a given
SampleRTT decays exponentially fast as the updates proceed. In the homework problems you will be asked to derive the exponential term in

Setting the Timeout

The timeout should be set so that a timer expires early (i.e., before the delayed arrival of a segment's ACK) only on rare occasions. It is therefore
natural to set the timeout equal to the EstimatedRTT plus some margin. The margin should be large when there is a lot of fluctuation in the
SampleRTT values; it should be small when there is little fluctuation. TCP uses the following formula:

        Timeout = EstimatedRTT + 4*Deviation,

where Deviation is an estimate of how much SampleRTT typically deviates from EstimatedRTT:

        Deviation = (1-x) Deviation + x | SampleRTT - EstimatedRTT |

Note that Deviation is an EWMA of how much SampleRTT deviates from EstimatedRTT. If the SampleRTT values have little fluctuation,
then Deviation is small and Timeout is hardly more than EstimatedRTT; on the other hand, if there is a lot of fluctuation, Deviation will be
large and Timeout will be much larger than EstimatedRTT.

3.5.8 TCP Connection Management
In this subsection we take a closer look at how a TCP connection is established and torn down. Although this particular topic may not seem particularly
exciting, it is important because TCP connection establishment can significantly add to perceived delays (for example, when surfing the Web). Let's
now take a look at how a TCP connection is established. Suppose a process running in one host wants to initiate a connection with another process in
another host. The host that is initiating the connection is called the client host whereas the other host is called the server host. The client application
process first informs the client TCP that it wants to establish a connection to a process in the server. Recall from Section 2.6, that a Java client program
does this by issuing the command:

                         Socket clientSocket = new Socket("hostname", "port number");

The TCP in the client then proceeds to establish a TCP connection with the TCP in the server in the following manner:

    q   Step 1. The client-side TCP first sends a special TCP segment to the server-side TCP. This special segment contains no application-layer data. It (11 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

        does, however, have one of the flag bits in the segment's header (see Figure 3.3-2), the so-called SYN bit, set to 1. For this reason, this special
        segment is referred to as a SYN segment. In addition, the client chooses an initial sequence number (client_isn) and puts this number in the
        sequence number field of the initial TCP SYN segment.This segment is encapsulated within an IP datagram and sent into the Internet.
    q   Step 2. Once the IP datagram containing the TCP SYN segment arrives at the server host (assuming it does arrive!), the server extracts the TCP
        SYN segment from the datagram, allocates the TCP buffers and variables to the connection, and sends a connection-granted segment to client
        TCP. This connection-granted segment also contains no application-layer data. However, it does contain three important pieces of information
        in the segment header. First, the SYN bit is set to 1. Second, the acknowledgment field of the TCP segment header is set to isn+1. Finally, the
        server chooses its own initial sequence number (server_isn) and puts this value in the sequence number field of the TCP segment header. This
        connection granted segment is saying, in effect, "I received your SYN packet to start a connection with your initial sequence number, client_isn.
        I agree to establish this connection. My own initial sequence number is server_isn." The conenction-granted segment is sometimes referred to
        as a SYNACK segment.
    q   Step 3. Upon receiving the connection-granted segment, the client also allocates buffers and variables to the connection. The client host then
        sends the server yet another segment; this last segment acknowledges the server's connection-granted segment (the client does so by putting the
        value server_isn+1 in the acknowledgment field of the TCP segment header). The SYN bit is set to 0, since the connection is established.

Once the following three steps have been completed, the client and server hosts can send segments containing data to each other. In each of these future
segments, the SYN bit will be set to zero. Note that in order to establish the connection, three packets are sent between the two hosts, as illustrated in
Figure 3.5-10. For this reason, this connection establishment procedure is often referred to as a three-way handshake. Several aspects of the TCP three-
way handshake (Why are initial sequence numbers needed? Why is a three-way handshake, as opposed to a two-way handshake, needed?) are explored
in the homework problems.

                                              Figure 3.5-10: TCP three-way handshake: segment exchange

All good things must come to an end, and the same is true with a TCP connection. Either of the two processes participating in a TCP connection can
end the connection. When a connection ends, the "resources" (i.e., the buffers and variables) in the hosts are de-allocated. As an example, suppose the
client decides to close the connection. The client application process issues a close command. This causes the client TCP to send a special TCP
segment to the server process. This special segment has a flag bit in the segment's header, the so-called FIN bit (see Figure 3.3-2), set to 1. When the
server receives this segment, it sends the client an acknowledgment segment in return. The server then sends its own shut-down segment, which has the
FIN bit set to 1. Finally, the client acknowledges the server's shut-down segment. At this point, all the resources in the two hosts are now de-allocated.

During the life of a TCP connection, the TCP protocol running in each host makes transitions through various TCP states. Figure 3.5-11 illustrates a
typical sequence of TCP states that are visited by the client TCP. The client TCP begins in the closed state. The application on the client side initiates a
new TCP connection (by creating a Socket object in our Java examples). This causes TCP in the client to send a SYN segment to TCP in the server.
After having sent the SYN segment, the client TCP enters the SYN_SENT sent. While in the SYN_STATE the client TCP waits for a segment from the
server TCP that includes an acknowledgment for the client's previous segment as well as the SYN bit set to 1. Once having received such a segment,
the client TCP enters the ESTABLISHED state. While in the ESTABLISHED state, the TCP client can send and receive TCP segments containing
payload (i.e., application-generated) data.

Suppose that the client application decides it wants to close the connection. This causes the client TCP to send a TCP segment with the FIN bit set to 1
and to enter the FIN_WAIT_1 state. While in the FIN_WAIT state, the client TCP waits for a TCP segment from the server with an acknowledgment.
When it receives this segment, the client TCP enters the FIN_WAIT_2 state. While in the FIN_WAIT_2 state, the client waits for another segment
from the server with the FIN bit set to 1; after receiving this segment, the client TCP acknowledges the server's segment and enters the TIME_WAIT
state. The TIME_WAIT state lets the TCP client resend the final acknowledgment in the case the ACK is lost. The time spent in the TIME-WAIT state (12 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

is implementation dependent, but typical values are 30 seconds, 1 minute and 2 minutes. After the wait, the connection formally closes and all
resources on the client side (including port numbers) are released.

                                         Figure 3.5-11: A typical sequence of TCP states visited by a client TCP

Figure 3.5-12 illustrates the series of states typically visited by the server-side TCP; the transitions are self-explanatory. In these two state transition
diagrams, we have only shown how a TCP connection is normally established and shut down. We are not going to describe what happens in certain
pathological scenarios, for example, when both sides of a connection want to shut down at the same time. If you are interested in learning about this
and other advanced issues concerning TCP, you are encouraged to see Steven's comprehensive book [Stevens 1994]. (13 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

                                      Figure 3.5-12: A typical sequence of TCP states visited by a server-side TCP

This completes our introduction to TCP. In Section 3.7 we will return to TCP and look at TCP congestion control in some depth. Before doing so, in
the next section we step back and examine congestion control issues in a broader context.


[Fall 1996] K. Fall, S. Floyd, "Simulation-based Comparisons of Tahoe, Reno and SACK TCP", ACM Computer Communication Review, July 1996.
[Jacobson 1988] V. Jacobson, "Congestion Avoidance and Control," Proc. ACM Sigcomm 1988 Conference,
in Computer Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988
[Mathis 1996] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining TCP Congestion Control", Proceedings of ACM SIGCOMM'96, August
1996, Stanford, CA.
[RFC 793] "Transmission Control Protocol," RFC 793, September 1981.
[RFC 854] J. Postel and J. Reynolds, "Telnet Protocol Specifications," RFC 854, May 1983.
[RFC 1122] R. Braden, "Requirements for Internet Hosts -- Communication Layers," RFC 1122, October 1989.
[RFC13 23] V. Jacobson, S. Braden, D. Borman, "TCP Extensions for High Performance," RFC 1323, May 1992.
[RFC 2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP Selective Acknowledgement Options", RFC 2018, October 1996.
[RFC 2581] M. Allman, V. Paxson, W. Stevens, " TCP Congestion Control, RFC 2581, April 1999.
[Stevens 1994] W.R. Stevens, TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley, Reading, MA, 1994.

Search RFCs and Internet Drafts

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s) here.


Press button to submit your query or reset the form:     Submit     Reset (14 of 15) [5/13/2004 11:58:23 AM]
  Transmission Control Protocol

Query Options:

          Case insensitive

      Maximum number of hits: 25

Return to Table Of Contents

Copyright Keith W. Ross and James F. Kurose 1996-2000 (15 of 15) [5/13/2004 11:58:23 AM]
  Principles of congestion control

                    3.6 Principles of Congestion Control
In the previous sections, we've examined both the general principles and specific TCP mechanisms used to provide for a
reliable data transfer service in the face of packet loss. We mentioned earlier that , in practice, such loss typically results
from the overflowing of router buffers as the network becomes congested. Packet retransmission thus treats a symptom of
network congestion (the loss of a specific transport-layer packet) but does not treat the cause of network congestion -- too
many sources attempting to send data at too high a rate. To treat the cause of network congestion, mechanisms are needed
to throttle the sender in the face of network congestion.

In this section, we consider the problem of congestion control in a general context, seeking to understand why congestion is
a "bad thing," how network congestion is manifested in the performance received by upper-layer applications, and various
approaches that can be taken to avoid, or react to, network congestion. This more general study of congestion control is
appropriate since, as with reliable data transfer, it is high on the "top-10" list of fundamentally important problems in
networking. We conclude this section with a discussion of congestion control in the ATM ABR protocol. The following
section contains a detailed study of TCP's congestion control algorithm.

3.6.1 The Causes and the "Costs" of Congestion
Let's begin our general study of congestion control by examing three increasingly complex scenarios in which congestion
occurs. In each case, we'll look at why congestion occurs in the first place, and the "cost" of congestion (in terms of
resources not fully utilized and poor performance received by the end systems).

Scenario 1: Two senders, a router with infinte buffers

We begin by considering perhaps the simplest congestion scenario possible: two hosts (A and B) each have a connection
that share a single hop between source and destination, as shown in Figure 3.6-1.

                  Figure 3.6-1: Congestion scenario 1: two connections sharing a single hop with infinte buffers

Let's assume that the application in Host A is sending data into the connection (e.g., passing data to the transport-level
protocol via a socket) at an average rate of λin bytes/sec. These data are "original" in the sense that each unit of data is sent
into the socket only once. The underlying transport-level protocol is a simple one: data is encapsulated and sent; no error
recovery (e.g., retransmission), flow control, or congestion control is performed. Host B operates in a similar manner and (1 of 8) [5/13/2004 11:58:40 AM]
  Principles of congestion control

we assume for simplicity that it too is sending at a rate of λin bytes/sec. Packets from hosts A and B pass through a router
and over a shared outgoing link of capacity C. The router has buffers that allow it to store incoming packets when the
packet arrival rate exceeds the outgoing link's capacity. In this first scenario, we'll assume that the router has an infinite
amount of buffer space.

                   Figure 3.6-2: Congestion scenario 1: throughtput and delay as a function of host sending rate

Figure 3.6-2 plots the performance of Host A's connection under this first scenario. The left graph plots the per-connection
throughput (number of bytes per second at the receiver) as a function of the connection sending rate. For a sending rate
between zero and C/2, the throughput at the receiver equals the sender's sending rate - everything sent by the sender is
received at the receiver with a finite delay. When the sending rate is above C/2, however, the throughput is only C/2. This
upper limit on throughput is a consequence of the sharing of link capacity between two connections - the link simply can
not deliver packets to a receiver at a steady state rate that exceeds C/2. No matter how high Hosts A and B set their sending
rates, they will each never see a throughput higher than C/2.

Achieving a per-connection throughput of C/2 might actually appear to be a "good thing," as the link is fully utilized in
delivering packets to their destinations. The right graph in Figure 3.6-2, however, shows the consequences of operating
near link capacity. As the sending rate approaches C/2 (from the left), the average delay becomes larger and larger. When
the sending rate exceeds C/2, the average number of queued packets in the router is unbounded and the average delay
between source and destination becomes infinite (assuming that the connections operate at these sending rates for an infinite
period of time). Thus, while operating at an aggregate throughput of near C may be ideal from a throughput standpoint, it is
far from ideal from a delay standpoint. Even in this (extremely) idealized scenario, we've already found one cost of a
congested network - large queueing delays are experienced as the packet arrival rate nears the link capacity.

Scenario 2: Two senders, a router with finite buffers

Let us now slightly modify scenario 1 in the following two ways. First, the amount of router buffering is assumed to be
finite. Second, we assume that each connection is reliable. If a packet containing a transport-level segment is dropped at
the router, it will eventually be retransmitted by the sender. Because packets can be retransmitted, we must now be more
careful with our use of the term "sending rate." Specifically, let us again denote the rate at which the application sends
original data into the socket by λin bytes/sec. The rate at which the transport layer sends segments (containing original data
or retransmitted data) into the network will be denoted λin' bytes/sec. λin' is sometimes referred to as the offered load to
the network. (2 of 8) [5/13/2004 11:58:40 AM]
  Principles of congestion control

                      Figure 3.6-3: Scenario 2: two hosts (with retransmissions) and a router with finite buffers

                                       Figure 3.6-4: Scenario 2 performance: (a) no retransmissions
                                 (b) only needed retransmisisons (c) extraneous, undeeded retransmissions

The performance realized under scenario 2 will now depend strongly on how retransmission is performed. First, consider
the unrealistic case that Host A is able to somehow (magically!) determine whether or not a buffer is free in the router and
thus sends a packet only when a buffer is free. In this case, no loss would occur, λin would be equal to λin ' , and the
throughput of the connection would be equal to λin. This case is shown in Figure 3.6-4(a). From a throughput standpoint,
performance is ideal - everything that is sent is received. Note that the average host sending rate can not exceed C/2 under
this scenario, since packet loss is assumed never to occur.

Consider next the slightly more realistic case that the sender retransmits only when a packet is known for certain to be lost.
(Again, this assumption is a bit of a stretch. However, it possiible that the sending host might set its timeout large enough (3 of 8) [5/13/2004 11:58:40 AM]
  Principles of congestion control

to be virtually assured that a packet that has not been ACKed has been lost.) In this case, the performance might look
something like that shown in Figure 3.6-4(b). To appreciate what is happening here, consider the case that the offered load,
λin' (the rate of original data transmission plus retransmissions), equals .6C. According to FIgure 3.6-4(b), at this value of
the offered load, the rate at which data are delivered to the receiver application is C/3. Thus, out of the .6C units of data
transmitted, .3333 bytes/sec (on average) are original data and .26666 bytes per second (on average) are retransmitted data.
We see here another "cost" of a congested network - the sender must perform retransmissions in order to compensate for
dropped (lost) packets due to buffer overflow.

Finally, let us consider the more realistic case that the sender may timeout prematurely and retransmit a packet that has been
delayed in the queue, but not yet lost. In this case, both the original data packet and the retransmission may both reach the
receiver. Of course, the receiver needs but one copy of this packet and will discard the retransmission. In this case, the
"work" done by the router in forwarding the retransmitted copy of the original packet was "wasted," as the receiver will
have already received the original copy of this packet. The router would have better used the link transmission capacity
transmitting a different packet instead. Here then is yet another "cost" of a congested network - unneeded retransmissions
by the sender in the face of large delays may cause a router to use its link bandwidth to forward uneeded copies of a
packet. Figure 3.6.4(c) shows the throughput versus offered load when each packet is assumed to be forwarded (on
average) at least twice by the router. Since each packet is forwarded twice, the throughput achieved will be bounded above
by the two-segment curve with the asymptotic value of C/4.

Scenario 3: Four senders, routers with finite buffers, and multihop paths

In our final congestion scenario, four hosts transmit packets, each over overlapping two-hop paths, as shown in Figure 3.6-
5. We again assume that each host uses a timeout/retransmission mechanism to implement a reliable data transfer service,
that all hosts have the same value of λin , and that all router links have capacity C bytes/sec.

                                Figure 3.6-5: Four senders, routers with finite buffers, and multihop paths

Let us consider the connection from Host A to Host C, passing through Routers R1 and R2. The A-C connection shares
router R1 with the D-B connection and shares router R2 with the B-D connection. For extremely small values of λin , (4 of 8) [5/13/2004 11:58:40 AM]
  Principles of congestion control

buffer overflows are rare (as in congestion scenarios 1 and 2), and the throughput approximately equals the offered load.
For slightly larger values of λin , the corresponding throughput is also larger, as more original data is being transmitted into
the network and delivered to the destination, and overflows are still rate . Thus, for small values of λin , an increase in λin
results in an increase in λ out.

Having considered the case of extremely low traffic, let us next examine the case that λin (and hence λin') is extremely
large. Consider router R2. The A-C traffic arriving to router R2 (which arrives at R2 after being forwarded from R1) can
have an arrival rate at R2 that is at most C, the capacity of the link from R1 to R2, regardless of the value of λin. If λin' is
extremely large for all connections (including the B-D connection), then the arrival rate of B-D traffic at R2 can be much
larger than that of the A-C traffic. Because the A-C and B-D traffic must compete at router R2 for the limited amount of
buffer space, the amount of A-C traffic that successfully gets through R2 (i.e., is not lost due to buffer overflow) becomes
smaller and smaller as the offered load from B-D gets larger and larger. In the limit, as the offered load approaches infinity,
an empty buffer at R2 is immediately filled by a B-D packet and the throughput of the A-C connection at R2 goes to zero.
This, in turn, implies that the A-C end-end throughput goes to zero in the limt of heavy traffic. These considerations give
rise to the offered load versus throughput tradeoff shown below in Figure 3.6-6.

                              Figure 3.6-6: Scenario 2 performance with finite buffers and multihope paths

The reason for the eventual decrease in throughput with increasing offered load is evident when one considers the amount
of wasted "work" done by the network. In the high traffic scenario outlined above, whenever a packet is dropped at a
second-hop router, the "work" done by the first-hop router in forwarding a packet to the second-hop router ends up being
"wasted." The network would have been equally well off (more accurately, equally as bad off) if the first router had simply
discarded that packet and remained idle. More to the point, the transmission capacity used at the first router to forward the
packet to the second router could have been much more profitably used to transmit a different packet. (For example, when
selecting a packet for transmission, it might be better for a router to give priorty to packets that have already traversed some
number of upstream routers). So here we see yet another cost of dropping a packet due to congestion - when a packet is
dropped along a path, the transmission capacity that was used at each of the upstream routers to forward that packet to the
point at which it is dropped ends up having been wasted.

3.6.2 Approaches Toward Congestion Control
In Section 3.7, we will examine TCP's specific approach towards congestion control in great detail. Here, we identify the
two broad approaches that are taken in practice towards congestion control, and discuss specific network architectures and
congestion control protocols embodying these approaches. (5 of 8) [5/13/2004 11:58:40 AM]
 Principles of congestion control

At the broadest level, we can distinguish among congestion control approaches based on the whether or not the network
layer provides any explicit assistance to the transport layer for congestion control purposes:

    q   End-end congestion control. In an end-end approach towards congestion control, the network layer provides no
        explicit support to the transport layer for congestion control purposes. Even the presence of congestion in the
        network must be inferred by the end systems based only on observed network behavior (e.g., packet loss and delay).
        We will see in Section 3.7 that TCP must necessarily take this end-end approach towards congestion control, since
        the IP layer provides no feedback to the end systems regarding network congestion. TCP segment loss (as indicated
        by a timeout or a triple duplicate acknowledgement) is taken as an indication of network congestion and TCP
        decreases its window size accordingly. We also see that new proposals for TCP use increasing round-trip delay
        values as indicators of increased network congestion.

    q   Network-assisted congestion control. With network-assisted congestion control, network-layer components (i.e.,
        routers) provide explicit feedback to the sender regarding the congestion state in the network. This feedback may be
        as simple as a single bit indicating congestion at a link . This approach was taken in the early IBM SNA [Schwartz
        1982] and DEC DECnet [Jain 1989] [Ramakrishnan 1990] architectures, was recently proposed for TCP/IP networks
        [Floyd 1994] [Ramakrishnan 1998], and is used in ATM ABR congestion control as well, as discussed below. More
        sophisticated network-feedback is also possible. For example, one form of ATM ABR congestion control that we
        will study shortly allows a router to explictly inform the sender of the transmission rate it (the router) can support on
        an outgoing link.

For network-assisted congestion control, congestion information is typically fed back from the network to the sender in one
of two ways, as shown in Figure 3.6-7. Direct feedback may be sent from a network router to the sender. This form of
notification typically takes the form of a choke packet (essentially saying, "I'm congested!"). The second form of
notification occurs when a router marks/updates a field in a packet flowing from sender to receiver to indiciate congestion.
Upon receipt of a marked packet, the receiver then notifies the sender of the congestion indication. Note that this latter
form of notification takes up to a full round-trip time.

                             Figure 3.6-7: Two feedback pathways for network-indicated congestion information

3.6.3 ATM ABR Congestion Control (6 of 8) [5/13/2004 11:58:40 AM]
  Principles of congestion control

Our detailed study of TCP congestion control in Section 3.7 will provide an in-depth case study of an end-end approach
towards congestion control. We conclude this section with a brief case study of the network-assisted congestion control
mechanisms used in ATM ABR (Available Bit Rate) service. ABR has been designed as an elastic data transfer service in a
manner reminiscent of TCP. When the network is underloaded, ABR service should be able to take advantage of the spare
available bandwidth; when the network is congested, ABR service should throttle its transmission rate to some
predetermined minimum transmititon rate. A detailed tutorial on ATM ABR congestion control and traffic management is
provided in [Jain 1996].

Figure 3.6-8 shows the framework for ATM ABR congestion control. In our discussion below we adopt ATM terminology
(e.g., using the term "switch" rather than "router," and the term "call" rather than "packet). With ATM ABR service, data
cells are transmitted from a source to a destination through a series of intermediate switches. Interpersed with the data cells
are so-called RM (Resource Management) cells; we will see shortly that these RM cells can be used to convey congestion-
related information among the hosts and switches. When an RM cell is at a destination, it will be "turned around" and sent
back to the sender (possibly after the destination has modified the contents of the RM cell). It is also possible for a switch to
generate an RM cell itself and send this RM cell directly to a source. RM cells can thus be used to provide both direct
network feedback and network-feedback-via-the-receiver, as shown in Figure 3.6-8.

                                     Figure 3.6-8: Congestion control framework for ATM ABR service

ATM ABR congestion control is a rate-based approach. That is, the sender explicitly computes a maximum rate at which it
can send and regulates itself accordingly. ABR provides three mechanisms for signaling congestion-related information
from the siwtches to the receiver:

     q   EFCI bit. Each data cell contains an EFCI (Explicit Forward Congestion Indication) bit. A congested network
         switch can set the EFCI bit in a data cell to 1 to signal congestion to the destination host. . The destination must
         check the EFCI bit in all received data cells. When an RM cell arrives at the destination, if the most recently-
         received data cell had the EFCI bit set to 1, then the destination sets the CI (Congestion Indication) bit of the RM
         cell to 1 and sends the RM cell back to the sender. Using the EFCI in data cells and the CI bit in RM cells, a sender
         can thus be notified about congestion at a network switch.
     q   CI and NI bits. As noted above, sender-to-receiver RM cells are interpersed with data cells. The rate of RM cell
         interspersion is a tunable parameter, with one RM cell every 32 data cells being the default value. These RM cells
         have a CI bit and a NI (No Increase) bit that can be set by a congested network switch. Specifically, a switch can set
         the NI bit in a passing RM cell to1 under mild congestion and can set the CI bit to 1 under severe congestion (7 of 8) [5/13/2004 11:58:40 AM]
  Principles of congestion control

         conditions. When a destination host receives an RM cell, it will send the RM cell back to the sender with its CI and
         NI bits intact (except that CI may be set to 1 by the destination as a result of the EFCI mechanism decribed above).
     q   Explicit Rate (ER) setting. Each RM cell also contains a 2-byte ER (Explicit Rate) field. A congested switch may
         lower the value contained in the ER field in a passing RM cell. In this manner, the ER field will be set to the
         minimum supportable rate of all switches on the source-to-destination path.

An ATM ABR source adjusts the rate at which it can send cells as a function of the CI, NI and ER values in a returned RM
cell. The rules for making this rate adjustment are rather complicated and tedious. The interested reader is referred to [Jain
1996] for details.


[Floyd 1994] Floyd, S., "TCP and Explicit Congestion Notification," ACM Computer Communication Review, V. 24 N. 5,
October 1994, p. 10-23.
[Jain 1989] R. Jain, "A Delay-Based Approach for Congestion Avoidance in Interconnected Heterogeneous Computer
Networks," ACM Comp. Commun. Rev., vol. 19, no. 5, 1989, pp. 56-71.
[Jain 1996] R. Jain. S Kalyanaraman, S. Fahmy, R. Goyal, S. Kim, "Tutorial Paper on ABR Source Behavior ," ATM
Forum/96-1270, October 1996
[Ramakrishnan 1990] K. K. Ramakrishnan and Raj Jain, "A Binary Feedback Scheme for Congestion Avoidance in
Computer Networks", ACM Transactions on Computer Systems, Vol.8, No.2, pp. 158-181, May 1990.
[Ramakrishnan 1998] Ramakrishnan, K.K., and Floyd, S., A Proposal to add Explicit Congestion Notification (ECN) to IP
. Internet draft draft-kksjf-ecn-03.txt, October 1998, work in progress.
[Schwartz 1982] M. Schwartz, "Performance Analysis of the SNA Virtual Route Pacing Control," IEEE Transactions on
Communications, Vol COM-30, No. 1 (Jan. 1982), pp. 172-184. (8 of 8) [5/13/2004 11:58:40 AM]
  TCP Congestion Control

                              3.7 TCP Congestion Control
In this section we return to our study of TCP. As we learned in Section 3.5, TCP provides a reliable transport service between
two processes running on different hosts. Another extremely important component of TCP is its congestion control
mechanism. As we indicated in the previous section, TCP must use end-to-end congestion control rather than network-
assisted congestion control, since the IP layer provides no feedback to the end systems regarding network congestion. Before
diving into the details of TCP congestion control, let's first get a high-level view of TCP's congestion control mechanism, as
well as the overall goal that TCP strives for when multiple TCP connections must share the bandwidth of a congested link. .

A TCP connection controls its transmission rate by limiting its number of transmitted-but-yet-to-be-acknowledged segments.
Let us denote this number of permissible unacknowledged segments as w, often referred to as the TCP window size. Ideally,
TCP connections should be allowed to transmit as fast as possible (i.e., to have as large a number of outstanding
unacknowledged packets as possible) as long as segments are not lost (dropped at routers) due to congestion. In very broad
terms, a TCP connection starts with a small value of w and then "probes" for the existence of additional unused link
bandwidth at the links on its end-to-end path by increasing w. A TCP connection continues to increase w until a segment loss
occurs (as detected by a timeout or duplicate acknowledgements). When such a loss occurs, the TCP connection reduces w to
a "safe level" and then begins probing again for unused bandwidth by slowly increasing w .

An important measure of the performance of a TCP connection is its throughput - the rate at which it transmits data from the
sender to the receiver. Clearly, throughput will depend on the value of w. W. If a TCP sender transmits all w segments back-
to-back, it must then wait for one round trip time (RTT) until it receives acknowledgments for these segments, at which point
it can send w additional segments. If a connection transmits w segments of size MSS bytes every RTT seconds, then the
connection's throughput, or transmission rate, is (w*MSS)/RTT bytes per second.

Suppose now that K TCP connections are traversing a link of capacity R. Suppose also that there are no UDP packets flowing
over this link, that each TCP connection is transferring a very large amount of data, and that none of these TCP connections
traverse any other congested link. Ideally, the window sizes in the TCP connections traversing this link should be such that
each connection achieves a throughput of R/K. More generally, if a connection passes through N links, with link n having
transmission rate Rn and supporting a total of Kn TCP connections, then ideally this connection should achieve a rate of
Rn/Kn on the nth link. However, this connection's end-to-end average rate cannot exceed the minimum rate achieved at all of
the links along the end-to-end path. That is, the end-to-end transmission rate for this connection is r = min{R1/K1,...,RN/KN}.
The goal of TCP is to provide this connection with this end-to-end rate, r. (In actuality, the formula for r is more complicated,
as we should take into account the fact that one or more of the intervening connections may be bottlenecked at some other
link that is not on this end-to-end path and hence can not use their bandwidth share, Rn/Kn. In this case, the value of r would
be higher than min{R1/K1,...,RN/KN}. )

3.7.1 Overview of TCP Congestion Control
In Section 3.5 we saw that each side of a TCP connection consists of a receive buffer, a send buffer, and several variables
(LastByteRead, RcvWin, etc.) The TCP congestion control mechanism has each side of the connection keep track of two
additional variables: the congestion window and the threshold. The congestion window, denoted CongWin, imposes an
additional constraint on how much traffic a host can send into a connection. Specifically, the amount of unacknowledged data
that a host can have within a TCP connection may not exceed the minimum of CongWin and RcvWin, i.e.,
                                  LastByteSent - LastByteAcked <= min{CongWin, RcvWin}.

The threshold, which we discuss in detail below, is a variable that effects how CongWin grows. (1 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

Let us now look at how the congestion window evolves throughout the lifetime of a TCP connection. In order to focus on
congestion control (as opposed to flow control), let us assume that the TCP receive buffer is so large that the receive window
constraint can be ignored. In this case, the amount of unacknowledged data hat a host can have within a TCP connection is
solely limited by CongWin. Further let's assume that a sender has a very large amount of data to send to a receiver.

Once a TCP connection is established between the two end systems, the application process at the sender writes bytes to the
sender's TCP send buffer. TCP grabs chunks of size MSS, encapsulates each chunk within a TCP segment, and passes the
segments to the network layer for transmission across the network. The TCP congestion window regulates the times at which
the segments are sent into the network (i.e., passed to the network layer). Initially, the congestion window is equal to one
MSS. TCP sends the first segment into the network and waits for an acknowledgement. If this segment is acknowledged
before its timer times out, the sender increases the congestion window by one MSS and sends out two maximum-size
segments. If these segments are acknowledged before their timeouts, the sender increases the congestion window by one
MSS for each of the acknowledged segments, giving a congestion window of four MSS, and sends out four maximum-sized
segments. This procedure continues as long as (1) the congestion window is below the threshold and (2) the
acknowledgements arrive before their corresponding timeouts.

During this phase of the congestion control procedure, the congestion window increases exponentially fast, i.e., the
congestion window is initialized to one MSS, after one RTT the window is increased to two segments, after two round-trip
times the window is increased to four segments, after three round-trip times the window is increased to eight segments, etc.
This phase of the algorithm is called slow start because it begins with a small congestion window equal to one MSS. (The
transmission rate of the connection starts slowly but accelerates rapidly.)

The slow start phase ends when the window size exceed the value of threshold. Once the congestion window is larger than
the current value of threshold, the congestion window grows linearly rather than exponentially. Specifically, if w is the
current value of the congestion window, and w is larger than threshold, then after w acknowledgements have arrived, TCP
replaces w with w + 1 . This has the effect of increasing the congestion window by one in each RTT for which an entire
window's worth of acknowledgements arrives. This phase of the algorithm is called congestion avoidance.

The congestion avoidance phase continues as long as the acknowledgements arrive before their corresponding timeouts. But
the window size, and hence the rate at which the TCP sender can send, can not increase forever. Eventually, the TCP rate
will be such that one of the links along the path becomes saturated, and which point loss (and a resulting timeout at the
sender) will occur. When a timeout occurs, the value of threshold is set to half the value of the current congestion window,
and the congestion window is reset to one MSS. The sender then again grows the congestion window exponentially fast using
the slow start procedure until the congestion window hits the threshold.

In summary:

    q   When the congestion window is below the threshold, the congestion window grows exponentially.
    q   When the congestion window is above the threshold, the congestion window grows linearly.
    q   Whenever there is a timeout, the threshold is set to one half of the current congestion window and the congestion
        window is then set to one.

If we ignore the slowstart phase, we see that TCP essentially increases its window size by 1 each RTT (and thus increases its
transmission rate by an additive factor) when its network path is not congested, and decreases its window size by a factor of
two each RTT when the path is congested. For this reason, TCP is often referred to as an additive-increase, multiplicative-
decrease (AIMD) algorithm. (2 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

                                        Figure 3.7-1: Evolution of TCP's congestion window

The evolution of TCP's congestion window is illustrated in Figure 3.7-1. In this figure, the threshold is initially equal to
8*MSS. The congestion window climbs exponentially fast during slow start and hits the threshold at the third transmission.
The congestion window then climbs linearly until loss occurs, just after transmission 7. Note that the congestion window is
12*MSS when loss occurs. The threshold is then set to .5*CongWin = 6*MSS and the congestion window is set 1. And the
process continues. This congestion control algorithm is due to V. Jacobson [Jac88]; a number of modifications to Jacobson's
initial algorithm are described in [Stevens 1994, RFC 2581].

A Trip to Nevada: Tahoe, Reno and Vegas

The TCP congestion control algorithm just described is often referred to as Tahoe. One problem with the Tahoe algorithm is
that when a segment is lost the sender side of the application may have to wait a long period of time for the timeout. For this
reason, a variant of Tahoe, called Reno, is implemented by most operating systems. Like Tahoe, Reno sets its congestion
window to one segment upon the expiration of a timer. However, Reno also includes the fast retransmit mechanism that we
examined in Section 3.5. Recall that fast retransmit triggers the transmission of a dropped segment if three duplicate ACKs
for a segment are received before the occurrence of the segment's timeout. Reno also employs a fast recovery mechanism,
which essentially cancels the slow start phase after a fast retransmission. The interested reader is encouraged so see [Stevens
1994, RFC 2581] for details.

Most TCP implementations currently use the Reno algorithm. There is, however, another algorithm in the literature, the
Vegas algorithm, that can improve Reno's performance. Whereas Tahoe and Reno react to congestion (i.e., to overflowing
router buffers), Vegas attempts to avoid congestion while maintaining good throughput. The basic idea of Vegas is to (1)
detect congestion in the routers between source and destination before packet loss occurs, and (2) lower the rate linearly
when this imminent packet loss is detected. Imminent packet loss is predicted by observing the round-trip times -- the longer
the round-trip times of the packets, the greater the congestion in the routers. The Vegas algorithm is discussed in detail in
[Brakmo 1995] ; a study of its performance is given in [Ahn 1995]. As of 1999, Vegas is not a part of the most popular TCP (3 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control


We emphasize that TCP congestion control has evolved over the years, and is still evolving. What was good for the Internet
when the bulk of the TCP connections carried SMTP, FTP and Telnet traffic is not necessarily good for today's Web-
dominated Internet or for the Internet of the future, which will support who-knows-what kinds of services.

Does TCP Ensure Fairness?

In the above discussion, we noted that the goal of TCP's congestion control mechanism is to share a bottleneck link's
bandwidth evenly among the TCP connections traversing that link. But why should TCP's additive increase, multiplicative
decrease algorithm achieve that goal, particularly given that different TCP connections may start at different times and thus
may have different window sizes at a given point in time? [Chiu 1989] provides an elegant and intuitive explanation of why
TCP congestion control converges to provide an equal share of a bottleneck link's bandwidth among competing TCP

Let's consider the simple case of two TCP connections sharing a single link with transmission rate R, as shown in Figure 3.7-
2. We'll assume that the two connections have the same MSS and RTT (so that if they have the same congestion window size,
then they have the same throughput), that they have a large amount of data to send, and that no other TCP connections or
UDP datagrams traverse this shared link. Also, we'll ignore the slow start phase of TCP, and assume the TCP connections are
operating in congestion avoidance mode (additive increase, multiplicative decrease) at all times.

                                Figure 3.7-2: Two TCP connections sharing a single bottleneck link

Figure 3.7-3 plots the throughput realized by the two TCP connections. If TCP is to equally share the link bandwidth between
the two connections, then the realized throughput should fall along the 45 degree arrow ("equal bandwidth share") emanating
from the origin. Ideally, the sum of the two throughputs should equal R (certainly, each connection receiving an equal, but
zero, share of the link capacity is not a desirable situation!), so the goal should be to have the achieved throughputs fall
somewhere near the intersection of the "equal bandwidth share" line and the "full bandwidth utilization" line in. Figure 3.7-3.

Suppose that the TCP window sizes are such that at a given point in time, connections 1 and 2 realize throughputs indicated
by point A in Figure 3.7-3. Because the amount of link bandwidth jointly consumed by the two connections is less than R,
no loss will occur, and both connections will increase their window by 1 per RTT as a result of TCP's congestion avoidance
algorithm. Thus, the joint throughput of the two connections proceeds along a 45 degree line (equal increase for both
connections) starting from point A. Eventually, the link bandwidth jointly consumed by the two connections will be greater
than R and eventually packet loss will occur. Suppose that connections 1 and 2 experience packet loss when they realize (4 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

throughputs indicated by point B. Connections 1 and 2 then decrease their windows by a factor of two. The resulting
throughputs realized are thus at point C, halfway along a vector starting at B and ending at the origin. Because the joint
bandwidth use is less than R at point C, the two connections again increase their throughputs along a 45 degree line starting
from C. Eventually, loss will again occur, e.g., at point D, and the two connections again decrease their window sizes by a
factor of two. And so on. You should convince yourself that the bandwidth realized by the two connections eventually
fluctuates along the equal bandwidth share line. You should also convince yourself that the two connections will converge to
this behavior regardless of where they being in the two-dimensional space! Although a number of idealized assumptions lay
behind this scenario, it still provides an intuitive feel for why TCP results in an equal sharing of bandwidth among

                                   Figure 3.7-3: Throughput realized by TCP connections 1 and 2

In our idealized scenario, we assumed that only TCP connections traverse the bottleneck link, and that only a single TCP
connection is associated with a host-destination pair. In practice, these two conditions are typically not met, and client-server
applications can thus obtain very unequal portions of link bandwidth.

Many network applications run over TCP rather than UDP because they want to make use of TCP's reliable transport service.
But an application developer choosing TCP gets not only reliable data transfer but also TCP congestion control. We have just
seen how TCP congestion control regulates an application's transmission rate via the congestion window mechanism. Many
multimedia applications do not run over TCP for this very reason -- they do not want their transmission rate throttled, even if
the network is very congested. In particular, many Internet telephone and Internet video conferencing applications typically
run over UDP. These applications prefer to pump their audio and video into the network at a constant rate and occasionally
lose packets, rather than reduce their rates to "fair" levels at times of congestion and not lose any packets. From the
perspective of TCP, the multimedia applications running over UDP are not being fair -- they do not cooperate with the other
connections nor adjust their transmission rates appropriately. A major challenge in the upcoming years will be to develop
congestion control mechanisms for the Internet that prevent UDP traffic from bringing the Internet's throughput to a grinding
halt. (5 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

But even if we could force UDP traffic to behave fairly, the fairness problem would still not be completely solved. This is
because there is nothing to stop an application running over TCP from using multiple parallel connections. For example, Web
browsers often use multiple parallel TCP connections to transfer a Web page. (The exact number of multiple connections is
configurable in most browsers.) When an application uses multiple parallel connections, it gets a larger fraction of the
bandwidth in a congested link. As an example consider a link of rate R supporting 9 on-going client-server applications, with
each of the applications using one TCP connection. If a new application comes along and also uses one TCP connection, then
each application approximately gets the same transmission rate of R/10. But if this new application instead uses 11 parallel
TCP connections, then the new application gets an unfair allocation of R/2. Because Web traffic is so pervasive in the
Internet, multiple parallel connections are not uncommon.

Macroscopic Description of TCP Dynamics

Consider sending a very large file over a TCP connection. If we take a macroscopic view of the traffic sent by the source, we
can ignore the slow start phase. Indeed, the connection is in the slow-start phase for a relatively short period of time because
the connection grows out of the phase exponentially fast. When we ignore the slow-start phase, the congestion window grows
linearly, gets chopped in half when loss occurs, grows linearly, gets chopped in half when loss occurs, etc. This gives rise to
the saw-tooth behavior of TCP [Stevens 1994] shown in Figure 3.7-1.

Given this sawtooth behavior, what is the average throuphput of a TCP connection? During a particular round-trip interval,
the rate at which TCP sends data is function of the congestion window and the current RTT: when the window size is w*MSS
and the current round-trip time is RTT, then TCP's transsmission rate is (w*MSS)/RTT. During the congestion avoidance
phase, TCP probes for additional bandwidth by increasing w by one each RTT until loss occurs; denote by W the value of w
at which loss occurs. Assuming that the RTT and W are approximately constant over the duration of the connection, the TCP
transmission rate ranges from (W*MSS)/(2RTT) to (W*MSS)/RTT.

These assumputions lead to a highly-simplified macroscopic model for the steady-state behavior of TCP: the network drops a
packet from the connection when the connection's window size increases to W*MSS; the congestion window is then cut in
half and then increases by one MSS per round-trip time until it again reaches W. This process repeats itself over and over
again. Because the TCP throughput increases linearly between the two extreme values, we have:

                                     average throughput of a connection = (.75*W*MSS)/RTT.

Using this highly idealized model for the steady-state dynamics of TCP, we can also derive an interesting expression that
relates a connection's loss rate to its available bandwidth [Mahdavi 1997]. This derivation is outlined in the homework

3.7.2 Modeling Latency: Static Congestion Window
Many TCP connections transport relatively small files from one host to another. For example, with HTTP/1.0 each object in
a Web page is transported over a separate TCP connection, and many of these objects are small text files or tiny icons. When
transporting a small file, TCP connection establishment and slow start may have a significant impact on the latency. In this
section we present an analytical model that quantifies the impact of connection establishment and slow start on latency. For a
given object, we define the latency as the time from when the client initiates a TCP connection until when the client receives
the requested object in its entirety.

The analysis presented here assumes that that the network is uncongested, i.e., the TCP connection transporting the object
does not have to share link bandwidth with other TCP or UDP traffic. (We comment on this assumption below.) Also, in
order to not to obscure the central issues, we carry out the analysis in the context of the simple one-link network as shown in
Figure 3.7-4. (This link might model a single bottleneck on an end-to-end path. See also the homework problems for an (6 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

explicit extention to the case of multiple links.)

                             Figure 3.7-4: A simple one-link network connecting a client and a server

We also make the following simplifying assumptions:

    1. The amount of data that the sender can transmit is solely limited by the sender's congestion window. (Thus, the TCP
       receive buffers are large.)
    2. Packets are neither lost nor corrupted, so that there are no retransmissions.
    3. All protocol header overheads -- including TCP, IP and link-layer headers -- are negligible and ignored.
    4. The object (that is, file) to be transferred consists of an integer number of segments of size MSS (maximum segment
    5. The only packets that have non-negligible transmission times are packets that carry maximum-size TCP segments.
       Request packets, acknowledgements and TCP connection establishment packets are small and have negligible
       transmission times.
    6. The initial threshold in the TCP congestion control mechanism is a large value which is never attained by the
       congestion window.

We also introduce the following notation:

    1.   The size of the object to be transferred is O bits.
    2.   The MSS (maximum size segment) is S bits (e.g., 536 bytes).
    3.   The transmission rate of the link from the server to the client is R bps.
    4.   The round-trip time is denoted by RTT.

In this section we define the RTT to be the time elapsed for a small packet to travel from client to server and then back to the
client, excluding the transmission time of the packet. It includes the two end-to-end propagation delays between the two end
systems and the processing times at the two end systems. We shall assume that the RTT is also equal to the roundtrip time of
a packet beginning at the server.

Although the analysis presented in this section assumes an uncongested network with a single TCP connection, it
nevertheless sheds insight on the more realistic case of multi-link congested network. For a congested network, R roughly
represents the amount of bandwidth recieved in steady state in the end-to-end network connection; and RTT represents a
round-trip delay that includes queueing delays at the routers preceding the congested links. In the congested network case, we
model each TCP connection as a constant-bit-rate connection of rate R bps preceded by a single slow-start phase. (This is
roughly how TCP Tahoe behaves when losses are detected with triplicate acknowledgements.) In our numerical examples we
use values of R and RTT that reflect typical values for a congested network.

Before beginning the formal analysis, let us try to gain some intuition. Let us consider what would be the latency if there
were no congestion window constraint, that is, if the server were permitted to send segments back-to-back until the entire
object is sent? To answer this question, first note that one RTT is required to initiate the TCP connection. After one RTT the
client sends a request for the object (which is piggybacked onto the third segment in the three-way TCP handshake). After a (7 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

total of two RTTs the client begins to receive data from the server. The client receives data from the server for a period of
time O/R, the time for the server to transmit the entire object. Thus, in the case of no congestion window constraint, the total
latency is 2 RTT + O/R. This represents a lower bound; the slow start procedure, with its dynamic congestion window, will
of course elongate this latency.

Static Congestion Window

Although TCP uses a dynamic congestion window, it is instructive to first analyze the case of a static congestion window. Let
W, a positive integer, denote a fixed-size static congestion window. For the static congestion window, the server is not
permitted to have more than W unacknowledged outstanding segments. When the server receives the request from the client,
the server immediately sends W segments back-to-back to the client. The server then sends one segment into the network for
each acknowledgement it receives from the client. The server continues to send one segment for each acknowledgement until
all of the segments of the object have been sent. There are two cases to consider:

    1. WS/R > RTT + S/R. In this case, the server receives an acknowledgement for the first segment in the first window
       before the server completes the transmission of the first window.
    2. WS/R < RTT + S/R. In this case, the server transmits the first window's worth of segments before the server receives
       an acknowledgement for the first segment in the window.

Let us first consider Case 1, which is illustrated in Figure 3.7-5.. In this figure the window size is W = 4 segments.

                                       Figure 3.7-5: the case that WS/R > RTT + S/R
One RTT is required to initiate the TCP connection. After one RTT the client sends a request for the object (which is
piggybacked onto the third segment in the three-way TCP handshake). After a total of two RTTs the client begins to receive
data from the server. Segments arrive periodically from the server every S/R seconds, and the client acknowledges every
segment it receives from the server. Because the server receives the first acknowledgement before it completes sending a (8 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

window's worth of segments, the server continues to transmit segments after having transmitted the first window's worth of
segments. And because the acknowledgements arrive periodically at the server every S/R seconds from the time when the
first acknowledgement arrives, the server transmits segments continuously until it has transmitted the entire object. Thus,
once the server starts to transmit the object at rate R, it continues to transmit the object at rate R until the entire object is
transmitted. The latency therefore is 2 RTT + O/R.

Now let us consider Case 2, which is illustrated in Figure 3.7-6. In this figure, the window size is W=2 segments.

                                            Figure 3.7-6: the case that WS/R < RTT + S/R

Once again, after a total of two RTTs the client begins to receive segments from the server. These segments arrive
peridodically every S/R seconds, and the client acknowledges every segment it receives from the server. But now the server
completes the transmission of the first window before the first acknowledgment arrives from the client. Therefore, after
sending a window, the server must stall and wait for an acknowledgement before resuming transmission. When an
acknowledgement finally arrives, the server sends a new segment to the client. Once the first acknowledgement arrives, a
window's worth of acknowledgements arrive, with each successive acknowledgement spaced by S/R seconds. For each of
these acknowledgements, the server sends exactly one segment. Thus, the server alternates between two states: a transmitting
state, during which it transmits W segments; and a stalled state, during which it transmits nothing and waits for an
acknowledgement. The latency is equal to 2 RTT plus the time required for the server to transmit the object, O/R, plus the
amount of time that the server is in the stalled state. To determine the amount of time the server is in the stalled state, let K =
O/WS; if O/WS is not an integer, then round K up to the nearest integer. Note that K is the number of windows of data there
are in the object of size O. The server is in the stalled state between the transmission of each of the windows, that is, for K-1
periods of time, with each period lasting RTT- (W-1)S/R (see above diagram). Thus, for Case 2,

                                       Latency = 2 RTT + O/R + (K-1)[S/R + RTT - W S/R] .

Combining the two cases, we obtain (9 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

                                       Latency = 2 RTT + O/R + (K-1) [S/R + RTT - W S/R]+

where [x]+ = max(x,0).

This completes our analysis of static windows. The analysis below for dynamic windows is more complicated, but parallels
the analysis for static windows.

3.7.3 Modeling Latency: Dynamic Congestion Window
We now investigate the latency for a file transfer when TCP's dynamic congestion window is in force. Recall that the server
first starts with a congestion window of one segment and sends one segment to the client. When it receives an
acknowledgement for the segment, it increases its congestion window to two segments and sends two segments to the client
(spaced apart by S/R seconds). As it receives the acknowledgements for the two segments, it increases the congestion
window to four segments and sends four segments to the client (again spaced apart by S/R seconds). The process continues,
with the congestion window doubling every RTT. A timing diagram for TCP is illustrated in Figure 3.7-7. (10 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

                                              Figure 3.7-7: TCP timing during slow start

Note that O/S is the number of segments in the object; in the above diagram, O/S =15. Consider the number of segments that
are in each of the windows. The first window contains 1 segment; the second window contains 2 segments; the third window
contains 4 segments. More generally, the kth window contains 2k-1 segments. Let K be the number of windows that cover the
object; in the preceding diagram K=4. In general we can express K in terms of O/S as follows:

After transmitting a window's worth of data, the server may stall (i.e., stop transmitting) while it waits for an
acknowledgement. In the preceding diagram, the server stalls after transmitting the first and second windows, but not after
transmitting the third. Let us now calculate the amount of stall time after transmitting the kth window. The time from when
the server begins to transmit the kth window until when the server receives an acknowledgement for the first segment in the
window is S/R + RTT. The transmission time of the kth window is (S/R) 2k-1. The stall time is the difference of these two
quantities, that is,

                                                        [S/R + RTT - 2k-1(S/R)]+.

The server can potentially stall after the transmission of each of the first K-1 windows. (The server is done after the
transmission of the Kth window.) We can now calculate the latency for transferring the file. The latency has three
components: 2RTT for setting up the TCP connection and requesting the file; O/R, the transmission time of the object; and
the sum of all the stalled times. Thus,

The reader should compare the above equation for the latency equation for static congestion windows; all the terms are
exactly the same except the term WS/R for static windows has been replaced by 2k-1S/R for dynamic windows. To obtain a
more compact expression for the latency, let Q be the number of times the server would stall if the object contained an
infinite number of segments: (11 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

The actual number of times the server stalls is P = min{Q,K-1}. In the preceding diagram P=Q=2. Combining the above two
equations gives

We can further simplify the above formula for latency by noting

Combining the above two equations gives the following closed-form expression for the latency:

Thus to calculate the latency, we simple must calculate K and Q, set P = min{Q,K-1}, and plug P into the above formula.

It is interesting to compare the TCP latency to the latency that would occur if there were no congestion control (that is, no
congestion window constraint). Without congestion control, the latency is 2RTT + O/R, which we define to be the Minimum
Latency. It is simple exercise to show that

We see from the above formula that TCP slow start will not significantly increase latency if RTT << O/R, that is, if the round-
trip time is much less than the transmission time of the object. Thus, if we are sending a relatively large object over an
uncongested, high-speed link, then slow start has an insignificant affect on latency. However, with the Web we are often
transmitting many small objects over congested links, in which case slow start can significantly increase latency (as we shall
see in the following subsection).

Let us now take a look at some example scenarios. In all the scenarios we set S = 536 bytes, a common default value for
TCP. We shall use a RTT of 100 msec, which is not an atypical value for a continental or inter-continental delay over
moderately congested links. First consider sending a rather large object of size O = 100Kbytes. The number of windows that
cover this object is K=8. For a number of transmission rates, the following chart examines the affect of the the slow-start
mechanism on the latency. (12 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

                                                                     Minimum Latency:                     Latency
             R                       O/R                P
                                                                       O/R + 2 RTT                     with Slow Start
          28 Kbps                  28.6 sec             1                   28.8 sec                       28.9 sec
          100 Kbps                   8 sec              2                    8.2 sec                       8.4 sec
          1 Mbps                  800 msec              5                     1 sec                        1.5 sec
          10 Mbps                  80 msec              7                    .28 sec                       .98 sec
We see from the above chart that for a large object, slow-start adds appreciable delay only when the transmission rate is high.
If the transmission rate is low, then acknowledgments come back relatively quickly, and TCP quickly ramps up to its
maximum rate. For example, when R = 100 Kbps, the number of stall periods is P=2 whereas the number of windows to
transmit is K=8; thus the server stalls only after the first two of eight windows. On one otherhand, when R = 10 Mbps, the
server stalls between each window, which causes a significant increase in the delay.

Now consider sending a small object of size O = 5 Kbytes. The number of windows that cover this object is K= 4. For a
number of transmission rates, the following chart examines the affect of the the slow-start mechanism.

                                                                     Minimum Latency:                     Latency
             R                       O/R            P
                                                                       O/R + 2 RTT                     with Slow Start
          28 Kbps                  1.43 sec          1                      1.63 sec                       1.73 sec
          100 Kbps                  .4 sec           2                       .6 sec                        .757 sec
          1 Mbps                  40 msec            3                       .24 sec                       .52 sec
          10 Mbps                   4 msec           3                       .20 sec                       .50 sec
Once again slow start adds an appreciable delay when the transmission rate is high. For example, when R = 1Mbps the
server stalls between each window, which causes the latency to be more than twice that of the minimum latency.

For a larger RTT, the affect of slow start becomes significant for small objects for smaller transmission rates. The following
chart examines the affect of slow start for RTT = 1 second and O = 5 Kbytes (K=4).

                                                                     Minimum Latency:                     Latency
            R                       O/R            P
                                                                       O/R + 2 RTT                     with Slow Start
         28 Kbps                  1.43 sec          3                       3.4 sec                        5.8 sec
         100 Kbps                 .4 sec            3                       2.4 sec                        5.2 sec
         1 Mbps                  40 msec            3                       2.0 sec                        5.0 sec
        10 Mbps                   4 msec            3                       2.0 sec                        5.0 sec
In summary, slow start can significantly increase latency when the object size is relatively small and the RTT is relatively
large. Unfortunately, this is often the scenario when sending of objects over the World Wide Web.

An Example: HTTP

As an application of the the latency analysis, let's now calculate the response time for a Web page sent over non-persistent
HTTP. Suppose that the page consists of one base HTML page and M referenced images. To keep things simple, let us (13 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

assume that each of the M+1 objects contains exactly O bits.

With non-persistent HTTP, each object is tranferred independently, one after the other. The response time of the Web page is
therefore the sum of the latencies for the individual objects. Thus

Note that the response time for non-persistent HTTP takes the form:
          response time = (M+1)O/R + 2(M+1)RTT + latency due to TCP slow-start for each of the M+1 objects.

Clearly if there are many objects in the Web page and if RTT is large, then non-persistent HTTP will have poor response-
time performance. In the homework problems we will investigate the response time for other HTTP transport schemes,
including persistent connections and non-persistent connections with parallel connections. The reader is also encouraged to
see [Heidemann] for a related analysis..

TCP congestion control has enjoyed a tremendous amount of study and pampering since its original adoption in 1988. This is
not surprising as it is both an important and interesting topic. There is currently a large and growing literature on the subject.
Below we provide references for the citations in this section as well references to some other important works.

[Ahn 1995] J.S. Ahn, P.B. Danzig, Z. Liu and Y. Yan, Experience with TCP Vegas: Emulation and Experiment, Proceedings
of ACM SIGCOMM '95, Boston, August 1995.
[Brakmo 1995] L. Brakmo and L. Peterson, TCP Vegas: End to End Congestion Avoidance on a Global Internet, IEEE
Journal of Selected Areas in Communications, 13(8):1465-1480, October 1995.
[Chiu 1989] D. Chiu and R. Jain, "Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer
Networks," Computer Networks and ISDN Systems, Vol. 17, pp. 1 - 14.
[Floyd 1991] S. Floyd, Connections with Multiple Congested Gateways in Packet-Switched Networks, Part 1: One-Wat
Traffic, ACM Computer Communications Review, Vol. 21, No. 5, October 1991, pp. 30-47.
[Heidemann 1997] J. Heidemann, K. Obraczka and J. Touch, Modeling the Performance of HTTP Over Several Transport
Protocols," IEEE/ACM Transactions on Networking, Vol. 5, No. 5, October 1997, pp. 616-630.
[Hoe 1996] J.C. Hoe, Improving the Start-up Behavior of a Congestion Control Scheme for TCP. Proceedings of ACM
SIGCOMM'96, Stanford, August 1996.
[Jacobson 1988] V. Jacobson, Congestion Avoidance and Control. Proceedings of ACM SIGCOMM '88. August 1988, p.
[Lakshman 1995] T.V. Lakshman and U. Madhow, "Performance Analysis of Window-Based Flow Control Using TCP/IP:
the Effect of High Bandwidth-Delay Products and Random Loss", IFIP Transactions C-26, High Performance Networking
V, North Holland, 1994, pp. 135-150.
[Mahdavi 1997] J. Mahdavi and S. Floyd, TCP-Friendly Unicast Rate-Based Flow Control. unpublsihed note, January 1997.
[Nielsen 1997] H. F. Nielsen, J. Gettys, A. Baird-Smith, E. Prud'hommeaux, H.W. Lie, C. Lilley, Network Performance
Effects of HTTP/1.1, CSS1, and PNG, W3C Document, 1997 (also appeared in SIGCOMM' 97).
[RFC 793] "Transmission Control Protocol," RFC 793, September 1981.
[RFC 854] J. Postel and J. Reynolds, "Telnet Protocol Specifications," RFC 854, May 1983.
[RFC 1122] R. Braden, "Requirements for Internet Hosts -- Communication Layers," RFC 1122, October 1989.
[RFC 1323] V. Jacobson, S. Braden, D. Borman, "TCP Extensions for High Performance," RFC 1323, May 1992. (14 of 15) [5/13/2004 11:59:09 AM]
  TCP Congestion Control

[RFC 2581] M. Allman, V. Paxson, W. Stevens, " TCP Congestion Control, RFC 2581, April 1999.
[Shenker 1990] S. Shenker, L. Zhang and D.D. Clark, "Some Observations on the Dynamics of a Congestion Control
Algorithm", ACM Computer Communications Review, 20(4), October 1990, pp. 30-39.
[Stevens 1994] W.R. Stevens, TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley, Reading, MA, 1994.
[Zhang 1991] L. Zhang, S. Shenker, and D.D. Clark, Obervations on the Dynamics of a Congestion Control Algorithm: The
Effects of Two Way Traffic, ACM SIGCOMM '91, Zurich, 1991.

Search RFCs and Internet Drafts

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s) here.


Press button to submit your query or reset the form:         Submit      Reset

Query Options:

            Case insensitive

       Maximum number of hits: 25

Return to Table Of Contents

Copyright Keith W. Ross and James F. Kurose 1996-2000 . All rights reserved. (15 of 15) [5/13/2004 11:59:09 AM]

                                               3.8 Summary
We began this chapter by studying the services that a transport layer protocol can provide to network
applications. At one extreme, the transport layer protocol can be very simple and offer a no-frills service
to applications, providing only the multiplexing/demultiplexing function for communicating processes.
The Internet's UDP protocol is an example of such a no-frills (and no-thrills, from the persective of
someone interested in networking) transport-layer protocol. At the other extreme, a transport layer
protocol can provide a variety of guarantees to applications, such as reliable delivery of data, delay
guarantees and bandwidth guarantees. Nevertheless, the services that a transport protocol can provide are
often constrained by the service model of the underlying network-layer protocol. If the network layer
protocol cannot provide delay or bandwidth guarantees to transport-layer segments, then the transport
layer protocol cannot provide delay or bandwidth guarantees for the messages sent between processes.

We learned in Section 3.4 that a transport layer protocol can provide reliable data transfer even if the
underlying network layer is unreliable. We saw that providing reliable data transfer has many subtle
points, but that the task can be accomplished by carefully combining acknowledgments, timers,
retransmissions and sequence numbers.

Although we covered reliable data transfer in this chapter, we should keep in mind that reliable data
transfer can be provided by link, network, transport or application layer protocols. Any of upper four
layers of the protocol stack can implement acknowledgments, timers, retransmissions and sequence
numbers and provide reliable data transfer to the layer above. In fact, over the years, engineers and
computer scientists have independently designed and implemented link, network, transport and
application layer protocols that provide reliable data transfer (although many of these protocols have
quietly disappeared).

In Section 3.5 we took a close look at TCP, the Internet's connection-oriented and reliable transport-layer
protocol. We learned that TCP is complex, involving connection management, flow control, round-trip
time estimation, as well as reliable data transfer. In fact, TCP is actually more complex that we made it
out to be -- we intentionally did not discuss a variety of TCP patches fixes, and improvements that are
widely implemented in various versions of TCP. All of this complexity, however, is hidden from the
network application. If a client on one host wants to reliably send data to a server on another host, it
simply opens a TCP socket to the server and then pumps data into that socket. The client-server
application is oblivious to all of TCP's complexity.

In Section 3.6 we examined congestion control from a broad perspective, and in Section 3.7 we showed
how TCP implements congestion control. We learned that congestion is imperative for the well-being of
the network. Without congestion control, a network can easily become grid locked, with little or no data
being transported end-to-end. In Section 3.7 we learned that TCP implements an end-to-end congestion
control mechanism that additively increases its transmission rate when the TCP connection's path is
judged to be congestion-free, and nultiplicatively decreases its transmission rate when loss occurs. This (1 of 2) [5/13/2004 11:59:11 AM]

mechanism also strives to give each TCP connection passing through a congested link an equal share of
the link bandwidth. We also examined in some depth the impact of TCP connection establishment and
slow start on latency. We observed that in many important scenarios, connection establishment and slow
start significantly contribute to end-to-end delay. We emphasize once more that TCP congestion control
has evolved over the years, remains an area of intensive research, and will likely continue to evolve in
the upcoming years.

In Chapter 1 we said that a computer network can be partitioned into the "network edge" and the
"network core". The network edge covers everything that happens in the end systems. Having now
covered the application layer and the transport layer, our discussion of the network edge is now
complete. It is time to explore the network core! This journey begins in the next chapter, where we'll
study the network layer, and continues into Chapter 5, where we'll study the link layer.

Copyright 1999. Keith W. Ross and James F. Kurose . All Rights Reserved. (2 of 2) [5/13/2004 11:59:11 AM]
  Chapter 4 Homework problems

              Homework Problems and Discussion Questions
                                                                 Chapter 3
Review Questions
Sections 3.1-3.3

1) Consider a TCP connection between host A and host B. Suppose that the TCP segments traveling from host A to host B have source port number x
and destination port number y. What are the source and destination port numbers for the segments travelling from host B to host A?

2) Describe why an application developer may choose to run its application over UDP rather than TCP.

3) Is it possible for application to enjoy reliable data transfer even when the application runs over UDP? If so, how?

Section 3.5

4) True or False:

       a) Host A is sending host B a large file over a TCP connection. Assume host B has no data to send A. Host B will not send acknowledgements
       to host A because B cannot piggyback the acknowledgementson data?

       b) The size of the TCP RcvWindow never changes throughout the duration of the connection?

       c) Suppose host A is sending host B a large file over a TCP connection. The number of unacknowledged bytes that A sends cannot exceed the
       size of the receive buffer?

       d) Suppose host A is sending a large file to host B over a TCP connection. If the sequence number for a segment of this connection is m, then
       the sequence number for the subsequent segment will necessarily be m+1?

       e) The TCP segment has a field in its header for RcvWindow?

       f) Suppose that the last SampleRTT in a TCP connection is equal to 1 sec. Then Timeout for the connection will necessarily be set to a value
       >= 1 sec.

       g) Suppose host A sends host B one segment with sequence number 38 and 4 bytes of data. Then in this same segment the acknowledgement
       number is necessarily 42?

5) Suppose A sends two TCP segments back-to-back to B. The first segment has sequence number 90; the second has sequence number 110. a) How
much data is the first segment? b) Suppose that the first segment is lost, but the second segment arrives at B. In the acknowledgement that B sends to
A, what will be the acknowledgment number?

6) Consider the Telent example discussed in Section 3.5. A few seconds after the user types the letter 'C' the user types the letter 'R'. After typing the
letter 'R' how many segments are sent and what is put in the sequence number and acknowledgement fields of the segments.

Section 3.7

7) Suppose two TCP connections are present over some bottleneck link of rate R bps. Both connections have a huge file to send (in the same
direction over the bottleneck link). The transmissions of the files start at the same time. What is the transmission rate that TCP would like to give to
each of the connections?

8) True or False: Consider congestion control in TCP. When a timer expires at the sender, the threshold is set to one half of its previous value? (1 of 5) [5/13/2004 11:59:22 AM]
  Chapter 4 Homework problems

1) Suppose client A initiates an FTP session with server S. At about the same time, client B also initiates an FTP session with server S. Provide
possible source and destination port numbers for :

       (a) the segments sent from A to S?
       (b) the segments sent from B to S?
       (c) the segments sent from S to A?
       (d) the segments sent from S to B?

(e) If A and B are different hosts, is it possible that the source port numbers in the segments from A to S are the same as those from B to S? (f) How
about if they are the same host?

2) UDP and TCP use 1's complement for their checksums. Suppose you have the following three 8-bit words: 01010101, 01110000, 11001100. What
is the 1's complement of the sum of these words? Show all work. Why is it that UDP take the 1's complement of the sum, i.e., why not just use the
sum? With the 1's complement scheme, how does the receiver detect errors. Is it possible that a 1-bit error will go undetected? How about a 2-bit

3) Protocol rdt2.1 uses both ACK's and NAKs. Redesign the protocol, adding whatever additional protocol mechanisms are needed, for the case that
only ACK messages are used. Assume that packets can be corrupted, but not lost. Give the sender and receiver FSMs, and a trace of your protocol in
operation (using traces as in Figure \ref{fig57}). Show also how the protocol works in the case of no errors, and show how your protocol recovers
from channel bit errors.

4) Consider the following (incorrect) FSM for the receiver for protocol rtd2.1.

Show that this receiver, when operating with the sender shown in Figure 3.4-5 can lead the sender and receiver to enter into a deadlock state, where
each is waiting for an event that will never occur.

5) In protocol rdt3.0, the ACK packets flowing from the receiver to the sender do not have sequence numbers (although they do have an ACK field
that contains the sequence number of the packet they are acknowledging). Why is it that our ACK packets do not require sequence numbers?

6) Draw the FSM for the receiver side of protocol rdt 3.0.

7) Give a trace of the operation of protocol rdt3.0 when data packets and acknowledgements packets are garbled. Your trace should be similar to that
used in Figure 3.4-9.

8) Consider a channel that can lose packets but has a maximum delay that is known. Modify protocol rdt2.1 to include sender timeout and
retransmit. Informally argue why your protocol can communicate correctly over this channel. (2 of 5) [5/13/2004 11:59:22 AM]
  Chapter 4 Homework problems

9) The sender side of rdt3.0 simply ignores (i.e., takes no action on) all received packets which are either in error, or have the wrong value in the
acknum field of an acknowledgement packet. Suppose that in such circumstances, rdt3.0 were to simply retransmit the current data packet. Would the
protocol still work? (Hint: Consider what would happen in the case that there are only it errors; no packet losses and no premature timeouts occur.
Consider how many times the nth packet is sent, in the limit as n approaches infinity.

10) Consider the cross-country example shown in Figure 3.4-10. How big would the window size have to be for the channel utilization to be greater
than 90 %?

11) Design a reliable, pipelined, data transfer protocol that uses only negative acknowledgements. How quickly will your protocol respond to lost
packets when the arrival rate of data ot the sender is low? Is high?

12) Consider transferring an enormous file of L bytes from host A to host B. Assumme an MSS of 1460 bytes.

       a) What us the maximum length of L such that TCP sequence numbers are not exhausted? Recall that the TCP number field has four bytes.

       b) For the L you obtain in (a), find how long it takes to transmit the file. Assme that a total of 66 bytes of transport, network and data-link
       header are added to each segment before the resulting packet is sent out over a 10 Mbps link. Ignore flow control and congestion control, so
       A can pump out the segments back-to-back and continuously.

13) In Figure 3.5-5, we see that TCP waits until it has received three duplicate ACK before performing a fas retransmit. Why do you think the TCP
designers chose not to perform a fast retransmit after the first duplicate ACK for a segment is received?

14) Consider the TCP procedure for estimating RTT. Suppose that x = .1. Let SampleRTT1 be the most recent sample RTT, let SampleRTT2 be the
next most recent sample RTT, etc. (a) For a given TCP connection, suppose 4 acknowledgements have been returned with corresponding sample
RTTs SampleRTT4, SampleRTT3, SampleRTT2, and SampleRTT1. Express EstimatedRTT in terms of the four sample RTTs. (b) Generalize your
formula for n sample round-trip times. (c) For the formula in part (b) let n approach infinity. Comment on why this averaging procedure is called an
exponential moving average.

15) Refer to Figure 3.7-3 that illustrates the convergence of TCP's additive increase, multiplicative decrease algorithm. Suppose that instead of a
multiplicative decrease, TCP decreased the window size by a constant amount. Would the resulting additive increase additive decrease converge to
an equal share algorithm? Justify your answer using a diagram similar to Figure 3.7-3.

16) Recall the idealized model for the steady-state dynamics of TCP. In the period of time from when the connection's window size varies from
(W*MSS)/2 to W*MSS, only one packet is lost (at the very end of the period). (a) Show that the loss rate is equal to

                                                         L = loss rate = 1/[(3/8)*W2 - W/4] .

(b) Use the above result to show that if a connection has loss rate L, then its average bandwidth is approximately given by:

                                         average bandwidth of connection ~ 1.22 * MSS / (RTT * sqrt(L) ).

17) Consider sending an object of size O = 100 Kbytes from server to client. Let S=536 bytes and RTT=100msec. Suppose the transport protocol uses
static windows with window size W.

a) For a transmission rate of 28 Kbps, determine the minimum possible latency. Determine the minimum window size that achieves this latency.

b) Repeat (a) for 100 Kbps.

c) Repeat (a) for 1 Mbps.

d) Repeat (a) for 10 Mbps.

18) Suppose TCP increased its congestion window by two rather than by one for each received acknowledgement during slow start. Thus the first
window consists of one segment, the second of three segments, the third of nine segments, etc. For this slow-start procedure:

a) Express K in terms of O and S.

b) Express Q in terms of RTT, S and R. (3 of 5) [5/13/2004 11:59:22 AM]
  Chapter 4 Homework problems

c) Express latency in terms of P = min(K-1,Q), O, R and RTT.

19) Consider the case RTT = 1 second and O = 100 kBytes. Prepare a chart (similar to the charts in Section 3.5.2) that compares the minimum latency
(O/R + 2 RTT) with the latency with slow start for R=28Kbps, 100 Kbps, 1 Mbps and 10 Mbps.

20) True or False.

       a) If a Web page consists of exactly one object, then non-persistent and persistent connections have exactly the same response time

       b) Consider sending one object of size O from server to browser over TCP. If O > S, where S is the maximum segment size, then the server
       will stall at least once?

       c) Suppose a Web page consists of 10 objects, each of size O bits. For persistent HTTP, the RTT portion of the response time is 20 RTT ?

       d) Suppose a Web page consists of 10 objects, each of size O bits. For non-persistent HTTP with 5 parallel connections, the RTT portion of the
       response time is 12 RTT ?

21) The analysis for dynamic windows in the text assumes that there is one link between server and client. Redo the analysis for T links between
server and client. Assume the network has no congestion, so the packets experience no queueing delays. The packets do experience a store-and-
forward delay, however. The definition of RTT is the same as that given in the section on TCP congestion control. (Hint: The time for the server to
send out the first segment until it receives the acknowledgement is TS/R + RTT.)

22) Recall the discussion at the end of Section 3.7.3 on the response time for a Web page. For the case of non-persistent connections, determine a
general expression for the fraction of the response time that is due to TCP slow start.

23) With persistent HTTP, all objects are sent over the same TCP connection. As we discussed in Chapter 2, one of the motivations behind persistent
HTTP (with pipelining) is to diminish the affects of TCP connection establishment and slow start on the response time for a Web page. In this
problem we investigate the response time for persistent HTTP. Assume that the client requests all the images at once, but only when it has it has
received the entire HTML base page. Let M+1 denote the number of objects and let O denote the size of each object.

       a) Argue that the response time takes the form (M+1)O/R + 3RTT + latency due to slow-start. Compare the contribution of the RTTs in this
       expression with that in non-persistent HTTP.
       b) Assume that K = log2(O/R+1) is an integer; thus, the last window of the base HTML file transmits an entire window's worth of segments,
       i.e., window K transmits 2K-1segments. Let P' = min{Q,K'-1} and

       Note that K' is the number of windows that cover an object of size (M+1)O and P' is the number of stall periods when sending the large object
       over a single TCP connection. Suppose (incorrectly) the server can send the images without waiting for the formal request for the images from
       the client. Show that the response time is that of sending one large object of size (M+1)O:

       c) The actual response time for persistent HTTP is somewhat larger than the approximation. This is because the server must wait for a request
       for the images before sending the images. In particular, the stall time between the Kth and (K+1)st window is not [S/R + RTT - 2K-1(S/R)]+
       but is instead RTT. Show that

24) Consider the scenario of RTT = 100 msec, O = 5 Kbytes, and M= 10. Construct a chart that compares the response times for non-persistent and
persistent connections for 28 kbps, 100 kbps, 1 Mbps and 10 Mbps. Note that persistent HTTP has substantially lower response time than non- (4 of 5) [5/13/2004 11:59:22 AM]
  Chapter 4 Homework problems

persistent HTTP for all the transmission rates except 28 Kbps.

25) Repeat the above question for the case of RTT = 1 sec, O = 5 Kbytes , M= 10. Note that for these parameters, persistent HTTP gives a
significantly lower response time than non-persistent HTTP for all the transmission rates.

26) Consider now non-persistent HTTP with parallel TCP connections. Recall that browsers typically operate in this mode when using HTTP/1.0. Let
X denote the maximum number of parallel connections that the client (browser) is permitted to open. In this mode, the client first uses one TCP
connection to obtain the base HTML file. Upon receiving the base HTML file, the client establishes M/X sets of TCP connections, with each set
having X parallel connections. Argue that the total response time takes the form:

                                 response time = (M+1)O/R + 2(M/X+1) RTT + latency due to slow-start stalling.
Compare the contribution of the term involving RTT to that of persistent connections and non-persistent (non-parallel) connections.

Discussion Questions
1) Consider streaming stored audio. Does it make sense to run the application over UDP or TCP? Which one does RealNetworks use? Why? Are
there any other streaming stored audio products? Which transport protocol do they use and why?

Programming Assignment
In this programming assignment, you will be writing the sending and receiving transport-level code for implementing a
simple reliable data transfer protocol - for either the alternating bit protocol or a Go-Back-N protocol. This should be FUN since your implementation
will differ very little from what would be required in a real-world situation.

Since you presumably do not have standalone machines (with an OS that you can modify), your code will have to execute in a simulated
hardware/software environment. However, the programming interface provided to your routines (i.e., the code that would call your entities from
above (i.e., from layer 5) and from below (i.e., from layer 3)) is very close to what is done in an actual UNIX environment. (Indeed, the software
interfaces described in this programming assignment are much more realistic that the infinite loop senders and receivers that many textbooks
describe). Stopping/starting of timers are also simulated, and timer interrupts will cause your timer handling routine to be activated.

You can find full details of the programming assignment, as well as C code that you will need to create the simulated hardware/software environment
at (5 of 5) [5/13/2004 11:59:22 AM]
  Network Layer: Introduction and Service Models

     4.1 Introduction and Network Service Models
We saw in the previous chapter that the transport layer provides communication service between two processes running on
two different hosts. In order to provide this service, the transport layer relies on the services of the network layer, which
provides a communication service between hosts. In particular, the network-layer moves transport-layer segments from one
host to another. At the sending host, the transport layer segment is passed to the network layer. The network layer then
"somehow" gets the segment to the destination host and passes the segment up the protocol stack to the transport layer.
Exactly how the network layer moves a segment from the transport layer of an origin host to the transport layer of the
destination host is the subject of this chapter. We will see that unlike the transport layers, the network layer requires the
coordination of each and every host and router in the network. Because of this, network layer protocols are among the most
challenging (and therefore interesting!) in the protocol stack.

Figure 4.1-1 shows a simple network with two hosts (H1 and H2) and four routers (R1, R2, R3 and R4). The role of the
network layer in a sending host is to begin the packet on its journey to the the receiving host. For example, if H1 is sending
to H2, the network layer in host H1 transfers these packets to it nearby router, R2. At the receiving host (e.g., H2) , the
network layer receives the packet from its nearby router (in this case, R3) and delivers the packet up to the transport layer at
H2. The primary role of the routers is to "switch" packets from input links to output links. Note that the routers in Figure
4.1-1 are shown with a truncated protocol stack, i.e., with no upper layers above the network layer, since routers do not run
transport and application layer protocols such as those we examined in Chapters 2 and 3.

                                                    Figure 4.1-1: The network layer

The role of the network layer is thus deceptively simple -- to transport packets from a sending host to a receiving host. To
do so, three important network layer functions can be identified:

     q   Path Determination. The network layer must determine the route or path taken by packets as they flow from a
         sender to a receiver. The algorithms that calculate these paths are referred to as routing algorithms. A routing
         algorithm would determine, for example, whether packets from H1 to H2 flow along the path R2-R1-R3 or path R2- (1 of 7) [5/13/2004 11:59:32 AM]
  Network Layer: Introduction and Service Models

         R4-R3 (or any other path between H1 and H2). Much of this chapter will focus on routing algorithms. In Section
         4.2 we will study the theory of routing algorithms, concentrating on the two most prevalent classes of routing
         algorithms: link state routing and distance vector routing. We will see that the complexity of a routing algorithms
         grows considerably as the number of routers in the network increases. This motivates the use of hierarchical
         routing, a topic we cover in section 4.3. In Section 4.8 we cover multicast routing -- the routing algorithms,
         switching function, and call setup mechanisms that allow a packet that is sent just once by a sender to be delivered to
         multiple destinations.
     q   Switching. When a packet arrives at the input to a router, the router must move it to the appropriate output link. For
         example, a packet arriving from host H1 to router R2 must either be forwarded towards H2 either along the link from
         R2 to R1 or along the link from R2 to R4. In Section 4.6, we look inside a router and examine how a packet is
         actually switched (moved) from an input link to an output link.
     q   Call Setup. Recall that in our study of TCP, a three-way handshake was required before data actually flowed from
         sender to receiver. This allowed the sender and receiver to setup the needed state information (e.g., sequence
         number and initial flow control window size). In an analogous manner, some network layer architectures (e.g.,
         ATM) requires that the routers along the chosen path from source to destination handshake with each other in order
         to setup state before data actually begins to flow. In the network layer, this process is referred to as call setup. The
         network layer of the Internet architecture does not perform any such call setup.

Before delving into the details of the theory and implementation of the network layer, however, let us first take the broader
view and consider what different types of service might be offered by the network layer.

4.1.1 Network Service Model
When the transport layer at a sending host transmits a packet into the network (i.e., passes it down to the network layer at
the sending host), can the transport layer count on the network layer to deliver the packet to the destination? When multiple
packets are sent, will they be delivered to the transport layer in the receiving host in the order in which they were sent?
Will the amount of time between the sending of two sequential packet transmissions be the same as the amount of time
between their reception? Will the network provide any feedback about congestion in the network? What is the abstract
view (properties) of the channel connecting the transport layer in the two hosts? The answers to these questions and others
are determined by the service model provided by the network layer. The network service model defines the characteristics
of end-to-end transport of data between one "edge" of the network and the other, i.e., between sending and receiving end

Datagram or Virtual Circuit?

Perhaps the most important abstraction provided by the network layer to the upper layers is whether or not the network
layer uses virtual circuits (VCs) or not. You may recall from Chapter 1 that a virtual-circuit packet network behaves much
like a telephone network, which uses "real circuits" as opposed to "virtual circuits". There are three identifiable phases in a
virtual circuit:

     q   VC setup. During the setup phase, the sender contacts the network layer, specifies the receiver address, and waits
         for the network to setup the VC. The network layer determines the path between sender and receiver, i.e., the series
         of links and switches through which all packets of the VC will travel. As discussed in Chapter 1, this typically
         involves updating tables in each of the packet switches in the path. During VC setup, the network layer may also
         reserve resources (e.g., bandwidth) along the path of the VC.
     q   Data transfer. Once theVC has been established, data can begin to flow along the VC.
     q   Virtual circuit teardown. This is initiated when the sender (or receiver) informs the network layer of its desire to
         terminate the VC. The network layer will then typically inform the end system on the other side of the network of (2 of 7) [5/13/2004 11:59:32 AM]
 Network Layer: Introduction and Service Models

       the call termination, and update the tables in each of the packet switches on the path to indicate that the VC no
       longer exists.

There is a subtle but important distinction between VC setup at the network layer and connection setup at the transport layer
(e.g., the TCP 3-way handshake we studied in Chapter 3). Connection setup at the transport layer only involves the two end
systems. The two end systems agree to communicate and together determine the parameters (e.g., initial sequence number,
flow control window size) of their transport level connection before data actually begins to flow on the transport level
connection. Although the two end systems are aware of the transport-layer connection, the switches within the network are
completely oblivious to it. On the otherhand, with a virtual-circuit network layer, packet switches are involved in virtual-
cicuit setup, and each packet switch is fully aware of all the VCs passing through it.

The messages that the end systems send to the network to indicate the initiation or termination of a VC, and the messages
passed between the switches to set up the VC (i.e. to modify switch tables) are known as signaling messages and the
protocols used to exchange these messages are often referred to as signaling protocols. VC setup is shown pictorially in
Figure 4.1-2.

                                                  Figure 4.1-2: Virtual circuit service model

We mentioned in Chapter 1 that ATM uses virtual circuits, although virtual circuits in ATM jargon are called virtual
channels. Thus ATM packet switches receive and process VC setup and tear down messages, and they also maintain VC
state tables. Frame relay and X.25, which will be covered in Chapter 5, are two other networking technologies that use
virtual circuits.

With a datagram network layer, each time an end system wants to send a packet, it stamps the packet with the address of
the destination end system, and then pops the packet into the network. As shown in Figure 4.1-3, this is done without any
VC setup. Packet switches (called "routers" in the Internet) do not maintain any state information about VCs because there
are no VCs! Instead, packet switches route a packet towards its destination by examining the packet's destination address,
indexing a routing table with the destination address, and forwarding the packet in the direction of the destination. (As
discussed in Chapter 1, datagram routing is similar to routing ordinary postal mail.) Because routing tables can be modified (3 of 7) [5/13/2004 11:59:32 AM]
 Network Layer: Introduction and Service Models

at any time, a series of packets sent from one end system to another may follow different paths through the network and
may arrive out of order. The Internet uses a datagram network layer.

                                                  Figure 4.1-3: Datagram service model

You may recall from Chapter 1 that a packet-switched network typically offers either a VC service or a datagram service to
the transport layer, and not both services. For example, an ATM network offers only a VC service to the ATM transport
layer (more precisely, to the ATM adaptation layer), and the Internet offers only a datagram sevice to the transport layer.
The transport layer in turn offers services to communicating processes at the application layer. For example, TCP/IP
networks (such as the Internet) offers a connection-oriented service (using TCP) and connectionless service (UDP) to its
communicating processes.

An alternative terminology for VC service and datagram service is network-layer connection-oriented service and
network-layer connectionless service, respectively. Indeed, the VC service is a sort of connection-oriented service, as it
involves setting up and tearing down a connection-like entity, and maintaining connection state information in the packet
switches. The datagram service is a sort of connectionless service in that it doesn't employ connection-like entities. Both
sets of terminology have advantages and disadvantages, and both sets are commonly used in the networking literature. We
decided to use in this book the "VC service" and "datagram service" terminology for the network layer, and reserve the
"connection-oriented service" and "connectionless service" terminology for the transport layer. We believe this decision
will be useful in helping the reader delineate the services offered by the two layers.

The Internet and ATM Network Service Models

Network                  Service          Bandwidth           No Loss                                        Congestion
                                                                                   Ordering    Timing
Architecture             Model            Guarantee           Guarantee                                      indication
                                                                                   Any order
Internet                 Best Effort None                     None                             Not maintained None
                                                                                   possible (4 of 7) [5/13/2004 11:59:32 AM]
  Network Layer: Introduction and Service Models

                                           Guaranteed                                                        congestion will
ATM                       CBR                                  Yes                  In order   maintained
                                           constant rate                                                     not occur
                                                                                                             congestion will
ATM                       VBR              Guaranteed rate     Yes                  In order   maintained
                                                                                                             not occur
ATM                       ABR                                  None                 In order   Not maintained indication
ATM                       UBR              None                None                 In order   Not maintained None
                                       Table 4.1-1: Internet and ATM Network Service Models

The key aspects of the service model of the Internet and ATM network architectures are summarized in Table 4.1-1. We
do not want to delve deeply into the details of the service models here (it can be quite "dry" and detailed discussions can be
found in the standards themselves [ATM Forum 1997]). A comparison between the Internet and ATM service models is,
however, quite instructive.

The current Internet architecture provides only one service model, the datagram service, which is also known as "best
effort service." From Table 4.1-1, it might appear that best effort service is a euphemism for "no service at all." With best
effort service, timing between packets is not guaranteed to be preserved, packets are not guaranteed to be received in the
order in which they were sent, nor is the eventual delivery of transmitted packets guaranteed. Given this definition, a
network which delivered no packets to the destination would satisfy the definition best effort delivery service (Indeed,
today's congested public Internet might sometimes appear to be an example of a network that does so!). As we will discuss
shortly, however, there are sound reasons for such a minimalist network service model. The Internet's best-effort only
service model is currently being extended to include so-called "integrated services" and "differentiated service." We will
cover these still evolving service models later in Chapter 6.

Let us next turn to the ATM service models. As noted in our overview of ATM in chapter 1, there are two ATM standards
bodies (the ITU and The ATM Forum) . Their network service model definitions contain only minor differences and we
adopt here the terminology used in the ATM Forum standards. The ATM architecture provides for multiple service models
(that is, each of the two ATM standards each has multiple service models). This means that within the same network,
different connections can be provided with different classes of service.

Constant bit rate (CBR) network service was the first ATM service model to be standardized, probably reflecting the
fact that telephone companies were the early prime movers behind ATM, and CBR network service is ideally suited for
carrying real-time, constant-bit-rate, streamline audio (e.g., a digitized telephone call) and video traffic. The goal of CBR
service is conceptually simple -- to make the network connection look like a dedicated copper or fiber connection between
the sender and receiver. With CBR service, ATM cells are carried across the network in such a way that the end-end delay
experienced by a cell (the so-called cell transfer delay, CDT), the variability in the end-end delay (often referred to as
"jitter" or "cell delay variation, CDV)"), and the fraction of cells that are lost or deliver late (the so-called cell loss rate,
CLR) are guaranteed to be less than some specified values. Also, an allocated transmission rate (the peak cell rate, PCR) is
defined for the connection and the sender is expected to offer data to the network at this rate. The values for the PCR, CDT,
CDV, and CLR are agreed upon by the sending host and the ATM network when the CBR connection is first established.

A second conceptually simple ATM service class is Unspecified Bit Rate (UBR) network service. Unlike CBR service,
which guarantees rate, delay, delay jitter, and loss, UBR makes no guarantees at all other than in-order delivery of cells
(that is, cells that are fortunate enough to make it to the receiver). With the exception of in-order delivery, UBR service is
thus equivalent to the Internet best effort service model. As with the Internet best effort service model, UBR also provides
no feedback to the sender about whether or not a cell is dropped within the network. For reliable transmission of data over
a UBR network, higher layer protocols (such as those we studied in the previous chapter) are needed. UBR service might (5 of 7) [5/13/2004 11:59:32 AM]
  Network Layer: Introduction and Service Models

be well suited for non-interactive data transfer applications such as email and newsgroups.

If UBR can be thought of as a "best effort" service, then Available Bit Rate (ABR) network service might best be
characterized as a "better" best effort service model. The two most important additional features of ABR service over UBR
service are:

     q   A minimum cell transmission rate (MCR) is guaranteed to a connection using ABR service. If, however, the
         network has enough free resources at a given time, a sender may actually be able to successfully send traffic at a
         higher rate than the MCR.

     q   Congestion feedback from the network. An ATM network provides feedback to the sender (in terms of a congestion
         notification bit, or a lower rate at which to send) that controls how the sender should adjust its rate between the
         MCR and some peak cell rate (PCR). ABR senders must decrease their transmission rates in accordance with such

ABR provides a minimum bandwidth guarantee, but on the other hand will attempt to transfer data as fast as possible (up to
the limit imposed by the PCR). As such, ABR is well suited for data transfer where it is desirable to keep the transfer
delays low (e.g., Web browsing).

The final ATM service model is Variable Bit Rate (VBR) network service. VBR service comes in two flavors (and in the
ITU specification of VBR-like service comes in four flavors -- perhaps indicating a service class with an identity crisis!). In
real-time VBR service, the acceptable cell loss rate, delay, and delay jitter are specified as in CBR service. However, the
actual source rate is allowed to vary according to parameters specified by the user to the network. The declared variability
in rate may be used by the network (internally) to more efficiently allocate resources to its connections, but in terms of the
loss, delay and jitter seen by the sender, the service is essentially the same as CBR service. While early efforts in defining a
VBR service models were clearly targeted towards real-time services (e.g., as evidenced by the PCR, CDT, CDV and CLR
parameters), a second flavor of VBR service is now targeted towards non-real-time services and provides a cell loss rate
guarantee. An obvious question with VBR is what advantages it offers over CBR (for real-time applications) and over UBR
and ABR for non-real-time applications. Currently, there is not enough (any?) experience with VBR service to answer this

An excellent discussion of the rationale behind various aspects of the ATM Forum's Traffic Management Specification 4.0
[ATM Forum 1996] for CBR, VBR, ABR and UBR service is [Garret 1996].

4.1.2 Origins of Datagram and Virtual Circuit Service
The evolution of the Internet and ATM network service models reflects their origins. With the notion of a virtual circuit as
a central organizing principle, and an early focus on CBR services, ATM reflects its roots in the telephony world (which
uses "real circuits"). The subsequent definition of UBR and ABR service classes acknowledges the importance of the types
of data applications developed in the data networking community. Given the VC architecture and a focus on supporting
real-time traffic with guarantees about the level of received performance (even with data-oriented services such as ABR),
the network layer is significantly more complex than the best effort Internet. This too, is in keeping with the ATM's
telephony heritage. Telephone networks, by necessity, had their "complexity' within the network, since they were
connecting "dumb" end-system devices such as a rotary telephone (For those too young to know, a rotary phone is a non-
digital telephone with no buttons - only a dial).

The Internet, on the other hand, grew out of the need to connect computers (i.e., more sophisticated end devices) together.
With sophisticated end-systems devices, the Internet architects chose to make the network service model (best effort) as
simple as possible and to implement any additional functionality (e.g., reliable data transfer), as well as any new application (6 of 7) [5/13/2004 11:59:32 AM]
 Network Layer: Introduction and Service Models

level network services at a higher layer, at the end systems. This inverts the model of the telephone network, with some
interesting consequences:

    q   The resulting network service model which made minimal (no!) service guarantees (and hence posed minimal
        requirements on the network layer) also made it easier to interconnect networks that used very different link layer
        technologies (e.g., satellite, Ethernet, fiber, or radio) which had very different characteristics (transmission rates,
        loss characteristics). We will address the interconnection of IP networks in detail Section 4.4.
    q   As we saw in Chapter 2, applications such as email, the Web, and even a network-layer-centric service such as the
        DNS are implemented in hosts (servers) at the edge of the network. The ability to add a new service simply by
        attaching a host to the network and defining a new higher layer protocol (such as HTTP) has allowed new services
        such as the WWW to be adopted in a breathtakingly short period of time.

As we will see in Chapter 6, however, there is considerable debate in the Internet community about how the network layer
architecture must evolve in order to support the real-time services such a multimedia. An interesting comparison of the
ATM and the proposed next generation Internet architecture is given in [Crowcroft 95].


[ATM Forum 1996] ATM Forum, "Traffic Management 4.0," ATM Forum document af-tm-0056.0000. On-line
[ATM Forum 1997] ATM Forum. "Technical Specifications: Approved ATM Forum Specifications." On-line.
[Crowcroft 1995] J. Crowcroft, Z. Wang, A. Smith, J. Adams, "A Comparison of the IETF and ATM Service Models,"
IEEE Communications Magazine, Nov./Dec. 1995, pp. 12 - 16. Compares the Internet Engineering Task Force int-serv
service model with the ATM service model. On-line.
[Garrett 1996] M. Garett, "A Service Architecture for ATM: From Applications to Scheduling," IEEE Network Magazine,
May/June 1996, pp. 6 - 14. A thoughtful discussion of the the ATM Forum's recent TM 4.0 specification of CBR, VBR,
ABR and UBR service.

Copyright Keith W. Ross and Jim Kurose, 1996-2000 All rights reserved. (7 of 7) [5/13/2004 11:59:32 AM]
  Point-to-Point Routing Algorithms

                                               4.2 Routing Principles
In order to transfer packets from a sending host to the destination host, the network layer must determine the path or route that the packets are
to follow. Whether the network layer provides a datagram service (in which case different packets between a given host-destination pair may
take different routes) or a virtual circuit service (in which case all packets between a given source and destination will take the same path), the
network layer must nonetheless determine the path for a packet. This is the job of the network layer routing protocol.

At the heart of any routing protocol is the algorithm (the "routing algorithm") that determines the path for a packet. The purpose of a routing
algorithm is simple: given a set of routers, with links connecting the routers, a routing algorithm finds a "good" path from source to
destination. Typically, a "good" path is one which has "least cost," but we will see that in practice, "real-world" concerns such as policy issues
(e.g., a rule such as "router X, belonging to organization Y should not forward any packets originating from the network owned by
organization Z") also come into play to complicate the conceptually simple and elegant algorithms whose theory underlies the practice of
routing in today's networks.

                                                     Figure 4.2-1: Abstract model of a network

The graph abstraction used to formulate routing algorithms is shown in Figure 4.2-1. (To view some graphs representing real network maps,
see [Dodge 1999]; for a discussion of how well different graph-based models model the Internet, see [Zegura 1997]). Here, nodes in the graph
represent routers - the points at which packet routing decisions are made - and the lines ("edges" in graph theory terminology) connecting
these nodes represent the physical links between these routers. A link also has a value representing the "cost" of sending a packet across the
link. The cost may reflect the level of congestion on that link (e.g., the current average delay for a packet across that link) or the physical
distance traversed by that link (e.g., a transoceanic link might have a higher cost than a terrestrial link). For our current purposes, we will
simply take the link costs as a given and won't worry about how they are determined.

Given the graph abstraction, the problem of finding the least cost path from a source to a destination requires identifying a series of links such

    q   the first link in the path is connected to the source
    q   the last link in the path is connected to the destination
    q   for all i, the i and i-1st link in the path are connected to the same node
    q   for the least cost path, the sum of the cost of the links on the path is the minimum over all possible paths between the source and
        destination. Note that if all link costs are the same, the least cost path is also the shortest path (i.e., the path crossing the smallest
        number of links between the source and the destination).

In Figure 4.2-1, for example, the least cost path between nodes A (source) and C (destination) is along the path ADEC. (We will find it
notationally easier to refer to the path in terms of the nodes on the path, rather than the links on the path).

Classification of Routing Algorithms (1 of 13) [5/13/2004 11:59:55 AM]
  Point-to-Point Routing Algorithms

As a simple exercise, try finding the least cost path from nodes A to F, and reflect for a moment on how you calculated that path. If you are
like most people, you found the path from A to F by examining Figure 4.2-1, tracing a few routes from A to F, and somehow convincing
yourself that the path you had chosen was the least cost among all possible paths (Did you check all of the 12 possible paths between A and
F? Probably not!). Such a calculation is an example of a centralized routing algorithm. Broadly, one way in which we can classify routing
algorithms is according to whether they are centralized or decentralized:

    q   A global routing algorithm computes the least cost path between a source and destination using complete, global knowledge about the
        network. That is, the algorithm takes the connectivity between all nodes and all links costs as inputs. This then requires that the
        algorithm somehow obtain this information before actually performing the calculation. The calculation itself can be run at one site (a
        centralized global routing algorithm) or replicated at multiple sites. The key distinguishing feature here, however, is that a global
        algorithm has complete information about connectivity and link costs. In practice, algorithms with global state information are often
        referred to as link state algorithms, since the algorithm must be aware of the state (cost) of each link in the network. We will study a
        global link state algorithm in section 4.2.1.
    q   In a decentralized routing algorithm, the calculation of the least cost path is carried out in an iterative, distributed manner. No node
        has complete information about the costs of all network links. Instead, each node begins with only knowledge of the costs of its own
        directly attached links and then through an iterative process of calculation and exchange of information with its neighboring nodes (i.e.,
        nodes which are at the "other end" of links to which it itself is attached) gradually calculates the least cost path to a destination, or set
        of destinations. We will study a decentralized routing algorithm known as a distance vector algorithm in section 4.2.2. It is called a
        distance vector algorithm because a node never actually knows a complete path from source to destination. Instead, it only knows the
        direction (which neighbor) to which it should forward a packet in order to reach a given destination along the least cost path, and the
        cost of that path from itself to the destination.

A second broad way to classify routing algorithms is according to whether they are static or dynamic. In static routing algorithms, routes
change very slowly over time, often as a result of human intervention (e.g., a human manually editing a router's forwarding table). Dynamic
routing algorithms change the routing paths as the network traffic loads (and the resulting delays experienced by traffic) or topology change. A
dynamic algorithm can be run either periodically or in direct response to topology or link cost changes. While dynamic algorithms are more
responsive to network changes, they are also more susceptible to problems such as routing loops and oscillation in routes, issues we will
consider in section 4.2.2.

Only two types of routing algorithms are typically used in the Internet: a dynamic global link state algorithm, and a dynamic decentralized
distance vector algorithm. We cover these algorithms in section 4.2.1 and 4.2.2 respectively. Other routing algorithms are surveyed briefly in
section 4.2.3.

4.2.1 A Link State Routing Algorithm
Recall that in a link state algorithm, the network topology and all link costs are known, i.e., available as input to the link state algorithm. In
practice this is accomplished by having each node broadcast the identities and costs of its attached links to all other routers in the network.
This link state broadcast [Perlman 1999], can be accomplished without the nodes having to initially know the identities of all other nodes
in the network A node need only know the identities and costs to its directly-attached neighbors; it will then learn about the topology of the
rest of the network by receiving link state broadcast from other nodes. (In Chapter 5, we will learn how a router learns the identities of its
directly attached neighbors). The result of the nodes' link state broadcast is that all nodes have an identical and complete view of the network.
Each node can then run the link state algorithm and compute the same set of least cost paths as every other node.

The link state algorithm we present below is known as Dijkstra's algorithm, named after its inventor (a closely related algorithm is Prim's
algorithm; see [Corman 1990] for a general discussion of graph algorithms). It computes the least cost path from one node (the source, which
we will refer to as A) to all other nodes in the network. Dijkstra's algorithm is iterative and has the property that after the kth iteration of the
algorithm, the least cost paths are known to k destination nodes, and among the least cost paths to all destination nodes, these k path will have
the k smallest costs. Let us define the following notation:

    q   c(i,j): link cost from node i to node j. If nodes i and j are not directly connected, then c(i,j) = infty. We will assume for simplicity that
        c(i,j) equals c(j,i).
    q   D(v): the cost of path from the source node to destination v that has currently (as of this iteration of the algorithm) the least cost.
    q   p(v): previous node (neighbor of v) along current least cost path from source to v
    q   N: set of nodes whose shortest path from the source is definitively known

The link state algorithm consists of an initialization step followed by a loop. The number of times the loop is executed is equal to the number (2 of 13) [5/13/2004 11:59:56 AM]
    Point-to-Point Routing Algorithms

of nodes in the network. Upon termination, the algorithm will have calculated the shortest paths from the source node to every other node in
the network.

Link State (LS) Algorithm:

1 Initialization:
2   N = {A}
3   for all nodes v
4     if v adjacent to A
5       then D(v) = c(A,v)
6       else D(v) = infty
8  Loop
9    find w not in N such that D(w) is a minimum
10   add w to N
11   update D(v) for all v adjacent to w and not in N:
12      D(v) = min( D(v), D(w) + c(w,v) )
13   /* new cost to v is either old cost to v or known
14    shortest path cost to w plus cost from w to v */
15 until all nodes in N

As an example, let us consider the network in Figure 4.2-1 and compute the shortest path from A to all possible destinations. A tabular
summary of the algorithm's computation is shown in Table 4.2-1, where each line in the table gives the values of the algorithms variables at
the end of the iteration. Let us consider the few first steps in detail:

step         N                D(B),p(B)        D(C),P(C)        D(D),P(D)        D(E),P(E)        D(F),p(F)
0            A                2,A              5,A              1,A              infty            infty
1            AD               2,A              4,D                               2,D              infty
2            ADE              2,A              3,E                                                4,E
3            ADEB                              3E                                                 4E
4            ADEBC                                                                                4E
5            ADEBCF
                                    Table 4.2-1: Steps in running the link state algorithm on network in Figure 4.2-1

      q   In the initialization step, the currently known least path costs from A to its directly attached neighbors, B, C and D are initialized to 2,
          5 and 1 respectively. Note in particular that the cost to C is set to 5 (even though we will soon see that a lesser cost path does indeed
          exists) since this is cost of the direct (one hop) link from A to C. The costs to E and F are set to infinity since they are not directly
          connected to A.
      q   In the first iteration, we look among those nodes not yet added to the set N and find that node with the least cost as of the end of the
          previous iteration. That node is D, with a cost of 1, and thus D is added to the set N. Line 12 of the LS algorithm is then performed to
          update D(v) for all nodes v, yielding the results shown in the second line (step 1) in Table 4.2-1. The cost of the path to B is
          unchanged. The cost of the path to C (which was 5 at the end of the initialization) through node D is found to have a cost of 4. Hence
          this lower cost path is selected and C's predecessor along the shortest path from A is set to D. Similarly, the cost to E (through D) is
          computed to be 2, and the table is updated accordingly.
      q   In the second iteration, nodes B and E are found to have the shortest path costs (2), and we break the tie arbitrarily and add E to the
          set N so that N now contains A, D, and E. The cost to the remaining nodes not yet in N, i.e., nodes B, C and F, are updated via line 12
          of the LS algorithm , yielding the results shown in the third row in the above table.
      q   and so on ...

When the LS algorithm terminates, we have for each node, its predecessor along the least cost path from the source node. For each
predecessor, we also have its predecessor and so in this manner we can construct the entire path from the source to all destinations. (3 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

What is the computation complexity of this algorithm? That is, given n nodes (not counting the source), how much computation must be done
in the worst case to find the least cost paths from the source to all destinations? In the first iteration, we need to search through all n nodes to
determine the node, w, not in N that has the minimum cost. In the second iteration, we need to check n-1 nodes to determine the minimum
cost; in the third iteration n-2 nodes and so on. Overall, the total number of nodes we need to search through over all the iterations is
n*(n+1)/2, and thus we say that the above implementation of the link state algorithm has worst case complexity of order n squared: O(n2). (A
more sophisticated implementation of this algorithm, using a data structure known as a heap, can find the minimum in line 9 in logarithmic
rather than linear time, thus reducing the complexity).

Before completing our discussion of the LS algorithm, let us consider a pathology that can arise with the use of link state routing. Figure 4.2-2
shows a simple network topology where link costs are equal to the load carried on the link, e.g., reflecting the delay that would be experienced
. In this example, link costs are not symmetric, i.e., c(A,B) equals c(B,A) only if the load carried on both directions on the AB link is the
same. In this example, node D originates a unit of traffic destined for A, node B also originates a unit of traffic destined for A, and node C
injects an amount of traffic equal to e, also destined for A. The initial routing is shown in Figure 4.2-2a, with the link costs corresponding to
the amount of traffic carried.

                                                Figure 4.2-2: Oscillations with Link State routing

When the LS algorithm is next run, node C determines (based on the link costs shown in Figure 4.2-2a) that the clockwise path to A has a cost
of 1, while the counterclockwise path to A (which it had been using) has a cost of 1+e. Hence C's least cost path to A is now clockwise.
Similarly, B determines that its new least cost path to A is also clockwise, resulting in the routing and resulting path costs shown in Figure 4.2-
2b. When the LS algorithm is run next, nodes B, C and D all detect that a zero cost path to A in the counterclockwise direction and all route
their traffic to the counterclockwise routes. The next time the LS algorithm is run, B, C, and D all then route their traffic to the clockwise

What can be done to prevent such oscillations in the LS algorithm? One solution would be to mandate that link costs not depend on the
amount of traffic carried -- an unacceptable solution since one goal of routing is to avoid highly congested (e.g., high delay) links. Another
solution is to insure that all routers do not run the LS algorithm at the same time. This seems a more reasonable solution, since we would hope
that even if routers run the LS algorithm with the same periodicity, the execution instants of the algorithm would not be the same at each
node. Interestingly, researchers have recently noted that routers in the Internet can self-synchronize among themselves [Floyd 1994], i.e.,
even though they initially execute the algorithm with the same period but at different instants of time, the algorithm execution instants can
eventually become, and remain, synchronized at the routers. One way to avoid such self-synchronization is to purposefully introduce
randomization into the period between execution instants of the algorithm at each node.

Having now studied the link state algorithm, let's next consider the other major routing algorithm that is used in practice today - the distance
vector routing algorithm.

4.2.2 A Distance Vector Routing Algorithm (4 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

While the LS algorithm is an algorithm using global information, the distance vector (DV) algorithm is iterative, asynchronous, and
distributed. It is distributed in that each node receives some information from one or more of its directly attached neighbors, performs a
calculation, and may then distribute the results of its calculation back to its neighbors. It is iterative in that this process continues on until no
more information is exchanged between neighbors. (Interestingly, we will see that the algorithm is self terminating -- there is no "signal" that
the computation should stop; it just stops). The algorithm is asynchronous in that it does not require all of the nodes to operate in lock step
with each other. We'll see that an asynchronous, iterative, self terminating, distributed algorithm is much more "interesting" and "fun" than a
centralized algorithm.

The principal data structure in the DV algorithm is the distance table maintained at each node. Each node's distance table has a row for each
destination in the network and a column for each of its directly attached neighbors. Consider a node X that is interested in routing to
destination Y via its directly attached neighbor Z. Node X's distance table entry, Dx(Y,Z) is the sum of the cost of the direct one hop link
between X and Z, c(X,Z), plus neighbor Z's currently known minimum cost path from itself (Z) to Y. That is:

                                        Dx(Y,Z) = c(X,Z) + minw{Dz(Y,w)}                                (4-1)

The minw term in equation 4-1 is taken over all of Z's directly attached neighbors (including X, as we shall soon see).

Equation 4-1 suggests the form of the neighbor-to-neighbor communication that will take place in the DV algorithm -- each node must know
the cost of each of its neighbors minimum cost path to each destination Thus, whenever a node computes a new minimum cost to some
destination, it must inform its neighbors of this new minimum cost.

Before presenting the DV algorithm, let's consider an example that will help clarify the meaning of entries in the distance table. Consider the
network topology and the distance table shown for node E in Figure 4.2-3. This is the distance table in node E once the Dv algorithm has
converged. Let's first look at the row for destination A.

    q   Clearly the cost to get to A from E via the direct connection to A has a cost of 1. Hence DE(A,A) = 1.
    q   Let's now consider the value of DE(A,D) - the cost to get from E to A, given that the first step along the path is D. In this case, the
        distance table entry is the cost to get from E to D (a cost of 2) plus whatever the minimum cost it is to get from D to A . Note that the
        minimum cost from D to A is 3 -- a path that passes right back through E! Nonetheless, we record the fact that the minimum cost from
        E to A given that the first step is via D has a cost of 5. We're left, though, with an uneasy feeling that the fact the path from E via D
        loops back through E may be the source of problems down the road (it will!).
    q   Similarly, we find that the distance table entry via neighbor B is DE(A,B) = 14. Note that the cost is not 15. (why?)

                                                      Figure 4.2-3: A distance table example

A circled entry in the distance table gives the cost of the least cost path to the corresponding destination (row). The column with the circled
entry identifies the next node along the least cost path to the destination. Thus, a node's routing table (which indicates which outgoing link
should be used to forward packets to a given destination) is easily constructed from the node's distance table. (5 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

In discussing the distance table entries for node E above, we informally took a global view, knowing the costs of all links in the network. The
distance vector algorithm we will now present is decentralized and does not use such global information. Indeed, the only information a node
will have are the costs of the links to its directly attached neighbors, and information it receives from these directly attached neighbors. The
distance vector algorithm we will study is also known as the Bellman-Ford algorithm, after its inventors. It is used in many routing algorithms
in practice, including: Internet BGP, ISO IDRP, Novell IPX, and the original ARPAnet.

Distance Vector (DV) Algorithm.                 At each node, X:

1 Initialization:
2  for all adjacent nodes v:
3     DX(*,v) = infty          /* the * operator means "for all rows" */
4      D X(v,v) = c(X,v)
5  for all destinations, y
6      send minwD(y,w) to each neighbor /* w over all X's neighbors */
8 loop
9   wait (until I see a link cost change to neighbor V
10          or until I receive update from neighbor V)
12  if (c(X,V) changes by d)
13     /* change cost to all dest's via neighbor v by d */
14     /* note: d could be positive or negative */
15     for all destinations y: DX(y,V) = DX(y,V) + d
17  else if (update received from V wrt destination Y)
18    /* shortest path from V to some Y has changed */
19    /* V has sent a new value for its minw DV(Y,w) */
20    /* call this received new value is "newval"       */
21    for the single destination y: D  X(Y,V) = c(X,V) + newval
23  if we have a new minw DX(Y,w)for any destination Y
24          send new value of minw DX(Y,w) to all neighbors
26    forever

The key steps are lines 15 and 21, where a node updates its distance table entries in response to either a change of cost of an attached link or
the receipt of an update message from a neighbor. The other key step is line 24, where a node sends an update to its neighbors if its minimum
cost path to a destination has changed.

Figure 4.2-4 illustrates the operation of the DV algorithm for the simple three node network shown at the top of the figure. The operation of
the algorithm is illustrated in a synchronous manner, where all nodes simultaneously receive messages from their neighbors, compute new
distance table entries, and inform their neighbors of any changes in their new least path costs. After studying this example, you should
convince yourself that the algorithm operates correctly in an asynchronous manner as well, with node computations and update
generation/reception occurring at any times.

The circled distance table entries in Figure 4.2-4 show the current least path cost to a destination. An entry circled in red indicates that a new
minimum cost has been computed (in either line 4 of the DV algorithm (initialization) or line 21). In such cases an update message will be
sent (line 24 of the DV algorithm) to the node's neighbors as represented by the red arrows between columns in Figure 4.2-4. (6 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

                                              Figure 4.2-4: Distance Vector Algorithm: example

The leftmost column in Figure 4.2-4 shows the distance table entries for nodes X, Y, and Z after the initialization step.

Let us now consider how node X computes the distance table shown in the middle column of Figure 4.2-4 after receiving updates from nodes
Y and Z. As a result of receiving the updates from Y and Z, X computes in line 21 of the DV algorithm:

       DX(Y,Z) = c(X,Z) + minw DZ(Y,w)
               = 7      +   1
               = 8
       DX(Z,Y) = c(X,Y) + minw DY(Z,w)
               = 2      +   1
               = 3

It is important to note that the only reason that X knows about the terms minw DZ(Y,w) and minw DY(Z,w) is because nodes Z and Y (7 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

have sent those values to X (and are received by X in line 10 of the DV algorithm). As an exercise, verify the distance tables computed by Y
and Z in the middle column of Figure 4.2-4.

The value DX(Z,Y) = 3 means that X's minimum cost to Z has changed from 7 to 3. Hence, X sends updates to Y and Z informing them
of this new least cost to Z. Note that X need not update Y and Z about its cost to Y since this has not changed. Note also that Y's
recomputation of its distance table in the middle column of Figure 4.2-4 does result in new distance entries, but does not result in a change of
Y's least cost path to nodes X and Z. Hence Y does not send updates to X and Z.

The process of receiving updated costs from neighbors, recomputation of distance table entries, and updating neighbors of changed costs of
the least cost path to a destination continues until no update messages are sent. At this point, since no update messages are sent, no further
distance table calculations will occur and the algorithm enters a quiescent state, i.e., all nodes are performing the wait in line 9 of the DV
algorithm. The algorithm would remain in the quiescent state until a link cost changes, as discussed below.

The Distance Vector Algorithm: Link Cost Changes and Link Failure

When a node running the DV algorithm detects a change in the link cost from itself to a neighbor (line 12) it updates its distance table (line
15) and, if there is a change in the cost of the least cost path, updates its neighbors (lines 23 and 24). Figure 4.2-5 illustrates this behavior for
a scenario where the link cost from Y to X changes from 4 to 1. We focus here only on Y and Z's distance table entries to destination (row)

    q   At time t0, Y detects the link cost change (the cost has changed from 4 to 1) and informs its neighbors of this change since the cost of a
        minimum cost path has changed.
    q   At time t1, Z receives the update from Y and then updates its table. Since it computes a new least cost to X (it has decreased from a cos
        of 5 to a cost of 2), it informs its neighbors.
    q   At time t2, Y has receives Z's update and has updates its distance table. Y's least costs have not changed (although its cost to X via Z
        has changed) and hence Y does not send any message to Z. The algorithm comes to a quiescent state. (8 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

                                              Figure 4.2-5: Link cost change: good news travels fast

In Figure 4.2-5, only two iterations are required for the DV algorithm to reach a quiescent state. The "good news" about the decreased cost
between X and Y has propagated fast through the network.

Let's now consider what can happen when a link cost increases. Suppose that the link cost between X and Y increases from 4 to 60.

                                      Figure 4.2-6: Link cost changes: bad news travels slow and causes loops

    q    At time t0 Y detects the link cost change (the cost has changed from 4 to 60). Y computes its new minimum cost path to X to have a
        cost of 6 via node Z. Of course, with our global view of the network, we can see that this new cost via Z is wrong. But the only
        information node Y has is that its direct cost to X is 60 and that Z has last told Y that Z could get to X with a cost of 5. So in order to
        get to X, Y would now route through Z, fully expecting that Z will be able to get to X with a cost of 5. As of t1 we have a routing loop
        -- in order to get to X, Y routes through Z, and Z routes through Y. A routing loop is like a black hole -- a packet arriving at Y or Z as
        of t1 will bounce back and forth between these two nodes forever ..... or until the routing tables are changed.

    q   Since node Y has computed a new minimum cost to X, it informs Z of this new cost at time t1
    q   Sometime after t1, Z receives the new least cost to X via Y (Y has told Z that Y's new minimum cost is 6). Z knows it can get to Y
        with a cost of 1 and hence computes a new least cost to X (still via Y) of 7. Since Y's least cost to X has increased, it then informs Y of
        its new cost at t2.
    q   In a similar manner, Y then updates its table and informs Z of a new cost of 9. Z then updates its table and informs Y of a new cost of
        10, etc..

How long will the process continue? You should convince yourself that the loop will persist for 44 iterations (message exchanges between Y
and Z) -- until Z eventually computes its path via Y to be larger than 50. At this point, Z will (finally!) determine that its least cost path to X
is via its direct connection to X. Y will then route to X via Z. The result of the "bad news" about the increase in link cost has indeed traveled
slowly! What would have happened if the link cost change of c(Y,X) had been from 4 to 10,000 and the cost c(Z,X) had been 9,999?
Because of such scenarios, the problem we have seen is sometimes referred to as the "count-to-infinity" problem. (9 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

Distance Vector Algorithm: Adding Poisoned Reverse.

The specific looping scenario illustrated in Figure 4.2-6 can be avoided using a technique known as poisoned reverse. The idea is simple -- if
Z routes through Y to get to destination X, then Z will advertise to Y that its (Z's) distance to X is infinity. Z will continue telling this little
"white lie" to Y as long as it routes to X via Y. Since Y believes that Z has no path to X, Y will never attempt to route to X via Z, as long as
Z continues to route to X via Y (and lie about doing so).

                                                          Figure 4.2-7: Poisoned reverse

Figure 4.2-7 illustrates how poisoned reverse solves the particular looping problem we encountered before in Figure 4.2-6. As a result of the
poisoned reverse, Y's distance table indicates an infinite cost when routing to X via Z (the result of Z having informed Y that Z's cost to X was
infinity). When the cost of the XY link changes from 4 to 60 at time t0, Y updates its table and continues to route directly to X, albeit at a
higher cost of 60, and informs Z of this change in cost. After receiving the update at t1, Z immediately shifts it route to X to be via the direct
ZX link at a cost of 50. Since this is a new least cost to X, and since the path no longer passes through Y, Z informs Y of this new least cost
path to X at t2. After receiving the update from Z, Y updates its distance table to route to X via Z at a least cost of 51. Also, since Z is now on
Y's least path to X, Y poisons the reverse path from Z to X by informing Z at time t3 that it (Y) has an infinite cost to get to X. The algorithm
becomes quiescent after t4, with distance table entries for destination X shown in the rightmost column in Figure 4.2-7.

Does poison reverse solve the general count-to-infinity problem? It does not. You should convince yourself that loops involving three or
more nodes (rather than simply two immediately neighboring nodes, as we saw in Figure 4.2-7) will not be detected by the poison reverse

A Comparison of Link State and Distance Vector Routing Algorithms

Let us conclude our study of link state and distance vector algorithms with a quick comparison of some of their attributes.

    q   Message Complexity. We have seen that LS requires each node to know the cost of each link in the network. This requires O(nE)
        messages to be sent, where n is the number of nodes in the network and E is the number of links. Also, whenever a link cost changes,
        the new link cost must be sent to all nodes. The DV algorithm requires message exchanges between directly connected neighbors at (10 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

        each iteration. We have seen that the time needed for the algorithm to converge can depend on many factors. When link costs change,
        the DV algorithm will propagate the results of the changed link cost only if the new link cost results in a changed least cost path for one
        of the nodes attached to that link.
    q   Speed of Convergence. We have seen that our implementation of the LS is an O(n2) algorithm requiring O(nE) messages, and
        potentially suffer from oscillations. The DV algorithm can converge slowly (depending on the relative path costs, as we saw in Figure
        4.2-7) and can have routing loops while the algorithm is converging. DV also suffers from the count to infinity problem.
    q   Robustness. What can happen is a router fails, misbehaves, or is sabotaged? Under LS, a router could broadcast an incorrect cost for
        one of its attached links (but no others). A node could also corrupt or drop any LS broadcast packets it receives as part of link state
        broadcast. But an LS node is only computing its own routing tables; other nodes are performing the similar calculations for
        themselves. This means route calculations are somewhat separated under LS, providing a degree of robustness. Under DV, a node can
        advertise incorrect least path costs to any/all destinations. (Indeed, in 1997 a malfunctioning router in a small ISP provided national
        backbone routers with erroneous routing tables. This caused other routers to flood the malfunctioning router with traffic, and caused
        large portions of the Internet to become disconnected for up to several hours [Neumann 1997].) More generally, we note that at each
        iteration, a node's calculation in DV is passed on to its neighbor and then indirectly to its neighbor's neighbor on the next iteration. In
        this sense, an incorrect node calculation calculation can be diffused through the entire network under DV.

In the end, neither algorithm is a "winner" over the other; as we will see in Section 4.4, both algorithms are used in the Internet.

4.2.3 Other Routing Algorithms

The LS and DV algorithms we have studied are not only widely used in practice, they are essentially the only routing algorithms used in
practice today.

Nonetheless, many routing algorithms have been proposed by researchers over the past 30 years, ranging from the extremely simple to the
very sophisticated and complex. One of the simplest routing algorithms proposed is hot potato routing. The algorithm derives its name from
its behavior -- a router tries to get rid of (forward) an outgoing packet as soon as it can. It does so by forwarding it on any outgoing link that is
not congested, regardless of destination. Although initially proposed quite some time ago, interest in hot-potato-like routing has recently been
revived for routing in highly structured networks, such as the so-called Manhattan street network [Brassil 1994].

Another broad class of routing algorithms are based on viewing packet traffic as flows between sources and destinations in a network. In this
approach, the routing problem can be formulated mathematically as a constrained optimization problem known as a network flow problem
[Bertsekas 1991]. Let us define λ ij as the amount of traffic (e.g., in packets/sec) entering the network for the first time at node i and destined
for node j. The set of flows, {λ ij} for all i,j, is sometimes referred to as the network traffic matrix. In a network flow problem, traffic flows
must be assigned to a set of network links subject to constraints such as:

    q   the sum of the flows between all source destination pairs passing though link m must be less than the capacity of link m;
    q   the amount of λ ij traffic entering any router r (either from other routers, or directly entering that router from an attached host) must
        equal the amount of λ ij traffic leaving router either via one of r's outgoing links or to an attached host at that router. This is a flow
        conservation constraint.

Let us define λ ijm as the amount of source i, destination j traffic passing through link m. The optimization problem then is to find the set of
link flows, {λ ijm} for all links m and all sources, i , and designations, j, that satisfies the constraints above and optimizes a performance
measure that is a function of {λ ijm }. The solution to this optimization problem then defines the routing used in the network. For example, if
the solution to the optimization problem is such that λ ijm = λ ij for some link m, then all i-to-j traffic will be routed over link m. In
particular, if link m is attached to node i, then m is the first hop on the optimal path from source i to destination j.

But what performance function should be optimized? There are many possible choices. If we make certain assumptions about the size of
packets and the manner in which packets arrive at the various routers, we can use the so-called M/M/1 queueing theory formula [Kleinrock
1976] to express the average delay at link as:

                                                             Dm = 1 / (Rm - ΣiΣj λ ijm),

where Rm is link m's capacity (measured in terms of the average number of packets/sec it can transmit) and ΣiΣj λ ijm is the total arrival rate of
packets (in packets/sec) that arrive to link m. The overall network wide performance measure to be optimized might then be the sum of all link (11 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

delays in the network, or some other suitable performance metric. A number of elegant distributed algorithms exist for computing the
optimum link flows (and hence routing determine the routing paths, as discussed above). The reader is referred to [Bertsekas 1991] for a
detailed study of these algorithms.

The final set of routing algorithms we mention here are those derived from the telephony world. These circuit-switched routing algorithms are
of interest to packet-switched data networking in cases where per-link resources (e.g., buffers, or a fraction of the link bandwidth) are to
reserved (i.e., set aside) for each connection that is routed over the link. While the formulation of the routing problem might appear quite
different from the least cost routing formulation we have seen in this chapter, we will see that there are a number of similarities, at least as far
as the path finding algorithm (routing algorithm) is concerned. Our goal here is to provide a brief introduction for this class of routing
algorithms. The reader is referred to [Ash 1998],[Ross 1995], [Girard 1990] for a detailed discussion of this active research area.

The circuit-switched routing problem formulation is illustrated in Figure 4.2-8. Each link has a certain amount of resources (e.g., bandwidth).
The easiest (and a quite accurate) way to visualize this is to consider the link to be a bundle of circuits, with each call that is routed over the
link requiring the dedicated use of one of the link's circuits. A link is thus characterized both by its total number of circuits, as well as the
number of these circuits currently in use. In Figure 4.2-8, all links except AB and BD have 20 circuits; the number to the left of the number of
circuits indicates the number of circuits currently in use.

                                                     Figure 4.2-8: Circuit-switched routing

Suppose now that a call arrives at node A, destined to node D. What path should be take? In shortest path first (SPF) routing, the shortest
path (least number of links traversed) is taken. We have already seen how the Dijkstra LS algorithm can be used to find shortest path routes.
In Figure 4.2-8, either that ABD or ACD path would thus be taken. In least loaded path (LLP) routing, the load at a link is defined as the
ratio of the number of used circuits at the link and the total number of circuits at that link. The path load is the maximum of the loads of all
links in the path. In LLP routing, the path taken is that with the smallest path load. In example 4.2-8, the LLP path is ABCD. In maximum
free circuit (MFC) routing, the number of free circuits associated with a path is the minimum of the number of free circuits at each of the
links on a path. In MFC routing, the path the maximum number of free circuits is taken. In Figure 4.2-8 the path ABD would be taken with
MFC routing.

Given these examples from the circuit switching world, we see that the path selection algorithms have much the same flavor as LS routing.
All nodes have complete information about the network's link states. Note however, that the potential consequences of old or inaccurate sate
information are more severe with circuit-oriented routing -- a call may be routed along a path only to find that the circuits it had been
expecting to be allocated are no longer available. In such a case, the call setup is blocked and another path must be attempted. Nonetheless,
the main differences between connection-oriented, circuit-switched routing and connectionless packet-switched routing come not in the path
selection mechanism, but rather in the actions that must be taken when a connection is set up, or torn down, from source to destination.


[Ash 1998] G. R. Ash, Dynamic Routing in Telecommunications Networks, McGraw Hill, 1998.
[Bertsekas 1991] D. Bertsekas, R. Gallager, Data Networks, Prentice Hall, 1991.
[Brassil 1994] J. T. Brassil, A. K. Choudhury, N. F. Maxemchuk, "The Manhattan Street Network: A High Performance, Highly Reliable
Metropolitan Area Network," Computer Networks and ISDN Systems, Mar. 1994.
[Corman 1990] T. Corman, C. Leiserson, R. Rivest,Introduction to Algorithms, (The MIT Press, Cambridge,
[Dodge 1999] M. Dodge, "An Atlas of Cyberspaces,"
[Girard 1990] A. Girard, Routing and Dimensioning in Circuit-Switched Networks, Addison Wessley, 1990. (12 of 13) [5/13/2004 11:59:56 AM]
  Point-to-Point Routing Algorithms

[Ross 1995] K.W. Ross, "Multiservice Loss Models for Broadband Telecommunications Networks," Springer-Verlay, 1995.
[Floyd 1994] S. Floyd, V. Jacobson, "Synchronization of Periodic Routing Messages," IEEE/ACM Transactions on Networking, Vol. 2 No. 2,
pp. 122-136, April 1994.
[Kleinrock 1975] L. Kleinrock, Queueing Systems: Theory, John Wiley and Sons, 1975.
[Neumann 1997] R. Neumann, "Internet Routing Black Hole," The Risks Digest: Forum on Risks to the Public in Computers and Related
Systems, Vol. 19, No. 12 (2-May-1997).
[Perlman 1999] R. Perlman, Interconnections, Second Edition: Bridges, Routers, Switches, and Internetworking Protocols (Addison-Wesley
Professional Computing Series), 1999.
[Zegura 1997] E. Zegura, K. Calvert, M. Donahoo, "A Quantitative Comparison of Graph-based Models for Internet Topology,"IEEE/ACM
Transactions on Networking, Volume 5, No. 6, December 1997. See also for a software package that
generates networks with realistic structure.

Copyright Keith W. Ross and James F. Kurose, 1996-2000. All rights reserved. (13 of 13) [5/13/2004 11:59:56 AM]
 The Network Layer:hierarchical networks

                              4.3 Hierarchical Routing
In the previous section, we viewed "the network" simply as a collection of interconnected routers. One
router was indistinguishable from another in the sense that all routers executed the same routing
algorithm to compute routing paths through the entire network. In practice, this model and its view of a
homogenous set of routers all executing the same routing algorithm is a bit simplistic for at least two
important reasons:

     q   Scale. As the number of routers becomes large, the overhead involved in computing, storing, and
         communicating the routing table information (e.g., link state updates or least cost path changes)
         becomes prohibitive. Today's public Internet consists of millions of interconnected routers and
         more than 50 million hosts. Storing routing table entries to each of these hosts and routers would
         clearly require enormous amounts of memory. The overhead required to broadcast link state
         updates among millions of routers would leave no bandwidth left for sending the data packets! A
         distance vector algorithm that iterated among millions of routers would surely never converge!
         Clearly, something must be done to reduce the complexity of route computation in networks as
         large as the public Internet.
     q   Administrative autonomy. Although engineers tend to ignore issues such as a company's desire
         to run its routers as it pleases (e.g., to run whatever routing algorithm it chooses), or to "hide"
         aspects of the networks' internal organization from the outside, these are important
         considerations. Ideally, an organization should be able to run and administer its network as it
         wishes, while still being able to connect its network to other "outside" networks.

Both of these problems can be solved by aggregating routers into "regions" or "autonomous systems"
(ASs). Routers within the same AS all run the same routing algorithm (e.g., a LS or DV algorithm) and
have full information about each other -- exactly as was the case in our idealized model in the previous
section. The routing algorithm running within an autonomous system is called an intra-autonomous
system routing protocol. It will be necessary, of course, to connect ASs to each other, and thus one or
more of the routers in an AS will have the added task for being responsible for routing packets to
destinations outside the AS. Routers in an AS that have the responsibility of routing packets to
destinations outside the AS are called gateway routers. In order for gateway routers to route packets
from one AS to another (possibly passing through multiple other ASs before reaching the destination
AS), the gateways must know how to route (i.e., determine routing paths) among themselves. The
routing algorithm that gateways use to route among the various ASs is known as an inter-autonomous
system routing protocol.

In summary, the problems of scale and administrative authority are solved by defining autonomous
systems. Within an AS, all routers run the same intra-autonomous system routing protocol. Special
gateway routers in the various ASs run an inter-autonomous system routing protocol that determines
routing paths among the ASs. The problem of scale is solved since an intra-AS router need only know
about routers within its AS and the gateway router(s) in its AS. The problem of administrative authority (1 of 4) [5/13/2004 12:00:03 PM]
 The Network Layer:hierarchical networks

is solved since an organization can run whatever intra-AS routing protocol it chooses, as long as the AS's
gateway(s) is able to run an inter-AS routing protocol that can connect the As to other ASs..

                                           Figure 4.3-1: Intra-AS and Inter-AS routing.

Figure 4.3-1 illustrates this scenario. Here, there are three routing ASs, A, B and C. Autonomous system
A has four routers, A.a, A.b, A.c and A.d, which run the intra-AS routing protocol used within
autonomous system A. These four routers have complete information about routing paths within
autonomous system A. Similarly, autonomous systems B and C have three and two routers,
respectively. Note that the intra-AS routing protocols running in A, B and C need not be the same. The
gateway routers are A.a, A.c, B.a and C.b. In addition to running theintra-AS routing protocol in
conjunction with other routers in their ASs, these four routers run an inter-AS routing protocol among
themselves. The topological view they use for their inter-AS routing protocol is shown at the higher
level, with "links" shown in light gray. Note that a "link" at the higher layer may be an actual physical
link, e.g., the link connection A.c and B.a, or a logical link, such as the link connecting A.c and A.a.
Figure 4.3-2 illustrates that the gateway router A.c must run an intra-AS routing protocol with its
neighbors A.b and A.d, as well as an inter-AS protocol with gateway router B.a. (2 of 4) [5/13/2004 12:00:03 PM]
 The Network Layer:hierarchical networks

                                  Figure 4.3-2: Internal architecture of gateway router A.c

Suppose now that a host h1 attached to router A.d needs to route a packet to destination h2 in
autonomous system B, as shown in Figure 4.3-3. Assuming that A.d's routing table indicates that router
A.c is responsible for routing its (A.d's) packets outside the AS, the packet is first routed from A.d to A.c
using A's intra-AS routing protocol. It is important to note that router A.d does not know about the
internal structure of autonomous systems B and C and indeed need not even know about the topology
connecting autonomous systems A, B and C. Router A.c will receive the packet and see that it is destined
to an autonomous system outside of A. A's routing table for the intra-AS protocol would indicate that a
packet destined to autonomous system B should be routed along the A.c to B.a link. When the packet
arrives at B.a, B.a's inter-AS routing sees that the packet is destined for autonomous system B. The
packet is then "handed over" to the intra-AS routing protocol within B, which routes the packet to its
final destination, h2. In Figure 4.3-3, the portion of the path routed using A's intra-AS protocol is shown
in red, the portion using the inter-AS routing protocol is shown in blue, and the portion of the path routed
using B's intra-AS protocol is shown in green. We will examine specific inter-AS and intra-AS routing
protocols used in the Internet in Section 4.5. (3 of 4) [5/13/2004 12:00:03 PM]
 The Network Layer:hierarchical networks

                 Figure 4.3-3: The route from A.d to B.b: intra-AS and inter-AS path segments.

Copyright Keith W. Ross and James F. Kurose, 1996-2000. All Rights Reserved. (4 of 4) [5/13/2004 12:00:03 PM]
  Point-toPoint Routing in the Internet

                                          4.4 Internet Protocol
So far in this chapter we have examined the underlying principles of the network layer. We have discussed the network layer
service models, including virtual circuit service and datagram service, the routing algorithms commonly used to determine paths
between origin and destination hosts, and how problems of scale are addressed with hierarchical routing. We are now going to
turn our attention to the Internet's network layer.

As we mentioned in Section 4.1, the Internet's network layer does not provide a virtual-circuit service, but instead a
connectionless datagram service. When the network layer at the sending host receives a segment from the transport layer, it
encapsulates the segment within an IP datagram, writes the destination address of the host (as well as other fields) on the
datagram, and drops the datagram into the network. As we mentioned in Chapter 1, this process is similar to a person writing a
letter, inserting the letter in an envelope, writing the destination address on the envelope, and dropping the envelope into a
mailbox. Neither the Internet's network layer nor the postal service make any kind of preliminary contact with the destination
before moving its "parcel" to the destination. Furthermore, as discussed in Section 4.1, the network layer service is a best effort
service. It does not guarantee that the datagram will arrive within a certain time, it does not guarantee that a series of datagrams
will arrive in the same order sent; in fact, it does not even guarantee that the datagram will ever arrive at its destination.

As we discussed in Section 4.1, the network layer for a datagram network, such as the Internet, has two major components.
First, it has a network protocol component, which defines network-layer addressing, the fields in the datagram (i.e., the network
layer PDU), and how the end systems and routers act on these fields. The network protocol in the Internet is called the Internet
Protocol, or more commonly, the IP Protocol. There are currently two versions of the IP protocol in use today. In this section
we examine the more widespread version, namely, Internet Protocol version 4, which is specified in [RFC 791] and which is
more commonly known as IPv4. In Section 4.7 we shall examine, IPv6, which is expected to slowly replace IPv4 in the
upcoming years.The second major component of the network layer is the path determination component, which determines the
route a datagram follows from origin to destination. We study the path determination component in the next section.

4.4.1 IP Addressing
Before discussing IP addressing, we need to say a few words about hosts and routers. A host (also called an end system) has
one link into the network. When IP in the host wants to send a datagram, it passes the datagram to its link. The boundary
between the host and the link is called the interface. A router is fundamentally different from a host in that it has two or more
links that connect to it. When a router forwards a datagram, it forwards the datagram over one of its links. The boundary
between the router and any one of its links is also called an interface. Thus, a router has multiple interfaces, one for each of its
links. Because every interface (for a host or router) is capable of sending and receiving IP datagrams, IP requires each
interface to have an IP address.

Each IP address is 32 bits long (equivalently, four bytes) long. IP addresses are typically written in so-called "dot-decimal
notation", whereby each byte of the address is written in its decimal form and is separated by a period. For example, a typical IP
address would be The 193 is the decimal equivalent for the first 8 bits of the address; the 32 is the decimal
equivalent for the second 8 bits of the address, etc. Thus, the address in binary notation is:

                                               11000001 00100000 11011000 00001001

(A space as been added between the bytes for visual purposes.) Because each IP address is 32 bits long, there are 232 possible IP
addresses. (1 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

                                             Figure 4.4-1: LANs are networks in IP jargon.

Figure 4.4-1 provides an example of IP addressing and interfaces. In this figure there is one router which interconnects three
LANs. (LANs, also known as local area networks, were briefly discussed in Chatper 1 and will be studied in detail in the next
chapter.) In the jargon of IP, each of these LANs is called an IP network or more simply a "network". There are several things
to observe from this diagram. First, the router has threes interfaces, labeled 1, 2 and 3. Each of the router interfaces has its own
IP address, which are provided in Figure 4.4-2; each host also has its own interface and IP address. Second, all of the interfaces
attached to LAN 1, including a router interface, have an IP address of the form . Similarly, all the interfaces
attached to LAN 2 and LAN 3 have IP addresses of the form and, respectively. In other words, each
address has two parts: the first part (the first three bytes in this example) that specifies the network; and the second part (the last
byte in this example) that addresses a specific host on the network.

                                              Router Interface                       IP Address
                                            Figure 4.4-2: IP addresses for router interfaces.

The IP definition of a "network" is not restricted to a LAN. To get some insight here, let us now take a look at another example.
Figure 4.4-3 shows several LANs interconnected with three routers. All of the interfaces attached to LAN 1, including the
router R1 interface that is attached to LAN 1, have an IP address of the form Similarly, all the interfaces attached
to LAN 2 and to LAN 3 have the form and, respectively. Each of the three LANs again constitute their
own network (i.e., IP network). But note that there are three additional "networks" in this example: one network for the
interfaces that connect Router 1 to Router 2; another network for the interfaces that connect Router 2 to Router 3; and a third
network for the interfaces that connect Router 3 to Router 1. (2 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

                                      Figure 4.4-3: An interconnected system consisting of six networks.

For a general interconnected system of routers and hosts (such as the Internet), we use the following recipe to define the
"networks" in the system. We first detach each router interface from its router and each host interface from its host. This creates
"islands" of isolated networks, with "interfaces" terminating all the leaves of the isolated networks . We then call each of these
isolated networks a network. Indeed, if we apply this procedure to the internconnected system in Figure 4.4-3, we get six
islands or "networks". The current Internet consists of millions of networks. (In the next chapter we will consider bridges. We
mention here that when applying this recipe, we do not detach interfaces from bridges. Thus each bridge lies within the interior
of some network.)

Now that we have defined a network, we are ready to discuss IP addressing in more detail. IP addresses are globally unique, that
is, no two interfaces in the world have the same IP address. Figure 4.4-3 shows the four possible formats of an IP address. (A
fifth address, beginning with 11110, is reserved for future use.) In general, each interface (for a host or router) belongs to a
network; the network part of the address identifies the network to which the interface belongs. The host part identifies the
specific interface within the network. (We would prefer to use the terminology "interface part of the address" rather than "host
part of the address" because IP address is really for an interface and not for a host; but the terminology "host part" is commonly
used in practice.) For a class A address, the first 8 bits identify the network, and the last 24 bits identify the interface within that
network. Thus with a class A we can have up to 27 networks (the first of the eight bits is fixed as 0) and and 224 interfaces.
Note that the interfaces in Figures 4-4.1 and 4-4.3 use class A addresses. The class B address space allows for 214 networks,
with up to 216 interfaces within each network. A class C address uses 21 bits to identify the network and leaves only 8 bits for
the interface identifier. Class D addresses are reserved for so-called multicast addresses. As we will see in Section 4.7, these
addresses do not identify a specific interface but rather provide a mechanism through which multiple hosts can receive a copy
of each single packet sent by a sender. (3 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

                                                   Figure 4.4-4: IPv4 address formats.

Assigning Addresses

Having introduced IP addressing, one question that immediately comes to mind is how does a host get its own IP address? We
have just learned that an IP address has two parts, a network part and a host part. The host part of the address can be assigned in
several different ways, including:

     q   Manual configuration: The IP address is configured into the host (typically in a file) by the system administrator.
     q   Dynamic Host Configuration Protocol (DHCP): [RFC 2131]. DHCP is an extension of the BOOTP [RFC 1542]
         protocol, and is sometimes referred to as Plug and Play. With DHCP, a DHCP server in a network (e.g., in a LAN)
         receives DHCP requests from a client and in the case of dynamic address allocation, allocates an IP address back to the
         requesting client. DHCP is used extensively in LANs and in residential Internet access.

The network part of the address is the same for all the hosts in the network. To obtain the network part of the address for a
network, the network administrator might first contact the network's ISP, which would provide addresses from a larger block
of addressees that have already been allocated to the ISP. But how does an ISP get a block of addresses? IP addresses are
managed under the authority of the Internet Assigned Numbers Authority (IANA), under the guidelines set forth in [RFC
2050]. The actual assignment of addresses is now managed by regional Internet registries. As of mid-1998, there are three such
regional registries: the American Registry for Internet Number (ARIN, which handles registrations for North and South
America, as well as parts of Africa. ARIN has recently taken over a number of the functions previously provided by Network
Solutions), the Reseaux IP Europeans (RIPE, which covers Europe and nearby countries), and the Asia Pacific Network
Information Center (APNIC).

Before leaving our discussion of addressing, we want to mention that mobile hosts may change the network to which they are
attached, either dynamically while in motion or on a longer time scale. Because routing is to a network first, and then to a host
within the network, this means that the mobile host's IP address must change when the host changes networks. Techniques for
handling such issues are now under development within the IETF and the research community [RFC2002] [RFC2131].

4.4.2 The Big Picture: Transporting a Datagram from Source to
Destination (4 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

Now that we have defined interfaces and networks, and that we have a basic understanding of IP addressing, we take a step
back and discuss how IP transports a datagram from source to destination. To this end, a high level view of an IP datagram is
shown in Figure 4.4-5. Note that every IP datagram has a destination address field and a source address field. The source host
fills the source address field with its own 32-bit IP address and fills the destination address field with the 32-bit IP address of
the host to which it wants to send the datagram. Note that these actions are analogous to what you do when you send a letter: on
the envelope of the letter, you provide a destination address and a return (source) address. The data field of the datagram is
typically filled with a TCP or UDP segment. We will discuss the remaining IP datagram fields a little later in this section.

                                            Figure 4.4-5: The key fields in an IP datagram.

Once the source host creates the IP datagram, how does the network layer transport the datagram from the source host to the
destination host? Let us answer this question in the context of network Figure 4.4-1. First suppose host A wants to send an IP
datagram to host B. The datagram is transported from host A to host B as follows. IP in host A first extracts the network
portion of the address, 223.1.1. , and scans its routing table, which is shown in Figure 4.4-6. In this table, the "number of hops
to destination" is defined to be the number of networks that need to be traversed, including the destination network. Scanning
the table, host A finds a match in the first row, and observes that the number of hops to the destination is 1. This indicates to
host A that the destination host is on the same network. Host A then passes the IP datagram to the link layer protocol and
indicates to the link layer protocol that the destination is on the same LAN. The link layer protocol then has the responsibility of
transporting the datagram to host B. (We will study how the link layer transports a datagram between to interfaces on the same
network in the next chapter.)

                                                   destination next
                                                                              of hops to
                                                    network    router
                                                     223.1.1.         -            1
                                                     223.1.2.        2
                                                     223.1.3.        2
                                                 Figure 4.4-6: Routing table in host A.

Now consider the more interesting case of host A sending an IP datagram to host E, which has IP address and is on a
different LAN. Host A again scans its routing table, but now finds a match in the second row. Because the number of hops to
the destination is 2, host A knows that the destination is on another network. The routing table also tells host A that in order to
get the datagram to host E, host A should first send the datagram to router address IP in host A then passes the
datagram down to the link layer, and indicates to the link layer that it should first send the datagram to IP address .The
link layer then transports the datagram to the router interface 1. The datagram is now in the router, and it is the job the router to
move the datagram towards the datagram's ultimate destination. The router extracts the network portion of the destination
address of the IP datagram, namely 223.1.2. , and scans its routing table, which is shown in Figure 4.4-7. The router finds a
match in the second row of the table. The table tells the router that the datagram should be forwarded on router interface 2; also
the number of hops to the destination is 1, which indicates to the router that the destination host is on the LAN directly attached
to interface 2. The router moves the datagram to interface 2. (The moving of a datagram from in input interface to an output
interface within a router will be covered in Section 4.6.) Once the datagram is at interface 2, the router passes the datagram to
link layer protocol and indicates to the link layer protocol that the destination host is on the same LAN. The link layer protocol (5 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

has the job of transporting the datagram from the router interface 2 to host E, both of which are attached to the same LAN.

                                               destination next
                                                                of hops to interface
                                                network router
                                                 223.1.1.       -           1            1
                                                 223.1.2.       -           1            2
                                                 223.1.3.       -           1            3
                                                  Figure 4.4-7: Routing table in router.

In Figure 4.4-7, note that the entries in the "next router" column are all empty. This is because all of the networks (223.1.1. ,
223.1.2. , and 223.1.3. ) are each directly attached to the router, that is, there is no need to go through an intermediate router to
get to the destination host. However, if host A and host E were separated by two routers, then within the routing table of the first
router along the path from A to B, the appropriate row would indicate 2 hops to the destination and would specify the IP address
of the second router along the path. The first router would then forward the datagram to the second router, using the link layer
protocol that connects the two routers. The second router then forwards the datagram to the destination host, using the link layer
protocol that connects the second router to the destination host.

You may recall from Chapter 1 that we said that routing a datagram in the Internet is similar to a person driving a car and asking
gas station attendants at each intersection along the way how to get to the ultimate destination. It should now be clear why this
an appropriate analogy for routing in the Internet. As a datagram travels from source to destination, it visits a series of routers.
At each router in the series, it stops and asks the router how to get to its ultimate destination. Unless the router is on the same
LAN as the ultimate destination, the routing table essentially says to the datagram: "I don't know exactly how to get to the
ultimate destination, put I do know that the ultimate destination is in the direction of the link (analogous to a road) connected to
interface 3." The datagram then sets out on the link connected to interface 3, arrives at a new router, and again asks for new

From this discussion we see that the routing tables in the routers play a central role in routing datagrams through the Internet.
But how are these routing tables configured and maintained for large networks with mulitple paths between sources and
destinations (such as in the Internet)? Clearly, these routing tables should be configured so that the datagrams follow good (if
not optimal) routes from source to destination. As you probably guessed, routing algorithms - like those studied in Section 4.2 -
have the job of configuring and maintaining the routing tables. Furthermore, as discussed in Section 4.3, the Internet is
partitioned into autonomous systems (ASs): intra-AS routing algorithms independently configure the routing tables within the
autonomous systems; inter-AS routing algorithms have the job configuring routing tables so that datagrams can pass through
multiple autonomous systems. We will discuss the Internet's intra-AS and inter-AS routing algorithms in Section 4.5. But before
moving on to routing algorithms, we cover three more important topics for the IP protocol, namely, the datagram format,
datagram fragmentation, and the Internet Control Message Protocol (ICMP).

4.4.3 Datagram Format
The IPv4 datagram format is shown in Figure 4.4-8. (6 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

                                                   Figure 4.4-8: IPv4 datagram format

The key fields in the IPv4 datagram are the following:

     q   Version Number: These 4 bits specify the IP protocol version of the datagram. By looking at the version number, the
         router can then determine how to interpret the remainder of the IP datagram. Different versions of IP use different
         datagram formats. The datagram format for the "current" version of IP, IPv4, is shown in Figure 4..4-8. The datagram
         format for the "new" version of IP (IPv6) is discussed in Section 4.7.
     q   Header Length: Because an IPv4 datagram can contain a variable number of options (which are included in the IPv4
         datagram header) these 4 bits are needed to determine where in the IP datagram the data actually begins. Most IP
         datagrams do not contain options so the typical IP datagram has a 20 byte header.
     q   TOS: The type of service (TOS) bits were included in the IPv4 header to allow different "types" of IP datagrams to be
         distinguished from each other, presumably so that they could be handled differently in times of overload. When the
         network is overloaded, for example, it would be useful to be able to distinguish network control datagrams (e.g., see the
         ICMP discussion in Section 4.4.5) from datagrams carrying data (e.g., HTTP messages). It would also be useful to
         distinguish real-time datagrams (e.g., used by an IP telephony application) from non-real-time traffic (e.g., FTP). More
         recently, one major routing vendor (Cisco) interprets the first three ToS bits as defining differential levels of service that
         can be provided by the router. The specific level of service to be provided is a policy issue determined by the router's
         administrator. We shall explore the topic of differentiated service in detail in Chapter 6.
     q   Datagram Length: This is the total length of the IP datagram (header plus data) measured in bytes. Since this field is
         16 bits long, the theoretical maximum size of the IP datagram to 65,535 bytes. However, datagrams are rarely greater
         than 1500 bytes, and are often limited in size to 576 bytes.
     q   Identifier, Flags, Fragmentation Offset: These three fields have to do with so-called IP fragmentation, a topic we will
         consider in depth shortly. Interestingly, the new version of IP, IPv6, simply does not allow for fragmentation.
     q   Time-to-live: The time-to-live (TTL) field is included to insure that datagrams do not circulate forever (due to, for
         example, a long lived router loop) in the network. This field is decremented by one each time the datagram is processed
         by a router. If the TTL field reaches 0, the datagram must be dropped.
     q   Protocol: This field is only used when an IP datagram reaches its final destination. The value of this field indicates the
         transport-layer protocol at the destination to which the data portion of this IP datagram will be passed. For example, a
         value of 6 indicates that the data portion is passed to TCP, while a value of 17 indicates that the data is passed to UDP.
         For a listing of all possible numbers, see [RFC 1700]. Note that the the protocol number in the IP datagram has a role (7 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

         that is fully analogous to the role of the port number field in the transport-layer segment. The protocol number is the
         "glue" that holds the network and transport layers together, whereas port number is the "glue" that holds the transport
         and application layers together. We will see in Chapter 5 that the link layer frame also has a special field which glues the
         link layer to the network layer.
     q   Header Checksum: The header checksum aids a router in detecting bit errors in a received IP datagram. The header
         checksum is computed by treating each 2 bytes in the header as a number and summing these numbers using 1's
         complement arithmetic. As discussed in Section 3.3, the 1's complement of this sum, known as the Internet checksum,
         is stored in the checksum field. A router computes the Internet checksum for each received IP datagram and detects an
         error condition if the checksum carried in the datagram does not equal the computed checksum. Routers typically discard
         datagrams for which an error has been detected. Note that the checksum must be recomputed and restored at each router,
         as the TTL field, and possibly options fields as well, may change. An interesting discussion of fast algorithms for
         computing the Internet checksum is [1071]. A question often asked at this point is, why does TCP/IP perform error
         checking at both the transport and network layers? There are many reasons for this. First, routers are not required to
         perform error checking, so the transport layer cannot count on the network layer to do the job. Second, TCP/UDP and IP
         do not necessarily have to both belong to the same protocol stack. TCP can, in principle, run over a different protocol
         (e.g., ATM) and IP can carry data without passing through TCP/UDP (e.g., RIP data).
     q   Source and Destination IP Address: These fields carry the 32 bit IP address of the source and final destination for
         this IP datagram. The use and importance of the destination address is clear. The source IP address (along with the
         source and destination port numbers) is used at the destination host to direct the application data in the proper socket.
     q   Options: The optional options fields allows an IP header to be extended. Header options were meant to be used rarely -
         - hence the decision to save overhead by not including the information in options fields in every datagram header.
         However, the mere existence of options does complicate matters -- since datagram headers can be of variable length, one
         can not determine a priori where the data field will start. Also, since some datagrams may require options processing and
         others may not, the amount of time needed to process a IP datagram can vary greatly. These considerations become
         particularly important for IP processing in high performance routers and hosts. For these reasons and others, IP options
         were dropped in the IPv6 header.
     q   Data (payload): Finally, we come to the last, and most important field - the raison d'être for the datagram in the first
         place! In most circumstances, the data field of the IP datagram contains the transport-layer segment (TCP or UDP) to be
         delivered to the destination. However, the data field can carry other types of data, such ICMP messages (discusssed in
         Section 4.4.5) .

Note that IP datagram has a total of 20 bytes of header (assuming it has no options). If the IP datagram carries a TCP segment,
then each (non-fragmented) datagram carries a total of 40 bytes of header (20 IP bytes and 20 TCP bytes) along with the
application-layer data.

4.4.4 IP Fragmentation and Reassembly
We will see in Chapter 5 that not all link layer protocols can carry packets of the same size. Some protocols can carry "big"
packets whereas other protocols can only carry "little" packets. For example, Ethernet packets can carry no more than 1500
bytes of data, whereas packets for many wide-area links can carry no more than 576 bytes. The maximum amount of data that a
link-layer packet can carry is called the MTU (maximum transfer unit). Because each IP datagram is encapsulated within the
link-layer packet for transport from one router to the next router, the MTU of the link-layer protocol places a hard limit on the
length of an IP datagram. Having a hard limit on the size of an IP datagram is not much of a problem. What is a problem is that
each of the links along the route between sender and destination can use different link-layer protocols, and each of these
protocols can have different MTUs.

To understand the problem better, imagine that you are a router that interconnects several links, each running different link-
layer protocols with different MTUs. Suppose you receive an IP datagram from one link, you check your routing table to
determine the outgoing link, and this outgoing link has an MTU that is smaller than the length of the IP datagram. Time to panic
-- how are you going to squeeze this oversized IP packet into the payload field of the link-layer packet? The solution to this
problem is to "fragment" the data in the IP datagram among two or more smaller IP datagrams, and then send these smaller (8 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

datagrams over the outgoing link. Each of these smaller datagrams is referred to as a fragment.

Fragments need to be reassembled before they reach the transport layer at the destination. Indeed, both TCP and UDP are
expecting to receive from the network layer complete, un-fragmented segments. The designers of IPv4 felt that reassembling
(and possibly re-fragmenting) datagrams in the routers would introduce significant complication into the protocol and put a
damper on router performance. (If you were a router, would you want to be reassembling fragments on top of everything else
you have to do?) Sticking to end-to-end principle for the Internet, the designers of IPv4 decided to put the job of datagram
reassembly in the end systems rather than in the network interior.

When a destination host receives a series of datagrams from the same source, it needs to determine if any of these datagrams are
fragments of some "original" bigger datagram. If it does determine that some datagrams are fragments, it must further determine
when it has received the last fragment and how the fragments it has received should be pieced back together to form the original
datagram. To allow the destination host to perform these reassembly tasks, the designers of of IP (version 4) put identification,
flag and fragmentation fields in the IP datagram. When a datagram is created, the sending host stamps the datagram with an
identification number as well as a source and destination address. The sending host increments the identification number for
each datagram it sends. When a router needs to fragment a datagram, each resulting datagram (i.e., "fragment") is stamped with
the source address, destination address and identification number of the original datagram. When the destination receives a
series of datagrams from the same sending host, it can examine the identification numbers of the datagrams to determine which
of the datagrams are actually fragments of the same bigger datagram. Because IP is an unreliable service, one or more of the
fragments may never arrive at the destination. For this reason, in order for the destination host to be absolutely sure it has
received the last fragment of the original datagram, the last fragment has a flag bit set to 0 whereas all the other fragments have
this flag bit set to 1. Also, in order for the destination host to determine if a fragment is missing (and also to be able to
reassemble the fragments in the proper order), the offset field is used to specify where the fragment fits within the original IP
datagram. This bit is set to 1 in all except the last fragment.

                                                    Figure 4.4-9: IP Fragmentation

Figure 4.4-9 illustrates an example. A datagram 4,000 bytes arrives to a router, and this datagram must be forwarded to a link
with a MTU of 1500 bytes. These implies that the 3,980 data bytes in the original datagram must be allocated to three separate
fragments (each of which are also IP datagrams). Suppose that the original datagram is stamped with an identification number
of 777. Then the characteristics of the three fragments are as follows:

        1st fragment
             r 1480 bytes in the data field of the IP datagram.

             r identification = 777

             r offset = 0 (meaning the data should be inserted beginning at byte 0)

             r flag = 1 (meaning there is more)

        2nd fragment
             r 1480 byte information field

             r identification = 777

             r offset = 1,480 (meaning the data should be inserted beginning at btye 1,480

             r flag = 1 (meaning there is more)

        3rd fragment (9 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

              r   1020 byte (=3980-1480-1480) information field
              r   identification = 777
              r   offset = 2,960 (meaning the data should be inserted beginning at byte 2,960)
              r   flag = 0 (meaning this is the last fragment)

The payload of the datagram is only passed to the transport layer once the IP layer has fully reconstructed the original IP
datagram. If one or more of the fragments does not arrive to the destination, the datagram is "lost" and not passed to the
transport layer. But, as we learned in the previous chapter, if TCP is being used at the transport layer, then TCP will recover
from this loss by having the source retransmit the data in the original datagram.

Fragmentation and reassembly puts an additional burden on Internet routers (the additional effort to create fragments out of a
datagram) and on the destination hosts (the additional effort to reassembly fragments). For this reason it is desirable to keep
fragmentation to a minimum. This is often done by limiting the TCP and UDP segments to a relatively small size, so that the
fragmentation of the corresponding datagrams is unlikely. Because all data link protocols supported by IP are supposed to have
MTUs of at least 576 bytes, fragmentation can be entirely eliminated by using a MSS of 536 bytes, 20 bytes of TCP segment
header and 20 bytes of IP datagram header. This is why most TCP segments for bulk data transfer (such as with HTTP) are 512-
536 bytes long. (You may have noticed while surfing the Web that 500 or so bytes of data often arrive at a time.)

Following this section we provide a Java applet that generates fragments. You provide the incoming datagram size, the MTU
and the incoming datagram identification. It automatically generates the fragments for you.

4.4.5 ICMP: Internet Control Message Protocol
We conclude this section with a discussion of the Internet Control Message Protocol, ICMP, which is used by hosts, routers,
and gateways to communicate network layer information to each other. ICMP is specified in [RFC 792]. The most typical use
of ICMP is for error reporting. For example, when running a Telnet, FTP, or HTTP session, you may have encountered an error
message such as "Destination network unreachable." This message had its origins in ICMP. At some point, an IP router was
unable to find a path to the host specified in your Telnet, FTP or HTTP application. That router created and sent a type-3
ICMP message to your host indicating the error. Your host received the ICMP message and returned the error code to the TCP
code that was attempting to connect to the remote host. TCP in turn returned the error code to your application.

ICMP is often considered part of IP, but architecturally lies just above IP, as ICMP messages are carried inside IP packets. That
is, ICMP messages are carried as IP payload, just as TCP or UDP packets are carried at IP payload. Similarly, when an host
receives an IP packet with ICMP specified as the upper layer protocol, it demultiplexes the packet to ICMP, just as it would
demultiplex a packet to TCP or UDP.

ICMP messages have a type and a code field, and also contain the first 8 bytes of the IP packet that caused the IP message to be
generated in the first place (so that the sender can determine which packet is sent that caused the error). Selected ICMP
messages are shown below in Figure 4.4-10. Note that ICMP messages are used not only for signaling error conditions. The
well-known ping [ping man page] program uses ICMP. ping sends an ICMP type 8 code 0 message to the specified host.
The destination host, seeing the echo request sends back an type 0 code 0 ICMP echo reply. Another interesting ICMP message
is the source quench message. This message is seldom used in practice. Its original purpose was to perform congestion control -
- to allow a congested router to send an ICMP source quench message to a host to force that host to reduce its transmission rate.
We have seen in Chapter 3 that TCP has its own congestion control mechanism that operates at the transport layer, without the
use of network layer support such as the ICMP source quench message.

                                    ICMP type     code        description (10 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

                                    0             0           echo reply (to ping)
                                    3             0           destination network unreachable
                                    3             1           destination host unreachable
                                    3             2           destination protocol unreachable
                                    3             3           destination port unreachable
                                    3             6           destination network unknown
                                    3             7           destination host unknown
                                    4             0           source quench (congestion control)
                                    8             0           echo request
                                    9             0           router advertisement
                                    10            0           router discovery
                                    11            0           TTL expired
                                    12            0           IP header bad
                                                Table 4.4-10: Selected ICMP messages

In Chapter 1 we introduced the Traceroute program, which enabled you to trace the route from a few given hosts to any host in
the world. Interesting enough, Traceroute also uses ICMP messages. To determine the names and addresses of the routers
between source and destination, Traceroute in the source sends a series of ordinary IP datagrams to the destination. The first of
these datagrams has a TTL of 1, the second of 2, the third of 3, etc. The source also starts timers for each of the datagrams.
When the nth datagram arrives at the nth router, the nth router observers that the TTL of the datagram has just expired.
According to the rules of the IP protocol, the router discards the datagram (because there may be a routing loop) and sends an
ICMP warning message to the source (type 11 code 0). This warning message includes the name of the router and its IP address.
When the ICMP message corresponding to the nth datagram arrives at the source, the source obtains the round-trip time from
the timer and the name and IP address from the ICMP message. Now that you understand how Traceroute works, you may want
to go back and play with it some more.


[Arin 1996] ARIN, "IP allocation report"
[Bradner 1996] S. Bradner, A. Mankin, IPng: Internet Protocol Next Generation, Adddison Wesley, 1996.
[Cisco 97] Cisco - Advanced QoS Services for the Intelligent Internet.
[RFC 791] J. Postel, "Internet Protocol: DARPA Internet Program Protocol Specification," RFC 791, Sept 1981.
[RFC 950] J. Mogul, J. Postel , "Internet Standard Subnetting Procedure," RFC 950, August 1985.
[RFC 1071] R. Braden, D. Borman, and C. Partridge , "Computing the Internet Checksum," RFC 1071, September 1988.
[RFC 1542] W. Wimer, "Clarifications and Extensions for the Bootstrap Protocol," RFC 1532, October 1993.
[RFC 1700] J. Reynolds, J. Postel, "Assigned Numbers" RFC 1700, Oct. 1994.
[RFC 2002] C. Perkins, "IP Mobility Support," RFC 2002, 1996.
[RFC 2131] R. Droms, "Dynamic Host Configuration Protocol," RFC 2131, March 1997.
[RFC 2050] K. Hubbard, M. Kosters, D. Conrad, D. Karrenberg, J. Postel, "Internet Registry IP Allocation Guidelines", RFC
2050, Nov. 1996.
[RFC 2131] R. Droms, "Dynamic Host Configuration Protocol," RFC 2131, March 1997.

Search RFCs and Internet Drafts (11 of 12) [5/13/2004 12:00:24 PM]
  Point-toPoint Routing in the Internet

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s) here.


Press button to submit your query or reset the form:        Submit      Reset

Query Options:

             Case insensitive

        Maximum number of hits:           25

Return to Table Of Contents

Copyright Keith W. Ross and James F. Kurose, 1996-2000. All rights reserved. (12 of 12) [5/13/2004 12:00:24 PM]
 IP Fragmentation

Fragmentation Applet
Provide an MTU (maximum transfer unit) and an incoming datagram size, and the applet will generate
all the fragments for you. [5/13/2004 12:00:26 PM]
 Point-toPoint Routing in the Internet

                              4.5 Routing in the Internet
The Internet consists of interconnected autonomous systems (ASs). An AS typically consists of many
networks, where a network (also called an IP network) was defined in the previous section. Recall from
Section 4.3 that each autonomous system is administered independently. The administrator of an
autonomous system chooses the intra-AS routing algorithm for that AS, and is responsible for
administering that AS and no others. Datagrams must also be routed among the ASs, and this is the job of
inter-AS routing protocols. As discussed in Section 4.3, this hierarchical organization of the Internet has
permitted the Internet to scale. In this section we examine the intra-AS and inter-AS routing protocols for
that are commonly used in the Internet.

4.5.1 Intra-Autonomous System Routing in the
An intra-AS routing protocol is used to configure and maintain the routing tables within an autonomous
system (AS). Once the routing tables are configured, datagrams are routed within the AS as described in
the previous section. Inter-AS routing protocols are also known as interior gateway protocols.
Historically, three routing protocols have been used extensively for routing within an autonomous system
in the Internet: RIP (the Routing Information Protocol), and OSPF (Open Shortest Path First), and IGRP
(Cisco's propriety Interior Gateway Routing Protocol).

RIP: Routing Information Protocol

The Routing Information Protocol (RIP) was one of the earliest intra-AS Internet routing protocols and is
still in widespread use today. It traces its origins and its name to the Xerox Network Systems (XNS)
architecture. The widespread deployment of RIP was due in great part to its inclusion in 1982 of the
Berkeley Software Distribution (BSD) version of UNIX supporting TCP/IP. RIP version 1 is defined in
[RFC 1058], with a backwards compatible version 2 defined in [RFC 1723].

RIP is a distance vector protocol that operates in a manner very close to the idealized protocol we
examined in Section 4.2.3. The version of RIP specified in RFC 1058 uses hop count as a cost metric,
i.e., each link has a cost of 1, and limits the maximum cost of a path to 15. This limits the use of RIP to
autonomous systems that are less than 15 hops in diameter.Recall that in distance vector protocols,
neighboring routers exchange routing information with each other. In RIP, the routing tables are
exchanged between neighbors every 30 seconds using RIP's. This is done with RIP's so-called response
message, with each response message containing that host's routing table entries for up to 25 destination
networks. These response messages containing routing tables are also called advertisements.

Let us take a look at a simple example of how RIP advertisements work. Consider the portion of an AS (1 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

shown in Figure 4.5-2. In this figure, the rectangles denote routers and the lines connecting the rectangles
denote networks. Note that the routers are labeled A, B, etc. and the networks are labeled 1, 10, 20, 30,
etc. For visual convenience, some of the routers and networks are not labeled. Dotted lines in the figure
indicate that the autonomous system continues on and perhaps loops back. Thus this autonomous system
has many more routers and links than are shown in the figure.

                                         Figure 4.5-2: A portion of an autonomous system.

Now suppose that the routing table for router D is as shown in Figure 4.5-3. Note that the routing table
has three columns. The first column is for the destination network, the second column indicates the next
router along the shortest path to the destination network, and the third column indicates the number of
hops (i.e., the number of networks that have to be traversed, including the destination network, to get to
the destination network along the shortest path). For this example, the table indicates that to send a
datagram from router D to destination network 1, the datagram should be first sent to neighboring router
A; moreover, the table indicates that destination network 1 is two hops away along the shortest path. Also
note that the table indicates that network 30 is seven hops away via router B. In principle, the routing
table should have one row for each network in the AS. (Although aggregation, a topic beyond the scope
of this book, can be used to aggregate entries.) It should also have at least one row for networks that are
outside of the AS. The table in Figure 4.5-3, and the subsequent tables to come, are only partially
complete. (2 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

                                                 destination next
                                                                   of hops to
                                                  network router
                                                        1             A             2
                                                       20             B             2
                                                       30             B             7
                                                       10             --            1
                                                       ....          ....          ....

              Figure 4.5-3: Routing table in router D before receiving advertisement from router A.

Now suppose that 30 seconds later, router D receives from router A the advertisement shown in Figure
4.5-4. Note that this advertisement is nothing other but the routing table in router A! This routing table
says, in particular, that network 30 is only 4 hops away from router A.

                                                destination next
                                                                  of hops to
                                                 network router
                                                       30            C              4
                                                        1            --             1
                                                       10            --             1
                                                       ....          ....          ....
                                         Figure 4.5-4: Advertisement from router A.

Router D, upon receiving this advertisement, merges the advertisement (Figure 4.5-4) with the "old"
routing table (Figure 4.5-3). In particular, router D learns that there is now a path through router A to
network 30 that is shorter than the path through router B. Thus, router D updates its routing table to
account for the "shorter" shortest path, as shown in Figure 4.5-5. How is it, you might ask, that the
shortest path to network 30 became shorter. This is because either this decentralized distance vector
algorithm was still in the process of converging (see Section 4.2), or new links and/or routers were added
to the AS, which changed the actual shortest paths in the network.

                                                destination next
                                                                  of hops to
                                                 network router
                                                        1            A              2
                                                       20            B              2 (3 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

                                                       30            A              5
                                                       ....          ....          ....
               Figure 4.5-5: Routing table in router D after receiving advertisement from router A.

Returning now to the general properties of RIP, if a router does not hear from its neighbor at least once
every 180 seconds, that neighbor is considered to be no longer reachable, i.e., either the neighbor has died
or the connecting link has gone down. When this happens, RIP modifies its local routing table and then
propagates this information by sendind advertisements to its neighboring routers (the ones that are still
reachable). A router can also request information about its neighbor's cost to a given destination using
RIP's request message. Routers send RIP request and response messages to each other over UDP using
port number 520.The UDP packet is carried between routers in a standard IP packet. The fact that RIP
uses a transport layer protocol (UDP) on top of a network layer protocol (IP) to implement network
layer functionality (a routing algorithm) may seem rather convoluted (it is!). Looking a little deeper at
how RIP is implemented will clear this up.

Figure 4.5-6 sketches how RIP is typically implemented in a UNIX system, e.g., for example, a UNIX
workstation serving as a router. A process called routed (pronounced "route dee") executes the RIP
protocol, i.e., maintains the routing table and exchanges messages with routed processes running in
neighboring routers. Because RIP is implemented as an application-layer process (albeit a very special
one that is able to manipulate the routing tables within the UNIX kernel), it can send and receive
messages over a standard socket and use a standard transport protocol. Thus, RIP is an application-layer
protocol (see Chapter 2), running over UDP.

                                 Figure 4.5-6: Implementation of RIP as the routed daemon

Finally, let us take a quick look at a RIP routing table. The RIP routing table below in Figure 4.5-7 is
taken from a UNIX router If you give a netstat -rn command on a UNIX (4 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

system, you can view the routing table for that host or router. Performing a netstat on yields the following routing table:

           Destination               Gateway                   Flags Ref       Use
         -------------------- -------------------- ----- ----- ------ ------
         ---                           UH          0 26492 lo0
         192.168.2.                        U           2      13 fa0
         193.55.114.                      U           3 58503 le0
         192.168.3.                        U           2      25 qaa0                        U           3       0 le0
         default                        UG          0 143454
                       Table 4.5-7 RIP routing table from

The router giroflee is connected to three networks. The second, third and fourth rows in the table tell us
that these three networks are attached to giroflee via giroflee's network interfaces fa0, le0 and
qaa0. These giroflee interfaces have IP addresses, and,
respectively. To transmit a packet to any host belonging to one of these three networks, giroflee
will simply send the outgoing IP datagram over the appropriate interface. Of particular interest to us is
the default route. Any IP datagram that is not destined for one of the networks explicitly listed in the
routing table will be forwarded to the router with IP address; this router is reached
by sending the datagram over the default network interface. The first entry in the routing table is the so-
called loopback interface. When IP sends a datagram to the loopback interface, the packet is simply
returned back to IP; this is useful for debugging purposes. The address is a special multicast
(Class D) IP address. We will examine IP multicast in Section 4.8.

OSPF: Open Shortest Path First

Like RIP, the Open Shortest Path First (OSPF) routing is used for intra-AS routing. The "Open" in OSPF
indicates that the routing protocol specification is publicly available (e.g., as opposed to Cisco's IGRP
protocol). The most recent version of OSPF, version 2, is defined in RFC 2178 - a public document.

OSPF was conceived as the successor to RIP and as such has a number of advanced features. At its heart,
however, OSPF is a link-state protocol that uses flooding of link state information and a Dijkstra least
cost path algorithm. With OSPF, a router constructs a complete topological map (i.e., a directed graph) of
the entire autonomous system. The router then locally runs Dijkstra's shortest path algorithm to determine
a shortest path tree to all networks with itself as the root node. The router's routing table is then obtained
from this shortest path tree. Individual link costs are configured by the network administrator.

Let us now contrast and compare the advertisements sent by RIP and OSPF. With OSPF, a router
periodically sends routing information to all other routers in the autonomous system, not just to its (5 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

neighboring routers. This routing information sent by a router has one entry for each of the router's
neighbors; the entry gives the distance (i.e., link state) from the router to the neighbor. On the otherhand,
a RIP advertisement sent by a router contains information about all the networks in the autonomous
system, although this information is only sent to its neighboring routers. In a sense, the advertising
techniques of RIP and OSPF are duals of each other.

Some of the advances embodied in OSPF include the following:

     q   Security. All exchanges between OSPF routers (e.g., link state updates) are authenticated. This
         means that only trusted routers can participate in the OSPF protocol within a domain, thus
         preventing malicious intruders (or networking students taking their newfound knowledge out for a
         joyride) from injecting incorrect information into router tables.
     q   Multiple same-cost paths. When multiple paths to a destination have the same cost, OSPF
         allows multiple paths to be used (i.e., a single path need not be not chosen for carrying all traffic
         when multiple equal cost paths exist).
     q   Different cost metrics for different TOS traffic. OSPF allows each link to have different costs
         for different TOS (type of service) IP packets. For example, a high bandwidth satellite link might
         be configured to have a low cost (and hence be attractive) for non-time critical traffic, but a very
         high cost metric for delay-sensitive traffic. In essence, OSPF sees different network topologies for
         different classes of traffic, and hence can compute different routes for each type of traffic.
     q   Integrated support for unicast and multicast routing. Multicast OSPF (RFC 1584) provides
         simple extensions to OSPF to provide for multicast routing (a topic we cover in more depth in
         Section 4.8). MOSPF uses the existing OSPF link database and adds a new type of link state
         advertisement to the existing OSPF link state broadcast mechanism.
     q   Support for hierarchy within a single routing domain. Perhaps the most significant advance in
         OSPF is the ability to hierarchically structure an autonomous system. Section 4.3 has already
         looked at the many advantages of hierarchical routing structures. We cover the implementation of
         OSPF hierarchical routing in the remainder of this section.

As OSPF autonomous system can be configured into "areas." Each area runs its own OSPF link state
routing algorithm, with each router in an area broadcasting its link state to all other routers in that area.
The internal details of an area thus remain invisible to all routers outside the area. Intra-area routing
involves only those routers within the same area.

Within each area, one of more area border routers are responsible for routing packets outside the area.
Exactly one OSPF area in the AS is configured to be the backbone area. The primary role of the
backbone area is to route traffic between the other areas in the AS. The backbone always contains all
area border routers in the AS and may contain non border routers as well. Inter-area routing within the
AS requires that the packet be first routed to an area border router (ntradomain routing), then routed
though the backbone to the area border router that is in the destination area, and then routed to the final
destination. (6 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

                              Figure 4.5-7: Hierarchically structured OSPF AS with four areas.

A diagram of a hierarchically structured OSPF network is shown in Figure 4.4-5 . We can identify four
types of OSPF routers in Figure 4.5-7:

     q   internal routers. These routers, shown in black, are in a non-backbone areas and only perform
         intra-AS routing.
     q   area border routers. These routers, shown in blue, belong to both an area and the backbone.
     q   backbone routers (non border routers). These routers, shown in gray, perform routing within the
         backbone but themselves are not area border routers. Within a non-backbone area, internal routers
         learn of the existence of routes to other areas from information (essentially a link state
         advertisement, but advertising the cost of a route to another area, rather than a link cost) broadcast
         within the area by its backbone routers.
     q   boundary routers. A boundary router, shown in blue, exchanges routing information with routers
         belonging to other autonomous systems. This router might, for example, use BGP to perform inter-
         AS routing. It is through such a boundary router that other routers learn about paths to external

IGRP: Internal Gateway Routing Protocol

The Interior Gateway Routing Protocol (IGRP) [Cisco97] is a proprietary routing algorithm developed by
Cisco Systems, Inc. in the mid-1980's as a successor for RIP. IGRP is a distance vector protocol. Several
cost metrics (including delay, bandwidth, reliability, and load) can be used in making routing decisions,
with the weight given to each of the metrics being determined by the network administrator. This ability
to use administrator-defined costs in making route selections is an important difference from RIP; we will
see shortly that so-called policy-based interdomain Internet routing protocols such as BGP also allow (7 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

administratively defined routing decisions to be made. Other important differences from RIP include the
use of a reliable transport protocol to communicate routing information, the use of update messages that
are sent only when routing table costs change (rather than periodically) , and the use of a distributed
diffusing update routing algorithm [Garcia-Luna-Aceves 1991] to quickly compute loop free routing

4.5.2 Inter-Autonomous System Routing
The Border Gateway Protocol version 4, specified in RFC 1771 (see also RFC 1772, RFC 1773), is the de
facto standard interdomain routing protocol in today's Internet. It is commonly referred to as BGP4 or
simply as BGP. As an inter-autonomous system routing protocol, it provides for routing between
autonomous systems (that is, administrative domains).

While BGP has the same general flavor as the distance vector protocol that we studied in Section 4.2, it is
more appropriately characterized as a path vector protocol. This is because BGP in a router does not
propagate cost information (e.g., number of hops to a destination), but instead propagates path
information, such as the sequence of ASs on a route to a destination AS. We will examine the path
information in detail shortly. We note though that while this information includes the names of the ASs
on a route to the destination, they do not contain cost information. Nor does BGP specify how a specific
route to a particular destination should be chosen among the routes that have been advertised. That
decision is a policy decision that is left up to the domain's administrator. Each domain can thus choose its
routes according to whatever criteria it chooses (and need not even inform its neighbors of its policy!) --
allowing a significant degree of autonomy in route selection. In essence, BGP provides the mechanisms
to distribute path information among the interconnected autonomous systems, but leaves the policy for
making the actual route selections up to the network administrator.

Let's begin with a grossly simplified description of how BGP works. This will help us see the forest
through the trees. As discussed in Section 4.3, as far as BGP is concerned, the whole Internet is a graph of
ASs, and each AS is identified by an AS number. At any instant of time, a given AS X may, or may not,
know of a path of ASs that lead to a given destination AS Z. As an example, suppose X has listed in its
BGP table such a path XY1Y2Y3Z from itself to Z. This means that X knows that it can send datagrams
to Z through the ASs X, Y1, Y2 and Y3, Z. When X sends updates to its BGP neighbors (i.e., the
neighbors in the graph), X actually sends the enitre path information, XY1Y2Y3Z, to its neighbors (as
well as other paths to other ASs). If, for example, W is a neighbor of X, and W receives an advertisement
that includes the path XY1Y2Y3Z, then W can list a new entry WXY1Y2Y3Z in its BGP table .However,
we should keep in mind that W may decide to not create this new entry for one of several reasons. For
example, W would not create this entry if W is equal to (say) Y2, thereby creating an undesirable loop in
the routing; or if W already has a path to Z in its tables, and this existing path is preferable (with respect
to the metric used by BGP at W) to WXY1Y2Y3Z ; or, finally, if W has a policy decision to not forward
datagrams through (say) Y2 . (8 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

In BGP jargon, the immediate neighbors in the graph of ASs are called peers. BGP information is
proprogated through the network by exchanges of BGP messages between peers. The BGP protocol
defines the four types of messages: OPEN, UPDATE, NOTIFICATION and KEEPALIVE.

     q   OPEN: BGP peers communicate using the TCP protocol and port number 179. TCP thus
         provides for reliable and congestion controlled message exchange between peers. In contrast,
         recall that we earlier saw that two RIP peers, e.g., the routed's in Figure 4.5-6 communicate via
         unreliable UDP. When a BGP gateway wants to first establish contact with a BGP peer (e.g., after
         the gateway itself or a connecting link has just be booted), an OPEN message is sent to the peer.
         The OPEN message allows a BGP gateway to identify and authenticate itself, and provide timer
         information. If the OPEN is acceptable to the peer, it will send back a KEEPALIVE message.
     q   UPDATE: A BGP gateway uses the UPDATE message to advertise a path to a given destination
         (e.g., XY1Y2Y3Z) to the BGP peer. The UPDATE message can also be used to withdraw routes
         that had previously been advertised (that is, to tell a peer that a route that it had previously
         advertised is no longer a valid route).
     q   KEEPALIVE: This BGP message is used to let a peer know that the sender is alive but that the
         sender doesn't have other information to send. It also serves as an acknowledgment to a received
         OPEN message.
     q   NOTIFICATION: This BGP message is used to inform a peer that an error has been detected
         (e.g., in a previously transmitted BGP message) or that the sender is about to close the BGP

Recall from our discussion above that BGP provides mechanisms for distributing path information but
does not mandate policies for selecting a route from those available. Within this framework, it is thus
possible for an AS such as to implement a policy such as "traffic from my AS should not
cross the AS," since it knows the identities of all AS's on the path. (The Hatfield and the
McCoy's are two famous feuding families in the US). But what about a policy that would prevent the
McCoy's from sending traffic through the Hatfield's network? The only means for an AS to control the
traffic it passes though its AS (known as "transit" traffic - traffic that neither originates in, nor is destined
for, the network, but instead is "just passing through") is by controlling the paths that it advertises. For
example, if the McCoy's are immediate neighbors of the Hatfields, the Hatfields could simply not
advertise any routes to the McCoy's that contain the Hatfield network. But restricting transit traffic by
controlling an AS's route advertisement can only be partially effective. For example, if the Jones are
between the Hatfield's and the McCoy's, and the Hatfield's advertise routes to the Jones' that pass through
the Hatfields, then the Hatfields can not prevent (using BGP mechanisms) the Jones from advertising
these routes to the McCoys.

Very often an AS will have muliple gateway routers that provide connections to other ASs. Even though
BGP is an inter-AS protocol, it can still be used inside an AS as a pipe to exchange BGP updates among
gateway routers belonging to the same AS. BGP connections inside an AS are called Internal BGP
(IBGP), whereas BGP connections between ASs are called External BGP (EBGP). (9 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

As noted above, BGP, which is the successor to EGP, is becoming the de facto standard for inter-AS
routing for the public Internet. BGP is used for example at the major network access points (NAP's)
where major Internet carries connect to each other and exchange traffic. To see the contents of today's
(less than four hours out of date) BGP routing table (large!) at one of the major NAP's in the US (which
include Chicago and San Francisco ), click here.

This completes our brief introduction of BGP. Although BGP is complex, it plays a central role in the
Internet. We encourage readers to see the references [Halabi 97] and [Huitema 95] to learn more about

4.5.3 Why are there Different Inter-AS and Intra-AS
Routing Protocols?

Having now studied the details of specific inter-AS and intra-AS routing protocols deployed in today's
Internet, let us conclude by considering perhaps the most fundamental question we could ask about these
protocols in the first place (hopefully, you have been wondering this all along, and have not lost the forest
for the trees!):

         Why are different inter-As and intra-AS routing protocols used?

The answer to this question gets at the heart of the differences between the goals of routing within an AS
and among ASs:

     q   Policy. Among ASs, policy issues dominate. It may well be important that traffic originating in a
         given AS specifically not be able to pass through another specific AS. Similarly, a given AS may
         well want to control what transit traffic it carries between other ASs. We have seen that BGP
         specifically carries path attributes and provide for controlled distribution of routing information so
         that such policy-based routing decisions can be made. Within an AS, everything is nominally
         under the same administrative control, and thus policy issues play a much less important role in
         choosing routes within the AS.
     q   Scale. The ability of a routing algorithm and its data structures to scale to handle routing
         to/among large numbers of networks is a critical issue in inter-AS routing. Within an AS,
         scalability is less of a concern. For one thing, if a single administrative domain become too large,
         it is always possible to divide it into two ASs and perform inter-AS routing between the two new
         ASs. (Recall that OSPF allows such a hierarchy to be built by splitting an AS into "areas").
     q   Performance. Because inter-AS routing is so policy-oriented, the quality (e.g., performance) of
         the routes used is often of secondary concern (i.e., a longer or more costly route that satisfies a
         certain policy criteria may well be taken over a route that is shorter but does not meet that
         criteria). Indeed, we saw that among ASs, there is not even the notion of preference or costs
         associated with routes. Within a single AS, however, such policy concerns can be ignored, (10 of 12) [5/13/2004 12:00:41 PM]
 Point-toPoint Routing in the Internet

         allowing routing to focus more on the level of performance realized on a route.


[Cisco97] Interior Gateway Routing Protocol and Enhanced IGRP
[RFC 792] J. Postel, "Internet Control Message Protocol, " RFC 792, Sep-01-1981.
[RFC 904] D. Mills, "Exterior Gateway Protocol Formal Specification," RFC 791, April 1984.
[RFC 1058] C.L. Hendrick, "Routing Information Protocol," RFC 1058, June 1988.
[RFC 1256] S. Deering, "ICMP Router Discovery Messages," RFC 1256, Sept. 1991.
[RFC 1584] J. Moy, "Multicast Extensions to OSPF," RFC 1584, March 1994.
[RFC 1723] G. Malkin, RIP Version 2 - Carrying Additional Information. RFC 1723, November 1994.
[RFC 1771] Y. Rekhter and T. Li, "A Border Gateway Protocol 4 (BGP-4)," RFC 1771, March 1995.
[RFC 1772] Y. Rekhter and P. Gross, "Application of the Border Gateway Protocol in the Internet," RFC
1772, March 1995.
[RFC 1773] P. Traina, "Experience with the BGP-4 protocol," RFC 1773, March 1995
[RFC 2002] C. Perkins, "IP Mobility Support," RFC 2002, 1996.
[RFC 2178] J. Moy, "Open Shortest Path First Version 2", RFC 2178, July 1997.
[Halabi 97] B. Halabi, Internet Routing Architectures, Cisco Systems Publishing, Indianapolis, 1997.
[Huitema 95] C. Huiteman, Routing in the Internet, Prentice Hall, New Jersey, 1995.

Search RFCs and Internet Drafts

If you are interested in an Internet Draft relating to a certain subject or protocol enter the keyword(s) here.


Press button to submit your query or reset the form:                    Submit         Reset

Query Options:

               Case insensitive

         Maximum number of hits:              25 (11 of 12) [5/13/2004 12:00:42 PM]
 Point-toPoint Routing in the Internet

Return to Table Of Contents

Copyright Keith W. Ross and James F. Kurose, 1996-2000. All rights reserved. (12 of 12) [5/13/2004 12:00:42 PM]
 What's inside a router?

                           4.6 What's inside a router?
Our study of the network layer so far has focussed on network layer service models, the routing
algorithms that control the routes taken by packets through the network, and the protocols that embody
these routing algorithms. These topics, however, are only part (albeit important ones) of what goes on in
the network layer. We have yet to consider the switching function of a router - the actual transfer of
datagrams from a router's incoming links to the appropriate outgoing links. Studying just the control and
service aspects of the network layer is like studying a company and considering only its management
(which controls the company but typically performs very little of the actual "grunt" work that makes a
company run!) and its public relations ("Our product will provide this wonderful service to you!"). To
fully appreciate what really goes on within a company, one needs to consider the workers. In the
network layer, the real work (that is, the reason the network layer exists in the first place) is the
forwarding of datagrams. A key component in this forwarding process is the transfer of a datagram from
a router's incoming link to an outgoing link. In this section we study how this is accomplished. Our
coverage here is necessarily brief, as an entire course would be needed to cover router design in depth.
Consequently, we'll make a special effort in this section to provide pointers to material that covers this
topic in more depth.

A high level view of a generic router architecture is shown in Figure 4.6-1. Four components of a router
can be identified:

     q   Input ports. The input port performs several functions. It performs the physical layer
         functionality (shown in light blue in Figure 4.6-1) of terminating an incoming physical link to a
         router. It performs the data link layer functionality (shown in dark blue) needed to interoperate
         with the data link layer functionality (see Chapter 5) on the other side of the incoming link. It
         also performs a lookup and forwarding function (shown in red) so that a datagram forwarded into
         the switching fabric of the router emerges at the appropriate output port. Control packets (e.g.,
         packets carrying routing protocol information such as RIP, OSPF or IGMP) are forwarded from
         the input port to the routing processor. In practice, multiple ports are often gathered together on a
         single line card within a router.
     q   Switching fabric. The switching fabric connects the router's input ports to its output ports. This
         switching fabric is completely contained with the router - a network inside of a network router!
     q   Output ports. An output port stores the datagrams that have been forwarded to it through the
         switching fabric, and then transmits the datagrams on the outgoing link. The output port thus
         performs the reverse data link and physical layer functionality as the input port.
     q   Routing processor. The routing processor executes the routing protocols (e.g., the protocols we
         studied in section 4.4), maintains the routing tables, and performs network management functions
         (see chapter 8), within the router. Since we cover these topics elsewhere in this book, we defer
         discussion of these topics to elsewhere. (1 of 11) [5/13/2004 12:01:10 PM]
 What's inside a router?

                                               Figure 4.6-1: Router architecture

In the following, we'll take a look at input ports, the switching fabric, and output ports in more detail.
[Turner 1988, Giacopelli 1990, McKeown 1997a, Partridge 1998] provide a discussion of some specific
router architectures. [McKeown 1997b] provides a particularly readable overview of modern router
architectures, using the Cisco 12000 router as an example.

4.6.1 Input ports

A more detailed view of input port functionality is given in Figure 4.6-2. As discussed above, the input
port's line termination function and data link processing implement the physical and data link layers
associated with an individual input link to the router. The lookup/forwarding function of the input port is
central to the switching function of the router. In many routers, it is here that the router determines the
output port to which an arriving datagram will be forwarded via the switching fabric. The choice of the
output port is made using the information contained in the routing table. Although the routing table is
computed by the routing processor, a "shadow copy" of the routing table is typically stored at each input
port and updated, as needed, by the routing processor. With local copies of the routing table, the
switching decision can be made locally, at each input port, without invoking the centralized routing
processor. Such decentralized switching avoids creating a forwarding bottleneck at a single point within
the router.

In routers with limited processing capabilities at the input port, the input port may simply forward the (2 of 11) [5/13/2004 12:01:10 PM]
 What's inside a router?

packet to the centralized routing processor, which will then perform the routing table lookup and
forward the packet to the appropriate output port. This is the approach taken when a workstation or
server serves as a router (e.g., [Microsoft 1998]); here, the "routing processor" is really just the
workstation's CPU and the "input port" is really just a network interface card (e.g., a Ethernet card).

                                              Figure 4.6-2: Input port processing

Given the existence of a routing table, the routing table lookup is conceptually simple -- we just search
through the routing table, looking for a destination entry that matches the destination address of the
datagram, or a default route if the destination entry is missing. In practice, however, life is not so
simple. Perhaps the most important complicating factor is that backbone routers must operate at high
speeds, being capable of performing millions of lookups per second. Indeed, it is desirable for the input
port processing to be able to proceed at line speed, i.e., that a lookup can be done in less than the amount
of time needed to receive a packet at the input port. In this case, input processing of a received packet
can be completed before the next receive operation is complete. To get an idea of the performance
requirements for lookup, consider that a so-called OC48 link runs at 2.5 Gbps. With 256 byte long
packets, this implies a lookup speed of approximately a million lookups per second.

Given the need to operate at today's high link speeds, a linear search through a large routing table is
impossible. A more reasonable technique is to store the routing table entries in a tree data structure.
Each level in the tree can be thought of as corresponding to a bit in the destination address. To lookup an
address, one simply starts at the root node of the tree. If the first address bit is a zero, then the left
subtree will contain the routing table entry for destination address; otherwise it will be in the right
subtree. The appropriate subtree is then traversed using the remaining address bits -- if the next address
bit is a zero the left subtree of the initial subtree is chosen; otherwise, the right subtree of the initial
subtree is chosen. In this manner, one can lookup the routing table entry in N steps, where N is the
number of bits in the address. (The reader will note that this is essentially a binary search through an
address space of size 2N.) Refinements of this approach are discussed in [Doeringer 1996]. (3 of 11) [5/13/2004 12:01:10 PM]
 What's inside a router?

But even with N=32 (e.g., a 32-bit IP address) steps, the lookup speed is not fast enough for today's
backbone routing requirements. For example, assuming a memory access at each step, less than a million
address lookups/sec could be performed with 40 ns memory access times. Several techniques have thus
been explored to increase lookup speeds. Content addressable memories (CAMs) allow a 32-bit IP
address to be presented to the CAM, which then returns the content of the routing table entry for that
address in essentially constant time. The Cisco 8500 series router [Cisco 1998a] has a 64K CAM for
each input port. Another technique for speeding lookup is to keep recently accessed routing table entries
in a cache [Feldmeier 1998]. Here, the potential concern is the size of the cache. Measurements in
[Thompson 1997] suggest that even for an OC-3 speed link, approximately 256,000 source-destination
pairs might be seen in one minute in a backbone router. Most recently, even faster data structures, which
allow routing table entry to be located in log(N) steps [Waldvogel 1997], or which compress routing
tables in novel ways [Degemark 1997], have been proposed. A hardware-based approach to lookup that
is optimized for the common case that the address being looked up has 24 or less significant bits is
discussed in [Gupta 1998].

Once the output port for a packet has been determined via the lookup, the packet can be forwarded into
the switching fabric. However, as we'll see below, a packet may be temporarily blocked from entering
the switching fabric (due to the fact that packets from other input ports are currently using the fabric). A
blocked packet must thus be queued at the input port and then scheduled to cross the switching fabric at a
later point in time. We'll take a closer look at the blocking, queueing and scheduling of packets (at both
input ports and output ports) within a router in section 4.6.4 below.

4.6.2 Switching Fabrics

The switching fabric is at the very heart of a router. It is through this switching that the datagrams are
actually moved from an input port to an output port. Switching can be accomplished in a number of
ways, as indicated in Figure 4.6-3: (4 of 11) [5/13/2004 12:01:10 PM]
What's inside a router?

                                         Figure 4.6-3: Three switching techniques

    q   Switching via memory. The simplest, earliest routers were often traditional computers, with
        switching between input and output port being done under direct control of the CPU (routing
        processor). Input and output ports functioned as traditional I/O devices in a traditional operating
        system. An input port with an arriving datagram first signaled the routing processor via an
        interrupt. The packet was then copied from the input port into processor memory. The routing
        processor then extracted the destination address from the header, looked up the appropriate output
        port in the routing table, and copied the packet to the output port's buffers. Note that if the
        memory bandwidth is such that B packets/sec can be written into, or read from, memory, then
        the overall switch throughput (the total rate at which packets are transferred from input ports to (5 of 11) [5/13/2004 12:01:10 PM]
What's inside a router?

        output ports) must be less than B/2.

        Many modern routers also switch via memory. A major difference from early routers, however, is
        that the lookup of the destination address and the storing (switching) of the packet into the
        appropriate memory location is performed by processors on the input line cards. In some ways,
        routers that switch via memory look very much like shared memory multiprocessors, with the
        processors on a line line card storing datagrams into the memory of the appropriate output port.
        Cisco's Catalyst 8500 series switches [Cisco 1998a] and Bay Networks Accelar 1200 Series
        routers switch packets via a shared memory.

    q   Switching via a bus. In this approach, the input ports transfer a datagram directly to the output
        port over a shared bus, without intervention by the routing processor (Note that when switching
        via memory, the datagram must also cross the system bus going to/from memory). Although the
        routing processor is not involved in the bus transfer, since the bus is shared, only one packet at a
        time can be transferred over the bus at a time. A datagram arriving at an input port and finding
        the bus busy with the transfer of another datagram is blocked from passing through the switching
        fabric and queued at the input port. Because every packet must cross the single bus, the switching
        bandwidth of the router is limited to the bus speed.

        Given that bus bandwidths of over a gigabit per second are possible in today's technology,
        switching via a bus is often sufficient for routers that operate in access and enterprise networks
        (e.g., local area and corporate networks). Bus-based switching has been adopted in a number of
        current router products, including the Cisco 1900 [Cisco 1997b], which switches packets over a
        1Gbps Packet Exchange Bus. 3Com's CoreBuilder 5000 systems [Kapoor 1997] interconnects
        ports that reside on different switch modules over its PacketChannel data bus, with a bandwidth
        of 2 Gbps.

    q   Switching via an interconnection network. One way to overcome the bandwidth limitation of a
        single, shared bus is to use a more sophisticated interconnection network, such as those that have
        been used in the past to interconnect processors in a multiprocessor computer architectures. A
        crossbar switch is an interconnection network consisting of 2N busses that connect N input ports
        to N output ports, as shown in Figure 4.6-3. A packet arriving at an input port travels along the (6 of 11) [5/13/2004 12:01:10 PM]
 What's inside a router?

         horizontal bus attached to the input port until it intersects with the vertical bus leading to the
         desired output port. If the vertical bus leading to the output port is free, the packet is transferred to
         the output port. If the vertical bus is being used to transfer a packet from another input port to this
         same output port, the arriving packet is blocked and must be queued at the input port.

         Delta and Omega switching fabrics have also been proposed as an interconnection network
         between input and output ports. See [Tobagi 90]for a survey of switch architectures. Cisco 12000
         Family switches [Cisco 1998b]use an interconnection network, providing up to 60 Gbps through
         the switching fabric. One current trend in interconnection network design [Keshav 1998]is to
         fragment a variable length IP datagram into fixed length cells, and then tag and switch the fixed
         length cells through the interconnection network. The cells are then reassembled into the original
         datagram at the output port. The fixed length cell and internal tag can considerably simplify and
         speed up the switching of the packet through the interconnection network.

4.6.3 Output Ports

Output port processing, shown in Figure 4.6-4, takes the datagrams that have been stored in the output
port's memory and transmits them over the outgoing link. The data link protocol processing and line
termination are the send-side link- and physical layer functionality that interact with the input port on the
other end of the outgoing link, as discussed above in section 4.6.2. The queueing and buffer management
functionality are needed when the switch fabric delivers packets to the output port at a rate that exceeds
the output link rate; we'll cover output port queueing below.

                                             Figure 4.6-4: Output Port processing

4.6.4. Where Does Queueing Occur?

Looking at the input and output port functionality and the configurations shown in Figure 4.6-3, it is (7 of 11) [5/13/2004 12:01:10 PM]
 What's inside a router?

evident that packet queues can form at both the input ports and the output ports. It is important to
consider these queues in a bit more detail, since as these queues grow large, the router's buffer space will
eventually be exhausted and packet loss will occur. Recall that in our earlier discussions, we said rather
vaguely that packets were lost "within the network" or "dropped at a router." It is here, at these queues
within a router, where such packets are dropped and lost. The actual location of packet loss (either at the
input port queues or the output port queues) will depend on the traffic load, the relative speed of the
switching fabric and the line speed, as discussed below.

Suppose that the input line speeds and output line speeds are all identical, and that there are n input ports
and n output ports. If the switching fabric speed is at least n times as fast as the input line speed, than no
queuing can occur at the input ports. This is because even in the worst case that all n input lines are
receiving packets, the switch will be able to transfer n packets from input port to output port in the time it
takes each of the n input ports to (simultaneously) receive a single packet. But what can happen at the
output ports? Let us suppose still that the switching fabric is at least n times as fast as the line speeds. In
the worst case, the packets arriving at each of the n input ports will be destined to the same output port.
In this case, in the time it takes to receive (or send) a single packet, n packets will arrive at this output
port. Since the output port can only transmit a single packet in a unit of time (the packet transmission
time), the n arriving packets will have to queue (wait) for transmission over the outgoing link. n more
packets can then possibly arrive in the time it takes to transmit just one of the n packets that had
previously been queued. And so on. Eventually, buffers can grow large enough to exhaust the memory
space at the output port, in which case packets are dropped.

                                              Figure 4.6-5: output port queueing

Output port queueing is illustrated in Figure 4.6-5. At time t, a packet has arrived at each of the
incoming input ports, each destined for the uppermost outgoing port. Assuming identical line speeds and
a switch operating at three times the line speed, one time unit later (i.e., in the time needed to receive or
send a packet), all three original packets have been transferred to the outgoing port and are queued (8 of 11) [5/13/2004 12:01:10 PM]
 What's inside a router?

awaiting transmission. In the next time unit, one of these three packets will have been transmitted over
the outgoing link. In our example, two new packets have arrived at the incoming side of the switch; one
of these packets is destined for this uppermost output port.

A consequence of output port queueing is that a packet scheduler at the output port must choose one
packet among those queued for transmission. This selection might be done on a simple basis such as first-
come-first-served (FCFS) scheduling, or a more sophisticated scheduling discipline such as weighted fair
queueing (WFQ), which shares the outgoing link "fairly" among the different end-to-end connections that
have packets queued for transmission. Packet scheduling plays a crucial role in providing quality of
service guarantees. We will cover this topic extensively in section 6.6. A discussion of output port
packet scheduling disciplines used in today's routers is [Cisco 1997a] .

If the switch fabric is not fast enough (relative to the input line speeds) to transfer all arriving packets
through the fabric without delay, then packet queueing will also occur at the input ports, as packets must
join input port queues to wait their turn to be transferred through the switching fabric to the output port.
To illustrate an important consequence of this queueing, consider a crossbar switching fabric and suppose
that (i) all link speeds are identical (ii) that one packet can be transferred from any one input port to a
given output port in the same amount of time it takes for packet to be received on an input link and (iii)
packet are moved from a given input queue to their desired output queue in a FCFS manner. Multiple
packets can be transferred in parallel, as long as their output ports are different. However, if two packets
at the front of two input queues are destined to the same output queue, then one of the packets be blocked
and must wait at the input queue - the switching fabric can only transfer one packet to a given output port
at a time. (9 of 11) [5/13/2004 12:01:10 PM]
 What's inside a router?

                                 Figure 4.6-6: HOL blocking at an input queued switch

Figure 4.6-6 shows an example where two packets (red) at the front of their input queues are destined for
the same upper right output port. Suppose that the switch fabric chooses to transfer the packet from the
front of the upper left queue. In this case, the red packet in the lower left queue must wait. But not only
must this red packet wait, but so too must the green packet that is queued behind that packet in the lower
left queue, even though there is no contention for the middle right output port (the destination for the
green packet). This phenomenon is known as head-of-the-line (HOL) blocking in an input-queued
switch - a queued packet in an input queue must wait for transfer through the fabric (even though its
output port is free) due to the blocking of another packet at the head-of-the-line. [Karol 1987] shows that
due to HOL blocking, the input queue will grow to unbounded length (informally, this is equivalent to
saying that significant packet loss will occur) as soon as packet arrival rate on the input links reaches
only 58% of their capacity. A number of solutions to HOL blocking are discussed in [McKeown 1997b].


[Cisco 1997a] Cisco Systems, "Queue Management,", 1997.
[Cisco 1997b] Cisco Systems, Next Generation ClearChannel Architecture for Catalyst 1900/2820
Ethernet Switches,, 1997.
[Cisco 1998 a] Cisco Systems "Catalyst 8500 Campus Switch Router Architecture," (10 of 11) [5/13/2004 12:01:10 PM]
 What's inside a router?, 1998.
[Cisco 1998b] Cisco Systems, "Cisco 12000 Series Gigabit Switch Routers,", 1998.
[Degemark 1997] M. Degemark et al., "SMall Forwarding Tables for Fast Router Lookup," Proc. 1997
ACM SIGCOMM Conference, (Canes, France, Sept. 1997).
[Doeringer 1996] W. Doeringer, G. Karjoth, M. Nassehi, "Routing on Longest Matching Prefixes,"
IEEE/ACM Transactions on Networking, Vol. 4, No. 1 (Feb. 1996), pp. 86-97.
[Giacopelli 1990] J. Giacopelli, M. Littlewood, W.D. Sincoskie “Sunshine: A high performance self-
routing broadband packet switch architecture”, 1990 International Switching Symposium.
[Gupta 1998] P. Gupta, S. Lin, N.McKeown. “Routing lookups in hardware at memory access speeds”,
Proc. IEEE Infocom 1998, pp 1241-1248.
[Kapoor 1997] H. Kapoor, "CoreBuilder 5000 SwitchModule Architecture,", 1997.
[Karol 1987] M. Karol, M. Hluchyj, A. Morgan, "Input Versus Output Queueing on a Space-Division
Packet Switch," IEEE Transactions on Communications, Vol. COM-35, No. 12, pp. 1347-1356,
December 1987.
[Keshav 1998] S. Keshav, R. Sharma, "Issues and Trends in Router Design," IEEE Communications
Magazine, Vol 36, No. 5 (May 1998), pp. 144-151.
[Microsoft 1998] Microsoft Corp., "Microsoft Routing and Remote Access Service for Windows NT
Server 4.0],
[McKeown 1997a] N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny
Tera: A Packet Switch Core”, IEEE Micro Magazine, Jan-Feb 1997.
[McKeown 1997b] N. McKeown, "A Fast Switched Backplane for a Gigabit Switched Router," Business
Communications Review, Vol. 27. N0 12.
[Partridge 1998] C. Partridge et al. “A Fifty Gigabit per second IP Router”, IEEE/ACM Transactions
on Networking, 1998.
[Tobagi 1990] F. Tobagi, "Fast Packet Switch Architectures for Broadband Integrated Networks," Proc.
IEEE, Vol. 78, No. 1, pp. 133-167.
[Turner 1988] J. S. Turner “Design of a Broadcast packet switching network”, IEEE Trans Comm, June
1988, pp. 734-743.
[Feldmeier 1988] D. Feldmeier, "Improving Gateway Performance with a Routing Table Cache," Proc.
1988 IEEE Conference, (New Orleans LA, March 1988).
[Thomson 1997] K. Thomson, G. Miller, R. Wilder, "Wid