Document Sample
TUTORIALS Powered By Docstoc
					TUTORIALS 25, 2000
     Sunday, June
 8:30 AM – 12:00 NOON                                 • Design and Construction:
 TUTORIAL 1:                                          The focus of this section is on an error
 Building Dependable                                  handling models, graceful degradation, and the
 Systems: the Power of                                use of techniques such as assertions and
 Negative Thinking                                    Design By Contract. Subtle mismatches among
 Chuck Howell, MITRE                                  error-handling portions of subsystems can be
 Corporation                                          extremely difficult to uncover, and are therefore
                                                      an important cause of latent software defects.
 There is a natural human tendency to be
 optimistic and to focus on the positive
                                                      • Testing and Validation:
 functional capabilities a new software intensive
                                                      Testing to provide evidence of robustness
 system will provide. However, for systems that
                                                      differs in focus from general software testing to
 must be trustworthy or dependable, there is
                                                      demonstrate functional behavior. A great deal of
 much to be gained from “negative thinking”: at
                                                      emphasis must be placed on demonstrating
 each stage of development, considering all the
                                                      that, even under stressful conditions, the
 ways things could go wrong. This half-day
                                                      software does not exhibit fragile behavior. This
 tutorial will use case studies to illustrate the
                                                      requires a considerable amount of fault
 importance of hazard analysis, error-handling
                                                      injection, boundary condition and out-of-range
 design, stress testing, fault injection, and other
                                                      testing, and exercising of those portions of the
 “negative” tasks. The tutorial is divided into
                                                      input space that are related to potential failures
 three sections corresponding to stereotyped
                                                      (e.g., critical operator functions and
 development stages.
                                                      interactions, deliberate security attacks). It also
                                                      includes test coverage analysis (both functional
 • Requirements:
                                                      and white box) to ensure that the error detection
 A variety of tools and techniques have evolved
                                                      and recovery aspects of the system are well
 to support the identification and management of
 potential hazards in complex critical systems.
 Notable examples of these techniques include
 Hazard and Operability Studies (HAZOP),
                                                      Chuck Howell is Chief Engineer of the Joint and
 Failure Modes and Effects Analysis (FMEA),
                                                      Defense Wide Systems Division at the MITRE
 Fault Tree Analysis (FTA), and Deviation
                                                      Corporation. Previously he was Chief Engineer
 Analysis. We will explore a range of
                                                      in the Systems Technology Center of Mitretek
 requirements issues to be considered from the
                                                      Systems, Inc., Director of Consulting Services
 perspective of what could go wrong and how
                                                      at Reliable Software Technologies, a Java
 should the system deal with it.
                                                      Technologist at Sun Microsystems, and a
Principal Scientist at the MITRE Corporation        1984 and worked on the development of
focusing on Critical Software Assurance for         networking products, such as the 3708
organizations such as the U.S. Department of        Protocol Converter and 3710 Controller. In
Defense, FAA, Nuclear Regulatory Commission,        1986, Steve began working on 3745
and commercial organizations such as electric       Communication Controller products where he
utilities and FDA regulated manufacturers. His      was involved with several hardware and
current interests include techniques to calibrate   software projects. In 1988, Steve completed his
and reduce residual doubt about the behavior of     Master’s degree from NC State University,
critical systems, and approaches to making          which focused on networking architecture and
large Networked Information Systems more            technology. In 1991, Steve moved into an area
robust (i.e., less fragile). He is a Senior         to focus on the architecture and design of
Member of the IEEE and a Member of the ACM.         network routing products. In this position,
                                                    Steve had responsibilities in both hardware and
8:30 A.M. – 12:00 NOON                              software with emphasis on hardware
TUTORIAL 2:                                         architecture, networking protocols and fault
Requirements,                                       tolerance. In 1994, Steve began a 2-year
Technologies, and                                   resident study term at Duke University to
Architectures for Large                             complete his Ph.D. His primary research
Web Sites                                           interests at Duke were in the areas of high-speed
Steven Hunter, IBM Corporation                      networking, distributed architectures and fault
                                                    tolerance. Upon returning to IBM in 1996, Steve
As the computing capability continues to track      worked on ISP networking technology, then moved
Moore’s law and networking bandwidth grows,         into the Netfinity server organization. As part of
the division between the network and the            Netfinity, Steve is in the group with server
computer is beginning to become                     architecture and technology responsibility and has
indistinguishable. One example of this is with      had involvement and responsibilities with
the web-hosting environment. The technology         multiprocessing systems, clustering, server/
required to deploy a web site is very much a        network attachment, system area networks (i.e.,
mixture of what was once considered network         InfiniBand), systems management and RAS. Steve
or server specific. This tutorial will provide      holds several patents and has published papers
insight into the workings of a large web site and   and presented at a variety of conferences and
will offer an understanding of key requirements     symposiums.
and technologies that are used as components.
Some of the topics covered will include:            1:30 PM – 5:00 PM
Dynamic Load Balancing, Quality of Service
(QoS), and System Area Networks, such as
                                                    TUTORIAL 3:
                                                    Fault Tolerant CORBA
                                                    Shalini Yajnik, Lucent-Bell
After completing his bachelor’s degree at           CORBA (Common Object Request Broker
Auburn University, Steve joined IBM in March        Architecture) is a platform for object-oriented
distributed computing, standardized by the           BIOGRAPHY:
Object Management Group (OMG). It provides a         Shalini Yajnik is a Member of the Technical
middleware upon which distributed                    Staff in the Distributed Software Research
applications can be built very quickly and           department at Lucent Technologies, Bell
easily. However, until recently CORBA did not        Laboratories. She graduated with a Ph.D.
provide any tools for enhancing the reliability of   degree from Princeton University in 1994. Her
applications. As a result, use of CORBA in           research interests are software level fault
building highly reliable distributed systems was     tolerance and distributed object systems, with
limited. OMG realized this problem and issued        the primary focus on studying the impact of
an RFP for fault tolerant CORBA in April 1998.       failures on distributed object systems and
For the past year, a group of industries has         developing fault tolerance solutions for
been working to respond to this RFP. In January      distributed object platforms like CORBA and
2000 the group presented a proposal for Fault        Java RMI.
Tolerant CORBA to the OMG, which will move
towards standardization in the coming months.        1:30 PM – 5:00 PM
The proposed standard will provide a standard        TUTORIAL 4:
and efficient way for developers to build highly     Exception Handling and
reliable and available distributed applications.     Software Fault Tolerance
                                                     Jie Xu, University of Durham
In this tutorial I will give a brief overview of     Brian Randell, University of
CORBA and then go on to discuss the proposed         Newcastle upon Tyne
Fault Tolerant CORBA specification, which
                                                     As the use of computer systems becomes more
makes use of replication of CORBA objects to
                                                     and more widespread in applications that
provide desired levels of reliability and
                                                     demand high levels of dependability, these
availability to a system. I will talk about the
                                                     applications themselves are growing in size and
wide spectrum of fault tolerance covered by the
                                                     complexity at a rapid rate, especially in areas
specification and how it can be used to build
                                                     that require concurrent and distributed
applications which require widely varying
                                                     computing. Such complex systems are very
degrees of fault tolerance, e.g. (1) stateless
                                                     prone to faults and errors from a variety of
systems like replicated web servers whose
                                                     sources, including the devices and people in
clients require simple failover mechanisms, to
                                                     the environment of the computer system, the
complex defense and telecom systems that
                                                     computer and communications hardware, etc.
require five nines reliability and availability
                                                     Moreover, given such complexity, no matter
along with strong data consistency, and (2)
                                                     how rigorously fault avoidance and fault
systems that do not want to deal with fault
                                                     removal techniques are applied, software
tolerance issues and would like the fault
                                                     design faults often remain in systems when
tolerance to be fully automated, to systems that
                                                     they are delivered to the customers. There is a
require high degree of application-specific
                                                     tremendous need for systematic techniques for
control under failure conditions.
                                                     building dependable software for such systems.
This half-day tutorial will focus on some basic      refereed reports in areas of system-level fault
concepts, state-of-the-art techniques, and           diagnosis, exception handling, software fault
state-of-practice approaches to exception            tolerance, and large-scale distributed
handling and software fault tolerance in both        applications. He has been involved in several
sequential programs and complex concurrent           research projects on dependable distributed
systems. The first part of the tutorial will start   computing systems, including two EC-
with an account of both programmed and               sponsored ESPRIT BRA projects and one
default exception handling methods in                ESPRIT Long Term Research project. He is
sequential modular programs, and then go on          Principal Investigator of the FTNMS project on
to describe the recovery block approach to           fault-tolerant mechanisms for multiprocessors
software fault tolerance and subsequent              and co-Investigator of the EPSRC Flexx project
extensions to this scheme. The second part of        on highly flexible software.
the tutorial will talk about coordinated
exception handling and software fault tolerance      Brian Randell graduated in Mathematics from
in concurrent and distributed systems. It will       Imperial College, London in 1957 and joined
cover the generalized conversation scheme for        the English Electric Company where he led a
handling exceptions in process-oriented              team which implemented a number of
systems and present the coordinated atomic           compilers, including the Whetstone KDF9 Algol
(CA) action scheme for concurrent object-            compiler. From 1964 to 1969 he was with IBM,
oriented systems, illustrated with an industrial     mainly at the IBM T.J. Watson Research Center
control application. The CA action approach is       in the United States, working on operating
in fact based on a very sophisticated exception      systems, the design of ultra-high speed
handling scheme, capable of dealing                  computers and system design methodology. He
appropriately even with very complex                 then became Professor of Computing Science
situations, including multiple concurrent faults.    at the University of Newcastle upon Tyne, where
                                                     in 1971 he set up the project which initiated
BIOGRAPHIES:                                         research into the possibility of software fault
Jie Xu is a Lecturer in the Department of            tolerance, and introduced the “recovery block”
Computer Science, University of Durham, UK.          concept. Subsequent major developments
He received the Ph.D. degree from University of      included the Newcastle Connection, and the
Newcastle upon Tyne on Advanced Fault-               prototype Distributed Secure System. He has
Tolerant Software. From 1990 to 1998, Dr. Xu         been Principal Investigator on a succession of
was with the Computing Laboratory at                 research projects in reliability and security
Newcastle where he was promoted to a Senior          funded by the Science Research Council (now
Researcher in 1995. He moved to a Lectureship        Engineering and Physical Sciences Research
at Durham in 1998 and cofounded the                  Council), the Ministry of Defence, and the
Distributed Systems Engineering group and the        European Strategic Programme of Research in
DPART laboratory supporting highly                   Information Technology (ESPRIT). Most
dependable enterprise computing. Dr. Xu has          recently he has had the role of Project Director
published more than 90 academic papers and           of DeVa, the ESPRIT Long Term Research
Project on Design for Validation, and of             split connection approach, explicit notification
CaberNet, the ESPRIT Network of Excellence on        schemes, the Snoop protocol, and the delayed
Distributed Computing Systems Architectures,         dupacks protocol to improve TCP performance
and is now leading an IST Research Project on        in presence of transmission errors.
Malicious- and Accidental-Fault Tolerance for
Internet Applications (MAFTIA). He has               BIOGRAPHY:
published nearly two hundred technical papers        Nitin Vaidya received the Ph.D. degree from the
and reports, and is coauthor or editor of seven      University of Massachusetts at Amherst. He is
books.                                               presently an Associate Professor of Computer
                                                     Science at the Texas A&M University. He has
1:30 PM – 5:00 PM                                    held visiting positions at Microsoft Research,
TUTORIAL 5:                                          Sun Microsystems, and Indian Institute of
Performance of TCP on                                Technology-Bombay. His research interests
Error-Prone Wireless                                 include wireless networking, mobile
Links                                                computing, and fault-tolerant computing. He is
Nitin H. Vaidya, Texas A&M                           a speaker for the Distinguished Visitor Program
University                                           of the IEEE Computer Society, a recipient of a
                                                     CAREER award from the National Science
This tutorial deals with the impact of wireless
                                                     Foundation, a coauthor of the Best Student
transmission errors on the performance of TCP,
                                                     Paper Award-winning paper at MOBICOM ’98,
and techniques for improving performance in
                                                     and of a 1999 IETF PILC working group
presence of such errors. It will provide the
                                                     internet-draft dealing with performance of
attendees with an overview of the state of the art
                                                     implications of transmission errors. Nitin
in TCP for error-prone wireless links, and an
                                                     served as program cochair for the 1999
understanding of the basic principles that guide
                                                     International Workshop on Discrete Algorithms
the design of mechanisms to improve TCP
                                                     and Methods for Mobile Computing and
performance in wireless environments.
                                                     Communication (DIAL M). He has also served
                                                     on the program committees of several other
Topics will include an overview of wireless
                                                     conferences and workshops, including
technologies, an overview of relevant TCP
                                                     INFOCOM 2000, 1999 ACM Symposium on
features, the impact of wireless errors on TCP
                                                     Principles of Distributed Computing (PODC),
performance, and classification of approaches
                                                     1999 Workshop on Data Engineering for
to improve TCP performance in the presence of
                                                     Wireless and Mobile Access (MobiDE), and
transmission errors. A detailed discussion of a
                                                     1998 International Workshop on Satellite-based
few representative approaches will be
                                                     Information Services (WOSBIS). Vaidya is a
presented, followed by a summary of other
                                                     senior member of the IEEE and member of the

Topics to be discussed include impact of link
layer retransmission on TCP performance, the