siegel by LBY2ks


									                          Workshop on Dependability
                              of e-Business Systems

Internet Performance / Availability
from an end-user perspective

Eric Siegel

                       The Internet Performance Authority
                                        2855 Campus Drive
                                      San Mateo, CA 94403
                                            (650) 522-1000

•   The importance of performance
•   A quick web-technology and Internet-technology tutorial
•   Web page performance factors and benchmarks
•   Transaction performance factors and benchmarks
•   Performance measurement goals, technologies, and issues
•   Load testing for web transactions

Performance Is Important!

“Twenty-eight percent of shoppers who have suffered failed performance
  attempts said they stopped shopping at the web site where they had
  problems, and six percent said they stopped buying at that particular
  company’s off-line store.” (Boston Consulting Group, quoted in Infoworld /
  Computerworld 3/00)
“It takes only 8 ½ seconds for half of the subjects to [give up]” (Peter
  Bickford, “Worth the Wait?” in Netscape/View Source Magazine 10/97)
“Perhaps as much as $4.35 billion in e-commerce sales in the U.S. may be
  lost each year due to unacceptable download speeds and resulting user
  bailout behaviors.” (Zona Research 4/99)
“Fifty-eight percent of online customers surveyed indicated quick
  download time as a key factor in determining whether they would return
  to a web site.” (Forrester Research 1/99)
“One of the top three reasons cited by online shoppers for dissatisfaction
  with a web site is slow site performance.” (Jupiter Communications / NFO
  Worldwide 1/99)
“At one site, the abandonment rate fell from 30% to 6-8% because of a one
  second improvement in load time.” (Zona Research 4/99)

Effects of Poor Performance

• Lost prospective customer
   – If the site didn’t work, or took too long, your prospect may not
     return for a long time – if ever.
• Lost sale
   – If your competitor’s site was up and responsive, you may have
     lost a single sale.
• Lost customer
   – If this happens repeatedly, you’ve lost a customer,
   – AND the customer may stop going to associated web sites and
     physical locations!
• Lost reputation
   – People talk about poor performance; word spreads.
   – People are looking for a few good sites that they can trust!

E-Commerce Performance Challenges

• 24x7 availability and geographic distribution; expectation of
  universal access
• A shared network resource
• No control over customers’ environment
• Multiple servers, which may be geographically distributed,
  participate in a single user interaction
• Dynamic, complex content
• Poor support for session structures
• Potentially massive peak volumes
• Difficult to predict workload mix

An Instant Web Tutorial

• The Domain Name System (DNS), a worldwide hierarchy of
  directories, translates into
• TCP/IP carries the data between your browser and; it
  detects errors and corrects them by retransmitting.
• The data consists of HTTP, HTML, and the page’s information.
• HTTP (Hypertext Transfer Protocol) carries the Hypertext Markup
  Language (HTML) and provides the basic Web page commands:
    – GET
    – POST
    – Query String (e.g., )
• HTML describes the page:
    – Formatting
    – Content, and the servers/files (e.g.,
      from which that content can be downloaded
    – Links

An Instant Internet Tutorial

Some of the additional                  Cache
servers provide third-party             Access
ads; others are distributed    Routers Devices
content providers.               Routers
                                                           Internet Browser
                                  The Internet
                Server           PSInet          Verio                Servers

          Peering             Digex      UUnet           Sprint
                                BBN        GTE      Worldcom
     Routers   Routers
                                      Mindspring                    Servers

Internet Routing Within An ISP

• Routers read every packet’s header and select
  an outgoing path for the next hop
    – Each hop adds delay
    – Routing information is imperfect
        – Packets may use non-optimal paths      Routers
        – Packets may loop
        – Routers may not notice moderate                  Routers
          congestion                          Routers
• Packets can be discarded by routers or              Routers
  otherwise lost
    – Noise in the communications link can corrupt
      packets, causing them to be discarded
    – Hopelessly looping packets are discarded
    – Temporary overflow of router buffers cause
      packet loss
    – Severely overloaded routers tend to lose
      massive numbers of packets in waves

Internet Routing Between ISPs (Peering)

• Internet Service Providers enter into legal contracts
  to carry each other’s traffic
     – Traffic transfer between ISPs occurs at peering           Peering
       points                                                    Point
     – Some peering points are public; e.g., MAE-EAST
       (and MAE-WEST !)
     – Other peering points are privately arranged
     – Peering philosophies differ among ISPs
                                                       ISP “A”   ISP “B”
• Congestion may occur at peering points,
  especially public ones!                          Routers       Routers
     – The primary inter-ISP “routing” protocol,
       BGP-4, usually does not look at congestion
• The end-to-end route in one direction is usually
  different from the end-to-end route in the other
     – Depends on legal and financial arrangements
       between ISPs, etc.

Internet Access Providers, Caching, and
Distributed Content Providers
• End-users (customers!) connect to
  an Access Device maintained by an
  Internet Access Provider or by their
  corporate IS department
     – Dial-in, xDSL from home
     – LAN link at the office
• Access Device connects to                               Access
  routers and then to the Internet            Routers     Devices
• End-users convert hostname                    Routers             Internet
  ( into Internet address                                   Browser
  ( by using Domain Name                  Access
  System (DNS) distributed directory                 Provider
     – A worldwide hierarchy of directories
     – Controlled by authoritative record
        created by hostname’s owner
• Cache or Distributed Content system
  may also be locally available

Domain Name System

• The end-user’s browser asks the end-user’s local DNS server to
  translate the hostname ( into an IP address
     – The end-user’s local DNS server may be owned by the Access
       Service Provider or by the end-user’s corporation
     – It may be very close, or it may be geographically distant
• DNS servers retain translation information for a period of time
  (“time to live”) controlled by the authoritative name server
     – The authoritative name server is controlled by the name’s owner
• If a DNS server doesn’t have the information, it looks elsewhere in
  the hierarchy
     – It may need to go all the way to the authoritative name server
• Authoritative Name Servers can furnish multiple addresses, etc.
     – For “round-robin” load balancing
• Some proprietary load balancers provide a DNS server function
     – Can make a sophisticated choice of an IP address to send

Caching and Distributed Content Providers

• Many ISPs install Caching systems that retain commonly-
  requested web pages locally.
    – Usually, only static, unchanging content is cached.
    – The web page designer can try to influence the behavior of remote
      caching devices.
    – Caching decreases the amount of traffic that the ISP must pull
      through adjacent ISPs.
    – Caching improves speed to the users.
    – Remember, the browser also caches content.
• Distributed Content Providers have constructed worldwide
  systems of caching and content distribution devices.
    – Usually in partnership with local ISPs
    – These systems may also be able to handle streaming media and
      some dynamically-generated web pages.
    – In some cases, they use distribution systems (leased lines,
      satellite, etc.) that completely bypass the Internet’s core.
Server Farm Architecture Summary

                                Security Control

                              Load-sharing Devices

                Web Server   Web Server   Web Server   Web Server

                               Database Back-End

A Definition of Performance

• Web e-commerce performance measures the user's experience
  interacting with your web site, not your in-house experience or
  the experience inside your web hosting center.
    – Download time
    – Transaction Time
        – banking, stock trading, purchasing
    – Availability
    – Errors
        – Failed connection attempts
        – Missing pages
        – Missing page components
        – Broken links
        – Transaction failure
        – Fulfillment failure (product delivery failure)

Web Performance Factors

• The web page seen by the browser is often generated from a
  number of different sources:
   – Ad servers
   – Geographically-distributed content servers
   – Static content servers
   – Dynamic content servers
   – Back-end databases
• Download performance is therefore affected by:
   – Geographic location of the browser
   – Congestion and latency between servers and browser
   – Performance of load-distribution and load-sharing schemes
   – Performance of the servers and their back-end databases
• For example...

Components of Web Page Download Time

                                                           external ad server

                         Akamai server
                         redirection delay

                         application delay

This page includes “Akamaized” content distribution and external ad servers

What Is “Good” Performance?

• Commonly-cited “Eight-Second Rule”
• But a better measure is your competitors’ performance
   – What is your end-user’s frame of reference?
       – Competitors
       – Commonly-accessed consumer sites (Yahoo!, etc.)
       – (how will you measure these sites?)
   – What does your end-user care about?
       – Browsing your catalog quickly
       – Placing orders quickly, without failures or delays
• The location of your end-user affects expectations.
   – On a corporate LAN with high-speed access
   – At home, on a 28.8k modem
   – Using a laptop in the rain at a gas station in Milan

    For Comparison...
    Performance of Major Web Sites

                              Business Day                       24x7
                          Mean         Error Rate       Mean            Error Rate
                        Download                      Download
        Amazon in        4.5 sec         1.1 %         3.7 sec            0.5 %
        IBM in US         3.4 sec        4.8 %         2.5 sec            4.6 %

        Yahoo in US       0.7 sec        0.3 %         0.6 sec            0.4 %

        KB 40 in US       4.7 sec                      3.8 sec

        KB 40 in          7.3 sec                      6.5 sec
        Euro 20 in        6.8 sec                      4.0 sec

•   Measured May 1 – 31, 2000 over high-speed links
•   67 measurement locations in U.S., 22 in Europe; each location measures every 15 min
•   “Business Day” is 10:00 a.m. to 4:00 p.m. CDT or MET, M – F.
•   Benchmark measurements available at

Improving Web Page Performance

• Decrease the use of frames and Java.
• Avoid complex, deeply-nested tables.
• Decrease the number of individual page components.
• Decrease the size of each component; “thin” the images.
• Give the viewer something to look at while the page is loading; minimize
  perceived delay.
• Consider using a distributed hosting solution, at least for static page
     – This is particularly important for cross-oceanic connections!
     – You may be able to serve graphics locally while going to the central
       server system to handle the transaction itself.
• Use flat files (and a naming convention) instead of databases.
• If you’re dynamically generating pages:
     – Be sure to tune the system (add memory; use server cache; etc.)
     – Dynamically starting a process takes time.
• Be sure your ISP has good peering to your customers’ ISPs.

Transaction Performance Factors

• Scaling transactions is much more difficult than scaling simple
  web page delivery!
   – Need to maintain transaction context between web pages,
     associating a user with a transaction-in-progress
       – Use IP address?
          – Different users of one ISP can have same address
          – Users can switch IP addresses in mid-transaction
       – Use a cookie?
          – Users can set browser to refuse cookies
       – Embed user and state information within each page, or each link?
          – Requires dynamic page generation logic
       – Use Secure Socket Layer (SSL) session ID?
          – Load balancer may need to be one end of the secure connection
• Need to detect and handle abandoned transactions
   – Each “active” session consumes server memory
   – A timeout value is a reasonable technique
Components of a Transaction’s Download Time

Keynote Broker Trading Index

• Average response times and success rates for creating a
  standard stock-order transaction
    – Enters the brokerage’s home page, then logs on, obtains a stock
       quote, creates an order to buy stock, and logs out before
       confirming the order.
• Measurements are performed every 15 minutes between 9 a.m.
  and 4 p.m. EST during market trading days.
• From 10 major metropolitan areas in the U.S.
• Unsuccessful transactions include those in which any Web page
  fails to download completely and those that do not complete
  within a specified time limit.
    – The time limit for a transaction is calculated by multiplying the
       number of Web pages in a transaction by 12 seconds.
    – Each week, individual brokerage error rates typically range from
       0.2% to  30%
• Available at
                                  For Comparison...
                                  Keynote Broker Trading Index
Transaction Time (mean seconds)

                                  30                                                                   16

                                                                                                            % Error Rate
                                  15                                                                   8
                                   0                                                                   0
                                       Aug   Sep   Oct   Nov   Dec    Jan   Feb      Mar   Apr   May
                                       '99   '99   '99   '99   '99    '00   '00      '00   '00   '00

                                                         Total Time     Error Rate

Improving Transaction Performance

• Decrease the number of pages required per transaction
    – Each page is a new chance for connection failure
• Measure performance
    – Detect performance issues and triage them quickly!
    – Use proxy Agents in geographic locations of customer groups
    – Count abandoned shopping carts, etc.
• Plan for failure
    – How will customer get help?
    – Number to call; transaction ID

Performance Measurement Goals

• Evaluation of improvements and competition
    – From a stable, representative set of measurement agents
    – Long-term trending and benchmarks
• Quick diagnostics and triage when problems occur.
    – Get the problem assigned to the proper support groups quickly.
    – Use a “white box” unloaded server for comparisons
• System tuning
    – Where, in the complex system, are the bottlenecks?
    – How is response time and availability affected by site traffic?
        – Complex if users and servers are geographically distributed!
    – How are response time and availability affected by background
      traffic and events on the Internet?
• Prepare for and evaluate load testing
    – Understand load details
    – Validate load-test results against production performance
 A Note About Availability

• Combination of MTBF (Mean Time Between Failures) and MTTR
  (Mean Time To Repair)
• Affected by error rate?
    – At what error rate or pattern is the system “unavailable”?
• Affected by time of day or date?
    – Do you care if the system is down at 1am Eastern Time on Sundays?
• What must be “available”, and from where?
    – Designated servers?
    – Access to backbone routers?
    – Access to specific gateways or routers?
    – Designated end-to-end paths?
• How is it measured?
    – Sampling by testing devices?
    – Sampling from the designated servers, etc?
    – What is measurement granularity?

Measurement Technologies

• Element vs. End-to-End
• Active vs. Passive
• Quick Overviews of:
    – Element Measurement (usually Passive)
    – Active End-to-End
    – Passive End-to-End

Element vs. End-to-End Measurement

• Element (“point measurement”)
    – Show only the behavior within a particular network element
      (router, switch, link, server)
    – Network internal measures are crucial for problem solving.
    – Must be correlated with End-to-End View, for quick fixes to
      problems seen by end users.
• End-to-End
    – End-to-end . . . but . . .where are the “ends” in “end-to-end”?
    – Network internals are usually irrelevant and confusing to network
    – Used in constructing Service Level Agreements (SLAs)

Active vs. Passive Measurement

• Active Measurement adds traffic to the system
   – Special software or hardware / software Agents perform scripted
      transactions, “pings,” and other simulated end-user actions
   – Based on sampling
• Passive Measurement watches real users
   – Watches existing traffic and system components
   – Can sample or can look at every packet and at other data
   – Great for Network Operators

Element Measurement

• This is measurement of individual network and server elements
    – Great for system operators and for triage
    – Necessary for tracking load vs. element utilization
• Typical element measures:
    – Device status: CPU, memory, link utilization; queue sizes
    – Identity of heavy users, hardware port flows
    – Bandwidth usage measurement
    – Application statistics (page hits, user counts, abandoned shopping
      carts, etc.)
• Measurement technology can be active or passive.
    – Passive measurement is the most common and may not need to
      be based on sampling.
    – Some passive measurement tools can examine frame or packet
      headers to track response times, etc.
    – Active measurement helps correlation with end-user views.

Typical Passive Measurement of Elements

                                   Passive Disk
                                   usage measures

                                    Passive Router
                                    SNMP component

Active Measurement of the End-User’s Experience

• Network-level pings, etc. are useful for debugging, but are not a
  true measurement of end-user experience.
    – Reaches only outskirts of web hosting system, not the application
    – Does not indicate the health of application
    – Usually is not directly correlated with end-user’s web page
• Automated measurement agents run scripts to download web
  pages and run transactions.
    – Includes non-network (e.g., server) time
    – May include detailed component measurements that are useful for
      triage and trending
    – Finds errors as seen by users
• Active measurement of the end-user’s experience builds a
  baseline that can be used to evaluate any single-site or
  distributed web serving solution, even if the web serving
  solution’s technology changes over time.

Typical Active End-User Measurement

• Details of download can be
  timed and displayed.
• Download details can be
  trended over time.
• Includes:
   –   DNS lookup
   –   TCP connection complete
   –   Redirections complete
   –   First packet of base page
   –   Base page complete
   –   Content (gifs, etc.) complete

Active Measurement Issues

• What should be measured? (Page URLs? Transaction scripts?)
   – How will you measure your competitors?
• What sampling rate is sufficient?
• How many Agents are needed, where should they be located, and
  how should they be built?
   – Stand-alone Agents, on dedicated workstations, located at sites
     you control
   – Applets downloaded into user machines
       – Do you have permission?
       – Do you have control? (Are these user machines portable? Do
         users reconfigure them without telling you? How will you build
         long-term trending baselines using this data?)
       – What if there aren’t any active users in an important location?
         How do you detect communications failures?
   – Measurement services (e.g., Keynote Systems) that can represent
     “Internet” users in the world at large
Passive Measurement of the End-User’s Experience

• Watches actual end-user performance
    – Can be embedded in end-user’s browser
    – An intermediate network device can examine packet headers to
      track response times, etc.
    – Server application can use an API to send signals to tool.
    – This can’t usually be used to measure competitors
• Some passive tools can examine web server data logs to track
  locations of users and their web site activity.
    – This can track every user, without delaying production (if analysis
      is done off-line).
    – But this won’t see failures and time delays caused by page
      elements that are delivered from a different geographical location
      (e.g., ad servers)
    – And it can’t be used to measure competitors

Passive Measurement Issues

• What should be measured (Page URLs? Transaction scripts?)
• How will you track true response time as seen by end-user, not just
  pieces of that response time?
• How will you standardize measurements for long-term trending?
• Location of passive measurement probes?
    – Will the sampled pages and transactions be representative of all
    – For probes downloaded into user machines:
       – Do you have permission?
       – Do you have control? (Are these user machines portable? Do
          users reconfigure them without telling you? How will you build
          long-term trending baselines using this data?)
       – What if there aren’t any active users in an important location?
          How do you detect communications failures?
• Does passive measurement affect response time?

Load Characterization and Measurement

• Understanding load patterns and measuring load is as important
  as measuring performance.
    – Customer response to marketing campaigns
    – Changes in usage patterns of web site
    – Correlation of load, system element utilization, and response time
    – Gathering data for testing
• Characterization (understanding load patterns) gathers data
  about how individual end-users travel through the site.
• Load Measurement gathers data on the load presented to the
  server system
    – The server system may be geographically distributed.

Site Testing

• If your site breaks under load, it’s very easy for your customers
  to click away ... and they will!
• Unfortunately, it’s risky to predict behavior beyond what you’ve
  already seen.
     – System performance is non-linear at best.
        – Just one additional user may reveal a bug in the system.
        – Response time vs. load is probably exponential.
        – Internet traffic is worse than exponential; it’s fractal.
     – Advertising or other factors may change the pattern of accesses to
       your site.
        – The total count of visitors may stay the same, but their paths
          through the site may change.
        – If people suddenly start to buy, instead of browse, will that
          break your site?

Types of Testing – 1

• Functional testing – does the site work at all? – is critically important,
  but it’s not enough.
    – Functional tests find missing elements, broken links, errors.
    – Most functional tests succeed even if they have to wait for a long
       time; most real users have abandoned the site by then.
• Load testing measures the response of the site to a specified load.
    – This type of testing can be used to measure the effects of
       changes to the web system.
    – The load in a “load test” doesn’t need to stress the system; many
       load tests are designed to emulate a normal load.

Types of Testing – 2

• Stress testing finds the instantaneous breaking points.
    – Under what load level, and what type of load, does the system fail
      or provide unacceptable response times?
    – Will load-sharing devices fail?
    – Will database replies time-out and result in empty pages?
• Endurance testing measures system performance after a sustained
  high load.
    – System performance may degrade after a large number of users
        – Poor re-use of system resources
        – Poor handling of abandoned sessions
    – Some systems may break entirely after a sustained high load.

Testing Tools

• Most testing is done within the server site.
    – Functional Testing (if all page components are within the site)
    – Initial stress testing
    – Testing of web server and back-end database integration
• Final testing should be done across the network.
    – Find problems with geographically-distributed systems
        – Distributed servers
        – DNS difficulties
    – Find problems in Internet connectivity
        – ISP connectivity
        – Network aggregation bottlenecks (routers, etc.)
        – Peering to ISPs that are used by customers
• Use realistic scripts!

Scripts for Testing and Measuring

• What is the definition of “response time”?
    – Will you get statistics for each page, not just for the total transaction?
• Can your scripts react to the received web pages?
    – Different “think time” for different responses
    – Transaction abandonment if response time is too long
• Which transactions are used by important users?
    – Which transactions are used by customers who are buying?
    – Which transactions are used by irritable, politically-powerful users?
• The transaction sets should exercise all major parts of the system.
• Create different transaction sets for different situations, e.g.,
    – Monthly cycles
    – Special advertising promotions
• System designers and administrators will optimize the system to
  make the measurements look good; be sure that when they do that,
  they’ll also help the end users!
            ... and remember ...

Even after the site seems to be running perfectly,
  You must measure and monitor continually

To Avoid the Dreaded . . .

            Nightmare on Web Street!

Webmaster goes home           Webmaster arrives at work



Keynote Systems (, “The Internet Performance Authority,” is the world’s
leading supplier of Internet performance measurement, diagnostic, and consulting services to
companies with e-commerce web sites. Keynote captures over 24 million performance
measurements daily, using Keynote’s global infrastructure of nearly 500 measurement computers
connected to the major Internet backbones from over 120 statistically selected Internet access
locations representing 50 metropolitan areas worldwide. Internet performance and availability data
are collected at Keynote’s sophisticated operations center and are instantly available to customers
through any Web browser. Keynote currently measures individual web pages as well as
transactions and streaming media. Keynote also supplies web load testing services through its
recent acquisition of Velogic, inc.
Eric Siegel is a Senior Internet Consultant with Keynote Systems, Inc. and is the author of Designing
Quality of Service Solutions for the Enterprise (John Wiley & Sons, 1999). Before joining Keynote
Systems, Mr. Siegel was a Senior Network Analyst at NetReference, Inc., which specializes in
network architectural design and strategic planning, and he was a Senior Network Architect with
Tandem Computers, where he was the technical leader and coordinator for all of Tandem's data
communications specialists worldwide. Mr. Siegel also worked for Network Strategies, Inc. and for
the MITRE Corporation, where he specialized in computer network design and performance
evaluation. Mr. Siegel received both his B.S. and M.E.E. degrees in Electrical Engineering from
Cornell University, and he has been a member of the Internet community since 1978.


To top