performance
Document Sample


Metrics and Techniques for
Evaluating the Performability
of Internet Services
Pete Broadwell
pbwell@cs.berkeley.edu
Outline
1. Introduction to performability
2. Performability metrics for
Internet services
• Throughput-based metrics (Rutgers)
• Latency-based metrics (ROC)
3. Analysis and future directions
Motivation
• Goal of ROC project: develop metrics to
evaluate new recovery techniques
• Problem: concept of availability
assumes system is either “up” or
“down” at a given time
• Availability doesn’t capture system’s
capacity to support degraded service
– degraded performance during failures
– reduced data quality during high load
What is “performability”?
• Combination of performance and
dependability measures
• Classical defn: probabilistic (model-
based) measure of a system’s “ability to
perform” in the presence of faults1
– Concept from traditional fault-tolerant
systems community, ca. 1978
– Has since been applied to other areas, but
still not in widespread use
1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
Performability Example
Discrete-time Markov chain (DTMC) model
of a RAID-5 disk array1
w0(t) (D+1)l w1(t) Dl w2(t)
p0(t) p1(t) p2(t)
m
Normal 1 disk failed, Failure -
operation repair necessary data loss
D = number of data disks pi(t) = probability that system is in state i at time t
m = disk repair rate wi(t) = reward (disk I/O operations/sec)
l = failure rate of a single disk drive
1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997
Performability for Online
Services: Rutgers Study
• Rich Martin (UCB alum) et al. wanted to
quantify tradeoffs between web server
designs, using a single metric for both
performance and availability
• Approach:
– Performed fault injection on PRESS, a
locality-aware, cluster-based web server
– Measured throughput of cluster during
simulated faults and normal operation
Degraded Service During a
PRESS Component Fault
Throughput
FAILURE RESET
RECOVER (optional)
Requests/sec
STABILIZE
REPAIR
(human
operator)
DETECT
Time
Calculation of Average
Throughput, Given Faults
Throughput
Normal throughput
Requests/sec
Average throughput
Time
Degraded throughput
Behavior of a
Performability Metric
Effect of improving degraded performance
Performability
Performance during faults
Behavior of a
Performability Metric
Effect of improving component availability
(shorter MTTR, longer MTTF)
Performability
Aavailability = MTTF
MTTF + MTTR
MTTR MTTF
Behavior of a
Performability Metric
Effect of improving overall performance
Performability
Overall performance (includes normal operation)
Most performability metrics scale linearly as
component availability, degraded performance
and overall performance increase
Results of Rutgers Study:
Design Comparisons
90
80
Reduced human monitoring
70
Performability
Original system
60
RAID storage subsystem
50
40
30
20
10
0
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
Web server version
An Alternative Metric:
Response Latency
• Originally, performability metrics were
meant to capture end-user experience1
• Latency better describes the
experience of an end user of a web site
– response time >8 sec = site abandonment
= lost income $$2
• Throughput describes the raw
processing ability of a service
– best used to quantify expenses
1 J.F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
2 Zona Research and Keynote Systems, The Need for Speed II, 2001
Effect of Component Failure
on Response Latency
Response
latency
(sec) Abandonment
region
8s
Annoyance
region?
Time
FAILURE REPAIR
Issues With Latency As a
Performability Metric
• Modeling concerns:
– Human element: retries and abandonment
– Queuing issues: buffering and timeouts
– Unavailability of load balancer due to faults
– Burstiness of workload
• Latency is more accurately modeled at
service, rather than end-to-end1
• Alternate approach: evaluate an
existing system
1 M. Merzbacher and D. Patterson, Measuring End-User Availability on the Web:
Practical Experience, 2002
Analysis
• Queuing behavior may have a
significant effect on latency-based
performability evaluation
– Long component MTTRs = longer
waits, lower latency-based score
– High performance in normal case =
faster queue reduction after repair,
higher latency-based score
• More study is needed!
Future Work
• Further collaboration with Rutgers on
collecting new measurements for
latency-based performability analysis
• Development of more realistic fault and
workload models, other performability
factors such as data quality
• Research into methods for conducting
automated performability evaluations of
web services
Metrics and Techniques for
Evaluating the Performability
of Internet Services
Pete Broadwell
pbwell@cs.berkeley.edu
Back-of-the-Envelope
Latency Calculations
• Attempted to infer average request latency for
PRESS servers from Rutgers data set
– Required many simplifying assumptions, relying
upon knowledge of PRESS server design
– Hoped to expose areas in which throughput- and
latency-based performability evaluations differ
• Assumptions:
– FIFO queuing w/no timeouts, overflows
– Independent faults, constant workload (also the case
for throughput-based model)
• Current models do not capture “completeness”
of data returned to user
Comparison of
Performability Metrics
35000
30000
Performability
25000
Latency-based
20000 peformability
15000 Throughput-based
performability
10000
5000
0
I-PRESS TCP- ReTCP- VIA-
PRESS PRESS PRESS
Web server versions
Rutgers calculations for
long-term performability
Goal: metric that scales linearly with both
- performance (throughput) and
- availability [MTTF / (MTTF + MTTR)]
Tn = normal throughput for server
AI = ideal availability (.99999)
Average throughput (AT) =
Tn during normal operation + per-
component throughput during failure
Average availability (AA) = AT / Tn
Performability = Tn x [log(AI) / log(AA)]
Results of Rutgers study:
performance comparison
Throughput
7000
Requests/sec
6000
5000
4000
3000
2000
1000
0
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
Results of Rutgers study:
availability comparison
Unavailability by Component
0.005 application crash
% Unavailability
node freeze
0.004 node crash
scsi hang
0.003 scsi timeout
internal switch
internal link
0.002
0.001
0
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
Results of Rutgers study:
performability comparison
Performability
Scaled Availability
60
Throughput X
50
40
30
20
10
0
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
Get documents about "