TRUSS: A RELIABLE, SCALABLE
TRADITIONAL TECHNIQUES THAT MAINFRAMES USE TO INCREASE RELIABILITY—
SPECIAL HARDWARE OR CUSTOM SOFTWARE—ARE INCOMPATIBLE WITH
COMMODITY SERVER REQUIREMENTS. THE TRUSS ARCHITECTURE PROVIDES
RELIABLE, SCALABLE COMPUTATION FOR UNMODIFIED APPLICATION SOFTWARE
IN A DISTRIBUTED SHARED-MEMORY MULTIPROCESSOR.
The day-to-day digital services that shared-memory multiprocessors that give pro-
users take for granted—from accounting and grammers the convenience of a global address
Brian T. Gold commercial transactions to residential utili- space. Scalable commodity servers are increas-
ties—often rely on available, reliable infor- ingly based on a cache-coherent distributed
Jangwoo Kim mation processing and storage. Server shared-memory (DSM) paradigm, which pro-
reliability is already a critical requirement for vides excellent scalability while transparently
Jared C. Smolens e-commerce, where downtime can undercut extending the global address space. Unfortu-
revenue by as much as $6 million per hour for nately, DSM servers tend to use potentially
Eric S. Chung availability-critical services.1 Small wonder unreliable components as building blocks to
that reliability has become a key design met- exploit economies of scale.
Vasileios Liaskovitis ric for server platforms. The Total Reliability Using Scalable Servers
Unfortunately, although availability and reli- (TRUSS) architecture, developed at Carnegie
Eriko Nurvitadhi ability are becoming increasingly crucial, the Mellon, aims to bring reliability to commod-
obstacles to designing, manufacturing, and mar- ity servers. TRUSS features a DSM multi-
Babak Falsaﬁ keting reliable server platforms are also escalat- processor that incorporates computation and
ing.1,2 The gigascale integration trend in memory storage redundancy to detect and
James C. Hoe semiconductor technology is producing circuits recover from any single point of transient or
with signiﬁcant vulnerability to transient error permanent failure. Because its underlying
Carnegie Mellon (such as that caused by cosmic radiation) and DSM architecture presents the familiar
permanent failure (such as that from device shared-memory programming model,
University wearout).3 Reliable mainframe platforms have TRUSS requires no changes to existing appli-
traditionally used custom components with cations and only minor modiﬁcations to the
enhanced reliability, but the cost can be pro- operating system to support error recovery.
Andreas G. Nowatzyk hibitive.4 Moreover, these platforms have strong
disadvantages: Either they provide a message- Designing for fault tolerance
Cedars-Sinai Medical passing programming interface, which requires Central to TRUSS’ practical fault-tolerant
custom software,5 or they use a small broadcast- design is Membrane, a conceptual fault-isola-
Center based interconnect to share a single physical tion boundary that conﬁnes the effects of a com-
memory, which compromises their scalability.6 ponent failure to the processor, memory, or I/O
In contrast, most modern servers are subsystem in which the failure occurred. With
0272-1732/05/$20.00 2005 IEEE Published by the IEEE computer Society 51
I/O devices without affecting processor or
CPU 0 Memory
memory design5,7 (see Figure 1).
To maintain the overall Membrane abstrac-
tion, the operation of the interconnection net-
work itself must be error-free and
uninterrupted. TRUSS builds on a wealth of
prior work in reliable, high-performance
CPU n I/O interconnects that guarantee packet delivery
and guard against data corruption8,9—long-
Figure 1. Logical decomposition of error detection and isola- standing requirements for high-performance
tion within the TRUSS distributed shared-memory multi- parallel systems.
processor. Each subsystem must detect and recover from Two elements are key in enabling TRUSS
an error without involvement from other subsystems. In processors and memory subsystems to satisfy
this manner, each subsystem uses an error detection and Membrane’s requirements for error-free, unin-
recovery scheme optimized for that particular component. terrupted operation:
The Membrane abstraction composes the various compo-
nents into a complete fault-tolerant system. • a master/slave computational redundancy
scheme that protects against processor
error or node failure and
Membrane, each subsystem must individually • a distributed-parity memory redundancy
detect errors and stop them from propagating to scheme that protects against multibit
the rest of the system. Because the subsystems errors or the complete loss of a node.
detect an error locally—before it can spread—
the system needs only local recovery to contin- In this article, we describe both these ele-
ue correct operation. In essence, the problem of ments and the results of a performance
designing a fault-tolerant system becomes a col- evaluation.
lection of subproblems that are easy to separate
and thus more manageable to analyze and solve. Evaluation framework
We can group processing nodes, for example, To evaluate TRUSS performance, we used
in a distributed dual- (DMR) or triple-modu- Flexus, a framework for cycle-accurate full-
lar redundant (TMR) scheme to protect against system simulation of a DSM multiproces-
processing errors. The memory subsystem can sor,10,11 to simulate a 16-node DSM running
then rely on both local error correction codes Solaris 8. Each node contains a speculative,
(ECC) and distributed parity for the detection eight-way out-of-order superscalar proces-
and recovery of errors in memory storage. sor, a detailed DRAM subsystem model, and
When we compose the two subsystems through an interconnect based on the HP GS1280.12
Membrane, the processors on one side simply We use a wait-free implementation of the
see the appearance of a reliable memory system total store order (TSO) memory consistency
on the other. Similarly, traditional redundancy model that enforces memory order at run-
and parity techniques protect storage and other time only in the presence of races.11 Table 1
Table 1. Parameters in the 16-node DSM multiprocessor simulation.
System element Parameters
Processing nodes UltraSPARC III instruction set architecture; 4-GHz eight-stage pipeline; out-of-order execution, eight-wide
dispatch and retirement; 256-entry reorder buffer
L1 caches Split instruction and data caches; 64-Kbyte, two-way, two-cycle load-to-use latency; four ports; and 32 miss-
status holding registers (MSHRs)
L2 caches Uniﬁed, 8-Mbyte, eight-way, 25-cycle hit latency; one port; and 32 MSHRs
Main memory 60-ns access latency, 32 banks per node, two channels, and 64-byte coherence unit
Protocol controller 1-GHz microcoded controller and 64 transaction contexts
Interconnect 4×4 2D torus, 25-ns latency per hop, 128-GBytes/s peak bisection bandwidth
52 IEEE MICRO
lists the relevant system parameters. Slave
We evaluated four commercial workloads
and three scientiﬁc applications: Incoming Replicate
message at t0 message
• OLTP-DB2 is IBM DB2 version 7.2 Delay
enterprise-extended edition running an Delivery
online transaction processing (OLTP) at t0
workload modeled after a 100-warehouse
TPC-C installation. Figure 2. Replicating incoming messages between master
• OLTP-Oracle is Oracle Database 10g and slave. As part of the coordination protocol, the master
running the OLTP workload. processor replicates the input and tags it with a delivery
• Web-Apache is Apache HTTP Server ver- timestamp. Both the input and timestamp are forwarded to
sion 2.0 running the SpecWeb99 the slave as a special coordination message. On the slave
benchmark. node, a gated delivery queue presents the forwarded input
• Web-Zeus is Zeus Web Server version 4.3 to the slave processor’s interface at precisely the cycle that
running the SpecWeb99 benchmark. the timestamp designates (according to a delayed, local
• Ocean, Fast Fourier Transform (FFT), and time reference).
em3d are scientific applications that
exhibit a range of sharing and network
trafﬁc patterns, which we scaled to exceed execute the same instruction sequence despite
the system’s aggregate cache footprint. In their physical distance.
this way, we could evaluate TRUSS Figure 2 illustrates the high-level opera-
under realistic memory-system con- tion of the master-slave coordination pro-
tention. tocol and associated hardware. The
coordination mechanism directs all inputs
Computational redundancy only to the master processor. To ensure that
In TRUSS, processor pairs reside on sepa- all coordination messages arrive at the slave
rate nodes so that the system can tolerate the in time for delivery, the slave runs behind
loss of an entire node. This requirement gives the master at a fixed lag longer than the
rise to two formidable challenges: coordinat- worst-case transit time for a coordination
ing a distributed DMR pair for lockstep oper- protocol message. To bound this latency,
ation and having that pair corroborate data as the master sends coordination messages on
part of error detection and isolation. the highest-priority channel, and master-
slave pairs are neighboring nodes in the net-
Coordination work topology.
Given that a variable-latency switching fab- Because the master and slave nodes have no
ric separates processors in a DMR pair, synchronized clocks, we must be able to mod-
TRUSS requires a coordination mechanism ulate their locally synthesized clocks, for
to enforce synchronous lockstep execution in example, using down spread-spectrum clock
an inherently asynchronous system. Rather synthesizers.13 Over time, if the master clock
than attempting to enforce true simultaneity, phase drifts too far behind the slave—that is,
we opted for an asymmetric scheme in which if coordination protocol messages arrive too
the execution of the slave processor in a DMR close to the required delivery time—the coor-
master-slave pair actually lags behind the mas- dination mechanism must actively retard the
ter. (The “processor” is the fully determinis- slave clock until the master clock phase catch-
tic core and caches into which asynchronous es up. When the opposite occurs—the slave
inputs—interrupts, cache-line ﬁlls, and exter- clock phase drifts too far behind that of the
nal coherence requests—feed.) The coordi- master—the mechanism retards the master
nation mechanism enforces the perception of clock. This scheme precludes large, unilater-
lockstep by replicating at the slave processor al changes to the clock frequency because of
the exact sequence and timing of the external, thermal throttling or similar mechanisms;
asynchronous inputs that the master proces- rather, such changes must be coordinated
sor first observes. Thus, the two processors between master and slave.
NOVEMBER–DECEMBER 2005 53
Normalized user IPC
1.0 it discards the previous checkpoint and con-
0.8 tinues execution. TRUSS recovers from the
permanent failure of a master or slave node by
either bringing a new master-slave pair online
0.4 or running the remaining functional node
0.2 (master or slave) in a nonredundant mode.
TRUSS integrates the error detection and
isolation protocol into a three-hop, invalida-
tion-based coherence protocol. In the base pro-
tocol, remote nodes forward responses for dirty
cache lines directly to the requesting node.
Figure 3. Performance with computational redundancy. We normalize TRUSS extends this three-hop forwarding
results to a 16-node nonredundant system and show 90 percent conﬁdence chain to include an additional hop (master to
intervals on commercial workloads. slave) to validate the outbound data. This extra
step introduces overhead in any request-reply
transaction to a logical processor. For dirty
Error detection and isolation cache reads, for example, the extra step fully
Coordination establishes lockstep execution, manifests in the read transaction’s critical path.
but TRUSS must also detect and recover from Outbound data that is not part of a request-
computational errors to satisfy Membrane’s reply (writebacks, for example) requires a sim-
requirements. To ensure that no errors leave ilar comparison step, but the latency is hidden
the logical processor pair, the master and slave if no other node is waiting on the writeback
must corroborate results and, if error detection result. For operations without irreversible side
reveals an error, recover to a known-good state. effects, the master issues the message before it
An effective method for processor-side error checks the result with the slave.
detection is fingerprinting,14 which tightly Because the master accepts request messages
bounds error-detection latency and greatly while the slave releases replies, a complication
reduces the required interprocessor commu- arises with network flow control. Guarding
nication bandwidth, relative to other detec- against deadlock requires that a node not
tion techniques. Fingerprints compress the accept a lower-priority request if back-pres-
execution history of internal processor state sure is blocking it from sending a response on
into a compact signature, which along with a higher-priority channel. Because the master
small on-chip checkpointing, provides error cannot directly sense back-pressure at the
detection and recovery. slave’s send port, the coordination protocol
TRUSS compares ﬁngerprints from the two must keep track of credit and debit counters
processors in a lockstep DMR pair. When the for the slave’s send buffers at the master node.
master processor generates output data, it com- The coordination protocol does not accept an
municates ﬁrst to the slave in a coordination inbound message at the master node unless
protocol message, which holds a timestamp the counters guarantee that the slave can also
and a fingerprint summarizing the master absorb the inbound message.
computation thus far. At the slave node, the
coordination mechanism waits for the slave Performance issues
processor to reach the same execution point. The key performance issues in the coordi-
The slave node releases the output data to the nation protocol are the impact on the round-
rest of the system only if the master and slave trip latency of request-reply transactions and
fingerprints agree. In this way, the pair cor- network contention from the extra trafﬁc in
roborates and validates execution correctness the master-to-slave channels. Figure 3 shows
up to and including the comparison point. TRUSS performance relative to a 16-node
When the slave detects mismatched ﬁnger- nonredundant system. For TRUSS, we
prints, indicating an execution error, it restores extended the 4×4 2D torus from the baseline
a checkpoint of architectural state changes to system to a 4×4×2 3D torus, where master
itself and the master, and both resume execu- and slave nodes are on separate planes of the
tion. If the slave node does not detect an error, interconnect topology. This arrangement pre-
54 IEEE MICRO
serves the master-to-master latency for side- chip-16,17 or module-level18 redundancy. It also
effect-free communications. guards against multibit transient errors—soon
In the base system, OLTP-Oracle spends less reaching prohibitive frequencies in memo-
than 40 percent of execution time in off-chip ry16—or a single memory component or node
memory accesses, spending 21 percent of the failure in a distributed system.
total time waiting for dirty data. Overhead wait- Drum’s main goal is to protect memory
ing for this dirty data, along with queuing effects with little or no performance overhead. As in
when reading shared data, account for the 15 other distributed parity schemes,15 Drum
percent performance penalty in the TRUSS sys- relies on ECC to detect multibit errors and
tem as compared to the baseline. OLTP-DB2, does not require a parity lookup upon mem-
however, spends 73 percent of execution time ory read operations. In the common case of
on off-chip memory accesses, most of which error-free execution, Drum incurs the over-
goes to dirty coherence reads. The additional head of updating parity on a memory write
latency associated with these accesses, coupled operation, such as a cache block writeback.
with related increases in queuing between mas-
ter and slave, account for the 35 percent per- Contention in distributed parity schemes
formance penalty in the TRUSS system. A distributed parity scheme can introduce
Although both Web-Apache and Web-Zeus several sources of contention and performance
spend over 75 percent of execution time in off- degradation, as Figure 4 shows. Other
chip memory accesses, few of these read mod- approaches to distributed parity in a DSM
iﬁed data from another processor’s cache. lock the directory entry while the directory
Moreover, because bandwidth does not gener- controller waits for a parity-update acknowl-
ally bound these applications, they can support edgement.15 Generally, acknowledgments help
the additional trafﬁc in the master-to-slave simplify recovery by accounting for all pend-
channels. Consequently, Web servers incur a ing parity updates. In workloads such as
marginal performance penalty in the TRUSS OLTP, however, which have frequent, simul-
system, relative to the baseline system. taneous sharing patterns, concurrent reads
Because their working sets exceed the aggre- stall while the directory entry waits for the
gate cache size, the scientiﬁc applications we parity-update acknowledgment.
studied do not spend time on dirty coherence A second bottleneck exists at the memory
misses and therefore do not incur additional channel and DRAM banks, where program-
latency from error detection. In ocean and initiated memory requests and parity updates
em3d, contention in the master-to-slave chan- contend for shared resources. Other proposed
nels creates back pressure at the master node techniques15 uniformly distribute parity infor-
and leads to delays on critical-path memory mation across a node’s memory banks. This
accesses, which accounts for the small perfor- approach can beneﬁt memory bank load bal-
mance loss in both applications. ancing, but for workloads with bursty write-
back traffic,19 parity updates contend with
Memory redundancy processor requests at the memory channels
TRUSS protects the memory system using and increase memory access time.
Distributed Redundant Memory (Drum), an Finally, parity updates in distributed parity
N+1 distributed-parity scheme akin to redun- schemes increase the number of network mes-
dant arrays of inexpensive disks (RAID) for sages. Therefore, in networks with low bisection
memory.7 In this scheme, N data words and bandwidth, parity updates can increase network
one parity word, which the system stores on traversal time. However, modern DSMs, such
N+1 different computational nodes, form a as the HP GS1280,12 typically use interconnect
parity group, and parity maintenance becomes fabrics designed for worst-case demand, so par-
part of the cache-coherence mechanism.15 The ity updates in such systems are unlikely to affect
parity word provides sufficient information message latencies signiﬁcantly.15
to reconstruct any one of the data words with-
in the parity group. Drum complements exist- Optimizations
ing within-node error-protection schemes, Distributed parity in Drum incurs minimal
such as word- or block-level ECC, and performance overhead through three opti-
NOVEMBER–DECEMBER 2005 55
N Home Parity
Ack P A
Figure 4. Contention in distributed parity schemes. Directory contention (a) occurs when incoming requests must wait while
the directory is locked, stalling critical-path accesses. Memory contention (b) occurs when the addition of parity updates to
the memory subsystem stalls critical-path data accesses.
mizations: eliminating parity-update recomputing the parity effectively coalesces
acknowledgments, opting for a lazy schedul- the two requests for the same address, there-
ing of parity updates, and dedicating memo- by reducing the number of accesses to the par-
ry banks to parity words. ity bank.
Eliminating parity-update acknowledg- To attain higher data throughput, Drum
ments alleviates contention at the directory segregates parity values to a few dedicated
and reduces the number of network messages, memory banks, which reduces memory bank
but it can lead to overlapping updates for the contention between data and parity requests,
same data. Fortunately, parity-update opera- and improves row buffer locality.
tions are commutative, so performing the
updates out of order does not affect parity Performance
integrity. For error recovery, however, the sys- Figure 5 shows the performance of various
tem must stop and collect all in-ﬂight messages parity-update organizations, all normalized to
to guarantee that the memory controllers have a conﬁguration that is not fault-tolerant.
completed all updates. The added overhead of Without prioritized scheduling and dedi-
quiescing in-flight messages has negligible cated memory banks for parity requests, mem-
impact on overall execution time because the ory contention causes ocean and FFT to suffer
system infrequently recovers from multibit 22 and 18 percent performance losses, respec-
errors or hardware failures. tively. Memory bandwidth bounds perfor-
Lazy parity-update scheduling prioritizes mance in both applications, and the
data over parity requests at the memory con- applications’ footprints exceed the aggregate
troller, which yields lower memory response cache size, which creates signiﬁcant contention
time for data requests. Because parity requests for memory channels and banks. With lazy
are not on the execution’s critical path, the scheduling and dedicated parity banks, the
memory controller delays them arbitrarily performance losses for ocean and FFT drop
during error-free execution. Drum stores signiﬁcantly, to 5 and 4 percent. Because write-
delayed parity requests in a separate parity backs in these applications are bursty, delayed
buffer queue in the memory controller, which parity requests have ample time to ﬁnish.
identiﬁes and uses idle memory channel and With parity acknowledgments, directory
bank cycles for parity requests after servicing contention causes OLTP-DB2 and OLTP-
bursty data requests. The Drum memory con- Oracle to show 8 and 7 percent performance
troller also supports the coalescing of parity losses. OLTP exhibits migratory sharing of
requests within the parity buffer queue; modiﬁed data, where many processors read and
56 IEEE MICRO
Nominal Lazy scheduling with dedicated parity banks Acknowledgment free Drum
Normalized user IPC
OLTP-DB2 OLTP-Oracle Web-Apache Web-Zeus Ocean FFT em3d
Figure 5. Performance of four parity-update approaches. In the nominal distributed-parity scheme, the directory waits for par-
ity-update acknowledgments and treats all parity updates and data requests equally in the DRAM subsystem. Drum com-
bines the acknowledgment-free and lazy-scheduling-with-dedicated-parity-banks schemes.
write a set of addresses over time. As data blocks Acknowledgments
pass from one processor’s cache to another, out- This research was supported in part by NSF
standing parity updates and acknowledgments awards ACI-0325802 and CCF-0347560,
delay the release of directory locks. Subsequent Intel Corp., the Center for Circuit and Sys-
reads for these addresses must wait until the tem Solutions (C2S2), and the Carnegie Mel-
parity acknowledgment completes. With the lon CyLab. We thank the SimFlex team at
enabling of lazy scheduling and dedicated par- Carnegie Mellon for the simulation infra-
ity banks, the performance loss grows to 14 per- structure and valuable feedback on early drafts
cent for OLTP-DB2 and 12 percent for of this article.
OLTP-Oracle because lazy scheduling further
delays parity acknowledgments. However, with References
no parity acknowledgments, these applications 1. D. Patterson, keynote address, “Recovery
recoup all their performance losses, with or Oriented Computing: A New Research
without lazy scheduling and bank dedication. Agenda for a New Century,” 2002; http://
With its combination of acknowledgment-free roc.cs.berkeley.edu/talks/pdf/HPCAkeynote.
parity updates, lazy scheduling, and dedicated pdf.
parity banks, Drum is the only solution that 2. J. Hennessy, “The Future of Systems
regains the performance losses for all the appli- Research,” Computer, vol. 32, no. 8, Aug.
cations studied. Its performance loss relative to 1999, pp. 27-33.
the baseline (non-fault-tolerant) design is only 3. S. Borkar, “Challenges in Reliable System
2 percent on average (at worst 4 percent in Design in the Presence of Transistor Vari-
ocean). ability and Degradation,” IEEE Micro, vol. 25,
no. 6, Nov.-Dec. 2005, pp. 10-16.
R eliability and availability will continue to
be key design metrics for all future serv-
er platforms. The TRUSS server architecture
4. W. Bartlett and L. Spainhower, “Commercial
Fault Tolerance: A Tale of Two Systems,”
IEEE Trans. Dependable and Secure Com-
bridges the gap between costly, reliable main- puting, vol. 1, no. 1, Jan. 2004, pp. 87-96.
frames and scalable, distributed, shared-mem- 5. W. Bartlett and B. Ball, “Tandem’s Approach
ory hardware running commodity application to Fault Tolerance,” Tandem Systems Rev.,
software. Using the Membrane abstraction, vol. 4, no. 1, Feb. 1998, pp. 84-95.
TRUSS can conﬁne the effects of a compo- 6. T.J. Slegel, et al., “IBM’s S/390 G5 Micro-
nent failure, enabling error detection and processor Design,” IEEE Micro, vol. 19, no.
recovery schemes optimized for a particular 2, Mar./Apr. 1999, pp. 12-23.
subsystem. MICRO 7. D. Patterson, G. Gibson, and R. Katz, “A
NOVEMBER–DECEMBER 2005 57
Case for Redundant Arrays of Inexpensive 18. “Hot Plug Raid Memory Technology for Fault
Disks (RAID),” Proc. Int’l Conf. Management Tolerance and Scalability,” HP white paper,
of Data (SIGMOD-88), ACM Press, 1988, pp. Hewlett Packard, 2003. http://h200001.
8. J. Duato, S. Yalamanchili, and L. Ni, Inter- Manual/c00257001/c00257001.pdf.
connection Networks: An Engineering 19. S.C. Woo et al., “The SPLASH-2 Programs:
Approach, Morgan Kaufmann, 2003. Characterization and Methodological Con-
9. D.J. Sorin et al., “SafetyNet: Improving the siderations,” Proc. 22nd Ann. Int’l Symp.
Availability of Shared Memory Multiproces- Computer Architecture (ISCA-95), IEEE CS
sors with Global Checkpoint/Recovery, Proc. Press, 1995, pp. 24-36.
29th Ann. Int’l Symp. Computer Architecture
(ISCA-02), IEEE CS Press, June 2002, pp. Brian T. Gold is a PhD student in electrical
123-134. and computer engineering at Carnegie Mellon
10. N. Hardavellas et al., “Simﬂex: A Fast, Accu- University. His research interests include reli-
rate, Flexible Full-System Simulation Frame- able computer systems and parallel computer
work for Performance Evaluation of Server architectures. Gold has an MS in computer
Architecture,” SIGMETRICS Performance engineering from Virginia Tech. He is a stu-
Evaluation Rev., vol. 31, no. 4, Apr. 2004, pp. dent member of the IEEE and ACM.
11. T. F. Wenisch et al., “Temporal Streaming Jangwoo Kim is a PhD student in electrical
of Shared Memory,” Proc. 32nd Ann. Int’l and computer engineering at Carnegie Mellon
Symp. Computer Architecture (ISCA-05), University. His research interests include reli-
IEEE CS Press, 2005, pp. 222-233. able server architecture and full system simu-
12. Z. Cvetanovic, “Performance Analysis of the lation. Kim has an MEng in computer science
Alpha 21364-Based HP GS1280 Multi- from Cornell University. He is a student
processor,” Proc. 30th Ann. Int’l Symp. member of the IEEE.
Computer Architecture (ISCA-03), IEEE CS
Press, 2003, pp. 218-229. Jared C. Smolens is a PhD student in elec-
13. K. Hardin et al., “Design Considerations of trical and computer engineering at Carnegie
Phase-Locked Loop Systems for Spread Mellon University. His research interests
Spectrum Clock Generation Compatibility,” include microarchitecture, multiprocessor
Proc. Int’l Symp. Electromagnetic Compati- architecture, and performance modeling.
bility (EMC-97), IEEE Press, 1997, pp. 302- Smolens has an MS in electrical and com-
307. puter engineering from Carnegie Mellon
14. J.C. Smolens et al., “Fingerprinting: Bound- University. He is a student member of the
ing Soft-Error Detection Latency and Band- IEEE.
width,” Proc. 11th Int’l Conf. Architectural
Support for Programming Languages and Eric S. Chung is a PhD student in electrical
Operating Systems (ASPLOS-XI), ACM and computer engineering at Carnegie Mellon
Press, 2004, pp. 224-234. University. His research interests include
15. M. Prvulovic, Z. Zhang, and J. Torrellas, designing and prototyping scalable, reliable
“ReVive: Cost-Effective Architectural Sup- server architectures and transactional memo-
port for Rollback Recovery in Shared Mem- ry. Chung has a BS in electrical and comput-
ory Multiprocessors,” Proc. 29th Ann. Int’l er engineering from the University of
Symp. Computer Architecture (ISCA-02), California at Berkeley. He is a student mem-
IEEE CS Press, 2002, pp. 111-122. ber of the IEEE and ACM.
16. T.J. Dell, “A White Paper on the Beneﬁts of
Chipkill-Correct ECC for PC Server Main Vasileios Liaskovitis is an MS student in elec-
Memory,” IBM Corp., 1997. trical and computer engineering at Carnegie
17. “HP Advanced Memory Protection Tech- Mellon University. His research interests
nologies,” Hewlett Packard, 2003; http:// include computer architecture and algo-
h200001.www2.hp.com/bc/docs/support/ rithms for pattern recognition. Liaskovitis has
SupportManual/c00256943/c00256943.pdf. a BS in electrical and computer engineering
58 IEEE MICRO
from the National Technical University of Mellon University. His research interests
Athens, Greece. He is a student member of include computer architecture and high-level
IEEE and ACM. hardware description and synthesis. Hoe has
a PhD in electrical engineering and comput-
Eriko Nurvitadhi is a PhD student in electri- er science from MIT. He is a member of the
cal and computer engineering at Carnegie IEEE and ACM.
Mellon University. He received an MS in
computer engineering from Oregon State Andreas G. Nowatzyk is the associate direc-
University. His research interests are in com- tor of the Minimally Invasive Surgery Tech-
puter architecture, including prototyping and nology Institute at the Cedars-Sinai Medical
transactional memory. He is a student mem- Center, where he works on highly reliable,
ber of the IEEE and ACM. high-performance computer systems that
process real-time image data in operating
Babak Falsaﬁ is an associate professor of elec- rooms. Nowatzyk has a PhD in computer sci-
trical and computer engineering and Sloan ence from Carnegie Mellon University. He is
Research Fellow at Carnegie Mellon Univer- a member of the IEEE and ACM.
sity. His research interests include computer
architecture with emphasis on high-perfor- Direct questions and comments about this
mance memory systems, nanoscale CMOS article to Babak Falsaﬁ, Electrical and Com-
architecture, and tools to evaluate computer puter Engineering Dept., Carnegie Mellon
system performance. Falsaﬁ has a PhD in com- University, 5000 Forbes Ave, Pittsburgh, PA
puter science from the University of Wiscon- 15213; email@example.com.
sin and is a member of the IEEE and ACM.
For further information on this or any other
James C. Hoe is an associate professor of elec- computing topic, visit our Digital Library at
trical and computer engineering at Carnegie http://www.computer.org/publications/dlib.
Advancing in the IEEE Computer Society can elevate your standing in the profession.
Application to Senior-grade membership recognizes
✔ ten years or more of professional expertise
Nomination to Fellow-grade membership recognizes
✔ exemplary accomplishments in computer engineering
GIVE YOUR CAREER A BOOST ■ UPGRADE YOUR MEMBERSHIP
NOVEMBER–DECEMBER 2005 59