PDL Newsletter 2003

Document Sample
PDL Newsletter 2003 Powered By Docstoc
					           L DATA LABORA
        LLE             TO
      RA                                                  THE

                                                          PDL Packet


   RN                                                      T H E N E W S L E T T E R O N P D L A C T I V I T I E S A N D E V E N T S • FA L L         2003

             IE M            IV              ER                                                                           
                  ELLO N U N

 AN INFORMAL PUBLICATION FROM                             Self-* Storage
  ACADEMIA’S PREMIERE STORAGE                             Greg Ganger & Joan Digney
                                                          Human administration of storage systems is a large and growing issue in modern
    DEVOTED TO ADVANCING THE                              IT infrastructures. We are exploring new storage architectures that integrate
    STATE OF THE ART IN STORAGE                           automated management functions and simplify the human administrative task.
                                                          Self-* storage systems (pronounced “self-star”— a play on the unix shell wild-card
                                                          character) are self-configuring, self-organizing, self-tuning, self-healing, self-man-
            INFRASTRUCTURES.                              aging systems of storage bricks. Borrowing organizational ideas from corporate
                                                          structure and technologies from AI and control systems, self-* storage should sim-
                                                          plify storage administration, reduce system cost, increase system robustness, and
             CONTENTS                                     simplify system construction.
                                                          The Self-* Storage Architecture
Automating Storage Management ...... 1                    Dramatic simplification of storage administration requires that associated functionalities
Director’s Letter ................................... 2   be designed into the storage system from the start and integrated throughout the
                                                          design. System components must continually collect information about ongoing
New PDL Faculty ................................. 3       tasks and regular evaluation of configuration and workload partitioning must occur,
Year in Review ..................................... 4    all within the context of high-level administrator guidance. The self-* storage project
                                                          is designing an architecture from a clean slate to explore such integration. The high-
Recent Publications ............................. 5       level system architecture is shown in Figure 1.
Database I/O Optimization .................. 8            The Brick-based Storage
                                                          Infrastructure. We envision                                                        Goal specification &
PDL News .......................................... 10                                                                                          complaints
                                                          self-* storage systems                                 Performance
                                                                                                                information &
Proposals & Defenses ....................... 13           composed of networked                                   delegation

                                                          “intelligent” storage bricks, Management                                                                Administrator
Toward Better File Searching ............. 14                                                  hierarchy                       Supervisors
                                                          each consisting of CPU(s),
                                                          RAM, and a number of
    PDL CONSORTIUM                                        disks; example brick de-                    Workers
                                                          signs provide 0.5-5 TB of                     bricks)
       MEMBERS                                            storage with moderate lev-
                                                          els of reliability, availability,
EMC Corporation                                           and performance. Each I/O request
                                                          storage brick (a "worker")            routing
Hewlett-Packard Labs
                                                          self-tunes and adapts its
Hitachi, Ltd.                                             operation to its observed                                         Routers

IBM Corporation                                           workload and assigned                                                         I/O requests & replies

                                                          goals. Data redundancy
Intel Corporation                                         across and within storage
Microsoft Corporation                                     bricks provides fault toler-
                                                          ance and creates opportu-
Network Appliance                                                                                                                   Clients
                                                          nities for automated
Panasas, Inc.                                             reconfiguration to handle Figure 1: Architecture of self-* storage. At the top of the diagram is the
Oracle Corporation                                        many problems. Out-of- management hierarchy, concerned with the distribution of goals and
                                                                                                                                              from the system administrator
                                                          band supervisory pro- the delegation of storage responsibilitiesThe path of I/O requests from
                                                                                            down to the individual worker devices.
Seagate Technology                                        cesses assign datasets and clients, through routing nodes, to the workers for service is shown at
Sun Microsystems                                          goals to workers, track the bottom. Note that the management infrastructure is logically
                                                                                                 independent of the I/O request path, and that the routing is logically
Veritas Software Corporation                                          Continued on page 12       independent of the clients and the workers.
                               FROM THE DIRECTOR’S CHAIR                                  T H E P D L PA C K E T

                               Greg Ganger
                                                                                         The Parallel Data Laboratory
                                                                                         School of Computer Science
                                Hello from fabulous Pittsburgh!
                                                                                              Department of ECE
                                2003 has been an exciting year in the Parallel
                                Data Lab, with a number of projects maturing              Carnegie Mellon University
                                and contributing to a broad, new research initia-            5000 Forbes Avenue
                                tive in automated storage administration. Along          Pittsburgh, PA 15213-3891
                                the way, two faculty and several new students                VOICE 412•268•6716
                                joined the Lab, several students spent summers
                                                                                              FAX 412•268•3010
with PDL Consortium companies, and several students won new Fellowships
(one each from Intel, NSF, and DoD).
The PDL continues to pursue a broad array of storage systems research, from                        PUBLISHER
the underlying devices to the applications that rely on storage. The past year                    Greg Ganger
brought excellent progress in existing projects and the initiation of exciting new
ones. Let me highlight a few things.                                                                 EDITOR
The biggest development has been the initiation of a major new project called                      Joan Digney
Self-* Storage, which explores the design and implementation of self-organiz-
ing, self-configuring, self-tuning, self-healing, self-managing systems of stor-     The PDL Packet is published once per year
age bricks. For years, PDL Retreat attendees have been pushing us to attack          and provided to members of the PDL
“storage management of large installations,” and this project is our response.       Consortium. Copies are given to other
With generous equipment donations from the PDL Consortium companies, we              researchers in industry and academia as
hope to put together and maintain 100s of terabytes of storage to be used by         well. A pdf version resides in the
ourselves and other CMU researchers (e.g., in data mining, astronomy, and            Publications section of the PDL Web pages
                                                                                     and may be freely distributed.
scientific visualization). Of course, the research challenge is to make the sys-     Contributions are welcome.
tem almost entirely self-*, so as to avoid the traditional costs of storage admin-
istration —deploying a real system will allow us to test our ideas in practice.
                                                                                             COVER ILLUSTRATION
Towards this end, we are designing the Self-* Storage architecture with a clean-
slate, integrating management functions throughout. In realizing the design, we      Skibo Castle and the lands that com-prise
are combining a number of recent and ongoing PDL projects (e.g., freeblock           its estate are located in the Kyle of
scheduling, PASIS, and self-securing storage) and ramping up efforts on new          Sutherland in the northeastern part of
challenges such as block-box device and workload modeling, automated deci-           Scotland. Both ‘Skibo’ and ‘Sutherland’ are
                                                                                     names whose roots are from Old Norse,
sion making, and automated diagnosis.
                                                                                     the language spoken by the Vikings who
PDL's push into database storage management has also produced an exciting            began washing ashore reg-ularly in the
new project: Fates. A Fates-based database storage manager transparently             late ninth century. The word ‘ Skibo ’
exploits select knowledge of the underlying storage infrastructure to automati-      fascinates etymologists, who are unable
cally achieve robust, tuned performance. As in Greek mythology, there are            to agree on its original meaning. All agree
three Fates: Atropos, Clotho, and Lachesis. The Atropos volume manager stripes       that ‘bo’ is the Old Norse for ‘land’ or
data across disks based on track boundaries and exposes aspects of the result-       ‘place.’ But they argue whether ‘ski’
                                                                                     means ‘ships’ or ‘peace’ or ‘fairy hill.’
ing parallelism. The Lachesis storage manager utilizes track boundary and par-
allelism information to match database structures and access patterns to the         Although the earliest version of Skibo
underlying storage. The Clotho dynamic page layout allows retrieval of just the      seems to be lost in the mists of time, it
desired table attributes, eliminating unnecessary I/O and wasted main memory.        was most likely some kind of fortified
Altogether, the three Fates components simplify database administration, in-         building erected by the Norsemen. The
crease performance, and avoid performance fluctuations due to query interfer-        present-day castle was built by a bishop
ence. Other database systems projects are exploring benchmarking techniques,         of the Roman Catholic Church. Andrew
automatic physical database design, and new software architectures for robust        Carnegie, after making his fortune,
load management.                                                                     bought it in 1898 to serve as his sum-mer
                                                                                     home. In 1980, his daughter, Mar-garet,
The self-securing devices project has made great strides. Highlighted in 2002        donated Skibo to a trust that later sold
by several news organizations, this project adapts medieval warfare notions to       the estate. It is presently being run as a
the defense of networked computing infrastructures. In a nutshell, devices are       luxury hotel.
augmented with relevant security functionality and made intrusion-independent
from client OSes and other devices. This architecture makes systems more
intrusion-tolerant and more manageable when under attack. The self-securing
devices vision has brought with it many interesting challenges and a healthy

2                                                                                                            T H E P D L PA C K E T
         PA R A L L E L D ATA
              L A B O R ATO RY                                            FROM THE DIRECTOR’S CHAIR
                   CONTACT US
                   CONTACT                                    source of funding. We have continued our exploration of how to build and
                     WEB PAGES
                                                              exploit network interface software for containing compromised client systems,
         PDL Home:
                                                              and explored in depth a new way of detecting intruders: storage-based intru-
              Please see our web pages at

                                                              sion detection. Storage devices are uniquely positioned to spot some common
            for further contact information.                  intruder actions (e.g., scrubbing audit logs and inserting backdoors), making
                       F A C U LT Y                           this an exciting new concept.
                 Greg Ganger (director)                       Other ongoing PDL projects are also producing cool results. For example, the
                    412 • 268• 1297                           PASIS project has produced a family of efficient and scalable consistency pro-
                  Anastassia Ailamaki
                                                              tocols that support a wide range of fault models that require no change to the
                                  server infrastructure. A new project has begun exploring the use of context
                  Anthony Brockwell                           information for assigning attributes to files to enable effective attribute-based
                   Christos Faloutsos
                                                              searches. We have built a working freeblock scheduling system and API in
                                 FreeBSD, and are using it for research activities such as continuous disk reor-
                      Garth Gibson                            ganization; it will also be a core component of each self-* storage brick. De-
                     Seth Goldstein
                                                              centralized caching is being explored in several contexts, including wide-area
                                     systems that opportunistically utilize CAS overlays and NFS clusters that dy-
                  Mor Harchol-Balter                          namically shed load by replicating read-only files. This newsletter and the PDL
                                                              website offer more details and additional research highlights.
                        Chris Long
                                   On the education front: This Spring, for the third time, we offered our new
                       Todd Mowry                             storage systems course to undergraduates and masters students at Carnegie
                       Adrian Perrig                          Mellon. Topics span the design, implementation, and use of storage systems,
                                  from the characteristics and operation of individual storage devices to the OS,
                        Mike Reiter                           database, and networking techniques involved in tying them together and mak-
               Mahadev Satyanarayanan                         ing them useful. The base lectures were complemented by real-world exper-
                                    tise generously shared by 8 guest speakers from industry, including several
                   Srinivasan Seshan                          members of the SNIA Technical Council. We continue to work on the book,
                       Dawn Song
                                                              and several other schools have already picked up and started teaching similar
                                storage systems courses. We view providing storage systems education as criti-
                      Chenxi Wang                             cal to the field’s future; stay tuned.
                        Hui Zhang                             I’m always overwhelmed by the accomplishments of the PDL students and
                                staff, and it’s a pleasure to work with them. As always, their accomplishments
                   STAFF MEMBERS
                   STAFF                                      point at great things to come.
                    Karen Lindenfelser
               (pdl business administrator)
                     412 • 268• 6716
                       Stan Bielski                                                                  NEW PDL FACULTY
                      Mike Bigrigg
                        John Bucy
                      Joan Digney
                    Gregg Economou
                         Ken Tew
                     Linda Whipkey
                                                                                      Anthony Brockwell
                                                                                      Dr. Anthony Brockwell, Assistant Professor in the
 Michael Abd-El-Malek                      David Petrou
 Mukesh Agrawal                       Brandon Salmon                                  Dept. of Statistics at Carnegie Mellon, received his
 Kinman Au                             Raja Sambasivan                                Ph.D. in Statistics and Electrical Engineering from
 Shimin Chen                               Jiri Schindler
 Garth Goodson                           Steve Schlosser                              the University of Melbourne in 1998. He joined the
 John Linwood Griffin                   Minglong Shao                                 PDL in June to collaborate on our Self-* Storage
 Stavros Harizopoulos                 Vlad Shkapenyuk
 James Hendricks                Shafeeq Sinnamohideen
                                                                                      project. Dr. Brockwell’s research interests are pri-
 Andrew Klosterman                          Craig Soules                              marily in the areas of dynamical systems and Baye-
 Chris Lumb                                 John Strunk                               sian computational methods. He is particularly in-
 Amit Manjhi                               Eno Thereska
 Michael Mesnier                                Niraj Tolia                           terested in the analysis and control of nonlinear and
 Jim Newsome                            Mengzhi Wang          non-Gaussian systems, and in associated computational methods such as
 Spiros Papadimitriou                         Ted Wong
 Stratos Papadomanolakis                        Jay Wylie     particle filtering and Markov chain Monte Carlo. Dr. Brockwell has pub-
 Adam Pennington                         Shuheng Zhou         lished articles in such journals as SIAM Journal on Control and Optimiza-
 Ginger Perng

                                                                                                                              Continued on page 4

FALL   2003                                                                                                                                     3
October 2003                                 Chris Long organized a Birds of a          San Antonio, TX and presented
  11th Annual PDL Retreat and                Feather session on “Security and           the paper “Opportunistic Use of
  Workshop.                                  Usability” at the USENIX                   Content Addressable Storage for
                                             Security Symposium in Washing-             Distributed File Systems.”
September 2003                               ton, DC.                                 May 2003
  Jiri Schindler presented two               Christos Faloutsos gave a tutorial
                                                                                        Craig Soules spoke on “Why
  papers at VLDB in Berlin:                  on “Mining Time Series Data” at
                                                                                        Can’t I Find My Files? New
  “Lachesis: Robust Database                 ICML 2003, Washington DC.
                                                                                        Methods for Automating At-
  Storage Management Based on                Jiri Schindler successfully
                                                                                        tribute Assignment” at HotOS in
  Device-specific Performance                defended his PhD dissertation
                                                                                        Lihue, HI. Greg also attended.
  Characteristics,” and “Matching            titled “Matching Application
                                                                                        Greg attended the AFSOR
  Database Access Patterns to                Access Patterns to Storage
                                                                                        program review in Colorado and
  Storage Characteristics,” which            Device Characteristics.”
                                                                                        presented Self-Securing Devices
  won the Ph.D. Workshop’s Best            July 2003                                    Adam Pennington spent the
  Paper Award.                               Greg visited HP, Microsoft and             summer interning at Seagate in
  Christos Faloutsos gave the                Veritas in CA and presented                Pittsburgh; Brandon Salmon
  keynote talk “Next Generation              Self-* Storage.                            interned at Microsoft Research,
  Data Mining Tools: Power laws            June 2003                                    and Craig Soules interned with
  and self-similarity for graphs,
                                             Brandon Salmon presented                   HP Labs.
  streams and traditional data” at
  ECML, Dubrovnik, Croatia.                  “A Two-Tiered Software Archi-            April 2003
  Spiros Papadimitriou presented             tecture for Automated Tuning of            Craig Soules presented “Meta-
  “Adaptive, Hands-Off Stream                Disk Layouts” at the AASMS                 data Efficiency in Versioning File
  Mining” at VLDB.                           Workshop in San Diego, CA.                 Systems” at FAST 03; Greg and
                                             Greg also attended.                        many other PDL students also
August 2003                                  Christos Faloutsos presented a             attended.
  Greg visited Intel in Portland, OR,        tutorial on “Internet Research             Fifth annual PDL Industry Visit
  to present Self-* Storage.                 meets Data Mining: Current                 Day.
  Adam Pennington presented                  Knowledge and New Tools” at                Chris Long co-organized the
  “Storage-based Intrusion Detec-            SIGMETRICS 2003, San Diego.                Workshop on Human-Computer
  tion” at USENIX Security ’03 in            Niraj Tolia attended the Usenix
  Washington, DC.                            Annual Technical Conference in                              Continued on page 19

Continued from page 3
tion, Journal of Time Series Analy-        the principles on which today’s and        principal architect and implementor of
sis and Journal of Computational and       tomorrow’s files systems are based.        the Andrew File System (AFS), which
Graphical Statistics.                      Key ideas from the Coda file system,       has been commercialized by IBM.
                                           which supports disconnected and            Dr. Satyanarayanan is the Carnegie
Mahadev                                    bandwidth-adaptive operation, have         Group Professor of Computer Science
Satyanarayanan                             been incorporated by Microsoft into        at Carnegie Mellon University. He is
                                           the IntelliMirror component of Win-        currently serving as the founding di-
                        At long last,      dows. Another outcome of his re-           rector of Intel Research Pittsburgh,
                        Satya has join-    search is Odyssey, a set of open-          which focuses on software systems
                        ed the PDL!        source operating system extensions         for distributed data storage. He is the
                        Long renown-       for enabling mobile applications to        founding Editor-in-Chief of IEEE Per-
                        ed as a leading    adapt to variation in critical resources   vasive Computing. He received his
                        researcher in      such as bandwidth and energy. Coda         Ph.D. in Computer Science from Car-
                        distributed file   and Odyssey are building blocks in         negie Mellon, after completing
                        systems,           Project Aura, a research initiative at     Bachelor’s and Master’s degrees
                        Satya’s re-        Carnegie Mellon to build a distraction-    from the Indian Institute of Technol-
                        search has giv-    free ubiquitous computing environ-         ogy, Madras. He is also a Fellow of
                        en us many of      ment. Earlier, Satyanarayanan was a        the ACM and the IEEE.

4                                                                                                    T H E P D L PA C K E T
                                                                                  RECENT PUBLICATIONS
Lachesis: Robust Database                           Lachesis improves DSS performance           of fault and timing assumptions, up to
Storage Management Based on                         by up to 3X, while OLTP also exhibits       asynchrony and Byzantine faults of
Device-specific Performance                         a 7% speedup.                               both storage-nodes and clients, with
Characteristics                                                                                 no changes to server implementation
                                                    A Two-Tiered Software                       or client-server interface. Measure-
Schindler, Ailamaki & Ganger                        Architecture for Automated                  ments of a prototype storage system
                                                    Tuning of Disk Layouts                      using these protocols show that the
VLDB 03, Berlin, Germany, Sept 9-                                                               protocol performs well under various
12, 2003.                                           Salmon, Thereska, Soules & Ganger           system model assumptions, numbers
Database systems work hard to tune                  First Workshop on Algorithms and            of failures tolerated, and degrees of
I/O performance, but do not always                  Architectures for Self-Managing Sys-        reader-writer concurrency.
achieve the full performance poten-                 tems. In conjunction with Federated
tial of modern disk systems. Their                  Computing Research Conference               A Human Organization Analogy
abstracted view of storage compo-                   (FCRC). San Diego, CA. June 11,             for Self-* Systems
nents hides useful device-specific                  2003.
characteristics, such as disk track                                                             Strunk & Ganger
                                                    Many heuristics have been developed
boundaries and advanced built-in firm-              for adapting on-disk data layouts to        First Workshop on Algorithms and
ware algorithms. This paper presents                expected and observed workload              Architectures for Self-Managing Sys-
a new storage manager architecture,                 characteristics. This paper describes       tems. In conjunction with Federated
called Lachesis, that exploits and                  a two-tiered software architecture for      Computing Research Conference
adapts to observable device-specific                cleanly and extensibly combining such       (FCRC). San Diego, CA. June 11,
characteristics in order to achieve and             heuristics. In this architecture, each      2003.
sustain high performance. For DSS                   heuristic is implemented independently
queries, Lachesis achieves I/O effi-                                                            The structure and operation of human
                                                    and an adaptive combiner merges their
ciency nearly equivalent to sequential                                                          organizations, such as corporations,
                                                    suggestions based on how well they
streaming even in the presence of                                                               offer useful insights to designers of
                                                    work in the given environment. The
competing random I/O traffic. In ad-                                                            self-* systems (a.k.a. self-managing
                                                    result is a simpler and more robust
dition, Lachesis simplifies manual con-                                                         or autonomic). Examples include
                                                    system for automated tuning of disk
figuration and restores the optimizer’s                                                         worker/supervisor hierarchies, avoid-
                                                    layouts, and a useful blueprint for other
assumptions about the relative costs                                                            ance of micro-management, and com-
                                                    complex tuning problems such as
of different access patterns expressed                                                          plaint-based tuning. This paper ex-
                                                    cache management, scheduling, data
in query plans. Experiments using                                                               plores the analogy, and describes the
                                                    migration, and so forth.
IBM DB2 I/O traces as well as a pro-                                                            design of a self-* storage system that
totype implementation show that                                                                 borrows from it.
                                                    Efficient Consistency for
Lachesis improves standalone DSS                    Erasure-coded Data via
performance by 10% on average.                                                                  Exposing and Exploiting Internal
                                                    Versioning Servers                          Parallelism in
More importantly, when running con-
currently with an on-line transaction               Goodson, Wylie, Ganger & Reiter             MEMS-based Storage
processing (OLTP) workload,
                                                    Carnegie Mellon University Techni-          Schlosser, Schindler, Ailamaki &
                                                    cal Report CMU-CS-03-127, April             Ganger
                 parsed SQL query                   2003.
                                                                                                Carnegie Mellon University Techni-
       OPTIMIZER                                    This paper describes the design, imple-     cal Report CMU-CS-03-125, March
                                                    mentation and performance of a fam-         2003.
                 query plan
                                    Configuration   ily of protocols for survivable, decen-
       EXECUTION                     parameters     tralized data storage. These protocols      MEMS-based storage has interesting
                                                    exploit storage-node versioning to ef-      access parallelism features. Specifi-
     STORAGE Manager                                ficiently achieve strong consistency        cally, subsets of a MEMStore’s thou-
                                                    semantics. These protocols allow era-       sands of tips can be used in parallel,
                  I/O                                                                           and the particular subset can be dy-
               requests                             sure-codes to be used that achieve
    buffers                                         network and storage efficiency (and         namically chosen. This paper de-
                                                    optionally data confidentiality in the      scribes how such access parallelism
                                                    face of server compromise). The pro-        can be exposed to system software,
Query optimization and execution in a typical       tocol family is general in that its pa-     with minimal changes to system in-
DBMS.                                               rameters accommodate a wide range                                Continued on page 6

FALL 2003                                                                                                                             5
Continued from page 5
terfaces, and utilized cleanly for two     selection queries and updates on                 (A)             ...          ...
classes of applications. First, back-      memory-resident relations execute 17-
ground tasks can utilize unused paral-     25% faster, and (c) TPC-H queries
lelism to access media locations with      involving I/O execute 11-48% faster.
no impact on foreground activity. Sec-     Finally, we show that PAX performs                servers,
                                                                                                            Local                           Wide
                                                                                                  etc.       Area                            Area
ond, two-dimensional data structures,      well across different memory system                                Network            Firewall     Network
such as dense matrices and relational      designs.                                                                                and
database tables, can be accessed in
both row order and column order with       Self-* Storage: Brick-based                                     ...           ...
maximum efficiency. With proper            Storage with Automated
table layout, unwanted portions of a       Administration                                   (B)
                                                                                                           ...           ...
table can be skipped while scanning                                                                      SSNI

at full speed. Using simulation, we        Ganger, Strunk & Klosterman                                                    SSNI

explore performance features of us-                                                         Desktops,
                                           Carnegie Mellon University Techni-                servers,
                                                                                                            Local                           Wide
ing this device parallelism for an ex-                                                            etc.       Area                            Area
                                           cal Report, CMU-CS-03-178, August                                  Network                         Network
ample application from each class.                                                                                             Firewall
                                           2003.                                                                                 and
Data Page Layouts for Relational           This white paper describes a new

Databases on Deep Memory                                                                                   ...           ...
                                           project exploring the design and imple-
Hierarchies                                mentation of “self-* storage systems:”
                                           self-organizing, self-configuring, self-        Self-securing network interfaces. (A)
                                                                                           shows the conventional network security
Ailamaki, DeWitt & Hill                    tuning, self-healing, self-managing             configuration, wherein a firewall and a NIDS
                                           systems of storage bricks. Borrow-              protect LAN systems from some WAN attacks.
The VLDB Journal 11(3), 2002.              ing organizational ideas from corpo-            (B) shows the addition of self-securing NIs,
                                           rate structure and automation tech-             one for each LAN system.
Relational database systems have tra-
ditionally optimized for I/O perfor-       nologies from AI and control systems,
                                           we hope to dramatically reduce the              self-securing NIs can better identify
mance and organized records sequen-
                                           administrative burden currently faced           suspicious traffic originating from that
tially on disk pages using the N-ary
                                           by data center administrators. Further,         host, including many explicitly de-
Storage Model (NSM) (a.k.a., slotted
                                           compositions of lower cost compo-               signed to defeat network intrusion
pages). Recent research, however, in-
                                           nents can be utilized, with available           detection systems. With normalization
dicates that cache utilization and per-
                                           resources collectively used to achieve          and detection-triggered throttling, self-
formance is becoming increasingly
                                           high levels of reliability, availability, and   securing NIs can reduce the ability of
important on modern platforms. In this
                                           performance.                                    compromised hosts to launch attacks
paper, we first demonstrate that in-
                                                                                           on other systems inside (or outside)
page data placement is the key to high
                                           Finding and Containing Enemies                  the intranet. We describe a prototype
cache performance and that NSM
                                           Within the Walls with Self-                     self-securing NI and example scan-
exhibits low cache utilization on mod-
                                           securing Network Interfaces                     ners for detecting such things as TTL
ern platforms. Next, we propose a new
                                                                                           abuse, fragmentation abuse, “SYN
data organization model called PAX
                                           Ganger, Economou & Bielski                      bomb” attacks, and random-propaga-
(Partition Attributes Across), that sig-
                                                                                           tion worms like Code-Red.
nificantly improves cache perfor-          Carnegie Mellon University Techni-
mance by grouping together all val-        cal Report CMU-CS-03-109, January
ues of each attribute within each page.                                                    Object-Based Storage
Because PAX only affects layout in-                                                        Mesnier, Ganger & Riedel
side the pages, it incurs no storage       Self-securing network interfaces
penalty and does not affect I/O be-        (NIs) examine the packets that they             IEEE Communications Magazine, v.41
havior. According to our experimen-        move between network links and host             n.8 pp 84-90, August 2003.
tal results (which were obtained with-     software, looking for and potentially           Storage technology has enjoyed con-
out using any indices on the partici-      blocking malicious network activity.            siderable growth since the first disk
pating relations), when compared to        This paper describes how self-secur-            drive was introduced nearly 50 years
NSM (a) PAX exhibits superior cache        ing network interfaces can help ad-             ago, in part facilitated by the slow and
and memory bandwidth utilization, sav-     ministrators to identify and contain            steady evolution of storage interfaces
ing at least 75% of NSM’s stall time       compromised machines within their
due to data cache accesses, (b) range      intranet. By shadowing host state,                                                        Continued on page 7

6                                                                                                                        T H E P D L PA C K E T
                                                                         RECENT PUBLICATIONS
Continued from page 6
(SCSI and ATA/IDE). The stability of       task run time compared to results from     data. In order to cope with increasing
these interfaces has allowed continual     previously published techniques.           demands for performance, high-end
advances in both storage devices and                                                  DBMS employ parallel processing
applications, without frequent changes     Storage-based Intrusion                    techniques coupled with a plethora of
to the standards. However, the inter-      Detection: Watching Storage                sophisticated features. However, the
face ultimately determines the func-       Activity For Suspicious Behavior           widely adopted, work-centric, thread-
tionality supported by the devices, and                                               parallel execution model entails sev-
current interfaces are holding system      Pennington, Strunk, Griffin, Soules,       eral shortcomings that limit server
designers back. Storage technology         Goodson & Ganger                           performance when executing
has progressed to the point that a         12th USENIX Security Symposium,            workloads with changing require-
change in the device interface is          Washington, D.C., Aug 4-8, 2003. An        ments. Moreover, the monolithic ap-
needed. Object-based storage is an         early version is available as Carnegie     proach in DBMS software has lead
emerging standard designed to ad-          Mellon University Technical Report         to complex anddifficultto extend de-
dress this problem. In this article we     CMU-CS-02-179, September 2002.             signs. This paper introduces a staged
describe object-based storage, stress-                                                design for high-performance, evolv-
ing how it improves data sharing, se-      Storage-based intrusion detection al-      able DBMS that are easy to tune and
curity, and device intelligence. We also   lows storage systems to watch for          maintain. We propose to break the
discuss some industry applications of      data modifications characteristic of       database system into modules and to
object-based storage and academic          system in-trusions. This enables stor-     encapsulate them into self-contained
research using objects as a founda-        age systems to spot several common         stages connected to each other
tion for building even more intelligent    intruder actions, such as adding           through queues. The staged, data-cen-
storage systems.                           backdoors, inserting Trojan horses,        tric design remedies the weaknesses
                                           and tampering with audit logs. Fur-        of modern DBMS by providing solu-
Why Can’t I Find My Files? New             ther, an intrusion detection system        tions at both a hardware and a soft-
Methods for Automating                     (IDS) embedded in a storage device         ware engineering level.
Attribute Assignment                       continues to operate even after client
                                           systems are compromised. This pa-
Soules & Ganger                            per describes a number of specific                   module 1                   module N
                                                                                         IN            11                            1N OUT
                                           warning signs visible at the storage                        m1       ...                  mN
Proceedings of the Ninth Workshop          interface. Examination of 18 real in-

on Hot Topics in Operating systems,        trusion tools reveals that most (15) can                1i : time to load module i
USENIX Association, May 2003.              be detected based on their changes
                                                                                                   mi: mean service time when
                                                                                                        module i is already loaded
                                           to stored files. We describe and evalu-
This paper analyzes various algorithms
                                           ate a prototype storage IDS, embed-
for scheduling low priority disk drive                                                A production-line model for staged servers.
                                           ded in an NFS server, to demonstrate
tasks. The derived closed form solu-
                                           both feasibility and efficiency of stor-
tion is applicable to a class of greedy
                                           age-based intrusion detection. In par-     Adaptive, Hands-Off Stream
algorithms that includes a variety of
                                           ticular, both the performance overhead     Mining
background disk scanning applica-
                                           and memory required (152 KB for
tions. By paying close attention to                                                   Papadimitriou, Brockwell &
                                           4730 rules) are minimal.
many characteristics of modern disk                                                   Faloutsos
drives, the analytical solutions achieve
very high accuracy — the difference        A Case for Staged Database                 Carnegie Mellon University SCS
between the predicted response times       Systems                                    Technical Report CMU-CS-02-205.
and the measurements on two differ-                                                   Also published in Proceedings VLDB
ent disks is only 3% for all but one
                                           Harizopoulos & Ailamaki                    03, Berlin, Germany, Sept 9-12, 2003.
examined workload. This paper also         In proceedings of the First Interna-       Sensor devices and embedded proces-
proves a theorem which shows that          tional Conference on Innovative Data       sors are becoming ubiquitous, espe-
background tasks implemented by            Systems Research (CIDR), Asilomar,         cially in measurement and monitoring
greedy algorithms can be accom-            CA, January 2003.                          applications. Automatic discovery of
plished with very little seek penalty.
                                                                                      patterns and trends in the large vol-
Using greedy algorithm gives a 10%         Traditional database system architec-
                                                                                      umes of such data is of paramount
shorter response time for the fore-        tures face a rapidly evolving operat-
                                                                                      importance. The combination of rela-
ground application requests and up to      ing environment, where millions of
a 20% decrease in total background         users store and access terabytes of                                    Continued on page 16

FALL 2003                                                                                                                                 7
Jiri Schindler, Minglong Shao, Steve Schlosser, Anastassia Ailamaki & Greg Ganger
Current database systems use data lay-                                     tions cleanly separate the functionality      pool, Clotho creates a skeleton of a page
outs that can exploit unique features of                                   of each component (described below)           in a memory frame describing what data
only one level of the memory hierar-                                       while allowing efficient query execution      is needed and how to lay it out.
chy (cache/main memory or on-line                                          along the entire path through the data-       The page layout builds upon a cache-
storage). Such layouts optimize for the                                    base system.                                  friendly layout, called PAX [1], which
predominant access pattern of one                                                                                        groups data into minipages, where each
workload (e.g., DSS), while trading off                                    The main feature of the Fates database
                                                                           system is the decoupling of the in-           minipage contains the data of only a
performance of another workload type                                                                                     single attribute (table column). Align-
(e.g., OLTP). Achieving efficient execu-                                   memory data layout from the on-disk
                                                                           storage layout. Different data layouts at     ing records on cache line boundaries
tion of different workloads without this                                                                                 within each minipage and taking advan-
trade-off or the need to manually re-tune                                  each memory hierarchy level can be
                                                                           tailored to leverage specific device char-    tage of cache prefetching logic im-
the system for each workload type is                                                                                     proves the performance of scan opera-
still an unsolved problem. The “Fates”                                     acteristics at that level. An in-memory
                                                                           page layout can leverage L1/L2 cache          tions without sacrificing full-record ac-
database system project answers this                                                                                     cess.
challenge.                                                                 characteristics, while another layout can
                                                                           leverage characteristics of storage de-       Clotho adjusts the position and size of
The primary goal of the project is to                                      vices. Additionally, the storage-device-      each minipage within the frame. It
achieve efficient execution of com-                                        specific layout also yields efficient I/O     matches the minipage size to the (mul-
pound database workloads at all levels                                     accesses to records for different query       tiple of) block size of the storage de-
of a database system memory hierar-                                        access patterns. Finally, this decoupling     vice, provided by Atropos, and decides
chy. By leveraging unique characteris-                                     also provides flexibility in determining      which minipages to store within a single
tics of devices at each level, the Fates                                   what data to request and keep around          page based on the needs of a given
database system can automatically                                          in main memory.                               query. Hence, minipages within a single
match query access patterns to the re-                                                                                   memory frame contain only the at-
spective device characteristics, which                                     Traditional database systems are forced       tributes needed by that query. The re-
eliminates the difficult and error-prone                                   to fetch and store unnecessary data as        maining attributes constituting the full
task of manual performance tuning.                                         an artifact of a chosen data layout. The      record, but not needed by the query, are
                                                                           Fates database system, on the other           never requested.
Borrowing from the Greek mythology                                         hand, can request, retrieve, and store just
of The Three Fates–Clotho, Lachesis,                                                                                     The skeleton page inside the memory
                                                                           the needed data, catering to the needs        frame includes a header that lists the at-
and Atropos–who spin, measure, and                                         of a specific query. This conserves stor-
cut the thread of life, the three compo-                                                                                 tributes and the range of records to be
                                                                           age device bandwidth, memory capac-           retrieved from the storage device and
nents of our database system (bearing                                      ity, and avoids cache pollution—all of
the Fates’ respective names) establish                                     which improves query execution time.
proper abstractions in the database                                                                                                       SCAN Operator
query execution engine. These abstrac-                                     Scatter/gather I/O facilitates efficient                         get next page

                                                                           transformation from one layout into an-                                  PageID

                                                                           other and creates an organization ame-                     BUFFER POOL Manager
             OPERATORS                (tblscan, idxscan, ... )             nable to the individual query needs on-                     allocate memory frame
      requests for pages with a                        access to payload   the-fly. In addition to eliminating ex-
    subset of attributes (payload)
                                        buffer     page hdr
                                                                           pensive data copies, these I/Os match                           fill page header
      Buffer Pool Manager
                                         pool                              explicit storage device characteristics,
                                                                           ensuring efficient execution at the stor-                Lachesis STORAGE Manager
                                                                                                                                       generate volume requests
                                                                           age device. Finally, thanks to proper ab-       p_hdr

                                        payload data directly placed
                                                                           stractions established by each Fate,                       Atropos LV Manager
        Storage Manager
                                           via scatter/gather I/O          other database system components                             generate disk requests
storage inteface exposes efficient
 access to non-contiguous blocks          disk 0              disk 1
                                                                           (e.g., the query optimizer or the lock-         p_hdr

                                                                           ing manager) remain unchanged.                                          issue disk I/Os

      Logical Volume Mgr
                                     disk array
                                                                           Clotho ensures efficient query execu-         Figure 2: Operations performed by each
                                                                           tion at the cache/main-memory level           component of the database system, showing
Figure 1: The Fates database system                                                                                      the content of a single memory frame at each
architecture. Efficient transformation of on-disk
                                                                           and figures at the inception of a request
                                                                                                                         stage of a data request during query execution.
layout to in-memory page layout is achieved                                for particular data. When the data de-
by DMA and scatter/gather I/O.                                             sired by a query is not found in the buffer                                   Continued on page 9

8                                                                                                                                           T H E P D L PA C K E T
                                    DATABASE QUERY EXECUTION WITH FATES
Continued from page 8
stored in that frame. Clotho marks the                                                                             LBN on a different track. Issuing all
set of attributes needed by the query in               disk 0         disk 1          disk 2          disk 3       requests together to these LBNs allows
the page header (i.e., the payload) and          0
                                                      0              4               8              parity         the disk’s internal scheduler to service
identifies the range of records to be put       12        A1-A4
                                                                                                                   the request with the smallest position-
into the page. Hence, the page header           36                                                                 ing cost first, given the current disk head
                                                     quadrangle 0   quadrangle 1    quadrangle 2    quadrangle 3
serves as a request from Clotho to                   parity         48              52              56             position. Servicing the remaining re-
Lachesis to retrieve the desired data.          60                       A5-A8         A5-A8           A5-A8       quests does not incur any other posi-
                                                                         r0:r99      r100:r199       r200:r299
                                                                                                                   tioning overhead thanks to the diago-
                                                     quadrangle 4
                                                                    quadrangle 5
                                                                                    quadrangle 6
                                                                                                    quadrangle 7
                                                                                                                   nal layout. Hence, this semi-sequential
The Lachesis database storage manager           96                                                                 access pattern is much more efficient
                                               108     A9-A12                         A9-A12          A9-A12
handles the mapping and access to              120    r200:r299                       r0:r99         r100:r199     than reading some randomly chosen
minipages located within the LBNs of           132
                                                     quadrangle 8   quadrangle 9   quadrangle 10   quadrangle 11   LBNs spread across the set of adjacent
on-line storage devices. Utilizing stor-                                                                           tracks of a single quadrangle.
age device-provided performance char-
acteristics and matching query access                                                                              Semi-sequential quadrangle access is
patterns (e.g., sequential scan) to these                                                                          used for retrieving minipages with all
                                               Figure 3: Mapping of database table with 12
hints, Lachesis constructs efficient I/Os.     attributes onto Atropos logical volume with                         attributes comprising a full record. If the
It also sets scatter/gather I/O vectors that   quadrangles. Each quadrangle holds 4                                number of minipages/attributes does fit
                                               attributes with 100 records each. The dashed                        into a single quadrangle, the remaining
allow direct placement of individual
                                               arrow line shows efficient semi-sequential
minipages to proper memory frames                                                                                  minipages are mapped to a quadrangle
                                               access for retrieving complete records.
without unnecessary memory copies.                                                                                 on a different disk. Using this mapping
                                               table attributes. By utilizing features built                       method, several disks can be accessed
Explicit relationships between indi-                                                                               in parallel to retrieve full records effi-
vidual logical blocks (LBNs), estab-           into disk firmware and a new data lay-
                                               out, Atropos delivers the aggregate                                 ciently.
lished by Atropos, allow Lachesis to
devise a layout that groups together a         bandwidth of all disks for accesses in                              SUMMARY
set of related minipages. This grouping        both majors, without penalizing small
                                               random I/O accesses.                                                Fates is the first database system that
ensures that all attributes belonging to                                                                           leverages the unique characteristics of
the same set of records can be accessed        The basic allocation unit in Atropos is                             each level in the memory hierarchy. The
in parallel, while a particular attribute      the quadrangle, which is a collection of                            decoupling of data layouts at the cache/
can be accessed with efficient sequen-         logical volume LBNs. A quadrangle                                   main-memory and on-line storage lev-
tial I/Os. The relationships between           spans the entire track of a single disk                             els is possible thanks to carefully or-
LBNs serve as hints that let Lachesis          along one dimension and a small num-                                chestrated interactions between each
construct I/Os that Atropos can execute        ber of adjacent tracks along the other                              Fate. Properly designed abstractions
efficiently.                                   dimension. Each successive quadrangle                               that hide specifics, yet allow the other
The layout and content of each                 is mapped to a different disk, much like                            components to take advantage of their
minipage is transparent to both Lachesis       a stripe unit of an ordinary RAID group.                            unique strengths achieve efficient query
and Atropos. Lachesis merely decides           Hence, the RAID 1 or RAID 5 data pro-                               execution at all levels of the memory
how to map each minipage to LBNs to            tection schemes fit the quadrangle lay-                             hierarchy.
be able to construct a batch of efficient      out naturally.
I/Os. Atropos in turn, cuts these batches
into individual disk I/Os comprising the       Atropos stripes contiguous LBNs                                     [1] A. Ailamaki, D. J. DeWitt, M. D.
exported logical volume. It does not           across quadrangles mapped to all disks.                             Hill, M. Skounakis: Weaving Relations
care where the data will be placed in          This provides aggregate streaming                                   for Cache Performance. Proc. of VLDB,
memory; this is decided by the scatter/        bandwidth of all disks for table accesses                           169-180. Morgan Kaufmann, 2001.
gather I/O vectors set up by Lachesis.         in column-major order (e.g., for single
                                                                                                                   [2] J. Schindler, J. L. Griffin, C. R.
                                               attribute scans). With quadrangle
                                                                                                                   Lumb, G.R. Ganger. Track-aligned Ex-
ATROPOS                                        “width” matching disk track size, se-
                                                                                                                   tents: Matching Access Patterns to Disk
                                               quential accesses exploits the high effi-
Atropos is a disk array logical volume                                                                             Drive Characteristics. Conf. on File and
                                               ciency of track-based access [2].
manager that offers efficient access in                                                                            Storage Technologies, 259-274. Usenix
both row- and column-major orders. For         Accesses in the other major order (i.e.                             Association, 2002.
database systems, this translates into ef-     row-major order), called semi-sequen-
ficient access to complete records as          tial, proceed to LBNs mapped diago-
well as scans of an arbitrary number of        nally across a quadrangle, with each

FALL 2003                                                                                                                                                   9

September 2003                            (CEE), Christos Faloutsos, Computer      ing to the Boston area soon to work
Ganger & the PDL Awarded                  Science, Anastassia Ailamaki, Com-       at EMC.
Equipment Grants from IBM                 puter Science, Mitch Small, CEE and
and Intel                                 Engineering and Public Policy, and
                                          Paul Fischbeck, Social and Decision
IBM Corporation and Intel Corpora-        Sciences, have received a National
tion have each generously donated         Science Foundation grant of $1.5 mil-
over $80K in equipment to provide an      lion for a new project called “SEN-
early testbed for PDL’s new Self-*        SORS: Placement and Operation of
Storage project. The article on page 1    an Environmental Sensor Network to
describes Self-* Storage.                 Facilitate Decision Making Regarding
September 2003                            Drinking Water Quality and Security.”
NSF Grant to Fund Self-* Stor-            *From CMU newspaper The Tartan, Sept.
age Research                              8, 2003.

PDL researchers have received a $1.5      August 2003
million NSF grant to pursue the Self-*    Ted and Addie Marry!
Storage project, which seeks to cre-
ate large-scale self-managing, self-
organizing, self-tuning storage systems
                                                                                   July 2003
from generic servers. The project PI
is Greg Ganger (ECE and CS; Direc-                                                 Jiri Schindler Receives Best
tor of PDL), and the co-PIs are                                                    Paper Award at VLDB Workshop
Natassa Ailamaki (CS), Anthony                                                     Jiri Schindler’s paper “Matching Da-
Brockwell (Statistics), Garth Gibson                                               tabase Access Patterns to Storage
(CS), and Mike Reiter (ECE and CS).                                                Characteristics,” co-authored with
September 2003                                                                     Anastassia Ailamaki and Greg Ganger,
                                                                                   has received the award for Best Pa-
Congratulations Natassa and
                                                                                   per in the VLDB 2003 PhD Work-
                                                                                   shop from among 34 submissions. Jiri
Natassa Ailamaki and Babak Falsafi                                                 will present his paper at the VLDB
are thrilled to announce the arrival of                                            PhD Workshop, co-located with
their daughter Niki Falsafi, who was                                               VLDB 2003 (29th Conference on
born at Magee Women’s Hospital at                                                  Very Large Databases) in Berlin in
7:44 a.m. on September 27.                                                         September. The VLDB 2003 PhD
                                                                                   Workshop brings together PhD stu-
                                                                                   dents working on topics related to the
                                          Ted Wong and Addie Tyler were mar-       VLDB Conference series, to present
                                          ried on Aug 9, 2003 in the Sage Chapel   and discuss their research in a con-
                                          at Cornell University in Ithaca, New     structive and international atmosphere.
                                          York. Ted will be defending his dis-     This paper is available on our publi-
                                          sertation on October 24, and then he     cations page.
                                          and Addie are moving to the Palo Alto    May 2003
                                          area in California where Ted will be
                                          working at IBM.                          John Linwood Griffin Receives
                                                                                   Intel Fellowship
                                          June 2003                                Our congratulations to John Linwood
                                          Congratulations to Jiri and              Griffin, who has been selected as a
September 2003                            Katrina!                                 recipient of a 2003-04 Intel Founda-
5 CMU Professors Receive NSF              Jiri Schindler and Katrina Van Dellen    tion PhD Fellowship Award. The fel-
Grant to Study Drinking Water             were married at The Wiley Inn in         lowship will cover John’s full tuition,
Quality and Security*                     Peru, Vermont on June 7, 2003. Jiri      fees, and stipend for the year. Addi-
Faculty members Jeanne VanBriesen,        successfully defended his PhD disser-
Civil and Environmental Engineering       tation on August 22 and will be mov-                          Continued on page 11

10                                                                                                T H E P D L PA C K E T
                                                              AWARDS & OTHER PDL NEWS
Continued from page 10
                        tionally, the fel-   States with tal-                            February 2003
                        lowship pro-         e n t e d ,                                 Congrats to Chris & Alexis
                        vides John with      doctorally                                  Long!
                        an Intel-based       trained Ameri-
                        laptop and a         can men and                                 Robert Nicholas Long joined his par-
                        mentor who           women who                                   ents Chris and Alexis on Feb. 16, 2003!
                        will act as a link   will lead state-                            What a bright and happy looking little
                        between the          of-the-art re-                              fellow.
                        student and          search projects
                        those people         in disciplines
                        pursuing rel-        having the greatest payoff to national
evant research at Intel. The fellow-         security requirements.” Since the
ship does not involve an internship;         program’s inception 14 years ago, ap-
rather, it is targeted at Ph.D. candi-       proximately 1,800 fellowships have
dates within 18 months of degree             been awarded from about 28,500 ap-
completion. Approximately 35 candi-          plications received. James’ fellowship
dates are selected annually for the          is supported by the Air Force Office
award from a very competitive field.         of Scientific Research (AFOSR) and
                                             covers his full tuition and required fees
May 2003                                     during that term. Fellows have no mili-
Brandon Salmon Awarded NSF                   tary or other service obligations, and      October 2002
Graduate Research Fellowship*                must be working towards a PhD.
                                                                                         Welcome to
The winners of                               February 2003                               the Newest
this year’s Na-                              Welcome Michelle!                           Seshan!
tional Science
Foundation                                   Michelle Liu was born to Mengzhi            Srini Seshan and
(NSF) Gradu-                                 Wang and Honliang Liu on February           his wife Asha wel-
ate Research                                 2, 2003 at 7 lbs. 9 oz. and 17.2 inches.    comed their first
Fellowships in-                              It looks like she is already following      baby on October
clude Electrical                             her Mom’s footsteps into computer re-       1, 2002. Sanjay
and Computer                                 lated research.                             Seshan was 7lbs.
Engineering                                                                              2oz. and 20 3/4
(ECE) students Jennifer Morris and                                                       inches long at
Brandon Salmon. The NSF’s Gradu-                                                         birth. In the photo
ate Research Fellowship funds three                                                      at 11 months of
years of graduate study, including a                                                     age, it looks like
$27,500 stipend for the first 12 months                                                  he has grown
and an annual tuition allowance of                                                       quite a bit since
$10,500, paid to the university. This                                                    then!
year’s contest was the most competi-
tive in recent history: 7,788 applicants
vied for 900 fellowships.
*CMU 8 1/2 x 11 News, May 1, 2003.           January 2003
May 2003                                     Chris Long and Greg Ganger
                                             Receive Funding from C3S
James Hendricks Awarded
                                             Chris Long and Greg Ganger have
National Defense Fellowship
                                             been awarded seed funding from the
Congratulations to James Hendricks,          Center for Computer Security (C3S)
who has been awarded a National              at Carnegie Mellon for their project
Defense Science and Engineering              “Access Control for the Masses.”
Graduate (NDSEG) Fellowship. The             The project will fall within a new PDL      Bruce Worthington of Microsoft discussing
prevailing goal of this highly competi-      research area dealing with Better User      research with Steve Schlosser, Jiri Schindler,
tive program is “to provide the United       Interfaces.                                 Brandon Salmon & Craig Soules.

FALL 2003                                                                                                                           11
Continued from page 1
their performance and reliability status,    translate component/service names to               READ requests access redundant data.
and exchange information with human          their locations in the network, and se-            Doing so requires metadata for track-
administrators. Dataset assignments          curity services.                                   ing current storage assignments, con-
and redundancy schemes are dynami-           Data Access and Storage. Workers                   sistency protocols for accessing redun-
cally adjusted based on observed and         store data and routers ensure that I/O             dant data, and choices for routing re-
projected performance and reliability.       requests are delivered to the appropri-            quests.
We refer to self-* collections of stor-      ate workers for service. Thus, self-*              Self-* workers will service requests for
age bricks as storage constellations.        clients interact with a self-* router to           and store assigned data. We expect
Administration and Organization. At          access data. We envision two types of              them to have the computation and
the top level, a self-* storage system       self-* clients. Trusted clients are con-           memory resources needed to internally
will still require human administrators      sidered a part of the system, and may              adapt to their observed workloads by,
to provide guidance, approve procure-        be physically co-located with other                for example, reorganizing on-disk
ment requests, physically install and        components (e.g., router instances);               placements and specializing cache poli-
repair equipment, and provide high-          examples are file servers or databases             cies. Workers will also handle storage
level goals for the system. A self-* stor-   that use the self-* constellation as back-         allocation internally, both to decouple
age system will need an administrative       end storage. Untrusted clients are modi-           external naming from internal place-
interface to provide information and         fied to support the self-* constellation's         ments and to allow support for internal
offer solutions to the human adminis-        internal protocols; they interact directly         versioning, since they will keep histori-
trator when problems arise or trade-         with self-* workers, via self-* rout-              cal versions (e.g., snapshots) of all data
offs (e.g., between performance and          ers, with access privileges verified on            to assist with recovery from dataset
reliability) are faced. A self-* adminis-    each request.                                      corruption. Although the self-* storage
trative interface should also help ad-       An important part of a self-* router’s             architecture would work with work-
ministrators decide when to acquire          job is correctly handling accesses to              ers as block stores (like SCSI or IDE/
new components, which would then             data stored redundantly across storage             ATA disks), we believe they will work
be automatically integrated into the         nodes. Doing so requires a protocol to             better with a higher-level abstraction
self-* constellation.                        maintain data consistency and liveness             (e.g., objects or files), which will pro-
The supervisors, processes playing an        in the presence of failures and                    vide more information for adaptive spe-
organizational role in the infrastructure,   concurrency. Since they will have flex-            cializations. Self-* workers must also
form a management hierarchy. They            ibility in deciding which servers should           provide support for a variety of main-
dynamically tune dataset-to-worker           handle certain requests, self-* routers            tenance functions, including crash re-
assignments, redundancy schemes for          will also have a role in dynamic load              covery, integrity checking, and data
given datasets, and router policies. The     balancing as they deliver client requests          migration.
hierarchy of supervisor nodes controls       to the appropriate workers, particularly
data partitioning and request distribu-      when new data are created and when                                        Continued on page 20
tion among workers, with the objec-
tive of partitioning data and goals
among its subordinates (workers or                                                   WRITE (D)
lower-level supervisors) such that, if
its children meet their assigned goals,                                                                          self-* storage
the goals for the entire subtree will be                                                                         constellation
                                                  user group 1
met. By communicating goals down
the tree, a supervisor gives its subordi-                                            head-end 1
nates the ability to assess their own per-
formance relative to goals as they in-
ternally tune; the supervisors need not
concern themselves with details of how                                                READ(D)
subordinates meet their goals. The top
of the hierarchy interacts with the sys-
tem administrator, receiving high-level
goals for datasets and providing status          user group N
and procurement requests. Additional
internal services, referred to as admin-                                             head-end N
istrative assistants, are also needed for       unmodified user systems          bridges into the store
a self-* constellation to function. Ex-
amples include event logging for prob-
lem diagnosis, directory services to help      Figure 2: “Head-end” servers bridge external clients into the self-* storage constellation.

12                                                                                                              T H E P D L PA C K E T
                                                                      PROPOSALS & DEFENSES

PH.D. DISSERTATION                         back, defragmentation, backup, integ-     THESIS PROPOSAL
Matching Application Access                rity checking, virus scanning, report     Efficient, Flexible Consistency
Patterns to Storage Device                 generation, tamper detection, and in-     for Highly Fault Tolerant Storage
Characteristics                            dex generation. Developers of such
                                           applications have had no clean way        Garth Goodson, ECE
                                           of designing these applications. The      August 18, 2003
Jiri Schindler, ECE
August 22, 2003                            main reason for that is the traditional   Fault-tolerant storage systems spread
                                           lack of “trust” applications have had     data redundantly across a set of stor-
Thesis statement: “With sufficient in-     on storage devices to do what is best     age-nodes in an effort to preserve and
formation, a storage manager can ex-       for the application, with consequences    provide access to data despite fail-
ploit unique storage device character-     reflected in the narrow interfaces be-    ures. One difficulty created by this ar-
istics to achieve better, more robust I/   tween the two.                            chitecture is the need for a consistent
O performance. This information can                                                  view, across storage-nodes, of the
                                           We introduce a framework for imple-
be abstract from device specifics, de-                                               most recent update. Such consistency
                                           menting background storage applica-
vice-independent, and yet expressive                                                 is made difficult by concurrent up-
                                           tions by adding a new asynchronous
enough to allow a storage manager to                                                 dates, partial updates made by clients
                                           interface to the storage device. Ap-
tune its access patterns to a given de-                                              that fail, and failures of storage-nodes.
                                           plications register background tasks
vice.”                                                                               This thesis will demonstrate how to
                                           through the interface and the storage
This dissertation contends that stor-      device notifies them of their comple-     achieve scalable, highly fault-tolerant
age device resources are not utilized      tion. The storage device uses             storage systems by leveraging an ef-
to their full potential because too much   freeblock scheduling together with idle   ficient and flexible family of strong
is hidden behind their high-level stor-    time detectors to guarantee that the      consistency protocols enabled by
age interfaces. Current storage inter-     background applications will make         server versioning. In particular, the
faces do not convey sufficient infor-      good progress independent on the load     design of block-based storage systems
mation to the storage manager to en-       of the system and without impacting       and file systems will be evaluated. The
able it to make informed decisions         the foreground workload. This frame-      storage protocol is made space-effi-
leading to the most efficient use of       work is described and evaluated in the    cient through the use of erasure codes
the storage device. To bridge the in-      context of two real applications, a       and made scalable by offloading work
formation gap between hosts and stor-      snapshot-based backup and a cache         from the storage-nodes to the clients.
age devices, the storage device should     cleaner.                                  The protocol family is flexible in that
explicitly state its performance char-                                               it covers a broad range of system
acteristics. Using this static informa-    MS THESIS                                 model assumptions with no changes
tion, a storage manager can take ad-                                                 to the client-server interface, server
                                           Storage-based Intrusion Detec-
vantage of the device’s unique                                                       implementations, or system structure.
                                           tion: Watching Storage Activity
strengths and avoid inefficient access                                               Each protocol scales with its require-
                                           For Suspicious Behavior
patterns.                                                                            ments—it only does work necessitated
                                           Adam Pennington, ECE                      by the system and fault models.
MS THESIS                                  August 2003
A Framework for Implementing                                                         THESIS PROPOSAL
Background Storage Applica-                Please see the abstract of the paper      Staged Database Systems
tions using Freeblock Schedul-             of the same name on pg. 7 for an out-
ing                                        line of this thesis.                      Stavros Harizopoulos, SCS
                                                                                     April 24, 2003
Eno Thereska, ECE                          MS THESIS
August 2003                                                                          Thesis Statement: “By organizing and
                                           Opportunistic Use of Content              assigning system components into
There are many disk maintenance            Addressable Storage for                   self-contained stages, database sys-
tasks that are required for robust sys-    Distributed File Systems                  tems can exploit instruction and data
tem operation but have loose time          Niraj Tolia, ECE                          commonality across concurrent re-
constraints. Such “background” tasks       May 2003                                  quests thereby increasing throughput.
need to complete within a reasonable                                                 Furthermore, staged database systems
amount of time, but are generally in-      Please see the abstract of the paper      are more scalable, easier to extend,
tended to occur during otherwise idle      of the same name on pg. 17 for an
time. Examples include cache write-        outline of this thesis.                                         Continued on page 18

FALL 2003                                                                                                                   13
Craig Soules & Greg Ganger
As storage capacity continues to in-       or <category, value> pairs. The key         of the information that Google relies
crease, users find it increasingly dif-    challenge is assigning useful, mean-        on does not exist within a file system.
ficult to manage their files using tra-    ingful attributes to files.                 Also, Google’s query feedback
ditional directory hierarchies. At-        Unfortunately, the two most prevalent       mechanism relies on two properties:
tribute-based naming enables power-        methods of attribute assignment, user       users are normally looking for the
ful search and organization tools for      input and content analysis, have been       most popular sites when they perform
ever-increasing user data sets. How-       largely unsuccessful. Although users        a query, and they have a large user
ever, such tools are only useful in        often have a good understanding of          base that will repeat the same query
combination with accurate attribute        the files they create, it can be time-      many times. Conversely, in file sys-
assignment. Existing systems rely on       consuming and unpleasant to distill         tems, users usually search for files
user input and content analysis, but       that information into the right set of      that have not been accessed in a long
they have enjoyed minimal success.         keywords. As a result, users are un-        time, because they usually remember
We propose several new approaches          derstandably reluctant to do so. On         where recently accessed files reside,
to automatically assigning attributes      the other hand, content analysis takes      and there is generally only a single
to files through context analysis, a       none of the user’s time, and can be         user for each set of files, making it
technique that has been successful in      performed entirely in the background        unlikely that frequent queries will be
the Google web search engine. With         to eliminate any potential perfor-          generated for any given file.
extensions like application hints (e.g.,   mance penalty. However, the com-            Context-based Attributes in File
web links for downloaded files) and        plexity of language parsing, com-           Systems
inter-file relationships, it should be     bined with the large number of pro-
possible to infer useful attributes for                                                We are investigating four approaches
                                           prietary file formats and non-textual       to automatically gathering context
many files, making attribute-based         data types, restricts the effectiveness
search tools more effective.                                                           information for use in file systems.
                                           of content analysis.                        The first two focus on gathering at-
Existing Organizational Tools              Context-based Attributes                    tributes when a file is created or ac-
As storage capacity increases, the         Early web search-engines, (e.g.             cessed. The second two focus on
amount of data belonging to an indi-       Lycos), relied upon user input (user        propagating attributes among related
vidual user increases accordingly.         submitted web pages) and content            files to increase the coverage of at-
Soon, storage capacity will reach a        analysis (word counts, word proxim-         tribute assignment. Together, these
point where there will be no reason        ity, etc.). Although valuable, the suc-     techniques should categorize a much
for a user to ever delete old content      cess of these systems has been              broader set of files than creation-
— in fact, the time required to do so      eclipsed by the success of Google.          based attribute assignment alone.
would be wasted. The challenge has         To provide better search results,           Application assistance: Although
shifted from deciding what to keep to      Google utilizes two forms of context        computers provide a vast array of
finding particular information when        analysis. First, it uses the text associ-   functionality, most people use their
it is desired. To meet this challenge,     ated with a link to determine attributes    computer for a limited set of tasks
we need to improve our approach to         for the linked site. This text gives the    using a small set of applications that,
personal data organization.                context of both the creator of the link-    in turn, access and create most of the
Today, most systems provide a tree-        ing site and the user who clicks on         user’s files. Modifying these applica-
like directory hierarchy to organize       the link at that site. The more times       tions to provide hints about the user’s
files. Although this is easy for most      that a particular word links to a site,     context could provide invaluable at-
users to understand, it does not pro-      the higher that word is ranked for that     tribute information.
vide the flexibility required to scale     site. Second, Google uses the actions       Existing user input: Although most
to large numbers of files. In particu-     of a user after a search to decide what     users are not willing to input addi-
lar, the strict hierarchy provides only    the user wanted from that search. For       tional information, they are willing to
a single categorization with no cross-     example, if a user clicks on the first      choose a directory and name for their
referenced information.                    four links of a given search, and then      files. Each of the sub-directories
Alternatives to the standard directory     does not return, it is likely that the      along the path and the file name it-
hierarchy systems generally assign         fourth link was the best match, pro-        self probably contain context infor-
attributes to files, providing the abil-   viding the user’s context for those         mation that can be used to assign at-
ity to cluster and search for files by     search terms.                               tributes. For example, if the user
their attributes. An attribute can be      Unfortunately, Google’s approach to         stores a file in “/home/papers/FS/At-
any metadata that describes the file,      indexing does not translate directly        tribute-based/,” then it
although most systems use keywords         into the realm of file systems. Much                             Continued on page 15

14                                                                                                    T H E P D L PA C K E T
                                       INFERRING ATTRIBUTES FROM CONTEXT
Continued from page 14
is likely that they believe the file is a
“paper” having to do with “FS,” “at-
tribute-based,” and “semantic.”
                                                                    Applications                                    Database
User access patterns: As users access
their files, the pattern of their accesses
provides a set of temporal relation-
ships between files. A possible use of                                    Application Interface
this information is to help propagate
information between related files. For
example, accessing “”
and “” followed by updating                                                                   Operating System
“related.tex” may indicate a relation-
ship between the three files. Subse-                            Tracer
quently, accessing “related.tex” and
creating “” may indi-                                       File System
cate a transitive relationship.
Inter-file content analysis: Content
analysis will continue to be an impor-          Figure 1: A prototype system for evaluating context-based attribute assignment schemes.
tant part of automatically assigning
attributes. In addition to existing per-     based attribute assignment schemes.              compare the results of the different
file analysis techniques, our focus on       The system is composed of four main              approaches. For more information on
creating context-based connections           parts: the tracer, the application inter-        this project see Soules [1] or the PDL
between files suggests another source        face, the analyzer, and the database.            project page at
of attributes: content-based relation-                                                        AttributeNaming/.
                                             The tracer keeps a trace of all file
ships. For example, some current file
                                             system activity in the system. Any file          References
systems use hashing to eliminate du-
                                             system calls made by applications are            [1] Craig A.N. Soules, Greg Ganger.
plicate blocks within a file system, or
                                             tracked and stored in a file for later           Why Can’t I Find My Files? New
even locate similarities on non-block
                                             offline analysis. This allows a single           methods for automating attribute as-
aligned boundaries. Such content
                                             system to employ a variety of differ-            signment. Proceedings of the Ninth
overlap could also be used to identify
                                             ent analysis techniques. The applica-            HotOS Workshop, USENIX Associa-
related files, by treating files with
                                             tion interface allows applications to            tion, May 2003.
large matching data sets as related.
                                             pass context information into the sys-
Similarly, users (or the system) will
                                             tem, such as email header informa-
often keep several slightly different
                                             tion or link information from a web
versions of a file. Although these files
                                             browser. This information is used by
generally contain differences, often
                                             the analyzer to generate attributes for
the inherent information contained
                                             files. The analyzer combines applica-
within does not change (e.g., a user
                                             tion information, and offline trace
may keep three instances of their re-
                                             analysis to generate attributes for
sume, each focused for a different
                                             files. All updated attribute informa-
type of job application). This gives the
                                             tion is passed to the database, which
system two opportunities for content
                                             provides the search interface to the
analysis. First, content comparison can
                                             application. It allows applications to
identify related files. Second, by per-
                                             locate files using the file attributes
forming content analysis solely on the
                                             assigned by the analyzer. Feedback
differences between versions, it may
                                             from the search results is pushed to
be possible to determine version-spe-
                                             the analyzer for further attribute re-
cific attributes, making it easier for
users to locate individual version in-
stances.                                     This design could include multiple da-
                                             tabases. In order to compare the
Prototype Evaluation System                  results of different trace analysis al-
Figure 1 shows an overview of a pro-         gorithms, the analyzer could maintain            Young Professor Tim Ganger teaching the ECE
totype system for evaluating context-        a database for each, and users could             18-746 storage systems class.

FALL 2003                                                                                                                                 15
Continued from page 7
tively limited resources (CPU,             forms manually set up auto-regressive
memory and/or communication band-          models, both in terms of long-term          Naive Tree              [0:100]
                                                                                       4 hops
width and power) poses some inter-         pattern detection and modeling, as
esting challenges. We need both pow-       well as by at least 10x in resource
erful and concise "languages" to rep-      consumption.                                     [0:50]                       [51:100]
resent the important features of the
data, which can (a) adapt and handle       Location-based Node IDs:
arbitrary periodic components, includ-     Enabling Explicit Locatlity in                              S                            D
ing bursts, and (b) require little         DHTs                                        [0:25]        [26:50]   [51:75]         [76:100]
memory and a single pass over the
data.                                      Zhou, Ganger & Steenkiste
                                                                                       Smart Tree              [0:100]
This allows sensors to automatically       Carnegie Mellon University School of        3 hops

(a) discover interesting patterns and      Computer Science Technical Report
trends in the data, and (b) perform        CMU-CS-03-171, August 2003.                      [0:50]                       [51:100]
outlier detection to alert users. We       Current peer-to-peer systems based
need a way so that a sensor can dis-       on DHTs struggle with routing local-
cover something like "the hourly phone     ity and content locality because of                         S                            D
call volume so far follows a daily and     random node ID assignment. To ad-           [0:25]        [26:50]   [51:75]         [76:100]
a weekly periodicity, with bursts          dress these issues, we promote the
roughly every year," which a human         use of location-based node IDs to           VPCR                    [0:100]
might recognize as, e.g., the Mother's     encode physical topology and improve        2 hops
day surge. When possible and if de-        routing. This gives applications explicit
sired, the user can then issue explicit    knowledge about and control over                [0:50]                        [51:100]
queries to further investigate the re-     data locality at a coarse-grain. Appli-
ported patterns.                           cations can place content in particu-
In this work we propose AWSOM              lar regions or route towards a close
                                                                                                       S                            D
(Arbitrary Window Stream mOdeling          replica. Schemes to address the diffi-      [0:25]        [26:50]   [51:75]         [76:100]
Method), which allows sensors oper-        culties that ensue, particularly load
ating in remote or hostile environments    imbalance, are discussed.
                                                                                       Virtual polar coordinate routing for VPCS. A
to discover patterns efficiently and                                                   packet is routed from S to D using the three
effectively, with practically no user      GEM: Graph EMbedding for
                                                                                       routing algorithms. Smart Tree and VPCR use
interventions. Our algorithms require      Routing and Data-Centric                    a 1-hop neighborhood.
limited resources and thus can be in-      Storage in Sensor Networks
corporated in individual sensors, pos-     Without Geographic Information              works), an infrastructure for node-to-
sibly alongside a distributed query pro-   Newsome & Song                              node routing and data-centric storage
cessing engine. Updates are per-                                                       and information processing in sensor
formed in constant time, using sub-        Proceedings of the First ACM Con-           networks. Unlike previous ap-
linear (in fact, logarithmic) space.       ference on Embedded Networked               proaches, it does not depend on geo-
Existing, state of the art forecasting     Sensor Systems (SenSys 2003). No-           graphic information, and it works well
methods (AR, SARIMA, GARCH,                vember 5-7, 2003, Redwood, CA.              even in the face of physical obstacles.
etc) fall short on one or more of these                                                In GEM, we construct a labeled graph
                                           The widespread deployment of sen-
requirements. To the best of our                                                       that can be embedded in the original
                                           sor networks is on the horizon. One
knowledge, AWSOM is the first                                                          network topology in an efficient and
                                           of the main challenges in sensor net-
method that has all the above charac-                                                  distributed fashion. In that graph, each
                                           works is to process and aggregate data
teristics.                                                                             node is given a label that encodes its
                                           in the network rather than wasting
                                                                                       position in the original network topol-
Experiments on real and synthetic          energy by sending large amounts of
                                                                                       ogy. This allows messages to be effi-
datasets demonstrate that AWSOM            raw data to reply to a query. Some
                                                                                       ciently routed through the network,
discovers meaningful patterns over         efficient data dissemination methods,
                                                                                       while each node only needs to know
long time periods. Thus, the patterns      particularly data-centric storage and
                                                                                       the labels of its neighbors. To demon-
can also be used to make long-range        information aggregation, rely on effi-
                                                                                       strate how GEM can be applied, we
forecasts, which are notoriously diffi-    cient routing from one node to another.
                                                                                       have developed a concrete graph em-
cult to perform automatically and ef-      In this paper we introduce GEM
ficiently. In fact, AWSOM outper-          (Graph EMbedding for sensor net-                                     Continued on page 17

16                                                                                                         T H E P D L PA C K E T
                                                                                                                                                                     RECENT PUBLICATIONS
Continued from page 16
bedding method, VPCS (Virtual Po-                                                                                                                                                   systems do not efficiently record the
lar Coordinate Space). In VPCS, we                                 Client                                 1. File Read                                             Jukebox          many prior versions that result. In
embed a ringed tree into the network                                                                                                                                                particular, the versioned metadata they
topology, and label the nodes in such                              Coda                                                                                                             keep consumes almost as much space
a manner as to create a virtual polar                              Client                                                                                                           as the versioned data. This paper ex-
coordinate space. We have also de-                                                                       LAN Connection                                              CAS            amines two space-efficient metadata
veloped VPCR, an efficient routing al-                                                                                                                              Storage         structures for versioning file systems
                                                                                                          4. CAS Request
gorithm that uses VPCS. VPCR is the                                                                                                                                                 and describes their integration into the
first algorithm for node-to-node rout-                                                                                                                                              Comprehensive Versioning File Sys-
ing that guarantees reachability, re-                                                                      5. CAS Reply
                                                                                                                                                                                    tem (CVFS). Journal-based metadata
quires each node to keep state only                                                                                                                                                 encodes each metadata version into
about its immediate neighbors, and                                                                                                                                                  a single journal entry; CVFS uses this

                                                                   3. Recipe Response
requires no geographic information.            2. Recipe Request
                                                                                                                                                                                    structure for inodes and indirect
Our simulation results show that                                                                                                                                                    blocks, reducing the associated space
                                                                                        WAN Connection
VPCR is robust on dynamic networks,                                                                                                                                                 requirements by 80%. Multiversion b-

                                                                                                                                      7. Missed Block Response
                                                                                                           6. Missed Block Request
works well in the face of voids and                                                                                                                                                 trees extend the per-entry key with a
obstacles, and scales well with net-                                                                                                                                                timestamp and keep current and his-
work size and density.                                                                                                                                                              torical entries in a single tree; CVFS
                                                                                                                                                                                    uses this structure for directories, re-

                                                                                                                                                                      File Writes
Opportunistic Use of Content                                                                                                                                                        ducing the associated space require-
Addressable Storage for                                                                                                                                                             ments by 99%. Experiments with
Distributed File Systems                                                                                                                                                            CVFS verify that its current-version
                                                                                                                                                                                    performance is similar to that of non-
Tolia, Kozuch, Satyanarayanan,                                                                                                                                   Coda               versioning file systems. Although ac-
Karp, Bressoud & Perrig                                                                                                                                           File              cess to historical versions is slower
                                                                                                    Server                                                       Server             than conventional versioning systems,
Proceedings USENIX Annual Tech-                                                                                                                                                     checkpointing is shown to mitigate this
nical Conference, General Track                                                                                                      Server                                         effect.
2003: 127-140, June 9-14, San Anto-
nio, TX.                                                                                                 System diagram.                                                            Byzantine-tolerant
Motivated by the prospect of readily                                                                                                                                                Erasure-coded Storage
                                               block matching, a promising technique
available Content Addressable Stor-                                                                                                                                                 Goodson, Wylie, Ganger & Reiter
                                               for using approximately matching
age (CAS), we introduce the concept
                                               blocks on CAS providers to reconsti-                                                                                                 Carnegie Mellon University SCS
of file recipes. A file's recipe is a first-
                                               tute the exact desired contents of a                                                                                                 Technical Report CMU-CS-03-187,
class file system object listing content
                                               file at a client.                                                                                                                    September, 2003.
hashes that describe the data blocks
composing the file. File recipes pro-
                                               Metadata Efficiency in a                                                                                                             This paper describes a decentralized
vide applications with instructions for
                                               Comprehensive Versioning File                                                                                                        consistency protocol for survivable
reconstructing the original file from
                                               System                                                                                                                               storage that exploits data versioning
available CAS data blocks. We de-
                                                                                                                                                                                    within storage-nodes. Versioning en-
scribe one such application of reci-
pes, the CASPER distributed file sys-
                                               Soules, Goodson, Strunk & Ganger                                                                                                     ables the protocol to efficiently pro-
                                                                                                                                                                                    vide linearizability and wait-freedom
tem. A CASPER client opportunisti-             2nd USENIX Conference on File and                                                                                                    of read and write operations to era-
cally fetches blocks from nearby CAS           Storage Technologies, San Francisco,                                                                                                 sure-coded data in asynchronous en-
providers to improve its performance           CA, Mar 31- Apr 2, 2003.                                                                                                             vironments with Byzantine failures of
when the connection to a file server
                                                                                                                                                                                    clients and servers. Exploiting
traverses a low-bandwidth path. We             A comprehensive versioning file sys-
                                                                                                                                                                                    versioning storage-nodes, the proto-
use measurements of our prototype              tem creates and retains a new file ver-
                                                                                                                                                                                    col shifts most work to clients. Reads
to evaluate its performance under              sion for every WRITE or other modi-
                                                                                                                                                                                    occur in a single round-trip unless cli-
varying network conditions. Our re-            fication request. The resulting history
                                                                                                                                                                                    ents observe concurrency or write
sults demonstrate significant improve-         of file modifications provides a de-
                                                                                                                                                                                    failures. Measurements of a storage
ments in execution times of applica-           tailed view to tools and administrators
                                                                                                                                                                                    system using this protocol show that
tions that use a network file system.          seeking to investigate a suspect sys-
We conclude by describing fuzzy                tem state. Conventional versioning                                                                                                                        Continued on page 19

FALL 2003                                                                                                                                                                                                                 17
Continued from page 13
and more readily fine-tuned than tra-       rectangular surface that is positioned    capacity, and reduces positioning times
ditional database systems.”                 by a set of MEMS actuators. Access        for cache misses. For network laten-
Database system architectures face          times are expected to be less than a      cies of up to 0.5ms, D-SPTF performs
a rapidly evolving operating environ-       millisecond with power consumption        as well as would a hypothetical cen-
ment where millions of users store and      10–100X less than a low-power disk        tralized system with the same collec-
access terabytes of data. To cope with      drive, while streaming bandwidth and      tion of CPU, cache, and disk re-
increasing demands for performance          volumetric density are expected to be     sources. Compared to existing decen-
high- end DBMS employ parallel pro-         around those of disk drives.              tralized approaches, such as hash-
cessing techniques coupled with a           We are starting to exploring how          based request distribution, D-SPTF
plethora of sophisticated features.         MEMStores would best be used in           achieves up to 50% higher through-
However, the widely adopted work-           computer systems and how those sys-       put and adapts more cleanly to heter-
centric thread-parallel execution           tems should adapt to their differences    ogenous server capabilities.
model entails several shortcomings          as compared to disks. For example,
that limit server performance, the          existing operating system policies are    THESIS PROPOSAL
most important being failure to exploit     tuned for disks, including request        Autonomous Spatio-Temporal
instruction and data commonality            scheduling, data layout, and power        Data Mining
across concurrent requests. More-           conservation. Also, given the perfor-
over, the monolithic approach in            mance, capacity, and non-volatility of    Spiros Papadimitriou, SCS
DBMS software has lead to complex           MEMStores, they represent a new,          May 5, 2003
designs which are difficult to extend.      intermediate member of the memory
                                            hierarchy. My thesis is that most of      The goal of data mining is to facilitate
This thesis introduces a staged design                                                the extraction of useful information
for high-performance, evolvable             these aspects can conform, with little
                                            penalty, to disk-like policies and us-    from large collections of data. Thus,
DBMS that are easy to fine-tune and                                                   eliminating the requirement of user
maintain. I propose to break the data-      ages.
                                                                                      intervention is essential. We propose
base system into modules and encap-         Because MEMStores perform basi-           to develop fast tools for spatio-tem-
sulate them into self-contained stages      cally like fast disks, with only a few    poral data mining towards that goal.
connected to each other through             exceptions, they can be treated by sys-   We have completed work in the area
queues. The staged, data-centric de-        tems as such. My dissertation will        of spatio-temporal data mining and, in
sign remedies the weaknesses of mod-        show that for most workloads, the         particular, outlier detection and time
ern DBMS by providing solutions at          same linear logical block abstraction     series modeling. These provide suffi-
(a) the hardware level: it optimally        that is used for disk drives is appro-    cient evidence that we can improve
exploits the underlying memory hier-        priate for MEMStores. The benefit of      upon previous techniques.
archy, and (b) at a software engineer-      using the same abstraction is that
ing level: it is more scalable, easier to   MEMStores can be easily integrated        THESIS PROPOSAL
extend, and more readily fine-tuned         into computer systems with little or
than traditional database systems.          no change.                                Prefetching and Locality
                                                                                      Optimizations for Database
THESIS PROPOSAL                             THESIS PROPOSAL                           Memory Hierarchy Performance

Using MEMS-based Storage                    D-SPTF: Decentralized Schedul-            Shimin Chen, SCS
Devices in Computer Systems                 ing for Storage Bricks                    May 2, 2003
Steve Schlosser, ECE                        Christopher Lumb, ECE                     Database performance studies have
June 5, 2003                                August 19, 2003                           been traditionally focused on I/O per-
                                                                                      formance. Recently, researchers have
MEMS-based storage is an interest-          Distributed Shortest-Positioning Time     shown that, on traditional disk-oriented
ing new technology that promises to         First (D-SPTF) is a request distribu-     databases, roughly 50% or more of the
bring fast, non-volatile, mass data stor-   tion protocol for decentralized systems   execution time in memory is wasted
age to computer systems. MEMS-              of storage servers.                       due to cache misses. Therefore, to
based storage devices (MEMStores)           D-SPTF exploits high-speed intercon-      exploit the full power of modern com-
themselves consist of several thousand      nects to dynamically select which         puter systems requires optimizing both
read/write tips, analogous to the read/     server, among those with a replica,       cache and disk performance in the
write heads of a disk drive, which read     should service each read request. In      memory hierarchy, which together
and write data in a recording medium.       doing so, it simultaneously balances
This medium is coated on a moving           load, exploits the aggregate cache                             Continued on page 20

18                                                                                                   T H E P D L PA C K E T
                                                                          RECENT PUBLICATIONS
Continued from page 17
the protocol scales well with the num-     for robustness in an attempt to always      limited pervasive computing devices.
ber of failures tolerated, and that it     guarantee full availability of data.        Data staging opportunistically
outperforms a highly-tuned instance        These mechanisms may not be nec-            prefetches files and caches them on
of Byzantine-tolerant state machine        essary, as the application programmer       nearby surrogate machines. Surro-
replication.                               may have already accounted for such         gates are untrusted and unmanaged:
                                           situations. By hinting to the file sys-     we use end-to-end encryption and se-
Robustness Hinting for                     tem the application’s ability to handle     cure hashes to provide privacy and
Improving End-to-End                       errors it is possible for the file system   authenticity of data and have designed
Dependability                              to make better resource allocation de-      our system so that surrogates are as
                                           cisions and improve end-to-end de-          reliable and easy to manage as pos-
Bigrigg                                    pendability.                                sible. Our results show that data stag-
                                                                                       ing reduces average file operation la-
Second Workshop on Evaluating and                                                      tency for interactive applications run-
                                           Data Staging on Untrusted
Architecting System Dependability                                                      ning on the Compaq iPAQ hand-held
(EASY). In conjunction with                Surrogates
                                                                                       by up to 54%.
ASPLOS-X. Sunday, 6 October 2002,          Flinn, Sinnamohideen, Tolia &
San Jose, California, U.S.A.               Satyanarayanan
File systems make unreasonable at-         Proceedings 2nd USENIX Confer-
tempts to provide data to the point that   ence on File and Storage Technolo-
they will block an application instead     gies (FAST03), Mar 31-Apr 2, 2003,
of passing the error on to the applica-    San Francisco, CA.
tion to handle. Transient problems
such as network congestion or out-         We show how untrusted computers
ages and heavily loaded systems or         can be used to facilitate secure mo-
denial of service attacks can lead to      bile data access. We discuss a novel
failure-like situations. Alternative       architecture, data staging, that im-
mechanisms have been developed for         proves the performance of distributed       Give PDL storage systems researchers snow
the file system to trade performance       file systems running on small, storage-     and a plastic chair and see what happens!

                                                                                         YEAR IN REVIEW
Continued from page 4
  Interaction and Security Systems            Systems” at the First Interna-             “The Palladio Project: A Surviv-
  in Fort Lauderdale, FL.                     tional Conference on Innovative            able, Scalable Distributed Storage
March 2003                                    Data Systems Research (CIDR),              System.”
  Christos Faloutsos presented a              in Asilomar, CA.                           DB Seminar Speaker: C. Mohan
  tutorial on “Data Mining the             December 2002                                 of IBM on “Future directions in
  Internet” at INFOCOM, San                                                              Data Mining: Streams, Networks,
                                              Greg Ganger chaired the session            Self-similarity and Power Laws.”
  Francisco, CA.
                                              on Decentralized Storage Sys-              Christos Faloutsos gave the
January 2003                                  tems at OSDI in Boston, MA.                keynote talk at CIKM in
  Over the past term, several                 Timmy Ganger gave a guest                  McLean, VA and was also an
  visitors have contributed to our            lecture in 15-712 (Advanced OS             invited speaker at the NSF
  Storage Systems course, includ-             and Distributed Systems).                  NGDM Workshop in Baltimore,
  ing: Dave Anderson, Seagate;                Ted Wong presented “Verifiable             MD and the N.A.S. Workshop in
  Steve Kleiman, NetApp; Ric                  Secret Redistribution for Archive          Washington, DC.
  Wheeler, EMC; Harald Skardal,               Systems” at the First Interna-
  NetApp; Jim Hughes,                                                                  October 2002
                                              tional IEEE Security in Storage
  StorageTek; Richie Lary, Inde-              Workshop.                                  SDI Speaker: John Wilkes of HP
  pendent Consultant; and Mark                                                           Labs on “Travelling to Rome—
  Carlson, Sun.                            November 2002                                 QoS Specifications for Auto-
  Stavros Harizopoulos presented             SDI Speaker: Richard Golding,               mated Storage System Manage-
  “A Case for Staged Database                then of Panasas, Inc., spoke on             ment.”

FALL 2003                                                                                                                    19
Continued from page 18
have not been well studied for data-          performance by exploiting inter-tuple         and capabilities of timing-accurate
base systems before. My thesis is that        parallelism through prefetching. For          storage emulation (TASE).
prefetching and locality optimizations        my proposed future work, I will be fo-        Our previous work demonstrates that
can effectively improve both cache and        cusing on utilizing history information       TASE is feasible for evaluating both
disk performance of database systems          to improve join performance. Since            evolutionary changes (faster platter
and that differences between the              updates are relatively infrequent com-        speeds; modified firmware algo-
cache-to-memory and the memory-to-            pared to joins in DSS and OLAP en-            rithms) and revolutionary changes
disk gap play a significant role in the       vironments, it is possible to use his-        (MEMS-based technology) to storage
design and choice of specific                 tory information about matching tuples        devices.
prefetching and locality optimization         to guide and improve future join per-
techniques.                                   formance. In contrast to previous             Several interesting questions remain
To validate my thesis, I revisit two          studies with join indices, I want to          before this new storage evaluation
important classes of database algo-           analyze the history information to            technique can be fully utilized by re-
rithms, B+tree index algorithms and           identify data locality in the joining re-     searchers and developers. For ex-
join algorithms. In my preliminary            lations and then improve join perfor-         ample, an emulator may have to mea-
work, I have exploited cache                  mance by exploiting the data locality         sure and deal with variable externally-
prefetching to improve search and             and using prefetching.                        induced timing errors during an ex-
range scan operations of main                                                               periment. How can an evaluator rest
memory B+trees. I have studied                THESIS PROPOSAL                               assured that the emulator is correctly
fractal prefetching B+trees, which are                                                      compensating for these errors? As
a new type of B+trees that optimize           Prototyping Without                           another example, an emulator may
both cache and disk performance. In           Prototyping: Evaluating hypo-                 need to keep data in RAM in order to
fractal prefetching B+trees, smaller          thetical storage components in                provide per-request data before each
cache-optimized trees are embedded            real systems                                  request completes. How can an emu-
in disk pages to improve data locality                                                      lator meet such deadlines when the
                                              John Linwood Griffin, ECE                     experimental working set is larger
for index search. Both cache                  August 1, 2003
prefetching and I/O prefetching are                                                         than the emulator's RAM?
used to improve performance. I have           For my dissertation research I am con-
worked on improving hash join cache           tinuing our investigation into the utility

Continued from page 12
Ursa Minor and Ursa Major                     for the design of a second, larger-scale      spective of their users, our early self-*
                                              instantiation of self-* storage. Ursa         constellations will look like really big,
In order to effectively explore how our
                                              Major will be a large-scale (~1 PB) stor-     really fast, really reliable file servers (ini-
ideas will simplify storage administra-
                                              age constellation, called Ursa Major. Its     tially NFS version 3). The decision to
tion, it is essential that operational sys-
                                              data storage capacity will be available       hide behind a standard file server inter-
tems be built and deployed. The Paral-
                                              to research groups around Carnegie            face was made to reduce software ver-
lel Data Lab has been developing tech-
                                              Mellon (e.g. groups involved data min-        sion complexities and user-visible
nologies relevant to the self-* storage
                                              ing and scientific visualization) who rely    changes-user machines can be unmodi-
architecture for several years, allow-
                                              on large quantities of storage for their      fied, and all bug fixes and experiments
ing us to build a prototype relatively
                                              work. We are convinced that such de-          can happen transparently behind a stan-
quickly. Our initial focus will be on
                                              ployment and maintenance is necessary         dard file server interface. The result-
implementing the data protection as-
                                              to evaluate self-* storage’s ability to       ing architecture is illustrated in Figure
pects, embedding instrumentation, and
                                              simplify administration for system            2. Direct, untrusted client access will
enabling experimentation with perfor-
                                              scales and workload mixes that tradi-         be added over time.
mance and diagnosis.
                                              tionally present difficulties. As well, our   For more information, please see the
Our first prototype, named Ursa Mi-           use of low-cost hardware and imma-            Self-* Storage project page at
nor, will be approximately 15 TB spread       ture software will push the boundaries
over 45 small-scale storage bricks. The       of fault-tolerance and automated recov-
main goal of this first prototype is rapid    ery mechanisms, which are critical for
(internal) deployment to learn lessons        storage infrastructures. From the per-

20                                                                                                            T H E P D L PA C K E T

Shared By: