PhD Thesis Proposal System Design for Chip Multiprocessor by grapieroo13


									        PhD Thesis Proposal:
System Design for Chip Multiprocessor

                 Bin Ren

          Computer Laboratory
         University of Cambridge

                July 2004

1   Introduction                                                                                              4
    1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                   4
    1.2 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . .                                    5

2   Background                                                                                         6
    2.1 Shift to CMP/MT . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   . . . .  6
    2.2 Concurrent Programming Model . . .                .   .   .   .   .   .   .   .   .   . . . .  8
    2.3 Characteristics of Server Applications            .   .   .   .   .   .   .   .   .   . . . . 10
    2.4 Operating System Scheduler . . . . . .            .   .   .   .   .   .   .   .   .   . . . . 13

3   Flow Network Architecture                                                                                 13
    3.1 Related Work . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13
        3.1.1 Click Packet Router . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
        3.1.2 StagedServer . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
        3.1.3 SEDA . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
        3.1.4 Summary . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
    3.2 Flow Network Architecture . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
    3.3 FNA Scheduling Policy . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
        3.3.1 Overview . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
        3.3.2 Flow Network Problems .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
        3.3.3 Dynamic Profiling . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
        3.3.4 Explicit Flow Scheduling        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
    3.4 Implementation and Evaluation .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
        3.4.1 FNA and VCP . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
        3.4.2 FNA Database . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
        3.4.3 FNA Web Server . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
    3.5 Other FNA Topics . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24

4   Road Map                                                                26
    4.1 Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
    4.2 Conference Calendar . . . . . . . . . . . . . . . . . . . . . . . 27

List of Figures
  1   Single-core CPU vs. CMP/MT CPU: On the left-side is a
      single-core CPU; on the right-side is a dual-core dual-thread CMP/MT
      CPU in which a core is represented by an EE and a hardware
      thread is represented by an AS. (EE – execution engine; AS – ar-
      chitectural state; L1I – level 1 instruction cache; L1D – level 1
      data cache.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
  2   Multi-threaded Server Design: Each incoming request is dis-
      patched to a thread, which performs the entire processing and gen-
      erates the response. . . . . . . . . . . . . . . . . . . . . . . . . . 9
  3   Event-driven Server Design: A main thread acts as the event
      scheduler. It fetches events generated from the network and uses
      them to drive the execution of many finite state machines. Each
      FSM represents a request and may generate new events. The de-
      sign of the event scheduler is the most complex task. . . . . . . . . 10
  4   FNA Web Server Design: This is a simple web server using
      FNA architecture. The application is composed of stages con-
      nected by queues. Edges represent the path of event flows. . . . . . 18
  5   A FNA Stage: A stage consists of an incoming event queue, a
      flow scheduler, an event handler and several threads either from a
      local thread pool or a global thread pool. Each event belongs to a
      flow and references the corresponding flow context. . . . . . . . . 18

1     Introduction

1.1   Terminology

This document makes frequent use of a number of technical terms which
I define and discuss here for clarity and simplicity of reference.
    • Chip Multi-Processing/Chip Multi-Processor (CMP)
      A CMP processor has multiple cores on one silicon chip. A core is
      a full-featured processor, possibly with multiple hardware threads,
      simpler design and lower power consumption. Cores usually share
      level 2/3 caches.
    • Multi-Threading (MT)
      The technology of integrating multiple hardware threads into one
      processor or core. These threads appear as logical processors that
      each has a copy of architectural state (e.g. general purpose registers)
      but share one execution engine and L1, L2 caches. The execution en-
      gine can either switch between hardware threads or run all of them
    • Simultaneous Multi-Threading (SMT)
      SMT is a particular style of MT technology, in which the execution
      engine run all the hardware threads simultaneously.
    • Hyper-threading (HT)
      HT is Intel’s implementation of SMT: two hardware threads in one
      latest Pentium 4 and Xeon processor.
    • Chip Multi-Threading (CMT)
      CMT is Sun Microsystems’ word for the combination of CMP and
      MT technologies for so-called Throughput Computing. Sun plans to
      ship Niagara in 2006, a 90nm chip with 8 cores, each of which has 4
    • Instruction Level Parallelism (ILP)
      ILP is a measure of how many of the instructions in a computer pro-
      gram can be executed at once. Processors that can execute multiple
      instructions in parallel are referred to as super-scalar processors. A
      super-scalar CPU architecture implements a form of parallelism on

      a single chip, thereby allowing the system as a whole to run much
      faster than it would otherwise be able to at a given clock speed.
   • Thread Level Parallelism (TLP)
      TLP is a measure of how many threads of a multi-threaded computer
      program can be executed in parallel. A CMP implements this form
      of parallelism on a single silicon chip.
   • Virtual Channel Processor (VCP)
      VCPs are virtual machines that encapsulate OS I/O subsystem, per-
      form I/O operations on behalf of an OS, and export message-passing
      based virtual I/O interfaces to the rest of kernel [9]. A VCP may con-
      tain one or more device drivers talking directly to hardware or via
      interfaces provided by Virtual Machine Monitor (VMM).

1.2   Thesis Summary

My thesis is that existing concurrent programming models and OS sched-
ulers can not make full use of the potential computing power of the new
generation of CMP/MT processors. I propose Flow Network Architecture
(FNA), a new programming model and system architecture for highly con-
current server applications to achieve optimal performance on CMP/MT
FNA server applications are constructed as a network of event-driven stages
connected by explicit queues. Events flow through several stages. Each
flow represents a client request.
The structural information of applications are provided to the FNA OS
scheduler. Together with information obtained from dynamic runtime
profiling, the FNA OS scheduler uses flow network algorithms to make
optimal decisions. FNA is intended to provide high parallelism with re-
duced resource contention on CMP/MT processors.

                  single-core CPU                   CMP CPU

                                                 EE       EE
                    EE       AS
                                               AS   AS   AS    AS

                    L1I     L1D                L1I L1D   L1D L1I

                      L2 Cache                      L2 Cache

Figure 1: Single-core CPU vs. CMP/MT CPU: On the left-side is a single-
core CPU; on the right-side is a dual-core dual-thread CMP/MT CPU in which a
core is represented by an EE and a hardware thread is represented by an AS. (EE
– execution engine; AS – architectural state; L1I – level 1 instruction cache; L1D
– level 1 data cache.)

2     Background

2.1   Shift to CMP/MT

For CPU design, a major movement in the 1980s was towards RISC instruction-
sets which made it simpler and more efficient to design CPU cores. This is
an example of the KISS principle - Keep It Simple Stupid, though actually
not all the RISC designs were that simple.
During the 1990s the main focus was on boosting instruction-level paral-
lelism (ILP) and clock rates of single-core processors. As a result, proces-
sors have become far more complex. However, there are various reasons
why performance doesn’t scale well with these techniques.
Recent years have witnessed a shift of focus to exploiting thread-level par-
allelism (TLP) with techniques like CMP and MT. CMP/MT scales the per-
formance of multi-threaded applications (or multiple running processes/programs)
through the integration of multiple cores onto a single silicon die and
the execution of instruction streams from multiple hardware threads on
each core. Implementations of these technologies exist in the market to-
day in the form of Intel Pentium 4 (Hyperthreading/SMT) as well as IBM
POWER4 and Sun MAJC-5200 (dual-core/CMP), to name a few.
Figure 1 shows the architectural differences of a single-core processor and
a dual-core dual-thread CMP/MT processor.

The followings are major difficulties encountered during efforts to further
increase performance of single-core processors, and how multi-core pro-
cessors overcome them.
   • Memory Stall Latency
     The speed of CPUs has been increasing at a much faster pace than
     main memory. When data or instructions are being fetched from
     memory, CPU is stalled, wasting time that could have been used to
     execute hundreds of instructions. Memory stall latency becomes a
     major part of total execution time. Up to now, two methods are cho-
     sen to reduce the latency: larger on-chip caches and exploitation of
     ILP. However, larger caches are not only much more expensive but
     also inevitably slower. ILP comes with substantial design and imple-
     mentation complexity but brings limited performance improvement.
     By comparison, CMP/MT exploits TLP to cover latency. Each core
     has multiple hardware threads sharing one execution engine. When-
     ever a thread is stalled by memory fetch, the core switches to the
     execution of another thread. As a result, for the entire core to idle,
     all threads must have overlapping stall time. The possibility of this
     happening is very low in the face of sufficient parallelism in multiple
   • Branch Prediction
     Single-core CPUs exploit ILP and have long pipelines. There is a
     limit to how accurate branch predictions can be. When a branch
     mis-prediction occurs, the entire pipeline has to be flushed and new
     instructions are fetched. Thus, the longer the pipeline, the more se-
     vere the penalty for mis-prediction. Besides, The overhead of a long
     pipeline itself is starting to become more significant than the process-
     ing for the stages themselves.
     By contrast, CMP/MT has one pipeline in each core shared by mul-
     tiple hardware threads. When a branch mis-prediction occurs to one
     thread, instructions of another thread are executed while those of the
     previous thread are fetched. Latency is covered by TLP.
   • Power Consumption
     As more and more transistors are put into one huge core, power con-
     sumption and heat dissipation increases exponentially. Power puts
     a practical limit on how reliable and how fast CPUs can get, as well
     as the number of CPUs that can be packed in a unit space.

      CMP/MT proves to be much more efficient and effective at using
      transistor and power budget. CMP/MT can deliver better perfor-
      mance while consuming the same power, or consumes less power
      while delivering the same performance.
   • Die Size
      The speed of light imposes a physical limit on the speed of a sin-
      gle CPU due to propagation delays. The clock signal can’t cross a
      20mm die at current GHz rates. Signal latency is harder to improve
      than transistor speed, which is why CPUs are becoming increasingly
      “wire limited”. Reducing the size to reduce signal latency meets with
      further technological problems as transistor size approaches molec-
      ular size limitations. Therefore, we either choose multi-core or Non-
      Uniform Cache Access.
      CMP/MT cores can be made simple and small. Performance can be
      greatly increased by adding additional cores and harware threads
      to the chip with marginal increase in die size. For example, in In-
      tel’s Hyperthreading technology, adding another hardware thread to
      Pentium 4/Xeon only entails duplication of architectural state with
      5 percent increase in die size but results in 30 percent increase in per-
      formance on average.
   • Complexity
      What’s easily forgotten is the formidable increase in difficulty of de-
      sign, debugging and verification of chips as more transistors are in-
      tegrated into single-core CPUs. With increased complexity, it’s hard
      for designers to reason about causes and effects, for testers to find
      out deeply hidden subtle bugs, and for verifiers to prove correctness
      and reliability. Even a bug fix can bring unexpected side-effects.
      By comparison, CMP is almost a linear integration of easy to de-
      sign, verify and debug cores. Complexity doesn’t scale exponentially
      when more cores and more threads are added.

2.2   Concurrent Programming Model

SMP workstations and servers have been quite common. For demand-
ing processing, one single processor just isn’t fast enough, and by spread-
ing processing over multiple processors, performance can be improved.

Figure 2: Multi-threaded Server Design: Each incoming request is dis-
patched to a thread, which performs the entire processing and generates the re-

Multi-threading is the most popular concurrent programming model. It
both alleviates the the complexity of development for programmers and
provides high performance on SMPs. Figure 2 shows the design of a typi-
cal multi-threaded server application.
In fact, there are very few demanding applications that cannot be made
multi-threaded. Some haven’t been because of the extra development
costs and some haven’t because of the hardware on the market. For exam-
ple, consumer 2D and 3D graphics (e.g. games) are mostly single threaded,
or have just one thread for the main processing whilst high-end graphics
systems from SGI have hundreds of processors. 3D processing software
is often highly parallelizable, and could be applied to consumer graphics
too, except most consumer systems today are uniprocessors so developers
have little incentive.
By bringing multi-threading/multi-processing capabilities to individual
processors, CMP/MT promotes the popularity of multi-threaded applica-
But nothing comes free. There are a lot of reports in Linux and FreeBSD
communities that enabling Hyperthreading in Pentium 4/Xeon actually
hurts performance. Among other reasons is that the multi-threading pro-
gramming model is not well suited to the cache-sharing architecture of
CMP/MT processors. CMP/MT processors improve overall performance
by covering the stall of a particular thread with execution of other threads,
but at the cost of increased resource contention. In multi-threaded appli-
cations, different threads may have conflicting worksets resulting in cache
thrashing. This problem is particularly serious on server applications such

Figure 3: Event-driven Server Design: A main thread acts as the event sched-
uler. It fetches events generated from the network and uses them to drive the
execution of many finite state machines. Each FSM represents a request and may
generate new events. The design of the event scheduler is the most complex task.

as web servers and databases.
The event-driven model is an alternative concurrent programming model.
It has better scalability but does suffer from greater development complex-
ity. In particular, it requires an application-specific event scheduler which
is hard to design properly without knowledge of the actual run-time work-
loads. Figure 3 shows the design of a typical event-driven server applica-
Section 3.2 proposes a new programming model and software architec-
ture, called flow network architecture (FNA). It combines both multi-threading
and event-driven models, and aims to maximize parallelism and minimize
resource contention on CMP/MT processors without sacrificing ease of

2.3   Characteristics of Server Applications

There has been an phenomenal increase in server applications, such as
web servers (e.g. Apache), runtime environments (e.g. J2EE, .NET) and
databases (e.g. MySQL). With the arrival of CMP/MT that exploits TLP, it
becomes necessary to well understand the runtime characteristics of server
applications so that they can be designed to use new hardware features
more effectively.
In the following, characteristics of server applications are compared with

those of single-threaded applications represented by the SPECint bench-
mark. [8].
   • Operating System Component
     In the SPECint, usually fewer than 1 percent of cycles are spent in the
     OS. On server benchmarks, OS almost always consume more than
     10 percent of cycles, 30 percent being typical and over 50 percent
     on network processing intensive benchmarks. There are several rea-
     sons. Server applications have a lot of file I/O and disk accesses, and
     when they provide services over networks, sending and receiving
     data are expensive processing in the OS. They are also highly multi-
     threaded. With a large number of threads and processes, scheduling
     and synchronization constitute major tasks of the OS.
   • ILP Profile
     Statistics show that server applications have more difficulty in ex-
     ploiting ILP than SPECint. This difficulty also their instructions per
     cycle (IPC).
   • Cache and TLB
     Instruction worksets of SPECint applications are small enough to fit
     in the L1 instruction cache. By contrast, server applications have
     much more complicated and larger code size, resulting in much higher
     rates of L1 instruction cache and TLB cache misses. There are several
     Server applications make heavy use of libraries. The dynamic load-
     ing and invocation of libraries seriously impact the instruction foot-
     print and instruction access nature of server applications. Since each
     library is loaded to a different physical memory page, calling a func-
     tion in another library causes the execution flow to transfer to an-
     other memory page, resulting in poor instruction TLB performance.
     Some server applications use dynamically generated codes, such as
     JIT compilers for Java byte-codes. Dynamically generated codes for
     consecutively invoked methods may not be located in contiguous
     address space.
     Server applications frequently invoke system calls, resulting in a large
     number of switches between user mode and kernel mode.
     Server applications make extensive use of threads and processes.

  The frequent context switches between hundreds of threads and pro-
  cesses can easily trigger cache thrashing.
  L1 data cache miss rates are not very different for SPECint and server
  applications. As for L2 cache, server applications have very large
  data footprint that is difficult to capture in ordinarily large L2 caches.
• Branch Behavior
  Different from SPECint single-threaded applications, server appli-
  cations make heavy use of virtual function calls, dynamic libraries,
  complex data structures and locking primitives. With larger and
  more complicated codes, server applications exhibit higher BTB (Branch
  Target Buffer) miss rate. If a BTB miss occurs, the static branch pre-
  dictor is used, which is not as accurate as the multi-level dynamic
  branch predictor. Even if the static branch predictor makes correct
  predictions, the CPU still suffers from latency because the static pre-
  dictor is rendered much deeper in the pipeline than the dynamic one.
• Instruction Per Cycle
  IPC is a good indicator of CPU usage efficiency. There are two types
  of stalls that are responsible for low IPC: resource stalls and I-stream
  Resource stalls involve the conditions where register renaming buffer
  entries, reorder buffer entries, memory buffer entries or execution
  units are full. In addition, serializing instructions (e.g. CPUID), in-
  terrupts and privilege level changes spend a considerable number of
  cycles in execution, forcing the decoder to wait. Stalls due to data
  cache misses are not explicitly included in resource stalls. But other
  resources can be oversubscribed due to a long data cache miss. In
  fact, L2 cache misses are responsible for most resource stalls, fol-
  lowed by L1 data cache misses. Likewise, I-stream stalls are mostly
  caused by L1 instruction cache misses.
  As already indicated above, larger data and instruction footprints in
  server applications increase cache misses. As a result, server appli-
  cations have a higher percentage of cycles in which no instructions
  are decoded, dispatched or retired, thus lower IPC.

2.4   Operating System Scheduler

A lot of research has been done on SMP operating systems, with two major
focuses: locking and thread scheduling.
For locking on SMPs, there are several locking primitives such as semaphores
and spin-locks. Recent research on lock-free data structures reveal that
conventional locks generate a lot of memory bus transactions due to cache
coherency protocols, resulting in very poor scalability [3].
With CMP/MT processors, the situation changes dramatically. As two
cores or two hardware threads share caches, two processes running on
them can compete for locks efficiently without generating memory bus
transactions. Lock compatibility is defined as the extent of lock contention
between a pair or processes. An OS scheduler that is fully aware of CMP/MT
hardware architecture and process lock compatibility can optimize schedul-
ing decisions [6].
For thread scheduling, much research has been done on high performance
user-level thread packages [11] and kernel support for thread scheduling
[1]. With CMP/MT processors, the OS scheduler should consider resource
contention between threads. Threads that share worksets should be sched-
uled on the same core; threads that have conflicting worksets should be
scheduled on different cores.
Section 3.3 describes the design of an OS scheduler based on the FNA
model. Based on the static structural information of FNA applications and
the dynamically profiled lock/resource compatibility of stages in FNA ap-
plications, the OS scheduler uses flow network algorithms to make opti-
mal scheduling decisions.

3     Flow Network Architecture

3.1   Related Work

There are three past research projects that are closely related to FNA: Click
modular packet router, SEDA and StagedServer.

3.1.1    Click Packet Router

Click is a modular, highly-extensible packet router [5]. It has two main
research facets: system architecture and configuration language.
   • System Architecture
        Click is made up of components called elements. Each element is a
        software component performing simple routing computations. There
        are several element classes, one of which is queue. Each element can
        have any number of input and output ports. A connection passes from
        an output port on one element to an input port on another. Connec-
        tions are the main mechanism for linking elements together, and are
        represented as pointers to elements. Passing a packet along a con-
        nection is implemented by a single virtual function call.
        Element, as the single component abstraction of Click, significantly
        influences how Click users think about router design. They tend to
        create modular designs because the element abstraction encourages
   • Configuration Language
        Click router design consists of two phases. In the first phase, users
        write element classes which are configuration independent. In the
        second phase, users design a particular router configuration by choos-
        ing a set of elements and the connections between them. The router
        configurations are written in a wholly declarative language called
        Click. It enforces hard separation between the roles of elements and
        configurations, leading to better modular design.
   • CPU Scheduling
        Click has a relatively simple CPU scheduling policy. It uses a single
        thread to call directly through multiple elements to reduce latency.
        Click assumes that elements have bounded processing time, leading
        to a static determination of resource management policy. A SMP ex-
        tension to Click uses a thread for each processor and performs load-
        balancing across threads [2].

3.1.2    StagedServer

StagedServer introduces a new scheduling policy Cohort Scheduling and a
new programming model Staged Computation [7].
   • Cohort Scheduling
        Cohort Scheduling defers processing a request until a cohort of com-
        putations arrive at a similar point in processing and then execute the
        cohort consecutively on a processor. Cohort Scheduling increases
        opportunities for code and data reuse by reducing the interleaving
        of unrelated computations that cause cache conflicts and evicts live
        cache lines.
   • Staged Computation
        Staged Computation is a programming model that replaces threads
        with stages as the construct underlying concurrent and parallel pro-
        grams. In this model, a program is constructed from a collection of
        stages, each of which consists of a group of exported asynchronous
        operations and private data. More over, a stage has scheduling au-
        tonomy, which enables it to control the order and concurrency of the
        execution of its operations.
        Staged Computation supports a variety of programming styles, in-
        cluding software pipelining, event-driven state machines, bi-directional
        pipelines and fork-join parallelism.
   • CPU Scheduling
        StagedServer currently uses a simple wavefront algorithm to sup-
        ply processors to stages. A programmer specifies an ordering of the
        stages in an application. In wavefront scheduling, processors in-
        dependently alternate forward and backward traversals of this list
        of stages. At each stage, a processor executes pending operations.
        Upon finishing, the processor proceeds to the next stage. If the pro-
        cessor repeatedly finds no work, it sleeps for exponentially increas-
        ing periods of time.

3.1.3    SEDA

SEDA stands for Staged Event-Driven Architecture. It’s a new software ar-
chitecture for handling massive concurrency and load conditioning de-

mands of busy Internet services. SEDA design yields higher performance
than traditional service designs, while exhibiting robustness under huge
variations in load [12].
   • System Architecture
     A stage is a self-contained application component consisting of an
     event handler, an incoming event queue and a thread pool.
     Each event is a data structure representing a single client request to
     the Internet service, or other information relevant to the operation of
     the service. Events are processed in batches to improve throughput.
     A service designer implements an event handler for each stage, which
     represents the core service-processing logic for the stage. An event
     handler is simply a function that accepts a batch of events as in-
     put, processes those events, and enqueues outgoing events onto next
     Threads are the basic concurrency mechanism in SEDA. A small num-
     ber of threads are allocated for each stage, instead of one thread per
     request. The use of threads relaxes the constraint that all event pro-
     cessing code be non-blocking, allowing an event handler to block
     or be preempted when necessary. Threads act as implicit continu-
     ations, automatically capturing the execution state across blocking
     operations. SEDA applications create explicit continuations when
     one stage dispatches events to other stages.
   • Resource Control
     SEDA makes use of dynamic resource control. Abstractly, a con-
     troller observes runtime characteristics of the stage and adjusts allo-
     cation and scheduling parameters to meet performance targets. Ex-
     amples include automatic tuning of the number of threads executing
     within each stage, and the number of events contained within each
     batch passed to a stage’s event handler.
     SEDA applies fine-grained admission control at each stage to limit
     the rate at which events are accepted by stages. Stages much be pre-
     pared to deal with enqueue rejection.
   • CPU Scheduling
     SEDA currently relies upon the OS to schedule stages transparently
     on processors. The use of a thread pool per stage allows stages to

        run in sequence or in parallel, depending on the OS scheduler and
        thread system.

3.1.4    Summary

In summary, Click focuses on structure: how to divide packet processing
into functional components and how to configure connections between
components to form a router. StagedServer’s main contribution is to intro-
duce the concepts of Cohort Scheduling and Staged Computation. SEDA
proposes a specific form of Staged Computation and focuses on automatic
resource control and load shedding.
By contrast, FNA has several important differences from previous work:
   • All previous work focussed on the use of uniprocessors. Support
     for SMP was either ignored or treated as simple derivation. By con-
     trast, FNA is designed from the very beginning with a clear focus
     on CMPs, and introduces techniques that are mostly applicable in
   • FNA adds flow as an explicit entity in both processing logic and
     scheduling policy. By maintaining per-flow contexts, FNA can guar-
     antee per-flow quality of service.
   • FNA incorporates a stage-aware scheduling policy both at the user
     level and the OS level.
   • FNA schedulers use both static structural information of applica-
     tions and dynamic profiling statistics at runtime.
   • Dynamic profiling statistics not only cover time measurement such
     as run/sleep time periods, but also include extensive system events
     measured by CPU performance counters, such as cache misses.
   • FNA scheduling policies are based on flow network algorithms in-
     stead of traditional multi-level priority algorithms.
   • FNA is designed for novel multi-core multi-thread processors, but
     most of FNA techniques can be applied to traditional SMPs as well.
All the details are provided in the next section.

Figure 4: FNA Web Server Design: This is a simple web server using FNA
architecture. The application is composed of stages connected by queues. Edges
represent the path of event flows.

Figure 5: A FNA Stage: A stage consists of an incoming event queue, a flow
scheduler, an event handler and several threads either from a local thread pool or a
global thread pool. Each event belongs to a flow and references the corresponding
flow context.

3.2   Flow Network Architecture

FNA is a specific form of Staged Computation. Currently, FNA uses the
same architecture as SEDA. FNA decomposes an application into a net-
work of stages representing different functional units of processing, as de-
picted in 4. Each stage has an input event queue, an flow scheduler and an
event handler, as depicted in 5. The event handler is driven by threads, ei-
ther from a per-stage local thread pool or from a global thread pool.
The arrival of a request generates an event. The event travels along a path
of stages, forming a flow. The path is either statically or dynamically de-
termined. As the event moves along this path, the corresponding request
is processed. Upon finish, a response is delivered and the event ceases to

A flow context contains much information: the client request, the event(s),
the path of stages and the QoS requirement. Two flows are compatible
if their contexts have the same path and QoS. Compatible flows can be
aggregated and treated alike.

3.3      FNA Scheduling Policy

3.3.1     Overview

Obviously, the major task of FNA applications is to move as many events
as possible through the network of stages (i.e. high throughput) and as fast
as possible (i.e. low latency). As a result, the scheduling policy is trans-
formed into a classic network flow problem. There have existed many algo-
rithms to solve different kinds of flow network problems, such as shortest
path problem, maximum flow problem and minimum cost flow problem.
There are two major challenges in applying flow network algorithms to
FNA scheduling policy:
   1. To define sensible meanings and to assign suitable values for edge
      capacity, edge flow cost etc.
   2. To find best flow network algorithm for FNA scheduling scenario.

3.3.2     Flow Network Problems

   • Basics
         A directed graph can be interpreted as a flow network to answer ques-
         tions about material flows. Imagine material passing through a sys-
         tem from a source, where the material is produced, to a sink, where it
         is consumed. The source produces material at some steady rate, and
         the sink consumes the material at the same rate. Flow networks can
         be used to model liquids flowing through pipes, parts through as-
         sembly lines, current through electrical networks, information through
         communication networks, and so forth.
         Each directed edge in a flow network can be thought of as a con-
         duit for the material. Each conduit has a stated capacity, given as a
         maximum rate at which the material can flow through the conduit.

 Vertices are conduit junctions, and other than the source and sink,
 material flows through the vertices without collecting in them. In
 other words, the rate at which material enters a vertex must equal
 the rate at which it leaves the vertex.
• FNA Scenario
 In FNA applications, each stage handles events from the input queue
 and then enqueues events onto the input queue of another stage. In
 this sense, stages are connected by queues. Thus, queues in FNA
 applications are equivalent to edges in flow networks. Edge capacity
 can be defined as the threshold of input event queues, i.e. maximum
 number of events a queue can hold. This threshold is mainly used
 for load shedding. It’s very difficult to statically decide thresholds for
 all the stages, because the approach to dealing with overloading is
 situation-specific. It’s more flexible to observe the states of input
 queues of all the stages under particular overloading conditions, and
 by adjusting the thresholds of certain stages, some client requests can
 be discarded at different stages of processing.
 In flow networks, edges, not vertices, incur costs. In FNA applica-
 tions, costs are incurred by processing in stages. Therefore, the flow
 cost of an edge is defined as the cost of the stage where this edge
 originates. How to define the stage cost is one of the top questions
 in this research. It should reflect both processing cost and scheduling
   1. Processing Cost
      Processing cost is further divided into two types: intra-stage pro-
      cessing cost and inter-stage processing cost. Intra-stage processing
      cost is the inherent cost for processing an event in a stage. It’s
      directly related to the processing logic complexity of the stage,
      such as instruction stream, memory access pattern, data struc-
      tures and algorithms. It can be expressed in cycles per event.
      Inter-stage processing cost is the overhead of switching the ex-
      ecution from one stage to another. It involves the cache misses,
      TLB misses and pipeline flushes. The more mutually friendly
      two stages are, the lower the inter-stage processing cost. The
      more mutually hostile two stages are, the higher the inter-stage
      processing cost. It’s important to notice that inter-stage process-
      ing cost is one-way only: the cost of switching from stage A to
      stage B may not equal the cost of switching from B to A.

           For example, suppose there are N stages in a FNA application.
           There is one intra-stage processing cost and (N-1) inter-stage
           processing costs for each stage.
        2. Scheduling Bonus
           Scheduling bonus is affected by two performance objectives:
           throughput and latency. For example, a stage that has a longer
           input event queue is able to batch more events for processing
           at one time to increase throughput, and should be scheduled
           sooner to avoid bottleneck and decrease latency. Therefore, this
           stage is entitled to a higher scheduling bonus and lower total

3.3.3   Dynamic Profiling

The running of FNA applications consists of two phases: start-up phase
and stable phase, in which dynamic profiling has different objectives.
A FNA application first enters start-up phase upon execution. In this
phase, the main objective of dynamic profiling is to determine the inter-
stage costs. Scheduler switches execution between each possible pairs
of stages. Costs are calculated from such statistics as cache misses, CPU
idle time etc. obtained from performance counters. At this time, perfor-
mance is sub-optimal, as necessary information is being collected so that
application-specific scheduling policy can be automatically deduced and
used later.
The research focus here is to decide how to calculate inter-stage costs, and
find the best path for traversing all the pairs of stages with best accuracy
of measurement, shortest time of completion and least penalty on perfor-
Once start-up phase ends, FNA applications enter stable phase. In this
phase, the main objective of dynamic profiling is to update the intra-stage
costs and scheduling bonuses.
The research focus here is to find the “secret” formula to calculate them
out of various factors, which will be used in flow network algorithms to
produce scheduling priority. In face of large number of available statis-
tics, machine learning and data mining techniques, instead of subjective
reasoning, can provide valuable insight.

3.3.4   Explicit Flow Scheduling

One major difference between FNA and previous staged architectures is
the treatment of flows as explicit structural and scheduling entities.
Scheduling policy is divided into two parts: OS-level stage-scheduling
policy and user-level flow-scheduling policy.
Each flow is given a scheduling priority based on requirement for QoS,
mostly in terms of latency. Flows with the same QoS requirement can be
aggregated and scheduled on a first-come-first-serve basis.
A flow traverses several stages but its priority remains the same. As a
result, instead of having one different scheduler per stage, one global user-
level scheduler can make the decision for all the stages.

3.4     Implementation and Evaluation

Xen VMM will be modified to use FNA scheduler to schedule privileged
driver domains (i.e. VCPs). I/O performance of the resulting system will
be compared to when the original Xen domain scheduler is used.
The latest Linux 2.6.x kernel will be modified to incorporate FNA sched-
uler. A FNA database and a FNA web server will be implemented. Per-
formances will be evaluated on macro-scale and micro-scale, OS-level and

3.4.1   FNA and VCP

Let’s take a look at Virtual Channel Processors (VCPs) in light of the FNA
VCPs are specialized stages that perform I/O operations, whilst the OS
kernel constitutes a generic stage performing all the other operations. VCPs
and the kernel communicate by messages which represent I/O requests.
As messages move from one VCP to another, they from I/O flows. Mes-
sages contain references to data, i.e. I/O flow contexts, which reside in
shared memory regions. An I/O flow inherits priority from its corre-
sponding process or interrupt.

FNA scheduling policy is divided into two parts: stage scheduling in
VMM and flow scheduling in VCPs. A special case is to dedicate an entire
hardware thread or core to VCPs.
I/O operations have very high requirement for both throughput and la-
tency. I/O performance is becoming a major bottleneck in overall per-
formance. It’s highly desirable to harness CMP capabilities in the most
efficient and effective way. Obviously, FNA and VCP are good matches
for each other.

3.4.2   FNA Database

Databases are ideal candidates for FNA architecture. Database workloads
exhibit large instruction footprints and tight data dependencies that re-
duce ILP and incur data and instruction transfer delays. A multi-threading
programming model yields poor cache performance that seriously hurts
database performance. For example, preemption is oblivious to the thread’s
current execution state. Context-switches that occur in the middle of a log-
ical operation evict a possibly very large workset from the cache. When
the suspended thread resumes execution, it wastes a lot of time restoring
evicted workset [4].
Under the workloads in standard database benchmarks such as TPC-C,
performance of the FNA database is compared with MySQL database. On
the macro-scale, the major metric is the largest number of transactions per
minute databases can generate. On the micro-scale, important statistics
include CPI, CPU idle time and numbers of L1/L2/TLB cache misses. On
the OS-level, focuses are put on the number of context switches, schedul-
ing overhead, and especially how well flow network algorithms can adapt
to drastically changing workloads.

3.4.3   FNA Web Server

Under the workloads in standard web server benchmarks such as SPECweb99
and upcoming SPECweb2004, performance of the FNA web server is com-
pared with those of Apache web server (representing multi-threading pro-
gramming model), Flash web server (representing event-driven program-
ming model), and Haboob web server (representing SEDA programming

On the macro-scale, major metrics are server throughput and response
time as the total number of client connections increases. Statistics gathered
on the micro-scale and OS-level are similar to those in databases. On the
application-level, it’s particularly interesting to see how well flow sched-
uler maintains QoS differentiation.

3.5   Other FNA Topics

FNA architecture facilitates research in several other topics:
   • Online Reconfiguration
      Online reconfiguration provides a way to extend and replace ac-
      tive application components. This enables systems to update codes,
      adapt to changing workloads, pinpoint performance problems and
      perform a variety of other tasks while systems are running. On-
      line reconfiguration requires generic support for interposition and hot-
      swapping [10].
      FNA architecture provides ideal support for online reconfiguration.
      Stages can be wrapped with additional functionality or replaced with
      different implementations. Event queues decouple stages by simple
      uniform enqueue/dequeue interfaces, which provides high trans-
      parency for reconfiguration. Per-request states are stored in corre-
      sponding flow contexts, leaving stages with private processing states
      only. As a result, state transition is as simple as event queue trans-
      Examples include profilers and debuggers that are themselves stages
      positioned around target stages.
      In short, support for online reconfiguration in FNA applications boils
      down to the support for event redirection in stages.
   • Fault Tolerance
      As stated above, flow contexts hold states for requests, and stages
      only contain private processing states. With support for stage-level
      exception, stages that encounter runtime errors can be reset to pre-
      defined states or restarted to clean states.
      Besides, the functionality of a stage can be implemented in differ-
      ent ways, e.g. different programming languages, data structures or

  algorithms. When a stage is found faulty at runtime, alternative im-
  plementations can be swapped in, increasing reliability.
• Multiple FNA Applications
  Server applications are highly specialized and are not intended to
  share the machine with other applications. It’s generally undesir-
  able for, say, a web server to run on the same machine as a database.
  But when multiple FNA applications run simultaneously, schedul-
  ing policy becomes much more complicated in face of multiple net-
  works of stages.
  One simple approach may be to treat different FNA applications like
  processes. They are chosen for scheduling based on traditional time-
  sharing algorithms. During the time slice, application-specific FNA
  scheduling policy is enforced.
• Hidden Message-Passing
  Message-passing, with simple enqueue/dequeue interfaces, seems
  to have more advantages than procedural calls in terms of devel-
  opment complexity. However, it turns out that the complexity is just
  transferred to the typing of messages. Nowadays, there is no support
  for message-passing at programming language level. Few develop-
  ers are familiar with message-passing programming model. As a re-
  sult, in modern software such as MPI, message-passing mechanisms
  are implemented in libraries which expose procedural-call interfaces
  instead. This eases developers of the burden of correctly assembling
  and disassembling messages. FNA applications can be implemented
  in the same way.
  It’ll be useful to design a declarative language describing message-
  passing mechanisms to be used between pairs of procedural calls.
  Configuration expressed in this language can facilitate automatic code
  generation. This method can be used to decouple traditional proce-
  dural programs to add support for online reconfiguration.
• FNA and Software Engineering
  Almost all modern softwares are designed in modules. There have
  been extensive software engineering research on how to divide soft-
  wares into modules, proper modular sizes, choice of modular inter-
  faces etc. However, most modular softwares are still compiled into
  one big chunk of executable program in the end.

      A few forms of runtime modules exist, such as loadable OS kernel
      modules, application-specific plugins, objects in object-oriented pro-
      gramming languages.
      Stages are generic runtime modules. FNA fits the modular design
      of softwares very well. It doesn’t require radical changes to existing
      software development methodology for FNA to be adopted, espe-
      cially when message-passing mechanisms can be automatically gen-

4     Road Map

My road map consists of two parts: a plan for research work and a plan
for publication. Although it’s quite difficult to accurately predict the time
frame for system research, a clear and detailed road map can play a vital
role in self-motivation.

4.1    Work Plan

    1. Aug. 2004 - Apr. 2005
      The FNA scheduling policy is implemented in next-generation I/O,
      i.e. VCPs on Xen VMM (described in section 3.4). This is the first
      design and implementation effort. I’ll work on both VCP-specific
      and VCP-unspecific aspects of FNA, and carry out detailed I/O per-
      formance evaluation. This work will lay the foundation for later re-
      search efforts.
    2. May. 2005 - Aug. 2005
      A FNA database is implemented (described in section 3.4), whose
      performance is going to be compared with those of MySQL and Post-
    3. Sep. 2005 - Dec. 2005
      A FNA web server is implemented (described in section 3.4), whose
      performance is going to be compared with those of Apache and Flash
      web servers.
    4. Jan. 2006 - Aug. 2006

      Based on previously written abstracts and full papers, I’ll write up
      the final dissertation. If time permits, I’m going to investigate other
      FNA-related topics, especially online reconfiguration and hidden message-
      passing language (described in section 3.5).

4.2   Conference Calendar

  1. ASPLOS-XI, Oct. 2004
      Eleventh International Conference on Architectural Support for Pro-
      gramming Languages and Operating Systems, Oct. 9-13, 2004, Park
      Plaza, Boston, Massachusetts, USA.
        • Submission
          I’ll submit an abstract to the The Wild and Crazy Idea Session
          IV. Based on my thesis proposal, I’ll describe how to make schedul-
          ing decisions for FNA-based server applications with dynamic
          profiling (by performance counters) and flow network algorithms.
        • Deadline
          Wednesday, Aug. 4, 2004, 6pm CDT
  2. OSDI’04, Dec. 2004
      Sixth Symposium on Operating Systems Design and Implementa-
      tion, Dec. 6-8, 2004, Renaissance Parc 55 Hotel San Francisco, Cali-
      fornia, USA.
        • Submission
          I’ll submit a work-in-progress report describing my work on
          applying FNA scheduling policy for next-generation I/O, i.e.
          VCPs on Xen VMM.
        • Deadline
          To be announced in Sep. 2004
  3. ISCA, Jun. 2005
      The 32nd Annual International Symposium on Computer Architec-
      ture, Jun. 4-8, 2005, Madison, Wisconsin, USA.
        • Submission

         I plan to submit a full paper describing how to use machine
         learning and data mining techniques to synthesize the right met-
         rics from various dynamic profiling statistics to define intra-
         stage and inter-stage processing costs.
       • Deadline
         Abstracts: Nov. 11, 2004 at 5:59PM PST
         Full papers: Nov. 18, 2004 at 5:59PM PST
 4. SOSP, Oct. 2005
    20th ACM Symposium on Operating Systems Principles, Oct. 23-26,
    2005, The Grand Hotel, Brighton, United Kingdom.
       • Submission
         By this time, I should be able to write a full paper on applying
         FNA scheduling policy for VCPs on Xen. It should systemati-
         cally cover all the important features and techniques in FNA.
       • Deadline
         Mar. 25, 2005


[1] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and
    Henry M. Levy. Scheduler activations: Effective kernel support for
    the user-level management of parallelism. ACM Transactions on Com-
    puter Systems, 10(1):53–79, February 1992.
[2] Benjie Chen and Robert Morris. Flexible control of parallelism in a
    multiprocessor pc router. In Proceedings of the USENIX 2001 Annual
    Technical Conference, 2001.
[3] Keir Anthony Fraser. Practical Lock Freedom. PhD thesis, University
    of Cambridge, 2003.
[4] Stavros Harizopoulos and Anastassia Ailamaki. A case for staged
    database systems. In Proceedings of the First Biennial Conference on In-
    novative Data Systems Research, 2003.
[5] Eddie Kohler. The Click Modular Router. PhD thesis, Massachusetts
    Institute of Technology, 2001.

 [6] Leonidas I. Kontothanassis, Robert W. Wisniewski, and Michael L.
     Scott. Scheduler-conscious synchronization. ACM Transactions on
     Computer Systems, 15(1):3–40, 1997.
 [7] James R. Larus and Michael Parkes. Using cohort scheduling to en-
     hance server performance. In Proceedings of LCTES/OM, pages 182–
     187, 2001.
 [8] Yue Luo, Pattabi Seshadri, Juan Rubio, Lizy John, and Alex Mericas.
     A case study of 3 internet benchmarks on 3 superscalar machines.
     Technical report, Laboratory for Computer Architecture, The Univer-
     sity of Texas at Austin, 2002.
 [9] Derek McAuley and Rolf Neugebauer. A case for virtual channel pro-
     cessors. In Proceedings of the ACM SIGCOMM Workshop on Network-
     I/O Convergence, pages 237–242. ACM Press, 2003.
[10] C. Soules, J. Appavoo, K. Hui, D. Silva, G. Ganger, O. Krieger,
     M. Stumm, R. Wisniewski, M. Auslander, M. Ostrowski, B. Rosen-
     burg, and J. Xenidis. System support for online reconfiguration. In
     Proceedings of the USENIX 2003 Annual Technical Conference, 2003.
[11] Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and
     Eric Brewer. Capriccio: Scalable threads for internet services. In Pro-
     ceedings of the nineteenth ACM Symposium on Operating Systems Princi-
     ples, pages 268–281. ACM Press, 2003.
[12] Matthew David Welsh. An Architecture for Highly Concurrent, Well-
     Conditioned Internet Services. PhD thesis, University of California,
     Berkeley, 2002.


To top