Why Threads a Bad Idea Cooperative Task Management Capriccio
Document Sample


Why Threads a Bad Idea
Cooperative Task Management
Capriccio
Threads very hard to program; should only be used when true CPU
concurrency is needed (i.e. are using multiprocessors). Else use
events (Ousterhout). Threads hard to program b/c:
-- must coord access to shared data; forget lock? screwed
-- circular dependencies... --> deadlock
-- hard to debug (errors may be timing-dependent)
-- can't design modules independently anymore -- break abstraction
-- achieving good perf is hard;
--> simple locking ==> low concurrency (e.g. COARSE locking)
--> fine-grained locking --> increases complexity, OH, reduces perf
-- OSs limit perf: context switches, scheduling
-- threads not well supported
--> hard to port threaded code
--> standard libes not thread-safe
--> kernel calls, .. not multi-threaded
--> few debugging tools
Threads: general purpose solution for managing concurrency
- multiple independent execution streams
- shared state
- preemptive scheduling
- synchronization via locks, CVs, ...
Used for: OSs (1 kernel thread per user process); Sci apps (one
thread per CPU); distrib sys (process requests concurrently --
olap IOs); GUIs (threads correspond to user actions)
EVENT-DRIVEN PROGRAMMING:
--1 execution stream; NO CPU CONCURRENCY
--register interest in events via callbacks
--event loop waits for events, invokes event handlers
--handlers short lived
--no preemption of event handlers
USED For: GUIs (one handler per event); distrib systems (one handler for
each source of input; e.g. sockets); event-driven I/O for I/O olap
Problems with events:
--long running handlers make app non-responsive
soln: fork off subprocesses for long-running things; use events
to find out when done
--can't maintain local state across events
--event-driven IO not always well supported (e.g. poor write buffering)
Easier to debug w/events; easier to restrict complexity;
EVENTS FASTER THAN THREADS ON A SINGLE CPU; NO LOCKING OH; NO CONTEXT
SWITCHING
COOPERATIVE TASK MGMT
Task Management (cooperative & preemptive)
Stack Management (manual & automatic)
Event driven programming: simplify concurrency issues by reducing
opportunities for race conditions and deadlocks
POINT: can choose reasoning benefits of cooperative task management
without sacrificing automatic stack management
TASK MANAGEMENT:
Divide work prog does into tasks; each task encapsulates a control flow;
all tasks access some common shared state;
PREEMPTIVE task management: usually demanded by high perf progs
COOPERATIVE task management:
-- invariants on global state only need to be restored when a task
explicitly yields; can be assumed to be valid when task resumes
COOP harder than SERIAL in that if task has local state that depends
on the global state before yielding, when the task resumes, that
global state may have changed (in the meantime).
If do COOP TASK MGMT, every IO library function must be wrapped so that
instead of BLOCKING, the function (1) initiates IO, (2) yields control
to another task; The wrapper must make sure that the task becomes
schedulable when the IO completes.
STACK MANAGEMENT:
(1) Manual stack management:
-- divide a program into a collection of event handlers
-- create separate events for each subtask;
-- other tasks can make progress while one task is waiting on something
So control flow for a SINGLE task is broken across several procedures
or subtasks; No shared state w/in the individual subtasks --> tough.
If have a "built-in facility for constructing closures" (e.g. setjmp
and longjmp in C), then don't need auto stack management as can restore
the state at any juncture
SYNCHRONOUS IO: calling task blocks at call site til IO completes
then resumes execution
ASYNCHRONOUS IO: returns control to caller immediately
Asynch IO provides concurrency that is different than task management
IO ops considered independently of the ops they overlap b/c IO does
not access the shared state of the computation.
Any task management can call either type of IO i/f
STACK MANAGEMENT
AUTOMATIC: programmer expresses each task as a single procedure
-- this procedure MAY call functions that block on IO ops
-- when a task is waiting on a blocking op, that tasks's state
is stored on the procedure's program stack
--> PROCEDURE-ORIENTED
==> SOUNDS LIKE SYNCHRONOUS IO
MANUAL: programmer must rip the code for any given task into event
handlers that run to completion without blocking;
-- event handlers: procedures that can be invoked by an EVENT HANDLING
SCHEDULER in response to events (such as response from previously
requested IO);
-- when register an event handler with the Event handling scheduler,
provide an object that CONTAINS THE STATE relevant to the event;
also provides pointer to a fxn that should be called when the event
completes.
-- event handling scheduler runs the game; so when event representing
IO completion occurs, EVS calls E2 passing E1's bundled state as
an arg.
==> SOUNDS LIKE ASYNCHRONOUS IO
DEBUGGING is DIFFICULT with MANUAL STACK MANAGEMENT; if the debugger
stops in GetCAInfoHandler2, we have NO visibility into the sequence
of events that preceded our existence in this function (i.e. the
stuff from GetCAInfoHandler1 -- no state from it, etc.);
For each routine that is ripped WITH MANUAL STACK MANAGEMENT, the
programmer must manually manage:
(1) function scoping: we have two functions to do what is a single
conceptual function
(2) auto variables: vars once alloc'd on the stack must be moved
into a state structure (on the heap) to survive across yield points;
(3) control structures: the entry point to each block containing a
function that might block must be reachable by a continuation; so
every block that contains a function that might block must be
a separate function --> conceptual functions w/loops must be ripped
into more than two pieces;
(4) debugging stack: must be manually reconstructed
SOFTWARE EVOLUTION magnifies the problem of STACK RIPPING (for manual
stack management); however, SOFTWARE EVOLUTION ALSO magnifies the
problem with ASM whereby if a function evolves from a non-yielding to
a yielding call, THERE IS NO INDICATION of this at the level of the
function (when a function with MSM evolves to yield for IO, that
function's signature changes to reflect the new structure... any
callers of that function will be notified by the compiler of the
function's yieldability); HOWEVER THIS COMES AT A COST; when a function
evolves from being compute-only to potentially yielding, with MSM, all
functions along the path from this changed functoin to the root may
have to be ripped (we keep ripping til a function is encountered
that already passes a continuation to one of the predecessor fxns.)
We can fix the problem with ASM & yielding though by adding static
checks; declare functions that yield YIELDING; everything that
runs w/o yielding is ATOMIC; compiler ensures that fxns that call
YIELDING fxns are themselves YIELDING; no calls to yielding fxns
within atomic blocks;
Could use a DYNAMIC check too: each block that DEPENDS ON ATOMICITY
is enclosed by: startAtomic() and endAtomic(); startAtomic() increments
a counter and endAtomic() decrements it; WHENEVER a function tries to
block on IO, yield() asserts that the counter is ZERO & dumps core o/w
WITH ASM, if the local state of a function does NOT DEPEND on the
yielding behavior of the called function, then the calling function
requires no change; if the calling function's local state IS affected,
then the calling function must be modififed to revalidate state after
yield returns. LOCAL SURGERY; doesn't require drastic changes that
MSM requires.
CAPRICCIO -- scalable thread pkg for use with "high concurrency" servers
-- Save Threads! Fix them! instead of moving to Event-based and all
of the problems inherent in that model!
implement capriccio as a USER-LEVEL thread package
-- thereby DECOUPLING thread package implem from underlying OS
==> therefore can take advantage of
(a) cooperative threading
(b) new asynch IO mechs
(c) compiler support
PROVIDE:
(1) scalability up to 100k threads
(2) efficient stack mgmt
(3) resource-aware scheduling
LINKED STACK MGMT: minimizes amt of wasted stack space; stacks can
grow & shrink at run time;
RESOURCE AWARE SCHEDULING: allows thread scheduling and admission control
to adapt to a system's current resource usage
==> uses a blocking graph (auto-derived from app)
-- describes flow of control b/n blocking points (in a cooperative
thread pkg)
Applied techs to Ap 2.0.44 web server
Thread pkgs: natural abstraction for high concurrency programming;
event-based systems use pipeline of stages; also event-based systems
allow precise control over batch processing, state mgmt, & admissions
control; also provide atomicity w/in each event handler
--> hard to follow control flow & understand cause/effect r/ps
--> manual stack mgmt via stack ripping -- BURDEN
USER-LEVEL THREADS vs. KERNEL LEVEL THREADS
KERNEL THREADS: useful for providing true CPU concurrency
USER THREADS: logical threads; provide clean programming model with
useful invariants & semantics
TO PROVIDE CLEAN SEMANTICS, must decouple logical thrads from
kernel threads; DECOUPLE PROGRAMMING MODEL FROM KERNEL;
(1) there's a lot of variation in kernel threads
--> USER LEVEL THREADS CAN HIDE OS VARIATION
(2) kernel threads & AIO are hot research areas; so to take
advantage of developments, use USER LEVEL THREADS
(1) improved scalability via using USER-LEVEL THREADS w/cooperative
scheduling by taking advantage of new AIO i/f; all thread ops O(1)
(2) LINKED STACKS: dynamic stack growth; solves problem of stack alloc
for a large # of threads;
(3) RESOURCE AWARE SCHEDULING: extracts info a/b flow of control w/in a
program to make sched decisions based on predicted resource usage;
--> programmer doesn't have to modify program to get these bennies
THREAD DESIGN AND SCALABILITY--supports POSIX API for thread mgmt & synch
USER-LEVEL THREADS: advantages over kernel threads in perf & flexibility
--also complicate preemption, though; can interact badly w/kernel
scheduler;
FLEXIBILITY: user-level threads create level of indirection b/n apps &
kernel; Linux kernel's new AIO mechs can be taken advantage of w/o
changing app code; USER-LEVEL THREADS increase flexibility of thread
scheduler, too; kernel-level thread sched must be GENERAL; user-level
thread scheduler can be tailored to the specific app; user-level threads
are also LIGHTWEIGHT;
PERFORMANCE: USER-LEVEL THREADS GREATLY REDUCE OH OF THREAD SYNCH;
Simplest case: cooperative scheduling on a single CPU; synch is nearly
free since neither user threads nor the thread sched can be interrupted
while in the critical section;
EVEN IN CASE OF PREEMPTIVE SCHEDULING: user-level threads don't
require KERNEL CROSSINGS for mutex acquisition and release;
KERNEL LEVEL MUTUAL EXCLUSION requirse a kernel crossing for
every synchronization operation;
MEMORY MANAGEMENT is more EFFICIENT WITH USER-LEVEL THREADS; kernel
threads require data structures that eat up valuable kernel addy space;
DISADVANTAGES OF USING USER-LEVEL INSTEAD OF KERNEL THREADS:
(1) To regain control of the CPU when a user level thread executes
a blocking call, user-level thread pkg overrides blocking calls
and replaces them with non-blocking equivalents;
--> these non-blocking calls require more KERNEL CROSSINGS;
Non-blocking network IO primitive: epoll (very efficient)
-- poll sockets for IO readiness
-- perform actual IO calls
For blocking calls, it's just the 2nd part; so the first part is just OH;
(2) USER LEVEL THREAD pkgs must introduce wrapper layer that translates
blocking IO mechs to non-blocking IO mechs; more OH; expensive if for
quick ops (in-cache reads, easily satisfied by kernel)
(3) USER LEVEL THREADS CAN MAKE IT MORE DIFFICULT TO TAKE ADVANTAGE OF
MULTIPLE PROCESSORS; if have multiple processors, the perf advantage
of light-weight synch is less b/c synch no longer free
IMPLEM:
(1) cap built on top of Toernig's coroutine libe; this libe provides VERY
fast CONTEXT SWITCHES for hte common case (when threads voluntarily
yield); BUILDING CODE THAT ALLOWS FOR PREEMPTION OF LONG-RUNNING USER
THREADS BUT NOT CURRENTLY PROVIDED
(2) IO: cap intercepts blocking IO calls at libe level by overriding
system call stub functions (in GNU libc); for some dynamically linked
apps, the libe bypasses the syscall stubs
CAP uses epoll (latest AIO Linux mech) for pollable file descriptors
(sockets, pipes, fifos); cap falls back to using standard UNIX poll
call if epoll not available; poll() provides pool of kernel threads
for disk IO, too;
SCHEDULING:
cap's main sched loop looks like an event-driven app;
alternatively runs app threads & checks for IO completions;
user can select b/n schedulers at run time;
SYNCHRONIZATION: cap supports cooperative scheduling;
if 1 CPU, inter-thread synch prims check locked/unlocked flag
if multiple kernel threads, cap uses spin locks or optimistic
concurrency;
EFFICIENCY: thread mgmt functions all have (but one) bounded worst case
RT
(indep of the # of threads); sleep queue is exception; uses linked list
THREADING MICROBENCHMARKS:
CAP == CHEAPER CONTEXT SWITCHES b/c of "reduced kernel crossings" and
"simpler scheduling policy"
--> synch prims cheaper in CAP too b/c no kernel crossings
IO PERF:
when concurrency is LOW, CAP is slower b/c performs more syscalls
(e.g. epoll_wait() -- if # of runnable threads is low, then cap
issues even more epoll_wait() calls)
CAP uses AIO primitives; THEREFORE can benefit from kernel's disk
head sched algo to the same degree that kernel threads can; USING
KERNEL'S HEAD SCHED ALGO IN *****EVENT-BASED SYSTEMS***** THAT
USE BLOCKING DISK IO IS LIMITED BY # OF KERNEL THREADS USED;
CAPRICCIO HAS TWICE THE OH of NPTL (for disk I/o perf with
buffer cache) b/c AIO i/f used by CAP INCURS SAME AMOUNT OF
OH FOR CACHE-HITTING OPS AS FOR ONES THAT REACH DISK;
--> same OH
SOLUTION: return result immediately for requests that do not need to go
to disk;
LINKED STACK MGMT -- reduce size of VM alloc'd to stacks while preserving
unbounded stack abstraction;
WEIGHTED CALL GRAPH: use compiler analysis to limit amount of stack space
that must be preallocated; each function in prog represented by a node
in the graph weighted by the max amt of stack space that a single stack
frame for that function will consume; an edge b/n nodes indicates that
function A calls function B; if no recursive calls, no cycles in graph;
IF MAKE USE OF RECURSION, can't compute max stack size of prog @ compile
time;
INSERT CHECKPOINTS: checkpoint == piece of code, determines whether
there is enough stack space left to reach next checkpoint w/o oflow;
if not, new stack chunk alloc'd, $sp adjusted; when function call
returns, new stack chunk is unlinked and returned to free list;
NON-CONTIGUOUS STACKS: stack chunks are switched before args for
a function call are pushed, so code for callee doesn't need to be
changed; have MUTEX for accessing free stack chunk list (through
cooperative threading approach)
PLACING CHECKPOINTS: ensure that at each checkpoint, we have a
bound on the stack space that may be consumed before we reach
the next checkpoint; SO MUST ENSURE AT LEAST ONCE CHECKPOINT IN
EVERY CYCLE IN THE GRAPH; to find points for checkpoints, perform
DFS on call graph which identifies back edges (where recursion
takes place); add checkpoints at all call sites IDd as backedges;
add additional checkpoints to ensure that all paths b/n check
points are w/in a desired bound (given as a compile-time param);
FUNCTION POINTERS: don't know which functions they might call;
categorize FPs by # and type of args; calls to external functions
also a prob; allow programmer to annotate external libe functions;
allow larger stack chunks to be linked for external functions;
INTERNAL WASTED SPACE: some stack space wasted when a new stack
chunk is linked
EXTERNAL WASTED SPACE: stack space at bottom of current chunk;
MAXPATH: max desired path length; increase max path, have fewer
checkpoints (smaller execution time), but more stack linking so
more internal wasted space
MINCHUNK: minimum stack chunk size; larger chunks result in
MORE external wasted space but less frequent stack linking
(which results in less internal wasted space; smaller
exec time OH);
USING LINKED STACKS CAN IMPROVE PAGING BEHAVIOR; linked stacks
are used in LIFO order; stack chunks can therefore be
shared b/n threads; which reduces size of app's working set;
(e.g. all threads can share 1 MB chunk that is big stack
since T1 runs, alloc's big chunk, completes, releases big
chunk then T2 runs, alloc's same big chunk, ...)
RESOURCE AWARE SCHEDULING:
(1) current task handler provides info a/b task's location
in processing chain; can be used to give prio to tasks about
to complete;
(2) lengths of handlers' task queues indicate which stages
are bottlenecks --> break those up!
CAP uses coop sched; so view an app as a sequence of stages;
stages separated by blocking points; deduce stages automatically
and have direct knowledge of resources used at each stage;
BLOCKING GRAPH: contains info a/b places in prog that threads block
Each NODE is a place where blocking occurred; an EDGE exists b/n
two nodes if they were CONSECUTIVE BLOCKING POINTS;
"location" == call chain used to reach blocking point (allows us
to differentiate blocking points in a more useful way than just
using the PC);
CAP generates this graph @ run time; observes transitions b/n
blocking points; *****Targeting long-running apps*****
--annotate edges & nodes
--average run time for each edge
--average for each node (of its outgoing edges)
--changes in resource usage: mem, stack space, sockets
(kept for edges & nodes);
Once we know that a resource is scarce, we promote nodes (& thus
threads) that release that resource and demote nodes that acquire
that resource.
For each resource, increase its util til it hits max cap;
then throttle back by scheduling nodes that release that resource;
when resource usage is low, we want to schedule nodes that consume
that resource -- in order to increase throughput;
Keep system at full throttle without thrashing!
Provides admission control too (tasks near end tend to release
resources; new tasks tend to acquire them)!
Maintain separate run queues for each node in the blocking graph;
periodically determine relative priorities of nodes; once prios
known, we select nodes by stride scheduling; select threads
by dequeeuing from the nodes' run queues, O(1) ops.
RESOURCES == CPU, memory & file descriptors
track mem usage by providing own version of malloc()
- detect resource limit for mem by watching page fault activity
file descriptors: track open() and close() syscalls;
-- estimate # of open conns at which response time jumps up
Don't track VM at current time; nor # of threads;
Determining MAX CAP difficult b/c can be workload-specific; the
disk subsystem can handle LOTS more requests if they're sequential
than if they're random!!! resources can interact, too; i.e.
VM system trades spare disk bw to free phys mem;
--> watch for early signs of thrashing;
But even detecting thrashing is difficult; thrashing == decrease
in productive work + increase in OH; can measure OH; but
productivity is app-specific notion;
-- guess at throughput: # of files opened/closed, # of
threads created/destroyed;
App-level mem mgmt hides resource alloc & dealloc from run-time system;
PROBLEM WITH COOPERATIVE SCHEDULING: THREADS MAY NOT YIELD THE
PROCESSOR!!! Easy to find edges that failed to yield via graph;
OH from gathering/maintaining stats only a/b 2% in Apache;
OH from stack traces is higher -- must be enabled for RAS
(compiler integ could help to maintian location info...)
Could also use a global var that holds a fingerprint of the
current stack; update fingerprint at each function call by
XORing unique function ID at function's entry & exit points;
Get documents about "