Checkpoint Based Recovery from
Power Failures
Christopher Sutardja
Emil Stefanov
Goals
• Consistent checkpoint
– A consistent snapshot of memory for a specific time in the
past.
• Safe even under power failure
– The checkpoint is never “in transition”
• Small storage overhead
– Not much more than double the memory.
• Low performance overhead
– Should not stall the processor for too long.
• Scalable
– Scales well in large core networks such as meshes.
Related Work
• On the feasibility of incremental checkpointing
for scientific computing by J. Sancho et al
– Speculates about the future role of checkpointing in
parallel machines.
– As the number of processing nodes grows
exponentially, failure of any one node becomes much
more likely.
– Error correction codes and other redundancies would
introduce too much overhead when used alone.
– As a result, researching Checkpoint recovery is
growing in importance.
Related Work
• Modular Checkpointing for Atomicity by L.
Ziarek et al.
– Introduces an abstraction called stabilizers to
make checkpointing easier.
– Targets message-passing machines
• Makes consistent checkpointing more challenging.
Related Work
• SafetyNet: improving the availability of shared
memory multiprocessors with global
checkpoint/recovery by D. Sorin et al.
– Explores the concept of checkpointing in logical
time.
– Multiple checkpoints.
– Each dirty cache line has a tag indicating when it
was modified relative to a checkpoint.
– Low execution overhead.
– Not safe from power failures.
Related Work
• ReVive: cost-effective architectural support for
rollback recovery in shared-memory
multiprocessors by M. Prvulovic et al.
– Explores different ways of rollback recovery in shared-
memory multiprocessor systems. Considers:
• the scope of the checkpoint
• memory
• checkpointing mechanism.
– Achieves about 6% checkpointing overhead.
– Not safe from power failures.
– Not geared towards non-volatile memory: requires
fast writes.
Related Work
• Efficient Initialization and Crash Recovery for Log-
based File Systems over Flash Memory by Chin
Wu et al.
– As Flash Memory becomes cheaper and denser, the
uses for Flash increase.
– Uses flash for recovering file systems.
– Yet another use of flash for recovery.
– Use a log-based method to accelerate remounting
after system crash by minimizing the amount of
information that has to be changed upon reboot.
DRAM
DRAM
Memory Controller Memory Controller
L2
L1
Core
Memory Controller Memory Controller
DRAM
DRAM
Memory Controller
Memory Controller
DRAM
DRAM
DRAM
Checkpointer DRAM
Checkpointer
Memory Controller
Memory Controller
DRAM DRAM
DRAM
Checkpointer DRAM
Checkpointer
Checkpoint A
Core
Checkpoint B
Checkpoint
Address Decoder
Coordinator Cache Checkpoint A
L1 Checkpoint Buffer Buffer Buffer Buffer
Controller Checkpoint B
Log Log Log Log
Checkpoint A
Cache
L2 Checkpoint Check Check Check Check
Controller Checkpoint B point point point point
Checkpointing Techniques
• For Caches and Cores:
– Each cache/core has two flash storages adjacent to it.
• One is for the previous checkpoint
• One for the current checkpoint.
– During a checkpoint, the cache/core internal state is
copied to flash storage.
• For DRAM:
– The checkpointing system snoops on DRAM.
– DRAM changes are continuously logged to flash
memory.
– A chain of parallel buffers ensues that DRAM
checkpointing almost never causes a stall.
Responsibilities of the Main
Components
• Checkpoint Coordinator
– Notifies the nodes and DRAM checkpointers that a
checkpoint is beginning.
• DRAM Checkpointer
– Continuously logs DRAM changes.
– Checkpoints when instructed by the coordinator.
• Cache Checkpoint Controller
– Checkpoints the adjacent cache when instructed
by the coordinator.
Steps for Checkpointing (1 of 2)
1. The coordinator sets the checkpoint signal to 1.
2. In parallel each
a. Core:
i. Pauses processing instructions.
ii. Copies internal state to flash memory.
b. Cache Checkpoint Controller:
i. Copies cache internal state to flash memory (data is copied
one line at a time).
c. DRAM Checkpointer:
i. Flushes buffer to flash log.
ii. Notifies checkpoint coordinator that the buffer has been
flushed.
Steps for Checkpointing (2 of 2)
3. The coordinator sets the checkpoint signal to 0.
4. In parallel each
a. Core:
i. Flips flash memory bit to indicate the new checkpoint
buffer.
b. Cache Checkpoint Controller:
i. Flips flash memory bit to indicate the new checkpoint
buffer.
c. DRAM Checkpointer:
i. Marks checkpoint boundary in flash log.
Checkpoint A
Core Checkpoint B
Cache Checkpoint A
L1 Checkpoint
Checkpoint B
Controller
Cache Checkpoint A
L2 Checkpoint
Controller Checkpoint B
F F F F F F F F
Address Decoder
Buffered
Changes
Buffer Buffer Buffer Buffer
Log Log Log Log
Check Check Check Check
Previous Checkpoint
Next point point point point
Checkpoint
Changes
Changes
start end
Previous
Checkpoint
(random access)
Recovering
1. Determining which Checkpoint to use
a. System checks which Checkpoint is the most recent
b. If the most recent checkpoint was in progress during crash, the older
checkpoint is used.
2. Restoring Previous State
a. Each architectural register is rewritten.
b. Each cache is written to by its adjacent FLASH buffer (one cache line
at a time)
c. Main Memory is recovered
d. Take advantage of pipelined write if available.
3. Resume Execution
a. Resume program counter
b. Notify that CPU’s that the system is restoring from a checkpoint
(single bit)