Embed
Email

Checkpoint Based Recovery from Power Failures

Document Sample

Shared by: huanglianjiang1
Categories
Tags
Stats
views:
0
posted:
12/1/2011
language:
English
pages:
16
Checkpoint Based Recovery from

Power Failures

Christopher Sutardja

Emil Stefanov

Goals

• Consistent checkpoint

– A consistent snapshot of memory for a specific time in the

past.

• Safe even under power failure

– The checkpoint is never “in transition”

• Small storage overhead

– Not much more than double the memory.

• Low performance overhead

– Should not stall the processor for too long.

• Scalable

– Scales well in large core networks such as meshes.

Related Work

• On the feasibility of incremental checkpointing

for scientific computing by J. Sancho et al

– Speculates about the future role of checkpointing in

parallel machines.

– As the number of processing nodes grows

exponentially, failure of any one node becomes much

more likely.

– Error correction codes and other redundancies would

introduce too much overhead when used alone.

– As a result, researching Checkpoint recovery is

growing in importance.

Related Work

• Modular Checkpointing for Atomicity by L.

Ziarek et al.

– Introduces an abstraction called stabilizers to

make checkpointing easier.

– Targets message-passing machines

• Makes consistent checkpointing more challenging.

Related Work

• SafetyNet: improving the availability of shared

memory multiprocessors with global

checkpoint/recovery by D. Sorin et al.

– Explores the concept of checkpointing in logical

time.

– Multiple checkpoints.

– Each dirty cache line has a tag indicating when it

was modified relative to a checkpoint.

– Low execution overhead.

– Not safe from power failures.

Related Work

• ReVive: cost-effective architectural support for

rollback recovery in shared-memory

multiprocessors by M. Prvulovic et al.

– Explores different ways of rollback recovery in shared-

memory multiprocessor systems. Considers:

• the scope of the checkpoint

• memory

• checkpointing mechanism.

– Achieves about 6% checkpointing overhead.

– Not safe from power failures.

– Not geared towards non-volatile memory: requires

fast writes.

Related Work

• Efficient Initialization and Crash Recovery for Log-

based File Systems over Flash Memory by Chin

Wu et al.

– As Flash Memory becomes cheaper and denser, the

uses for Flash increase.

– Uses flash for recovering file systems.

– Yet another use of flash for recovery.

– Use a log-based method to accelerate remounting

after system crash by minimizing the amount of

information that has to be changed upon reboot.

DRAM

DRAM









Memory Controller Memory Controller









L2

L1

Core









Memory Controller Memory Controller

DRAM

DRAM

Memory Controller









Memory Controller

DRAM

DRAM



DRAM

Checkpointer DRAM

Checkpointer

Memory Controller









Memory Controller

DRAM DRAM





DRAM

Checkpointer DRAM

Checkpointer

Checkpoint A

Core

Checkpoint B

Checkpoint

Address Decoder

Coordinator Cache Checkpoint A

L1 Checkpoint Buffer Buffer Buffer Buffer

Controller Checkpoint B

Log Log Log Log

Checkpoint A

Cache

L2 Checkpoint Check Check Check Check

Controller Checkpoint B point point point point

Checkpointing Techniques

• For Caches and Cores:

– Each cache/core has two flash storages adjacent to it.

• One is for the previous checkpoint

• One for the current checkpoint.

– During a checkpoint, the cache/core internal state is

copied to flash storage.

• For DRAM:

– The checkpointing system snoops on DRAM.

– DRAM changes are continuously logged to flash

memory.

– A chain of parallel buffers ensues that DRAM

checkpointing almost never causes a stall.

Responsibilities of the Main

Components

• Checkpoint Coordinator

– Notifies the nodes and DRAM checkpointers that a

checkpoint is beginning.

• DRAM Checkpointer

– Continuously logs DRAM changes.

– Checkpoints when instructed by the coordinator.

• Cache Checkpoint Controller

– Checkpoints the adjacent cache when instructed

by the coordinator.

Steps for Checkpointing (1 of 2)

1. The coordinator sets the checkpoint signal to 1.

2. In parallel each

a. Core:

i. Pauses processing instructions.

ii. Copies internal state to flash memory.

b. Cache Checkpoint Controller:

i. Copies cache internal state to flash memory (data is copied

one line at a time).

c. DRAM Checkpointer:

i. Flushes buffer to flash log.

ii. Notifies checkpoint coordinator that the buffer has been

flushed.

Steps for Checkpointing (2 of 2)

3. The coordinator sets the checkpoint signal to 0.

4. In parallel each

a. Core:

i. Flips flash memory bit to indicate the new checkpoint

buffer.

b. Cache Checkpoint Controller:

i. Flips flash memory bit to indicate the new checkpoint

buffer.

c. DRAM Checkpointer:

i. Marks checkpoint boundary in flash log.

Checkpoint A

Core Checkpoint B



Cache Checkpoint A

L1 Checkpoint

Checkpoint B

Controller



Cache Checkpoint A

L2 Checkpoint

Controller Checkpoint B









F F F F F F F F

Address Decoder

Buffered

Changes

Buffer Buffer Buffer Buffer



Log Log Log Log



Check Check Check Check

Previous Checkpoint

Next point point point point

Checkpoint

Changes

Changes



start end





Previous

Checkpoint

(random access)

Recovering

1. Determining which Checkpoint to use

a. System checks which Checkpoint is the most recent

b. If the most recent checkpoint was in progress during crash, the older

checkpoint is used.

2. Restoring Previous State

a. Each architectural register is rewritten.

b. Each cache is written to by its adjacent FLASH buffer (one cache line

at a time)

c. Main Memory is recovered

d. Take advantage of pipelined write if available.

3. Resume Execution

a. Resume program counter

b. Notify that CPU’s that the system is restoring from a checkpoint

(single bit)



Related docs
Other docs by huanglianjiang...
conseil_6_avr_2006_delib
Views: 4  |  Downloads: 0
insurance-format
Views: 0  |  Downloads: 0
RUNABOUT 787 LIMITED
Views: 0  |  Downloads: 0
Chapter24_Ross
Views: 0  |  Downloads: 0
Paper-19
Views: 0  |  Downloads: 0
SuperHero
Views: 0  |  Downloads: 0
2007 SO Policy Manual
Views: 0  |  Downloads: 0
Employment Master Graduates
Views: 0  |  Downloads: 0
Gym
Views: 4  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!