Embed
Email

Synchronization

Document Sample

Shared by: cuiliqing
Categories
Tags
Stats
views:
0
posted:
11/10/2011
language:
English
pages:
22
UNIVERSITY of WISCONSIN-MADISON

Computer Sciences Department



CS 537 Andrea C. Arpaci-Dusseau

Introduction to Operating Systems Remzi H. Arpaci-Dusseau

Haryadi S. Gunawi









Journaling File Systems

Questions answered in this lecture:

VFS and FS operations

Why is it hard to maintain on-disk consistency?

How does the FSCK tool help with consistency?

What information is written to a journal?

What 3 journaling modes does Linux ext3 support?

1

Virtual File System (VFS)

Operations:

• File/Dir: open, close, chdir, link, unlink (delete), truncate, rename

• Data: read, write, lseek,

• Access and info: stat, chmod, chown

Ext2/3 (or any other file system)

• Knows its on-disk format

• Has its own block allocation policies

VFS layer:

• Structure-independent code

• Manage buffer cache, directory cache, generic inode descriptor, file descriptor

• Defines a set of functions every file system has to implement





Application

VFS

Linux SGI ReiserFS Sun IBM

Ext2/3 XFS ZFS JFS

2

Multiple updates / ops

Write

• Write to the next byte (to a data block)

• Update block bitmap

• Update meta-data

Delete (e.g. rm /dir/file)

• Release data blocks of file  update block bitmap (to

free space)

• Update the inode for “file”

• Update inode bitmap

• Update “dir” data block (remove directory entry)

And many more …



What happens if a crash happens in the middle …

3

Review: The I/O Path (Reads)

Read() from file 1

Block









Leave copy in cache

• Check if block is in cache in

– (file cache sometimes is called cache Main

“buffer cache”) Memory

• If so, return block to user (Cache)

[1 in figure]

• If not, read from disk, insert Block

into cache, return to user [2] 2

Not in

cache



Disk

4

Review: The I/O Path (Writes)

Write() to file Buffer in memory 1

• Write is buffered in memory

(“write behind”) [1]

Main

• Sometime later, OS decides

to write to disk [2]

Memory

(Cache)

Why delay writes?

• Implications for performance

Later

• Implications for reliability

Write to 2

disk



Disk

5

Many “dirty” blocks in memory:

What order to write to disk?

Example: Appending a new block to existing file

• Initially I have a data bitmap B, inode file I, and an unused data

block D

• After append:

• Write data bitmap B’ (for new data block),

write inode I’ of file (to add new pointer, update time),

write new data block D’

B’ I’ D’





? ? ? Memory

Disk

B I D

6

The Problems

Writes: Have to update disk with N writes

• Writes are buffered on the first place, and then are

performed at the same time later

Disk scheduler

• But, disk does only a single write atomically

• Disk scheduler (e.g. C-LOOK) reorders write sequence

Crashes: System may crash at arbitrary point

• Bad case: In the middle of an update sequence

Desire: To update on-disk structures atomically

• Either all should happen or none



7

Example: Bitmap first

Write Ordering: Bitmap (B’), Inode (I’), Data (D’)

• But CRASH after B’ has reached disk, before I’ or D’

Result?

• Inode is still the old inode (I), it doesn’t point to D’

• Data is still the old data (D)

• D can never be used (bitmap says D is used but actually it’s not because

no inode is pointing to D)





B’ I’ D’



Memory

Disk

B I D

8

Example: Inode first

Write Ordering: Inode (I’), Bitmap (B’), Data (D’)

• But CRASH after I’ has reached disk, before B’ or D’

Result?

• I’ points to D which contains garbage (not D’)

• B is the old bitmap which says block D is unused (although there is

an inode that already points to the data block)

– Another user (I2) requests a block, the FS gives D to I2

– I2 gets D, D is pointed by I and I2 (security leak!)



B’ I’ D’



Memory

Disk

B I D

9

Example: Inode first

Write Ordering: Inode (I’), Bitmap (B’), Data (D’)

• CRASH after I’ AND B’ have reached disk, before D’

Result?

• Better than previous example (no security leak)

• But D still contains garbage, so I is pointing to garbage data



B’ I’ D’



Memory

Disk

B I D

10

Example: Data first

Write Ordering: Data (D’) , Bitmap (B’), Inode (I’)

• CRASH after D’ has reached disk, before I’ or B’

Result?

• No bad thing happens, everything is “consistent”

– Bitmap says the block that holds D’ is free

– No inode points to that block D’

• Inode is still the old inode (I) which does not point to any data block



B’ I’ D’



Memory

Disk

B I D

11

Traditional Solution: FSCK

FSCK: “file system checker”

When system boots:

• Make multiple passes over file system, looking for inconsistencies

– e.g., inode pointers and bitmaps, directory entries, inode reference counts

– Ex1: bitmap says D is used, but no inode is pointing to D, then bitmap is modified

(D is not used)

– Ex2: Two inodes pointing to the same data block, a clone of the data block will be

created, and one of the inodes will point to the new clone (hence, no sharing

anymore)

• Either fix automatically or punt to admin

• Does fsck have to run upon every reboot?

– Yes, if FS does not know if there is a crash in the middle of ops

– No, if FS knows that an operation has not finished yet

• E.g. put in superblock a dirty bit, set dirty bit to 1 before starting the operations. Clean

dirty bit if operations have finished

• If upon reboot, dirty bit in superblock is 1, must run fsck

• Problem: add runtime overhead (must write to superblock for each update sequence)

Main problem with fsck: Performance

• Sometimes takes hours to run on large disk volumes

• Inconsistency can only be detected if the whole content of the file system is

checked  must scan the whole file system (more precisely: must scan all

metadata in the file system) 12

How To Avoid The Long Scan?

Idea:

• Do not perform in-place update

• Write something to another area on the disk before updating its

data structures

– Called the “write ahead log” or “journal”

• If all updates have been successfully reflected to the journal, then

all the updates can be reflected to the final place (this is called the

checkpointing process)

When crash occurs, look through log and see

what was going on

• Use contents of log to fix file system structures

• The process is called “recovery”



13

Case Study: Linux ext3

Journal location

• EITHER on a separate device partition

• OR just a “special” file within ext2



Three separate modes of operation:

• Data: All data and metadata is journaled

• Ordered, Writeback: Just metadata is journaled



First focus: Data journaling mode





14

Transactions in ext3 Data Journaling

Mode

Same example: Update Inode (I), Bitmap (B), Data (D)

First, write to journal:

Each write is formed into a transaction

A transaction comprises of:

• Journal descriptor block (Dr)

– Implies the beginning of a transaction (Tx begin)

– Contains the actual locations of the blocks saved in the journal data blocks

– Contains the transaction number

• Journal data blocks

– All blocks that must be updated atomically to the disk, e.g. in this example: I’,

B’, and D’

• Journal commit block (C)

– Implies the end of transaction (Tx end) Dr:

– Also contains the transaction number Tx#: 2

3 blocks

[1] = 1000

Dr B’ I’ D’ C B I D [2] = 2000

blk # blk # blk # [3] = 3000

15

1000 2000 3000

Write to the journal (sequence)

I want to write B’, I’, and D’

Please give me a transaction # (e.g. got tx #2)

Write tx#2 to the journal superblock (so that we

know later tx#2 is pending)

Prepare journal descriptor block

• Set tx# = 2, set the final locations of the journal data

blocks, set the #blks

Write journal descriptor block and journal data

blocks

Write the journal commit block



16

Transactions in ext3 Data Journaling

Mode

Second, “checkpoint” data to fixed ext3 structures

• Copy B’, I’, and D’ to their fixed file system locations



Dr B’ I’ D’ C B’ I’ D’

blk # blk # blk #

1000 2000 3000





Finally, free Tx in journal

• Journal is fixed-sized circular buffer, entries

must be periodically freed



Dr B’ I’ D’ C B’ I’ D’

blk # blk # blk #

1000 2000 3000

17

Upon reboot

Check the journal superblock: “is there any pending

transaction?”

If yes (e.g. tx#2), scan the journal area to find a journal

descriptor for tx#2

After finding the journal descriptor block, ask “Is there a

commit block?”

If not, release the transaction

If yes, need to checkpoint the transaction

If checkpoint is successful, clear the transaction by updating

the journal superblock (so that we can know there is no

pending transaction)

18

What if there’s a Crash?

Recovery: Go through log and “redo” operations

that have been successfully commited to log

What if …

• Tx begin but not Tx end in log?

– Discard the transaction

• Tx begin through Tx end are in log,

but I’, B’, and D’ have not yet been checkpointed?

– Keep that transaction on the disk until the journal data blocks have been

checkpointed successfully

• What if Tx is in log, I’, B’, D’ have been checkpointed,

but Tx has not been freed from log?

– In terms of correctness, there is no problem

– But the journal size is usually fixed (e.g. X MB), so eventually the log will be full

and some transactions must be freed

– Again: Tx can only be freed if all its journal data blocks have been

checkpointed successfully!!

Performance? (As compared to fsck?)

• Much faster

• Only read the transactions that haven’t been checkpointed

• Journal size is fixed, so no need to scan the entrire file system, simply

scan the journal area only (X MB)

19

Complication: Disk Scheduling

Problem:

• Low-levels of I/O subsystem in OS

and even the disk/RAID itself may reorder requests

How does this affect Tx management?

• Do we write the journal blocks (e.g. Dr, B’, I’, D’, C) in parallel?

– No! Because of this ordering, when all these journal blocks are sent to

the disk, it could be the case that Dr and C have been written first

before the journal data blocks!

• Where is it OK to issue writes in parallel?

– Tx begin

– I, B, D

– Tx end

– Checkpoint: I, B, D copied to final destinations

– Tx freed in journal

• Synchronization points:

– Write Tx begin, B’, I’, D’ in parallel (then wait until they finish)

– Write Tx end (wait)

– Checkpoint B’, I’, D’ in parallel (wait)

– Tx freed in journal

20

Problem with Data Journaling

Data journaling: Lots of extra writes

• All data committed to disk twice

(once in journal, once to final location)

Overkill if only goal is to keep metadata consistent

Instead, use ext3 writeback mode

• Just journals metadata

• Writes data to final location directly, at any time

• Problem: B’ and I’ are written to the journal, crash (D’ has not been

written to the disk)  I’ points to a valid block, but the content is

garbage

Solution: Ordered mode

• Write all data blocks to their final location (e.g. write D’ to its final

location), then wait until finish

• Write metadata to the journal

21

Conclusions

Journaling

• All modern file systems use journaling to

reduce recovery time during startup

(e.g., Linux ext3, ReiserFS, SGI XFS, IBM JFS, NTFS)

• Simple idea: Use write-ahead log to record some

info about what you are going to do before doing it

• Turns multi-write update sequence into a single

atomic update (“all or nothing”)

• Some performance overhead: Extra writes to journal

– Worth the cost?









22



Related docs
Other docs by cuiliqing
11.1 Exploring Area and Perimeter
Views: 0  |  Downloads: 0
Volusia County
Views: 2  |  Downloads: 0
choosing_topics_and_y10
Views: 0  |  Downloads: 0
CLE Credit - rscrpubs.com
Views: 2  |  Downloads: 0
Meeting Minutes September 8 Final
Views: 0  |  Downloads: 0
nov2411
Views: 3  |  Downloads: 0
EKG Spreadsheet - Geocities.ws
Views: 0  |  Downloads: 0
Gift from Christ to the Church
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!