UNIVERSITY of WISCONSIN-MADISON
Computer Sciences Department
CS 537 Andrea C. Arpaci-Dusseau
Introduction to Operating Systems Remzi H. Arpaci-Dusseau
Haryadi S. Gunawi
Journaling File Systems
Questions answered in this lecture:
VFS and FS operations
Why is it hard to maintain on-disk consistency?
How does the FSCK tool help with consistency?
What information is written to a journal?
What 3 journaling modes does Linux ext3 support?
1
Virtual File System (VFS)
Operations:
• File/Dir: open, close, chdir, link, unlink (delete), truncate, rename
• Data: read, write, lseek,
• Access and info: stat, chmod, chown
Ext2/3 (or any other file system)
• Knows its on-disk format
• Has its own block allocation policies
VFS layer:
• Structure-independent code
• Manage buffer cache, directory cache, generic inode descriptor, file descriptor
• Defines a set of functions every file system has to implement
Application
VFS
Linux SGI ReiserFS Sun IBM
Ext2/3 XFS ZFS JFS
2
Multiple updates / ops
Write
• Write to the next byte (to a data block)
• Update block bitmap
• Update meta-data
Delete (e.g. rm /dir/file)
• Release data blocks of file update block bitmap (to
free space)
• Update the inode for “file”
• Update inode bitmap
• Update “dir” data block (remove directory entry)
And many more …
What happens if a crash happens in the middle …
3
Review: The I/O Path (Reads)
Read() from file 1
Block
Leave copy in cache
• Check if block is in cache in
– (file cache sometimes is called cache Main
“buffer cache”) Memory
• If so, return block to user (Cache)
[1 in figure]
• If not, read from disk, insert Block
into cache, return to user [2] 2
Not in
cache
Disk
4
Review: The I/O Path (Writes)
Write() to file Buffer in memory 1
• Write is buffered in memory
(“write behind”) [1]
Main
• Sometime later, OS decides
to write to disk [2]
Memory
(Cache)
Why delay writes?
• Implications for performance
Later
• Implications for reliability
Write to 2
disk
Disk
5
Many “dirty” blocks in memory:
What order to write to disk?
Example: Appending a new block to existing file
• Initially I have a data bitmap B, inode file I, and an unused data
block D
• After append:
• Write data bitmap B’ (for new data block),
write inode I’ of file (to add new pointer, update time),
write new data block D’
B’ I’ D’
? ? ? Memory
Disk
B I D
6
The Problems
Writes: Have to update disk with N writes
• Writes are buffered on the first place, and then are
performed at the same time later
Disk scheduler
• But, disk does only a single write atomically
• Disk scheduler (e.g. C-LOOK) reorders write sequence
Crashes: System may crash at arbitrary point
• Bad case: In the middle of an update sequence
Desire: To update on-disk structures atomically
• Either all should happen or none
7
Example: Bitmap first
Write Ordering: Bitmap (B’), Inode (I’), Data (D’)
• But CRASH after B’ has reached disk, before I’ or D’
Result?
• Inode is still the old inode (I), it doesn’t point to D’
• Data is still the old data (D)
• D can never be used (bitmap says D is used but actually it’s not because
no inode is pointing to D)
B’ I’ D’
Memory
Disk
B I D
8
Example: Inode first
Write Ordering: Inode (I’), Bitmap (B’), Data (D’)
• But CRASH after I’ has reached disk, before B’ or D’
Result?
• I’ points to D which contains garbage (not D’)
• B is the old bitmap which says block D is unused (although there is
an inode that already points to the data block)
– Another user (I2) requests a block, the FS gives D to I2
– I2 gets D, D is pointed by I and I2 (security leak!)
B’ I’ D’
Memory
Disk
B I D
9
Example: Inode first
Write Ordering: Inode (I’), Bitmap (B’), Data (D’)
• CRASH after I’ AND B’ have reached disk, before D’
Result?
• Better than previous example (no security leak)
• But D still contains garbage, so I is pointing to garbage data
B’ I’ D’
Memory
Disk
B I D
10
Example: Data first
Write Ordering: Data (D’) , Bitmap (B’), Inode (I’)
• CRASH after D’ has reached disk, before I’ or B’
Result?
• No bad thing happens, everything is “consistent”
– Bitmap says the block that holds D’ is free
– No inode points to that block D’
• Inode is still the old inode (I) which does not point to any data block
B’ I’ D’
Memory
Disk
B I D
11
Traditional Solution: FSCK
FSCK: “file system checker”
When system boots:
• Make multiple passes over file system, looking for inconsistencies
– e.g., inode pointers and bitmaps, directory entries, inode reference counts
– Ex1: bitmap says D is used, but no inode is pointing to D, then bitmap is modified
(D is not used)
– Ex2: Two inodes pointing to the same data block, a clone of the data block will be
created, and one of the inodes will point to the new clone (hence, no sharing
anymore)
• Either fix automatically or punt to admin
• Does fsck have to run upon every reboot?
– Yes, if FS does not know if there is a crash in the middle of ops
– No, if FS knows that an operation has not finished yet
• E.g. put in superblock a dirty bit, set dirty bit to 1 before starting the operations. Clean
dirty bit if operations have finished
• If upon reboot, dirty bit in superblock is 1, must run fsck
• Problem: add runtime overhead (must write to superblock for each update sequence)
Main problem with fsck: Performance
• Sometimes takes hours to run on large disk volumes
• Inconsistency can only be detected if the whole content of the file system is
checked must scan the whole file system (more precisely: must scan all
metadata in the file system) 12
How To Avoid The Long Scan?
Idea:
• Do not perform in-place update
• Write something to another area on the disk before updating its
data structures
– Called the “write ahead log” or “journal”
• If all updates have been successfully reflected to the journal, then
all the updates can be reflected to the final place (this is called the
checkpointing process)
When crash occurs, look through log and see
what was going on
• Use contents of log to fix file system structures
• The process is called “recovery”
13
Case Study: Linux ext3
Journal location
• EITHER on a separate device partition
• OR just a “special” file within ext2
Three separate modes of operation:
• Data: All data and metadata is journaled
• Ordered, Writeback: Just metadata is journaled
First focus: Data journaling mode
14
Transactions in ext3 Data Journaling
Mode
Same example: Update Inode (I), Bitmap (B), Data (D)
First, write to journal:
Each write is formed into a transaction
A transaction comprises of:
• Journal descriptor block (Dr)
– Implies the beginning of a transaction (Tx begin)
– Contains the actual locations of the blocks saved in the journal data blocks
– Contains the transaction number
• Journal data blocks
– All blocks that must be updated atomically to the disk, e.g. in this example: I’,
B’, and D’
• Journal commit block (C)
– Implies the end of transaction (Tx end) Dr:
– Also contains the transaction number Tx#: 2
3 blocks
[1] = 1000
Dr B’ I’ D’ C B I D [2] = 2000
blk # blk # blk # [3] = 3000
15
1000 2000 3000
Write to the journal (sequence)
I want to write B’, I’, and D’
Please give me a transaction # (e.g. got tx #2)
Write tx#2 to the journal superblock (so that we
know later tx#2 is pending)
Prepare journal descriptor block
• Set tx# = 2, set the final locations of the journal data
blocks, set the #blks
Write journal descriptor block and journal data
blocks
Write the journal commit block
16
Transactions in ext3 Data Journaling
Mode
Second, “checkpoint” data to fixed ext3 structures
• Copy B’, I’, and D’ to their fixed file system locations
Dr B’ I’ D’ C B’ I’ D’
blk # blk # blk #
1000 2000 3000
Finally, free Tx in journal
• Journal is fixed-sized circular buffer, entries
must be periodically freed
Dr B’ I’ D’ C B’ I’ D’
blk # blk # blk #
1000 2000 3000
17
Upon reboot
Check the journal superblock: “is there any pending
transaction?”
If yes (e.g. tx#2), scan the journal area to find a journal
descriptor for tx#2
After finding the journal descriptor block, ask “Is there a
commit block?”
If not, release the transaction
If yes, need to checkpoint the transaction
If checkpoint is successful, clear the transaction by updating
the journal superblock (so that we can know there is no
pending transaction)
18
What if there’s a Crash?
Recovery: Go through log and “redo” operations
that have been successfully commited to log
What if …
• Tx begin but not Tx end in log?
– Discard the transaction
• Tx begin through Tx end are in log,
but I’, B’, and D’ have not yet been checkpointed?
– Keep that transaction on the disk until the journal data blocks have been
checkpointed successfully
• What if Tx is in log, I’, B’, D’ have been checkpointed,
but Tx has not been freed from log?
– In terms of correctness, there is no problem
– But the journal size is usually fixed (e.g. X MB), so eventually the log will be full
and some transactions must be freed
– Again: Tx can only be freed if all its journal data blocks have been
checkpointed successfully!!
Performance? (As compared to fsck?)
• Much faster
• Only read the transactions that haven’t been checkpointed
• Journal size is fixed, so no need to scan the entrire file system, simply
scan the journal area only (X MB)
19
Complication: Disk Scheduling
Problem:
• Low-levels of I/O subsystem in OS
and even the disk/RAID itself may reorder requests
How does this affect Tx management?
• Do we write the journal blocks (e.g. Dr, B’, I’, D’, C) in parallel?
– No! Because of this ordering, when all these journal blocks are sent to
the disk, it could be the case that Dr and C have been written first
before the journal data blocks!
• Where is it OK to issue writes in parallel?
– Tx begin
– I, B, D
– Tx end
– Checkpoint: I, B, D copied to final destinations
– Tx freed in journal
• Synchronization points:
– Write Tx begin, B’, I’, D’ in parallel (then wait until they finish)
– Write Tx end (wait)
– Checkpoint B’, I’, D’ in parallel (wait)
– Tx freed in journal
20
Problem with Data Journaling
Data journaling: Lots of extra writes
• All data committed to disk twice
(once in journal, once to final location)
Overkill if only goal is to keep metadata consistent
Instead, use ext3 writeback mode
• Just journals metadata
• Writes data to final location directly, at any time
• Problem: B’ and I’ are written to the journal, crash (D’ has not been
written to the disk) I’ points to a valid block, but the content is
garbage
Solution: Ordered mode
• Write all data blocks to their final location (e.g. write D’ to its final
location), then wait until finish
• Write metadata to the journal
21
Conclusions
Journaling
• All modern file systems use journaling to
reduce recovery time during startup
(e.g., Linux ext3, ReiserFS, SGI XFS, IBM JFS, NTFS)
• Simple idea: Use write-ahead log to record some
info about what you are going to do before doing it
• Turns multi-write update sequence into a single
atomic update (“all or nothing”)
• Some performance overhead: Extra writes to journal
– Worth the cost?
22