Session 9257 Linux Filesystems

Document Sample
scope of work template
							Session 9257

Linux Filesystems
Jens Osterkamp
Linux Architecture & Performance
IBM Lab Boeblingen


SHARE, February 22-27, 2004 | Longbeach, CA
  Trademarks
The following are trademarks of the International Business Machines Corporation in the United States and/or other countries.
Enterprise Storage Server
ESCON*
FICON
FICON Express
HiperSockets
IBM*
IBM logo*
IBM eServer
Netfinity*
S/390*
VM/ESA*
WebSphere*
z/VM
zSeries
* Registered trademarks of IBM Corporation
The following are trademarks or registered trademarks of other companies.
Intel is a trademark of the Intel Corporation in the United States and other countries.
Java and all Java-related trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc., in the
United States and other countries.
Lotus, Notes, and Domino are trademarks or registered trademarks of Lotus Development Corporation.
Linux is a registered trademark of Linus Torvalds.
Microsoft, Windows and Windows NT are registered trademarks of Microsoft Corporation.
Penguin (Tux) compliments of Larry Ewing.
SET and Secure Electronic Transaction are trademarks owned by SET Secure Electronic Transaction LLC.
UNIX is a registered trademark of The Open Group in the United States and other countries.
* All other products may be trademarks or registered trademarks of their respective companies.
                          Agenda

Journaling file systems
Measurement setup
Measurement results
   LPAR – VM
   31 / 64 bit
   single disk and LVM
   DASD statistics
   CPU load and CP overhead
   journaling options
Outlook
    Problems of non-journaling file systems

  data and meta-data is written directly and in arbitrary order
  no algorithm to ensure data integrity
  after crash, complete structure of file system has to be
  checked to ensure integrity
  file system check times depend on size of file system


⇨risk of data loss
⇨long and costly system outages
               advantages of journaling

  data integrity ensured
  in case of system crash only journal has to be replayed to
  recover consistent file system structure
  file system check time depends on size of journal


⇨much higher data integrity
⇨much shorter system outages


  but there is a cost...
    Journaling file systems in SuSE SLES8

  ext3 v0.9.18
  jfs 1.0.24
  reiserfs 3.6.2


For reference :
  ext2 v0.5 (non-journaling)
                           ext3

developed by Andrew Morton and others
based on ext2
extended by journaling features
supports full data journaling
resizing (only with unmount) possible
http://www.zipworld.com.au/~akpm/linux/ext3/
                             jfs

developed by IBM Austin Lab
ported from OS/2 Warp Server
only metadata journaling
max. file system size 4 PB
http://www.ibm.com/developerworks/oss/jfs/index.html
                        reiserfs

developed by a group around Hans Reiser
SUSE's default choice
only metadata journaling
disk space optimization algorithm
online enlargement of file system
http://www.namesys.com/
                     Measurement setup

Hardware                           Software
  2064-216 (z900)                    SUSE SLES8
     1.09ns (917MHz)                 Dbench
     2 * 16 MB L2 Cache (shared)
     64 GB
     6 FICON channels
  2105-F20 (Shark)
     384 MB NVS
     16 GB Cache
     128 * 36 GB disks
     10.000 RPM
     FICON (1 Gbps)
                Measurement setup

dbench
128MB main memory
1, 2 and 4 CPUs
LPAR and z/VM 4.3
31-bit and 64-bit
Single 3390 model 3 disk
6 pack of 3390-3 using striped LVM. Attached via 6 FICON
channels
Running 8 and 16 processes
                     Dbench File I/O

Emulation of Netbench benchmark, rates windows
fileservers
Large set of mixed file operations workload for each
process: create, write, read, append, delete
   Scaling for Linux with 1, 2, 4 PUs
   Scaling for 8 and 16 clients (processes) simultaneously
forced to do I/O while memory is filling up with data
Measurement results
                                   LPAR and VM
                         single disk, LPAR and VM, 31bit, 4 CPUs
                    70
                    65
                    60
                    55
Throughput [MB/s]




                    50
                    45                                            8 proc., LPAR
                    40                                            16 proc., LPAR
                    35                                            8 proc., VM
                                                                  16 proc., VM
                    30
                    25
                    20
                    15
                    10
                     5
                     0
                           Ext2    Ext3          Jfs   Reiserfs
                                    filesystem
                                31-bit and 64-bit
                    single disk, VM, 4 CPUs
                    65
                    60
                    55
                    50
Throughput [MB/s]




                    45
                                                                8 proc., 31bit
                    40
                                                                16 proc., 31bit
                    35                                          8 proc., 64bit
                    30                                          16 proc., 64bit
                    25
                    20
                    15
                    10
                     5
                     0
                         Ext2    Ext3          Jfs   Reiserfs
                                  filesystem
  /proc/dasd/statistics – Example

root@g73vm1:~# cat /proc/dasd/statistics
56881 dasd I/O requests
with 5270816 sectors(512B each)
   __<4    ___8    __16    __32    __64     _128   _256    _512   __1k   __2k     __4k    __8k    _16k    _32k   _64k   128k
   _256    _512    __1M    __2M    __4M     __8M   _16M    _32M   _64M   128M     256M    512M    __1G    __2G   __4G   _>4G
Histogram of sizes (512B secs)
      0       0    1039    4799    8102    36557   4475     292    195   1422        0       0       0       0      0      0
      0       0       0        0       0       0      0       0      0      0        0       0       0       0      0      0
Histogram of I/O times (microseconds)
      0       0       0        0       0       0      0       0      2       8     109    3244   25570   17480   7666   1248
   1390     153      11        0       0       0      0       0      0       0       0       0       0       0      0      0
Histogram of I/O times per sector
      0       0       0        0    176     4141  24084   15639   9506   2513      601     173      41       7      0      0
      0       0       0        0       0       0      0       0      0      0        0       0       0       0      0      0
Histogram of I/O time till ssch
      5       1       2        0       0       0      0       0      2       4     301   11527   25339   12278   5156   1759
    383     118       6        0       0       0      0       0      0       0       0       0       0       0      0      0
Histogram of I/O time between ssch and irq
      0       0       0        0       0       0      0       0   2584   23896   18720    5307    2325    2725   1217     62
     23      21       1        0       0       0      0       0      0       0       0       0       0       0      0      0
Histogram of I/O time between ssch and irq per sector
      0       0       0   21722   26243     3939   2184    1798    774    159       47      12       3       0      0      0
      0       0       0        0       0       0      0       0      0      0        0       0       0       0      0      0
Histogram of I/O time between irq and end
      7       0   43393   11341     457      179   1494       3      3       1       2       1       0       0      0      0
      0       0       0        0       0       0      0       0      0       0       0       0       0       0      0      0
# of req in chanq at enqueuing (1..32)
      8       3       4        5  56861        0      0       0      0       0       0       0       0       0      0      0
      0       0       0        0       0       0      0       0      0       0       0       0       0       0      0      0
                      ext2, 8 Processes
             Histogram of I/O times (microseconds)
1400
1300
1200
1100
1000
 900
 800
 700
 600
 500
 400
 300
 200
 100
  0
       __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4 _>4
       4   8   6   2   4   8   6   2   k   k   k   k   k   k   k   k   6   2   M M M M M M M M M M G                       G   G G
                    ext2, 16 Proceses
           Histogram of I/O times (microseconds)
7500
7000
6500
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
  0
       __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4
       4   8   6   2   4   8   6   2   k   k   k   k   k   k   k   k   6   2   M M M M M M M M M M G                       G   G
                       ext3, 8 Processes
              Histogram of I/O times (microseconds)
3000
2750
2500
2250
2000
1750
1500
1250
1000
 750
 500
 250
   0
       __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4
       4   8   6   2   4   8   6   2   k   k   k   k   k   k   k   k   6   2   M M M M M M M M M M G                       G   G
                     ext3, 16 Processes
            Histogram of I/O times (microseconds)
15000
14000
13000
12000
11000
10000
 9000
 8000
 7000
 6000
5000
4000
3000
2000
1000
    0
        __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4
        4   8   6   2   4   8   6   2   k   k   k   k   k   k   k   k   6   2   M   M   M   M   M   M   M   M   M   M   G   G   G
                   ext3, 16 Processes
        Histogram of I/O time before SSCH (IOSQ)
13000
12000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
   0
        __< ___ __1 __3 __6 _12 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 _25 _51 __1 __2 __4 __8 _16 _32 _64 128 256 512 __1 __2 __4
        4   8   6   2   4   8   6   2   k   k   k   k   k   k   k   k   6   2   M   M   M   M   M   M   M   M   M   M   G   G   G
                      Ext3, 16 Processes
         Histogram of I/O time between SSCH and IRQ
18000

16000

14000

12000

10000

8000

6000

4000

2000

    0
        __ __ __ __ __ _1 _2 _5 __ __ __ __ _1 _3 _6 12 _2 _5 __ __ __ __ _1 _3 _6 12 25 51 __ __ __
        <4 _8 16 32 64 28 56 12 1k 2k 4k 8k 6k 2k 4k 8k 56 12 1M 2M 4M 8M 6M 2M 4M 8M 6M 2M 1G 2G 4G
                 Ext3, 16 Processes
number of requests in subchannel-queue at enqueuing

60000

55000

50000

45000

40000

35000

30000

25000

20000

15000

10000

 5000

    0
        1         2        3         4         5
         Logical Volume Manager (LVM)

Linux software raid with raid levels 0,1, 4 and 5
excellent performance
excellent flexibility (resizing, adding/removing disks)
available in SLES7, SLES8, and RedHat RHEL 3
on zSeries, support multipath and PAV (under z/VM)
http://www.sistina.com/products_lvm.htm
                LVM system structure
(journaled) file system             Raw Logical Volume


    logical volume                       logical volume

                                                          Logical
                         volume group                     Volume
                                                          Manager

   physical   physical                         physical
   volume     volume                           volume




   block device driver                  RAID adapter

 physical       physical                   RAID
 disk           disk                       array
       Improving disk performance with LVM


      striped datastream




             physical       physical       physical
             volume         volume         volume




With LVM and striping parallelism is achieved
                                   LVM results
                single disk - LVM comparison, z/VM, 31bit, 2 CPUs
                    130
                    120
                    110
                    100
Throughput [MB/s]




                     90
                     80                                         8 proc., LVM
                     70                                         16 proc., LVM
                     60                                         8 proc., single disk
                                                                16 proc., single disk
                     50
                     40
                     30
                     20
                     10
                      0
                          Ext2   Ext3          Jfs   Reiserfs
                                  filesystem
                                                                                                filesystem options
                                        single disk, ext3 optimizations, z/VM, 31bit, 4 CPUs                                single disk, reiserfs optimizations, z/VM, 31bit, 4 CPUs
                       45                                                                                                                         50

                       40                                                                                                                         45

                                                                                                                                                  40
                       35




                                                                                                                              Throughput [MB/s]
Throughput [MB/s]




                                                                                                                                                  35
                       30
                                                                                                                                                  30                                                    8 processes
                       25                                                                                                                                                                               16 processes
                                                                                                                                                  25
                       20                                                                                                                         20
                       15                                                                                                                         15

                       10                                                                                                                         10

                                                                                                                                                  5
                               5
                                                                                                                                                  0
                               0                                                                                                                       default   no hash      no border    no
                                          ordered      writeback   full data   external    100M         200M      400M                                           relocation   allocation   unhashed
                                                                               journal     journal      journal   journal                                                                  relocation

                                        single disk, jfs optimizations, z/VM, 31bit, 4 CPUs
                                         45

                                         40

                                         35
                    Throughput [MB/s]




                                         30

                                         25

                                         20

                                         15

                                         10

                                          5

                                          0
                                                    default        external      100M journal        200 MB
                                                                   journal                           journal
                                             CPU load
                       LPAR, 1 CPU, 8 processes, single disk
               100
                90
                80
CPU load [%]




                70
                60                                                                                        idle
                50                                                                                        system
                                                                                                          user
                40
                30
                20
                10
                 0
                     ext2,   ext2,   ext3,   ext3,      reiserfs,   reiserfs,   jfs, 31bit   jfs, 64bit
                     31bit   64bit   31bit   64bit      31bit       64bit
                                                     filesystem
                                            VM overhead
                                   LVM CPU consumption
CPU consumption




                                                                                                   CP
                                                                                                   Gues
                                                                                                   t




                  ext2,   ext2,     ext3,   ext3,    reiserfs,   reiserfs,   jfs, 31bit   jfs,
                  31bit   31bit,    31bit   31bit,   31bit       31bit,                   31bit,
                          LVM               LVM                  LVM                      LVM
recovery times
                                  Outlook on kernel 2.6

                    journaling filesystem in kernel version 2.4 vs 2.6
                    60
                    55
                    50
                    45
                    40                                                                  ext3
Throughput [MB/s]




                    35                                                                  jf
                                                                                        s
                                                                                        reiserf
                    30                                                                  s
                    25
                    20
                    15
                    10
                     5
                     0
                            Kernel 2.4.19                          Kernel 2.6.0-test5
                                       preliminary results
                                       jfs was not compiled into 2.6 kernel
                       Summary

journaling file systems increase data integrity significantly
journaling file systems dramatically reduce system outage
times
performance cost is at least 30%
reiserfs is slightly faster than ext3, but needs much more
CPU
journaling file systems profit from LVM
jfs has fastest recovery times
2.6 will bring more improvements (increased throughput,
reduced CPU load, iostat for ECKD)
Questions ?

						
Related docs