Docstoc

Flash Memory Database Systems and IPL

Document Sample
Flash Memory Database Systems and IPL Powered By Docstoc
					  Flash Talk

Flash Memory Database Systems
      and In-Page Logging

                         Bongki Moon
                    Department of Computer Science
                         University of Arizona
                      Tucson, AZ 85721, U.S.A.
                        bkmoon@cs.arizona.edu

      In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung)



COMPUTER SCIENCE DEPARTMENT                        KOCSEA’09, Las Vegas, December 2009 -1-
  Magnetic Disk vs Flash SSD
  Champion                                                      Intel X25-M Flash SSD
 for 50 years                                                        80GB 2.5 inch




                                   New
                               challengers!
Seagate ST340016A
  40GB,7200rpm
                                   Samsung FlashSSD
                                   128 GB 2.5/1.8 inch


 COMPUTER SCIENCE DEPARTMENT                         KOCSEA’09, Las Vegas, December 2009 -2-
 Past Trend of Disk
• From 1983 to 2003 [Patterson, CACM 47(10) 2004]
    Capacity increased about 2500 times (0.03 GB  73.4 GB)
    Bandwidth improved 143.3 times (0.6 MB/s  86 MB/s)
    Latency improved 8.5 times (48.3 ms  5.7 ms)
        Year         1983       1990      1994         1998             2003
      Product         CDC      Seagate   Seagate    Seagate           Seagate
                    94145-36   ST41600   ST15150    ST39102          ST373453
     Capacity       0.03 GB    1.4 GB    4.3 GB      9.1 GB           73.4 GB
       RPM           3600       5400      7200        10000            15000
    Bandwidth         0.6         4         9           24               86
     (MB/sec)
      Media           5.25      5.25       3.5          3.0              2.5
     diameter
   Latency (msec)     48.3      17.1      12.7          8.8              5.7



COMPUTER SCIENCE DEPARTMENT                        KOCSEA’09, Las Vegas, December 2009 -3-
  I/O Crisis in OLTP Systems
• I/O becomes bottleneck in OLTP systems
     Process a large number of small random I/O operations
• Common practice to close the gap
     Use a large disk farm to exploit I/O parallelism
        • Tens or hundreds of disk drives per processor core
        • (E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core
     Adopt short-stroking to reduce disk latency
        • Use only the outer tracks of disk platters
     Other concerns are raised too
        • Wasted capacity of disk drives
        • Increased amount of energy consumption


• Then, what happens 18 months later?
     To catch up with Moore’s law and balance CPU and I/O (Amdhal’s
      law), the number of spindles should be doubled again?

 COMPUTER SCIENCE DEPARTMENT                           KOCSEA’09, Las Vegas, December 2009 -4-
 Flash News in the Market
• Sun Oracle Exadata Storage Server [Sep 2009]
    Each Exadata cell comes with 384 GB flash cache
• MySpace dumped disk drives [Oct 2009]
    Went all flash, saving power by 99%
• Google Chrome ditched disk drives [Nov 2009]
    SSD is the key to 7-sec boot time
• Gordon at UCSD/SDSC [Nov 2009]
    64 TB RAM, 256 TB Flash, 4 PB Disks
• IBM hooked up with Fusion-io [Dec 2009]
    SSD storage appliance for System X server line


COMPUTER SCIENCE DEPARTMENT                KOCSEA’09, Las Vegas, December 2009 -5-
 Flash for Database, Really?
• Immediate benefit for some DB operations
    Reduce commit-time delay by fast logging
    Reduce read time for multi-versioned data
    Reduce query processing time (sort, hash)


• What about the Big Fat Tables?
    Random scattered I/O is very common in OLTP
       • Slow random writes by flash SSD can handle this?



COMPUTER SCIENCE DEPARTMENT              KOCSEA’09, Las Vegas, December 2009 -7-
 Transactional Log
                                   SQL Queries




                              System Buffer Cache




                Database      Transaction   Temporary         Rollback

               Table space    (Redo) Log    Table Space       Segments




COMPUTER SCIENCE DEPARTMENT                             KOCSEA’09, Las Vegas, December 2009 -8-
    Commit-time Delay by Logging
• Write Ahead Log (WAL)                                    T1   T2        …   Tn
    A committing transaction force-writes its                  SQL
     log records
                                                      Buffer                       Log Buffer
    Makes it hard to hide latency
    With a separate disk for logging                                pi
        • No seek delay, but …
        • Half a revolution of spindle on average
        • 4.2 msec (7200RPM), 2.0 msec (15k RPM)      DB
    With a Flash SSD: about 0.4 msec                                                 LOG



• Commit-time delay remains to be a significant overhead
    Group-commit helps but the delay doesn’t go away altogether.
• How much commit-time delay?
    On average, 8.2 msec (HDD) vs 1.3 msec (SDD) : 6-fold reduction
        • TPC-B benchmark with 20 concurrent users.

   COMPUTER SCIENCE DEPARTMENT                         KOCSEA’09, Las Vegas, December 2009 -9-
 Rollback Segments
                                   SQL Queries




                              System Buffer Cache




                Database      Transaction   Temporary       Rollback

               Table space    (Redo) Log    Table Space     Segments




COMPUTER SCIENCE DEPARTMENT                          KOCSEA’09, Las Vegas, December 2009 -10-
 MVCC Rollback Segments
• Multi-version Concurrency Control (MVCC)
    Alternative to traditional Lock-based CC
    Support read consistency and snapshot isolation
    Oracle, PostgresSQL, Sybase, SQL Server 2005, MySQL
• Rollback Segments
    Each transaction is assigned to a rollback segment
    When an object is updated, its current value is recorded
     in the rollback segment sequentially (in append-only
     fashion)
    To fetch the correct version of an object, check whether
     it has been updated by other transactions


COMPUTER SCIENCE DEPARTMENT              KOCSEA’09, Las Vegas, December 2009 -11-
  MVCC Write Pattern
• Write requests from TPC-C workload
    Concurrent transactions generate multiple streams of append-only
     traffic in parallel (apart by approximately 1 MB)
    HDD moves disk arm very frequently
    SSD has no negative effect from no in-place update limitation




 COMPUTER SCIENCE DEPARTMENT                    KOCSEA’09, Las Vegas, December 2009 -12-
         MVCC Read Performance
     C   …
                                             • To support MV read consistency,
                            T1        T0
     B

     A
         100

         50           A     100   A    200
                                               I/O activities will increase
                                                 A long chain of old versions may have
T2
                                                  to be traversed for each access to a
                                                  frequently updated object
                                             • Read requests are scattered
         Rollback segment
                                               randomly
                                                 Old versions of an object may be
                                                  stored in several rollback segments
                                                 With SSD, 10-fold read time reduction
                                                  was not surprising
         Rollback segment




     COMPUTER SCIENCE DEPARTMENT                              KOCSEA’09, Las Vegas, December 2009 -13-
 Database Table Space
                                   SQL Queries




                              System Buffer Cache




                Database      Transaction   Temporary       Rollback

               Table space    (Redo) Log    Table Space     Segments




COMPUTER SCIENCE DEPARTMENT                          KOCSEA’09, Las Vegas, December 2009 -14-
  Workload in Table Space
• TPC-C workload
     Exhibit little locality and sequentiality
        • Mix of small/medium/large read-write, read-only (join)
     Highly skewed
        • 84% (75%) of accesses to 20% of tuples (pages)
• Write caching not as effective as read caching
     Physical read/write ratio is much lower that logical
      read/write ratio

• All bad news for flash memory SSD
     Due to the No in place update and Asymmetric read/write
      speeds
     In-Page Logging (IPL) approach [SIGMOD’07]

 COMPUTER SCIENCE DEPARTMENT                     KOCSEA’09, Las Vegas, December 2009 -15-
  In-Page Logging (IPL)
• Key Ideas of the IPL Approach
   Changes written to log instead of updating them in place
        • Avoid frequent write and erase operations
     Log records are co-located with data pages
        • No need to write them sequentially to a separate log region
        • Read current data more efficiently than sequential logging
     DBMS buffer and storage managers work together




 COMPUTER SCIENCE DEPARTMENT                      KOCSEA’09, Las Vegas, December 2009 -16-
     Design of the IPL
  • Logging on Per-Page basis in both Memory and Flash

                                                                    An In-memory log sector can
                                         update-in-place             be associated with a buffer
Database in-memory                                                   frame in memory
 Buffer  data page                       in-memory
            (8KB)                        log sector                        Allocated on demand when
                                         (512B)                             a page becomes dirty
                                                                    An In-flash log segment is
                                                                     allocated in each erase unit
 Flash
Memory
         Erase unit:                15 data pages
         128KB                       (8KB each)

                                    log area (8KB):
                           ….
                           ….       16 sectors

 The log area is shared by all the data pages in an erase unit


   COMPUTER SCIENCE DEPARTMENT                                            KOCSEA’09, Las Vegas, December 2009 -17-
     IPL Write
•     Data pages in memory
              Updated in place, and
              Physiological log records written to its in-memory log sector
•     In-memory log sector is written to the in-flash log segment, when
              Data page is evicted from the buffer pool, or
              The log sector becomes full
•     When a dirty page is evicted, the content is not written to flash memory
        The previous version remains intact
     Data pages and their log records are physically co-located in the same erase unit


                    Update / Insert / Delete         update-in-place                                  Sector : 512B
                                                     physiological log                                Page : 8KB
      Buffer
      Pool                                                                                            Block : 128KB




      Flash
     Memory
                                  Data Block Area

    COMPUTER SCIENCE DEPARTMENT                                                KOCSEA’09, Las Vegas, December 2009 -18-
    IPL Read
• When a page is read from flash, the current version is computed on the fly

                                                   Apply the “physiological action”
                                                   to the copy read from Flash
                Pi                                 (CPU overhead)
Buffer
Pool
                                 Re-construct
                                 the current
                                 in-memory copy
                                                  Read from Flash
                                                   Original copy of Pi
                                                   All log records belonging to Pi
                                                  (IO overhead)
              data area
              (120KB):
 Flash
              15 pages
Memory

         log area (8KB):
         16 sectors


  COMPUTER SCIENCE DEPARTMENT                       KOCSEA’09, Las Vegas, December 2009 -19-
  IPL Merge
• When all free log sectors in an erase unit are consumed
     Log records are applied to the corresponding data pages
     The current data pages are copied into a new erase unit
         • Consumes, erases, and releases only one erase unit


              Can be
              Erased



                                      Merge                      15 up-to-date
  Physical                                                       data pages
   Flash
   Block
                        log area (8KB):
                        16 sectors                                clean log area
               Bold                                    Bnew




 COMPUTER SCIENCE DEPARTMENT                           KOCSEA’09, Las Vegas, December 2009 -20-
  Industry Response
• Common in Enterprise Class SSDs
     Multi-channel, inter-command parallelism
        • Thruput than bandwidth, write-followed-by-read pattern
     Command queuing (SATA-II NCQ)
     Large RAM Buffer (with super-capacitor backup)
        • Even up to 1 MB per GB
        • Write-back caching, controller data (mapping, wear leveling)
     Fat provisioning (up to ~20% of capacity)

• Impressive improvement

     Prototype/Product         EC SSD          X-25M              15k-RPM Disk
       Read (IOPS)             10500            20000                     450
       Write (IOPS)             2500            1200                      450


 COMPUTER SCIENCE DEPARTMENT                           KOCSEA’09, Las Vegas, December 2009 -21-
  EC-SSD Architecture
• Parallel/interleaved operations
     8 channels, 2 packages/channel, 4 chips/package
     Two-plane page write, block erase, copy-back operations
                                             Flash            Flash
                  ARM9           ECC
                                            package          package


    Host I/F      Main           Flash
   (SATA-II)    Controller     Controller

                                            NAND            NAND

                  DRAM                                                            8 channels
                 (128MB)


                                            NAND            NAND




 COMPUTER SCIENCE DEPARTMENT                          KOCSEA’09, Las Vegas, December 2009 -22-
  Concluding Remarks
• Recent advances cope with random I/O better
     Write IOPS 100x higher than early SSD prototype
     TPS: 1.3~2x higher than RAID0-8HDDs for read-write
      TPC-C workload, with much less energy consumed
• Write still lags behind
     IOPSDisk < IOPSSSD-Write << IOPSSSD-Read
     IOPSSSD-Read / IOPSSSD-Write = 4 ~ 17
• A lot more issues to investigate
     Flash-aware buffer replacement, I/O scheduling, Energy
     Fluctuation in performance, Tiered storage architecture
     Virtualization, and much more …

 COMPUTER SCIENCE DEPARTMENT                     KOCSEA’09, Las Vegas, December 2009 -23-
 Question?
• For more Flash Memory work of Ours
      In-Page Logging [SIGMOD’07]
      Logging, Sort/Hash, Rollback [SIGMOD’08]
      SSD Architecture and TPC-C [SIGMOD’09]
      In-Page Logging for Indexes [CIKM’09]


• Even More?
    www.cs.arizona.edu/~bkmoon


COMPUTER SCIENCE DEPARTMENT        KOCSEA’09, Las Vegas, December 2009 -24-

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:8
posted:6/1/2011
language:English
pages:23