Floating-Point Data Compression at 75 Gb_s - Department of

Document Sample
Floating-Point Data Compression at 75 Gb_s - Department of Powered By Docstoc
					Floating-Point Data Compression
       at 75 Gb/s on a GPU

Molly A. O’Neil and Martin Burtscher
     Department of Computer Science
    Introduction
 Scientific simulations on HPC clusters
       Run on interconnected compute nodes
       Produce and transfer lots of floating-point data
       Data storage and transfer are expensive and slow
       Compute nodes have multiple cores but only one link

 Interconnects are getting faster
       Lonestar: 40 Gb/s InfiniBand
       Speeds of up to 100 Gb/s soon
                                                      Texas Advanced Computing Center




Floating-Point Data Compression at 75 Gb/s on a GPU                                     March 2011
    Introduction (cont.)
 Compression
       Reduced storage, faster transfer
       Only useful when done in real time
             Saturate network with compressed data    Charles Trevelyan for http://plus.maths.org/




             Requires compressor tailored to hardware capabilities


 GFC algorithm for IEEE 754 double-precision data
       Designed specifically for GPU hardware (CUDA)
       Provides reasonable compression ratio and operates
         above throughput of emerging networks

Floating-Point Data Compression at 75 Gb/s on a GPU                                             March 2011
    Lossless Data Compression
 Dictionary-based (Lempel-Ziv family) [gzip, lzop]
 Variable-length entropy coders (Huffman, AC)
 Run-length encoding [fax]
 Transforms (Burrows-Wheeler) [bzip2]
 Special-purpose FP compressors [FPC, FSD, PLMI]
       Prediction and leading-zero suppression

 None of these offer real-time speeds for
    state-of-the-art networks
Floating-Point Data Compression at 75 Gb/s on a GPU   March 2011
    GFC Algorithm
 GPUs require 1000s of parallel activities, but…
    compression is a generally serial operation
                                                       Divide data into n chunks,
                                                        processed in parallel
                                                          Best perf: choose n to match
                                                           max number of resident warps
                                                       Each chunk composed of
                                                        32-word subchunks
                                                          One double per warp thread
                                                          Use previous subchunk to
                                                           provide prediction values

Floating-Point Data Compression at 75 Gb/s on a GPU                            March 2011
    Dimensionality
 Many scientific data sets display dimensionality
       Interleaved coordinates from multiple dimensions




 Optional dimensionality parameter to GFC
       Determines index of previous subchunk to use as the
         prediction

Floating-Point Data Compression at 75 Gb/s on a GPU   March 2011
    GFC Algorithm (cont.)




Floating-Point Data Compression at 75 Gb/s on a GPU   March 2011
    GPU Optimizations
 Low thread divergence (few if statements)
       Some short enough to be predicated

 Coalesce memory accesses by packing/unpacking
  data in shared memory (for CC < 2.0)
 Very little inter-thread
    communication and synchronization
       Prefix sum only
       Warp-based implementation                           gamedsforum.ca




Floating-Point Data Compression at 75 Gb/s on a GPU   March 2011
    Evaluation Method
 Systems
       Two quad-core 2.53 GHz Xeons
       NVIDIA FX 5800 GPU (CC 1.3)

 13 datasets: real-world data (19 – 277 MB)
       Observational data, simulation results, MPI messages

 Comparisons
       Compression ratio vs. 5 compressors in common use
       Throughput vs. pFPC (fastest known CPU compressor)

Floating-Point Data Compression at 75 Gb/s on a GPU   March 2011
                                  Compression Ratio
                                                                                   1.188 (range: 1.01 – 3.53)
                                  1.25
                                                                                       Low (FP data), but in
                                                                                        line with other algos
Harmonic mean compression ratio




                                  1.20
                                                                                       Largely independent of
                                  1.15                                                  number of chunks

                                  1.10                                             When done in real-
                                                                                     time, compression at
                                  1.05
                                                                                     this ratio can greatly
                                  1.00                                               speed up MPI apps
                                         bzip2   gzip   lzop   FPC   pFPC   GFC
                                                                                       3% – 98% speed-up
                                                                                        [Ke et al., SC’04]

Floating-Point Data Compression at 75 Gb/s on a GPU                                                    March 2011
                                   Throughput
                                                                                                 C: 75 – 87 Gb/s
                                  100
                                  90
                                                                                                    Mean: 77.9 Gb/s
                                                      GFC
Harmonic mean throughput (Gb/s)




                                  80                                    compression              D: 90 – 121 Gb/s
                                  70                                    decompression               Mean: 96.6 Gb/s
                                  60
                                  50
                                  40
                                                                                                 4x faster than pFPC
                                                                     pFPC
                                  30                                                              on 8 cores (2 CPUs)
                                  20
                                  10                                                             Improvement over
                                   0
                                        1.15   1.20         1.25    1.30          1.35   1.40     pFPC’s compression
                                                Harmonic mean compression ratio                   ratio vs. performance
                                                                                                  trend
     Floating-Point Data Compression at 75 Gb/s on a GPU                                                          March 2011
        NEW: Fermi Throughput
 Fermi improvements:                                  C: 119 – 219 (HM: 167.5 Gb/s)
               Faster, simpler memory accesses        D: 169 – 219 (HM: 180.3 Gb/s)
               Hardware support for count-
                        leading-zeros op               Compresses over 9.5x faster
 Compression ratio: 1.187                              than pFPC on 8 x86 cores
                       200

                       150
   Throughput (Gb/s)




                       100

                        50                                    compression   decompression

                         0




Floating-Point Data Compression at 75 Gb/s on a GPU                                     March 2011
    Summary
 GFC algorithm
       Chunks up data, each warp processes a chunk
        iteratively by 32-word subchunks
       No communication required between warps

 Minimum 75 Gb/s – 90 Gb/s (encode-decode)
    throughput on GTX-285, and 119 Gb/s – 169
    Gb/s on Fermi, with a compression ratio of 1.19
 CUDA source code is freely available at
    http://www.cs.txstate.edu/~burtscher/research/GFC/
Floating-Point Data Compression at 75 Gb/s on a GPU   March 2011
    Conclusions
 GPU can compress much faster than PCIe bus can
  transfer the data
 But…
       PCIe bus will become faster                   AMD




       CPU-GPU increasingly on single die
       GPU-to-GPU, GPU-to-NIC transfers coming?
                                                                    NVIDIA




 GFC is the first compressor with the potential to
    deliver real-time FP data compression for current
    and emerging network speeds

Floating-Point Data Compression at 75 Gb/s on a GPU         March 2011

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:2/26/2013
language:English
pages:14