									Data Compression

By, Keerthi Gundapaneni
• Data Compression is an very effective
  means to save storage space and network
• A large number of compression schemes
  currently in the market have been based
  on character encoding or on detection of
  repetitive string.
• Many of these schemes achieve data
  reduction rates to 2.3-2.5 bits per
  character for English text.
• Database performance strongly depends a
  great deal on the amount of available
• Important to try and use the available
  memory as efficiently as possible.
          Current Schemes
• Text compression schemes based on
  letter frequency. (pioneered by Huffman)
• Schemes based on string matching.
• Schemes based on fast implementation of
  algorithms, parallel algorithms and VLSI
• Many database uses prefix and postfix-
  truncation to save space and increase the
  fan-out of nodes, e.g. starburst.
      Using various schemes
• Compression rates of dataset depends on
  the attribute type and value distribution.
• It is difficult to compress binary floating
  point numbers but relatively easy to
  compress English test by a factor of 2 or 3.
• Optimal performance can only be obtained
  by judicious decisions which attributes to
  compress and which compression method
  to use.
   Advantages of Compression
• Reduce disk space required.
• Seek distance and Seek times are
• More data fits into each disk page, track
  and cylinder allowing more intelligent
  clustering of related objects into physically
  near locations.
• Unused disk space can be used for
  shadowing to increase reliability
  Advantages of Compression
• Compressed data can be transferred
  faster to and from disk.
• Data compression increases disk
• Due to the information density there is a
  decrease in the load there for less I/O
• Faster transfer rates across the network.
  Advantages of Compression
• Retaining more data in compression from
  in the I/O buffer allows more records to
  remain in the buffer, thus increases the
  buffer hit rate and reducing the number of
• The log recorders can become shorter.
       Types of compression
• For a given table of “parts” the attribute
  “color” is replaced by a small integer, save
  the encoding in a separate relation, and
  join the larger table with the relatively
  small encoding table for queries that
  require string-values output of the color
  attribute. Since such encoding tables are
  typically small e.g. a few kilobytes, efficient
  hash-based algorithms can be used for the
      Huffman code example

Symbol :        A   B C       D    E
Frequency:     24   12 10     8    8

Total 186 bit (with 3 bit per code word)
Huffman code example

Symbol        Frequency     Code     Code Length        Total Length
   A                 24        0            1                  24
   B                 12       100           3                  36
   C                 10       101           3                  30
   D                  8        110          3                  24
   E                  8        111          3                  24

         Initial. 186 bit            Final. 138 bit (3 bit code)
