Data Compression By, Keerthi Gundapaneni Introduction • Data Compression is an very effective means to save storage space and network bandwidth. • A large number of compression schemes currently in the market have been based on character encoding or on detection of repetitive string. • Many of these schemes achieve data reduction rates to 2.3-2.5 bits per character for English text. Introduction • Database performance strongly depends a great deal on the amount of available memory. • Important to try and use the available memory as efficiently as possible. Current Schemes • Text compression schemes based on letter frequency. (pioneered by Huffman) • Schemes based on string matching. • Schemes based on fast implementation of algorithms, parallel algorithms and VLSI implementations. • Many database uses prefix and postfix- truncation to save space and increase the fan-out of nodes, e.g. starburst. Using various schemes • Compression rates of dataset depends on the attribute type and value distribution. • It is difficult to compress binary floating point numbers but relatively easy to compress English test by a factor of 2 or 3. • Optimal performance can only be obtained by judicious decisions which attributes to compress and which compression method to use. Advantages of Compression • Reduce disk space required. • Seek distance and Seek times are reduced. • More data fits into each disk page, track and cylinder allowing more intelligent clustering of related objects into physically near locations. • Unused disk space can be used for shadowing to increase reliability Advantages of Compression • Compressed data can be transferred faster to and from disk. • Data compression increases disk bandwidth. • Due to the information density there is a decrease in the load there for less I/O bottleneck. • Faster transfer rates across the network. Advantages of Compression • Retaining more data in compression from in the I/O buffer allows more records to remain in the buffer, thus increases the buffer hit rate and reducing the number of I/Os. • The log recorders can become shorter. Types of compression • For a given table of “parts” the attribute “color” is replaced by a small integer, save the encoding in a separate relation, and join the larger table with the relatively small encoding table for queries that require string-values output of the color attribute. Since such encoding tables are typically small e.g. a few kilobytes, efficient hash-based algorithms can be used for the join. Huffman code example Symbol : A B C D E Frequency: 24 12 10 8 8 Total 186 bit (with 3 bit per code word) Huffman code example Results Symbol Frequency Code Code Length Total Length A 24 0 1 24 B 12 100 3 36 C 10 101 3 30 D 8 110 3 24 E 8 111 3 24 Initial. 186 bit Final. 138 bit (3 bit code) References: • Seeck, Roger (2008). Binary Essence. Retrieved April 17, 2008, from About BinaryEssence Web site: http://www.binaryessence.com/dct/en000081.htm • Graefe, Author's first name initialG, & Shapiro, L (1991). ACM/IEEE- CS Symp. Data Compression and Database Performance. 1, 1-10.
Pages to are hidden for
"Data Compression"Please download to view full document