Vertical Vs Horizontal Partition: In Depth by ijcsiseditor


More Info
									                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 9, No. 10, October 2011

             Vertical Vs Horizontal Partition: In Depth
         Tejaswini Apte                                 Dr. Maya Ingle                            Dr. A.K.Goyal
         Sinhgad Institute of Business                 Devi Ahilya VishwaVidyalay               Devi Ahilya VishwaVidyalay
         Administration and Research                         Indore                               Indore
         Kondhwa(BK), Pune-411048                  

Abstract-For the write-intensive operations and predictable                        HP is superior than VP, at less selectivity when query
behavior of queries, the traditional database system have                           retrieves more columns with no chaining and the
optimize performance considerations. With the growing data in                       system is CPU constrained.
database and unpredictable nature of queries, write optimize
system are proven to be poorly designed. Recently, the interest in
                                                                                Selectivity factor and number of retrieved columns is
architectures that optimize read performance by using Vertically                    the measure of processing time of VP than HP.
Partitioned data representation has been renewed. In this paper,                VP may be sensitive to the amount of processing
we identify the components affecting the performance of                             needed to decompress a column.
Horizontal and Vertical Partition, with the analysis. Our study           Compression ratio may be improved for non-uniform
focusing on tables with different data characteristics and                distribution [13]. Research community mainly focused on
complex queries. We show that carefully designed Vertical                 single predicate with less selectivity, applied to the first
Partition may outperform carefully designed Horizontal                    column of the table, and the same is retrieved by the query
Partition, sometimes by an order of magnitude.
                                                                          [12]. We believe that the relative performance of VP and HP
General Terms: Algorithms, Performance, Design                            is affected by (a) Number of Predicates (b) Predicates
                                                                          application on columns and Selectivity (c) Resultant Columns.
Keywords: Vertical Partition, Selectivity, Compression, Horizontal        Our approach mainly focusing on factors, affecting response
Partition                                                                 time of HP and VP i.e. (a) Additional Predicate (b) Data
                                                                          Distribution (c) Join Operation.
                 I.    INTRODUCTION                                       For various applications, it has been observed that VP has
Storing relational tables vertically on disk has been of keen             several advantages over HP. We discuss related, existing and
interest as observed in data warehouse research community.                recent compression techniques of HP and VP in Section 2.
The main reason lies in minimizing time required for disk                 Many factors affects the performance of HP and VP. Section 3
reads for tremendously growing data warehouse. Vertical                   provides the comparative study of performance measure with
Partition (VP) possesses better cache management with less                query characteristics. Our approach's implementation detail
storage overhead. For queries retrieving more columns, VP                 and analysis of the result is presented in Section 4. Finally, we
demands stitching of the columns back together, offset the I/O            conclude with a short discussion of our work in Section 5.
benefits, potentially causing a longer response time than the
same query on the Horizontal Partition (HP). HP stores tuples
on physical blocks with slot array, specifies the offset of the                          II.     RELATED WORK
tuple on the page [15]. HP approach is superior for queries,              In this section, some existing compression techniques used in
retrieve more columns and on transactional databases. For                 VP and HP have been discussed briefly along with the latest
queries, retrieves less columns (DSS systems) HP approach                 methodologies.
may result in more I/O bandwidth, poor cache behavior and
poor compressible ratio [6].                                                  A. Vertical Storage
                                                                          The VP and HP comparison is presented with C-Store and Star
Current up-gradation of database technology has improved HP               Schema Benchmark [12]. VP is implemented using
compression ratio by storing the tuples densely in the block,             commercial relational database systems by making each
with poor updatable ratio and improved I/O bandwidth than                 column its own table. The idea presented had to pay more
VP. To achieve degree of HP compression close to entropy of               performance penalty, since every column must have its own
table, skewed dataset and advance compression techniques                  row-id. To prove the superiority of HP over VP, analysis has
opened the research path for response time of queries and HP              done by implementing HP in C-store (VP database).
performance for DSS systems [16].                                         Compression, late materialization and block iteration were the
                                                                          base of measure for the performance of VP over HP.
Previous research shown results relevant to this paper are:

                                                                                                      ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 10, October 2011

With the given workload, compression               and     late         Dictionary cache line is bigger than processors L1 data cache
materialization improves performance by a factor of two and             (b) Index size is larger than value and (c) Un-encoded column
three respectively [12]. We believe these results are largely           size is smaller than the size of the encoded column plus the
orthogonal to ours, since we heavily compress both the HP               size of the dictionary [3].
and VP and our workload does not lend itself to late
materialization of tuples. “Comparison of Row Stores and                Delta coding
Column Stores in a Common Framework” mainly focused on                  The data is stored, as the difference between successive
super-tuple and column abstraction. Slotted page format in HP           samples (or characters). The first value in the delta encoded
results in less compression ratio than VP [10]. Super-tuples            file is the same as the first value in the original data. All the
may improve the compression ratio by storing rows with one              following values in the encoded file are equal to the difference
header with no slot-array. Column abstraction avoids storing            (delta) between the corresponding value in the input file, and
repeated attributes multiple times by adding information to the         the previous value in the input file. For uniform values in the
header. Comparison is made over varying number of columns               database, delta encoding for data compression is beneficial.
with uniformly distributed data for VP and HP, while                    Delta coding may be performed on both column level and
retrieving all columns from table.                                      tuple level. For unsorted sequence and size-of(encoded) is
                                                                        larger than size-of(un-encoded), delta encoding is less
The VP concept has implemented in Decomposition storage                 beneficial [3].
model (DSM), with storage design of (tuple id, attribute
values) for each column (MonetDB) [9]. C-Store data model               Run Length Encoding (RLE)
contains overlapping projections of tables. L2 cache behaviour          The sequences of the same data values within a file is replaced
may improved by PAX architecture, focused on storing tuples             by a count number and a single value. RLE compression
column-wise on each slot [7], with penalty of I/O bandwidth.            works best for sorted sequence, long runs. RLE is more
Data Morphing improves on PAX to give even better cache                 beneficial for VP [3].
performance by dynamically adapting attribute groupings on
the page [11].                                                               C. Query Parameters and Table Generation
                                                                        To study the effect of queries with table characteristics,
     B. Database Compression Techniques                                 queries were tested with varying number of predicates and
Compression techniques in database is mostly based on slotted           selectivity factor. Factors affecting the execution plan and cost
page HP. Compression ratio may be improved up to 8-12 by                are (a)Schema definition (b) Selectivity factor (c) Number of
using processing intensive techniques [13]. VP compression              columns referenced (d) Number of predicates. The execution
ratio is examined by “Superscalar RAM-CPU Cache                         time of a query change with column characteristics and I/O
Compression” and “Integrating Compression and Execution                 bandwidth. For each characteristic of column, the query
in Column-Oriented Database Systems” [21, 3]. Zukowski                  generator randomly selects the columns used to produce a set
presented an algorithm for compression optimization the                 of “equivalent” queries with the cost analysis [12].
usability of modern processor with less I/O bandwidth. Effect           Performance measure with compression is implemented by:
of run lengths on degree of compression and dictionary                        Generation of uncompressed HP version of each
encoding proven to be best compression scheme for VP [3].                         table with primary key on left most column.
                                                                              Sorted on columns frequently used in query.
                                                                              Replica is generated on VP.
Our contribution to existing approach is based on the major
factors affecting the performance of HP and VP (a)Data                                 IV.     IMPLEMENTATION DETAIL
Distribution    (b)Cardinality  (c)Number     of   columns              To study the effect of VP and HP, the experiments are done
(d)Compression Technique and (e) Query nature.                          against TPC-H standard Star-Schema on MonetDB.
                                                                        We mainly concentrated on the fact table i.e. Sales, contains
     A. Data Characteristics                                            approximately 10L records. We focused on five columns for
The search time, and performance of two relational tables               selectivity i.e. prod_id, cust_id, time_id, channel_id, promo_id
varies with number of attributes, data type of each attribute           with selectivity varying from 0.1 to 50%.
along with the compression ratio, column cardinality and                         SELECT p.product_name,ch.channel_class,
selectivity.                                                                     c.cust_city, t.calendar_quarter_desc,
                                                                                   SUM(s.amount_sold) sales_amount
    B. Compression Techniques                                                    FROM sales s, times t, customers c, channels ch,
Dictionary based coding                                                          products p, promotions pr
The repeated occurrences are replaced by a codeword that                         WHERE s.time_id = t.time_id
points to the index of the dictionary that contains the pattern.                 AND s.prod_id=p.prod_id
Both code words and uncompressed instructions are part of                        AND s.cust_id = c.cust_id
compressed program. Performance penalty occurs for (a)                           AND s.channel_id = ch.channel_id

                                                                                                    ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 10, October 2011

        AND s.promo_id=pr.promo_id                                       Predicate   Selectivity(   No Of     HP(time      VP(time
        AND c.cust_state_province = 'CA'                                             %)             Rows      in sec)      in sec)
                                                                         Prod_id     Compresse      1000000
        AND ch.channel_desc in ('Internet','Catalog')                                d(50)                            3         14
        AND t.calendar_quarter_desc IN ('1999-Q1','1999-                 Cust_id     25             1000000
        Q2')                                                                                                        45          10
        GROUP BY ch.channel_class,p.product_name                         Time_id     10             10,0000
        c.cust_city, t.calendar_quarter_desc;                                                       0               40          20
                                                                         Promo_i     1              1000000
Table 1: Generalized Star-Schema Query                                   d                                          35          20
                                                                         Channel     0.1            1000000
                                                                         _id                                        30          30
     A. Read-Optimized Blocks (Pages)
The HP and VP, dense pack the table on the blocks to achieve
less I/O bandwidth. With varying page size HP keeps tuples
together, while the VP stores each column in a different file.
The different entries on the page are not aligned to byte or
word boundaries in order to achieve better compression. Each
page begins with the page header, contains number of entries
on the page, followed by data and compression dictionary.
The size of the compression dictionary is stored at the very
end of the page, with the dictionary growing backwards from
the end of the page towards the front. For the HP, the
dictionaries for the dictionary-compressed columns are stored
sequentially at the end of the page.

     B. Query Engine, Scanners and I/O
The query scanner scans the files differently for HP and VP.
Materialization of results are done after reading the data and
applying predicates to it, with minimum passes in HP than
VP, which requires reading multiple files for each column                Figure 1: Time measurement for HP and VP with varying selectivity and
referenced by query. Predicates are applied on a per-column
basis, columns are processed by order of their selectivity, most         Effect of compression
selective (with the fewest qualifying tuples) to least selective         For skew data distribution and large cardinality in HP, run-
(the most qualifying tuples). Placing the most selective                 length and dictionary compression techniques are more
predicate first allows the scanner to read more of the current           beneficial. The size of VP tuple is approximately same as size
file before having to switch to another file, since the output           of HP tuple. HP compression is a critical component in
buffer fills up more slowly.                                             determining its performance relative to that of the VP.
                                                                         Compression is more beneficial for columns having high
     C. Experimental Setup                                               cardinality. For compression, some VP proponents have
All results were run on a machine running RHEL 5 on a 2.4                argued that, since VP compress better than HP, storing the
GHz Intel processor and 1GB of RAM. HP and VP are                        data with multiple projections and sort orders are feasible and
affected by the amount of I/O and processing bandwidth                   can provide even better speedups [18].
available in the system; for each combination of output
selectivity and number of columns accessed.                              Effect of Joins
                                                                         We examined join operations for query presented in table 1,
Effect of selectivity                                                    with varying predicates over HP and VP, to analyze the
Selecting fewer tuples with very selective filter and index has          interaction of resultant tuple with join (e.g. more instruction
no effect on I/O performance, system time remains the same.              cache misses due to switching between scanning and
The HP remains the same, since it has to examine each tuple              reconstructing tuples and performing the join).
in the relation to evaluate the predicate. For the VP evaluating         Compression improves the performance by decreasing I/O
the predicate requires more time. With decrease in selectivity           bandwidth and increasing scan time, as the columns selection
VP and HP performance ratio is less. However as selectivity              ratio grows. Unlike compression, cost of join operation has
increases towards 100%, each column scan contribute in CPU               increased with increased list of selected columns. The HP
cost. The VP is faster than HP when more columns are                     outperforms the VP as number of accessed columns is more.
returned with the selectivity factor from 0.1% to 25%. Further           The join component of the time is always roughly equivalent
with same configuration compressed HP will speed up by 4 in              between the HP and VP (Figure 2). Thus, the paradigm with
VP (Figure 1).                                                           the smaller scan time will also have the smaller join time, and

                                                                                                        ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 10, October 2011

the join time is greatly affected by the number of joined tuples         more of a factor in VP for queries with more predicates,
materialized, number of passes are required, the type of join            lower selectivity and more columns referenced. HP on slotted
operation.                                                               pages will most likely never beat VP for read-optimized
Join         VP (time     HP (time
Operations   in sec)      in sec)
Hash Join    27           30
Merge        38           35                                             [1]
Join                                                                     Accessed November 8,2007.
Nested       30           34                                             [2] Accessed November 8,
Loop Join                                                                2007.
                                                                         [3] Abadi, D. J., Madden, S. R., Ferreira, M. C. “Integrating
                                                                         Compression and Execution in Column-Oriented Database
                                                                         Systems.” In SIGMOD, 2006.
                                                                         [4] Abadi, D.J., Madeen, S. R., Hachem, N. “Column-Stores vs.
                                                                         Row-Stores: How Different Are They Really?” In SIGMOD,
                                                                         [5] Abadi, D. J., Myers, D.S., DeWitt, D.J., Madden, S.R.
                                                                         “Materialization Strategies in a Column-Oriented DBMS.”
                                                                         In ICDE, 2007.
                                                                         [6] Ailamaki, A. Architecture-Conscious Database Systems.
                                                                         Ph.D. Thesis, University of Wisconsin, Madison, WI, 2000.
                                                                         [7] Ailamaki, A., DeWitt, D. J., Hill, M. D., and Skounakis, M.
                                                                         “Weaving Relations for Cache Performance.” In VLDB,
Figure 2: Performance of Join Operation in HP and VP                     2001.
                                                                         [8] Boncz, P., Zukowski, M., and Nes, N. “MonetDB/X100:
                                                                         Hyper-Pipelining Query Execution.” In CIDR, 2005.
Analysis                                                                 [9] Copeland, A. and Khoshafian, S. “A Decomposition Storage
Our analysis focuses tuple-at-a-time paradigm. The cost for              Model.” In SIGMOD, 1985.
each tuple evaluation is the minimum of CPU processing and               [10] Halverson, A. J., Beckmann, J. L., Naughton, J. F., DeWitt,
                                                                         D. J. “A Comparison of C-Store and Row-Store in a
Disk bandwidth. Performance of the database depends on size              Common Framework.” Technical Report, University of
of input (SOI). For any query,                                           Wisconsin-Madison, Department of Computer Sciences,
                                                                         T1666, 2006.
Total Disk Rate (TDR) = SOI1/TOS+ … +SOI n/TOS                           [11] Hankins, R. A., Patel, J. M. “Data Morphing: An Adaptive,
                                                                         Cache-Conscious Storage Technique.” In VLDB, 2003.
                                                                         [12] Harizopoulos, S., Liang, V., Abadi, D., and Madden, S.
For more columns, HP outperforms the VP. CPU cost                        “Performance Tradeoffs in Read-Optimized Databases.” In
measured by search and operations time on the query.                     VLDB, 2006.
Thus it is,                                                              [13] Holloway, A. L., Raman, V., Swart, G. and DeWitt, D. J.
                                                                         “How to Barter Bits for Chronons: Compression and
Cost (CPU) = Cost (Operations)||Cost(Scan)                               Bandwidth Trade Offs for Database Scans.” In SIGMOD,
Rate of an operator                                                      2007.
OP=time/no. of CPU instructions                                          [14] Huffman, D. “A Method for the Construction of Minimum-
                                                                         Redundancy Codes.” In Proceedings of the I. R. E., pages
                  V.      CONCLUSION                                     1098-1102, 1952.
                                                                         [15] Ramakrishnan, R. and Gehrke, J. Database Management
We summaries the following points:                                       Systems. McGraw-Hill, 3rd edition, 2003.
   A. The selectivity of predicate can substantially change              [16] Raman, V., Swart, G. “Entropy Compression of Relations
      the relative performance of HP and VP.                             and Querying of Compressed Relations.” In VLDB, 2006.
                                                                         [17] Shapiro, L. D. “Join processing in database systems with
   B. HP performs better compared to VP, when most of                    large main memories.” ACM Trans. Database Syst. 11(3):
      the columns are required by the query.                             239-264 (1986).
   C. Adding predicates increases VP run times.                          [18] Stonebraker, M., et al. “C-Store: A Column-Oriented
   D. Joins do not change the relative performance of HP                 DBMS.” In VLDB, 2005.
                                                                         [19] T. P. P. Council. “TPC Benchmark H (Decision Support),”
      and VP.                                                  , August 2003.
                                                                         [20] “The Vertica Database Technical Overview White Paper.”
VP outperforms a HP when I/O is a dominating factor in                   Vertica, 2007.
query plan and for less columns selection. For HP with                   [21] Zukowski, M., Heman, S., Nes, N., and Boncz, P. “Super-
                                                                         Scalar RAM-CPU Cache Compression.” In ICDE, 2006.
compression, I/O becomes less of a factor and CPU time is

                                                                                                        ISSN 1947-5500

To top