Vertical Vs Horizontal Partition: In Depth
The Journal of Computer Science and Information Security (IJCSIS) offers a track of quality R&D updates from key experts and provides an opportunity in bringing in the new techniques and horizons that will contribute to advancements in Computer Science in the next few years. IJCSIS scholarly journal promotes and publishes original high quality research dealing with theoretical and scientific aspects in all disciplines of Computing and Information Security. Papers that can provide both theoretical analysis, along with carefully designed computational experiments, are particularly welcome. IJCSIS is published with online version and print versions (on-demand). IJCSIS editorial board consists of several internationally recognized experts and guest editors. Wide circulation is assured because libraries and individuals, worldwide, subscribe and reference to IJCSIS. The Journal has grown rapidly to its currently level of over thousands articles published and indexed; with distribution to librarians, universities, research centers, researchers in computing, and computer scientists. After a very careful reviewing process, the editorial committee accepts outstanding papers, among many highly qualified submissions. All submitted papers are peer reviewed and accepted papers are published in the IJCSIS proceeding (ISSN 1947-5500). Both academia and industries are invited to present their papers dealing with state-of-art research and future developments. IJCSIS promotes fundamental and applied research continuing advanced academic education and transfers knowledge between involved both sides of and the application of Information Technology and Computer Science. The journal covers the frontier issues in the engineering and the computer science and their applications in business, industry and other subjects. (See monthly Call for Papers)
- views:
- 60
- posted:
- 11/24/2011
- language:
- English
- pages:
- 4

(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 10, October 2011
Vertical Vs Horizontal Partition: In Depth
Tejaswini Apte Dr. Maya Ingle Dr. A.K.Goyal
Sinhgad Institute of Business Devi Ahilya VishwaVidyalay Devi Ahilya VishwaVidyalay
Administration and Research Indore Indore
Kondhwa(BK), Pune-411048 maya_ingle@rediffmail.com goyalkcg@yahoo.com
trapte@yahoo.com
Abstract-For the write-intensive operations and predictable HP is superior than VP, at less selectivity when query
behavior of queries, the traditional database system have retrieves more columns with no chaining and the
optimize performance considerations. With the growing data in system is CPU constrained.
database and unpredictable nature of queries, write optimize
system are proven to be poorly designed. Recently, the interest in
Selectivity factor and number of retrieved columns is
architectures that optimize read performance by using Vertically the measure of processing time of VP than HP.
Partitioned data representation has been renewed. In this paper, VP may be sensitive to the amount of processing
we identify the components affecting the performance of needed to decompress a column.
Horizontal and Vertical Partition, with the analysis. Our study Compression ratio may be improved for non-uniform
focusing on tables with different data characteristics and distribution [13]. Research community mainly focused on
complex queries. We show that carefully designed Vertical single predicate with less selectivity, applied to the first
Partition may outperform carefully designed Horizontal column of the table, and the same is retrieved by the query
Partition, sometimes by an order of magnitude.
[12]. We believe that the relative performance of VP and HP
General Terms: Algorithms, Performance, Design is affected by (a) Number of Predicates (b) Predicates
application on columns and Selectivity (c) Resultant Columns.
Keywords: Vertical Partition, Selectivity, Compression, Horizontal Our approach mainly focusing on factors, affecting response
Partition time of HP and VP i.e. (a) Additional Predicate (b) Data
Distribution (c) Join Operation.
I. INTRODUCTION For various applications, it has been observed that VP has
Storing relational tables vertically on disk has been of keen several advantages over HP. We discuss related, existing and
interest as observed in data warehouse research community. recent compression techniques of HP and VP in Section 2.
The main reason lies in minimizing time required for disk Many factors affects the performance of HP and VP. Section 3
reads for tremendously growing data warehouse. Vertical provides the comparative study of performance measure with
Partition (VP) possesses better cache management with less query characteristics. Our approach's implementation detail
storage overhead. For queries retrieving more columns, VP and analysis of the result is presented in Section 4. Finally, we
demands stitching of the columns back together, offset the I/O conclude with a short discussion of our work in Section 5.
benefits, potentially causing a longer response time than the
same query on the Horizontal Partition (HP). HP stores tuples
on physical blocks with slot array, specifies the offset of the II. RELATED WORK
tuple on the page [15]. HP approach is superior for queries, In this section, some existing compression techniques used in
retrieve more columns and on transactional databases. For VP and HP have been discussed briefly along with the latest
queries, retrieves less columns (DSS systems) HP approach methodologies.
may result in more I/O bandwidth, poor cache behavior and
poor compressible ratio [6]. A. Vertical Storage
The VP and HP comparison is presented with C-Store and Star
Current up-gradation of database technology has improved HP Schema Benchmark [12]. VP is implemented using
compression ratio by storing the tuples densely in the block, commercial relational database systems by making each
with poor updatable ratio and improved I/O bandwidth than column its own table. The idea presented had to pay more
VP. To achieve degree of HP compression close to entropy of performance penalty, since every column must have its own
table, skewed dataset and advance compression techniques row-id. To prove the superiority of HP over VP, analysis has
opened the research path for response time of queries and HP done by implementing HP in C-store (VP database).
performance for DSS systems [16]. Compression, late materialization and block iteration were the
base of measure for the performance of VP over HP.
Previous research shown results relevant to this paper are:
98 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 10, October 2011
With the given workload, compression and late Dictionary cache line is bigger than processors L1 data cache
materialization improves performance by a factor of two and (b) Index size is larger than value and (c) Un-encoded column
three respectively [12]. We believe these results are largely size is smaller than the size of the encoded column plus the
orthogonal to ours, since we heavily compress both the HP size of the dictionary [3].
and VP and our workload does not lend itself to late
materialization of tuples. “Comparison of Row Stores and Delta coding
Column Stores in a Common Framework” mainly focused on The data is stored, as the difference between successive
super-tuple and column abstraction. Slotted page format in HP samples (or characters). The first value in the delta encoded
results in less compression ratio than VP [10]. Super-tuples file is the same as the first value in the original data. All the
may improve the compression ratio by storing rows with one following values in the encoded file are equal to the difference
header with no slot-array. Column abstraction avoids storing (delta) between the corresponding value in the input file, and
repeated attributes multiple times by adding information to the the previous value in the input file. For uniform values in the
header. Comparison is made over varying number of columns database, delta encoding for data compression is beneficial.
with uniformly distributed data for VP and HP, while Delta coding may be performed on both column level and
retrieving all columns from table. tuple level. For unsorted sequence and size-of(encoded) is
larger than size-of(un-encoded), delta encoding is less
The VP concept has implemented in Decomposition storage beneficial [3].
model (DSM), with storage design of (tuple id, attribute
values) for each column (MonetDB) [9]. C-Store data model Run Length Encoding (RLE)
contains overlapping projections of tables. L2 cache behaviour The sequences of the same data values within a file is replaced
may improved by PAX architecture, focused on storing tuples by a count number and a single value. RLE compression
column-wise on each slot [7], with penalty of I/O bandwidth. works best for sorted sequence, long runs. RLE is more
Data Morphing improves on PAX to give even better cache beneficial for VP [3].
performance by dynamically adapting attribute groupings on
the page [11]. C. Query Parameters and Table Generation
To study the effect of queries with table characteristics,
B. Database Compression Techniques queries were tested with varying number of predicates and
Compression techniques in database is mostly based on slotted selectivity factor. Factors affecting the execution plan and cost
page HP. Compression ratio may be improved up to 8-12 by are (a)Schema definition (b) Selectivity factor (c) Number of
using processing intensive techniques [13]. VP compression columns referenced (d) Number of predicates. The execution
ratio is examined by “Superscalar RAM-CPU Cache time of a query change with column characteristics and I/O
Compression” and “Integrating Compression and Execution bandwidth. For each characteristic of column, the query
in Column-Oriented Database Systems” [21, 3]. Zukowski generator randomly selects the columns used to produce a set
presented an algorithm for compression optimization the of “equivalent” queries with the cost analysis [12].
usability of modern processor with less I/O bandwidth. Effect Performance measure with compression is implemented by:
of run lengths on degree of compression and dictionary Generation of uncompressed HP version of each
encoding proven to be best compression scheme for VP [3]. table with primary key on left most column.
Sorted on columns frequently used in query.
Replica is generated on VP.
III. PERFORMANCE MEASURING FACTORS
Our contribution to existing approach is based on the major
factors affecting the performance of HP and VP (a)Data IV. IMPLEMENTATION DETAIL
Distribution (b)Cardinality (c)Number of columns To study the effect of VP and HP, the experiments are done
(d)Compression Technique and (e) Query nature. against TPC-H standard Star-Schema on MonetDB.
We mainly concentrated on the fact table i.e. Sales, contains
A. Data Characteristics approximately 10L records. We focused on five columns for
The search time, and performance of two relational tables selectivity i.e. prod_id, cust_id, time_id, channel_id, promo_id
varies with number of attributes, data type of each attribute with selectivity varying from 0.1 to 50%.
along with the compression ratio, column cardinality and SELECT p.product_name,ch.channel_class,
selectivity. c.cust_city, t.calendar_quarter_desc,
SUM(s.amount_sold) sales_amount
B. Compression Techniques FROM sales s, times t, customers c, channels ch,
Dictionary based coding products p, promotions pr
The repeated occurrences are replaced by a codeword that WHERE s.time_id = t.time_id
points to the index of the dictionary that contains the pattern. AND s.prod_id=p.prod_id
Both code words and uncompressed instructions are part of AND s.cust_id = c.cust_id
compressed program. Performance penalty occurs for (a) AND s.channel_id = ch.channel_id
99 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 10, October 2011
AND s.promo_id=pr.promo_id Predicate Selectivity( No Of HP(time VP(time
AND c.cust_state_province = 'CA' %) Rows in sec) in sec)
Prod_id Compresse 1000000
AND ch.channel_desc in ('Internet','Catalog') d(50) 3 14
AND t.calendar_quarter_desc IN ('1999-Q1','1999- Cust_id 25 1000000
Q2') 45 10
GROUP BY ch.channel_class,p.product_name Time_id 10 10,0000
c.cust_city, t.calendar_quarter_desc; 0 40 20
Promo_i 1 1000000
Table 1: Generalized Star-Schema Query d 35 20
Channel 0.1 1000000
_id 30 30
A. Read-Optimized Blocks (Pages)
The HP and VP, dense pack the table on the blocks to achieve
less I/O bandwidth. With varying page size HP keeps tuples
together, while the VP stores each column in a different file.
The different entries on the page are not aligned to byte or
word boundaries in order to achieve better compression. Each
page begins with the page header, contains number of entries
on the page, followed by data and compression dictionary.
The size of the compression dictionary is stored at the very
end of the page, with the dictionary growing backwards from
the end of the page towards the front. For the HP, the
dictionaries for the dictionary-compressed columns are stored
sequentially at the end of the page.
B. Query Engine, Scanners and I/O
The query scanner scans the files differently for HP and VP.
Materialization of results are done after reading the data and
applying predicates to it, with minimum passes in HP than
VP, which requires reading multiple files for each column Figure 1: Time measurement for HP and VP with varying selectivity and
Compression
referenced by query. Predicates are applied on a per-column
basis, columns are processed by order of their selectivity, most Effect of compression
selective (with the fewest qualifying tuples) to least selective For skew data distribution and large cardinality in HP, run-
(the most qualifying tuples). Placing the most selective length and dictionary compression techniques are more
predicate first allows the scanner to read more of the current beneficial. The size of VP tuple is approximately same as size
file before having to switch to another file, since the output of HP tuple. HP compression is a critical component in
buffer fills up more slowly. determining its performance relative to that of the VP.
Compression is more beneficial for columns having high
C. Experimental Setup cardinality. For compression, some VP proponents have
All results were run on a machine running RHEL 5 on a 2.4 argued that, since VP compress better than HP, storing the
GHz Intel processor and 1GB of RAM. HP and VP are data with multiple projections and sort orders are feasible and
affected by the amount of I/O and processing bandwidth can provide even better speedups [18].
available in the system; for each combination of output
selectivity and number of columns accessed. Effect of Joins
We examined join operations for query presented in table 1,
Effect of selectivity with varying predicates over HP and VP, to analyze the
Selecting fewer tuples with very selective filter and index has interaction of resultant tuple with join (e.g. more instruction
no effect on I/O performance, system time remains the same. cache misses due to switching between scanning and
The HP remains the same, since it has to examine each tuple reconstructing tuples and performing the join).
in the relation to evaluate the predicate. For the VP evaluating Compression improves the performance by decreasing I/O
the predicate requires more time. With decrease in selectivity bandwidth and increasing scan time, as the columns selection
VP and HP performance ratio is less. However as selectivity ratio grows. Unlike compression, cost of join operation has
increases towards 100%, each column scan contribute in CPU increased with increased list of selected columns. The HP
cost. The VP is faster than HP when more columns are outperforms the VP as number of accessed columns is more.
returned with the selectivity factor from 0.1% to 25%. Further The join component of the time is always roughly equivalent
with same configuration compressed HP will speed up by 4 in between the HP and VP (Figure 2). Thus, the paradigm with
VP (Figure 1). the smaller scan time will also have the smaller join time, and
100 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 10, October 2011
the join time is greatly affected by the number of joined tuples more of a factor in VP for queries with more predicates,
materialized, number of passes are required, the type of join lower selectivity and more columns referenced. HP on slotted
operation. pages will most likely never beat VP for read-optimized
workloads.
Join VP (time HP (time
Operations in sec) in sec)
Hash Join 27 30
REFERENCES
Merge 38 35 [1]https://www.cs.hut.fi/Opinnot/T106.290/K2005/Ohjeet/Zipf.html.
Join Accessed November 8,2007.
Nested 30 34 [2] http://db.lcs.mit.edu/projects/cstore/. Accessed November 8,
Loop Join 2007.
[3] Abadi, D. J., Madden, S. R., Ferreira, M. C. “Integrating
Compression and Execution in Column-Oriented Database
Systems.” In SIGMOD, 2006.
[4] Abadi, D.J., Madeen, S. R., Hachem, N. “Column-Stores vs.
Row-Stores: How Different Are They Really?” In SIGMOD,
2008.
[5] Abadi, D. J., Myers, D.S., DeWitt, D.J., Madden, S.R.
“Materialization Strategies in a Column-Oriented DBMS.”
In ICDE, 2007.
[6] Ailamaki, A. Architecture-Conscious Database Systems.
Ph.D. Thesis, University of Wisconsin, Madison, WI, 2000.
[7] Ailamaki, A., DeWitt, D. J., Hill, M. D., and Skounakis, M.
“Weaving Relations for Cache Performance.” In VLDB,
Figure 2: Performance of Join Operation in HP and VP 2001.
[8] Boncz, P., Zukowski, M., and Nes, N. “MonetDB/X100:
Hyper-Pipelining Query Execution.” In CIDR, 2005.
Analysis [9] Copeland, A. and Khoshafian, S. “A Decomposition Storage
Our analysis focuses tuple-at-a-time paradigm. The cost for Model.” In SIGMOD, 1985.
each tuple evaluation is the minimum of CPU processing and [10] Halverson, A. J., Beckmann, J. L., Naughton, J. F., DeWitt,
D. J. “A Comparison of C-Store and Row-Store in a
Disk bandwidth. Performance of the database depends on size Common Framework.” Technical Report, University of
of input (SOI). For any query, Wisconsin-Madison, Department of Computer Sciences,
T1666, 2006.
Total Disk Rate (TDR) = SOI1/TOS+ … +SOI n/TOS [11] Hankins, R. A., Patel, J. M. “Data Morphing: An Adaptive,
Cache-Conscious Storage Technique.” In VLDB, 2003.
[12] Harizopoulos, S., Liang, V., Abadi, D., and Madden, S.
For more columns, HP outperforms the VP. CPU cost “Performance Tradeoffs in Read-Optimized Databases.” In
measured by search and operations time on the query. VLDB, 2006.
Thus it is, [13] Holloway, A. L., Raman, V., Swart, G. and DeWitt, D. J.
“How to Barter Bits for Chronons: Compression and
Cost (CPU) = Cost (Operations)||Cost(Scan) Bandwidth Trade Offs for Database Scans.” In SIGMOD,
Rate of an operator 2007.
OP=time/no. of CPU instructions [14] Huffman, D. “A Method for the Construction of Minimum-
Redundancy Codes.” In Proceedings of the I. R. E., pages
V. CONCLUSION 1098-1102, 1952.
[15] Ramakrishnan, R. and Gehrke, J. Database Management
We summaries the following points: Systems. McGraw-Hill, 3rd edition, 2003.
A. The selectivity of predicate can substantially change [16] Raman, V., Swart, G. “Entropy Compression of Relations
the relative performance of HP and VP. and Querying of Compressed Relations.” In VLDB, 2006.
[17] Shapiro, L. D. “Join processing in database systems with
B. HP performs better compared to VP, when most of large main memories.” ACM Trans. Database Syst. 11(3):
the columns are required by the query. 239-264 (1986).
C. Adding predicates increases VP run times. [18] Stonebraker, M., et al. “C-Store: A Column-Oriented
D. Joins do not change the relative performance of HP DBMS.” In VLDB, 2005.
[19] T. P. P. Council. “TPC Benchmark H (Decision Support),”
and VP. http://www.tpc.org/tpch/default.asp, August 2003.
[20] “The Vertica Database Technical Overview White Paper.”
VP outperforms a HP when I/O is a dominating factor in Vertica, 2007.
query plan and for less columns selection. For HP with [21] Zukowski, M., Heman, S., Nes, N., and Boncz, P. “Super-
Scalar RAM-CPU Cache Compression.” In ICDE, 2006.
compression, I/O becomes less of a factor and CPU time is
101 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Get documents about "