rolap definition

Reviews
Shared by: Randy Couture
Stats
views:
97
rating:
not rated
reviews:
0
posted:
1/16/2009
language:
pages:
0
Answering Data Cube Queries Using Families of Statistics Trees Lixin Fu and Joachim Hammer Dept. of Computer & Information Science & Engineering University of Florida Gainesville, FL 32605 (lfu,jhammer}@cise.ufl.edu Abstract In data warehouse and OLAP, the modeling and implementation of Data Cube operation have been a major concern. A Cube query is to compute the aggregates of measures over an arbitrary combination of dimensions in a relational warehouse. It often needs multiple table joins and multiple scans. How to speed up Cube computation will have great influence on DSS (Decision Support System). CUBIST (CUBIng with Statistics Trees) algorithm evaluates an ad-hoc Cube query after the statistics trees are initialized by scanning the original table. In this paper, further Cube query optimizations over CUBIST will be given. Especially, the Data Cube operator is generalized so that an arbitrary navigation over the abstraction hierarchies from different dimensions is much more efficient. Experiments demonstrated the effectiveness of these optimizations. 1. Introduction Data warehouses and related OLAP (On-line Analytical Processing) technologies [5, 6] continue to receive strong interest from the research community as well as from industry. A data warehouse contains data from a number of independent sources, integrated and cleansed to support clients who wish to analyze the data for trends and anomalies. The decision support is provided by OLAP tools which present their users with a multi-dimensional perspective of the data in the warehouse and facilitate the writing of reports involving aggregations over the various attributes of the data set [7]. Since OLAP queries are complex and the warehouse database is very large, processing the queries quickly is an important prerequisite for building efficient decision support systems (DSS). The database size often grows to hundreds of GBs or TBs with millions or even billions of records with high dimensionality and large domain sizes. For example, a data warehouse containing sales information for automobiles across the US may be used to answer queries such as ―How many red Toyota cars have been sold in Florida and Georgia between 1990 and 2000?‖ Answering this query requires the selection of tuples satisfying the specified conditions as well as the aggregation of the sales data along the location, time, and product attributes. On any but the smallest databases, these kinds of queries will tax even the most sophisticated query processors. Current warehouses make use of specialized index structures (e.g., bitmap [ref]), pre-computation of partial results (e.g., view materialization [ref]), and special data structures for storing the warehouse (e.g., MOLAP [ref]) in order to speed up the computation of the result. There are many algorithms for evaluating OLAP and cube queries, but no existing indexing and query optimization performs sufficiently well for high dimensional data [3]. In addition, current systems suffer from the fact that the types of supported OLAP queries must be known beforehand in order to set up the appropriate index structures and assure that relevant partial results have been pre-computed. This limits the analysts‘ ability to navigate freely through the entire data set, in order to get a high-level overview of possible trends and patterns, for example. In this paper we introduce a novel approach to answering OLAP queries efficiently based on a new implementation of data cubes. Specifically, our query answering algorithm allows almost free (ad-hoc) navigation through the cube and can be used to answer an important subclass of OLAP queries which we term cube queries. Cube queries return unary results as opposed to relations and are useful getting a highlevel overview of the data set. We have already reported on the basic data structure to answer simple cube queries (Cubing with Statistics Trees) in a recent workshop paper [ref]. The main contribution of this paper is a comprehensive, new methodology to answer cube queries with arbitrary aggregations over any subset of the underlying dimensions of the data cube. The remainder of the paper is organized as follows. Section 2 reviews current and past research activities related to the work presented here, focusing chiefly on OLAP query processing, indexing and view materialization in data warehouses. In section 3, we introduce fundamental concepts including statistics trees, hierarchies, and navigation operators. Section 4 describes our approach to supporting adhoc navigation through the data cube using a family of statistics trees. In Section 5 we describe our cube query answering algorithm and its capabilities when integrated into a data warehouse. A description of our experimental system and the results of our evaluation are presented in section 6. A summary and concluding remarks are presented in section 7. 2. State-of-the-Art Research related to this work falls into three broad categories: OLAP servers including ROLAP and MOLAP, indexing, and view materialization in data warehousing. 2.1. ROLAP Servers Two approaches to implementing data cubes have emerged: the multidimensional approach (MOLAP) uses proprietary data structures to store the data cube. The relational approach (ROLAP) on the other hand, stores data in relational tables using a star or snowflake schema design [6]. In the star schema, there is a fact table plus one or more dimension tables. The snowflake schema is a generalization of the star schema where the core dimensions have aggregation levels of different granularities. In the ROLAP approach, cube queries are translated into relational queries against the underlying star or snowflake schema using the standard relational operators such as selection, projection, relational join, group-by, etc. However, directly executing translated SQL can be very inefficient and as a result, many commercial ROLAP servers extend SQL to support important OLAP operations directly (e.g., RISQL from Redbrick Warehouse [28], cube operator in Microsoft SQL Server [21]). A simple algorithm 2 N-algorithm for evaluating the cube operator is proposed in [11]. In this algorithm, where N is the number of dimensions, a handle is allocated for each for each cell of the data cube. For each new record (x1,x2,…,xN,v) the handle function is called 2 N times – once for each handle of each cell of the cube matching this value. Here, xi are the dimension values and v is the measure. When all input values have been processed the final aggregate for each of the nodes in the cube is computed. Due to the large number of handles, this algorithm does not scale well for large N. To speed up the group-by‘s, indices and materialized views are widely used. As far as we know, there is no internal ROLAP algorithm in the literature for evaluating cube queries efficiently. MicroStrategy [22], Redbrick [27], Informix's Metacube [16] and Information Advantage [15] are examples of ROLAP servers. 2.2. MOLAP Servers MOLAP servers use multidimensional arrays as the underlying data structure. MOLAP is often several orders faster than the ROLAP alternative when the dimensionality and domain size are relatively small compared to the available memory. However, when the number of dimensions and their domain sizes increase, the data becomes very sparse resulting in many empty cells in the array structure. Storing sparse data in an array in this fashion is inefficient. A popular technique to deal with the sparse data is chunking [31]. The full cube (array) is chunked into small pieces called cuboids. For a non-empty cell, a(OffsetInChunk,data) pair is stored. Zhao et. al. describe a single pass, multi-way algorithm that overlaps the different group-by computations to minimize the memory requirement. The authors also give a lower-bound for the memory which is required by the minimal memory spanning tree (MMST) of the optimal dimension order (which increases Fu & Hammer 2 with the domain sizes of these dimensions). Their performance evaluations show that a MOLAP server using an appropriate chunk-offset compression algorithm is much faster than most ROLAP servers. However, if there is not enough memory to hold the MMST, several passes over the input data are needed. In the first read-write pass, data is partitioned. In the second read-write pass, the partitions are clustered further into chunks. Additional passes may be needed to compute all aggregates in the MMST execution plan. In this case, the initialization time may be prohibitively large. In addition, since the materialized views reside on disk, answering OLAP queries may require multiple disk I/Os. To address the scalability problem of MOLAP, Goil and Choudhary proposed a parallel MOLAP infrastructure called PARSIMONY [9, 10]. Their algorithm incorporates chunking, data compression, view optimization using a lattice framework, as well as data partitioning and parallelism. The chunks can be stored as multi-dimensional arrays or (OffsetInChunk,data) pairs depending on whether they are dense or sparse. The OffsetInChunk is bit-encoded (BESS). However, like other MOLAP implementations, the algorithm still suffers from high I/O costs during aggregation because of frequent paging operations that are necessary to access the underlying data. In general, ROALP is more scalable in terms of the data size, while MOLAP has better performance when the number of dimensions is small. However, the decision of whether to use ROLAP or MOLAP does not only depend on the original data size but also the data volume which is defined as the product of the cardinalities. To illustrate our point, consider a database with 50 billion records and three dimensions each having 100 values. Suppose that each row needs 4*4 bytes (assuming an integer uses 4 bytes and the table has one measure), then the data size is 16*50*10 9 = 800GB but the data volume is 100*100*100 = 1M. In this case, MOLAP would a better choice. Arbor software's Essbase [2], Oracle Express [25] and Pilot LightShip [26] are based on MOLAP technology. The latest trend is to combine ROLAP and MOLAP in order to take advantage of the best of both worlds. For example, in PARSIMONY, some of the operations within sparse chunks are relational while operations between chunks are multidimensional. 2.3. Work on Indexing Specialized index structures are another way to improve the performance of OLAP queries. The use of complex index structures is made possible by the fact that the data warehouse is a ―read-mostly‖ environment in which updates are performed in large batch processes. This allows time for reorganizing the data and indexes to a new optimal clustered form. When the domain sizes are small, a bitmap index structure [24] can be used to help speed up OLAP queries. A bitmap index for a dimension with m values generates m bitmaps (bit vectors) of length N, where N is the number of records in the underlying table. To initialize a bitmap index on a particular attribute (dimension) of a table, we set the bits in each bitmap as follows: for each record we indicate the occurrence of a particular value with a 1 in the same row of the bitmap that represents the value; the bits in all other bitmaps for this row will be set to 0. Bitmap indexes use bit-wise logical AND, OR, NOT operations to speed up the computation of the where-clause predicates in queries. However, simple bitmap indexes are not efficient for large-cardinality domains and large range queries. In order to overcome this deficiency, an encoded bitmap scheme has been proposed [4]. Suppose a dimension has 1,024 values. Instead of using 1,024 bit vectors most rows of which are zero, log 1024 = 10 bit vectors are used plus a mapping table, and a Boolean retrieve function. A well-defined encoding can reduce the complexity of the retrieve function thus optimizing the computation. However, designing well-defined encoding algorithms remains an open problem. Bitmap schemes are a powerful means to evaluate complex OLAP queries when the number of records is small enough so that the entire bitmap fits into main memory. If not all of the bitmaps fit into memory (e.g., when the number of records is large), query processing will require many I/O operations. Even when all the bitmaps fit into memory, the runtime of an algorithm using bitmap indexes is proportional to the number of records in the table. Later in the paper, we show that the runtime of a Fu & Hammer 3 bitmap-based algorithm is much larger than the runtime of CUBIST which has a worst-case runtime proportional to the number of dimensions of the data cube. A good alternative to encoded bitmaps for large domain sizes is the B-Tree index structure [8]. O‘Neil and Quass [23] provide an excellent overview of and detailed analyses for index structures which can be used to speed up OLAP queries. 2.4. View Materialization View materialization in decision support systems refers to the pre-computing of partial query results which may be used to derive the answer for frequently asked queries. Since it is impractical to materialize all possible views, view selection is an important research problem. For example, [14] introduced a greedy algorithm for choosing a near-optimal subset of views from a view materialization lattice based on user-specified criteria such as available space, number of views, etc. In their approach, the next view to be materialized is chosen such that its benefit is maximal among all the non-materialized views. The computation of the materialized views, some of which depend on previously materialized views in the lattice, can be expensive when the views are stored on disk. More recently [13, 17, 18], for example, developed various algorithms for view selection in data warehouse environments. Another optimization to processing OLAP queries using view materialization is to pipeline and overlap the computation of group-by operations to amortize the disk reads, as proposed by [1]. To illustrate, assume a table has four dimensions A, B, C, and D. The computation of the following group-by operations can be pipelined: ABCD  ABC  AB A (i.e., from view ABCD we can compute view ABC form which view AB can be computed and so on). When a data partition is computed, the memory can be reused for the next partition. Interleaved with the pipelined aggregations are sorting or hashing operations on some nodes on the minimal cost processing tree. When the data set is large, external sorting is required which often needs multiple passes and is very expensive. Other related research in this area has focused on indexing pre-computed aggregates [29] and incrementally maintaining them [20]. Also relevant is the work on maintenance of materialized views (see [19] for a summary of excellent papers) and processing of aggregation queries [12, 30]. However, in order to be able to support true ad-hoc OLAP queries, indexing and pre-computation of results alone will not produce good results. For example, building an index for each attribute of the warehouse or pre-computing every sub cube requires a lot of space results in high maintenance cost. On the other hand, if we index only some of the dimensions or pre-compute few views, queries for which no indexes or views are available will be slow. In the remainder of this paper we introduce a new approach which may bring us closer to the goal of being able to answer ad-hoc OLAP queries efficiently. Sampling techniques are also sometimes used with large data sets. However, the accuracy of the result may not be acceptable to many applications. Instead, many systems resort to parallel computing to achieve fast query response times, relying on additional hardware and software support. 3. Basic Concepts 3.1. Multi-dimensional Hierarchies Users of data warehouses frequently like to ―visualize‖ the data as a multidimensional “data cube” to facilitate OLAP. This so-called dimensional modeling allows the data to be structured around natural business concepts, namely measures and dimensions. Measures are numerical data being tracked (e.g. sales). Dimensions are the natural business parameters that define the individual transactions (e.g. time, location, product). Some dimensions may have hierarchies. Using our car sales example, a hierarchy on the time dimension could be: year—quarter—month: For each year (e.g., 1995, 1996, …), there are four quarters Q1, Q2, Q3, Q4. For each quarter, there are three months (January – March, April – June, etc.). In order to represent a hierarchy more efficiently, we map its members to integer values as shown below: Fu & Hammer 4 month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 quarter 1 2 3 4 1 2 3 4 1 year 1 2 ... Figure 1: Insert caption here. We can also represent hierarchies as a table in which the columns represent the hierarchies and the rows contain the hierarchy values. For example, in the location hierarchy for our car sales example can be represented as a table with three columns ―City‖, State‖, and ―Region‖ as shown in Table ?? below. It is important to note that while some dimensions have static domain values and hierarchical inclusion relationships (e.g., time), many other do not (e.g., location). In that case, in order to discover the hierarchy information for our query answer algorithm, we must establish the partitioning and mapping during the load of the dimension tables [[this comes out of the blue; we need to tell readers why we are loading and what]]. Let us take the location dimension as an example. Table 1: Location dimension loc_key City 1 Gainesville 2 Atlanta 3 Los Angeles 4 Chicago 5 Miami 6 San Jose 7 Seattle 8 Twin City State Region FL SE GA SE CA WC IL MW FL SE CA WC WA WC MN MW Suppose Table 1 is part of the original location dimension table that contains location information and a location key (loc_key). Table 2 is the result of sorting along column Region, State and City and then mapping the resulting values to integers (shown in parenthesis). This process is equivalent to executing the following SQL statement on the location dimension table: SELECT DISTINCT Region, State, City FROM Location GROUP BY Region, State, City ORDER BY Region, State, City; Fu & Hammer 5 Table 2: Partitioning and Mapping of Location Region State MW (1) IL (1) MN (2) SE (2) FL (3) GA (4) City Chicago (1) Twin City (2) Gainesville (3) Miami (4) Atlanta (5) WC (3) CA (5) Los Angeles (6) San Jose (7) WA (6) Seattle (8) Table 3 results from the mapping of table 1 column values. The mapping itself can be represented and stored as an encoding index structure e.g. trie or dictionary (string  integer) and a decoding data structure such as arrays (integer  string). The encoding mapping is very important through which strings or other types in the user input queries will be transformed into internal integer representation to facilitate CubiST application. Table 2 can also be directly implemented as a multi-way tree. Table 3: Location after transformation loc_key 1 2 3 4 5 6 7 8 City 3 5 6 1 4 7 8 2 State 3 4 5 1 3 5 6 1 Region 2 2 3 1 2 3 3 1 Notice that we only need to store Table 2 and Table 3 because they are reversible transformations of Table 1. Furthermore, we only store the finest level column; other level columns are implied through the mappings resulting in a potentially very large compression of dimensional data. In the example, column 3 and 4 in Table 3 will not be stored. In addition, instead of storing strings like ―Gainesville‖ (consuming 11 bytes), we only store integers, consuming at most two bytes if there are no more than 65,536 cities in the data warehouse. Though we still store the mapping, only the transformed tables (e.g. Table 3) are read during the initialization of STs. The data compression for the transformed tables significantly reduced the runtime of joining operations in the initialization phase. Some attributes have complex hierarchies, where multiple paths from high level concepts to low level concepts exist. We can decompose complex hierarchies into multiple simple hierarchies. For example, the following complex time hierarchy, represented in a lattice structure, can be decomposed into two simple hierarchies: ―year—quarter—month—day‖ and ―year—week – day.‖ Fu & Hammer 6 year quarter week month day Figure 2: Insert caption here. Based on the example presented above, we define the following: P is a partition for set V, iff  V1, V2, …, Vk, where ViV, such that: (i) ViVj =  (ii) Vi = V Higher level values are just the labels of the subsets of lower level domain values that form a partition. The higher level values as a set can further be partitioned. In this way, the hierarchical partitioning structure of the domain values is the concept of attribute/dimension hierarchy. [[Lixin, this section is not well motivated and this last definition is not used anywhere in this section. Please integrate this better]] 3.2. Cube Queries OLAP queries select data that is represented in various 2-D, 3-D, or even higher-dimensional regions of the data cube, called subcubes. Slicing, dicing, rolling-up, drilling-down, and pivoting are typical operators found in OLAP queries. The data cube operator, which was proposed in [11], contains these operators and generalizes aggregates, subtotals, cross tabulations, and group-bys. Here, we further generalize the cube operator so that each selected dimension set in the query can be a value, a range, or an arbitrary subset of domains. We term this new operation cube query (a.k.a. cube operation or cubing). A Cube query q of degree r is an aggregate operation that is performed on the cells of a k-dimensional data cube. The query q is a tuple of the form: q  ( si1 , si2 ,..., sir ) , where {i1 , i2 ,..., ir }  {1,2,..., k} and r is the number of dimensions specified in the query. The degree of the query is the number of dimensions specified in the query. Each selection si j can be one of the following (in decreasing order of generality). Let w  i j : 1. A partial selection, {t1,t2,…,tr}, 2  r  dw, ti  {1,2,…,dw}, specifying any subset of all domain values for dimension ij. 2. A range selection [a,b], specifying a contiguous range in the domains of some of the attributes, a, b  [1..d w ] , [a, b]  [1, d w ] . 3. A singleton value a, a  [1..d w ] . Note, when r  k, some of the dimensions may be collapsed. If r = k, we say the query is in normal form. A query can be transformed into its normal form by adding collapsed dimensions in the query using ―ALL‖ values. A query is partial iff j such that si j is a partial selection. By convention, the integer values in the partial set are not contiguous. Consequently, a query is a range query iff j such that si j is a range selection. Fu & Hammer 7 Finally, a query is a singleton query iff j si j is a singleton value including the ―ALL‖ value. It is worth noting that a singleton query represents a subcube. Continuing our car sales example from the beginning, assume that the domain of manufacturer is {Ford, GM, Honda, Toyota}, of color is {blue, red, white, black}, of style is {sports car, sedan, SUV}, of time is {Jan, Feb, …, Dec} and of location is {Florida, Alabama, Georgia, South Carolina, Tennessee}. Then a formal representation of the sample cube query ―How many Toyota cars have been sold from January to March this year in Florida and Georgia?‖ is q  ( s1 , s4 , s5 ) = (4, [1,3], {1,3}). In this example, q is a partial query of degree r = 3 and its normal form is (4, ALL, ALL, [1 3], {1,3}). 3.3. Statistics Trees A Statistics Tree (ST) is a multi-way tree structure which holds aggregate information (e.g., SUM, AVG, MIN, MAX) for one or more attributes over a set of records (e.g., tuples in a relational table). The structure of a Statistics Tree is similar to that of the B-Tree where information is only stored in the leaf nodes of the tree; internal nodes are used as branch nodes containing pointers to subtrees. We have developed statistics tress as a superior data structure for storing data cube information that allows efficient answering of cube queries. Let us assume R is a relational table with attributes A1,A2,…,Ak with cardinalities d1,d2,…,dk respectively. In keeping with standard OLAP terminology, we refer to the attributes of R as its dimensions. The structure of a k-dimensional statistics tree for R is determined as follows:  The height of the tree is k+1, its root is at level 1.  Each level in the tree (except the leaf level) corresponds to an attribute in R.  The fan-out (degree) of a branch node at level j is dj+1, where j = 1,2,...,k. The first dj pointers point to the subtrees which store information for the jth column value of the input data. The (dj+1)th pointer is called star pointer which leads to a region in the tree where this domain has been ―collapsed,‖ meaning it contains all of the domain values for this dimension. This ―collapsed‖ domain is related to the definition of super-aggregate (a.k.a ―ALL‖) presented in Gray et al. [11].  The leaf nodes at level k+1 contain the aggregation information and form a linked list. Figure 3 depicts a sample statistics tree for a relational table with three dimensions A1, A2, A3 with cardinalities d1=2, d2=3, and d3=4 respectively. The letter ‗V‘ in the leaf nods indicates the presence of an aggregate value (as opposed to the interior nodes which contain only pointer information). Root Level 1 • • • * Level 2 Star Pointer • • • • • • • • • • • • * Interior Nodes * Level 3 ... • • • • • ... * ... • • • • • * Level 4 V V V ... Leaf Nodes ... V V V Figure 3: Example of a statistics tree for a data set with 3 dimensions. Fu & Hammer 8 Although this data structure is similar to multidimensional arrays and B-trees, there are significant differences. For example, a multi-dimensional array does not have a star-pointer although the extension ―ALL‖ to the domain of a dimension attribute has been used in the query model and summary table described in [11]. Hence one can regard the Statistics Tree as a generalization of the multidimensional array data structure. Looking at the B +-Tree, for example, it too, stores data only at the leaf nodes (or pointers to pages containing data) which linked together to form a list. However, a B-tree is an index structure for one attribute (dimension) only, as opposed to a Statistics Tree which can contain aggregates for multiple dimensions as shown in Figure 3. In addition, in the B-Tree, the degree of the internal nodes is restricted. The Statistics Tree on the other hand is naturally balanced (it is always a full tree) and its height is based on the number of dimensions but independent of the number of records in the input data set. Next, we describe an algorithm for updating the aggregate information that is stored in the leaf nodes of a statistics tree. Note, in case the elements of the database records are real numbers or integers not in [1..di], i=1,...,k, we can devise a mapping F: t  t '  (t '1 , t ' 2 ,..., t ' k ) , such that t 'i  [1..d i ] . To find out more details about maintaining statistics trees incl. inserting and deleting records, please refer to [ref]. 3.4. Navigation To navigate freely on the aggregations, queries can be constrained on any combination of the dimensions. For any constrained dimension with a concept hierarchy, the constraints can be in any abstract level. For a particular level, the selected values can form any subset of the domain values in that level. In addition, the aggregate functions can be different. A1 Level 1 A2 Level 1 ... Ak Level 1 Level 2 Level 2 Level 2 Level L1 Level L2 Level Lk Figure 4: Insert Caption here. Theorem: (number of level combinations) If in a group, there are k attributes A1, A2, …, Ak with L1, L2,…, Lk levels in their hierarchies respectively, then the number of level combinations is For ith attribute (i=1,2,…,k), there are Li+1 choices of the levels. The additional choice is to choose the attribute or not so that the combinations of the attributes are also counted in. According to composition theory, the total number of combinations is the product of the number of choices in the attributes. Theorem: (number of combinations for selected sets at lowest level domain) If in a group, there are k attributes A1, A2, …, Ak with cardinalities d1, d2,…, dk respectively in their finest levels, then the number of potential queries is 2d1 2d2…2dk. Definition: when one or/and two of the following cases happen, a rolling-up operation is performed: one or more dimensions go up one or more levels along their abstract hierarchy ladders and no dimension goes down at the same time. leave out one or more constrained dimensions. Fu & Hammer 9 Similarly, we can define a drilling-down operation. Rolling-up and drilling-down are two common operators in OLAP. In strict sense, they are not new operators but a cube operation relative to another cube operation. However, the requirement of free navigation on the hierarchies adds additional complexity for implementing OLAP systems. Our objective is to design new algorithms that can improve the efficiency of free aggregate navigating. OLAP queries select data that is represented in various 2-D, 3-D, or even higher-dimensional regions of the data cube, called subcubes. Slicing, dicing, rolling-up, drilling-down, and pivoting are typical operators found in OLAP queries. The data cube operator, which was proposed in [11], contains these operators and generalizes aggregates, subtotals, cross tabulations, and group-bys. We have further generalized the cube operator so that each selected dimension set in the query can be a value, a range, or an arbitrary subset of domains. We term this new operation cube query (a.k.a. cube operation or cubing). and have provide a formalism for this class of queries in [ref]. Continuing our example form the introduction, our sample warehouse containing information on car sales may have a measure called ―sales‖ and five dimensions called ―manufacturer,‖ ―color,‖ ―style,‖ ―time,‖ and ―location.‖ It is worth pointing out that there is a subtle difference between cube queries and OLAP queries. Cube queries return only aggregate information while the latter may also return the detailed records that satisfy the query conditions. Using the sample car sales example from above, an OLAP query may also return the individual sales records that contributed to the result rather than only the aggregated total sales value. 4. A Family of Statistics Trees to Support Free Navigation in the Cube In the statistics tree framework, CUBIST can freely navigate on the hierarchies internally, provided that the core statistics trees include the lowest levels which the queried levels are at least the same as or above. To implement free navigation using CUBIST directly, first transform the conditions expressed in different abstract level to the conditions expressed in the lowest levels. We can map the higher level values to lower level values according to the mapping functions (e.g. Table 2 or corresponding multi-way tree). The hierarchy navigating queries are then just simply regarded as range queries or partial queries that can be answered directly by CUBIST. Even though these queries can be answered in memory, in the worst case, answering them may need to walk through most part of the statistics trees. In the car sales example, suppose that the time dimension has a year-month-day hierarchy and the location hierarchy is region-state-city. The user Cube query is "How many cars were sold in southeast in 1998?" Assume that southeast region have 150 cities in the database and the records are in finest granularities, answering the query is equivalent to answering 150*365 = 54,750 singleton queries. In the real world applications, multiple STs may be used thus requiring the page-in and page-out of the leaves in the STs. If the STs are large, the I/O overhead can be also large. CubiST should be further optimized to cut down the overhead. The idea is to materialize some statistics trees with higher levels. For instance, if we materialize the statistics tree in the car sales example in region and year level, then the sample cube query becomes a singleton query and the I/O cost is negligible. To implement the free navigation, we select, compute and materialize a set of Statistics Trees and then answer an arbitrary query based upon these STs. First, let us give some definitions. (Definitions: base trees, derived trees, base set) Definition: The base abstraction levels correspond to the lowest levels in the concept hierarchies such that the statistics trees are in memory. If each dimension uses the lowest level, the statistics trees Fu & Hammer 10 may not fit in memory. In this case, use the higher levels for the large dimensions or take out some dimensions from the group. The resulting statistics trees are called base trees. Definition: The derived trees are the STs in the same group but with higher levels. They are computed from base trees. Definition: The base tree and its derived trees form a base set. The base set is used for answering any queries in different abstraction levels for different dimensions. Potentially, any combination of the levels in the attribute group can be selected as a member of base set. However, we need to consider the space limitation. Again it is a space-time trade-off issue. The configuration of the base set is determined by the memory size and specific application. Here, we introduce a simple selection approach. From the base tree, each time, the largest dimension climbs up one level in its hierarchy while other dimensions remain the same. The newly formed ST in this way is included in the base set. The size of base set is not more than two times the size of base tree. Usually, the total size of the derived trees is just a very small fraction of the base tree. Of course, more sophisticated approaches like greedy-algorithms or cost-based algorithms are also possible. (relationship to generalized lattice view, materialized views) (update of dimensions) (incremental maintenance of the ST‘s and base set) 4.1. Base Set Generation To generate the base set from base tree, the simple approach is used, namely, rolling-up the large dimensions one after another. For numerical attributes in our examples of this subsection, they are grouped into equal-size segments except the last one that may be smaller. Example 4.1.1: Suppose the base tree T0 has three dimensions whose sizes are 100,100, and 100. Each dimension can be grouped ten by ten to form two-level hierarchies. The configuration of base tree T0 is 101*101*101. T1 is the result of rolling up the third dimension in T0, therefore, has configuration 101*101*11. T2 is the result of rolling up the second dimension in T1, its configuration is 101*11*11. T3 is the result of rolling up the first dimension in T2, its configuration is 11*11*11. T1, T2, T3 are derived trees. Together with base tree T0, they form a base set. The base tree is setup during the initialization of ST described in Section 2. Except the first one which is computed from the base tree, a new derived tree is computed from an existing one by merging some sub-trees in the old tree. Before introducing this new roll-up algorithm, we give a definition for a new operator ―‖ on STs as follows. Definition: Two STs are isomorphic if they are identical except the values of their leaves i.e. they have the same structure. All pairs of corresponding nodes in two isomorphic STs have the same degree. Definition: The summation S of two isomorphic STs S1, S2 is a new ST that has the same structure of S1 or S2 (i.e. isomorphic to them) but its leaf values are the summation of the corresponding leaf values (i.e. they are of the same position) of S1 and S2. Their relationship is denoted by S=S 1 S2. The implementation of the new operator is straightforward. Keeping one of the STs (say S1) and updating its leaves by repeatedly adding up the count values of the leaves directed by two sliding pointers. Notice that the new operator can adapt to other aggregation functions such as MIN/MAX. In that case, instead of summing the leaves up, compute the minimum or maximum of the corresponding leaves as the leaves of the resulting ST. The summation of several isomorphic STs is a ST which has exactly the same structure as any of the STs except the leaves. The count fields are the summation of the count fields of corresponding leaves. Fu & Hammer 11 Next, we will give an algorithm of roll-up operation case (1) in definition ??. Case (2) is covered by CubiST algorithm directly. To derive a new tree by rolling up a dimension2, proceed from root to the level corresponding to the rollup dimension. According to the mapping relationship described in section 3.1, reorganizing and merging the sub-trees for all the nodes in that level to form new children. Adjust the degree and update these nodes. Lastly, link the newly formed (summation of subtrees) children to the new nodes. The best way to illustrate the roll-up process is through a simple example. Example: suppose during the derivation, we proceed to a node in the rollup dimension level. The dimension with nine values has a two-level hierarchy which group the lower level values three by tree. The sample node is shown in the left of Figure? with its children S1,.., S9, S*. We superscript #‘s to indicate they are higher level values. We group and sum up its sub-trees (also STs) three by three to form three new children S1#, S2# and S3# of the new node with degree of four, as shown in the right side. The star sub-tree remains the same. Notice that S1#=S1 S2 S3 S2#=S4 S5 S6 S3#=S7 S8 S9 Figure ? Rolling-up one dimension. The following is the roll-up algorithm. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 } Fu & Hammer } rollUP(int whichDim, int compTimes) { change(root,1,whichDim,compTimes); 12 change(Node nd, int level, int whichDim, int compTimes) { IF level == whichDim THEN group the children of nd such that each group has compTimes subtrees except the last group and star node; compute the summation of the subtrees in each group; make a new node nd’ whose children are the summations and star node subtree; adjust the domain of the nd’; link nd' to nd's original parent IF level is not equal to 1; return; level++; FOR each child Ci of nd, change(Ci,level,whichDim,compTimes); Figure ?: Algorithm for Rolling-up One Dimension to Form a New Derived Tree Line 1 is the signature of a recursive function change, which proceed from a node nd at level level to the rollup dimension whichDim. Suppose that the hierarchy is formed by grouping the lower level values compTimes at a time. Lines 2 to 9 corresponds to the modifying process in the example described in Figure ?. Line 14 calls change. Before computing the new derived tree, we first store the old ST or base tree on disk (swap out). The metadata about the ST such as the dimensions in the group, hierarchy levels, etc is also stored in Metadata Repository. The new ST is usually much smaller than the old one and the remaining memory space can be reclaimed for other use. 4.2. Tree Selection in the Lattice Once the base set is established and materialized, it is ready to answer the queries. Potentially, multiple views can answer them but the smaller one is preferred. The proper tree in base set to select is the tree with highest levels that match the query. This can be done by the following matching scheme. We will introduce some new concepts and data structures first. Definition: A view matrix (VM) is a matrix, where the rows are dimensions, the columns are the hierarchy levels and the entries are the statistics tree labels in the base set. It describes what views are materialized in the hierarchies. Example 4.2.1: The VM F for example 4.1.1. Suppose the level with cardinality 101 has level value 0 and grouped level with cardinality 11 has level value 1. For example, T1 has cardinality structure 101*101*11. Its first two dimensions are at level 0 and the last dimension is at level 1, therefore, T1 is put in the first column at first two rows and in row 2 and column 2. Here we assume that dimension and level lables start from 0 (to match with C/C++ array convention). dim\lev 0 1 0 T0,T1,T2 T3 1 T0,T1 T2,T3 2 T0 T1,T2,T3 VM is easily implemented as a stack of level vectors which are composed of level values for the dimensions. The stack corresponding to the above VM is We can peel a MV one layer off from right, that is, each dimension pops off the rightmost ST. The peel operator corresponds to a pop operation of the stack. MV creation corresponds the push of the level vectors to an empty stack. Example 4.2.2 After peeling off T3, the MV in example 4.2.1 is dim\lev 0 1 0 T0,T1,T2 1 T0,T1 T2 2 T0 T1,T2 Figure ? Definition: A query hierarchy vector (QHV) is a vector (l 1,l2,...,lk), where li is ith dimension level value in its hierarchy in the query. Definition: A QHV (l1,l2,...,lk) matches view T in the VM iff li>= column index of the i th row entry of T in VM (the column indices of T compose T's column vector), i, i=1,2,...,k. In other words, when T's column vector <= QHV, we say " the query or its QHV matches view T". Definition: An optimal matching view with respect to a VM is the view in VM that has the highest index and matches the query's QHV. Fu & Hammer 13 Example 4.2.3: query Cube([3#,5#], 1#, [4#,6#]) its QHV is (1,1,1). It matches T0,T1,T2,T3. But T3 is the optimal matching view. As before, a number with a sharp superscript indicates that it belongs to higher level values. [3#,5#] means [30,59] in the example 4.1.1. If a query's QHV is (0,1,0), T0 is the only matched view, so it is also optimal relative to F. Theorem: Any query with no finer granularities than the base tree has an optimal view that matches its QHV. This is quite obvious, because T0 always matches and the indices of the derived trees have a total order (increasing). The MV is generated in the generation of base set. Now the problem is to find the optimal matching view for the query. Algorithm 2: Find optimal matching view for the query 1 2 3 4 Compute the QHV for the query; peel off the entries of VM layer by layer from right to left until the view matches QHV; return the view; Example 4.2.4: Suppose the query is Cube( 3 #, 2, 5#). Its QHV is (1,0,1). Because the column vector for T3 is (1,1,1) which does not match (1,0,1), T3 must be peeled off. Figure ? is the resulting MV. In the same way, T2 must also be peeled off: dim\lev 0 1 0 T0,T1 1 T0,T1 2 T0 T1 This time column vector of T1 is (0,0,1) <= (1,0,1) i.e. T1 matches QHV. So T1 is the optimal matching view. 4.3. Answering Cube Queries with a Statistics Tree  Review formal representation of Cube queries  Cube query algorithm In this phase, the query is evaluated based upon its optimal matching view. The answering algorithms are the same as CUBIST. Before evaluation, the query must first be transferred if its QHV is not equal to the optimal view's column vector. In the Example 4.2.4, query Cube (3#, 2, 5#) will be translated into Cube([30,39],2, 5 #) so that its QHV (0,0,1) exactly indicates the optimal view levels. Then compute the query using T1. The query is rewritten according to the mapping functions. Higher level values are translated into lower level values by the inclusion relationship. Fu & Hammer 14 5. Answering Cube Queries with Statistics Trees 6. Experimental Results 6.1. Datasets 6.2. Performance of Cube Queries with and without Lattice 6.3. Accuracy comparison and Time distribution 7. Concluding Remarks References [1] S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan, ―On The Computation of Multidimensional Aggregates,‖ in Proceedings of the International Conference on Very Large Databases, Mumbai (Bomabi), India, 1996. [2] Arbor Systems, ―Large-Scale Data Warehousing Using Hyperion Essbase OLAP Technology,‖ Arbor Systems, White Paper, http://www.hyperion.com/whitepapers.cfm. [3] S. Berchtold and D. A. Keim, ―High-dimensional index structures database support for next decade's applications (tutorial),‖ in Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, pp. 501, 1998. [4] C. Y. Chan and Y. E. Ioannidis, ―Bitmap Index Design and Evaluation,‖ in Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, pp. 355-366, 1998. [5] S. Chaudhuri and U. Dayal, ―Data Warehousing and OLAP for Decision Support,‖ SIGMOD Record (ACM Special Interest Group on Management of Data), 26:2, pp. 507--508, 1997. [6] S. Chaudhuri and U. Dayal, ―An Overview of Data Warehousing and OLAP Technology,‖ SIGMOD Record, 26:1, pp. 65-74, 1997. [7] E. F. Codd, S. B. Codd, and C. T. Salley, ―Providing OLAP (on-line analytical processing) to user-analysts: An IT mandate,‖ , Technical Report, www.arborsoft.com/OLAP.html. [8] D. Comer, ―The Ubiquitous Btree,‖ ACM Computing Surveys, 11:2, pp. 121--137, 1979. [9] S. Goil and A. Choudhary, ―High Performance OLAP and Data Mining on Parallel Computers,,‖ Journal of Data Mining and Knowledge Discovery, 1:4, pp. 391-417, 1997. [10] S. Goil and A. Choudhary, ―PARSIMONY: An Infrastructure for Parallel Multidimensi onal Analysis and Data Mining,,‖ Journal of Parallel and Distributed Computing, to appear. [11] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, ―Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals,‖ Data Mining and Knowledge Discovery, 1:1, pp. 29-53, 1997. [12] A. Gupta, V. Harinarayan, and D. Quass, ―Aggregate-query Processing in Data Warehousing Environments,‖ in Proceedings of the Eighth International Conference on Very Large Databases, Zurich, Switzerland, pp. 358-369, 1995. Fu & Hammer 15 [13] H. Gupta and I. Mumick, ―Selection of Views to Materialize Under a Maintenance Cost Constraint,‖ Stanford University, Technical Report. [14] V. Harinarayan, A. Rajaraman, and J. D. Ullman, ―Implementing data cubes efficiently,‖ SIGMOD Record (ACM Special Interest Group on Management of Data), 25:2, pp. 205--216, 1996. [15] Information Advantage, ―Business Intelligence,‖ , White Paper, 1998, http://www.sterling.com/eureka/. [16] Informix Inc., ―Informix MetaCube 4.2, Delivering the Most Flexible Business-Critical Decision Support Environments,‖ Informix, Menlo Park, CA, White Paper, http://www.informix.com/informix/products/tools/metacube/datasheet.htm. [17] W. Labio, D. Quass, and B. Adelberg, ―Physical Database Design for Data Warehouses,‖ in Proceedings of the International Conference on Database Engineering, Birmingham, England, pp. 277-288, 1997. [18] M. Lee and J. Hammer, ―Speeding Up Warehouse Physical Design Using A Randomized Algorithm,‖ in Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW '99), Heidelberg, Germany, 1999, ftp://ftp.dbcenter.cise.ufl.edu/Pub/publications/VS-geneticfull.doc/pdf. [19] D. Lomet, Bulletin of the Technical Committee on Data Engineering, vol. 18, IEEEE Computer Society, 1995. [20] Z. Michalewicz, Statistical and Scientific Databases, Ellis Horwood, 1992. [21] Microsoft Corp., ―Microsoft SQL Server,‖ Microsoft, Seattle, WA, White Paper, http://www.microsoft.com/federal/sql7/white.htm. [22] MicroStrategy Inc., ―The Case For Relational OLAP,‖ MicroStrategy, White Paper, http://www.microstrategy.com/publications/whitepapers/Case4Rolap/execs umm.HTM. [23] P. O'Neil and D. Quass, ―Improved Query Performance with Variant Indexes,‖ SIGMOD Record (ACM Special Interest Group on Management of Data), 26:2, pp. 38--49, 1997. [24] P. E. O'Neil, ―Model 204 Architecture and Performance,‖ in Proceedings of the 2nd International Workshop on High Performance Transaction Systems, Asilomar, CA, pp. 40-59, 1987. [25] Oracle Corp., ―Oracle Express OLAP Technology‖, Web site, http://www.oracle.com/olap/index.html. [26] Pilot Software Inc., ―An Introduction to OLAP Multidimensional Terminology and Technology,‖ Pilot Software, Cambridge, MA, White Paper, http://www.pilotsw.com/olap/olap.htm. [27] Redbrick Systems, ―Aggregate Computation and Management,‖ Redbrick, Los Gatos, CA, White Paper, http://www.informix.com/informix/solutions/dw/redbrick/wpapers/redbric kvistawhitepaper.html. [28] Redbrick Systems, ―Decision-Makers, Business Data and RISQL,‖ Informix, Los Gatos, CA, White Paper, 1997, http://www.informix.com/informix/solutions/dw/redbrick/wpapers/risql.h tml. [29] J. Srivastava, J. S. E. Tan, and V. Y. Lum, ―TBSAM: An Access Method for Efficient Processing of Statistical Queries,‖ IEEE Transactions on Knowledge and Data Engineering, 1:4, pp. 414--423, 1989. Fu & Hammer 16 [30] W. P. Yan and P. Larson, ―Eager Aggregation and Lazy Aggregation,‖ in Proceedings of the Eighth International Conference on Very Large Databases, Zurich, Switzerland, pp. 345-357, 1995. [31] Y. Zhao, P. M. Deshpande, and J. F. Naughton, ―An Array-Based Algorithm for Simultaneous Multidimensional Aggregates,‖ SIGMOD Record (ACM Special Interest Group on Management of Data), 26:2, pp. 159--170, 1997. Fu & Hammer 17

Related docs
A guide to plan, manage, execute a successful BI
Views: 89  |  Downloads: 19
Main-Memory Near-Main- Memory OLAP Databases
Views: 22  |  Downloads: 0
Exercise Sheet 3
Views: 0  |  Downloads: 0
000-705 Practice Tests & Exam
Views: 23  |  Downloads: 0
Data Warehousing Concepts
Views: 855  |  Downloads: 109
senior data base administrator resume example
Views: 45  |  Downloads: 0
Senior DBA Resume
Views: 36  |  Downloads: 0
Certinside 70-446 v3.36
Views: 2  |  Downloads: 0
Examsoon 000-705
Views: 0  |  Downloads: 0
premium docs
Other docs by Randy Couture
what is a value
Views: 112  |  Downloads: 2
terms of service agreement
Views: 1179  |  Downloads: 15
privacy policy sample
Views: 1612  |  Downloads: 22
privacy policy example
Views: 2747  |  Downloads: 51
trademark my name
Views: 1024  |  Downloads: 1
licensee licensor
Views: 1060  |  Downloads: 0
vivisimo clustering
Views: 1006  |  Downloads: 0
toplevel domain names
Views: 1007  |  Downloads: 0
reserving domain name
Views: 1046  |  Downloads: 0
toplevel domain
Views: 796  |  Downloads: 4
business.com domain name
Views: 832  |  Downloads: 0
trade mark names
Views: 976  |  Downloads: 2
winternals defrag manager
Views: 320  |  Downloads: 3
reserving domain names
Views: 785  |  Downloads: 1
interleaf inc.
Views: 138  |  Downloads: 0