VIEWS: 15 PAGES: 13 POSTED ON: 10/11/2011 Public Domain
Eager Aggregation and Lazy Aggregation Weipeng P. Yan Per-Bike Larson Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3Gl {pwyan,palarson}@bluebox.uwaterloo.ca We proposed a new query optimization technique, group-by push down and group-by pull up, which inter- Abstract changes the order of group-by and joinspL94, YL95]. Groupby push down is to push groupby past a join. Efficient processing of aggregation queries is Its major benefit is that the group-by may reduce the essential for decision support applications. number of input rows of the join. Group-by pull up is This paper describes a class of query trans- to delay the processing of groupby until after a join. formations, called eager aggregation and laty Its major benefit is that the join may reduce the num- aggregation, that allows a query optimizer to ber of input rows to the group-by, if the join is selec- move group-by operations up and down the tive. Figure 1 shows the idea of commuting group-by query tree. Eager aggregation partially pushes and join. In Figure l(a), we join Table Tl(Gl,Jl,Sl) a groupby past a join. After a group-by is and T2(G2,J2) on join columns Jl and 52 then group partially pushed down, we still need to per- the result on grouping columns Gl and G2, followed form the original groupby in the upper query by aggregation on Sl. Figure 1(b) shows an alter- block. Eager aggregation reduces the number native way where group-by is performed before join. of input rows to the join and thus may result Note that group-by and join commutation cannot al- in a better overall plan. The reverse trans- ways be done. The necessary and sufficient condition formation, lazy aggregation, pulls a group-by is provided in [YL94, YL95]. above a join and combines two group-by op- SUM(S1) + erations into one. This transformation is typ- ically of interest when an aggregation query @$jg Jl=J& references a grouped view (a view containing a groupby). Experimental results show that t ss/ AS sum(s1, z the technique is very beneficial for queries in Jl=JZa T T2 the TPC-D benchmark. WJ2) /‘\ 1 Introduction Tl T2 Tl (Gl,Jl,Sl) (Gl,Jl,Sl) (G2,52) Aggregation is widely used in decision support sys- tems. All queries in the TPC-D[Raa95] benchmark (a) Group-by Pull up (b) Group-by Push down contain aggregation. Efficient processing of aggrega- tion ,queries is essential for performance in decision Figure 1: Group-by and Join Commutation support applications and large scale applications. The technique to only partially push down a group- Permission to copy without fee all or part of this material is by past a join can be extended. For some queries granted provided that the copies arc not made or distributed for containing joins and groupby, we can perform group direct commercial advantage, the VLDB copyright notice and by on some of the tables, then the join, and finally the title of the publication and its date appear, and notice is another group-by. The first groupby, which we call given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, 01 to republish, requires a fee eager group-by, reduces the number of input rows to and/or special pcmzission from the Endowment. the join and thus may result in a better plan. We Proceedings of the 21st VLDB Conference call the groups generated by the early groupby par- Zurich, Swizerland, 1995 tial groups because they will be merged by the second 345 group-by. When the amount of data reduction does not justify the cost of eager group-by, we should prob- R SUMISUM’ ably delay group-by until after the join, which we term ‘@& lazy group-by. Both directions of the transformation SUM& i should be considered in query optimization. We call the technique of performing aggregation before join t J1=J2G9 eager aggregation, and delaying aggregation until after SUM(S1) AS SW/ \ join lazy aggregation. Figure 2(a) and (b) show the basic idea of ea- “‘=a m (ZJ2) ger/lazy group-by. Eager group-by performs eager ag- Tl /‘\ T2 aI ’ gregation on all tables containing aggregation columns. (Gl,Jl.Sl) Tl (62952) Lazy group-by is its reverse transformation. (Gl,Jl,Sl) The following examples illustrate the basic idea of (a) Lazy grwW (b) Eager group-by eager group-by and lazy group-by. The examples are based on a subset of the TPC-D database[Raa95]. The tables are defined in Appendix A. Example 1 : Find the total loss of revenue on OT- ders handled by each clerk due to parts being returned by customers. Output clerk and loss of revenue. SELECT O-CLERK, * SDM(L-EXTENDEDPRICE (I-L-DISCOUNT)) FROM LINEITEM, ORDERS Tl T2 WHERE O-ORDhKEY = L-ORDERKEY T2 AND LJETURNFLAG= 'R' (Gl,Jl,Sl) (WJ2) mJJ2) BY GROUP O-CLERK (c) Lazy Count (d) Eager Count A Each order is handled by one clerk so we can first find the loss of revenue for each order. We then join the aggregated view with table ORDERS find the total loss for each clerk. to suM~s’& 51352 a SELECT O-CLERK, SUM(REVENDE) FROM (SELECT L-ORDERKEY. SUM(L-EXTENDEDPRICE t f’\ ‘XT *(l-L-DISCOUNT)) AS REVENUE J,=>&k-& sAs FROM LINEITEM WHERE L-RETURNFLAG 'R'= t t BY GROUP L-ORDERKEY) LOSS, ORDERS AS Tl l2 Tl T2 = WHERE O-ORDBRKEY L-ORDERKEY (Gl,Jl,Sl) WC2) (Gl,Jl,Sl) K-352) BY GROUP O-CLERK (e) Double Lazy (t) Double Eager The eager (inner) group-by reduces the number of input rows to the join. If the LINEITEM table is clus- tered on L-ORDERKEY,the eager group-by can be done at almost no additional cost. Experiment on DB2 JlJ2 & V2 Beta3 confirms that eager group-by reduces the elapsed time by 16%. The following example shows I \ that lazy group-by can be beneficial. Example 2 : Find the total loss of revenue on OT- ders from May 1995 handled by each clerk due to parts being returned by customers. Output clerk and loss of Tl T.2 Tl T2 revenue. (Gl,Jl,Sl) (G2J23) (Gl,Jl,Sl) W&W SELECT O-CLERK, SUM(REVENUE) FROM ORDERS,LOSS-BY-ORDER (9) Lazy Split (h) Eager Split WHERE = O-ORDERKEY L,ORDERKEY Lazy Aggregation Eager Aggregation O-ORDERDATE "1996-05-01" BETWEEN _______-______-- ----------------- AND AND "1995-05-31" GROUP O-CLERK BY Figure 2: Eager and Lazy Aggregation 346 is where LOSS-BY-ORDER an aggregated view defined by six queries. For example, it reduces the elapsed time of Query 5 by a factor of ten. CREATEVIEW LOSS-BY-ORDER(L-ORDERKEY,REVENUE) (SELECT L-ORDERKEY, * ~~~~(L-E~TENDEDPRI~E (~-L-DISCOUNT)) 1.1 Organization of This Paper FROM LINEITEM The rest of the paper is organized as follows. Section 2 WHERE LJETURNFLAG= 'R' reviews aggregation functions in SQL2 and introduces BY GROUP L-ORDERKEY ); the concepts of decomposable aggregation functions, We can merge the view with the query and rewrite the and class C and class D aggregation functions. Sec- query as tion 3 defines the class of queries that we consider and introduces notations. Section 4 presents the formalism SELECT O-CLERK, that our results are based on. Section 5 introduces and SUM(L-EXTENDEDPRICE*(l-L-DISCOUNT)) proves our main theorem. Sections 6, 7, 8 and 9 in- FROM LINEITEM. ORDERS troduce corollaries for eager/lazy group-by, eager/lazy = WHERB O-ORDERKEY L,ORDERKEY count, double eager/lazy and eager/lazy split trans- AND LJETURNFLAG= 'R' AND O-ORDERDATB BETWEEN"1995-05-01" formations. Section 10 proposes algorithms for find- AND "1995-05-31" ing all possible eager/lazy transformations for a query GROUP O-CLERK BY and discusses the way to integrate eager/lazy aggre- gation and group-by push down/pull up into existing is The predicate on 0-ORDERDATE highly selective. optimizers. In order to simplify the proofs we have not In this case, we should delay the group-by until after considered HAVING in the theorem and corollaries. Sec- the join. A nested loop join with LINEITEM as the inner tion 11 considers the case when the HAVING clause is and ORDERSas the outer looks like a very promising present. Section 12 shows that eager/lazy aggregation evaluation strategy. Experiment on DB2 V2 Beta3 and group-by push down/pull up is very beneficial for confirms that lazy group-by reduces the elapsed time TPC-D official queries. Section 13 discusses related by 60%. work. Section 14 concludes the paper. These examples show that both directions (eager group-by and lazy group-by) should be considered in 2 Aggregation Functions query optimization. There may be several ways of per- In SQLP, a value expression may include aggregation forming eager group-by when there are more than two functions. There are five aggregation functions: SUM, tables in the FROMclause[Yan95]. AVGs MIN, MAXand COUNT. Consider the query Figure 2 and 3 show the eager/lazy transformations introduced in this paper. Eager count transformation SELECT2*SUN(Tl.Cl)/COUNT(DISTINCT T2.C2)* performs eager aggregation on tables not containing MIN(Tl.C3*T2.C3) aggregation columns, as shown in Figure 2 (d). It first FROMTl,T2 counts the number of rows in each group in the early We can rewrite this query as aggregation, then performs the join, and finally ag- SELECT2*NCl/NC2*NC3 gregates the original aggregation columns. Lasy count FROM(SELECTmM(cl) AS NCI, transformation is its reverse transformation. COUNT(DISTINCT T2.C2) AS NC2, Double eager performs eager count on tables not #IN(Tl.C3*T2.C3) AS NC3 containing aggregation columns and eager group-by on FROMTl,T2 ) TMP-VIEW; the remaining tables which may or may not contain Any query block that has an arbitrary value ex- aggregation columns, as shown in Figure 2 (f). The reverse transformation is double lazy. pression containing more than one aggregation func- tions, can always be rewritten so that the new query Eager groupby-count, as shown in Figure 3, per- block is a SELECT on top of a view that contains value forms eager aggregation on a subset of tables contain- ing the aggregation columns. Its reverse transforma- expression having at most one aggregation function. Therefore, without loss of generality, we assume that tion is called lazy groupby-count. our query contains no value expression that has more Eager split, as shown in Figure 2 (h), performs eager than one aggregation functions. groupby-count on both input streams before the join, when both input streams are involved in aggregation. 2.1 Decomposable Aggregation Functions Its reverse transformation is called lazy split. Our experiments show that we can apply group- All sets in this paper are multi sets. Let U, denote set by push down/pull up and eager/lazy aggregation to union preserving duplicates, and ud denote set union twelve of the seventeen queries in the TPC-D bench- eliminating duplicates. These operations exist in SQL2 mark. This significantly reduces the elapsed time for as UNION ALL and UNION, respectively. 347 Definition 1 : (Decomposable Aggregation FROMLINBITEH Function) An aggregation function F is decompos- BY GROUP L-ORDERKEY) COUNT-BY-ORDER. AS able if there exist aggregation functions F1 and F,?2 ORDERS such that F(&Ua&) = F2(Fl(Sr),Fl(S2)), where WHERE 0sORDERKEY L-ORDERKEY = S1 and & are two sets of values. We call S1 and Sz BY GROUP O-CLERK partial groups. COUNTEY-ORDER counts the number of lineitems for SUM(C) is decomposable since SUM(S~U,S~) = to each orders, then joins with table ORDERS find the SUM(SUM(Sl), SUM(S2)); count required. We call this transformation eager COUNT(C) is decomposable since COUNT(SlU,S2) = count, and its corresponding reverse transformation lazy count. Note that, this time, we are performing SUM(COUNT(Sl), COUNT(S2)); and MIN(C) is decomposable since MIN(Slu,S2) = eager aggregation on a table which contains no aggre- gation columns. Eager count performs eager aggrega- MIN(MIN(Sl), MIN(S2)); and AVG(C) can be han- dled as SUM(C) and COUNT(NOTNULL C) and thus is tion on tables not containing any aggregation columns. decomposable’. Experiment on DB2 V2 Beta3 shows that eager count For aggregation functions like COUMT(DISTINCT for this query reduces the elapsed time by 40%. Cl), it is not trivial to determine whether it is de- When performing eager count and the original ag- composable. There may be two rows with the same gregation function is either SUM or COUNT, we need Cl value in Sl and S2. These two rows would then to count the number of rows in each group produced contribute 2 instead of 1 in the final count. However, by the inner group-by and multiply the count with if we know in advance that column Ci cannot contain the result from the later group by. We call aggrega- duplicate values, then COUMT(DISTINCT Cl) is decom- tion functions satisfying this property class C aggre- posable. Note that, even though Cl has duplicate val- gation functions(C stands for COUNT), and the count ues, there may be other conditions which ensure that obtained from the inner group-by duplication factor. If rows with the same Cl value belong to the same par- the original aggregation function is SUM(DISTINCT), tial groups(e.g., Cl is a grouping column). Therefore, COUNT(DISTINCT), MIN, MAX, or AVG, we can discard an aggregation function may or may not be decompos- the count in the subquery block. In other words, we able. Aggregation functions MIN and MAX are always can use a DISTINCT in the subquery block. We call decomposable; SUMand COUNT decomposable when are aggregation functions satisfying this property class D aggregation functions(r) stands for DISTINCT). And we they contain no DISTINCT. The issue of determining whether an aggregation function is decomposable will call this transformation eager distinct, and its corre- not be discussed further. From now on we will assume sponding reverse transformation lazy distinct. There- fore, combining this with whether the function is de- that we have the knowledge about whether an aggre- composable or not, we can have four types of aggre- gation function is decomposable. gation functions. Class D aggregation functions are 2.2 Class C and Class D Aggregation hnc- insensitive to duplication factors. tions Find the total number of urgent 3 Class of Queries Considered Example 3 : OT high priority lineitems handled by each clerk. Any column occurring as an operand of an aggregation function (COUNT, MIN, MAX, SUM, AVG) in the SELECT SELECT O-CLERK, SUM(CASE WHEN O-ORDERFRIORITY='l-URGENT' clause is called an aggregation column. Any column OR O-ORDERF'RIORITY='2-HIGH' occurring in the SELECTclause which is not an aggrega- THEN 1 ELSE 0 END) tion column is called a selection column. Aggregation FROM LINEITEM, ORDERS columns may belong to more than one tables. We par- WHERE O-ORDERKEY L-ORDERKEY = tition the tables in the FROMclause into two groups: BY GROUP O-CLERK those tables that contain aggregation columns and those that may or may not contain any such columns. It is equivalent to the following query. Technically, each group can be treated as a single ta- SELECT O-CLERK, ble consisting of the Cartesian product of the member SUM(CASE WHEN O-ORDERPRIORITY='l-URGENT' tables. Therefore, without loss of generality, we can OR O-ORDERPRIORITY='2-HIGH' assume that the FROMclause contains only two tables, THEN 1 ELSE 0 END) * CNT & and &. Let Rd denote the table containing aggre- FROM (SELECTL-ORDERKEY, COUNT(*) AS CNT gation columns and R, the table that may or may not lspL2 does not support COUMT(IOT IULL Cl) operation, but contain any such columns. it is fairly easy to implement in any existing systems. The search conditions in the NMEREclause can be 348 expressed as Cd A Co A C,,, where Cd, Co, and C,, are GA: : E Gdd U a(&) - R,, i.e., the columns of & in COnjUnCtiVe normal form, Cd Only involves columns participating in the join and grouping; in &, C,, only involves columns in R,, and each dis- GA:: E Gdd U a(Co) - &, i.e., the columns of R, junctive component in Co involves columns from both participating in the join and grouping & and R,. Note that subqueries are allowed. FAA: resulting columns of the application of function The grouping columns mentioned in the GROUP BY array F on AA in the fhst group-by when eager clause may contain columns from & and R,, de- group-by is performed on the above query. noted by C& and GA,,, respectively. According to SQL2[ISO92], the selection columns in the SELECT 4 Formalization clause must be a subset of the grouping columns. We denote the selection columns as SC& and SGA, , sub- In this section we define the formal “machinery” we sets of GAd and GA,,, respectively. For the time being, need for the theorems and proofs to follow. we assume that the query does not contain a HAVING SQL2[ISO92] represents missing information by a clause(relaxed in Section 11). The columns of & par- special value NULL. It adopts a three-valued logic in ticipating in the join and grouping is denoted by GA:, evaluating a conditional expression. We define func- and the columns of R, participating in the join and tional dependencies using strict SQL2 semantics tak- grouping is denoted by GA,+. ing into account the effect of NULLS in SQLP. When In summary, we consider queries of the following NULLS do not occur in the the columns involved in form: a functional dependency, our definition of functional dependency is the same as the traditional functional SELECT [ALL/DISTINCT] SGdd, SGA,, F(AA) dependency. The detailed definitions are included in FROl4 &is Ru WHERE cd A c,, A c, pL94]. Due to space limitation, they are not included GROUP BY Gdci,GA, here. Let A and B be two sets of columns, A fimction- ally determines B is denoted by A-B. where Gdd: grouping columns of table Rd; 4.1 An Algebra for Representing SQL Queries GA,: grouping columns of table R,; GAd and GA, Specifying operations using standard SQL is tedious. cannot both be empty. As a shorthand notation, we define an algebra whose S&id: selection columns, must be a subset of grouping basic operations are defined by simple SQL statements. cohmns G.&i; Because all operations are defined in terms of SQL, SGA,,: selection columns, must be a subset of grouping there is no need to prove the semantic equivalence be- columns GA,; tween the algebra and the SQL statements. Note that AA: aggregation columns of table & and possible transformation rules for “standard” relational algebra table R,. When considering eager/lazy group- do not necessarily apply to this new algebra. The op by, eager/lazy count and double eager/lazy, AA erations are defined as follows. belong to Rd. When considering eager/lazy groupby-count and eager/lazy split, AA belong l B[GA] R: Group table R on grouping columns to & and R, and is denoted by the union of GA = {GAI, GAz, .... GA,}. This operation is aggregation columns AA,, and A&, where AA, defined by the query 2 SELECT * FROM R ORDER and AAd belong to & and R, respectively. BY GA. The result of this operation is a grouped cd: conjunctive predicates on columns of table &; table. c,: conjunctive predicates on columns of table R,; l RI x R2: The Cartesian product of table RI and co: conjunctive predicates involving columns of both R2. tables Rd and R,, e.g., join predicates; a(C0): columns involved in CO; l a[C]R: Select all rows of table R that satisfy con- F: array of aggregation functions and/or arithmetic dition C. Duplicate rows are not eliminated. This aggregation expressions applied on AA (may be operation is defined by the query SELECT * FROM empty). When considering eager/lazy groupby- R WHERE C. count and eager/lazy split, F is denoted by the union of aggregation functions Fa and F,,, where l xd[B]R, where d = A or D: Project table R on Fd and Fd are applied on AAd and AA, respec- columns B, without eliminating duplicates when tively. 2Certainly, this query does more than GROUP BY by ordering F(AA): application of aggregation functions and/or the resulting groups. However, this appears to be the only valid ,arithmetic aggregation expressions F on aggre- SQL query that can represent this operation. It is appropriate gation columns AA; for our purpose as long as we keep the difference in mind. 349 d = A and with duplicate elimination when d = 2. F contains both class C and class D aggregation D. This operation is defined by the query SELECT functions. In this case, we need to use a COUNT CALL /DISTINCT] B FROMR. aggregation function in the SELECTlist of the sub- query block. The aggregation value of a class C l $Jr;l~;A”[“=“l yAUd$4, fz@A),.... fn(AA)), aggregation function f is the count multiplied by 1, 2, ....A.,}, and F = the value resulting from applying f. Therefore, {fl, f2, . . ..f”}. AA are aggregation columns of we need to change F into F,, in which every grouped table R and F are arithmetic aggrega- class C aggregation function f of F is replaced tion expressions operating on AA. We must em- by f * count. For example, if F(C1, C2, C3) is phasis the requirement that table R is grouped (SUM(Cl>,COUNT(C2),MIN(C3)), then by some grouping columns C. All rows of table F,(Cl, C2, C3, count) R must agree on the values of all columns except is (SUM(C1) ,MAX(C2) ,MIN(C3))o(count, 1, I) = AA columns. Each fi, where i = 1,2, . . . . n, is an (SUM(Cl)*count,MAX(C2) ,MIN(C3)). The oper- arithmetic expression(which can simply be an ag- ator o is vector product. We call F, the duplicated gregation function) applied to some columns in aggregation functions of F. As a shorthand nota AA of each group of R and yields one value. An tion, we use F(C1, C2, . .. . C,,) * count to repre- + example of fi(AA) is COUNT(A1) sUM(A2+ As). sent F,, while keeping in, mind that we only need Duplicates in the overall result are not eliminated. to multiply class C aggregation function by the This operation is defined by the query SELECT count. Note that we need an additional argument R GA,A, F(AA) FROM GROUP GA,whereGAisBY to F,e the grouping columns of R, and A is a set of none grouping columns that are functionally de- Note that, it is not necessary that the functions in termined by GA and may be empty. Note that F be decomposable. this is not a syntactically valid SQL2 statement since the columns A in the SELECT clause are not SUM(SSl), SUM(S2)‘CNT t mentioned in the GROUPBY clause. However, since WWP SY GA+ A, from a query processing point of view, SUM(Sl), t SUM(S2) 9 this is semantically sound. Therefore, the class of query we consider can be expressed as nd[SGA,j, SGA,, FAA] F[AA]rA[GAd, GA,, , AA] G[GAd, G&]+‘d A co A Cu](& x &a) where d = A or d = D, and FAA are the aggre- Tl T2 gation values after applying F[AA] on each group. (Gl,Jl,Sl) (G2,J2,S2) (Gl ;,Sl) The last projection simply projects the rows on the Lazy groupbycount Eager groupby-count columns wanted, and may eliminate duplicates. If all the columns wanted are the same as all existing Lazy aggregation Eager aggregation columns, and the projection does not eliminate dupli- cates, then we usually omit the last projection in the Figure 3: The Main Theorem expression. Consider the query to the left of Figure 3. It aggre- All sets in this paper are multisets which may con- gates columns from both input streams. In the query tain duplicates. Td, T” denote instances of table & and on the right, we can first perform aggregation on one R,; T[SJ is used as a shorthand for nA[qT, where S of the input stream. We need to not only find the sum is a set of columns and T is a grouped or ungrouped of partial groups, but also keep track of the number table, or a row. of rows in each partial group for the aggregation on the table(T2) that are aggregated only after the join. 5 Main Theorem This is the basic idea of eager groupby-count. In the following theorem, let (1) NGAd denote a set When performing eager count, we need to consider two of columns in Rd; (2) CNT the column produced by cases: COUNT(*) after grouping a[Cd]& on NG&; (3) FA& 1. F contains only class D aggregation functions. We the rest of the columns produced by Fd in the first can simply add a DISTINCT to the SELECT of list group-by of table U[Cd]Td on NG&; and (4) F,, the the subquery block and no modification to the duplicated aggregation function of F,,. Also assume original aggregation functions is needed. that (1) AA = AAd ud AA, where AAd contains only 350 columns in &, and AA,, contains only columns in R,; S,. Note that the above statements hold for all joins, (2) F = Fd ud F, where Fd applies to AAd and F,, not just equijoins. applies to AA,,. Since S,, depends on Gd, we denote the set of rows joining with Gd as S, (Gd). The set resulting from Theorem 1 (Eager/Lazy Groupby-Count(Main the join of Gd and S, is Gd x S,(Gd), i.e., a Carte- Theorem)): The expcpressions sian product. (Fdl[A&], COUNT~)Gd denotes the El : F[A&, A&]n[G&, GA,, AAd, AA,] row resulting from applying Fdr and COUNT on AAd of the group Gd. G[G&, G&]@d A co A cu](& x R,) Let Gdi, Gds be two (partial groups) produced by and B[NGA,+[Cd]rd. We have two cases to consider. Case 1: Gdi[G&] = Gdz[G&] and S, (Gdl)[GA,] = E2 : ?Td[G&, GA,, FAA] S,,(Gds)[GA,]. In Es, after the join, all rows in (Fua[AAu, CNT], FddFA&]) ((Fdl [A&], COUNTIJ) flA [G& GA,,, AA,,, PAAd, CNll] (“[NGAd, GA,+, A&]‘&) x St, (Gdl) G[GAd, GAMCo, ‘%](((Fdl[-‘h], C0UNTl-j) and c#=‘b, GA,+, A&]~[NGAd]+‘d]Rd) x R,) ((Fdl[A&], COUNTU) are equivalent if (1) aggregation functions Fd contain only decomposable aggregation functions and can be de- TINGAd, GA,+, AA&h) x su (Gd2) composed into Fdl and Fd2; (2) F,, contain class C OT are merged into the same group by the second group- D aggregation functions and (3) NGAd+ GA: hold by(after the join). in U[cd]&. In El, each row in Gdi and Gd2 joins with each row in &(Gdl) and &(Gd2), respectively. Therefore, all The main theorem is illustrated in Figure 3. The ag- rows in Gdr x S, (Gdl) and Gds x S,(Gds) are merged gregation columns are split into two sets, which belong into the same group by the group-by. Since every ag- to & and R,, tables respectively. For the transforma- gregation function in Fd can be decomposed as Fdi tion from El to E2 (eager aggregation), we push down and Fds, the aggregation values in the row produced the & tables and perform eager aggregation on AAd and obtain the count before the join. After the join, we by then perform aggregation on FA& and AA,. There- Fd[A-‘ti]~A[G&, GA,, A&] fore, we basically split the aggregation into two parts, ((Gdl X &(Gdl))Ua(Gd2 X &(&a))) one is pre-evaluated before the join and one is evalu- ated after the join. We call the transformation from in El are equal to the aggregation values produced by El to E2 eager groupby-count and its reverse transfor- Fd2 [FA&]m [G-b, G-L, FA&] mation lazy gmupby-count. The requirement NG&+ GA: is not a neces- (((Fdl[AA&bdNG&, GA,+, A&]‘%) x s&h)) sary conditions. If NG& ti G& in some instance Ua((Fdl[A&]m[NG&, GA,+, A&]Gda) x su(Gd2))) of a[Cd]&, then the first group-by of E2 may group in E2* rows together when they do not belong to the same Since every aggregation function in F,, is either class group in El. However, incorrectly assigned rows may C or D, the aggregation values in the row produced by be eliminated by the join and we may still get the cor- rect result. If NG& does not functionally determine E&%&r&&, GA,, AAul((Gdl x & (Gdl)) the join columns of table &, the join in E2 is un- UaGdz( Xsu (Gdz))) defined since a group may contain different values on in El are equal to the aggregation values produced by the join columns. To obtain necessary and sufficient condition, we need to extend the meaning of F[AA], J’u,[AA,, CNT]ci[G&, GA,, AA,, CNT] which is beyond the scope of this paper. (((COUNTbA[N’=d, GAd+]Gdl) x & (‘%l))Uo Proof: Consider a group Gd in g[NG&]u[cd]Td for some ((COUNTbA[NG&, GA,+]Gdz) x St,(h))) instance Td of Rd. Since NG&+ GA:, all rows in in E2. G,j have the same G& value and have the same value Case 2: Gdl [G&l # G-&G&] 01 & (Gdl)[G&] # for the join columns of Rd. Therefore, if one row of S,, (Gd2) [GA,]. In Ez? the rows in G,j qualifies in the join of b[Cd A Co A C&l (Td X T,), all rows of Gd qualify. If one row of Gd joins with a ((Fdl[AAd], COUWI) set of rows S, from ~[C,,]T,,, all rows of Gd join with n[NG&, GA,+, A&]&l) x St, (Gdl) 351 and and ((Fdl[AAd], CouNTO) E2 : F2 [FAA&A [G&, GA,, FAA&(G& GA,] 4NG&, GA:, AA&da) x &(&a) xA[G&, GA,, FAAd]@o A Ct,] are not merged into the same group by the second ( (S [A&A [NG& GA,f 7 AAl group-by(after the join). In Er, each row in Gdr and ~[NG&]~[Cd]Rd) X Ru) Gds joins with each row in & (Gdi) and S, (Gdz), re- spectively. However, the rows in Gdr x S,, (Gdr) and are equivalent if NG&-+ GA: holds in b[cd]Rd and Gdz x S,,(Gdz) are not merged into the same group by all aggregation functions in F[AA] are decomposable the group-by. Since F is decomposable, the aggrega- and can be decomposed into FI and F2. tion values in Eager group-by transformation introduces a new group-by, and lazy group-by transformation eliminates Fd [A&]nA [G&, GA,, A&] a group-by. (Gdl X su (‘&I) The proof of the corollary is straightforward. Since in Er are equal to the aggregation values in AA,, is empty, Fua[AAU, CNTJ is empty. Deleting all terms relating to AA,, in E2 of the Main Theorem gives Fd@‘&]u[GAd, GA”, F&z] Ez of the corollary. ((h[A&]m[NG&, GA,+, AA&al) X sty (‘&l)) 7 Eager/Lazy Count and Eager/Lazy in Ez. Also, e aggregation values in the row produced by Distinct In the Main Theorem, if we let GA, contain all the GA,, u,CNT]%t[Gfti, AAl +L.l Fu[A aggregation columns, that is, all aggregation columns (GdlX$Gdl)) belong to R, tables, then we obtain the following corol- in El are equal ‘to the aggregation values produced by lary. In the following corollary, NGAd denotes a set of grouping columns belonging to &, and CNT the column produced by COUNT(*) after grouping o[c,j]& on NGAa. in E2. Cl “1 Corollary 2 (Eager Count/Lazy Count): The The Main Theorem assumes that the final se- expressions lection columns are the same as the grouping EI : F[A~]~&‘&, GA,, AA]B[GAa, GA,] columns(GAd, GA,) and the final projection must be an ALL projection. We&an actually relaxes +d A co A cu] (Rd X R,) \ these two restrictions, i.e., the fin 1 selection columns and may be a subset(SGAd, SGA,)\of the grouping columns(GAd, GA,), and the final projection may be E2 : F,,[AA, CNT]nA[GAd, GA,, AA, CNT] a DISTINCT projection. This is also true for all other g[GAd, GA&A [GAd, GA,, AA, CNT] corollaries in this paper. For a formal description of fl[Co, CU]((COUNT@A[NG&, G&+] the transformation and proof, please refer to [yan95]. s[NGAd]a[Cd]Rd) X &) 6 Eager Group-by and Lazy Group-by are equivalent if F are class C or class D aggregation In the Main Theorem, if we let GAd contain all the fin&ions and NGAd+ GA: hold in a[Cd]Rd. aggregation columns, that is, all aggregation columns In E2 above, COUNT(1 the inner group-by in E2 after belong to & tables, then we obtain the following corol- means that we add a COUNT(*) the select list of the to lary. subquery block. In the following corollary, let NGAd denote a set of The proof of the corollary is straightforward. Since columns in table Rd. and FAAd the columns produced AAd is empty, Fd, Fdr and Fdz are all empty. Re- by applying F[AA] after grouping table Rd on NG&. moving all terms relating to AAd in Ez of the Main Corollary 1 (Eager Group-by and Lazy Group- Theorem gives E2 of the corollary. by) : The expcpressions We call the transformation from El to E2 eager count and from E2 to El lazy count. El : F[AA]nA[G&, GA,, AA]B[GAd, GA,] Clearly, when F in the theorem contains only class fl[Cd A co A cti](Rd X &) D aggregation functions, we can simply use a DISTINCT 352 in the subquery block. We then call the transformation Corollary 4 from El to Ez eager distinct and from Ea to El lazy (Double Group-by Push-Down/Double Group- distinct. Note that in this case, F, is the same as F. by Pull-Up): A ssume that the conditions in Co~ol- lary 3 holds. If, in addition, (1) GA;---, NGAd holds in u[Cd]Rd, (2) GA,++ NGA, holds in u[C,,] R,, 8 Double Eager and Double Lazy and (3) (GA,, Gdd) functionally determines the join Now we are ready to tackle the double eager and dou- columns in r[Cd A Co A CU] (Rd x R,), then the es ble lazy transformations. Consider the query in Fig- pressions ure 2(e). It aggregates the columns belonging to one El : F[A+A[G&, GA,, AA]B[GAd, GA,] input stream (Tl). In the query in Figure 2(f), we per- form eager group-by on the stream (Tl) containing c[cd A co A Cu](Rd X Ru) aggregation columns and eager count on the stream and (T2) not containing any aggregation columns. We call the transformation double eager. Double eager can be E2 : ?rA[GAd, GA,, FAA * CNT@[GAd, GA,,] understood as an eager group-by followed by an eager u[COI((COUNTO~~A[NGA,, GA,+]B[NGA,] count transformation. The reverse transformation is 4WL) x (F[A+A[NG&, G&+, AA] called double lazy. i?[NG&]u[Cd]&)) In the following corollary, NGA, denotes a set of columns in IS,,, NGAd a set of grouping columns be- are equivalent. longing to & tables, FAA the columns produced by This Corollary eliminates the groupby at the top Fl in the first group-by of table u[Cd]& on NG&, query block. This can be viewed as a more general and CNT the column produced by COUNT(*) after case of groupby push down, which pushs down group- grouping g[CU]R, on NGA,. Also assume that AA by into two lower query blocks. We call the transfor- belongs to Rd. mation double group-by push down. Its reverse trans- formation, which pull up group-by’s from two lower Corollary 3 (Double Eager/Double Lazy): the query blocks, is called double group-by pull up. Please expressions refer to pan951 for the proof. A simple way to ensure that the conditions of the El : F[AAITA[GA~, GA,, AA]B[GAd, GA,] corollary hold is to use GA: and GA: as NGAd and a[Cd A co A cu](Rd X Ru) NGA, . Then, if (GA,, G&) functionally determines the join columns, we can apply the Corollary. and I Similarly, it is also possible to eliminate the group by at the top query block after eager count, eager Ez : F,[Fz[FAA], CNT]rA[GAd, GA,,, FAA, CNT] groupby-count and eager split to obtain the push down C?[GAd, GA&[Co]((COUNTO versions for these transformations, and analogly, the flA[NGAu, GAu+]G[NGAu]&L]Ru) pull up versions for the lazy aggregations. Due to space limitation, we cannot provide the conditions x (FI[AA]TA[NG& G&+, AA]G[NGAd] here. Please refer to pan951 for detailed conditions @dRd)) and proofs. Note that, the push down/pull up version of eager/lazy groupby is group-by push down/pull up. are equivalent if (1) NGA,+ GA,, holds in +Z’,,]R,, (2) NGAd+ GAd holds in a[&]&, (3) all aggregation functions in F are decomposable and 9 Eager Split and Lazy Split can be decomposed as Fl and F2, (4) all aggregation If we apply eager groupby-count twice to Rd and R, functions in F are class C OT D and its duplicated ag- respectively, we can perform eager aggregation on both gregation function is F,. tables before the join. We call this transformation ea- ger split since the aggregation is computed separately The proof of this corollary is straightforward. It before the join. We call the reverse transformation can be done by first performing an eager/lazy group- lazy split. Both transformations are illustrated in Fig- by and then an eager/lazy count. ure 2(g) and (h). Again, when F in the corollary contains only class D In the following corollary, (1) NG& and NGA, de- aggregation functions, we can simply use a DISTINCT notes a set of columns in Rd and R,, respectively; (2) in the subquery block of R,. Note that in this case, F, CNT, the column produced by COUNT(*) after group- is the same as F. The following corollary shows when ing u[Cd]& on NGAd; (3) CNTz the column pro- the group-by at the top query block may be eliminated. duced by COUNT(*) after grouping u[C,]R, on NGA,; 353 (4) FAAd the columns produced by Fd in the first ag- columns(NC&). According to Corollary 1, we can gregation of table 6[Cd]Rd on NGAd; (5) FAA, the add more Rd columns to NGAd without changing the columns produced by F,, in the first aggregation of result of the query. Normally we want to choose a table o[C,]R, on NGA,; and (6) Fd,, and F,,, the new set of grouping columns only if the new set has duplicated aggregation function of Fd and F,, respec- some ordering properties that save sorting time. For tively. Also assume that (1) AA = AAd ud AA,, where example, if the ordering property on a new column AAd contains only columns in &, and AA,, contains is supported by a clustering index, then after the new only columns in R,; (2) F = Fd ud F, where Fd ap- column is added into NC&, it can be used as the ma- plies to AAd and F,, applies to AA,,. jor of the sorting columns(assuming sorting is used for GROUPBY). The subsequent sort may be faster since Corollary 5 (Eager Split and Lazy Split:) The the minor columns are sorted in a smaller range, plus expressions the advantage of sequential fetching of data rows. In this case, even if one of the GA$ columns has an index, El : F[A&, ~A,]~A[G& GA,, AAd, AA,] since the index is not clustered, it may be more expen- g[GAd, G&]&‘d A co A Cu](Rd x Ru) sive to perform the grouping using GA: as the group- and ing columns than using the clustering index column and GA: as the grouping columns. Therefore, we want E2 : %[G&, GA,, FAA] to consider possible beneficial addition of columns to (L#‘uz[FA&], CNrr,], Fda[Fdz[FA&], CNTz])GA,+as eager grouping columns. We call such columns promising columns. Since adding new columns is of- nA[GA,j, GA,, FAA,, FAAd, CNTI, CNTz] ten not beneficial, a good heuristic might be not to 8[GAd, G&l+‘o, cu]((((Fdl[AAd], COUNT[) add grouping columns beyond GA$. r.dN’=d, GA:, AA&[NG&]u[‘%]Rd) When performing eager aggregation, our objective x ((Ful[AAwl, COUNTU) is to achieve data reduction before the join, so we want each partial group to contain as many rows as possible. AA[NG&, GA:, AAMNGA&[G]~u)) Therefore, if NC& contains a unique key of a[Cd]&, are eqUiVaknt if (I) aggregation function3 Fd contain we should immediately abandon using this set for eager only decomposable aggregation functions that can be group-by. decomposed into Fdl and Fd2; (2) aggregation fwac- tions F,, contain only decomposable aggregation junc- 10.1.2 Table Partitioning tions that can be decomposed into Ful and F,,z; (3) When the query contains more than two tables, there F, and Fd contain class C OT D aggregation func- may be several ways of performing eager aggregation. tions; (4) NG&+ GA,+ holds in 6[Cd]Rd; (5) The question is how to partition the tables in the FROM NGA,+ GA: holds in u[C&] R, . clause into Rd tables and R, tables. Section 10.3 dis- The proof of this corollary is also straightforward. It cusses the way to partition tables to obtain all pos- can be done by first performing an eager/lazy groupby- sible transformations. We assume that table parti- tioning has been done before calling the algorithm in count on Rd, and then an eager/lazy groupby-count on Section 10.1.3. &a. 10.1.3 The Algorithm 10 Algorithms and Implementation Assuming table partitioning is done, we have the fol- 10.1 Algorithm for Eager Aggregation lowing algorithm for finding valid eager aggregation. In this section, we present a practical algorithm for In this algorithm, we choose not to add new columns recognizing all valid eager transformations for a given to either NC& or NGA,. In the following algorithm, query. We assume that & tables contain aggregation & tables must contain aggregation COhnUS. columns and R, tables do not. That is, all queries belong to the class of queries specified in Section 3. Algorithm 1 Eager Aggregation Inputs: input query, &, R,, AA 10.1.1 Finding the Eager Grouping Columns Output: all possible rewritten queries for Eager Aggregation 1 NG& := GA,+ and NGA, := GA: Given two sets of tables, Rd and R,, with Rd ta 2 eagerd = false, eageru = false bles containing aggregation columns and R, tables 3 if NC& is not a unique key of U[Cd]Rd not, let’s first consider eager group-by. We can 4 eUgeTd = true start with NC& using GA: as the eager grouping 5 end if 354 6 if NGA, is not a unique key of (r[C,,]R, an optimizer, we can first perform group-by pull up 7 eager, = true and lazy aggregation to obtain a canonical form in 8 end if which all group-bys are delayed as late as possible. 9 if eager, and eageTd Then, during dynamic programming process, when- 10 if no aggregation cohmns in R, ever a table access plan or join plan is constructed, we 11 Apply double eager on & end R, can consider adding a group-by on top of the plan. All 12 Output the reuritten query tables in the query are then partitioned into two sets, 13 else the set containing all tables in the current join plan, 14 Apply eager split on both .& and R, 15 Output the rearitten query and the set containing the remaining tables. We can 16 Apply eager groupby-count on Rd then apply Algorithm Eager Aggregation to find all 17 Output the reuritten query possible eager aggregations. There can be several pos- 18 end if sible ways for adding an aggregation on top of a plan. 19 else if eageTd and not eager, The optimizer may want to choose the cheapest way 20 if no aggregation columns in R, for each plan to reduce optimization cost. Then, for 21 Apply eager group-by on & each original join plan, there is at most one additional 22 else plan that performs a group-by at the top. On the 23 Apply eager groupby-count on Rd other hand, when considering join plan for two input 24 end if streams, the optimizer can consider the alternatives of 25 Output the reuritten query taking the streams with or without aggregation. If the 26 else if not eagerd and eager, optimizer employs an exhaust search and considers all 27 if no aggregation cohmns in R, possible join plans in the dynamic programming pro- 28 Apply eager count on &, cess(e.g., Starburst), all possible transformations can 29 else be found in this process. This approach is also suitable 30 Apply eager groupby-count on & 31 end if for dynamic programming process that generates only 32 Output the rearitten query left deep trees or right deep trees. However, it might 33 else overlook some possible rewrites. 34 Output “No transformation” 35 end if 11 Queries Including HAVING END Algorithm 1 A query with a HAVING clause can always be trans- 10.2 Algorithm for Lazy Aggregation formed into one without. This technique is well known and is used in existing database systems. For exam- Now consider lazy aggregation. Whenever a query ple, the Starburst optimizer always transforms a query matches the form in any one of our theorems and sat- with a HAVING into one without at the beginning of isfies their conditions, we can perform a lazy aggre- the query rewrite phase[PHH92]. After the HAVING is gation to eliminate one GROUPBY(or DISTINCT), and eliminated, we can perform eager aggregation trans- delay grouping until after the join. Lazy aggregation is formation on the view created. especially useful when the join is highly selective. The Now consider lazy aggregation. When a HAVING is algorithm to find all valid lazy aggregation transforma- eliminated in a subquery block with an aggregation tion for a given query is to iterate through each avail- (either groupby or distinct), and the HAVING clause able transformation and output the rewritten forms. contains no aggregations, then the predicate in the Please refer to [Yan95] for a detailed description for HAVING clause can be moved to the WHERE clause and the algorithm. we can then try to apply one of our lazy aggregation theorems. If the HAVING clause contains aggregations, 10.3 Implementation we usually give up performing lazy aggregations be We need to find a way to efficiently integrate ea- cause the HAVING predicates have to be evaluated be ger/lazy aggregation and group-by push down/pull fore the join. However, it is possible to perform lazy up into existing optimizers. The standard technique aggregation when the HAVING clause of a query con- for determinating join order in a cost-based opti- tains aggregation. mizer is dynamic programming in a bottom up (e.g., We formally proved our theorem for the conditions Starburst[LohM]) fashion. During the dynamic pro- of groupby push down transformation for queries con- gramming process, plans for table accesses, two-table taining a HAVING clause in [yL95]. The process to joins, three-table joins and joins involving more tables prove conditions of eager aggregation for queries with are constructed and kept until the final query plan is a HAVING clause is completely analog to our previous obtained. To integrate the transformations into such effort. Due to space limitation we shall not present the 355 conditions and proof here. Table 2: Ratio Between Best And Worst Elapsed Time 12 TPC-D Queries For All TPC-D Queries That Can Be Transformed We can apply group-by push down/pull up and ea- 1 Q ) # of I Worst 1 Best 1 W/B t ger/lazy aggregation to twelve of the seventeen queries rewrites Formulation Formulation Ratio in the TPC-D benchmark and significantly reduce the 3 3 PD on L/O Oligbl 3.93 elapsed time of six queries on DB2 V2 Beta 3, as shown 5 7 EG on EG on 43.71 in Table 1. For example, it improves the elapsed time w/w L/O/C 7 7 Origilld EGonL 2.58 of Query 5 by a factor of ten. Table 2 shows the ratio 8 14 EG on L/S EG on 501.02 between best and worst elapsed time for all TPC-D official queries that can be transformed3. The perfor- mance difference between a badly formed query and a better formed query can be very significant. Partic- ularly, in applications where queries are generated by tools or inexperienced users, automatic transformation I I 1 for both 1 I I of queries is indeed very important. aggregations In both Table 1 and 2, each table is represented by 12 2 EConL Otigilld 1.02 the first letter of its name, except that table PART- 13 2 EGonL Olighl 16.59 SUPP is represented by PS. Also, we use PD, PU, EG, 14 2 Origin&l EGonL 1.07 EC and DC to represent group-by push down, group 15 2 PU Ori& 2.51 by pull up, eager group-by, eager count and query de- correlation transformations respectively. original query. The access plan must maintain a count Table 1: TPC-D Queries With Reduced Elapsed Time of the number of duplicates being removed. Then, af- (Compared With Original Formulation) ter or during the join, the access plan must restore the duplicates. Chaudhuri and Shim[CS95] also general- ized group-by pull up to handle the case when the join is a many to many join. 14 Conclusion Group-by push down and group-by pull up interchange the order of join and group-by. The number of group by’s is unchanged. Eager aggregation introduces an additional group-by before a join, and lazy aggregation eliminates a group-by before a join. Groupby push 13 Related Work down and eager aggregation reduces the number of We proposed the idea of eager aggregation and lazy rows participating in a join, groupby pull up and lazy aggregation in [Yan94]. Chaudhuri and Shim[CS94] aggregation reduces the number of input rows to the also independently discovered eager group-by and ea- groupby. Both directions of transformation should be ger count. Their simple coalescing grouping and gen- considered during query optimization. eralized coalescing grouping correspond to our eager We classify eager aggregation into five different group-by and eager count transformation, respectively. types: eager group-by, eager count, double eager, ea- They also proposed an algorithm to integrate group- ger groupby-count and eager split. Eager groupby by push down, eager groupby and eager count into a partially pushs down a groupby on the tables that greedy join enumeration algorithm which produces left contain all aggregation columns; eager count partially deep trees in a cost based optimizer. However, they pushs down a groupby on the tables that do not con- did not discuss lazy aggregation transformation in the tain any aggregation columns; double eager partially paper. pushs down a groupby on both types of tables; eager Gupta, Harinarayan and Quass[GHQ95] general- groupby-count partially pushs a groupby into a sub- ized group-by push down in another fashion. They set of tables containing the aggregation columns; eager showed that it is possible to perform early duplicate re- split splits a group-by into two group-bys and partially moval before a join when there is no aggregation in the pushs the groupbys down the two input streams of the 3The ratio marked as ‘infinity’ means that the query with the join. As a special case of double eager, we can com- worst formulation ran out of system space and did not finish. pletely push down group-by into two input streams, 356 CUSTOMERS(C_) PARTSUPP(PSJ SUPPLIERS ORDERS(O-) LINEITEM (L_) 15K 1K 1K 160K 6ooK tTAX Legend: l Scale factor 1 l The highlighted column names in each table form its primary key l The number below a table name shows the number of rows of the table. 1 COMMENT Figure 4: Subset of the TPC-D Database which is call double groupby push down. Similarly, References we classify lazy aggregation into lazy groupby, lazy [CS94] S. Chaudhuri and K. Shim. Including group-by count, double lazy, lazy groupby-count and lazy split, in query optimization. In Proc. VLDB Conf., which perform the reverse transformations of their ea- pages 354-366, Santiago, Chile, Sep. 1994. ger counterparts. We also provide practical algorithms [CS95] S. Chaudhuri and K. Shim. Optimizing com- for identifying all possible transformations. The algo- plex queries: A unifying approach. Tech. Re- rithms do not restrict the join order of the query. port HPL-DTD-95-20, HP Lab, Mar. 1995. Future work includes: (1) finding the conditions of [GHQ95] A. Gupta, V. Harinarayan, and D. Quass. lazy transformation for subqueries containing an ag- Aggregate-query processing in data warehous- gregating HAVING clause; (2) finding necessary and suf- ing environments. In Proc. VLDB Conf., 1995. ficient conditions for all the transformations in the pa [IS0921 1.30. Information Technology - Database lan- per and (3) finding eager/lazy aggregation transforma- guages - SQL. Reference number ISO/IEC tions involving other binary relational operations (e.g., 9075:1992(E), Nov. 1992. UNION, INTERSECT, EXCEPT and OUTER JOIN). [Loh88] G. M. Lohman. Grammar-like functional rules for representing query optimization altern* tives. In Proc. ACM SIGMOD Conf., pages 18- Acknowledgements 27, Chicago, Illinois, June 1988. [PHH92] H. Pirahesh, J. M. Heiierstein, and W. Hasan. We thank the referees for their many useful comments. Extensible/rule based query rewrite optimiza- We also would like to thank Guy M. Lohman for sug- tion in STARBURST. In Proc. SIGMOD Conf., gesting the word eager and K. Bernhard Schiefer for pages 39-48, San Diego, California, June 1992. his help with the TPC-D experiments. We also want [Raa95] F. Raab, editor. TPC Benchmark D (Deci- to extend our appreciation to Surajit Chaudhuri and sion Support), Working Draft 9.1. Transaction Kyuseok Shim for their valuable comments. Processing Performance Council, San Jose CA, 95112-6311, USA, February 1995. [yan94] w. P. Y an. Query optimization techniques for aggregation queries. Research Proposal, Univer- A The TPC-D Database sity of Waterloo, April 1994. TPC-D is a decision support benchmark proposed by pan951 W. P. Yan. Rewrite optimization of SQL queries the the Transaction Processing Performance Coun- containing GROUP-BY. PhD thesis, Depart- ment of Comp. Sci., University of Waterloo, cil(TPC). It is a suite of business oriented queries to Sep. 1995. be executed against a database that allows continuous access as well as concurrent updates[Raa95]. The size w 941 W. P. Yan and Per-Ake Larson. Performing group-by before join. In Proc. IEEE ICDE, of the database is scalable adjusted by a scale factor. pages 89-100, Houston, Texas, Feb. 1994. The scale factor for a 1OOMB database is 0.1. The size of the database we used through out this paper IF951 W. P. Yan and Per-&ce Larson. Interchanging the order of grouping and join. Technical Report is 100MB. Figure 4 shows the subset of the TPC-D CS 95-09, University of Waterloo, Feb. 1995. database we used in this paper. 357