On Query Optimization in Relational Databases
by
John Ngubiri PGDCS(Mak), BSc/Ed(Mak) ngubiri@ics.mak.ac.ug, 071921969
A Dissertation Submitted in Partial Fulfillment of the Requirements for the Award of a Degree of Master of Science in Computer Science of Makerere University
May, 2004
Declaration
I, John Ngubiri do hereby declare that this dissertation is my original work and has never been submitted for any award of a degree in any institution of higher learning. Where quotations have been used, they are acknowledged through references.
Signed .....................................Date...................................... John Ngubiri (2002/HD18/423U) Candidate.
i
Approval
I certify that this is the original work of John Ngubiri and has been done under my supervision. The work has never been submitted for any award of a degree in any Institution of higher learning.
Signed......................................Date....................................... Dr. Venansius Baryamureeba, Ph.D Supervisor.
ii
Dedication
This Dissertation is dedicated to:
• The two ladies in my life: my mother and my fianc˙ Eve. e • The two gentlemen in my life: my father and my son Matthew.
iii
Acknowledgments
My sincere appreciation goes to Dr. Venansius Baryamureeba Ph.D, Director Institute of Computer Science and my supervisor for all his advise, guidance and encouragement. Without you, this dissertation would not be the way it is now. Most likely, it would not be there! I would also wish to thank the staff at Institute of Computer Science - Makerere University with whom I associate and hence grow academically. Notable on the list are Mr Habibu Atib - for the very initial discussions, Dr Vincent Ssembatya (Department of Mathematics) for the Algorithm design background and Ms Josephine Nabukenya for technical writing assistance. To you all, I say Thank you. John Ngubiri -June 2004.
iv
Abstract
Query Optimization is an important process in Relational databases. With the current load on databases increasing, the need to optimize queries in batches is a promising way out. Studies have shown that sharing among common sub-expressions can as well be beyond the optimal plans of the constituent queries. But challenges of excessively large sample space, buffer management, establishment of optimal order of optimization, and identification of disjoint queries remain in place. In this dissertation, We propose how We can efficiently establish the extent of inter-query shareability and exploit it so as to compute common sub-expressions once and share the output among the queries. We also propose the optimal order of optimization so that the sharing is done in a more cost saving and time conserving manner.
v
List of Figures
1.1 1.2 1.3 1.4
The stages of query processing . . . . . . . . . . . . . . . . . . . . . .
7
Query Tree Representation . . . . . . . . . . . . . . . . . . . . . . . . 16 The optimal plan of the query . . . . . . . . . . . . . . . . . . . . . . 17 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1
The different tree configurations: bushy, complex deep, left deep and right deep trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 2.3 2.4
A pipeline Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Random nature of the II algorithm . . . . . . . . . . . . . . . . . . . 41 Query representation: Tree, DAG and extended DAG . . . . . . . . . 46
vi
Symbols
Symbol Equivalent Relational Algebra Operation σ Π × Select Project Cartesian Product Join θ A comparison operator
vii
Contents
1 Introduction 1.1 1.2 1.3 1.4 Background to Databases . . . . . . . . . . . . . . . . . . . . . . . .
1 1 3 6 8 9
The Relational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 1.4.2 1.4.3 Systematic Query Optimization . . . . . . . . . . . . . . . . .
Heuristic Query Optimization . . . . . . . . . . . . . . . . . . 10 Semantic Query Optimization . . . . . . . . . . . . . . . . . . 11
1.5
Single Query Heuristic Optimization . . . . . . . . . . . . . . . . . . 13 1.5.1 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
viii
1.5.2 1.6
Challenges to Single Query Optimizers . . . . . . . . . . . . . 18
Multi-Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . 18 1.6.1 1.6.2 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Challenges of Multi-Query Optimizers . . . . . . . . . . . . . 19
1.7 1.8 1.9
Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 21 Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Objectives: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.10 Scope
1.11 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.12 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.13 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Literature Review 2.1 2.2
27
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Approaches to Query Optimization . . . . . . . . . . . . . . . . . . . 30 2.2.1 Single Query Optimization . . . . . . . . . . . . . . . . . . . . 30
ix
2.2.2 2.3 2.4
Multi-Query Optimization (MQO) . . . . . . . . . . . . . . . 32
The Effect of Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . 35 Single Query Optimization Algorithms . . . . . . . . . . . . . . . . . 38 2.4.1 2.4.2 Dynamic Programming Algorithms . . . . . . . . . . . . . . . 38 Randomized Algorithms . . . . . . . . . . . . . . . . . . . . . 39
2.5
Multi-Query Optimization Algorithms . . . . . . . . . . . . . . . . . 43 2.5.1 2.5.2 The MQO Problem . . . . . . . . . . . . . . . . . . . . . . . . 44 Reuse Based Optimization . . . . . . . . . . . . . . . . . . . . 46
2.6
Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3
Query Shareability Establishment 3.1 3.2 3.3 3.4 3.5
55
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 The Shareability Problem . . . . . . . . . . . . . . . . . . . . . . . . 58 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 The Greedy Search Algorithm . . . . . . . . . . . . . . . . . . . . . . 61 Improvements to the Greedy Algorithm . . . . . . . . . . . . . . . . . 63
x
3.6
The Improved Greedy Searching Algorithm . . . . . . . . . . . . . . . 65
4 Optimizing the Traversed Plans 4.1 4.2
67
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.1 4.2.2 4.2.3 The Basic Volcano Algorithm . . . . . . . . . . . . . . . . . . 70 The Volcano-SH . . . . . . . . . . . . . . . . . . . . . . . . . . 71 The Volcano-RU . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3
The Proposed Optimizing Algorithm . . . . . . . . . . . . . . . . . . 75 4.3.1 4.3.2 4.3.3 The Background . . . . . . . . . . . . . . . . . . . . . . . . . 75 The Optimizing Algorithm . . . . . . . . . . . . . . . . . . . . 77 Benefits of the new Algorithm . . . . . . . . . . . . . . . . . . 79
5 Discussion and Future Work 5.1 5.2
81
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xi
Chapter 1
Introduction
1.1
Background to Databases
Right from the early times of programming, programmers relied on well-organized data structures to simplify programming tasks [11]. Though it is normal to include variables in program codes, it is nearly impossible to include large data sets. Including data in codes implies that the code owns the data. The code therefore has to provide facilities for other programs to access that data when required. This brings in programming hardships since data normally has many processes like insertion, updates and deletion which have to take place not necessarily simultaneously. Hardships are more prominent in cases where a program has to access data from another program. For a process to allow access to its data, it must be running as well since data cannot be got from a dormant program. If the data owner is not running, 1
then it has to initiate some dummy operations hence wasting processor cycles and memory. Programs owning data come with many programming bottlenecks and the most outstanding ones, according to Johnson [11] are:-
• Unhealthy dependence between data and programs; • Repetition of data elements; • Opportunities for inconsistencies; • Unorganized scattering of related data across many files; • Distributed ownership of data; • Decentralized security of data; • Unregulated interactions between programs using the same data; • Inadequate multiuser access; • Ad hoc approach to error recovery; and • Over dependence on physical considerations such as disc track and sector addresses.
During the 1950s and 1960s, the above bottlenecks led to the development of database systems so that data can be independent and the application programs just access it. With the development of databases, data and application software 2
became independent but interacting components. A Database is a collection of logically related data and a Database Management System (DBMS) is a software product that helps in defining, creating, maintaining and controlling access to a database. DBMSs are grouped according to the model of their development. A database model is an organizing principle that specifies particular mechanisms for data storage and retrieval. There are five database models namely the Hierarchical, Network, Relational, Object and Deductive models; the Relational model being the most popular of all. The Hierarchical and Network models, which were developed before the Relational model was developed are referred to as the pre-Relational models while the Object and Deductive models, which were developed after the Relational model was developed are referred to as the post-Relational models.
1.2
The Relational Model
The Relational model is the most popular database model currently in use. It uses tables (relations) to organize the data elements stored. A relation represents an application entity while a row in the relation represents an instance of the entity. It replaced the Hierarchical model (whose organizational principle was the tree structure organization of data) and the Network model (whose organizational principle was the graphical representation of data)[11]. It was because of the popularity of
3
the Relational model that the Hierarchical and Network models were phased out. Johnson [11] attributes the popularity of the Relational model to:-
i The existence of a standard, easy and flexible querying language called Structured Query Language (SQL) which is universal to all Relational Database Management Systems (RDBMS); ii The existence of a simple data structure (relations) which makes it easily understandable even by non-technical users; iii The existence of a strong mathematical base in the model (Relational Algebra and Relational Calculus) for its operation.
Rammakshriman and Gherke [18] however observe that the Relational model has two strong setbacks which are:-
i Performance: A single SQL query can be written in many different ways each of which can have a different cost. Most of the queries are so expensive that executing them degrades the computer system and greatly slows the rate of output generation. ii Flatness: The relational model relies on inbuilt primitive data types yet data in real life situations is increasingly becoming complex. Representing complex fields as a set of primitive attributes leads to too many fields with often null entries. 4
Motivated by the above weaknesses, two post-Relational models emerged. These are the Object Oriented model and the Deductive model. The Object Oriented model represents an entity as a class and an entity occurrence as an object. It bases on the three principles of Object Oriented Programming which are encapsulation, inheritance and polymorphism. It gives the programmer freedom to create classes as dictated by the real life situation and the way he envisages the entities being modeled. This not only makes the database more problem focused, but it also eases reuse. It is suitable for creating complex systems. The Deductive model (also called the Inferential model) aims at storing minimal data in combinations called axioms. It allows for un stored combinations to be deduced from the existing ones. For example, if an axiom for a student being a member of a class is Member(studentN ame, class), the individual students memberships can be represented as Member(”Mark”, ”F ormT wo”) and Member(”Matthew”, ”F ormT wo”). An axiom like ClassMates(class, studentList) is not stored but is deduced from the Member axiom. Intensive research took place to address the limited data type and the query performance problems. The limited data type problem was addressed by incorporating object oriented characteristics in the relational model. This led to the introduction of a hybrid model, the Object-Relational model. DBMSs like Oracle and postgreSQL have object oriented characteristics. The inefficient queries problem was addressed by passing relational (and object relational) queries in a series of steps between the query source and the physical data storage level so as to have them 5
execute in a cost effective manner. This process is called Query Processing.
1.3
Query Processing
This is the process of transforming a high level query into a plan that executes and retrieves data from the database (Figure 1.2). It involves four phases which are Query decomposition, Query optimization, Code generation and Run time query execution [2].
(a) Query decomposition In this phase, a query is checked as to whether or not it conforms to the syntax of the language used (mostly SQL) and in case it does not, the error message is generated. If the query conforms to the syntax, it is broken into small pieces and represented in an equivalent relational algebra expression (parsing). A system catalog is used to cross check the consistency of the query with the schema. (b) Query optimization In this phase, the best execution plan is generated. This is done by putting into account the resources required to execute the query as well as the resources required to get the plan. Database statistics are used to make appropriate decisions.
6
(c) Code generation After the optimizer has got the best execution plan, the code generator creates the code equivalent to the plan. This is sent to the internal ANSI spark architecture level of the database for execution. (d) Query execution In this phase, the code interacts with the database and retrieves the data for consumption of the process or individual who sent the query.
Figure 1.1: The stages of query processing
The query processing activity therefore acts as an interface between the querying individual/process and the database. It relieves the querying individual/ process of 7
the burden of deciding the best execution strategy. So while the querying individual/ process specifies what, the query processor determines how [4]. In query processing, the least straight forward stage is Query Optimization. It is the efficiency of the query optimizer that determines how much resources are to be used and a measure of how suitable the DBMS is for critical and real time applications.
1.4
Query Optimization
Query Optimization is the process of choosing the efficient execution strategy for executing a query [2] and it is one of the most important tasks of any RDBMS. Ramakrishman and Gehrke [18] observe that SQL which is a de facto standard for data definition and data manipulation in RDBMSs has a variety of ways in which a user can express, and therefore a system can evaluate a query. The query optimizer therefore is responsible for finding the best execution strategy so that less resources are used to retrieve data. There are three main approaches to query optimization. These are Systematic, Heuristic and Semantic query optimization.
8
1.4.1
Systematic Query Optimization
In systematic query optimization, the system estimates the cost of every plan and then chooses the best one. The best cost plan is not always universal since it depends on the constraints put on data. For example, joining on a primary key may be done more easily than joining on a foreign key since primary keys are always unique and therefore after getting a joining partner, there is no other key expected. The system therefore breaks out of the loop and hence does not scan the whole table. Though in many cases efficient, it is a time wasting practice and therefore sometimes it can be done away with [4]. The costs considered in systematic query optimization include access cost to secondary storage, storage cost, computation cost for intermediate relations and communication costs. The importance put on these costs depend on the type of database. For example, for large databases, emphasis is put on minimizing access cost to storage and memory usage. For small databases however, where outputs can be stored in memory, emphasis is put on minimizing computational cost. On the other hand, in distributed databases, where many sites are involved, communication cost is of paramount importance and it has to be minimized since it normally involve costs of channel coding, security coding as well as other network related limitations like bandwidth and noise.
9
1.4.2
Heuristic Query Optimization
In the heuristic approach, the operator ordering is used in a manner that economizes the resource usage but conserving the form and content of the query output. The principle aim is to:
(i) Set the size of the intermediate relations to the minimum and increase the rate at which the intermediate relation size tend towards the final relation so as to optimize memory. (ii) Minimize on the amount of processing that has to be done on the data without affecting the output.
Connolly and Begg [2] state five main rules which are used in heuristic query optimization:-
(a) Perform selection operations as early as possible. This reduces the cardinality of the intermediate relation and hence reducing the resources used to process a column as well as the memory occupied per column. (b) Combine the Cartesian product with a subsequent selection operation whose predicate represents a join condition into a join operation i.e σR·aθS·b (R × S) = R
R·aθS·b
S. Elmasri and Navathe [4] observe that this reduces the complexity
of the joining algorithm (which is one of the most expensive operations in data 10
retrieval) for example in cases where individual relations are first sorted on the joining fields. (c) Use association of binary operations to rearrange the query so that the most restrictive selection operation is done first. This increases the rate at which the intermediate relation size tends to the final relation size hence minimizing on the memory occupied and the resources required to process a column. (d) Perform projection operations as early as possible. This reduces the order of the intermediate relation. It therefore reduces the memory occupied by the relation together with the amount of resources required to process a row. (e) Compute common expressions once. If a certain expression appears more than once, and its not too large, it is kept in memory so that when it is required again, it is reused. In case the expression is too big to fit in memory, it can be stored on a disk and later retrieved when wanted so long as the cost of retrieval is not greater than the cost of recomputing it.
1.4.3
Semantic Query Optimization
This is a combination of Heuristic and Systematic optimization. The constraints specified in the database schema can be used to modify the procedures of the heuristic rules making the optimal plan selection highly creative. This leads to heuristic rules that are locally valid though cannot be taken as rules of the thumb. For 11
example, if there is a query such as
12
SELECT Employee.lname, Supervisor.lname FROM Employee, Supervisor WHERE Employee·supervisorNo = Supervisor·No and Employee·salary≥Supervisor·salary
This is a very unlikely invent and its likely to be directly or indirectly in the database constraints. The database restriction may be like check Employee · salary between(S1, S2). Check Supervisor · salary between(S3, S4 ) where S3 > S2 . This shows that the Supervisor can never earn less than the Employee therefore the query yields no results. A Heuristic optimizer would go ahead and parse, optimize and execute the query resulting in no output which is a worst case scenario [9]. A semantic optimizer would recognize it by use of the constraints and respond ”Empty set” and saves the resources.
1.5
1.5.1
Single Query Heuristic Optimization
The Process
For a single syntactically correct query, when a query is broken down, and expressed into a relational algebra expression, a query tree is created [2] [4] with the interme-
13
diate operations as nodes; the source relations as leaves and the output as root. For example, given the table components below:Client Field
clientNo name prefType maxRent
Data type
four fixed characters twenty five variable characters ten variable characters float rounded to 2 decimal places
Description
Client identity number, primary key client Name property type of preference maximum rent affordable by the client
Property Field
propertyNo street rent ownerNo
Data type
five variable characters twelve variable characters float rounded to 2 decimal places four fixed characters
Description
Property identification number, primary key Street where the property is located monthly rent of property identification number of owner, Foreign key
Viewing Field
propertyNo clientNo
Data type
five fixed characters four fixed characters
Description
Property identification number partial primary key Identification number of viewing client partial primary key
14
and with a query
SELECT p.propertyNo, p.street FROM client c, viewing v, propertyForRent p WHERE c.prefType = ’Flat’ AND c.clientNo = v.clientNo AND v.propertyNo = p.proprtyNo AND c.maxRent ≥ p.rent AND c.prefType = p.type AND p.ownerNo = ’C093’;
In the process of getting the suitable plan, it is decomposed into a relational algebra expression; πp·propertyN o,p·street(σc·pref T ype= F lat ∧c·clientN o=v·clientN o∧·propertyN o=p·proprtyN o∧c.maxRent≥p·rent∧
c.pref T ype=p·type∧p·ownerN o= C093
(C×V)×P))
which is expressed as a query tree in Figure 1.2. The formed plan like one above, is in most cases not optimal. The optimizer therefore looks for a most cost effective plan. This may be either by using the heuristic rules and adjust the plan to a cost effective one or by looking for the most effective plan among the many available. In this case, the optimal tree is shown in Figure 1.3.
15
.
Figure 1.2: Query Tree Representation
16
.
Figure 1.3: The optimal plan of the query
It is this strategy which is sent to the code generator for code generation and data fetching.
17
1.5.2
Challenges to Single Query Optimizers
Since the single query optimizers handle one query at ago, they are unsuitable for handling a high traffic of queries. Since they work on a large sample space, they cannot be exhaustive in nature since that will be expensive and time wasting. The algorithm therefore has to look for a search strategy so that a thorough search is done, and the optimal plan is got without traversing all the options in an acceptably small interval of time.
1.6
1.6.1
Multi-Query Optimization
The Process
In multi query optimization, queries are optimized and executed in batches. Individual queries are transformed into relational algebra expressions and are represented as graphs [13, 16, 20]. According to Roy et al [20], the graphs are created in such a way that:
(i) Common sub-expressions can be detected and unified; (ii) Related sub-expressions are identified so that the more encompassing subexpression is executed and the other sub-expressions are derived from it. For example if We have σA≤5 (E) and σA≤10 (E), σA≤10(E) is executed and σA≤5 (E) 18
derived from it.
The optimizer therefore concentrates on:-
(i) Identifying the common expressions so that the database accesses and computational costs are minimized; (ii) Identifying groups of sub-expressions where internal derivations are possible so that derivations can be made and database access is minimized; (iii) Get a search strategy so that the search is not too long.
1.6.2
Challenges of Multi-Query Optimizers
Generation of optimal global queries is not necessarily done on individual optimal plans. It is done on the group of them. This leads to a large sample space from which the composite optimal plan has to be got. For example, if for four relations A, B, C, and D there are two queries whose optimal states are Q1 = (A B) C and Q2 = (B C) D with execution costs
1
and
2
respectively, the total cost is
1
+
2.
Though these queries are individually optimal, the sum is not necessarily optimal. The query Q1 for example can be rearranged to an individually non optimal state Q’ = A (B C) whose cost, say ε is greater than
1.
The combination of Q’ and Q2
may make a more optimal plan globally at run time in case they cooperate. Since there is a common expression (B C), it can be executed once and the result shared 19
by the two queries. This leads to a cost of ε + (B C) which can be less than
1
2
- ξ where ξ is the cost of evaluating
+
2.
If sharing was tried on individual optimal
plans, sharing would be impossible hence the saving opportunity would be lost. To achieve cost savings using sharing, both optimal and non-optimal plans are needed so that the sharing possibilities are fully explored. This however increase the sample space for the search hence a more search cost. The search strategy therefore needs to be efficient enough to be cost effective. Though this approach can lead to a lot of improvement on the efficiency of a query, it may still have some bottlenecks that have to be overcome if a global state is at all times to be achieved. The bottlenecks include:-
(a) The cost of Q’ may be too high that the sum of the independent optimal states is still the global optimal state; (b) There may be no possibility at all to have sharable components and therefore a search for sharable components is a wastage of resources. (c) The new query plans may have a lower resource requirement than the previous one but when the resources taken to identify the plan (search cost) take on more resources than the trade off hence no net saving on resources. This is very likely given the large sample space.
20
1.7
Statement of the Problem
Parallel real time demands are putting a big load on relational query optimizers especially in Decision Support and Business Intelligence tools, remote access architectures like databases on networks as well as fragmented architectures like distributed databases. This leads to overworking of hardware increasing wear and tear rate hence increasing the maintenance costs. Multi-query optimizers identify common sub-expressions, execute them once and share the outputs among the queries. This leads to significant performance gains [22]. Current optimization approaches make no effort to find out whether queries in a batch really have common sub-expressions and neither do they have an intelligent way of predicting the absence or exhaustion of common sub-expressions. Their cost-saving abilities are therefore not assured.
1.8
Justification
Though currently relational database optimizers are available, with a lot of efforts spent on them (average relational database optimizers are 50 - 60 man years of development effort [18]), they are highly inefficient in an algorithmic sense. For example, commonly used bottom up tree optimizers are O(2n ) in worst case [5]. There is therefore a need for further studies since:
(a) The saving made depends on how many common sub-expressions are present 21
and how easy it is to find them. We therefore need to establish to what extent queries are similar. We as well need to detect cases of disjoint queries. (b) Queries come from different sources and at different times therefore similarities between them is un-predictable. There is need to study how they can be scheduled so that they are executed in an optimal order. Though some of the current algorithms [20] acknowledge the importance of order, they do not attempt to systematically get the optimal order.
1.9
Objectives:
The objectives of the study were:-
(a) To study ways of establishing the extent of sharing among the queries as well as establishing the order of optimal execution. (b) To study existing algorithms, identify weaknesses and strengths with a view of improving them to minimize the weaknesses and integrate strengths of the different algorithms into more efficient hybrid algorithms. (c) Analyze improved and hybrid algorithms for efficiency and effectiveness.
22
1.10
Scope
(a) Optimization approach The research greatly concentrated on the combination of the Heuristic and Semantic approach to query optimization. (b) Database Model The research highly concentrated on the Relational model. It can however be applied in an Object-Relational environment. (c) Optimization Philosophy The research was based on the multi-query approach to query optimization
23
.
1.11
Conceptual Framework
Figure 1.4: Conceptual Framework
24
1.12
Methodology
In the process of coming up with the results of this study, the following methods were used.
1. Literature Evaluation This involved reading literature related to the concepts of query optimization. It also involved studying the criticisms as well as the improvements on them by the successive researches. 2. Comparison This involved comparing ways in which similar optimization styles (like single query or multi query optimization ) were handled by different approaches (algorithms). This was used in identifying similarities hence improving on the algorithms, or developing algorithms that put together the individual algorithm strengths.
1.13
Dissertation Overview
In Chapter-2, We look into the previous research which has taken place in the field of query optimization. We examine the rationale for query optimization, the approaches to query optimization as well as the use of pipelining so that materialization is minimized. We then examine the different algorithms that have been 25
proposed by the different researchers. We identify the merits and demerits of the algorithms. We then formulate research questions to guide us in the process of proposing improvements. In Chapter-3, We lay a foundation for exploiting the similarities between the different queries in a batch for a multi-query environment. We propose an algorithm that traverses the whole query plans batch so as to establish the extent of similarities. We improve the greedy algorithm to be intelligent enough so as to increase its efficiency. We summarize the inter query shareability in a query sharing matrix M. In Chapter-4, we use the information in the query sharing matrix and traverse the batch while merging common sub-expressions. The matrix is used while choosing the order in which query plans are to be considered for optimization. It also provides guidelines on the extent sharing between any two plans so that the optimizer makes appropriate decisions. For example we do not search for sharable sub-expressions between disjoint query plans. We then present our conclusion, recomendations and future work in Chapter-5.
26
Chapter 2
Literature Review
2.1
Overview
When the relational model was first launched in the late 1970s, one of the major criticisms often cited was inadequate performance of queries [2]. This was because queries used a lot of resources such as processor cycles and memory compared to other models for equal amounts of data. SQL, which is the de facto standard language for data definition and data manipulation in RDBMS [2] offers a variety of ways in which a query can be structured to achieve the same output. The more the complexity of the query the higher the number of ways a query can be represented. On average, the best measure of a relational query complexity is the number of relations a single query joins. In fact, even at the design level, developers introduce 27
redundancy or merge some relations such that joins in frequently invoked queries are minimized. Surajit and Kyuseok [1] state that the complexity of a query is exponential to the number of joins involved. Complex queries like those in data warehouses that normally join tens of tables are too expensive to process in reasonable time. Since the structural differences in queries depend highly on the way the joins are ordered, the more the tables joined the more the options of writing a query that can bring a single output. These queries, if executed have varying costs and in most cases (if not all) only one is optimal. The probability of writing the most optimal query tends to zero as the query complexity increases and the computer is most likely to waste a lot of resources. It is therefore the complex queries that must be optimized if a computer system is to work efficiently. An extensive query optimization phase must select the most efficient plan among the many available to process the query in an acceptably short time. Without query optimization, RDBMSs would be inefficient and hence un practical [18]. Query optimization is an expensive process because it mostly relies on evaluating the different plans (access paths) and choosing an optimal one among them. The number of alternative access paths grows at least exponentially with the number of relations participating in the query [15]. The optimizer therefore, which is nearly 100% sure that the plan sent by a user is not optimal, has to search for an optimal plan and forward it for execution. This has to be done with in the time constraint and in a resource-conserving manner. It would
28
not be worthwhile if the difference between the optimized and a pre-optimized query is less than the cost of finding the optimal plan. The user likewise is supposed to get what he requested for, in the same logical presentation as well as in an acceptable time interval. Therefore, as the user specifies what, the optimizer determines how [4] but still conserving the what. The process of optimization should have no effect whatsoever on the final query output. The search strategy therefore, on top of conserving the form and content of the query request must be efficient. Optimization is not a matter of transferring the resources that would execute the query to looking for the execution plan. The ability of the optimizer reaching the optimal plan, at the earliest opportunity, with substantial resource savings is therefore of paramount importance. Kroger et al [13] summarizes the goal of an optimizer as follows ”A plan as cost-effective as possible is looked for as soon as possible”. Kroger et al [13] further observe that the job of a query optimizer is not necessarily to get the cheapest plan (though the cheapest plan would of course be the best). In fact, if a stage is reached where the cost of further optimizing is higher than the resource savings, it is worthwhile to terminate the search. The optimizer is supposed to economize the resources spent on looking for a plan as well as putting into consideration the time of processing (time of execution + time of optimization). Depending on the nature of the problem therefore, a sub-optimal plan may be preferred especially in a real time scenario.
29
Given the large number of possible plans, traversing them, one by one, establishing the cost of each may be the ideal strategy but it is time wasting since the options are too many and it is likely to produce a low cost query but having spent a lot of resources to get it.
2.2
Approaches to Query Optimization
Broadly, computers optimize queries either individually (Single-query optimization) or as batches (Multi-query optimization).
2.2.1
Single Query Optimization
In Single Query Optimization, a query, which is syntactically correct, is broken down, expressed into a relational algebra expression, and a query plan, represented as a tree is created [4, 18]. This is a traditional approach to query optimization and is used in most commercially available optimizers. It is suitable where a database receives a low traffic of simple queries. Depending on the algorithm used, either the different representation of the original query are generated and the best one searched for or the query supplied is adjusted to the optimal one. If the option of choosing the best tree from the different trees available is used, there is a high possibility of logical duplicates (two physically dif-
30
ferent trees, doing the same thing, the same way). Since the number of options is likely to be high, such options normally overload the memory and require an exhaustive algorithm. There is a likelihood of groups of plans, with the same cost implying that more searching is made but with no practical advantage. Using exhaustive algorithms is a reason for inefficiency of many query optimization research works carried out [20]. Single processor optimizers therefore limit the search space by considering only some tree configurations [12]. This may be left deep, right deep, complex deep or bushy tree configuration.
Figure 2.1: The different tree configurations: bushy, complex deep, left deep and right deep trees
The IBM System R optimizer for example, considers only left deep trees. Heuristics may as well be used to eliminate some obviously expensive plans before the optimization so as to reduce the search space. The rules may be static or dynamic. Dynamic rules are applied basing on available data like database statistics or system catalog. Failure to reduce the search space may cause effects that lead to the decline in the performance and cost effectiveness of the optimizer. These are:31
(i) Too many options may overload the memory. In cases where the query is complex, the computer may not have enough memory space to perform other routine activities. (ii) In the process of conserving the memory, the computer may have to store some of the plans on disk and retrieve them when required. This then brings in costs of writing and reading from disk which increase optimization costs. (iii) The many options put a bigger load on the processor during cost estimation and comparison.
Restricting the tree types used in optimization therefore eliminating duplicates reduces the number of possible options hence reducing on the search space saving the traversal of many options as well as memory usage.
2.2.2
Multi-Query Optimization (MQO)
In MQO, queries are executed in batches. Some of the MQO techniques act in such a way that in case some queries have a common sub-expression, such a subexpression is executed once and the output shared. In some cases, the sharing does not necessarily take place on individual optimal plans [20] but instead sub-optimal plans are used. Decision may as well have to be taken whether the common subexpressions should be pipelined or materialized [16]. Some multi-query optimization techniques, like those described by Kyuseok et al [14] basically aim at having parallel 32
optimization of many queries. The queries pass through the different optimization steps together and as an output,which is a set of optimal plans for each query is generated. Roy et al [20] criticize this approach on a basis that further cooperation can be made between the queries that make up the batch. If a certain sub-expression is common, then the computer should execute it once and share out the results. This is a guiding principle to the Basic Volcano algorithm proposed by Goetz and McKenna [7], and the Volcano-SH and Volcano RU optimizer algorithms proposed by Roy et al [20]. Roy et al [20] further put the sharing of the sub-expressions to a great importance that even if the sharing takes place on a non-optimal plan of the query, so long as the total resource requirement is optimal, it is acceptable. If for example the queries Q1 and Q2 have locally optimal plans Q1 = (R S) P and Q2 = (R T) S, there exists no common sub-expressions in them and therefore need to be executed independently. However, if Q2 takes a locally sub-optimal plan Q2 = (R S) T, then (R S) is a common sub-expression. If it is executed once and Q1 and Q2 share it, there is a likely reduction in the amount of resources that are required to execute the two queries to a value less than the sum of the two optimal plans. This is however done on the sub-optimal plan of Q2 . A multi-query optimizer therefore, according to Roy et al [20] is responsible for recognizing the possibilities of shared computations and modifying the optimizer search strategy to explicitly account for shared computations so as to find a globally optimal plan. This sharing however may not be necessarily optimal since:33
(i) The cost of Q may be too high that the sum of the independent optimal plans is still the global optimal; (ii) The shared plan may have a lower resource requirement than the non shared ones but when the resources taken to achieve the plan take more resources than the trade off hence an efficient query in an inefficient system; (iii) The sharable components may be too few and the ratio of sharable to total components is too low. The search for sharable components may produce very little sharable components that the saving is far less than the searching cost.
The inclusion of sub-optimal plans increases the sample space hence a more efficient search technique is required. Sharing in multi-query optimization highly hinges on the availability of the sharable sub-expressions. The elimination of some plan types, like in the way it is done in System R optimizer may not be useful since the eliminated types may have sharable components. The elimination may instead depend on the cost of the sub-expressions in the plan itself since excessively expensive expressions, even if shared, are very unlikely to create global optimality. Elimination of some expression is done so as to reduce the search sample space. In reuse based cases, cost not configuration should be the basis.
34
2.3
The Effect of Pipelining
Dalvi et al [16] observed that the most multi-query optimization techniques assume that all sub-expressions have to be materialized so that they are read from disk when they need to be used. This is actually assumed in the contributions of Roy et al [20] as well as Graefe and McKenna [7]. Dalvi et al [16] suggest that the materialization of a sub-expression needs not to be made a rule of thumb since in case the subexpression is pipelined instead, the cost of writing to disk while materializing and the cost of reading from disk while reusing are saved. The technique of pipelining however brings in new challenges like limited buffer space. The difference between the rate at which a source operator produces the output and the rate at which the destination operator needs the output may make the pipeline schedule unrealizable.
Figure 2.2: A pipeline Schedule
35
If for example We take the schedule shown in Figure 2.2, the output of A is shared between B and C. Let the rate at which A is producing tuples be rA and the rate at which B and C need the tuples be rB and rC respectively. The schedule is not realizable for all instances of the rates. If for example rB = rC and rA is less than rB (hence also less than rC ). This means that the schedule is unrealizable since A cannot produce tuples at a rate satisfying the desires of B and C. If the reverse is true, then A produces tuples at a rate greater than what B and C can consume. This implies that tuples will pile in the output buffer of A (if the pull model is used) or pile in the input buffers of B and C (if the push model is used). The piling of tuples may lead to filling of the buffers hence a deadlock. It should be noted that rB is not necessarily equal to rC hence the situation may be more complicated. Dalvi et al [16] came up with the neccesary and sufficient conditions for a schedule to be pipelinable otherwise it is materialized. Rao and Ross [19] propose a less conservative pipeline schedule compared to that of Dalvi et al [16]. It allows simultaneous input of multiple tuples unlike that of Dalvi et al [16]. Much as some sub-expressions may satisfy the neccesary and sufficient conditions for pipelining, it may still be impossible to pipeline them. This is because pipelining of a sub-expression implies that it has to be kept in cache. Given the limited cache size, it may be impossible to keep all the sub-expressions in cache.
36
Hulgeri et al [10] aim at optimizing the use of memory by appropriately dividing it among the pipelinable sub-expressions in the plan. The amount of memory given to a sub-expression determines the cost estimate associated with it. Given the finite value of memory therefore, not all the sub-expressions may be finished. Hulgeri et al [10] however use the memory division to accommodate as many pipelinable schedules as possible. Faced with a scenario where all the pipelinable schedules may not be pipelined due to limitations in memory, there is need to decide on what to pipeline so as to have a higher benefit. Urhan and Franklin [24] propose a dynamic approach to pipelining so that the choice to pipeline is guided by the perceived importance (cost wise) of the output. This way, much as the total number of pipelined sub-expressions for two similar schemes may be the same, the savings may be different. Gupta et al [8] uses a rather different approach to optimal use of cache. They use query (and hence sub-expression) scheduling to achieve optimal use of memory. A sub-expression may be removed when it still has other processes to serve so long as it is replaced with one whose benefit is higher than the one removed. With pipeline, the algorithm has to make a decision to pipeline and materialize where the buffer space is not enough. More to that, not all reused sub-expressions are pipelinable. It is therefore limited to the pipelinable schedules and in case the pipelinable schedules cannot all fit in memory, then the pipelining has to be done in a buffer-conserving manner.
37
2.4
Single Query Optimization Algorithms
Single query optimization algorithms explore one query at ago, find an optimal plan and execute it. They mostly employ dynamic programming techniques and are commonly used in commercial optimizers such as the System R optimizer.
2.4.1
Dynamic Programming Algorithms
In dynamic programming algorithms, queries are turned into and optimized as query trees. In a tree, a node represents a join operation, the leaf represents a relation while the edge indicates data flow [15]. The execution space, which is a set of all join trees for a query is normally restricted deep trees, either right deep or left deep but sometimes bushy and complex deep trees. This is done to lower the sample space and the applied heuristics eliminate the duplicates further. The Left deep Algorithm The Left Deep algorithm works well for a space that allows only nested loop and hash joins as join methods. However the algorithm is poor for sort merge joins since it has to keep multiple plans for sub-queries with different interesting orders [15]. This algorithm, being selective has to be applied when:-
(i) The types of join algorithm to be used are known in advance. (ii) There is a facility to find out whether the join algorithms that are compatible 38
are the ones used and if they are not, a standby alternative algorithm is used.
2.4.2
Randomized Algorithms
These come in to handle rather complex queries which cannot be well handled by the dynamic programming algorithms [1]. For complex queries, there happens to be too many possible alternatives and therefore, a dynamic programming approach is likely to cause a combinatorial explosion [14]. The algorithms find different states and for each state, a cost function is used to find the costs of using it to execute the plan. The algorithm performs random walks from one state to another directly accessible state (neighbor). In case the move is from a cheaper state to an expensive state, it is referred to as an uphill move otherwise it is a downhill move. The algorithm basically computes the costs of alternatives, and makes downhill moves so as to achieve the minimum cost plan. There are three popular randomized algorithms: the Iterative Improvement(II), Simulated Annealing(SA) and Two Phase Optimization(2PO) algorithms. Iterative Improvement (II) This algorithm chooses a region at random, finds the local minimum and compares the local minima to come up with the global minimum. It randomly picks a state and looks for a minimum in the neighborhood.
39
The Iterative Improvement Algorithm
ProcedureII minS = infinity while not(Stopping condition) do{ S = random state # start of inner loop while(not local minima)do{ S’ = random state neighbour(S) if(cost(S’). For each i, either Piji ∈ Pi
44
in which case S suggests that Pji should be used to process Qi or Piji = ”NU LL” if Piji = 0, then S suggests that Piji should be used to process Qi or else Piji = ”NULL” in which case S does not suggest any plan for processing Qi. The cost of S, denoted by scost(S) is the total cost for processing all the queries in S. 2. The initial state is S0 =< NU LL, NU LL, · · · , NU LL > while the final state is SF =< P1j1 , P2j2 , · · · , Pnjn > with Piji = NU LL ∀i in the final states. 3. Given that S =< P1j1 , P2j2 , · · · , Pnjn > let
next(s) =
min{i|Pij = NU LL} if {i|Pij = NU LL} = φ n+1 otherwise
4. Let the state S have at least one NULL entry and m = next(S) . Then the immediate successors of s include s = P1j1 , P2j2 , · · · , Pnjn satisfying the
following properties Piki = Piji for 1 ≤ i ≤ m Pmkm ∈ Pm Piki = NU LL for m + 1 ≤ i ≤ n The cost of a transition from S to s is the additional cost required to process the new plan Pmkm given the results of processing the plans in S.
45
This algorithm is not sharing oriented be it during optimization or during execution. The global cost therefore is a sum of the individual optimal plans. Therefore for queries Qi , with optimal plans Pi i = 1, 2, · · · n, the cost of execution is the sum of costs Pi =
alli
Pi
2.5.2
Reuse Based Optimization
All the multi query algorithms so far explored do not take into consideration possibilities of sub-expressions that may be common and save more in overall costs. In reuse-based optimization sharing possibilities are explored. The Volcano [7], Volcano-SH [20] and Volcano-RU [20] use the same principal though using different philosophies. Most reuse based algorithms use Directed Acyclic Graphs (DAG) to represent the search space. In some cases however, search space is represented as an AND-OR DAG.
Figure 2.4: Query representation: Tree, DAG and extended DAG
46
An AND-OR DAG is a DAG whose nodes are divided into two: the AND nodes and the OR nodes. AND nodes have only OR nodes as their children and OR nodes have only AND nodes as their children. The AND node in an AND-OR DAG has algebraic operation like select (σ), project (π), etc. They are therefore referred to as operational nodes. The OR node of an AND-OR DAG represents a logical expressions that generates the same result set as when a child operational node is applied on its children/ child. OR nodes are therefore referred to as equivalence nodes. The expanded DAG is used as a representation for modern optimizers because they are easily extensible. Graefe and McKenna [7] proposed an algorithm for generating an AND-OR DAG and Roy et al [21] improved it into an efficient one. AND - OR DAG Adjustments for Multi query Optimization Roy et al [20] proposed three main adjustments that need to be made on an AND-OR DAG in order to get an optimal plan:
(i) Merging Since each query has a single DAG and the queries are to be optimized simultaneously, they need to be merged into a single DAG. A pseudo root equivalence node is created with all individual DAG roots as children. 47
(ii) Identification of common sub-expressions If two common sub-expressions exist in the pseudo rooted DAG then there will be two equivalence nodes, which are exactly the same or the same after applying join associativity. For example, in cases of X (iii) Handling of nodes derivable from others In case expressions that can be derived from others exist, the optimizer adds some nodes appropriately as children or as parents. For example, if two equivalence nodes e1 = σA≤5 (E) and e2 = σA≤10 (E) exist, e2 is made a parent of e1 in the part where e1 is and the output shared where e2 is while e1 is derived from e2 . In case there were nodes e3 and e4 where e3 = σA=5 (E) and e4 = σA=10 (E), then a node e5 = σA=5∨A=10 (E) is created and the two are derived from it. Y and Y X.
The Basic Volcano Algorithm This determines the cost of the nodes by using a depth first traversal of the DAG. The cost of operational and equivalence nodes are given by cost(o) = costofexecuting(o) +
ei ∈children(o)
cost(ei )
(2.1)
and the cost of an equivalence node is given by cost(e) = min(cost(oi )|o∈children(e) (2.2)
If the equivalence node has no children, then cost(e) = 0. In case a certain node has to be materialized, then the equation for cost(o) is adjusted to incorporate 48
materialization. For a materialized equivalence node, the minimum between the cost of reusing the node and the cost of recomputing the node is used. The equation therefore becomes cost(o) = costofexecuting(o) +
ei ∈children(o)
C(ei)
(2.3)
where C(ei) =
cost(ei)
if the node is not materialized
min(cost(ei), reusecost(ei)) if the node is materialized
The Volcano SH Algorithm
In Volcano-SH the plan is first optimized using the Basic Volcano algorithm and then creating a pseudo root merges the Basic Volcano best plans. The optimal query plans may have common sub-expressions which need to be materialized and reused. If for example a certain equivalence node has computational cost cost(e), materialization cost matcost(e), reuse cost reusecost(e), and is to be used numuses(e) times then the optimizer decides whether it should be materialized or not. For it to be materialized, there must be a saving on costs as a result. Therefore cost(e) + matcost(e) + numuses(e) × reusecost(e) < numuses(e) × cost(e) (2.4) which simplifies to reusecost(e) + matcost(e) < cost(e) numuses(e) − 1 (2.5)
This equation has one practical limitation. The volcano SH algorithm starts from the bottom upwards and the number of times a node is used depends on whether or 49
not the parents have been materialized. The cost depends on the children therefore its known but the number of times to be used needs first traversing the DAG which is an expensive process [20]. The Volcano-SH Algorithm Procedure VOLCANO-SH(P) Input: consolidated volcano best plan for virtual root of DAG Output: sets of nodes to materialize and the corresponding best plan P Global variable: M set of nodes chosen to be materialized M = { } Perform prepass on P to introduce subsumptions derivations let Costroot = COMPUTEMATSET(root) set Costroot = Costroot + Σd∈M (cost(d) + matcost(d)) undo all subsumptions on P where the subsumption node is not chosen to be materialized return(M,P) Procedure COMPUTEMATSET(e) if e is arleady memoised, return cost(e) let operator O be a child of e in P For each input equivalence node e of O Let C = COMPUTEMATSET(e) // return computational cost of ei if e is materialised, let C = reusecost(e) compute cost(e) = costofoperation(o) + ΣC if reusecost(e) +
matcost(e) (numuses(e)−1)
< cost(e)
if e is not introduced by a subsumption derivation add e to M //decide to materialize e matcost(e) else if reusecost(e) + (numuses(e)−1) is less than the savings to the parents of e due to the introduction of a materialised e add e to M memoise and return cost(e) 50
In such a scenario, numuses(p) if p is not materialised numuses(e) =
p∈parents(e)
U (p)where U (p) = 1 if p is materialised.
Instead of using numuses(e), its under estimate numuses−(e) is used. And the condition for materialization becomes reusecost(e) + matcost(e) < cost(e). (numuses−(e) − 1) (2.6)
Since the under estimate is used, once it holds, the actual value also must hold. Cases where the under estimate does not hold but the actual value holds have low savings. The Volcano-RU Algorithm The Volcano-RU exploits sharing well beyond the optimal plans of the individual queries. Though volcano SH algorithm considers sharing, it does it on only individually optimal plans therefore some sharable components which are in sub-optimal plans are left out. Including sub-optimal states however implies that the sample space of the nodes has to increase. The search algorithm must be able to put it into consideration so that the searching cost is still below the extra savings made.
51
The Volcano-RU Algorithm Procedure VOLCANO-RU Input: Expanded DAG on queries Q1, ..., Qn (including subsumptions derivations) Output: Set of nodes to materialize M and corresponding best plan P N = {} //set of potentially materializable nodes for each equivalence node e, set count[e] = 0 for i=0 to k compute Pi, the best plan for Qi assuming the nodes in N are materialized for every equivalence node in Pi set count[e] = count[e] + 1
matcost(e) if reusecost(e) + (count[e]−1) < cost(e) // worth materialising add e to set N Combine P1, P2, ... Pn to get a single DAG structured plan P (M, P) = VOLCANO-SH(P) //volcano-SH makes final materialization decision
The volcano RU algorithm aims at reusing and sharing sub-expressions which are not necessarily in the individual query optimal plans. Volcano-RU is sequential, considering possibilities of reusing expressions of previously optimized queries in subsequent queries. For a set of queries in the same pseudo root, after optimizing Qi, the nodes in the plans of Qi are identified. Since at that moment the algorithm has no idea of the structure of the subsequent queries, it checks whether it would be optimal if a certain node was materialized for reuse one extra time. While optimizing the next query, costs saving expressions are considered to be present. The VolcanoSH is then applied to further detect and exploit more sharing opportunities. In such a case, a query is able to share sub-expressions within itself and materializable plans are all identified. 52
Volcano-RU depends on the order in which queries are optimized. It can be done in a certain sequence, then in reverse order and the cheaper alternative is chosen. Considering more orders have a probability of getting a cheaper order but it increases the optimization time [20]. Reuse based algorithms are based on the number of economically reusable components for a batch of queries. The efficiency therefore is derived from the proportion of searches that result in materializable and sharable nodes. In some cases however, the order of optimization is of paramount importance. The materializability of a node depends on the nature of a node hence it is out of the control of the optimizer. The shareability of a node however depends on how common the queries in a batch are. In this research, We investigate the indicators of shareability as well as efficiently establishing the extent of sharing among them so as to make a more guided search.
2.6
Research Questions
1. Since queries are sent from different sources, they are different and random but since they are addressed to the same schema, there are possible similarities. How can We efficiently detect sharing opportunities (and lack of them) so that the search for common sub-expressions is done with a high probability of resource savings?
53
2. Materialization is a good sharing option but not the best. Pipelining is limited to specific schedules and a lot of pipelining is impossible given limited buffer space. Can’t We have a more cost saving schedule with minimal materialization for an AND-OR DAG plan?
54
Chapter 3
Query Shareability Establishment
3.1
Motivation
With the load on database applications increasing, there is a need to improve the efficiency of the query optimizers so as to accommodate the load together with the non-functional constraints on them like speed. The costs incurred mostly come in during physical storage access, storage of intermediate relations and computations on intermediate data. To minimize the frequency at which the database is accessed, Weave to ensure that common outputs are reused other than accessing the storage disk multiple times for the same data. Previous research in multi-query optimization [7, 8, 17, 20, 24] acknowledges the
55
need to exploit the sharable sub-expressions but highly depends on the possibility of the common sub-expressions existence. Since the queries come from different sources, there is a possibility that they have nothing in common hence leading to a worthless search. There is no effort to establish whether they actually exist or not. When queries are checked for syntax correctness and parsed into plans, they are sent to the query optimizer to generate the most cost effective plan. Multi-query optimizers traverse the different query plans (DAGs or AND-OR DAGs) and identify sub-expressions among which sharing can be done. Sharing of node outcomes saves memory, disk access and computation costs hence lowering the global resource requirements for the execution of the queries. Though the search for sub-expressions to be shared takes some resources (time and processor cycles), if a good search strategy is used, the savings exceed the searching cost hence a global advantage. Our aim therefore is to ensure that the cost of searching is as low as possible while the number of common sub-expressions are as many as possible so as to yield maximum benefits. Generally, there are three cases in which sharing is possible. Considering an AND-OR DAG. If W is any equivalent node of a plan, then sharing is possible when:-
(i) Nodes produce exactly similar results. e.g s1 = σx=a (W ) and s2 = σx=a (W ). In this case, only one is executed and the other just uses the output of the 56
executed node. (ii) The output of one node is an exact superset of the others. For example, if We have nodes like s3 = σx≤6 (W ) and s4 = σx≤10 (W ), s4 is executed and s3 is derived from it i.e. s3 = σx≤6(s4 ). (iii) Nodes retrieve data from the same (intermediate) relation or view but on different instances of the outermost constraint. For example, if We have s5 = σx=10(W ) and s6 = σx=15(W ), then a different node s7 = σx=10∨x=15(W ) is created so that the two are derived from it i.e. s5 = σx=10 (s7), and s6 = σx=15(s7 ).
The efficiency of the multi-query optimizer does not depend on how aggressively common sub-expressions are looked for but rather on the search strategy [20]. The search strategy may need to exploit issues like:-
i The ability to tell that there are no more sub-expressions common between a pair of queries without having to exhaustively search all the available subexpressions. ii The ability to chose the order in which the queries in a batch [8, 20] should be processed so as to have a cost effective process. iii To identify which nodes to materialize or pipeline [16], and the order in which those to pipeline should be assembled [8]. 57
iv Mixing materialization and pipelining for specific sub-expressions so as to optimize the use of cache.
Given that multi-query optimization takes place on many complex queries with many relations, comparing sub-expressions exhaustively leads to too many comparisons hence high comparison cost and time. Our aim is therefore to look for the extent of sharing without necessarily traversing all the nodes. The output is to be used while optimizing. If We know that the queries have no node in common, We do not attempt to exploit the similarity because it is not there.
3.2
The Shareability Problem
While searching for the sharable sub-expressions between queries, We are guided by the three cases in which the sub-expressions can be reused. If the sub-expressions are exactly similar, then We can be able to use any of the sub-expressions in all the instances where it exists. If however the case is not so, and one of the subexpressions is a super set of the rest, We then need to make a decision as to what sub-expression should be actually executed and what sub- expression is to be derived from the other. More to that, if We have cases where the sub-expressions are not subsets of equivalent nodes, We then have to look for another sub-expression to be created so that the rest are derived from it. For the interest of identification, once a pair of queries have any such sub-expressions, We say they are sharable. 58
Since the information as to whether or not the queries are sharable is to be used during the actual optimization, We need to establish a way of summarizing the outcome. We therefore introduce some terminologies to be used in this research.
(a) Query-Sharing Matrix For n - queries in a batch, a query sharing matrix M is an n×n matrix with integral entries in which the entry at M[i,j] shows the number of sub-expressions sharable between the ith and the j th query. (b) Query Popularity This is the number of instances in which the query plan sub-expressions are found to are sharable with other queries in the batch. It should be noted that popularity may be partially out of intra-query sharing opportunities. For a given query sharing matrix M, the popularity of the ith query is Σall k∈[1,n]M[k, i] = Σall (c) Order of a node
k∈[1,n]
M[i, k].
This is the number of (not necessarily distinct) relations, with at least a relational operation that make up a node. It should be noted that if there is no relational operation (a whole relation) then the order is zero.
Our aim therefore is to establish the extent to which sub-expressions of any two queries are sharable. This helps in coming up with a decision as to whether or not an attempt to look for sharable opportunities between any two queries should go on. 59
3.3
Previous Research
The need to know the number of times a sub-expression is used in a query batch is evidently of paramount importance. The research on multi-query optimization [1, 7, 16, 20, 21] employ it. However, none of them makes a deliberate attempt to establish it with some accuracy. The approach of the Volcano-SH optimizer [20] is to count the number of parents a node has and it is taken as an under estimate for the number of times a subexpression is used. The parents exist in the parent DAG of the query plan therefore inter-query sharing is not attempted at all. More to that, the under estimate is not accurate. In some cases, it is actually greater than the actual number of times a node is used. For example, if We consider a DAG in which node a is used twice to produce node b which is used thrice to produce node c (c has two parents and c together with the parents are all used once). The number of times a is used is 6. Using the under estimate of Roy et al [20], We come up with 4. Since the under estimate is less than the actual value, it does not pause any decision contradiction. However, assuming all nodes were used once, then the number of times a is used is 1 yet the under estimate remains 4. This is dangerous since the node under estimate may permit materialization yet the actual value refuses it. In this case for example, since a appears only once, the decision to materialize it should not come up at all. More to that, this approximation uses a single DAG . It does not consider cases of inter-query sharing. 60
In the Volcano-RU [20], a node is chosen for materialization if it would cause savings if reused once. This implies that all nodes that can cause a saving are materialized. But it is also true that not all nodes that would cause savings when reused once actually exist more than once. Therefore, some nodes are chosen for materialization yet they do not exist multiple times. Further more, a node may exist say four times and it causes savings on materialization yet it would not cause savings if it existed twice. This implies that on top of the Volcano-RU choosing nodes that appear once for materialization, it may leave out nodes that are able to cause savings only because a more accurate approximation of its frequency is not known. Based on these weaknesses and the importance a more accurate approximation of the frequency a node is to be used, We establish the inter-query sharing extent before We proceed to optimize.
3.4
The Greedy Search Algorithm
In the greedy search algorithm, We compare two query plans at ago and for each pair of nodes, We establish whether or not they are sharable. If they are sharable, the appropriate entry in the query sharing matrix is incremented. Let us consider a situation where We have a batch of n queries Q1, Q2 , · · · Qn . For any query Qi , a set of nodes in the Volcano optimal plan can be got by the method proposed by Graefe and McKenna [7]. The input, like in the Volcano-SH [20] is a set of volcano
61
best plans. Let Si be the set of equivalence nodes in the plan of Qi . The greedy algorithm checks nodes pairwise and establishes whether sharing can be possible. If it is possible for any queries Qi and Qj , the query popularities and M[i,j] are incremented. This is done until all sets are exhausted. The Proposed Gready Search Algorithm: for (i=1; i<=n; i++) S = set of nodes in the ith plan for(j=1; j<=n; j++) P= set of nodes in the jth plan while(S still has nodes) a = next node in S while(P still has nodes) b = next node in P if(Sharable(a,b)) increment M[i,j] endif endwhile endwhile endfor endfor
Though this algorithm is simple, it is too greedy to be optimal/feasible. This is because it compares indiscriminately and therefore does too many comparisons. Assuming a query Qi has ki nodes in its volcano best plan and We consider the first node for the Q1 plan, it will make k1 − 1 comparisons while making intraquery comparisons, k2 comparisons while comparing with Q2, and so on up to kn comparisons for query Qn . This implies that this node alone undergoes Σall i ki − 1 comparisons while the whole Q1 plan will undergo k1 (Σalli ki − 1) comparisons. The whole batch will need Σall i ki × (Σall i ki − 1) comparisons which is too much. 62
We therefore propose improvements to the greedy algorithm so as to have a lessexpensive search both in processor resources and time.
3.5
Improvements to the Greedy Algorithm
Though the Greedy algorithm will be able to get all the sharable nodes, We need to make some improvements on it so as to minimize the search cost. In fact, the search cost is reduced without interfering with the output. This is done by adding some intelligence into the search algorithm so that some decisions can be deduced. The following observations can be noted and improvements to address them can be made without substantially affecting the cost-effectiveness of the scheme.
a. Elimination of duplicate searches The output of the search between Qi and Qj is equivalent to that between Qj and Qi since the query shareability is independent of the order in which the queries are checked. We therefore need to make sure that when the comparisons are made, such a pair should never be compared again. To ensure that, We compare Qi and Qj if and only if i ≤ j. b. Search by node order From the conditions that have to be satisfied before query sharing takes place, sharing can take place between nodes of different queries but strictly of the 63
same order. We therefore need to group nodes by order and for a node, We search for shareability within the same order. This reduces the sample space for each node when the search for the sharable sub-expressions is taking place. c. Null sharing prediction Moving up the order makes the nodes more specific. If for a pair of queries there is no sharing for nodes of order m, then there is no sharing for nodes of order n where n > m. We therefore terminate higher nodes’ search for the query pair immediately We finish nodes of a certain order and We get no sharing opportunity. d. Zero order tip Relations (zero order node) are too broad to be considered for sharing. We only use them to find if any two queries are disjoint or not. If We get a sharing opportunity at this order, We do not update M and go to the next order (order one). If however there are no zero order nodes sharable, then the queries are disjoint. We need not continue searching for shareability in higher orders.
64
3.6
The Improved Greedy Searching Algorithm
We now present the enhanced greedy algorithm such that the observations above are put into consideration in order to have a more optimal search. It outputs the sharing matrix. for i =1; i<=n; i++ for j =i; j<=n; j++ ORDER = 0 repeat nextOrderpermision = false; Si = set of nodes of order ORDER in Query i node1 = starting node in Si Sj = set of nodes of order ORDER in query j node2 = starting node in Sj while(Si still has nodes) while(Sj still has nodes) if(node1 and node2 are sharable) nextOrderPermision = true if(ORDER = 0) break out of the two while loops else increment M[i,j] and M[j,i] and mark node1 and node2 mark node1 and node2 endif endif node2 = next(node2) endwhile node1 = next(node1) endwhile ORDER = next(ORDER) until(nextOrderPermision = false or orders are exhausted) endfor endfor
65
The algorithm traverses the pseudo-rooted DAG and gives a summary of the extent of sharing between any two queries. Using this information, We can deduce the popularity. The matrix acts as a guide to establish which queries have sub-expressions in common and which queries are disjoint. With such information, searching for sharable nodes in disjoint queries can be done away with hence targeting the search efforts on cases where there is a high chance of making cost savings. However, the search does not put into consideration the cases where multiple use of a parent results into multiple use of children. So much as the algorithm can tell which query shares nodes with which query, it cannot tell how many times a node will be used. The entries in M therefore represent the number of times a node or its sharing partners appear not the number of times it is used.
66
Chapter 4
Optimizing the Traversed Plans
4.1
Background
From the very principle of multi-query optimization, the optimizer has to look for the common sub-expressions or any such sub-expressions where cooperation at execution time can lead to cost savings. Tests are made to establish whether costs will really be saved. If the tests are promising, such a sub-expression or a group of sub-expressions is explored. From Chapter 3, We were able to traverse the query plans using a greedy but intelligent algorithm and the output was put in a query sharing matrix M. If We consider a general query sharing matrix M below:
67
M =
m11 m12 m13 m14 · · · m1n m21 m22 m23 m24 · · · m2n ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· mn1 mn2 mn3 mn4 · · · mnn
the entry M[i, j] = mij is an integer that shows how many sub-expressions (equivalence nodes in AND-OR DAGs) that are sharable between queries Qi and Qj . This is the same as the value of M[j, i]. If We sum up the entries in a column (or a row), We get the total number of instances in which nodes in the plan have sharable partners. This is called query popularity. All nodes that have partners are marked so that the optimizer can identify which nodes necessitate checking other plans while searching for common sub-expressions and which nodes do not. This helps in eliminating null searches hence a more efficient strategy. After identifying the extent of sharing among the queries, We can be able to tell which pair or pairs of queries have nothing in common. For such queries, We do not need to search for common subexpressions since they do not exist. Likewise, We can be able to tell for each query, which other queries in the batch share nodes with it and to what extent. When We start searching for common sub-expressions, We start with one query and search for the sharable sub-expressions with other queries in the DAG. The Query plan, which
68
is at the center of the searching process is called the focal node. Since the plans are already traversed at the searching stage, We have nodes which are marked (those with sharable sub-expressions else where) and those which are not. It is only on marked nodes that We search for other plans for sharable nodes. It should be noted that a node being marked does not mean that its partners will be searched for in all plans. Since the query sharing matrix has the summary of what query shares with what, only those with non-zero entries in the sharing matrix are searched for a column/ row representing the extent of sharing between the focal plan and such a plan are checked. The algorithm therefore searches only in optimistic cases. In this chapter, We use the information in the query sharing matrix to guide us so as to exploit the sharing opportunities among the queries that make up a batch.
4.2
Related Work
Looking for common sub-expressions in a query batch has been done in most multiquery optimization algorithms but following different approaches and principles. The algorithms [7, 20] however seem not to make enough preparations to exploit them to the full.
69
4.2.1
The Basic Volcano Algorithm
This was proposed by Graefe and McKenna [7] as a reaction to the previously proposed Exodus Optimizer [6]. It uses DAG as a representation of the query plans. It has a problem of extensibility since AND-OR DAGs are easier to extend than the DAGs [21]. The Basic Volcano algorithm materializes all nodes that appear more than once. This brings in a problem that not all nodes that appear more than once cause savings when materialized. As observed in [20], for some nodes, it is cheaper to recompute than to materialize and reuse them. This is because materialization involves writing and reading to disk which is costly. The Basic Volcano lays a foundation for costeffective reuse but:-
i It does not establish the cost effectiveness of the node - candidate to materialization before choosing it for materialization. ii Its search is exhaustive therefore the optimizer incurs a high search cost which brings a negative impact on the overall cost effectiveness of the query processor.
The Basic Volcano algorithm therefore incurs a lot of cots in searching for the sharable sub-expressions which may render it inefficient especially for large (and therefore) complex queries.
70
4.2.2
The Volcano-SH
The Volcano-SH [20] is an extension of the Basic Volcano algorithm in that it uses the Basic Volcano optimal plans as an input. The volcano SH computes the cost of each node and decides whether or not it is cost effective to materialize it. This is done by considering a scenario of materialization and reuse against re computation. If for example We have an equivalence node e with the following characteristics:
Number of times it is to be used = numuses(e) Cost of computing the node = cost(e) Cost of materializing the node = matcost(e) and Cost of reusing the node =reusecost(e).
A decision has to be made whether to materialize and reuse the node or to recompute the node whenever it is needed. If all the nodes are computed from the database, the cost incurred would be cost(e) × numuses(e) and if the node was computed once, then materialized so that for subsequent times it is just reused, the cost incurred would be cost(e) + matcost(e) + reusecost(e) × (numuses(e) − 1)
71
. Materialization is cost effective if cost(e)+matcost(e)+reusecost(e)×(numuses(e)−1) < cost(e)×numuses(e) (4.1)
or more simply reusecost(e) + matcost(e) < cost(e)[20] (numuses(e) − 1) (4.2)
. Volcano SH therefore selects whether or not to materialize depending on the cost effectiveness of the scheme. The volcano-SH traverses the DAG from the leaves towards the root. Since the cost of a node is computed from the children (leaves), the cost of a node can be accurately established as the algorithm traverses the DAG. The number of times a node is used however depends on the materialization status of the parents. Since a node is reached before the parents are reached, it can not be easily established. Roy et al [20] uses an under estimate numuses−(e) which is got by counting the number of parents of a node. The condition for materialization is therefore modified to reusecost(e) + . It is advantageous that it eliminates blind materialization hence saving more resources. It however has some shortcomings that need to be addressed. The shortcomings are:matcost(e) < cost(e) (numuses−(e) − 1) (4.3)
72
i Inter-query Sharing elimination Its estimate of the frequency a node appears puts no consideration of other plans in the pseudo rooted DAG yet the inter-query sharing makes a lot of savings on resources. ii Accuracy of Estimates Its estimate is inaccurate. In fact in some cases the under estimate is higher than the actual value which may lead to wrong decisions. iii Order of Processing It does not attempt to exploit the inter-query extents of similarities. This makes it unable to decide on the optimal order in which the queries should be processed. iv DAG Trimming It does not trim already catered for nodes hence the search works on a fixed sample space leading to non worthwhile search efforts.
4.2.3
The Volcano-RU
Unlike the volcano-SH [20], the volcano-RU [20] does not take in the Basic Volcano outputs and neither does it attempt to establish the number of times a node is used. It optimizes one query at ago and any node, (whether it is on the optimal plan or not) that would cause savings if reused once is chosen for materialization. 73
The subsequent queries are optimized putting into consideration the fact that some nodes are already materialized. Its strength lies in exploiting shareability beyond the Basic Volcano optimal plans. Given its approach, the order of optimization is of paramount importance since the node to be materialized depends on which query has been optimized so far. It however has the following weaknesses:-
i Order of Processing It does not go into details of establishing the exact optimal order of optimization. Roy et al [20] propose that after optimizing in a specific order, We optimize in the reverse order and the cheaper option is chosen. However, since the first order was arbitrary, a more optimal order is very likely to exist. Attempting to randomly choosing other orders while searching for the optimal order leads to time wasting [20]. ii Over Materialization It also has a problem of excessive materialization since not all nodes that would cause saving if reused twice actually exist more than once. The excessive materialization leads to further costs incurred at materialization hence a more costly query processor. iii Inaccurate Materialization Criteria In a DAG, some nodes appear more than twice and cause savings yet they 74
would not make the savings if they appeared twice. Such nodes are left out since the criteria only consider those which would cause savings when reused once. The materialization therefore may be insufficient.
4.3
4.3.1
The Proposed Optimizing Algorithm
The Background
In designing this algorithm, We attempt to address the weaknesses identified in the Basic Volcano, Volcano SH and Volcano RU. It is against these addressed weaknesses, coupled with the strengths already exhibited by the algorithms that We lay a foundation for the new algorithm.
a. Optimization order The Volcano-RU [20] acknowledges the use of the order of optimization since by its philosophy, the content of the materialized set is of high importance. Though this algorithm unlike the Volcano-RU is to use the Basic Volcano optimal plans as the inputs, it uses order in another way. We identify the plan with the highest popularity and We establish its details (node costs and frequency). The common sub-expressions are searched for in the rest of the queries following the stable marriage preference list of the focal plan. If according to the sharing matrix M, a certain plan is not having any node to share with the 75
focal plan, We do not search it. b. Sample space management If We are to reduce the sample space without affecting the output and optimal plan, We minimize the optimization cost in the algorithm. If We identify more than one node to be sharable, a decision is made to decide which of the nodes will remain in the plan and which (the rest) is/ are to be removed from the plan. Action is done such that then plans are supplied with the output of the retained node so that the output of the removed nodes is compensated. In an AND-OR DAG, the equivalence node will have multiple parents. This reduces the sample space for the subsequent searches without affecting the output of the query batch. It also saves the cost of recomputation. c. Estimate of numuses(e) When a focal plan is identified, the costs and number of times a node is used in the plan, count are established. If a node is marked, then it means that it has sharable partners in other plans. The sharable nodes are searched for in the plans that share at least a node with the focal plan. Each time a sharable node is found, the count is incremented by one and its assured to be of the same cost as the node in the focal plans. The test for materialization used in the Volcano SH is then applied using the new estimated cost and the node count in the pseudo rooted plan.
76
d. Direction of Optimization In this algorithm, We choose nodes to materialize from the high order nodes downwards. This is used in such a way that if say two nodes are sharable and two have to be removed, this reduces the sample space of subsequent searches and saves the cost of computing already catered for nodes. e. Elimination of Repetitive search We already established that it is not worthwhile to search between Qi and Qj and then search Qj and Qi as this is a repetition. Since We search plans in the decreasing order of their popularity, any plan is searched if the focal plan has a higher popularity, this eliminates the duplicates of making searches which are already catered for in previous searches. f. Disjoint plans identification If a plan has zero popularity then it will not participate in inter-query search since it is unique. This helps saving resources that would be spent searching for non-existent partner nodes.
4.3.2
The Optimizing Algorithm
The new algorithm inputs the DAG made up of the Basic Volcano plans for each query. The inter-query shareability information and individual query popularities are got from the query sharing matrix M. In this algorithm plans are assigned focal
77
roles in the order of their decreasing popularity. Searching for sharable partners for any marked node in a focal plan is done in the stable marriage preference order of less popular plans. Searching starts from higher order nodes and plans of Zero popularity are not assigned focal roles.
78
S = set of plans that make up the virtual DAG in a decreasing order of their popularity focalPlan = first plan in S repeat establish node cost and numuses for each node for the focal plan S* = subset of S who share at least a node with focalPlan query candidateOrder = highest order of focalPlan repeat e = node in candidateOrder if( e is marked) repeat traverse S* searching for marked equivalent nodes of the same order with and sharable with e increment numuses(e) whenever a sharable node is met until(S* is exhausted) if(sharable condition(e)) chose which node to be materialized and add it to the materializable node remove the rest update the DAG for the materialized node to cater for the removed node’s parents unmark the chosen node endif else if(sharablecondition(e)) add e to materialization node endif endif until(nodes in candidateOrder are exhausted) focalPlan = next(focalPlan) until(plans on non zero popularity are exhausted)
4.3.3
Benefits of the new Algorithm
1. Better estimation of numuses−(e) The Algorithm uses the exact number of times the node exists in the focal plan
79
and increments it by one whenever a sharable node is found outside the focal plan. The nodes may actually be greater if the non focal plan nodes are used multiple times in their parent plans. Therefore numuses−(e) ≤ numuses(e) ∀e . 2. Elimination of null searches: The sharing matrix has details of the extent of sharing for any pair of plans. If the entry for a pair is zero, then We need not to search for shareability between them. If the popularity of a certain query is zero, then its plan is not involved in the search. 3. DAG trimming If We have say three nodes of order five and they are sharable, and it is decided that one has to be materialized and the rest of the plans use it, then it is not worthwhile to process the other nodes since the ultimate goal is to get the root of the tree. This algorithm removes the sub DAGs whose output can be got from common/ sharable nodes so that such children do not enter the optimization process since their output is catered for. 4. Optimal order of optimization Since the strategy eliminates catered for sub-DAGs, it is better if it does it as early as possible so that the subsequent search space is reduced without affecting the outcome. Starting with the most popular query does this. This saves time, memory and processor cycles.
80
Chapter 5
Discussion and Future Work
5.1
Discussion
In this dissertation, we studied the strategies and approaches to query optimization. The duty of the query optimizer is to establish the most cost effective execution plan of a query. This has to be done with in several limitations and notable among the limitations are:-
(a) Time. The process should be timely. Exhaustive searches that take a lot of time have to either be eliminated or improved to make intelligent deductions so that they have a low runtime.
81
(b) Form and Content. The transformation should have no effect whatsoever on the Form and content of the request made by the user. (c) Net Savings. The optimization process should not be a mere transfer of resources from executing a complex form of a query to searching for a cheap form of the query and executing it. It has to make substantial savings.
We Therefore examined existing approaches, studied how they order queries for optimization, how they optimize and how they exploit the geometry of query plans representation (Trees, DAGs and AND-OR DAGs) to make the scheme more cost effective. We Then Proposed a greedy search algorithm that traverses the composite plan in search for common sub-expressions. We used the geometry of the plan representation (AND-OR DAG) to propose improvements on the greedy algorithm so that the searching is more intelligent and therefore searches equally exhaustively but with fewer operations and in less time. We sumerised the sharability in a query sharing matrix. We also proposed an optimization algorithm that exploits the sharability extents in an optimal order to minimize null searches for common sub-expressions. This way we reduce run time and increase efficiency. we also propose trimming of catered for 82
sub DAGs so as to eliminate catered for nodes and reduce the sample space of subsequent searches. Lastly, We make comparisons of the proposed optimization schemes with the existing ones.
5.2
Future Work
This Research has implied some future research work in the field of Query Optimization
(a) Maintenance of the Sharing matrix. The Sharing matrix is used to get the order of assigning focal roles among the plans during optimization and for each focal plan, we get the order in which the plan should exploit sharability. We however trim plans whenever we get nodes catered for due to sharability. This trimming takes away all the children of the catered for node. Some of these children may be sharable else where hence removing them leads to an exaggerated sharing matrix. There is need for Research on how the matrix can be cost-effectively updated whenever we trim the composite DAG. (b) Incorporating Pipelining. Multi-Query Optimization is materialization intensive and the current sched83
ules of materialization use DAGs not AND-OR DAGs. Besides, they are too strict on the qualifications for pipelining. AND-OR DAG can use the equivalence nodes as staging areas for intermediate results and hence can work on less strict schedule. This however may lead to filling of the memory. There is therefore a need for incorporating pipelining and memory management in AND-OR DAG structured plans so that materialization costs are reduced. (c) Numerical Evaluation. The evaluation done on the proposed algorithms in this dissertation is not out of coding but rather our of comparison at the algorithm level. There is need to code the proposed algorithms and run them with the existing ones (such as Basic Volcano, Volcano-RU, Volcano-SH) on similar data and similar hardware. This will give the runtime comparison with the existing schemes.
84
References
[1] Chaudhuri, S. and Kyuseok, S. (1999). Optimization of queries with user defined predicates. ACM Transactions on Database Systems, vol.24 No.2, pages 177-228. [2] Connoly, T. and Begg, C. (2001). Database Systems: A practical Approach to design, Implimentation and Management, Third Edition. Edison and Wesley. [3] Cosar, A. Lim, E. and Jaideep, S. (2001). Multiple query Optimization with depth-first branch and bond and dynamic query ordering. International Conference on Information and Knowledge Management. [4] Elmasri, R. and Navathe, B.S. (1994). Fundamentals of database systems , Second Edition. Benjamin Cummings. [5] Fegaras, L. (1998) A new Heuristic for Optimising Large Queries Re-
search Paper, Department of Computer Science and Engineering . The University of Texas at Arlington. 85
[6] Graefe, G. and DeWitt, D.J. (1987). The EXODUS Optimizer Generator ACM SIGMOD Records, Volume 16, Issue 3 pages 160-172 [7] Graefe, G. and McKenna,W.J. (1991).Extensibility and Search efficiency in the Volcano Optimiser generator. Technical report CU-CS-91-563. University of Colorado [8] Gupta, A. Sudarshan, S. Viswanathan, S.(2001) Query Scheduling in mutiquery optimization Research paper, Indian Institute of Technology - Bombay. [9] Horowitz, E. Sahni, S. and Rajasekaran, S. (1996). Computer Algolithms C++ , Computer Science Press. [10] Hurigeri, S. Seshadri, S. and Sudarshan, S. (2001). Memory Cognizant Query Optimization. Research paper, Indian Institute of Technology - Bombay. [11] Johnson, J.L.(1997). Database: Models, Languages, Design, Oxford University Press. [12] Kremer, M. and Gryz, J. (1999). A Survey of Query Optimization in Parallel Database. Technical Report CS-1999-04, Department of Computer Science, York University. [13] Kroger, J. Stefan, P. and Heuer, A. (2001). Query optimisation. On the ordering of Rules. Research paper, cost - and rule based optimisation of object 86
- oriented queries (CROQUE) Project. University of Rostock and University of Hamburg, Germany. [14] Kyuseok, S. Sellis, T.K and Nau, D. (1994). Improvements on a Heuristic algorithm for Multiple-query Optimization. Technical report, University of Maryland, Department of Computer Science. [15] Kyuseok, S. (1993). Advanced query optimization techniques for relational database systems. PhD dissertation, University of Maryland. [16] Nilesh, N.V. Sumit, K.S. Roy,P. and Sudarshan, S. (2001). Pipelining in multi-query optimisation. Research paper, Indian Institute of Technology. Bombay. [17] Park, P. and Segar, A. (1988). Using common sub-expressions to optimise multiple queries. Proceedings of the IEEE International Conference on Data Engineering. [18] Ramakrishnan, R. and Gehrke, J. (2000). Database Management Systems Third Edition. McGraw Hill. [19] Rao, J. and Ross, R.K. (2000) Power Pipelining for Enhanced Query Performance . Technical Report CUCS-007-00, Columbia University. [20] Roy, P. Seshadri, S. Sudarshan, S. and Bhobe, S. (2001). Efficient and extensible algorithms for Multi query optimisation. Researh Paper, SIGMOD International Conference on management of data. 87
[21] Roy, P. Seshadri,S. Sudarshan,S. and Bhobe,S (1998) Practical Algolithms for multi query Optimisation. Technical report, Indian institute of Technology, Bombay. [22] Sallis, L. (1998). A new Heuristic for optimising large queries (1998) Research Paper, Department of Computer Science and Engineering . The University of Texas at Arlington. [23] Sellis,T.K. and Gosh, S. (1990). On Multi-query Optimization Problem. IEEE Transactions on Knowledge and Data Engineering pp. 262-266. [24] Urhan, T. and Franklin, M.J. (2001). Dynamic Pipelining Scheduling for improving interactive query performance. In proceedings of the 27th Very Large Databases Conference. Roma - Italy.
88
Apendix A: Paper
89