Docstoc

Framework for Query Optimization

Document Sample
Framework for Query Optimization Powered By Docstoc
					                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 9, No. 10, October 2011




                         Framework for Query optimization
                        na
               Pawan Meen                                        Arun Jhapate                                  malik kumar
                                                                                                            Parm
               Department of                                                 f
                                                                 Department of                                 partment of
                                                                                                             Dep
    Compu Science and Engineering 
        uter                                              r              ngineering
                                                   Computer Science and En                                   ence and Engine
                                                                                                 Computer Scie             eering
           llege of science & Technology
   Patel col                                                 ge
                                                  Patel colleg of science & Technology         Patel college of science & Technology
                         NDIA
             Bhopal,M.P,IN                                 B             IA
                                                           Bhopal,M.P,INDI                                Bhopa
                                                                                                              al,M.P,INDIA
        paw           ahoo.com 
          wanmeena75@ya                                    n_jhapate@yahoo
                                                        Arun             o.com                                k83@gmail.com
                                                                                                       Parmalik


     ACT
ABSTRA                                                                   Queri about even are complex because the cuts
                                                                               ies             nts               x,              e
                            u
Modern database systems use a query optim                   y
                                            mizer to identify            are co
                                                                              omplex with ma predicates ap
                                                                                              any               pplied to the prooperties
the most e                   gy,
            efficient strateg called “pla   an”, to execute e            of                    t.
                                                                                    each event The              onditions of
                                                                                                               co                      the
                            O
declarative SQL queries. Optimization is m  much more than  n            query involving selec
                                                                              y                ctions,           hmetic
                                                                                                             arith            opeerators,
transformati                                he
             ions and query equivalence. Th infrastructure  e
                                                                         aggreegates, UDF, and joins. The aggregates co          ompute
for optimiz                 ficant. Designing effective and
            zation is signifi                g              d
             L              ons
correct SQL transformatio is hard. Op       ptimization is a             comp                 ent
                                                                              plex derived eve properties. Fo example, a co
                                                                                                                 or              omplex
mandatory e                 he
            exercise since th difference bettween the cost of
                                                            f                 y                 vent
                                                                         query is to look for ev production Higgs bosons [1 3] by1,
            an               m
the best pla and a random choice could be in orders of      f            apply                 heories expressed cuts. These co
                                                                              ying scientific th                 d               omplex
                             q               rs
magnitude. The role of query optimizer is especially        y            querie need
                                                                               es          to      be optimized      for the    effficient
critical for the decision-suupport queries ffeatured in dataa            and scalable. Howeve  er,     the     op                omplex
                                                                                                                 ptimization of co
             g
warehousing and data mi                     ons. This paper
                             ining applicatio               r            querie is a challenge because:
                                                                               es              e
            an               o
presented a abstraction of the architect    ture of a query y
             nd              he
optimizer an focused on th techniques cu                    y
                                             urrently used by
                                                                             e               n
                                                                         • The queries contain many joins.
most comm                   s                us
            mercial systems for its variou modules. In      n
aaddition, p                al
            provide technica constraint of advanced issues  s
in query opttimization.                                                      e                ries        mization slow.
                                                                         • The size of the quer makes optim

      ds
Keyword                                                                  • The cut definitions contain many more or less co
                                                                             e               s                            omplex
Query optimmizer ,Operator tree, Query a
                         r             analyzer, Query
                                                     y                       egates.
                                                                         aggre
           n
optimization
                                                                         • The filters defining the cuts use man numerical UD
                                                                             e                                 ny           DFs.
       oduction
1. Intro
For significaantly improve appplication develo
                                             opment and user r                ere
                                                                         • The are dependen ncies between ev
                                                                                                           vent properties t
                                                                                                                           that are
productivity relational database techn
            y,              d               nology growing   g                cult          odel.
                                                                         diffic to find or mo
             he                              ate
success in th treatment of data is appropria in part to the  e
availability of non-proced  dural languages. By hiding the
                                             .               e           • The UDFs cause dep
                                                                             e              pendencies betw              ables.
                                                                                                          ween query varia
low-level d                  e
            details about the physical orga  anization of thee
            onal database lan
data, relatio                                he
                            nguages allow th expression of   f
complex qu   ueries in a co  oncise and sim mple fashion. In n
            to               wer              y,
particular, t build the answ to the query the user does      s
not exactly specify the proccedure. This pro                 t
                                            ocedure is in fact
designed b   by a DBMS module, kno          own as query     y
processor. T                 e
            This relieves the user to query optimization, a
tedious task that is man    naged correctly by the query     y
processor. M                ses              e
            Modern databas can provide tools for the         e
            eatment of large amounts of co
effective tre               e               omplex scientificc
data involv ving the applic cation of specif analysis [1,
                                             fic
                             n              d
2]. Scientific analysis can be specified as high-level       l
requests user-defined func  ctions (UDFs) in an extensible   e
DBMS. The query optimiz
            e                zation provides scalability and d
high perform                 he
            mance without th need for rese  earchers to spendd
time on low w-level program                  r,
                           mming. Moreover as the queries    s
             d
are specified and easily chaanged, new theor                 e
                                             ries, for example
                             b
implemented as filters, can be tested quicklyy.                                               ure           imizer
                                                                                           Figu 1: Query Opti




                                                                   102                                http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 9, No. 10, October 2011




                                                                      optimizer is responsible for producing the input for the
Relational query      languages       provide     a      high         execution engine. It takes a parsed representation of an
level "declarative" interface to access data stored                   SQL      query as    input      and is    responsible  for
in relational databases. Over time, SQL [1,4] has emerged             producing an efficient execution plan for the given SQL
as the standard for relational query languages. Two key               query in the space of possible execution plans. The task
elements of the component of the evaluation of a system               of an optimizer is nontrivial since for a given SQL query,
for querying SQL databases are the query optimizer and                there may be many operator trees possible:
execution engine queries. The query execution engine
implements a set of physical operators. An operator takes
                                                                      • The algebraic representation of the data query can be
as input one or more data streams and produces
                                                                      transformed into many other logically equivalent algebraic
an output data stream. Examples of operators are physical
                                                                      representations: for example,
(external) sorting, sequential analysis, index analysis,
nested loop join and sort-merge join. We refer to operators                Join (Join (P, Q), R) = Join (Join (Q, R), P)
such      as physical    operators since     they are     not
necessarily related one by one with the relational operators.         • For a given algebra representation, there can be many
The easiest way to think of physical operators is like pieces         operator trees that the operator algebraic expression to
of code that are used as building blocks to enable the                perform,      for      example, in     general, there     are
execution of SQL queries. An abstract representation of               several algorithms supported them in a system database. In
such a performance is a physical operator tree, as shown in           addition, the current or the response time for the
Figure 2. The edges in an operator tree represent the                 implementation         of these     plans        is     very
flow of data between the physical operators.                          different. Therefore, a choice            of execution by the
                                                                      optimization program is crucial. For instance, query
                                                                      optimizations are regarded as difficult search. To solve this
                                                                      problem, we need:
                                                                      • A space of plans (search space).
                                                                      • A cost estimation technique so that a cost may be
                           Index Nested Loop                          assigned to each plan in the search space. Intuitively, this is
                               (P,z=R,z)                              an estimation of the resources needed for the execution of
                                                                      the plan.
                                                                      • An enumeration algorithm that can search through the
                                                                      execution space A desirable optimizer is one where
              Merge_Join               Index Scan R                   the search space includes plans to lower costs, the costing
               (Pz=Qz)
                                                                      technique is correct and the enumeration algorithm eff-
                                                                      icient. Each of these tasks is nontrivial and that is
                                                                      why building a good optimizer is a huge undertaking.


    Merge_Join                Merge_Join
     (Pz=Qz)                   (Pz=Qz)
                                                                                      Query Analyzer


  Table Scan P              Table Scan Q 


                                                                                    Query Optimizer

        Figure 2: Physical Operator Tree



                                                                                    Code Generator
                                                                                    /Interpreter
We use the terms physical operator tree and execution
plan (or simply plan) interchangeably. The execution
engine is responsible for implementing the plan resulting
                                                                                    Query Processor
generate     responses to     the     request. Therefore, the
Capabilities of the query execution engine to determine
the    structure of   the    operator    trees that       are
                                                                             Figure 3: Query traverses through DBMS
practicable. We refer the reader to [5] for an overview of
the technical evaluation of the query. The query




                                                                103                                http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 9, No. 10, October 2011




The path th                                            s
            hrough a query to a DBMS is generated by its                 into account the ac   ctual cost for the specific qu    uestion
            shown in Figure 3.The modules of the system,
reaction is s              e                                             DBM and the datab
                                                                             MS                base in question If rewriting is known
                                                                                                               n.
            to             owing functions.
allowing it t move the follo                                             or asssumed always p                   tial request is ig
                                                                                               positive, the init                gnored,
The Query A Analyzer checks the validity o the query; it
                           s               of               t                  wise
                                                                         otherw sent to the next as well. The nature
            nternal form, usually an exp
creates an in                              pression of the  e                 e
                                                                         of the transformations to        rewrite this       step occurs
relational     calculus   o
                          or      something      similar. The
                                                            e                 clarative level [6
                                                                         in dec                6].
query optim                a               pressions that are
           mizer considers all algebraic exp                e
                           q
equivalent to the given query and choo     ose one that is  s
estimated to be less ex     xpensive. The code generator    r
                                                                            emer: This is the main mo
                                                                         Sche                                     rdering
                                                                                                    odule of the or
or interprete changes
            er               the map generaated             e
                                                       by the            stage. Examine all possible exec        cution plans fo each
                                                                                                                                or
            alls
optimizer ca the query pro  ocessor.                                          y                 n                s
                                                                         query generated in the previous step and selects
                                                                               est              rket            d
                                                                         the be global mar to be used for the reac              ction to
                                                                              rate
                                                                         gener the               nal
                                                                                            origin      query. It               esearch
                                                                                                                     employs a re
      ry       ation Archit
2. Quer Optimiza          tecture                                        strate that examine the space of execution plan in a
                                                                              egy                es             f               ns
In this sect               de
             tion, we provid an abstractio of the query
                                            on               y                cular fashion. Th is determined by two other m
                                                                         partic                 his             d              modules
             n              D
optimization process in a DBMS. Given a database and a                        e
                                                                         of the optimizer, space and sp          pace-mode alg  gebraic
            it,
query on i several exec    cution plans ex  xist that can be e           struct
                                                                              ture. Most of the modules and the search stra
                                                                                                 ese             d             ategy to
              to
employed t answer the query. In pri          inciple, all thee                 ost,
                                                                         the co i.e., work ti                   zer
                                                                                                 ime, the optimiz itself, which should
alternatives need to be considered so that t one with the
                                             the             e
                                                                                                e               The
                                                                         be as low as possible to determine. T implementat      tions of
best estimat  performance is chosen. An a
             ted                                             e
                                            abstraction of the
process of generating and testing these alternatives is
                            d                                s           the pl                  y               e
                                                                               lans reviewed by the planner are compared in te  erms of
shown in Figure 4, which is essentia         ally a modular  r                                  s
                                                                         their cost estimates so that the cheapest ma           ay be
architecture of a query optim               h
                            mizer. Although one could build  d           chose These costs a calculated by the last two m
                                                                              en.                are              y            modules
            er              a                real
an optimize based on this architecture, in r systems, the    e                e                                 del
                                                                         of the optimizer, the cost mod and the esti            imator-
modules sho                 ays
             own do not alwa have so clea   ar-cut boundariess           Size aallocation. 
            ure            o                 he
as in Figu 4. Based on Figure 4, th entire query             y
optimization  process can be seen as hav
             n              b               ving two stages:
            nd
rewriting an planning [6]. There is only on module in the
                                             ne              e           Statistical Space This module determines the action
                                                                                         e:
              the         w                 r
first stage, t Rewriter, whereas all other modules are in    n           execu                  t
                                                                              ution orders that are to be cons sidered by the P
                                                                                                                              Planner
the second stage. The funct tionality of each of the modules
                                            h                s           for ea query sent to it. All such ser of actions p
                                                                              ach              o               ries           produce
in Figure 4 i analyzed below
              is                                                              ame query answ but usually differ in perform
                                                                         the sa               wer,                             mance.
                                                                         They are usually r    represented in relational algebra as
                                                                              ulas
                                                                         formu or in tree fo                  f
                                                                                               orm. Because of the algorithmic nature
                                                                         of the objects gener  rated by this module and sent to the
                                                                         Plann                 l             ge
                                                                              ner, the overall planning stag is characteriz    zed as
                                                                         opera                 edural level.
                                                                              ating at the proce

                                                                            uctural Space This module determines the choice
                                                                         Stru           e:          e
                                                                         of perrformance that e                xecution of each set of
                                                                                               exists for the ex
                                                                               ns               e
                                                                         action ordered by the field of sta    atistics. This chhoice is
                                                                              ed
                                                                         relate to the join me ethods are availa                 int
                                                                                                                able for each joi (eg,
                                                                              ed              nd
                                                                          neste loop, scan an hash them tog     gether), as supp porting
                                                                                               uilt
                                                                         data structures are bu on them if / when duplicat are   tes
                                                                         elimin                haracteristics of other impleme
                                                                               nated, and the ch               f                entation
                                                                                               are              by
                                                                         of this kind, which a determined b the performa        ance of
                                                                             DBMS. This cho is also link to
                                                                         the D                oice             ked              nce
                                                                                                                           eviden any
                                                                               onship, which is determined by the physical sch
                                                                         relatio               s                                hema of
                                                                                                               ntry
                                                                         each database stored in its catalog en Given a Sta     atistical
                                                                         formu or tree from the Statistica Space, this m
                                                                               ula             m               al                module
                                                                         produuces all corres sponding complete execution plans,
                                                                         which specify the implementation of each alg
                                                                               h                                n                gebraic
                                                                              ator             of
                                                                         opera and the use o any indices [6    6].

           F               o               ecture
           Figure 4: Query optimizer archite                                t
                                                                         Cost Model: This module spec the mathem
                                                                                                    cify       matical
                                                                              ulas                           te
                                                                         formu that are used to approximat the cost of exe    ecution
Revise: Th module appl transformati
         his         lies                       n
                                  ions to a given                        plans. For every diff
                                                                                             ferent join method, for every di ifferent
                              ar
query and produces simila questions that are hopefully         y             x               nd                               kind of
                                                                         index type access, an in general for every different k
              ive,
more effecti for             mple, replacemen of
                          exam                 nt              t
                                                         thought         step that can be fou und in an execution plan, ther is a
                                                                                                                              re
with their definition, to attend nested qu
                              a                ueries, etc. Thee              ula             s             he                f
                                                                         formu that gives its cost. Given th complexity of many
                              e                n
processing is done by the author only on the declarative,                of thhese steps, mo of these formulas are simple
                                                                                              ost
                                                                             oximations of w
                                                                         appro                                               and
                                                                                             what the system actually does a are
                              stics of requests and do not take
that is, static the characteris                                e
                                                                             d
                                                                         based on certain assuumptions regardding issues like buffer




                                                                   104                                http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 9, No. 10, October 2011




management, disk-cpu overlap, sequential vs. random I/O,               select empid, subject
etc. The most important input parameters to a formula are
the size of the buffer pool used by the corresponding step,            from emp, dept
the sizes of relations or indices accessed, and possibly               where emp.dno = dept.dno and job = “Assistant professor"
various distributions of values in these relations. While the          and salary>200K.
first one is determined by the DBMS for each query, the
other two are estimated by the Size- allocation Estimator.             Having the extra selection could help extremely in
                                                                       discovery a fast plan to answer the query if the only index
Size- Allocation Estimator: This module specifies                      in the database is a B+-tree on emp.sal. On the other hand,
how the sizes (and possibly frequency distributions of                 it would certainly be a waste if no such index exists. For
attribute values) of database relations and indices as well as         such reasons, all proposals for semantic query optimization
(sub) query results are estimated. As mentioned above,                 present various heuristics or rules on which rewritings have
these estimates are needed by the Cost Model. The specific             the potential of being beneficial and should be applied and
estimation approach adopted in this module also determines             which not.
the form of statistics that need to be maintained in the
catalogs of each database, if any [6]
                                                                       Global Query Optimization

3. Advanced Types of Optimization                                      So far, we have focused our attention to optimizing
In this section, we attempt to provide a concise sight of              individual queries. Quite often, however, multiple queries
advanced types of optimization that researchers have                   become available for optimization at the same time, e.g.,
proposed over the past few years. The descriptions are                 queries with unions, queries from multiple concurrent
based on examples only; further details may be found in the            users, queries embedded in a single program, or queries in a 
references provided. Furthermore, there are several issues             deductive system. Instead of optimizing each query
that are not discussed at all due to lack of space, although           separately, one may be able to obtain a global plan that,
much interesting work has been done on them, e.g., nested              although possibly suboptimal for each individual query, is
query optimization, rule-based query optimization, query               optimal for the execution of all of them as a group. Several
optimizer generators ,object-oriented query optimization,              techniques have been proposed for global query
optimization with materialized views, heterogeneous query              optimization [8].
optimization, recursive query optimization, aggregate query
optimization, optimization with expensive selection                    As a simple example of the problem of global optimization
predicates, and query optimizer validation. Before                     consider the following two queries:
presenting specific technique consider the following simple            select empid, subject
relation EMP (empid ,salary, job, department, dno) ,
DEPT(dno, budget,)                                                     from emp, dept
                                                                       where emp.dno = dept.dno and job = “Assistant professor ",

Semantic Query Optimization                                            select empid

Semantic query optimization is a form of optimization                  from emp, dept
mostly related to the Rewriter module. The basic idea lies             where emp.dno = dept.dno and budget > 1M
in using integrity constraints defined in the database to
rewrite a given query into semantically equivalent ones [7].           Depending on the sizes of the emp and dept relations and
These can then be optimized by the Planner as regular                  the selectivity’s of the selections, it may well be that
queries and the most efficient plan among all can be used to           computing the entire join once and then applying separately
answer the original query. As a simple example, using a                the two selections to obtain the results of the two queries is
hypothetical SQL-like syntax, consider the following                   more efficient than doing the join twice, each time taking
integrity constraint:                                                  into account the corresponding selection. Developing
                                                                       Planner modules that would examine all the available
assert sal-constraint on emp:                                          global plans and identify the optimal one is the goal of
salary>200K where job = “Assistant professor"                          global/multiple query optimizers.

In addition consider the following query:
select empid, subject                                                  Parametric Query Optimization
from emp, dept                                                         As mentioned earlier, embedded queries are typically
                                                                       optimized once at compile time and are executed multiple
where emp.dno = dept.dno and job = “Assistant professor".              times at run time. Because of this temporal separation
Using the above integrity constraint, the query can be                 between optimization and execution, the values of various
rewritten into a semantically equivalent one to include a              parameters that are used during optimization may be very
selection on sal:                                                      different during execution. This may make the chosen plan
                                                                       invalid (e.g., if indices used in the plan are no longer




                                                                 105                               http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 9, No. 10, October 2011




available) or simply not optimal (e.g., if the number of              [8] T. Cells. Multiple query optimization. ACM-TODS,
available buffer pages or operator selectivity’s have                     13(1):23{52, March 1988.
changed, or if new indices have become available). To                 [9] G. Graefe and K. Ward. Dynamic query evaluation
address this issue, 31several techniques [9,10,11] have been              plans. In Proc. ACM-SIGMOD Conference on the
proposed that use various search strategies (e.g.,                        Management of Data, pages 358-366, Portland, OR,
randomized algorithms [10] or the strategy of Volcano                     May 1989.
[11]) to optimize queries as much as possible at compile
time taking into account all possible values that interesting         [10] Y. Ioannidis, RNg, K. Shim, and T. K. Sellis.
parameters may have at run time. These techniques use the                  Parametric query optimization. In Proc. 18th Int.
actual parameter values at run time, and simply pick the                   VLDB Conference, pages 103{114, Vancouver, BC,
plan that was found optimal for them with little or no                     August 1992.
overhead. Of a drastically different flavor is the technique          [11] R. Cole and G. Graefe. Optimization of dynamic
of Rdb/VMS [12], where by dynamically monitoring how                       query evaluation plans. In Proc .ACM-SIGMOD
the probability distribution of plan costs changes, plan                   Conference on the Management of Data, pages
switching may actually occur during query execution.                       150{160, Minneapolis,MN, June 1994.
                                                                      [12] G. Antoshenkov. Dynamic query optimization in
                                                                           Rdb/VMS. In Proc. IEEE Int. Coference on Data
Conclusion                                                                 Engineering, pages 538{547, Vienna, Austria, March
To a large extent, the success of a DBMS lies in the quality,              1993.
functionality, and sophistication of its query optimizer,
since that determines much of the system's performance. In
this paper, we have given a bird's eye view of query                   
optimization. We have presented an abstraction of the
architecture of a query optimizer and focused on the
techniques currently used by most commercial systems for
its various modules. In addition, we have provided a
glimpse of advanced issues in query optimization, whose
solutions have not yet found their way into practical
systems, but could certainly do so in the future.


References
[1] J. Gray, D.T. Liu, M.A. Nieto-Santisteban, A. Szalay,
    D.J. DeWitt, and G. Heber, "Scientific data
    management in the coming decade”, SIGMOD
    Record 34(4), pp. 34-41, 2005.
[2] Ruslan Fomkin and Tore Risch 1997 “Cost-based
    Optimization of Complex Scientific Queries”,
    Department of Information Technology, Uppsala
    University
[3] C. Hansen, N. Gollub, K.Assamagan, and T. Ekelöf,
    “Discovery potential for a charged Higgs boson
    decaying in the chargino-neutralino channel of the
    ATLAS detector at the LHC”, Eur.Phys.J. C44S2, pp.
    1-9, 2005.
[4] Melton, J., Simon A. Understanding The New SQL: A
    Complete
[5] Graefe G. Query Evaluation Techniques for Large
    Databases. In ACM Computing Surveys: Vol 25, No
    2., June 1993.
[6] Yannis E. Ioannidis,” Query optimization” Computer
    Sciences Department,University of Wisconsin
    Madison, WI 53706
[7] J. J. King. Quits: A system for semantic query
    optimization in relational databases. In Proc. of the 7th
    Int. VLDB Conference , pages 510{517, Cannes,
    France, August 1981.




                                                                106                               http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500

				
DOCUMENT INFO
Description: The Journal of Computer Science and Information Security (IJCSIS) offers a track of quality R&D updates from key experts and provides an opportunity in bringing in the new techniques and horizons that will contribute to advancements in Computer Science in the next few years. IJCSIS scholarly journal promotes and publishes original high quality research dealing with theoretical and scientific aspects in all disciplines of Computing and Information Security. Papers that can provide both theoretical analysis, along with carefully designed computational experiments, are particularly welcome. IJCSIS is published with online version and print versions (on-demand). IJCSIS editorial board consists of several internationally recognized experts and guest editors. Wide circulation is assured because libraries and individuals, worldwide, subscribe and reference to IJCSIS. The Journal has grown rapidly to its currently level of over thousands articles published and indexed; with distribution to librarians, universities, research centers, researchers in computing, and computer scientists. After a very careful reviewing process, the editorial committee accepts outstanding papers, among many highly qualified submissions. All submitted papers are peer reviewed and accepted papers are published in the IJCSIS proceeding (ISSN 1947-5500). Both academia and industries are invited to present their papers dealing with state-of-art research and future developments. IJCSIS promotes fundamental and applied research continuing advanced academic education and transfers knowledge between involved both sides of and the application of Information Technology and Computer Science. The journal covers the frontier issues in the engineering and the computer science and their applications in business, industry and other subjects. (See monthly Call for Papers)