Docstoc

Online Aggregation

Document Sample
Online Aggregation Powered By Docstoc
					Online Aggregation

 Joseph M. Hellerstein
     Peter J. Haas
    Helen J. Wang
Motivation for Online
Aggregation
• Traditional aggregation takes a long time to
  return a very small final result from a large
  amount of data
• The result does not have to be very precise!
• Online aggregation allows users to observe
  the progress of their queries and control
  execution on the fly
New interface for aggregation
• Observe the progress of their queries
• Control execution on the fly
Statistical estimation
• Users do not need to set a priori
  specification of stopping condition
• The interface is easier for users with no
  statistical background
• It requires more powerful statistical
  estimation techniques (Hoeffding’s
  inequality versus Chebyshev’s inequality)
Related work
• Online Analytical Processing (OLAP): batch
  mode and precise computation
• “fast first” query processing, similar to Online
  Aggregation but has simpler performance goals
• APPROXIMATE:defines an approximate
  relational algebra used to process standard
  relational queries in an iteratively refined manner
Usability goals
• Continuous observation
• Control of time/precision
• Control of fairness/partiality
Performance goals
• Minimum time to accuracy: produce a
  useful estimate of the final answer ASAP
• Minimum time to completion: secondary
  goal, assume user will terminate processing
  long before the final answer is produced
• Pacing: guarantee a smooth and continuous
  improving display
A naïve approach
• Use user-defined output functions supported by
  POSTGRES to provide simple running aggregates
• Can not support complex aggregates like
  aggregation with GROUP BY
• Current systems optimize only for accurate
  solutions
• Skip factor k – the DBMS only ships an update to
  the user interface after k input tuples have been
  processed
Random access to data
  We need to retrieve data in random order to
  produce meaningful statistical estimation.
  Three ways to get records in random order:
• Heap scans
• Index scans
• Sampling from indices (less efficient)
Non-blocking GROUP BY and
DISTINCT
• Sorting is a blocking algorithm and only
  one group is computed at a time after
  sorting
• Hashing is non-blocking, but hash table
  need to fit in memory to have good
  performance
• Hybrid Cache (an extension of hybrid
  hashing) might be good
Index striding         k1   k2   k3


• Hash-based grouping can be unfair
• Solution: probe the index to find all the
  groups and then process tuples from each
  group in a “round robin” fashion
• Can control speed by weighting the
  schedule
• Fair for groups with different cardinality
Index striding - Continued
• Efficiency: will be as good as scanning a
  relation via a clustered secondary index if
  – Index is the primary access method or
  – Relation is clustered by the grouping columns
    or
  – Index keys contain both the grouping and
    aggregation columns, with the grouping
    columns as a prefix
Non-blocking join algorithms (1)
• Sort-merge join is not acceptable for online
  aggregation because sorting is blocking
• Hybrid hash join blocks for the time required to
  hash the inner relation
• Pipeline hash join techniques may be appropriate
  for online aggregations when both relations are
  large
• Merge join (without sort) and hash join provide
  output with orders – not good for statistic
  estimation
Non-blocking join algorithms (2)
• The “safest” traditional join algorithm is
  nested loop, particularly if there is an index
  on the inner relation
• More on ripple join coming next…
Issues in optimization (1)
• Avoid sorting completely
• It is undesirable to produce results ordered on
  aggregation or grouping columns
• Divide cost model into two parts:
   – Time td spent in blocking operations
   – Time to spent producing output
   Use cost function: f(to) + g(td) (where f is linear and g is
     super-linear) to “tax” operations with too much dead
     time
Issues in optimization (2)
• Preference to plans that maximize user
  control (such as index-striding)
• Trade off between output rate of a query
  and its time to completion
  – Create natural controls in this regard for naïve
    users
  – Run multiple versions of a query is a solution
    but will waste computing resources
Running confidence intervals (1)
• Confidence parameter p(0,1) is
  prespecified
• Display a precision parameter єn such that
  running aggregate Yn is within  єn of the
  final answer μ with probability
  approximately equal to p. [Yn- єn,Yn+ єn]
  contains μ with probability approximately
  equal to p
Running confidence intervals (2)
• Three types to contruct from n retrieved records:
   – Conservative confidence intervals based on Hoeffding’s
     inequality or recent extention of this inequality, for all
     n>=1
   – Large-sample confidence intervals based on central
     limit theorems (CLT’s), for n both small and large
     enough
   – Deterministic confidence intervals contain μ with
     probability 1, only for very large n
 Running confidence intervals (3)
• SELECT AVG(exp) FROM R;
  v(i) (1  i  m): the value of exp when applied to tuple i
  Li: the random index of the ith tuple retrieved from R
  a and b are a priori bounds a  v(i)  b for 1  i  m
• Conservative confidence interval equations:
                   Y n  (1 / n)i 1 v( Li )
                                     n



                      (1 / m)i 1 v(i )
                                     m


                                                   2 n 2 /( b  a ) 2
                    P{| Y n   |  }  1  2e
                                                        1/ 2
                                 1    2 
                    n  (b  a) ln(
                                 2n 1  p ) 
                                             
                                            
Running confidence intervals (4)
• Large-sample confidence interval equations
• By central limit theorems (CLT’s) , Ynapproaches a normal
  distribution with a mean () and a variance s2/n as n, the
  sample size, increases. s2 can be replaced by the estimator
  Tn,2(v)
  Tn , 2 (v)  (n  1) i 1 (v( Li )  Yn ) 2
                      1 n



                            n (Yn   
                                           n         n 
  P{| Yn   |  }  P                  1/ 2   2 1/ 2   1
                                                        T (v ) 
                            Tn , 2 (v)   Tn , 2 (v) 
                                1/ 2
                                                      n, 2   
   ( z p )  ( p  1) / 2
                  1/ 2
        z T (v ) 
         2

  n  
       
         p n,2
                  
                  
           n     
Performance issues – skip factor
Performance issues – index
striding (a large group)
Performance issues – index
striding (a small group)
Conclusion
• aggregates have running output and
  confidence interval
• hash-based grouping and duplicate-
  elimination is not blocking
• index striding gives more control over
  GROUP BY
Future work
• Graphical user interface
• Nested queries
• Checkpointing and continuation
Ripple Joins for Online
     Aggregation
       Peter J. Haas
   Joseph M. Hellerstein
Join algorithms for online
aggregation
• Sort-merge and hash join algorithms are
  blocking algorithms – not acceptable in
  online aggregation
• Merge (without sorting) provides ordered
  output – bad for statistical estimator in
  online aggregation
• Nested loop join is the best, but...
An artificial bad example for
nested loop join
• SELECT AVG(S.a + R.b/10000000)
   FROM R, S
If R is the inner relation, for each tuple from
   S, we need to scan the whole relation R. But
   the scan does not provide much information
   for the output at all.
Overview of ripple join (1)
Overview of ripple join (2)
• online nested-loops join is a special case of
  ripple join
 Ripple join algorithms (1)
• It can be viewed as a generalization of nested-
  loops join in which the traditional roles of “inner”
  and “outer” relation are continually interchanged
  during processing
          R                   R                   R

    n-1*n-1             n-1*n-1             n-1*n-1
S                   S                   S
Ripple join algorithms (2)
• We need to modify the iterator for our
  algorithm – have to keep track of more
  states
• Aspect ratios does not have to be 1
• It can be extended to multiple ripple joins
Ripple join variants
• Block ripple join improves performance on
  I/O
• Indexed ripple join = index-enhanced
  nested-loops join, the role of inner/outer
  relations does not alternate any more
• Hash ripple join can be used for equijoin
  queries – two hash tables in memory for
  both R and S
Estimators for SUM, COUNT
and AVG
        | R |*| S |
sum                            exp p (r , s)
      | Rn | * | S n | ( r , s )Rn S n
          | R |*| S |
count                            one
        | Rn | * | S n | ( r , s )Rn S n

       sum
avg 
      count
Confidence intervals
• Use central limit theorems (CLT’s) to
  compute “large-sample” confidence
  intervals
• Fix the problems in classic CLT’s with
  newly defined s2 for different aggregate
  queries
                    K
                          d (k )
             s 
               2

                   k 1    k
Ripple optimization:choosing
aspect ratios (1)
• Blocking factor  is prespecified, we want
  to optimize  k’s – the aspect-ratio
  parameters         K
                         d (k )
• minimize s  
                2

                    k 1  k


such that
 1 2  3 ... K K-1c (decided by animation speed)
1   k  mk/  for 1  k  K
 1, 2 , 3 ,... K interger
Choosing aspect ratios (2)
• Solve relaxed version of the optimization
  problem by droping all constraints other
  than the first one
• Adjust the ratios accordingly during
  execution to satisfy the other constraints
• Starting ripple join with all =1 and start
  updating the aspect ratio after certain
  amount of time
Performance (1)
• SELECT ONLINE_AVG(enroll.grade)
  FROM enroll, student
  WHERE enroll.sid = student.sid
     AND student.honors_code IS NULL;
• Hash ripple join gives tight confidence interval
  within seconds
• Nested lopps join takes over 10 seconds to begin
  producing output
• Best batch join algorithm (hybrid hash join) takes
  208 seconds to complete
Performance (2)
Performance (3) – choosing
aspect ratios
• Select a lower animation speed to allow all
  kinds of aspect ratios
• Query:
  SELECT ONLINE_AVG(d.grade/a.grade)
  FROM enroll d, enroll a
  WHERE d.college = “Education”
   AND a.college = “Agriculture”
   AND a.year = d.year;
Performance (4) – choosing
aspect ratios
Conclusion
• The bigger the database, the more attractive
  online join algorithms appear
• User can control animation speed to trade
  off between precision and updating speed
• The system will adjust aspect ratios
  accordingly to achieve the best updating
  performance
Future work
• Optimization for ripple joins – choice
  between many variants of ripple joins
• Parallelization of ripple joins
Online Dynamic Reordering for
 Interactive Data Processing
       Vijayshankar Raman
         Bhaskaran Raman
       Joseph M. Hellerstein
Motivating applications
• Unclustered index striding in online
  aggregation might be very inefficient
  because of random I/Os
• Scalable spreadsheets want to provide
  services like sorting, scrolling and jumping
  instantaneously
• Sort algorithm used in some query plans is
  not the best solution
The prefetch and Spool (P&S)
• Use the time a process takes to fetch
  interesting tuples to the buffer
• Ratios of different groups in the buffer
  depends on user’s interest
• Spools tuples (in chunks) that are not very
  interesting onto side-disk
• Phase 1 scans from the input and Phase 2
  scans from the side-disk with certain order
Policies for online reordering
• User preference and metric (confidence,
  rate and strict) decide delivery priority of
  different groups
• We want to maximize the rate at which the
  feedback function rises as the number of
  tuples we retrieve increases
Performance (1)
• Compared to Index Striding
  – Much better than unclustered index striding
    because of less random I/Os
  – For extremely small groups, IS might be better
• Works better for slower processing rate
  because P&S has more time to build up the
  buffer between “gets”
Performance (2)
• Works well for scalable spreadsheets
• After P&S finishes Phase 1, retrieve tuples
  at interesting place in milliseconds while a
  blocking sort takes 15 minutes
• Random jump is not as good as short jump
  because nearby data has higher priority
  (using strict metric)
Performance (3)
• Can be used to replace sort in some
  traditional query plans
• Exact ordering is not needed in some query
  plans and reordering can provide non-
  blocking service which is good for
  pipelined parallelism and interactive
  estimation techniques like online
  aggregation
Future work
• Other feedback functions for different
  scenario
• providing data for multiple users
• Usages in graphical data visualization

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:8/21/2011
language:English
pages:51