Dissertation Pres

Document Sample
Dissertation Pres Powered By Docstoc
					Optimizing
Multiple Continuous Queries
  Dissertation Defense

  Chun Jin

  Thesis Committee
  Jaime Carbonell (Chair)
  Christopher Olston, on leave at Yahoo! Research
  Jamie Callan
  Phil Hayes, Vivisimo, Inc.

        October 31, 2006, Carnegie Mellon
Emerging Stream Applications




                          •Intelligence monitoring
                          •Fraud detection
•Transactions             •Onset epidemic patterns
•Senor network readings   •Network intrusion detection
•Network traffic data     •GeoSpatial change detection
                                  Chun Jin Carnegie Mellon   2
ARGUS: Toward Collaborative
Intelligence Analysis
 Data Streams

    Novelty Detection   Stream Matching                Ad hoc
                                    Continuous
                                                    Query Matching
                                     Queries




                                       New
                                     Continuous
                                      Queries




 Analyst A                           Analyst B
                                           Chun Jin Carnegie Mellon   3
Challenges
   Large-Scale (~103) continuous queries
   On FAST (104-105tuples/day) continuous
  streams
 With LARGE (~10 tuples) historical DBs.
                    6

… but computation-sharable and highly-
  selective queries
 Support stream processing for a broad range
  of queries on existing DB applications.
… but DBMS technologies.

                              Chun Jin Carnegie Mellon   4
Problems
   Efficiency and scalability
       Continuous query evaluation
       Multiple/Large-scale queries
   Practicality
       Utilize DBMS legacy systems to support
        stream processing on a broad range of
        queries.



                                 Chun Jin Carnegie Mellon   5
Approaches
   Efficiency and scalability
       Incremental query evaluation
       Incremental multiple query optimization
        (IMQO)     Selection/join queries
       Query optimization
   Practicality
       Built atop DBMSs
       Use SQL as the query language
   Shows up-to hundreds-fold improvement
         (Details coming up)

                                    Chun Jin Carnegie Mellon   6
Challenges to Multiple Query
Optimization (MQO)
                Q1
                Q2
                          MQO is NP-hard!
                …             [Sellis90]
                QK



      Q1 Q2     …    QK


                                                     time
  0   t1   t2    …   tK



      Incremental MQO (IMQO)
                              Chun Jin Carnegie Mellon      7
Performing IMQO
                                                 Q1
   QN
                                                 Q2
                                                 …
   SELECT …                                      QK
   FROM …
   WHERE …            Query Network R

 1. Index R
 2. Identify common computations between
    QN and R
 3. Select optimal sharing paths
 4. Expand R with new computations
                          Chun Jin Carnegie Mellon    8
             Related Work
   Efficiency and Scalability:             ARGUS[Jin et al,05][Jin et al,06]
       Incremental evaluation:
        Stream operators                        Practicality
            Join(Rete) [Forgy82]               Comprehensive IMQO
             [Urhan et al,00]
             [Viglas et al,03]                   framework
            Aggregate [Haas et al,99]               Richer query syntax and
       IMQO: Stream Processing                       semantics
        Projects                                     Canonicalization
            NiagaraCQ, TelegraphCQ                  More flexible plan structures
             [Chen et al,00]
             [Chandrasekaran et al,03]               More general sharing
            STREAM, Aurora, Gigascope                strategies
             [Motwani et al,03]
             [Abadi et al,03]
             [Cranor et al,03]
                                                       Chun Jin Carnegie Mellon   9
Thesis Statement
   The thesis demonstrates constructively that
    incremental multiple query optimization,
    incremental evaluation, and other query
    optimization techniques provide very
    significant performance improvements for
    large-scale continuous queries.
   The methods can function atop existing
    DBMS systems for maximal modularity and
    direct practical utility.
   The methods work well across diverse
    applications.
                               Chun Jin Carnegie Mellon   10
       ARGUS Stream Processing
 ARGUS Query Network Generator                    ARGUS Execution Engine
                                                        Data Tables
  IMQO
  Module                                                             Input Streams
                      Plan
                  Instantiator
SingleQuery                         System
                                                                Query Network
 Optimizer                          Catalog


  Code
Assembler

       Register & initialize query network



              Register queries
        Analyst
                                 Result streams        Chun Jin Carnegie Mellon      11
        Query Network Generator
                   Parser         ARGUS Query Network Generator


                Canonicalizer

                Incremental                       Index & Search
SQL   ARGUS                                                                 System
                Multi-Query                          Interface
Query Manager                                                               Catalog
                 Optimizer            Plan
                                  Instantiator
                                                        Query
                Single-Query
                                                       Rewriter
                 Optimizer

                   Code
                 Assembler

                       Initiation and execution code
                                                 Chun Jin Carnegie Mellon       12
Query Example
   Suppose for every big transaction of type
    code 1000 or 2000, the analyst wants to
    check if the money stayed in the bank or left
    within twenty days. An additional sign of
    possible fraud is that the transactions involve
    at least one intermediate bank. The query
    generates an alarm whenever the receiver of
    a large transaction (over $1,000,000)
    transfers at least half of the money further
    within twenty days of this transaction using
    an intermediate bank.

                                 Chun Jin Carnegie Mellon   13
The Query in CNF
     SELECT *
     FROM Fed r1, Fed r2, Fed r3
     WHERE
         (r1.type_code = 1000 OR r1.type_code = 2000)
S2   AND r1.amount > 1000000
     AND (r2.type_code = 1000 OR r2.type_code = 2000)
S1   AND r2.amount > 500000
     AND (r3.type_code = 1000 OR r3.type_code = 2000)
S1   AND r3.amount > 500000
     AND r1.rbank_aba = r2.sbank_aba
     AND r1.benef_account = r2.orig_account

J1   AND r2.amount > r1.amount / 2
     AND r1.tran_date <= r2.tran_date
     AND r2.tran_date <= r1.tran_date + 20
     AND r2.rbank_aba = r3.sbank_aba
     AND r2.benef_account = r3.orig_account

J2   AND r2.amount = r3.amount
     AND r2.tran_date <= r3.tran_date
     AND r3.tran_date <= r2.tran_date + 20;


F             S1             S2               J1                   J2


                                                   Chun Jin Carnegie Mellon   14
           Identify Sharable Computations
SELECT *
FROM Fed r1, Fed r2, Fed r3
                                                          1. Literal predicates
WHERE
    (r1.type_code = 1000 OR r1.type_code = 2000) ORp3
                                                             1. Equivalency
AND r1.amount > 1000000                            ORp4
AND (r2.type_code = 1000 OR r2.type_code = 2000) ORp1
                                                             2. Subsumption
AND r2.amount > 500000                             ORp2
AND (r3.type_code = 1000 OR r3.type_code = 2000) ORp1
                                                          2. OR predicates
AND r3.amount > 500000
AND r1.rbank_aba = r2.sbank_aba
                                                   ORp2   3. Predicate sets
AND r1.benef_account = r2.orig_account
AND r2.amount * 2 > r1.amount  r2.amount > r1.amount/2
                                                          4. Topology
AND r1.tran_date <= r2.tran_date
AND r2.tran_date - 10 <= r1.tran_date
                                                             Sharing strategies
AND r2.rbank_aba = r3.sbank_aba
AND r2.benef_account = r3.orig_account
                                                             Self-join
AND r2.amount = r3.amount
AND r2.tran_date <= r3.tran_date
                                                          P   J1
                                                                         J4

AND r3.tran_date - 10 <= r2.tran_date;  r3.tran_date <= r2.tran_date + 20
                                                                              J3

                        F             S1            S2            J1               J2    J4
                                                              Chun Jin Carnegie Mellon   15
        Computation Hierarchy
                                  sharable
                     S1                                 S2
                                subsumption
                     PS1                               PS2

        ORp1                ORp2                                   ORp4
                                         subsumption


      p11      p12             p2                                       p4
                                         subsumption
Fed.type_code = 1000 OR    Fed.amount > 500000          Fed.amount > 1000000
Fed.type_code = 2000
                                                 Chun Jin Carnegie Mellon    16
ER Model for Hierarchy
name        Node                      IsAChild
type     Associates
psetid    PredSet

         BelongsTo

ORpid     OR Pred

         BelongsTo

 text    Literal Pred                       pid
                        Chun Jin Carnegie Mellon   17
Problems in Index/Search
   Rich syntax  Canonicalization
   Subsumption
       Literal predicate:
         subsumption + canonicalization
          triple-string canonical form
       ORPred/PredSet  algorithms
   Self-join + canonicalization  Standard Table
    Alias (STA) assignment
   Topology  multiple topology indexing

    (Details coming up)

                                   Chun Jin Carnegie Mellon   18
Canonicalization
   Equivalency:
    r2.amount > r1.amount / 2
    r2.amount *2 > r1.amount
     r2.amount * 2 – r1.amount > 0
   Subsumption:
    r2.tran_date <= r1.tran_date + 20
    r2.tran_date – r1.tran_date <= 20
    r2.tran_date – 10 <= r1.tran_date
    r2.tran_date – r1.tran_date <= 10
   Triple-string canonical form:
        attribute-expression op constant Mellon
                                 Chun Jin Carnegie   19
Self-Join
   Canonical forms refer to true table
    names.
   Not good for self-join predicates:
    r1.benef_account = r2.orig_accout
     Fed. benef_account = Fed.orig_accout

   Use Standard Table Alias (STA)
        T1. benef_account = T2.orig_accout
       Enumerate STA assignments to find matches




                                    Chun Jin Carnegie Mellon   20
Self-Join in ORPred/PredSet Layers
    OR Predicate:
        (r1.c=1000 OR r1.a=r2.b)
         (Fed.c=1000 OR T1.a=T2.b)             ?
         (T1.c=1000 OR T1.a=T2.b)              ?
    Add STA when indexing OR Predicates
    Similar on Predicate Sets



                                       Chun Jin Carnegie Mellon   21
Subsumption at ORPred Layer
Input: ORPred p P
Output: All ORPreds r R,
             s.t. pr.
Algorithm:
   For each ρ p,
      Find γ r, such that ργ
   For each r found,
      Count # of γ that subsumes ρ, |I(r)|
      If |I(r)|=|p|
             pr            Chun Jin Carnegie Mellon   22
Topological Connections


     S1   S3   S5


B1                  J1           J4            J7


     S2   S4   S6




                         Chun Jin Carnegie Mellon   23
       System Catalog
        JoinTopologyIndex
Node   JVOA1   JVOA2   JVOAPSetID   DParent1    DParent2     DPSetID      Distinct



SelectionTopologyIndex
Node   JVOA1   JVOA2   JVOAPSetID   DParent    DPSetID     SVOA SVOAPSetID           Distinct




                               PSetID     PredID     STA
       PredicateSetIndex

                                PredicateIndex
        ORPredID   LPredID LExpr Op RExpr        Node1 Node2 STA            UseSTA



                                                         Chun Jin Carnegie Mellon         24
                    Indexing & Searching
                                     Inference & Classification                                       Canonicalization
         r2.type_code = 1000
                                                 r1.type_code = 1000
                                                r1.amount > 1000000                             T2.amount * 2 – T1.amount > 0
         r3.type_code = 1000                     r2.type_code = 1000
         r1.type_code = 1000                     r2.amount > 500000
        r1.amount > 1000000                      r3.type_code = 1000
   r1.rbank_aba = r2.sbank_aba                   r3.amount > 500000
r1.benef_account = r2.orig_account
     r2.amount * 2 > r1.amount
                                           r1.rbank_aba = r2.sbank_aba
                                        r1.benef_account = r2.orig_account
                                                                                                T2.tran_date – T1.tran_date <= 10
    r1.tran_date <= r2.tran_date             r2.amount * 2 > r1.amount
 r2.tran_date – 10 <= r1.tran_date          r1.tran_date <= r2.tran_date
   r2.rbank_aba = r3.sbank_aba           r2.tran_date – 10 <= r1.tran_date
r2.benef_account = r3.orig_account         r2.rbank_aba = r3.sbank_aba
       r2.amount = r3.amount            r2.benef_account = r3.orig_account
    r2.tran_date <= r3.tran_date               r2.amount = r3.amount
 r3.tran_date – 10 <= r2.tran_date          r2.tran_date <= r3.tran_date             Common
                                         r3.tran_date – 10 <= r2.tran_date
                                                                                    Computation                Computation
                                                                                     Searching                  Indexing


                                     Node        PredSetID           …       PredSetID PredID     …      PredID CanonicalForm   …




                                                                                System Catalog
                                                                                                   Chun Jin Carnegie Mellon         25
Sharing Strategies
    B1
1             ?                     B1
    B2                                       J1
                                    B2                     J2
    B2
2             ?                     B3
                                         (c-1) Sharing-selection
    B3
         (b-1) Joins in Q

    B1
             J1
    B2
         (a) Query network R        B1
    B1                                       J1
                                    B2
                                                           J3
    B2                                       J2
                        J3          B3
             J2                          (c-2) Match-plan
    B3                                        Chun Jin Carnegie Mellon   26
         (b-2) Optimal plan for Q
Evaluation
   Databases:
       Synthesized FedWire money transfers (Fed
        500000 records)
       Anonymized Medical patient admission records
        (Med 835890 records)
   Queries:
       Seed queries
       Generate sharable queries from seeds
       A wide range of queries
   Simulation:
       Historical data (300000 on Fed, 600000 on Med)
       Chunks of new data (4000 per chunk, etc.)
                                    Chun Jin Carnegie Mellon   27
       Improvement Factors


       Incremental   Join Order         Canonicalization           IMQO
       Evaluation    Optimization       1-10x                      1-50x
       1-100x        1-10x

DBMS                                                                            ARGUS
1x          Conditional       Transitivity                                      1-500x
            Materialization   Inference
            1.2-1.8x          1-20x




                                                     Chun Jin Carnegie Mellon        28
                                     Fed IMQO & Canonicalization
Execution Time(s)......




                          250                                                        25000

                          200                                                        20000




                                                                            WQNS..
                          150                                                        15000

                          100                                                        10000

                          50                                                          5000

                           0                                                             0
                                                                                             0   100     200    300   400   500   600   700   800
                                0    100 200 300 400 500 600 700 800
                                                  # of queries                                                 # of queries
                                     AllSharing      NonCanon    NonJoinS        WQNS: weighted query network size


                                    HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
                                                                                                       Chun Jin Carnegie Mellon                 29
                                       Fed Sharing Strategies
Execution Time (s)......




                           100                                                                               6000


                           80                                                                                4800




                                                                                                    WQNS..
                           60                                                                                3600

                           40                                                                                2400


                           20                                                                                1200


                            0                                                                                   0
                                 0   100     200   300    400       500       600    700      800                   0   100     200   300   400    500   600   700   800
                                                     # of queries                                                                       # of queries
                                     SharingSel     M atchPlan            M atchPlan+NCanon




                                     HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
                                                                                                                              Chun Jin Carnegie Mellon               30
Summary of Contributions
   Efficiency and scalability
       Continuous queries  Incremental query evaluation
       Multiple/large-scale queries  Incremental multiple query
        optimization (IMQO)
       Query optimization
   Practicality
       Existing DB applications  Built atop DBMSs
       A broad range of query syntax and semantics  Support
   Evaluation
       Shows up-to hundreds-fold improvement
       Works across various domains




                                          Chun Jin Carnegie Mellon   31
Future Work
   Generalization of current work
       Support multi-way joins
       More sophisticated sharing strategies
            Rerouting
            Restructuring
   Adaptive query processing
       Adaptive re-optimization: rerouting and restructuring
       Adaptive rescheduling
   New infrastructure
       Parallel/distributive processing
       Automatic tuning: index selection
   Support new data types
       Text
       Multimedia

                                            Chun Jin Carnegie Mellon   32
Acknowledgement
   Advisor: Jaime Carbonell.
   Committee: Chris Olston, Jamie Callan, and
    Phil Hayes
   CMU and Dynamix ARGUS team: Jaime
    Carbonell, Phil Hayes, Santosh Ananthraman,
    Cenk Gazen, Bob Frederking, Eugene Fink,
    Dwight Dietrich, Ganesh Mani, Johny Mathew,
    and Aaron Goldstein.
   CMU faculty and friends: many …

                              Chun Jin Carnegie Mellon   33
      Thank you!

Questions and comments?




                   Chun Jin Carnegie Mellon   34
Outline
   Motivation
   System and methods:
       System architecture
            Execution engine
            Query network structures
       IMQO framework
            Query network generator
            Query examples
            Hierarchy/ER Model
            Problems and solutions
            System catalog
            Sharing strategies
   Evaluation
   Conclusion and future work

                                        Chun Jin Carnegie Mellon   35
Adapted Rete Algorithm (Join)
   Join  on N and M
   (N+ΔN) (M+ΔM)
    = N M + ΔN M + N ΔM + ΔN ΔM

    Old Results
                   New Incremental Results

   When ΔN and ΔM are very small
    compared to N and M, time complexity
    of incremental join is O(N+M)
                                Chun Jin Carnegie Mellon   36
      Incremental Evaluation
                           N.rbank_aba = M.sbank_aba
     N                     N.benef_account = M.orig_account
                           M.amount > N.amount*0.5
                    J      N.tran_date <= M.tran_date
     M                     M.tran_date >= N.tran_date+20


     N


       hist
ΔN    new     J
                    hist
                                    Compute ΔJ
                   new
     M                              by ΔN   M
              ΔJ
      hist
                                      N   ΔM
                                     ΔN   ΔM
ΔM   new



                                       Chun Jin Carnegie Mellon   37
        Incremental Evaluation
                                  r1.rbank_aba = r2.sbank_aba
                                  r1.benef_account = r2.orig_account
                                  r2.amount > r1.amount*0.5
               type_code=1000     r1.tran_date <= r2.tran_date
               amount>500000      r2.tran_date >= r1.tran_date+20
   F
               F         S1       S2               J1                    J2
        hist
       temp                       S2

                         S1        hist
                                  temp           J1
                                                     hist
Compute S1_temp by         hist                     temp

selecting from F_temp     temp
                                         Compute J1_temp by
                                    joining S1_temp and S2_hist,
                                    joining S1_hist and S2_temp,
                                  and joining S1_temp and S2_temp
                                              Chun Jin Carnegie Mellon        38
Code Generation
   Code template for each operator
   Code block for each node
   Sort the code blocks
   Wrap up code blocks in Oracle stored
    procedures
   Register and periodical execution



                           Chun Jin Carnegie Mellon   39
Projection Management


     B1       S1




                                 J1




      B2      S2




                   Chun Jin Carnegie Mellon   40
Transitivity Inference Example
   Given
       r1.amount > 1000000 and
       r2.amount > r1.amount * 0.5 and
       r3.amount = r2.amount

   We can infer highly-selective predicates:
       r2.amount > 500000
       r3.amount > 500000




                                    Chun Jin Carnegie Mellon   41
         Query Optimizer
   Similar to traditional enumeration-based
    query optimizer
   Optimize
       Join order                           History-based
                                                                                           DB
       Conditional materialization          Cost Estimator


                                 SQL Query                         Active List
                                                 Join
                                              Enumerator            Join Graph
                                      Plan




                                 Update System Catalog
                                             History-based              StructureBuilder
                                             Query Optimizer


                                                           Chun Jin Carnegie Mellon         42
 Conditional Materialization
                 Unconditional Materialization
r1



r2




r1               Conditional Materialization:
                                    Choose materialization
                                    or not based on cost
                                    estimates
r2


                            Chun Jin Carnegie Mellon     43
            Selection/Join
            Incremental Evaluation (Fed)
                     50
 Execution Time(s)



                     40

                     30

                     20

                     10

                      0
                          Q1   Q2    Q3     Q4   Q5         Q6         Q7

          Rete Data1           DBMS Data1   Rete Data2        DBMS Data2
HP PC, Single core Pentium(R) 4 CPU, 1.7GHz, 512M RAM, Windows XP, Oracle 10.1.0
                                                      Chun Jin Carnegie Mellon     44
               Fed Comparing All
Execution Time(s)......

                           250

                          200

                           150

                           100

                            50

                             0
                                 0     100   200    300       400   500      600       700       800

                                                      # of queries
                          AllSharing               NonCanon                   NonJoinS
                          M atchPlan               M atchPlan+NCanon
    HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
                                                                          Chun Jin Carnegie Mellon     45
Execution Time (s)......   Med Comparing All
                           120

                            90

                            60

                            30

                            0
                                 0   100   200      300     400          500           600
                                                 # of queries
                             AllSharing                     NonCanon
                             NonJoinS                       MatchPlan
                             MatchPlan+NCanon
         HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
                                                            Chun Jin Carnegie Mellon         46
                                      Med IMQO & Canonicalization
Execution Time (s)......




                           120                                                                   15000
                                                                                                 12000




                                                                                       WQNS ..
                           90
                                                                                                  9000
                           60
                                                                                                  6000
                           30
                                                                                                  3000
                            0                                                                        0
                                 0     100     200       300        400   500    600
                                                                                                         0   100   200 300 400           500   600
                                                     # of queries                                                     # of queries
                                      AllSharing        NonCanon          NonJoinS




                                     HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
                                                                                                              Chun Jin Carnegie Mellon          47
                                     Med Sharing Strategies
Execution Time (s)......




                           80                                                                  8000

                           60                                                                  6000




                                                                                     WQNS ..
                           40                                                                  4000

                           20                                                                  2000

                           0                                                                     0
                                0      100   200       300        400   500    600                    0   100   200       300        400   500   600
                                                                                                                      # of queries
                                                   # of queries
                            SharingSel       MatchPlan            MatchPlan+NCanon




                                    HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
                                                                                                           Chun Jin Carnegie Mellon              48

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:9/17/2012
language:English
pages:48