High-Performance Extensible Indexing by Gf93D1fM


									  Access Methods for Next-
Generation Database Systems

          Marcel Kornacker
             UC Berkeley
Overview and Motivation
• this talk’s topic: access method (AM)
   – how to support novel AMs in extensible ORDBMSs
   – not: yet another point, spatial, metric, ... AM
• outline:
   – AM extensibility architecture for ORDBMSs
   – concurrency & recovery
   – AM performance analysis
Overview and Motivation
• why bother with AMs:
   – we have B-trees
   – our customers don’t care
• well...
   – new apps need ORDBMS support: GIS, multimedia, genomic
     sequence databases, etc.
   – customers do care about fast access to data
   – B-trees won’t help
Overview and Motivation
• AM extensibility: what has been done
  – slew of papers about novel AMs
  – slew of papers about extensible DBMSs (in 80s!)
• what needs to be done:
  – storage-level techniques for ORDBMSs:
    reconciling functionality with performance and reliability
  – AMs crucial to performance, deserve special attention
• Overview and Motivation
• High-Performance Extensible Indexing with
  Generalized Search Trees
  – AM Support in Commercial DBMSs
  – GiST Overview
  – IUS Implementation Overview
• Concurrency and Recovery
• Access Method Performance Analysis
AM Support in Commercial DBMSs
• OR modeling and extensibility very successful
  in commercial DBMS, but:
   – AM extensibility has not received same degree of attention
   – DBMS vendors now struggling to add/improve novel (spatial)
• State of art: IUS virtual index interface/Oracle
  extensible indexing interface:
   – iterator interface: open(), getnext(), close(), insert(),
   – AM handles internally: locking, recovery, page management,
     … (same as built-in AMs)
 AM Support in Commercial DBMSs (2)
• What’s wrong with this interface
  – concurrency and recovery
    need to be                                   Query Processing
    (re-) implemented for each
    new AM
  – difficult to implement:


    AM developer = domain

                                              New AMs still
    expert, rarely also DBMS                     require
                                              custom CC&R
    internals expert                              code
  – would prefer to deal purely
    with AM specifics: Generalized
    Search Tree                               Storage, Buffer, Log, ...
Generalized Search Tree Overview
• Generalized Search Tree (GiST) = template
  index structure
  –   extensible set of data types and queries
  –   customize tree behavior through extension methods
  –   examples: B-trees, R-trees, …
  –   details: Hellerstein, Naughton, Pfeffer, VLDB ‘95
• GiST provides
  –   basic structure: height-balanced tree
  –   template algorithms: search, insert and delete
  –   no assumptions about keys and how they’re arranged
  –   AM developer provides key-specific functions and particular
      operational properties
         Generalized Search Tree Overview
GiST AM Parameters
Subtree predicate (SP)=User-defined key
Consistent( ) returns true/false                 SP1 SP2 ... SPn
Penalty( ) returns insertion cost
PickSplit( ) splits page                                 …..
items into two groups                               .
Union( ) updates SPs                                .

                           Internal Nodes

                                    Leaf Nodes
Generalized Search Tree Overview
• More suitable basis for AM extensibility than
  iterator interface, because:
  – raises level of abstraction,
                                        Query Processing
    allows developer to focus on
    AM specifics/performance-
    relevant AM properties              R-tree

    (clustering, internal              R*-tree


    predicates)                      SR-tree

  – built-in features (e.g., page
    handling, tree traversal) need
    not be re-implemented for
    specific AMs
                                     Storage, Buffer, Log, ...
Generalized Search Trees Overview
• Goal of thesis work: making idea industrial
  – efficiency: Informix Universal Server impl (VLDB ’99)
  – concurrency & recovery (SIGMOD ’97)
  – analysis framework (demo SIGMOD ’98)
• Datatype extensibility:
  – IUS allows UDTs to be indexed by built-in AMs
  – require same degree of datatype extensibility for GiST
• Good performance:
  – operations on UDTs implemented via UDFs, which are orders
    of magnitude slower than regular function calls
  – GiST needs to avoid large number of UDF calls
• Intra-page storage format:
  – original GiST design assumed R-tree-like page layout
  – IUS GiST needs to allow customized formats (e.g., B-tree,
    hB-tree) to support compression or simplified access
IUS GiST - Summary
• Extensibility architecture:
   – GiST core in server
   – AM extension in DataBlade, implements GiST API
• Improved GiST API
   – page-based, not entry-based: few UDF calls
   – leave page format to AM extension: very flexible
IUS GiST - Performance
• Comparison of GiST-based and built-in R-trees
  in IUS:
  – built-in R-tree: datatype extensible
  – software engineering: 1,400 lines of C for GiST R-tree vs.
    10,000 lines of C for built-in R-tree (GiST core: about 10,000
    lines of C)
  – performance: identical # of I/OS, GiST uses 14 to 40% less
    CPU time
• Reasons for GiST performance advantage:
  – far fewer UDF calls for GiST R-tree: on average 1 per page
    (built-in: 1 per page entry)
  – but: higher set-up cost for GiST: needs to set up descriptors
    for 11 interface UDFs (built-in: only needs 7 UDFs)
• Overview and Motivation
• High-Performance Extensible Indexing with
  Generalized Search Trees
• Concurrency and Recovery
  – Physical CC - concurrent index operations
  – Logical CC - transaction isolation
  – Recovery and how it affects concurrency
• Access Method Performance Analysis
    Concurrent Index Ops
    • Problem: concurrent structure modifications
      (example: B-tree)
        4                           4   7

1             4   6   7         1           4   6         7   8

                          I:8                       D:7

    • Lock-coupling strategy with repositioning
      (ARIES/IM): requires key ordering
    • B-link tree strategy: compensation during
Concurrent Index Ops
• Navigating linked nodes:
  – how to detect node split?
  – when to stop going right?
• Structural GiST extensions
  – global sequence counter
  – sequence number (NSN) for each node
           Concurrency Extensions
counter:                       counter:
   5                              6
             3                            3

  2                   1          2            6        1

                          I:                      D:

       • Remember and compare global counter with
       • NSNs allow split compensation independently of
         key properties
       • Can be implemented very efficiently in typical
         WAL environments
A Word on Latch-Coupling
• Popular technique for B-trees (ARIES/IM):
  – hold parent latched while going to child
  – avoid invalid pointers after node deletion
• Won’t work for GiST:
  – GiST search might traverse multiple subtrees
  – either: keep parent latched while traversing all children (low
  – or: reposition in parent after traversing each child
  – but: repositioning requires partitioning
Transaction Isolation
• SQL Isolation Levels:
  – most of it: locks on base table items (unrelated to index)
  – hard part: “serializable” isolation level (i.e., preventing
• Phantom Problem:
  – SELECT specifies logical range
  – need to prevent insertions into that range
  – can’t lock non-existing items
Transaction Isolation: B-Trees
• B-Trees: Example of Key-Range Locking
  –   partition data space into intervals
  –   each leaf item corresponds to interval
  –   scan from 5 to 9: lock data in range, lock next key
  –   insert 9: check interval 8-10 (check item 10)

 3      4      5     6      7     8      9     10    11
Transaction Isolation: GiSTs
• But: this doesn’t work for GiSTs!
• Example:
   – 2-dim point keys
     and search rectangle
   – what are next keys?
     where do we find
     them in tree?
Predicate Locking
• Idea:
  – readers register shared predicate (search qual.), check
    exclusive predicates
  – updaters register exclusive predicate (updated item), check
    shared predicates
• Compared to key-range locking
  – perfectly accurate, maximal concurrency
  – expensive: evaluate lots of predicates
  – no gradual expansion:
    (1) lock entire range, even when cursor stops early
    (2) nearest-neighbor search locks entire tree
Hybrid Locking
• Instead: novel hybrid mechanism
  – 2-phase locking of retrieved, inserted and deleted data
  – restricted pred locking for phantom avoidance (‘‘covering the
• Restricted predicate locking
  – search predicate attached to every visited node during
  – structure mods replicate predicate attachments
  – no insert/delete predicates
  – insert only checks target leaf’s predicates
Hybrid Locking
• Example

                5-9       8

    5-9     4    6            8     10   12

                      S:5-9       I:9
Hybrid Locking
• In comparison to predicate locking:
  –   retains perfect accuracy
  –   fewer predicates to check (only those on leaf)
  –   search and delete don’t check predicates
  –   almost gradual expansion
• In comparison to key-range locking:
  – structure mods more expensive (replicating predicate
  – high implementation complexity (replicating ...):
    difficult/expensive to implement
Node Locking
• Simplified hybrid locking:
   – replace pred. attachments with node locks
   – block structure mods that would need to replicate node locks
• Comparison to hybrid locking:
   – diminished accuracy/concurrency
   – structure mods cheap
   – higher implementation complexity
AM Recovery
• Purpose:
  – bring AM structure into working condition
  – restore AM contents to reflect committed xacts
• Reminder: WAL recovery
  –   every update writes log record with redo and undo info
  –   rollback: apply undo portion in reverse chronological order
  –   restart, phase 1: apply redo portion in chronological order
  –   restart, phase 2: undo all uncommitted xacts
GiST Recovery
• Keys to high concurrency:
  – separate updates into contents change and structure
    modification (operation, SMO)
  – “commit” SMOs separately and immediately
  – “logical” undo of contents change:
    re-locate leaf, compensate update, only lock data, not
• Logical undo:
  – cannot re-traverse tree structure, but can follow rightlinks
  – in Aries/IM, need tree-global latch to avoid this situation
Summary & Conclusion
• GiST concurrency and recovery techniques:
  –   Concurrency: adaption of link technique
  –   Isolation: hybrid predicate/2-phase locking
  –   Recovery: WAL-based a la ARIES
  –   invisible to AM developer:
      can write industrial-strength AMs without knowledge of
      server internals
• Overview and Motivation
• High-Performance Extensible Indexing with
  Generalized Search Trees
• Concurrency and Recovery
• Access Method Performance Analysis
  –   Motivation
  –   Goals of Framework
  –   Analysis Framework with Examples
  –   amdb
  –   Conclusion
• Access method (AM) design and tuning is a
  black art
   –   Which AM do I use to index my non-traditional data type?
   –   What are contributions of individual design ideas?
   –   How to explain performance differences between AMs?
   –   How to assess AMs on their own?
• Current practice of performance analysis little
   – aggregate runtime or I/Os numbers provide no explanations
   – measuring semantic properties (e.g., spatial overlap) is
     domain-specific, not useful for general analysis framework
Goals of Analysis Framework (1)
• Measure performance in context of specific
  workload (data and queries)
  – recognize workload as part of analysis input
  – compare workloads by running against same AM
  – allows tuning of AM for specific workload
• Performance metrics characterize observed
  performance (I/Os)
  – independent of data or query semantics
  – reflects purpose of performance tuning
Goals of Analysis Framework (2)
• Metrics express performance loss
  – loss = difference between observed and optimal performance
  – shows potential for performance improvement
  – optimal performance obtained by executing workload in
    optimal tree
  – fixed point of reference allows AM to be assessed on its own
• Loss metrics for each query, node of input tree
  and structure-shaping aspects of
  – breakdown allows performance flaws to be traced back,
    facilitates assessment of individual design aspects
Analysis Framework: Overview (1)
• Performance-relevant aspects of tree:
  – Clustering: determines amount of data that query accesses
    beyond result set
  – Page Utilization: determines number of pages that data
  – SPs: excess coverage (covers more data space than is
    present in subtree) leads to extra traversals
• subdivide loss along those factors
  – break-down provides more detailed clues about cause of
    performance loss
Analysis Framework: Overview (2)
• Query Metrics:
  – run each query in actual and optimal tree to obtain
    performance loss
  – workload metrics: sum over all queries
• Node metrics:
  – node’s contribution to aggregate loss
  – obtained by computing per-node metrics for each query and
    aggregating over workload
• Implementation metrics:
  – measure how much pickSplit() and penalty() deteriorate tree
  – obtained by running sample splits and insertions
• Analysis, visualization and debugging tool that
  implements analysis framework
• Accepts AMs written for libgist
• Available at http://gist.cs.berkeley.edu
amdb Features
• Tracing of insertions, deletions, and searches
• Debugging operations: breakpoints on node
  splits, updates, traversals, and other events
• Global and structural views of tree allow
  navigation and provide visual summary of
• Graphical and textual view of node contents
• Analysis of workloads and tree structure
Node Visualization

Node View
  Displays bounding
  predicates (SPs) and items
  within nodes.

  Highlights BPs on current
  traversal path.

Split Visualization
  Shows how SPs or data items
  are divided with PickSplit( )

Node Contents
  Provides textual
  description of node
        Leaf-Level Statistics
 Global View
    Provides summary
    of node statistics
    for entire tree

 Tree View
    Also displays node stats

  Total or per query

  I/O counts and
  overheads under
  various scenarios

Breakdown of losses
against optimal clustering     {
Construction of Optimal Tree
• Optimal tree: optimal clustering, optimal page
  utilization, no excess coverage
• Optimal leaf level:
   – partition items to minimize number of page accesses
   – partition size = target page utilization
   – workload can be modeled as hypergraph, approximate
     clustering with hypergraph partitioning algorithm
• Optimal internal levels:
   – cannot be constructed analogously
   – but: we assume target page utilization and no excess
Performance Metrics - Sample Query
Optimal Clustering:                 clustering loss

      - X X           - X X         exc. cov. loss

                                    utilization loss
Actual Tree:
                      X X X

        X -           X X X   - -

      - - -           - X X   - X            -   - X
Example 1: Unindexability Test
• Unindexable workload: aggregate leaf
  accesses in optimal tree take longer than
  sequential scan for each query
  – typical ratio of sequential/random I/Os: 14:1
• Conclusion: uniformly-distributed multi-
  dimensional point sets mostly unindexable
Example 2: Comparison of SP designs
• Goal: assess effect of SP on nearest-neighbor AMs
  – R*-trees: bounding rectangles, SS-trees: bounding spheres,
    SR-trees: combination of BRs and BSs
• Experiment:
  – bulk-load 3 trees with identical data (only internal levels differ)
  – data: 8-dim points, arranged into sphere-shaped clusters
  – measure excess coverage loss
• Results:
  – R*- and SR-tree identical, SS-tree order of magnitude worse at
    leaf level
  – SR-tree paper came to contrary conclusion, because it compared
    insertion-loaded trees
Summary (1)
• Analysis framework
  – workload (= data and queries) is part of input
  – comparison of observed with optimal performance
  – metrics express performance loss, subdivided into clustering,
    utilization, excess coverage loss
• Advantages over current practice:
  – fixed point of reference allows AM to be assessed on its own
  – metrics more meaningful than aggregate I/O numbers,
    facilitate evaluation of individual design ideas
  – metrics are independent of data or query semantics
Summary (2)
• amdb
  – implements framework, along with visualization and
    debugging features
  – utilizes hypergraph partitioning to approximate opt.
  – user experience verifies usefulness of framework and tool
• GiST-based AM extensibility is effective:
   – reduces implementation complexity w/o
     performance/robustness penalty
• Focus on template data structure leads to
  general solutions:
   – generally applicable concurrency & recovery protocols,
     performance analysis framework
Backup slides
Generalized Search Tree Overview
• Search:
  – traverse all subtrees for which Consistent(SP, qual) is true;
    return all leaf items for which Consistent(item, qual) is true
• Insert:
  – find leaf by following single path from root, guided by
  – if leaf is full, call PickSplit() to determine split info, then
    perform split (recursively)
  – if SP needs to be updated, call Union(old SP, new item) to
    determine new SP
Generalized Search Tree Overview
• GiST = simple abstraction of search tree
  – simple algorithms for basic functionality (search, insert,
    delete); easy to comprehend and extend
  – full control over performance-relevant properties of tree:
    clustering and page util (PickSplit() and Penalty()), SPs
  – captures essence of indexing:
    organize data into clusters, build directory structure to locate
Extensibility Architecture: Overview
                UDF Call
                C Function Call



                             Unordered             Page
       Rect                   Layout            Management

                                                 Storage, Buffer,
                                                     Log, ...

   Adapter AM Extension                         GiST Core
Extensibility Architecture: Overview (2)
• GiST core:
  –   built into server, written by ORDBMS vendor
  –   defines GiST interface, exports page mgmt interface
  –   implements iterator interface defined by server
  –   calls AM extension during insert(), delete(), getnext()
  –   fully encapsulates locking and logging (to be shown)
• AM extension:
  –   written by AM developer
  –   defines extension interface
  –   implements GiST interface defined by GiST core
  –   calls GiST’s page mgmt interface functions
Extensibility Architecture: Overview (3)
• Datatype adapter:
  – written by (datatype) domain expert
  – implements extension interface defined by AM extension
GiST Interface Overview
• Page-based, not entry-based
  – reduce # of UDF calls from 1 per entry to 1 per page
  – find_min_pen(): find page entry with smallest insertion
  – search(): find matching entries on page
• Leave page format to AM extension:
  –   call AM extension fcts to update or extract data on pages
  –   insert(): add entry to page
  –   update_pred(): update predicate part of entry
  –   get_key(): extract key from page entry
  –   remove(): remove entries from page
• GiST works!
  – reduces AM implementation complexity considerably by
    factoring out common operation functionality (no locking
    and logging)
  – separation of functionality achieves tight integration of
    external AM with server: same degree of concurrency and
    recoverability as built-in AMs
  – improved GiST interface reduces # of costly UDF calls: more
    efficient than (naïve) built-in extensible AM
  – improved GiST interface increases flexibility: customizable
    page layout
Implementation Details
• Two mechanisms for synchronization in DBMS
• Latches:
  –   like mutexes, addressed physically
  –   used for physical mem-resident resources (buffers)
  –   no “latch manager”, no “deadlatch” detection
  –   cheap
• Locks
  –   addressed logically (lock name = integer)
  –   used for non-mem-resident resources (pages, tuples)
  –   lock manger (hash table) provides deadlock detection
  –   not as cheap
Implementation Details (cont.)
• use latches to synchronize access to index
  (physical concurrency control)
• use locks to synchronize access to data
  (logical concurrency control)
• want deadlock-free protocol for AMs: can use
• want to avoid node latches during I/Os for
  GiST: high concurrency
Node Deletion
• No latch-coupling possible, can’t avoid
  incorrect pointers
• must delay deletion until there are no more
  traversals with pointer to node
• how to:
  – traversal sets S-lock on node when reading pointer
  – deletion sets X-lock on node
  – not pretty: replicate S-locks on node split (or prevent them)
Implementing NSNs
• Global counter can become bottleneck:
  – read at each visited node
  – avoid by storing maximum child NSN in parent
• Global counter needs to be recoverable
  – end-of-log LSN is a good candidate
  – if not available, need to write log records
Transaction Isolation
• SQL Isolation Levels and how to achieve them
  –   read uncommitted: no locks
  –   read committed: instant-duration locks
  –   repeatable read (read set won’t change): xact-dur. locks
  –   serializable (same op, same result, no phantoms): ??
• Phantom Problem:
  – SELECT specifies logical range
  – need to prevent insertions into that range
  – can’t lock non-existing items
AM Recovery
• Recovery affects concurrency:
  need to lock what you log (until commit)
• Keys to high concurrency:
  – separate updates into contents change and structure
    modification (operation, SMO)
  – “commit” SMOs separately and immediately
  – “logical” undo of contents change:
    re-locate leaf, compensate update, only lock data, not
• Short-duration, recoverable, atomic, but
  outside of transaction
• “Committed” through nested top action

  ...    update left   update right            CLR   ...
Logical Undo
• Example for compensating action: re-traverse
  tree and delete leaf item to undo insert op
• Recovery affects concurrency, part II:
  at restart, no latches to protect inconsistent
• in GiST, cannot re-traverse tree, but can follow
• in B-tree (Aries/IM), need tree-global latch to
  avoid this situation
     Debugging Operations
Stepping Controls

Tree View
 Shows structural
 organization of index.

 Highlights current
 traversal path during
 debugging steps.

Console window
  Displays search results,
  debugging output, and
  other status info.

Breakpoint Table
 Defines and enables
 breakpoints on events
     Subtree Predicate Statistics

  Views highlight
  nodes traversed
  by query

Query breakdown in
terms of “empty” and
required I/O.

Excess Coverage
  Overheads due to
  loose/wide BPs
• GiST-based AM extensibility is effective:
   – reduces implementation complexity w/o
     performance/robustness penalty
• Focus on template data structure leads to
  general solutions:
   – generally applicable concurrency & recovery protocols,
     performance analysis framework
• Open problems:
   – automatic reorganization
   – automatic SP refinement
   – automatic AM design?
• Before we launch into the details...
   – why worry about databases?
   – what are ORDBMSs again?
   – access methods...?
Motivation^2 (2)
• Why worry about databases
  –   declarative access
  –   transaction - simplicity for application
  –   concurrency - 10,000s of simultaneous users
  –   recovery - restore data after failure
  –   flexibility - separate data from application
  –   scalability - automatic parallelization
Motivation^2 (3)
• What are ORDBMSs again?
  – “object-relational DBMS”
  – combines object-oriented concepts with relational DBMS
  – in practice: extensibility of DBMS with new data types
  – now supported by most big vendors (IBM, Informix, Oracle),
    standardized in SQL3
Motivation^2 (4)
• Example: Geospatial data in Informix
  – implementation of user-defined type “GeoObject” provided
    by dynamic library
  – register type and functions with system:
    create opaque type GeoObject (...);
    create function Contains (GeoObject, GeoObject)
    returns boolean
    external name “<library path>”;
  – store inside DBMS:
    create table customers (..., loc GeoObject, ...);
    insert into customers (...) values (..., “<x, y>”,
  – query data:
    select * from customers where Contains(“<area>”,
Motivation^2 (5)
• Access methods...?
  – aka indices, e.g., B-trees
  – more generally: persistent data structure for associative
  – B-tree reminder:
      •   height-balanced tree structure
      •   data at leaves in sorted order
      •   directory structure in internal nodes
      •   search proceeds from root to leaf
      •   predictable performance

To top