Document Sample
Select Powered By Docstoc
					     Select Operation- disk access
             and Indexing

*Some info on slides from Dr. S. Son, U. Va
Disk access
• DBs traditionally stored on disk
• Cheaper to store on disk than in memory
• Costs for:
  – Seek time, latency, data transfer time

• Disk access is page (block) oriented
• 2 - 4 KB page size
Access time
• Access time is the time to randomly access a
• System initially determines if page in
  memory buffer (page tables, etc.)
• Large disparity between disk access and
  memory access
 Select operation using table scan
• If read the entire table for a select – table
• Improvements to table scan of disk:
  – Parallel access
  – Sequential prefetch
Parallel access
• Linear search - all data rows read in from
  – I/O parallelism can be used (Raid)
     • multiple I/O read requests satisfied at the same
     • stripe the data across different disks
  – Problems with parallelism?
     • must balance disk arm load to gain maximum
     • requires the same total number of random I/O's,
       but using devices for a shorter time
Sequential prefetch I/O
• Retrieve one disk page after another (on
  same track) - typically 32
• Seek time no longer a problem
• Must know in advance to read 32
  successive pages
• Speed up of I/O by a factor of ≈10 (500
  I/O's per second vs. 70)
Access time
• Seek time – average 8-10 ms, as low as 4
  ms server
• Latency time – 2-4 ms, as low as 1 or less
• Data transfer time – .4-2 ms
Access time

  RIO        Seq. Prefetch
  .010        .010           Seek - disk arm to cylinder
  .002        .002           Latency - platter to sector
  .0015       .048           Data transfer - Page
  .0135       .060           1 page vs. 32 pages
  .43* seconds vs .060 seconds 32 pages for both

* .00135X32=.43
       Access time for fast I/O

  RIO        Seq. Prefetch
  .004        .004          Seek - disk arm to cylinder
  .001        .001          Latency - platter to sector
  .0005       .016          Data transfer - Page
  .0055       .021          1 page vs. 32 pages
  .176* seconds .021 seconds 32 pages for both

* .0055X32=.176
      Organizing disk space
• How to store data so minimize access time
  if read the entire table?
Disk allocation
• Disk Resource Allocation for Databases (DBA
  has control)
• Goal – contiguous sectors on disk - want data
  as close together as possible to minimize seek
• No standard SQL approach, but general way to
  deal with allocation
• Some OS allow specification of size of file and
  disk device
                 Types of Files
• Heap files (unordered – sequential)
• Sorted files (ordered – sort key)
• Hash files (hash key, hash function)
   – Internal, external, file expansion
   – B+-trees

• Raid technology (parallelizing)
• Storage area networks – ERP (enterprise resource
  planning) and DW (data warehouses)
   – Storage devices configured as nodes in network –
     can attach/detach
Tablespace is:
• Allocation medium for tables and indexes for
  ORACLE, DB2, etc.
• Can put >1 table in a table space if accessed
• Tablespace corresponds to 1 or more OS files
  and can span disk devices
• Usually relations cannot span disk devices
           DB storage structures

DB                      CAP Database
Table-              tspace 1                   system
OS files     fname1      fname2                fname3
Tables       Cust agents prods orders orindx
Segments     data data data data index
• ORACLE DB's contain several tablespaces,
  including one called system -
     data description + indexes + user-defined tables

• default tablespace given to each user
• if multiple tablespaces - better control over load
• can take some disk space off-line
• Relation composed of 1 or more extents
• Extent - contiguous storage on disk
• when data segment or index segment first
  created, given an initial extent from
  tablespace 10KB (5 pages)
• if need more space given next contiguous
• Can increase the size by a positive %
  (cannot decrease)
  – initial n - size of initial extent
  – next n - size of next
  – max extents - maximum number of extents
  – min extents - number of extents initially
  – pct increase n - % by which next extent
   grows over previous one
       Oracle create tablespace
Create table
• Create table statement - can specify tablespace,
  no. of extents
  – When initial extent full, new extent allocated
  – pctfree - determine how much space in a page can be
    used for inserts of new rows
     • if pctfree =10%, inserts stop when page is 90% full
             » Uses another page
  – pctused – determines when new inserts start again
     • if fall below certain percentage of total, default pctused =
                   pctfree + pctused < 100
     For more info: http://download-

• Row layout on each disk page

             1 2 3… N                    Row N Row N-1 … Row 1
Header info Row directory   free space      data rows

• Header -
• Row directory – row number and page byte offset
    – Row number is row number in page – book calls it slot#
        • Page byte offset – with varchar, row size not constant
• To identify a particular row use RID (RowID) –
              page #, slot # [file#]
   slot# is number in row directory (logical #)
 Differences in DBMSs re: rows
• ROWID can be retrieved in ORACLE but
  not DB2 (violates relational model rule)
     • rows can be split between pages (row record
     • Can have rows from multiple tables on same page,
       more info
• DB2, no splitting, entire row moved to new
  page, need forwarding pointer
 Select operation using Indexes
• Alternative to table scan
                Binary Search
• “Find all students with gpa > 3.0”
  –   If data is in sorted file, do binary search to find
      first such student, then scan to find others.
  –   Cost of binary search can be quite high.
Binary Search
• Binary search on disk
  – optimal for comparisons - not optimal for
   disk-based look-up
  – must keep data in order
  – may be reading values from same page at
   different times
•   Instead: Create an `index’ file
•   Keyed access retrieval method
•   index is a sorted file - sorted by index key
•   index entries:

          index key      pointer (RID)

• pointer is RID
• index resides on disk, partially memory resident when
                     Index File

         k1 k2               kN            Index File

Page 1     Page 2   Page 3        Page N   Data File
          Tree-based index
• B-tree – balanced tree
• Nodes point to data (RIDs) and also point
  to other nodes in tree

• Most commonly used index structure type in DBs today
• Based on B-tree
• Good for equality and range searches
• B+ tree: dynamic, adjusts gracefully under inserts and
• Used to minimize disk I/O
• available in DB2, ORACLE also has hash cluster, Ingres
  has heap structure, B-tree, isam (chain together new
       Structure of B+ Trees
• leaf level pointers to data (RIDs)
• the remaining are directory (index) nodes
  that point to other index nodes Fig.

                             Index Entries
                             (Direct search)

                             Data Entries
                             ("Sequence set")
    Characteristics of B+ Tree
• Insert/delete at log F N cost; keep tree height-
  balanced. (F = fanout, N = # leaf pages)
• Minimum 50% occupancy (except for root).
  Each node contains d <= m <= 2d entries. The
  parameter d is called the order of the tree.
• Supports equality and range-searches efficiently
Cost of I/O for B+-tree
•    Assume number of entries in each index node
    fits on one page - one node is one page
•    If tree with depth of 3, 3 I/Os to get pointer to
    data B+-tree structured to get most out of
    every disk page read
•    Read in index node, can make multiple probes
    to same page if remains in memory
    –   likely since frequent access to upper -level nodes of
        actively used B+-trees
            B+ Trees in Practice
• Typical order: 100.
• Typical fill-factor: 2/3 full (66.6%)
   – average fanout = 133
• Typical capacities:
   –   Height 4: 1334 = 312,900,700 records
   –   Height 3: 1333 = 2,352,637 records
• Can often hold top levels in buffer pool:
   –   Level 1 =      1 page = 8 Kbytes
   –   Level 2 =    133 pages = 1 Mbyte
   –   Level 3 = 17,689 pages = 133 MBytes
•   B+ tree has a directory structure that allows
    retrieval of a range of values efficiently
    – search for leftmost index entry Si such that
      X <= Si
•   Index entries always placed in sequence by
    value - can use sequential prefetch on index
•   Index entries shorter than data rows and
    require proportionately less I/O
• Balancing of B+-trees - insert, delete
•  Nodes usually not full
•  utilities to reorganize to lower disk I/O
•  Most systems allow nodes to become
  depopulated- no automatic algorithm to
• Average node below root level 71% full in
  active growing B+-trees
• Insert/delete
           Inserting into B+ Tree
• Find correct leaf L.
• Put data entry onto L.
   –   If L has enough space, done!
   –   Else, must split L (into L and a new node L2)
        • Redistribute entries evenly, copy up middle key.
        • Insert index entry pointing to L2 into parent of L.
• This can happen recursively
   –   To split index node, redistribute entries evenly, but
       push up middle key. (Contrast with leaf splits.)
• Splits “grow” tree; root split increases height.
   –   Tree growth: gets wider or one level taller at top.
   –   Algorithm
            Deleting from B+ tree
• Start at root, find leaf L where entry belongs.
• Remove the entry.
   –   If L is at least half-full, done!
   –   If L has only d-1 entries,
         • Try to re-distribute, borrowing from sibling (adjacent node
            with same parent as L).
         • If re-distribution fails, merge L and sibling.
• If merge occurred, must delete entry (pointing to L or
  sibling) from parent of L.
• Merge could propagate to root, decreasing height.
• Algorithm
Duplicate key values
• Duplicate key values in index
• leaf nodes have sibling pointers
• but a delete of a row that has a heavily
  duplicated key entails a long search through the
  leaf-level of the B+-tree
• Index compression - with multiple duplicates

| header info | PrX keyval RID RID ... RID | PrX keyval RID…RID|
      where PrX is count of RID values
Create Index
       multiple columns
       storage - initial extents, etc.
       percent free default = 10
            % of each page left unfilled
       free page (1 free page for every n
            index pages)
 Can control % of B+-tree node pages left
 unfilled when index created, refers to initial
   Why use an index?
• If use a select (or join) on the same attribute
• want a way to improve performance - use
   – For example:
          Select from Employee
          where ssn = 333445555

• Instead of reading the entire file until ssn is
  found, it would be nice if we had a pointer to that
Types of indexes (textbook)
• Primary index - key field is a candidate key
  (must be unique) – data file ordered by key field
• Clustering index - key field is not unique, data
  file is ordered – all records with same values on
  same pages
• Secondary index - non-clustering index – data
  file not ordered
  – First record in the data page (or block) is called the
    anchor record
     • Non-dense index - pointer in index entry points to anchor
     • Dense index - pointer to every record in the file

Non-clustered indexes
• Non-clustered index (secondary index)
  – key field is a non ordering field - it is not used
    to physically order the data file
  – the index itself is still ordered
  – How many non-clustering indexes can a table

Clustered Indexes
• Placing rows on disk in order by some common index
  key value
       (remember the index itself is always sorted)
   – Clustered index - (primary and clustering)
   – key field is an ordering field - all the data with the
     same values for the key field physically placed on the
     same pages on the disk.
   – If primary key, data ordered on a page by key field
   – Usually assume disk pages themselves also clustered
     on the disk
   – How many clustering indexes can a table have?
• Efficiency advantage
       read in a page, get all of the rows with
             the same value
• clustering is useful for range queries
        e.g. between keyval1 and keyval2
• http://www.dba-
• Can only cluster table by 1 clustering index at a time
• In SQL server
    – creates clustered index on PK automatically if no other clustered
      index on table and PK nonclustered index not specified
•   In DB2 –
    – if the table is empty, rows sorted as placed on disk
    – subsequent insertions not clustered, must use REORG
• In Oracle-
    – Cluster index – now available for PK in 10g
    – Define a cluster to create cluster index for 2 tables
Indexes vs. table scan
• To illustrate the difference between table
  scan, secondary index (non clustered) and
  clustered index
  Assume 10 M customers, 200 cities
  2KB/page, row = 100 bytes, 20 rows/page
           Select *
           From Customers
           Where city = Birmingham
  1/200 * 10M if assume selectivity = 1/200
     50,000 customers in a city
      Rules of Thumb for I/O
• Random I/O – 160 pages/second, .00625
• Sequential prefetch I/O – 1600
  pages/second, .000625

Will discuss later:
• List prefetch I/O – 400 pages/second,
Table Scan
Table Scan - read entire table

If used an random I/O
      10,000,000/20 = 500,000 pages
            500,000*R = 3125

Instead, it makes more sense to use:
sequential prefetch read 32 pages at a time
             500,000*S = 312
Clustering Index
Clustering Index –
All entries for B'ham clustered on same pages
50,000/20 = 2500 pages (with 20 rows per page)
Assume: 3 upper nodes of the tree
Assume: 1000 index entries per leaf node, read
  50000/1000 index pages

(3 + 50 + 2500) * ? = 2,553 * ?

If assume ?=R, then 2,553*R=16
Makes more sense to assume (3+50+2500) * S = 1.6
Secondary Index
Secondary Index–
In the worst case 1 entry for B'ham per page
50,000 pages (10M/200)
3 upper nodes of the tree
Assume 1000 index entries per leaf node, read
  50000/1000 index pages

  (3 + 50 + 50,000)*? = 50, 053 * ?

If assume ?=R then 50,053*R=312.8
Better to assume (3+50)*S + 50,000*R=312.53
List Prefetch
Create list of data pages to access
Pages not necessarily in contiguous sequential
system orders pages to minimize disk I/O
   E.g. elevator algorithm for disk request

  50, 053 * L = 125.1
  Best to assume (3+50)*S+50,000*L=125.03
% Free
• Redo the previous calculations assuming
  relations created with 50% free option
Creating Indexes
When determining what indexes to create consider:
 workload - mix of queries and frequencies of requests
        20% of requests are updates, etc.
      can create lots of indexes but:
           cost to create
           initial load time high if a large table
           index entries can become longer and
            longer as multiple columns included
Multiple Indexes
• More than one index on a relation

  – e.g. class - one index, gender - one index
Composite Index
• One index based on more than one attribute
    Create Index index_name on Table (col1, col2,... coln)

•   Composite index entry - values for each
          class, gender
          entry in index is: C1, C2, RID
• What would B+ tree look like?
• Thread results from a fork of a computer program, usually contained
  inside a process
   – Multiple threads inside same process, share resources, address
      space and memory
   – Processes do not share these resources
   – Thread have their own stack, copy of registers, PC and local
      thread storage
• Some languages support multiple threads, but do not execute at the
  same time
   – Kernel threads can run concurrently
              Parallel computing
• Form of computation in which many calculations carried out
   – Divide large problem into smaller ones
   – data, instruction level and task parallelism
• Dominant paradigm in the form of multicore processors
• Parallel computer – shared or distributed memory
• Parallel program difficult to write due to
   – Software bugs, race conditions
   – Communication and synchronization
• Multiple processing elements working concurrently
   – Single computer with multiple processors, networked computers,
     special hardware, etc.
• Multithreading
   – Model to allow multiple threads within single process
   – Can execute in parallel on multiprocessor system
• Process, kernel thread, user thread, fiber (cooperatively
  scheduled, can run in any thread in the same process)
• Subtasks in a parallel program are called threads
   – Lightweight version of threads – fibers
   – Bigger versions – processes
• Parallel computing – model of computation
  – Can utilize processes, multithreading to

Shared By: