Docstoc

Minimum Spanning Trees

Document Sample
Minimum Spanning Trees Powered By Docstoc
					                                             

             CS 728
    Advanced Database Systems
                Chapter 16
         Database File Organization:
    Unordered, Ordered, and Hashed Files of
                   Records




                       1                     
          Disk I/O Model of Computation                       

       Disk I/O is equivalent to one read or write operation
        of a single block

       It is very expensive compared with what is likely to
        be done once the block gets in main memory
          one random disk I/O ~ about 1,000,000 machine
            instructions in terms of time

       Cost for computation that requires secondary storage
        is computed only by disk I/Os.



                                2                             
                Disk Storage Devices                       

       Preferred secondary storage device for high storage
        capacity and low cost.

       Data stored as magnetized areas on magnetic disk
        surfaces.

       A disk pack contains several magnetic disks
        connected to a rotating spindle.

       Disks are divided into concentric circular tracks
        on each disk surface.
          Track capacities vary typically from 4 to 50
           Kbytes or more
                              3                            
            Disk Storage Devices (cont.)                                

       A track is divided into smaller blocks or sectors
          because it usually contains a large amount of
            information

       The division of a track into sectors is hard-coded on the
        disk surface and cannot be changed.
          One type of sector organization calls a portion of a track
            that subtends a fixed angle at the center as a sector.

       A track is divided into blocks.
          The block size B is fixed for each system.
              Typical block sizes range from B=512 bytes to B=4096 bytes.
           Whole blocks are transferred between disk and main
            memory for processing.

                                     4                                  
   Disk Storage Devices (cont.)   




                5                 
           Disk Storage Devices (cont.)                       

   A read-write head moves to the track that contains the block
    to be transferred.
      Disk rotation moves the block under the read-write head
        for reading or writing.
   A physical disk block (hardware) address consists of:
      a cylinder number (imaginary collection of tracks of same
        radius from all recorded surfaces)
      the track number or surface number (within the cylinder)
      and block number (within track).


   Reading or writing a disk block is time consuming because of
    the seek time s and rotational delay (latency) rd.
   Double buffering can be used to speed up the transfer of
    contiguous disk blocks.


                                6                             
   Disk Storage Devices (cont.)   




                    7             
   Typical Disk Parameters   




                   8         
                   Pages and Blocks                        

       Data files decomposed into pages (blocks)
         fixed size piece of contiguous information in
           the file
         sizes range from 512 bytes to several kilobytes


       block is the smallest unit for transferring data
        between the main memory and the disk.

       Address of a page (block):
         (cylinder#, track# (within cylinder), sector#
          (within track)


                               9                           
                          Pages and Blocks                            




                Track

                                                           Gap
                  Sector


    One track     1          2              3          4         ...

                            1 page/block = 4 Sectors

                                      10                              
                           Page I/O                             

       Page I/O --- one page I/O is the cost (or time needed) to
        transfer one page of data between the memory and the disk.

       The cost of a (random) page I/O =
          seek time + rotational delay + block transfer time


       Seek time
          time needed to position read/write head on correct
           track.
       Rotational delay (latency)
          time needed to rotate the beginning of page under
           read/write head.
       Block transfer time
          time needed to transfer data in the page/block.


                                 11                             
                          Page I/O                               

       Average rotational delay (rd)
         rd = ½ * (1/p) min = (60*1000)/(2*p) msec
         OR
         rd = ½ * cost of 1 revolution
            = ½ * (60*1000/p) msec
         where
             p is speed of disk rotation (how many revolutions
              per minute - rpm)
       Example
          Speed of disk rotatioon is p = 3600 rpm
          60 revolutions/sec
          1 rev. = 16.66 msec. (1 second = 1000 msec)
          rd = 8.33 ms

                                12                               
                        Page I/O                          

       Transfer rate (tr)
         tr = track size / cost of one revolution
            = track size / (60*1000/p) in msec

       Bulk transfer rate (btr)
         btr = (B/(B+G)) * tr   bytes/msec
         Where B is the block size in bytes
         G is interblock gap size in bytes


       Block transfer time (btt)
         btt = B / tr       not    taking into acount G
         btt = B / btr             taking into acount G
                             13                           
                        Page I/O                     

       Example:
          Track size = 50 KB and p = 3600 rpm
          Block size B = 3KB = 3000 bytes


       tr = (50*1000)/(60*1000/3600) = 3000 bytes/msec

       btt = B / tr = 3000/3000 = 1 msec




                             14                      
                          Page I/O                          

       Average time for reading/writing n consecutive
        pages that are in the same track or cylinder = s +
        rd + n * btt

       Average time for reading/writing consecutively n
        noncontigues pages/blocks that are in the same
        cylinder = s + n * (rd + btt)




                               15                           
                      An Example                     

       A hard disk specifications:
         4 platters, 8 Surfaces, 3.5 Inch diameter
         213 = 8192 tracks/surface
         28 = 256 sectors/track
         29 = 512 bytes/sector
         Average seek time s = 25 ms
         Rotation rate rd = 3600 rpm = 60 rps
         1 rev. = 16.66 msec
         Transfer rate
         tr = 1 KB in 0.117 ms
         tr = 1 KB in 0.130 ms with gap


                              16                     
                       An Example                     

       What is the total capacity of this disk
         8 GB (8*213*28*29=233)


       How many bytes does one track hold?
         256 sectors/track*512 bytes/sector = 128KB


       How many blocks per track?
         one block = 4096 bytes = 8 sectors (4096/512)
         256/8 = 32 blocks/track




                               17                     
                       An Example                           

       How long does it take to access one block?

       One block = 4096 bytes
          8 sectors = 4096/512
       Rotation rate r
          1 rev. = 16.66 msec.
       Time to access 1 sector (s + r/2 + tr/(secters/KB)
          25 + (16.66/2) + .117/2 = 33.3885 ms.


       time to access 1 block
          time to access the first sector of the block +
           time to access the subsequent 7 sectors.
                               18                           
                       An Example                           

       T = 25 + (16.66/2) + (0.117/2) * 1 + (0.13/2) *7 =
           33.3885 + 0.455 ms = 33.8435ms

       Compare to one sector access time: 33.3885 ms


           1        2         3          ...      8


                   1 block = 8 Sectors



                              19                            
                        Buffering                           

       A buffer
         is a contiguous reserved area in main memory
           available for storage of copies of disk blocks.
         to speed up the processes.


       For a read command
          the block from disk is copied into the buffer.


       For a write command
          the contents of the buffer are copied into the
           disk.


                              20                            
      Accessing Data Through RAM Buffer 

                             RAM
                                           Block transfer


    Application               Buffer




                                                       block
                  Record
                  transfer             Page frames



                              21                              
                   File Organization                       

       The database is stored as a collection of files.

       Each file is a sequence of records.

       A record is a sequence of fields.

       Records are stored on disk blocks.

       A file can have fixed-length records or variable-
        length records.



                               22                          
                   File Organization                          

       Fixed length records
          Each record is of fixed length. Pad with spaces.




       Variable length records
         different records in the file have different sizes.
         Arise in database systems in several ways:
             different record types in a file.
             same record type with (variable-length fields,
              repeating field, or optional fields)



                                23                            
   File Organization   




          24           
                         Fixed-Length Records              

       Insertion:
          Store record i starting from byte
           n  (i – 1), where n is the size of
           each record.

       Deletion of record i:
          Packed format:
              move records i + 1, . . ., n to i, . . . ,
               n–1
              OR
              move record n to i
            Unpacked format (do not move
             records, but)
              link all free records on a free list
              OR
              Use bitmap vector

                                                25         
                         Free Lists                           

       Store the address of the first deleted record in the
        file header.
       Use this first record to store the address of the
        second deleted record, and so on.




                               26                             
       Page Formats: Fixed Length Records 

       Record id = <page id, slot #>.


    Slot 1                              Slot 1
    Slot 2                              Slot 2
                              Free
                ...           Space
                                                    ...
    Slot N                              Slot N

                                        Slot M
                       N                         1 . . . 0 1 1M
                           number                M ... 3 2 1      number
              PACKED       of records        UNPACKED, BITMAP     of slots




                                     27                               
       Variable-Length Records Representation 

       Byte-String representation
         Attach an end-of-record () control character to
          the end of each record
         Difficulty with deletion and growth


       Slotted-page header contains:
          number of record entries
          location and size of each record
          end of free space in the block




                               28                           
               Slotted Page Structure                      

       Records can be moved around within a page to
        keep them contiguous with no empty space
        between them
          entry in the header must be updated.
          Pointers should not point directly to record -
           instead they should point to the entry for the
           record in header.




                              29                           
          Fixed-Length Representation                    

       Reserved Space
         can use fixed-length records of a known
           maximum length
         unused space in shorter records filled with a
           null or end-of-record symbol.




                             30                          
          Fixed-Length Representation                    

       List Representation by Pointers
         A variable-length record is represented by a list
           of fixed-length records, chained together via
           pointers.
         Can be used even if the maximum record length
           is not known




                             31                          
          Fixed-Length Representation                            

       Disadvantage: space is wasted in all records except
        the first in a a chain.
          Solution is to allow two kinds of block in file:
             Anchor block: contains the first records of chain
             Overflow block: contains records other than those
              that are the first records of chairs.




                                32                               
                    Blocking Factor                         

       Blocking Factor (bfr) - the number of records that
        can fit into a single block.
          bfr = ⌊B/R⌋
             B : Block size in bytes
             R: Record size in bytes
       Example:
          Record size R = 100 bytes
          Block Size B = 2,000 bytes
          Thus the blocking factor bfr = 2000/100 = 20


       The number of blocks b needed to store a file of r
        records:
          b = r/bfr blocks

                                33                          
         Spanned & Unspanned Records                         

       A block is the unit of data transfer between disk
        and memory.
       Unspanned records:
          A record is found in one and only one block.
             records do not span across block boundaries.
          Used with fixed-length records having B  R
       Spanned records:
          Records are allowed to span across block
           boundaries.
          Used with variable-length records having R  B


       In variable-length records, either organization can
        be used.
                                34                           
            Placing File Records on Disk                         

       A file header or file descriptor contains information about
        a file (e.g., the disk address, record format descriptions,
        etc.)




                                 35                              
            Allocating File Blocks on Disk                

       The physical disk blocks that are allocated to
        hold the records of a file can be contiguous,
        linked, or indexed.

       In contiguous allocation, the file blocks are
        allocated to consecutive disk blocks.

       In linked allocation, each file block contains a
        pointer to the next file block.

       In indexed allocation, one or more index blocks
        contain pointers to the actual file blocks.

                               36                         
          Organization of Records in Files                           

       Heap/Unordered/Pile File Organization
         a record can be placed anywhere in the file where there is
          space, or at the end
         for full file scans or frequent updates
         Data unordered (unsorted)


       Sorted/Ordered File Organization
          store records sorted in order, based on the value of the
           search key of each record
          Need external sort or an index to keep sorted


       Hashing File Organization
         a hash function computed on some attribute of each record
         the result specifies in which block of the file the record
          should be placed
                                    37                               
               Heap File Organization                     

       Records are placed in the file in the order in which
        they are inserted. Such an organization is called a
        heap file.
          Insertion is at the end
             takes constant time O(1) (very efficient)
           Searching
             requires a linear search (expensive)
           Deleting
             requires a search, then delete

           Select, Update and Delete
             take b/2 time (linear time) in average
             b is the number of blocks

                                 38                       
               Heap File Organization                     

       For a file of unordered fixed-length records using
        unspanned blocks and contiguous allocation, it is
        straightforward to access any record by its position
        in the file.
          If the records are numbered 0,1,2, …, r-1 and
          The records in each block are numbered 0,1,2,
            …, f-1, where f is the blocking factor
          The the i-th record of the file is located in
             Block i/f and in the
             (i mod f)-th record in that block




                                 39                       
              Heap File Organization                      

       A Heap file allows us to retrieve records:
          by specifying the rid, or
          by scanning all records sequentially
       Accessing a record by its position does not help
        locate a record based on a search condition.




                              40                          
   File Stored as a Heap File                

    666666   MGT123    F1994   4.0
    123456   CS305     S1996   4.0   page 0
    987654   CS305     F1995   2.0


    717171   CS315    S1997 4.0
    666666   EE101    S1998 3.0      page 1
    765432   MAT123   S1996 2.0
    515151   EE101    F1995 3.0

    234567   CS305     S1999   4.0
                                     page 2

    878787   MGT123    S1996   3.0
                     41                      
           Sequential File Organization                  

       Suitable for applications that require sequential
        processing of the entire file
       The records in the file are ordered by a search-key




                              42                         
             Files of Ordered Records   

       Some blocks of an
        ordered (sequential)
        file of EMPLOYEE
        records with NAME
        as the ordering key
        field.




                              43        
   File Stored as a Sorted File              

    111111   MGT123   F1994 4.0
    111111   CS305    S1996 4.0      page 0
    123456   CS305    F1995 2.0


    123456   CS315    S1997 4.0
    123456   EE101    S1998 3.0      page 1
    232323   MAT123   S1996 2.0
    234567   EE101    F1995 3.0

    234567   CS305     S1999   4.0
                                     page 2

    313131   MGT123    S1996   3.0
                     44                      
            Sequential File Organization                                  

       Insertion is expensive
          records must be inserted in the correct order
                locate the position where the record is to be inserted
                if there is free space insert there
                if no free space insert the record in an overflow block
                In either case, pointer chain must be updated
           Insert takes lg(b) plus the time to re-organize records.
           b is the number of blocks

       Deletion
          use pointer chains


       Searching
          very efficient (Binary search)
          This requires lg(b) on the average


                                       45                                 
   Sequential File Organization   




                46                
                Average Access Times                      

       The following table shows the average access time
        to access a specific record for a given type of file




                              47                          
                            Hashed Files                                       

       Hashing for disk files is called External Hashing
       The file blocks are divided into M equal-sized buckets, numbered
        bucket0, bucket1, ..., bucketM-1.
            Typically, a bucket corresponds to one (or a fixed number
             of) disk block.
       One of the file fields is designated to be the hash key of the file.
       The record with hash key value K is stored in bucket i, where i=h(K),
        and h is the hashing function.
       Search is very efficient on the hash key.
       Collisions occur when a new record hashes to a bucket that is already
        full.
            An overflow file is kept for storing such records.
            Overflow records that hash to each bucket can be linked
             together.



                                       48                                      
                     Hashed Files (cont.)                                        
       There are numerous methods for collision resolution, including the
        following:
            Open addressing: Proceeding from the occupied position
             specified by the hash address, the program checks the
             subsequent positions in order until an unused (empty)
             position is found.
            Chaining: For this method, various overflow locations are
             kept, usually by extending the array with a number of
             overflow positions. In addition, a pointer field is added to
             each record location. A collision is resolved by placing the
             new record in an unused overflow location and setting the
             pointer of the occupied hash address location to the address
             of that overflow location.
            Multiple hashing: The program applies a second hash
             function if the first results in a collision. If another collision
             results, the program uses open addressing or applies a third
             hash function and then uses open addressing if necessary.


                                       49                                        
   Hashed Files (cont.)   




            50            
                 Hashed Files (cont.)                     

       To reduce overflow records, a hash file is typically
        kept 70-80% full.
       The hash function h should distribute the records
        uniformly among the buckets
          Otherwise, search time will be increased
           because many overflow records will exist.
       Main disadvantages of static external hashing:
          Fixed number of buckets M is a problem if the
           number of records in the file grows or shrinks.
          Ordered access on the hash key is quite
           inefficient (requires sorting the records).

                              51                          
 Hashed Files - Overflow Handling   




                    52              

				
DOCUMENT INFO