Two-Phase Multi-way Merge Sort Examples

Document Sample
Two-Phase Multi-way Merge Sort Examples Powered By Docstoc
					                                      Two-Phase Multi-way Merge Sort Examples

a.   How many sublists will be required for sorting the relation if 50 MB (1 MB = 220 bytes) is available for
     buffering?

         1 block = 4096 bytes = 212 bytes, and each tuple uses 100 bytes , so we can store

                   4096 / 100  = 40.96 = 40 tuples in one block

         The buffer space is 50 MB = 50 × 220 bytes, which is enough to store

                   50 × 220 ÷ 212  = 50 × 28 = 12,800 blocks

         The file requires 250,000 blocks and each time we load a buffer we create a sublist; this means we
         must create 250,000 ÷ 12800 = 19.53 = 20 sublists

         (Note that the last sublist contains only 250,000 - 19 × 12,800 = 6800 blocks)


b.   How long will it take to sort the relation if all blocks are stored randomly?

     We shall break this problem down into the following parts
         • The time to load the buffers prior to sorting them, and to write the sorted buffer back to disk

                   This is the same as the time to read and then write 250,000 blocks into memory. In an earlier
                   example we noted that the average access time for a block to be 10.9 msecs, so to access 250,000
                   blocks requires

                        250,000 × 10.9 msecs = 2725000 msecs = 2725 secs = 45.41 mins

                   Thus to read and then write 250,000 blocks will take 2 × 45.41 = 90.82 or 91 mins

         •    The total time to sort the loaded buffers

                   Let us assume we use an n log n algorithm (e.g. quicksort). Since the buffer can store 12,800 (=
                   50 × 28 ) blocks and since each block stores 40 tuples, then sorting each buffer requires sorting

                        50 × 28 × 40 = 2000 × 28 tuples ≈ 211 × 28 = 219 tuples

                   With an n log n algorithm we then estimate the number of operations (actually comparisons) at

                        219 × log2 219 = 19 × 219 ≈ 25 × 219 = 224 operations

                   Sorting 20 ( ≈ 25 ) sublists then requires approximately 25 × 224 = 229 operations

                   If we assume 60 nsec memory and a 1 GHz processor, then we estimate 61 nsecs per operation., so
                   sorting all of the sublists requires on the order of

                        229 × 61 × 2-30 ≈ 229 × 26 × 2-30 = 25 secs

                   so that sorting the sublists effectively only adds seconds to the overall sorting time, and hence can
                   be ignored.

         •    The time for Phase 2.

                   The time for phase 2 is the same as that for reading 250,000 random blocks and then writing them
                   (again randomly), since comparisons among the sublist buffer values will be negligible. In the
                   first part of this problem we saw that this took 91 mins.


     Total: We estimate the total sorting time as approximately 91 mins + 91 mins = 182 mins
c.   How long will it take to sort the relation if all blocks are stored in consecutive cylinders (in the same region)?

     Let us assume the blocks will be stored in consecutive cylinders in region 2.

     Since there are 256 sectors per track and each block requires 8 sectors we can store 32 blocks per sector. Since
     the Megatron 747 has 8 disk surfaces, this means each cylinder can hold 32 × 8 = 256 blocks.

     Thus, in order to store the file we shall need 250,000 ÷ 256 = 976.5625 = 977 cylinders

     Phase 1: We shall once again ignore the time to sort the sublists, so phase 1 effectively consists of loading the
     50MB memory buffer to create 20 sorted sublists and then to store each of these sublists.

     We note that the 50MB buffer can store 12,800 ÷ 256 = 50 cylinders. We also note that we can ignore
     rotational delay when we initially load the buffer to create the sublists since the order in which the tuples are
     read into the buffer does not matter

     We see the events of the first part of phase 1 as

         load 19 full buffers with 50 cylinders each time + load 1 partial buffer with 26 full and one partial cylinder

     Here:
         loading a full buffer requires: 1 random seek, 49 adjacent seeks, 12800 block transfers
         loading a partial buffer requires: 1 random seek, 26 adjacent seeks, 1 rotational delay, 6800 block transfers

     Thus, the first part of phase 1 will require

         20 random seeks, 957 seeks to an adjacent cylinder, 1 rotational delay, and 250000 block transfers

     The time for this is

              20(6.46 msecs) + 957 (1.002 msecs) + 4.17 msecs + 250000(.26 msecs)
         =    129.2 + 958.914 + 65000 + 4.17 msecs
         =    66092.284 msecs = 66.092 secs

     Storing the sublists will require essentially the same amount of time, except each time we write to a cylinder we must
     allow for a rotational delay since in this case the order in which we write the sublists back to disk will matter. Thus
     means we will have 976 additional rotational delays, which will add 976(4.17 msecs) = 4040.64 msecs = 4.040 secs
     to the above time.

     Thus, the total time for phase 1 is 66.092 secs + 66.092 secs + 4.040 secs = 136.224 secs = 2.27 mins.

     Phase 2: The time for phase 2 will be the same as that for the case with randomly stored blocks since we cannot
     predict when we will write the output buffer or fill a sublist buffer. We saw that this took 91 mins.

     Total: We estimate the total sorting time to be approximately 91 mins + 2.27 mins = 93 mins


d.   What is the maximum number of these tuples that can be sorted using the two-phase multi-way sort with the
     amount of buffer space available and the given block size? How much storage would they require?

     We begin by noting that the maximum number of sublists is directly dependent on the number of sublists we
     have available for merging. Since each sublist must have a buffer of at least one block we will maximize the
     number of sublists if each buffer is the size of a block, in this case 4K = 212 bytes. The number of 1 block
     buffers that can be carved out of 50MB of total buffer space is

         50 × 220 ÷ 212  = 50 × 28 = 12,800 buffers


CSCI 430 -- Spring, 2001                                Assignment 2                                              Page - 2
    Since 1 of these must serve as the output buffer for merging the sublists, however, the total number of buffers
    available for sublists is 12,800 – 1 = 12,799, which is also the maximal number of sublists we can have.

    In creating each these sublists, however, we use all of available buffer memory to hold tuples for sorting. Since
    this memory can hold 12,800 blocks and since each block hold 40 tuples, this means that each of the sublists can
    have a maximum of 12,800 = 512,000 tuples.

    Thus, the maximum number of tuples that we can sort is 12,799 × 512,000 = 6,553,088,000 tuples.

    At 100 bytes per tuple, this means we must have 655,308,800,000 bytes of disk space available just to hold the
    tuples.




CSCI 430 -- Spring, 2001                           Assignment 2                                             Page - 3