final ROLL NO NAME CS 543 – Data Warehousing by panniuniu


									ROLL NO.:                                 NAME:

                             CS 543 – Data Warehousing
                                        Final Exam
                                       May 28, 2006
                          Duration: 90 minutes (9.00 to 10.30 AM)

(1) Please write legibly. Unreadable answers will NOT be graded. Use a BALL-POINT
    and write in a readily readable font size.
(2) Write in the space provided for each question only. If you need to continue a question
    in a different space, have it signed by the instructor/invigilator. Otherwise, it will not
    be graded.
(3) This is a closed notes/books exam. You may use a calculator.
(4) Distribute your time judiciously as some questions might be more involved than
(5) Make all usual assumptions if something is not stated in the question explicitly. If you
    feel your answer depends on an unstated assumption, then state it explicitly.
(6) There are 20 questions in this exam. All questions are worth 5 points.

1. In Teradata, the maximum possible number of indexes that can be created on a table
       a. 1
       b. 2
       c. 8
       d. 9
       e. 32
       f. 33

2. The following is NOT a data warehouse performance enhancement technique:
      a. Keeping the block percent used parameter as high as possible
      b. Removing referential integrity checks
      c. Indexing significant non-primary key attributes
      d. Partitioning the database tables
      e. Creating repeating groups in tables instead of multiple table rows
      f. None of the above

3. List the advantages of RAID technology in data warehousing. Which RAID type is
   the most popular?

CS 543 (Sp 05-06) – Dr. Asim Karim                                                  Page 1 of 6
4. Suppose a table has 4 attributes (A, B, C, and D) with cardinalities (number of
   distinct values) 10, 100, 256, and 1000, respectively. (i) On which attribute would B-
   tree index be most effective, and (ii) on which attribute would bit-mapped index be
   most effective? Assume approximately uniform distribution of the values among the
   rows of the table. Explain your reason briefly.

5. Refer to question 4 above. Suppose there are 10,000 rows in the table, each being
   identified by an 8-byte row ID. Estimate the size in bytes of the bit-mapped index
   structure created on attribute C.

Questions 6 to 9 are based on the following description:
Consider a customer table containing 10 million rows and having the attributes: Name,
Gender, Age, Occupation, Education, ResidentState, Status. Suppose indexes are created
on attributes: Occupation, ResidentState, and Status. Assume each row takes 128 bytes
space and data is accessed in blocks of size 32 Kbytes.

6. In the worst case (but using the indexes), how many blocks of data need to be
   accessed to answer the query: What is the average age of customers who live in
   Punjab and have a ‘Preferred’ status? Suppose ResidentState = Punjab has a
   selectivity of 20% and Status = Preferred has a selectivity of 2%.

CS 543 (Sp 05-06) – Dr. Asim Karim                                              Page 2 of 6
7. In the worst case (but using the indexes), how many blocks of data need to be
   accessed to answer the query: How many customers living in Punjab and having the
   preferred status have education ‘Graduate’? Assume the data given in question 6
   above and Education = Graduate has selectivity of 5%.

8. Consider the query: How many customers live in Punjab? Compute the maximum
   selectivity of value Punjab to ensure benefit from using the index (in the worst case)
   over a complete scan of the table.

9. Compute the storage requirement for the customer table (i) now and (ii) 1 year from
   now if the customer base is expected to grow by 10%. Assume indexes require 5% of
   the base table’s size.

CS 543 (Sp 05-06) – Dr. Asim Karim                                               Page 3 of 6
10. What is an iceberg CUBE? Explain briefly.

11. Given the following base relation, create the CUBE and represent it as another

Attribute A Attribute B Value
A1            B2             5
A1            B1             2
A2            B1             4
A2            B3             2
A3            B2             5
(Do this on the back-side of the previous page)

12. Refer to question 11 above. If a multidimensional array is used to represent the
    CUBE, how many cells would be needed, and what would be sparsity of the arrays
    (percent of filled cells).

13. The following statement is not true regarding BUC algorithm:
       a. It can compute the entire CUBE
       b. It can compute the iceberg CUBE
       c. It is more efficient for sparse CUBEs
       d. The partitioning function is the most time consuming
       e. None of the above

CS 543 (Sp 05-06) – Dr. Asim Karim                                              Page 4 of 6
14. List at least two different techniques for performing the portioning operation in BUC

15. What is the difference between ETL and ELT? Be clear and concise.

16. Draw a load time Vs. percent rows changed graph showing the performance (load
    time) of full refresh and incremental load.
(Do this on the back-side of the previous page)

17. List at least 3 recommended tips for improving the load performance during an initial
    full load of the data warehouse.

CS 543 (Sp 05-06) – Dr. Asim Karim                                              Page 5 of 6
18. Indicate true (T) or false (F) in front of the following statements:
        a. Aggregates should not replace detailed data in a data warehouse
        b. The slice-and-dice operation does not involve numerical computations (like
        c. As the sparsity of a CUBE increases the size of the iceberg-CUBE decreases
        d. Column replication may be used instead of pre-joins to improve query
            performance with lesser space overhead
        e. New rows are inserted into a degenerate dimension table almost as frequently
            as in the fact table

19. Suppose a mobile phone company’s data warehouse has a star schema with customer,
    account, time, and location dimensions. The base fact table records the duration and
    cost of every call made. If there are 20,000 accounts (unique phone numbers) and on
    average 40% of them make 5 or more calls per day and the remaining 60% make 1 or
    more call per day, what is the maximum number of days by which the size of the fact
    table will increase by 1,000,000 rows?

20. Refer to question 19 above. Suppose the company creates two snapshot fact tables to
    provide quick access to total cost incurred by each customer and account, respectively,
    at the end of each month. Draw the new data warehouse schema showing all fact
    tables and their attributes and keys, the dimension tables with keys (no need to show
    attributes), and the relationship between the tables (with cardinality).
(Do this on the back-side of the previous page)

CS 543 (Sp 05-06) – Dr. Asim Karim                                              Page 6 of 6

To top