; Partition and Aggregation
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Partition and Aggregation

VIEWS: 1 PAGES: 42

  • pg 1
									Partitioning and Aggregation
               Group 15:
     He Dajun,Lin Li, and Yu Xinguo

       CS6203 -- Advanced Topics
    in Database Management Systems

   Data Mining, Data Warehousing and OLAP
Contents of the Presentation

     Introduction & Hardware
      Partitioning (He Dajun)
     Database Partitioning (Lin Li)
     Aggregation (Yu Xinguo)
Introduction

   Data Warehouse?
          Data Warehouse is more than just data, it
           is also the process involved in getting that
           data from source to table and in getting
           the data from table to analysis
        Architecture of Data Warehouse
       Data          Information             Decision


Operational                                      Data
  Data        L                          Q      dippers

                             Summary
                               Info              OLAP
 External         Detailed
  Data        M    Info                  M       tools
                             Meta data


                      W        M
Architecture of Data Warehouse
       Load Manager
             Extract and load the data
       Warehouse Manager
             Clean and Transform data
             Backup and archive data

             Generate any new aggregate that may require
       Query Manager
             Manage Queries, and direct them to appropriate data
              source
Problems with Data Warehouse
  Huge Data
     Management
     Backup
     Daily process
  Approach
      Partitioning
    Partitioning
Partitioning ?
     A single logical entity is partitioned into a
     multiple sub-entity
Why Partitioning?
    Assist the management of the data.
    Assist backup and recovery.
    Improve query performance ---by completely
     removing large parts of fact table from possible
     set of data that needs to be scanned.
                Partitioning

   Types of Partitioning
    Software partitioning v.s. Hardware partitioning
     Horizontal Partitioning v.s.Vertical Partitioning
   Other issues
   Hardware Architecture

SMP: (Symmetric Multi-processing)
      Called shared everything
MMP: (Massively Parallel Processing)
      Called shared nothing
   CPUs

System Bus

Disk
controllers




              SMP
             High-speed connect




   nodes


Local disk



                     MPP
Hardware partitioning
        Maximizing processing power and
            avoiding bottlenecks
           Shared-everything architecture maximize
            processing power at all time because all disk and
            memory are fully shareable
           Shared-nothing architecture will maximize
            processing power available only if the sub-
            processes are capable of running on different
            node.
           Hardware partitioning is used in Shared-nothing
            architecture.
         Hardware partition ( Striping data across MPP Nodes)


Query
                SQ     Sub-queries run on all nodes
controlle
subquery
r               SQ          SQ              SQ         SQ
node


                            Interconnect
I/O
controller
  Active data
                disk       disk           disk        disk
             Hardware partition ( Horizontal Hardware Partition)


 Query
                 SQ        Sub-queries run on all nodes
 controller
 Sub-query       SQ             SQ              SQ          SQ
 node


                                Interconnect
I/O
controller

disk           Month 1        Month 2         Month 3     Month 4
 Fact
 table
Software Partitioning
Advantages
     Improve Query Performance
          Speed up queries
     Simplify the Management of Data
      Warehouse
          Backup/restore, move, creating &
           maintaining indexes
Cost
     Increases number of physical
      tables, hence higher operational cost
Partitioning Strategies
 Horizontal Partitioning

 Vertical Partitioning

Note: Usually apply partitioning on fact
tables, rarely on dimension tables.
                  Horizontal Partitioning
            1. Partition by time into equal segments
                                      Year 1      Year 2

                                      Jan Sales   Jan Sales

Disadvantage:            Sales        Feb Sales   Feb Sales
Some partitions         Records          .           .
                     (500 millions)      .           .
might store                              .           .

more records                          Dec Sales   Dec Sales
than others
                   Horizontal Partitioning
    2. Partition by time into different sized segments
                                        Year 1      Year 2
Advantage:                             Dec Sales   Dec Sales
Detailed
                                       Nov Sales   Nov Sales
information
remains online            Sales        Oct Sales   Oct Sales
                         Records
                      (500 millions)   Quarter     Quarter
Disadvantage:
Partitioning
profile change
                                       Half Year   Half Year
regularly, hence
increases
operational cost
                     Horizontal Partitioning
    3. Partition on a different dimension, e.g. region

                                             Ladies’ Wear


                              Sales          Men’s Wear
Note:
                             Records
Do not partition          (500 millions)          .
on a dimensional                                  .

grouping that is                                  .

likely to change
within the life of                         Sports Equipment
the data
warehouse
                   Horizontal Partitioning
                   4. Partition by size of table
                         Create a new partition when the table
                          is about to exceed a predetermined
Disadvantage:             size, not based on any dimension
Complex to
manage, requires
metadata to              Use this strategy when there is no
identify what is          clear basis for partitioning the fact
stored in each            table
partition
Horizontal Partitioning
Partitioning Dimension Tables

    The basis is unlikely to be time, but
    a grouping of dimension

    Rare case
                     Horizontal Partitioning
Advantage:              Round Robin Partitions
Simpler to
automate table       Archive the oldest partition and reuse the old
management           partition for the latest data
facilities with in
the data
                            Sales 1999                      Sales 2000
warehouse                  1st Quarter                     1st Quarter
Disadvantage:
Metadata is                Sales 1999      Archive          Sales 1999
required to               2nd Quarter      Sales 1999      2nd Quarter
manage table                               1st
                           Sales 1999                       Sales 1999
partition as the                           Quarter
                          3rd Quarter                      3rd Quarter
information
period they cover           Sales 1999                      Sales 1999
changes                    4th Quarter                     4th Quarter
frequently
                      Vertical Partitioning
                            1. Normalization
                     Common fields are collapsed into
                     single rows to reduce space usage

                                           CustID Name      Amt Tran_ID

                                            001    May      500      #123
CustID Name   Amt Tran_ID   Tran_Desc
                                            002    Ann      1000     #159
      May     500    #123   Deposit         003    Don      6000     #159
001
002   Ann     1000   #159   Withdraw        004    Mel      8000     #123
003   Don     6000   #159   Withdraw        005    Zoe      300      #123
004   Mel     8000   #123   Deposit
005   Zoe     300    #123    Deposit        Tran_ID      Tran_Desc

                                            #123         Deposit
                                            #159         Withdraw
                   Vertical Partitioning
                       2. Row Splitting
                           CustID Name Amt Tran_ID Tran_Desc

                           001    May    500    #123     Deposit
                           002    Ann    1000   #159     Withdraw
                           003    Don    6000   #159     Withdraw
                           004    Mel    8000   #123     Deposit
Splits less used           005    Zoe    300    #123     Deposit
column of
information          CustID Name Amt     CustID Tran_ID Tran_Desc
out from a
                     001    May   500     001     #123     Deposit
frequently                  Ann   1000            #159     Withdraw
                     002                  002
accessed fact        003    Don   6000    003     #159     Withdraw
table                004    Mel   8000    004     #123     Deposit
                     005    Zoe   300     005     #123      Deposit
       Aggregation
   What is an aggregation?
   Why aggregate?
   How to aggregate?
   Which aggregations to create?
What Is An Aggregation(1)
Pid1     24/09   9:30    1kg     S$2
Pid2     24/09   9:50    1.5kg   S$15
Pid1     24/09   11:10   2kg     S$3.8
Pid1     24/09   12:45   3kg     S$5.1
Pid2     24/09   14:25   1kg     S$11
  Day 24/09 Sales
  Pid1    6kg      S$10.90
  Pid2    2.5kg    S$26
What Is An Aggregation(2)

 An aggregation is a
 summary of a data table. It
 is made for providing cost-
 effective query
 performance.
       What Is An Aggregation(3)
      Facts Dimensions
                                     Income

                          Consumer   Spend
Location      Basket
            Transaction
                          Product    Location


                 Time
  What Is An Aggregation(4)
   Facts Dimensions Aggregation
                                             Income

                                  Consumer   Spend
Location       Basket
             Transaction
                                 Meat-       Location
                               Purchasing
                               Consumer
            Time     Product
    Why Aggregate

   Objective is to provide cost-
    effective query performance
   Spot the trends within data
Design Summary Tables
   1. Dimension to be aggregated
   2. Aggregation of multiple values
   3. Aggregating multiple facts
   4. The level of aggregation
   5. The extent of denormalisation
   6. Time
   7. Index the summary table
Step 1. Dimension To
        Be Aggregated(1)
        Facts Dimensions




    Location      Basket      Consumer
                Transaction


                   Time
Step 1. Dimension To
        Be Aggregated(2)
        Facts Dimensions




    Location      Week32   Consumer
                   Sales
Step 2. Aggregation of
       Multiple Values
   A segmentation of week32 sale
   aggregation with multiple values

PID  sum      max      min     mean
Pid1 245     48       28       35
Pid2 350     70       38       50
      Step 3. Aggregating
              Multiple Facts
               About A Product
   number of days where the product ran
    out of stock,
   shrinkage over the whole week,
   forecast number of sales for that week,
   returns throughout the week.
Sep 4. Level of Aggregation
     Aggregate at a level below the level
      identified
       Ifobtaining aggregation at level
        identified on the fly is not expensive
       Can drill down to aggregated level

     Multiple Level Aggregation
Step 5. Embedding Dimension
        Data in the Summary
  A segmentation of week-sale aggregation
  with dimension data---dept

 PID    sum max min mean dept
 Pid1 350     70     38    50       toy
Step 6. Design Time into the
        Summary table
       Storing a physical date
       Storing an offset
         E.g. In a month segmentation of a
          daily summary. Day offset 0-30.
       Storing date range
Step 7. Index the Summary
        Tables
   Consider using high level of
    indexation on the summary tables

     Direct as many queries as possible
      to the summary tables
Which Summaries to Create
      For a new data warehouse
        Identify summary tables by
           understanding user interesting
      For using data warehouse
          Identify summary tables by statistics
      Size of summary tables
Q&A
Bye! Bye !

								
To top