Partitioning and Aggregation Group 15: He Dajun,Lin Li, and Yu Xinguo CS6203 -- Advanced Topics in Database Management Systems Data Mining, Data Warehousing and OLAP Contents of the Presentation Introduction & Hardware Partitioning (He Dajun) Database Partitioning (Lin Li) Aggregation (Yu Xinguo) Introduction Data Warehouse? Data Warehouse is more than just data, it is also the process involved in getting that data from source to table and in getting the data from table to analysis Architecture of Data Warehouse Data Information Decision Operational Data Data L Q dippers Summary Info OLAP External Detailed Data M Info M tools Meta data W M Architecture of Data Warehouse Load Manager Extract and load the data Warehouse Manager Clean and Transform data Backup and archive data Generate any new aggregate that may require Query Manager Manage Queries, and direct them to appropriate data source Problems with Data Warehouse Huge Data Management Backup Daily process Approach Partitioning Partitioning Partitioning ? A single logical entity is partitioned into a multiple sub-entity Why Partitioning? Assist the management of the data. Assist backup and recovery. Improve query performance ---by completely removing large parts of fact table from possible set of data that needs to be scanned. Partitioning Types of Partitioning Software partitioning v.s. Hardware partitioning Horizontal Partitioning v.s.Vertical Partitioning Other issues Hardware Architecture SMP: (Symmetric Multi-processing) Called shared everything MMP: (Massively Parallel Processing) Called shared nothing CPUs System Bus Disk controllers SMP High-speed connect nodes Local disk MPP Hardware partitioning Maximizing processing power and avoiding bottlenecks Shared-everything architecture maximize processing power at all time because all disk and memory are fully shareable Shared-nothing architecture will maximize processing power available only if the sub- processes are capable of running on different node. Hardware partitioning is used in Shared-nothing architecture. Hardware partition ( Striping data across MPP Nodes) Query SQ Sub-queries run on all nodes controlle subquery r SQ SQ SQ SQ node Interconnect I/O controller Active data disk disk disk disk Hardware partition ( Horizontal Hardware Partition) Query SQ Sub-queries run on all nodes controller Sub-query SQ SQ SQ SQ node Interconnect I/O controller disk Month 1 Month 2 Month 3 Month 4 Fact table Software Partitioning Advantages Improve Query Performance Speed up queries Simplify the Management of Data Warehouse Backup/restore, move, creating & maintaining indexes Cost Increases number of physical tables, hence higher operational cost Partitioning Strategies Horizontal Partitioning Vertical Partitioning Note: Usually apply partitioning on fact tables, rarely on dimension tables. Horizontal Partitioning 1. Partition by time into equal segments Year 1 Year 2 Jan Sales Jan Sales Disadvantage: Sales Feb Sales Feb Sales Some partitions Records . . (500 millions) . . might store . . more records Dec Sales Dec Sales than others Horizontal Partitioning 2. Partition by time into different sized segments Year 1 Year 2 Advantage: Dec Sales Dec Sales Detailed Nov Sales Nov Sales information remains online Sales Oct Sales Oct Sales Records (500 millions) Quarter Quarter Disadvantage: Partitioning profile change Half Year Half Year regularly, hence increases operational cost Horizontal Partitioning 3. Partition on a different dimension, e.g. region Ladies’ Wear Sales Men’s Wear Note: Records Do not partition (500 millions) . on a dimensional . grouping that is . likely to change within the life of Sports Equipment the data warehouse Horizontal Partitioning 4. Partition by size of table Create a new partition when the table is about to exceed a predetermined Disadvantage: size, not based on any dimension Complex to manage, requires metadata to Use this strategy when there is no identify what is clear basis for partitioning the fact stored in each table partition Horizontal Partitioning Partitioning Dimension Tables The basis is unlikely to be time, but a grouping of dimension Rare case Horizontal Partitioning Advantage: Round Robin Partitions Simpler to automate table Archive the oldest partition and reuse the old management partition for the latest data facilities with in the data Sales 1999 Sales 2000 warehouse 1st Quarter 1st Quarter Disadvantage: Metadata is Sales 1999 Archive Sales 1999 required to 2nd Quarter Sales 1999 2nd Quarter manage table 1st Sales 1999 Sales 1999 partition as the Quarter 3rd Quarter 3rd Quarter information period they cover Sales 1999 Sales 1999 changes 4th Quarter 4th Quarter frequently Vertical Partitioning 1. Normalization Common fields are collapsed into single rows to reduce space usage CustID Name Amt Tran_ID 001 May 500 #123 CustID Name Amt Tran_ID Tran_Desc 002 Ann 1000 #159 May 500 #123 Deposit 003 Don 6000 #159 001 002 Ann 1000 #159 Withdraw 004 Mel 8000 #123 003 Don 6000 #159 Withdraw 005 Zoe 300 #123 004 Mel 8000 #123 Deposit 005 Zoe 300 #123 Deposit Tran_ID Tran_Desc #123 Deposit #159 Withdraw Vertical Partitioning 2. Row Splitting CustID Name Amt Tran_ID Tran_Desc 001 May 500 #123 Deposit 002 Ann 1000 #159 Withdraw 003 Don 6000 #159 Withdraw 004 Mel 8000 #123 Deposit Splits less used 005 Zoe 300 #123 Deposit column of information CustID Name Amt CustID Tran_ID Tran_Desc out from a 001 May 500 001 #123 Deposit frequently Ann 1000 #159 Withdraw 002 002 accessed fact 003 Don 6000 003 #159 Withdraw table 004 Mel 8000 004 #123 Deposit 005 Zoe 300 005 #123 Deposit Aggregation What is an aggregation? Why aggregate? How to aggregate? Which aggregations to create? What Is An Aggregation(1) Pid1 24/09 9:30 1kg S$2 Pid2 24/09 9:50 1.5kg S$15 Pid1 24/09 11:10 2kg S$3.8 Pid1 24/09 12:45 3kg S$5.1 Pid2 24/09 14:25 1kg S$11 Day 24/09 Sales Pid1 6kg S$10.90 Pid2 2.5kg S$26 What Is An Aggregation(2) An aggregation is a summary of a data table. It is made for providing cost- effective query performance. What Is An Aggregation(3) Facts Dimensions Income Consumer Spend Location Basket Transaction Product Location Time What Is An Aggregation(4) Facts Dimensions Aggregation Income Consumer Spend Location Basket Transaction Meat- Location Purchasing Consumer Time Product Why Aggregate Objective is to provide cost- effective query performance Spot the trends within data Design Summary Tables 1. Dimension to be aggregated 2. Aggregation of multiple values 3. Aggregating multiple facts 4. The level of aggregation 5. The extent of denormalisation 6. Time 7. Index the summary table Step 1. Dimension To Be Aggregated(1) Facts Dimensions Location Basket Consumer Transaction Time Step 1. Dimension To Be Aggregated(2) Facts Dimensions Location Week32 Consumer Sales Step 2. Aggregation of Multiple Values A segmentation of week32 sale aggregation with multiple values PID sum max min mean Pid1 245 48 28 35 Pid2 350 70 38 50 Step 3. Aggregating Multiple Facts About A Product number of days where the product ran out of stock, shrinkage over the whole week, forecast number of sales for that week, returns throughout the week. Sep 4. Level of Aggregation Aggregate at a level below the level identified Ifobtaining aggregation at level identified on the fly is not expensive Can drill down to aggregated level Multiple Level Aggregation Step 5. Embedding Dimension Data in the Summary A segmentation of week-sale aggregation with dimension data---dept PID sum max min mean dept Pid1 350 70 38 50 toy Step 6. Design Time into the Summary table Storing a physical date Storing an offset E.g. In a month segmentation of a daily summary. Day offset 0-30. Storing date range Step 7. Index the Summary Tables Consider using high level of indexation on the summary tables Direct as many queries as possible to the summary tables Which Summaries to Create For a new data warehouse Identify summary tables by understanding user interesting For using data warehouse Identify summary tables by statistics Size of summary tables Q&A Bye! Bye !
Pages to are hidden for
"Partition and Aggregation"Please download to view full document