Document Sample

UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY TIES443 Lecture 3 Data Warehousing Mykola Pechenizkiy Course webpage: http://www.cs.jyu.fi/~mpechen/TIES443 November 3, 2006 Department of Mathematical Information Technology University of Jyväskylä TIES443: Introduction to DM Lecture 3: Data Warehousing 1 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Topics for today • What is a data warehouse? • Data warehouse architectures – Conceptual DW Modelling – Physical DW Modelling • A multi-dimensional data model – Data Cubes • OLAP – 12 Codd’s rules for OLAP – Main OLAP operations • New buzzwords • Data warehouse implementation and maintenance TIES443: Introduction to DM Lecture 3: Data Warehousing 2 1 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse A decision support DB that is maintained separately from the organization’s operational databases. Why Separate Data Warehouse? • High performance for both systems – DBMS— tuned for OLTP • access methods, indexing, concurrency control, recovery – Warehouse—tuned for OLAP • complex OLAP queries, multidimensional view, consolidation. • Different functions and different data – Missing data: Decision support requires historical data which operational DBs do not typically maintain – Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources – Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled TIES443: Introduction to DM Lecture 3: Data Warehousing 3 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Three-Tier Architecture Monitor OLAP Analysis other Metadata & Server sources Integrator Query/Reporting Operational Extract DBs Transform Data Serve Load Refresh Warehouse Data Mining ROLAP Server Data Marts Data Sources Data Storage OLAP Engine Front-End Tools TIES443: Introduction to DM Lecture 3: Data Warehousing 4 2 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Three-Tier Architecture • Warehouse database server – Almost always a relational DBMS, rarely flat files – Schema design – Specialized scan, indexing and join techniques – Handling of aggregate views (querying and materialization) – Supporting query language extensions beyond SQL – Complex query processing and optimization – Data partitioning and parallelism • OLAP servers – Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators – Multidimensional OLAP (MOLAP): special-purpose server that directly implements multidimensional data and operations – Hybrid OLAP (HOLAP): user flexibility, e.g., low level: relational, high-level: array – Specialized SQL servers: specialized support for SQL queries over star/snowflake schemas • Clients – Query and reporting tools – Analysis tools – Data mining tools TIES443: Introduction to DM Lecture 3: Data Warehousing 5 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Warehouse Physical Architectures Client Client Client Central Data Warehouse Source Source Logical Physical Data Data Centralized architecture Warehouse Warehouse Source Source Source Source Federated architecture Tiered architecture TIES443: Introduction to DM Lecture 3: Data Warehousing 6 3 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Three Data Warehouse Models • Enterprise warehouse: collects all information about subjects (customers,products,sales,assets, personnel) that span the entire organization – Requires extensive business modeling (may take years to design and build) • Data Marts: Departmental subsets that focus on selected subjects – Marketing data mart: customer, product, sales – Faster roll out, but complex integration in the long run • Virtual warehouse: views over operational DBs – Materialize selective summary views for efficient query processing – Easy to build but require excess capability on operat. db servers TIES443: Introduction to DM Lecture 3: Data Warehousing 7 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY TIES443: Introduction to DM Lecture 3: Data Warehousing 8 4 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse • A data warehouse is a – subject-oriented, – integrated, – time-varying, – non-volatile collection of data that is used primarily in organizational decision making TIES443: Introduction to DM Lecture 3: Data Warehousing 9 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse—Subject-Oriented • Organized around major subjects, such as customer, product, sales • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process TIES443: Introduction to DM Lecture 3: Data Warehousing 10 5 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse—Integrated • Constructed by integrating multiple, heterogeneous data sources – relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied – Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. – When data is moved to the warehouse, it is converted TIES443: Introduction to DM Lecture 3: Data Warehousing 11 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse—Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems – Operational database: current value data – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse – Contains an element of time, explicitly or implicitly – But the key of operational data may or may not contain “time element” TIES443: Introduction to DM Lecture 3: Data Warehousing 12 6 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse—Non-Volatile • A physically separate store of data transformed from the operational environment • Operational update of data does not occur in the data warehouse environment – Does not require transaction processing, recovery, and concurrency control mechanisms – Requires only two operations in data accessing: initial loading of data and access of data TIES443: Introduction to DM Lecture 3: Data Warehousing 13 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse vs. Heterogeneous DBMS • Traditional heterogeneous DB integration – Build wrappers/mediators on top of heterogeneous databases – Query driven approach • When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set • Complex information filtering, compete for resources • Data warehouse – update-driven, high performance – Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis TIES443: Introduction to DM Lecture 3: Data Warehousing 14 7 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse vs. Operational DBMS • OLTP (On-Line Transaction Processing) – Major task of traditional relational DBMS – Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. • OLAP (On-Line Analytical Processing) – Major task of data warehouse system – Data analysis and decision making • Distinct features (OLTP vs. OLAP): – User and system orientation: customer vs. market – Data contents: current, detailed vs. historical, consolidated – Database design: ER + application vs. star + subject – View: current, local vs. evolutionary, integrated – Access patterns: update vs. read-only but complex queries TIES443: Introduction to DM Lecture 3: Data Warehousing 15 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Conceptual Modeling of Data Warehouses • ER design techniques not appropriate - design should reflect multidimensional view – Star Schema • A fact table in the middle connected to a set of dimension tables – Snowflake Schema • A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact Constellation Schema • Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation TIES443: Introduction to DM Lecture 3: Data Warehousing 16 8 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Example of a Star Schema Order Product Order No ProductNO Order Date ProdName Fact Table ProdDescr Customer OrderNO Category Customer No CategoryDescription SalespersonID Customer Name UnitPrice CustomerNO Customer Address ProdNo Date City DateKey DateKey CityName Date Salesperson Quantity City SalespersonID SalespersonName Total Price CityName City State Quota Country TIES443: Introduction to DM Lecture 3: Data Warehousing 17 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Star Schema • A single fact table and a single table for each dimension • Every fact points to one tuple in each of the dimensions and has additional attributes • Does not capture hierarchies directly • Generated keys are used for performance and maintenance reasons • Fact constellation: Multiple Fact tables that share many dimension tables – Example: Projected expense and the actual expense may share dimensional tables TIES443: Introduction to DM Lecture 3: Data Warehousing 18 9 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Some Terms • Relation, which relates the dimensions to the measure of interest, is called the fact table (e.g. sale) • Information about dimensions can be represented as a collection of relations – called the dimension tables (product, customer, store) • Each dimension can have a set of associated attributes • For each dimension, the set of associated attributes can be structured as a hierarchy TIES443: Introduction to DM Lecture 3: Data Warehousing 19 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY A Concept Hierarchy: Dimension (location) all all region North_America ... Europe country Canada ... Mexico Ireland ... France city Toronto ... Dublin ... Belfast office Belfield ... Blackrock TIES443: Introduction to DM Lecture 3: Data Warehousing 20 10 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Example of a Snowflake Schema Order Product Order No Category ProductNO Order Date ProdName CategoryName Fact Table ProdDescr CategoryDescr Customer Category OrderNO Customer No Category Customer Name SalespersonID UnitPrice Customer CustomerNO Address Date ProdNo Month City DateKey DateKey Month Salesperson Date Year CityName Year Month SalespersonID Quantity City Year SalespersonName State Total Price CityName City StateName State Quota Country Country TIES443: Introduction to DM Lecture 3: Data Warehousing 21 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Example of Fact Constellation Multiple fact tables share dimension tables Shipping Fact Table Time Time_key time_key Item Item_key Sales Fact Table day item_key shipper_key day_of_the_week Time_key item_name month brand from_location quarter Item_key type to_location year Branch_key supplier_key Euros_sold Location_key unit_shipped Branch Location branch_key Unit_sold location_key branch_name Euros_sold street shipper branch_type Avg_sales city shipper_key Province/street shipper_name country location_key Measures shipper_type TIES443: Introduction to DM Lecture 3: Data Warehousing 22 11 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Multidimensional Data Model Fact relation Two-dimensional cube sale Product Client Amt p1 c1 12 c1 c2 c3 p2 c1 11 p1 12 50 p1 c3 50 p2 11 8 p2 c2 8 Fact relation 3-dimensional cube sale Product Client Date Amt p1 c1 1 12 p2 c1 1 11 c1 c2 c3 day 2 p1 c3 1 50 p1 44 4 p2 c2 1 8 p2 c1 c2 c3 day 1 p1 c1 2 44 p1 12 50 p1 c2 2 4 p2 11 8 TIES443: Introduction to DM Lecture 3: Data Warehousing 23 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Multidimensional Data • Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths on gi Re Industry Region Year Category Country Quarter Product Product City Month Week Office Day Month TIES443: Introduction to DM Lecture 3: Data Warehousing 24 12 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY From Tables to Data Cubes • A data warehouse is based on – multidimensional data model which views data in the form of a data cube • A data cube allows data to be modeled and viewed in multiple dimensions (such as sales) – Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) – Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables • Definitions – an n-Dimensional base cube is called a base cuboid – The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid – The lattice of cuboids forms a data cube TIES443: Introduction to DM Lecture 3: Data Warehousing 25 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Cube: A Lattice of Cuboids all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier time,supplier 2-D cuboids item,supplier time,location,supplier time,item,location 3-D cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier TIES443: Introduction to DM Lecture 3: Data Warehousing 26 13 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY A Sample Data Cube Total annual sales Date of TV in Ireland ct 1Qtr 2Qtr 3Qtr 4Qtr sum TV u od PC Ireland Pr VCR Country sum France Germany sum TIES443: Introduction to DM Lecture 3: Data Warehousing 27 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Typical OLAP Operations • Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction • Drill down (roll down): reverse of roll-up – from higher level summary to lower level summary or detailed data, or introducing new dimensions • Slice and dice – project and select • Pivot (rotate) – reorient the cube, visualization, 3D to series of 2D planes. • Other operations – drill across: involving (across) more than one fact table – drill through: through the bottom level of the cube to its back-end relational tables (using SQL) – rankings – time functions: e.g. time avg. TIES443: Introduction to DM Lecture 3: Data Warehousing 28 14 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Typical OLAP Operations: Drill & Roll Drill Total Sales Drill Down Total Sales per city Up Total Sales per city per store Total Sales per city per store per month Drill Total Sales Drill Down Total Sales per city Up Total Sales per city by category Drill Across TIES443: Introduction to DM Lecture 3: Data Warehousing 29 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY OLAP Queries: Rollup & Drill-Down c1 c2 c3 day 2 p1 44 4 p2 c1 c2 c3 p1 12 50 day 1 p2 11 8 c1 c2 c3 sum 67 12 50 c1 c2 c3 p1 56 4 50 p2 11 8 129 sum rollup p1 110 p2 19 drill-down TIES443: Introduction to DM Lecture 3: Data Warehousing 30 15 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Typical OLAP Operations: Pivoting • Pivoting can be combined with aggregation sale prodId clientid date amt p1 c1 1 12 p2 c1 1 11 c1 c2 c3 day 2 p1 c3 1 50 p1 44 4 p2 c2 1 8 p2 c1 c2 c3 p1 c1 2 44 day 1 p1 12 50 p1 c2 2 4 p2 11 8 c1 c2 c3 Sum c1 c2 c3 Sum 1 23 8 50 81 p1 56 4 50 110 2 44 4 48 p2 11 8 19 Sum 67 12 50 129 Sum 67 12 50 129 The result of pivoting is called a cross-tabulation TIES443: Introduction to DM Lecture 3: Data Warehousing 31 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Browsing a Data Cube • Visualization • OLAP capabilities • Interactive manipulation TIES443: Introduction to DM Lecture 3: Data Warehousing 32 16 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 12 Codd’s Rules for OLAP 1. Multi-Dimensional Concept View – The user should be able to see the data as being multidimensional insofar as it should be easy to 'pivot' or 'slice and dice’. (See later.) 2. Transparency – The OLAP functionality should be provided behind the user's existing software without adversely affecting the functionality of the 'host‘, i.e. OLAP server should shield the user for the complexity of the data and application 3. Accessibility – OLAP should allow the user to access diverse data stores (relational, nonrelational and legacy systems) but see the data within a common 'schema‘ provided by the OLAP tool, i.e. Users shouldn’t have to know the location, type or layout of the data to access it. – OLAP server should automate the mapping of the logical schema to the physical data 4. Consistent Reporting Performance – There should not be significant degradation in performance with large numbers of dimensions or large quantities of data. TIES443: Introduction to DM Lecture 3: Data Warehousing 33 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 12 Codd’s Rules for OLAP 5. Client-Server Architecture – Since much of the data is on mainframes, and the users work on PCs, the OLAP tool must be able to bring the two together – Different clients can be used – Data sources must be transparently supported by the OLAP server 6. Generic Dimensionality – Data dimensions must all be treated equally. Functions available for one dimension must be available for others. 7. Dynamic Sparse Matrix Handling – The OLAP tool should be able to work out for itself the most efficient way to store sparse matrix data. 8. Multi User Support – access, integrity, security TIES443: Introduction to DM Lecture 3: Data Warehousing 34 17 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 12 Codd’s Rules for OLAP 9. Unrestricted Cross-Dimensional Operations – e.g., individual office overheads are allocated according to total corporate overheads divided in proportion to individual office sales. – Non-additive formulas cause the problems • Contribution = Revenue - Total Costs • Margin Percentage = Margin / Revenue 10. Intuitive Data Manipulation – Navigation should be done by operations on individual cells rather than menus. – Dimensions defined should allow automatic reorientation, drill-down, zoom-out, etc – Interface must be intuitive 11. Flexible Reporting – Row and column headings must be capable of more than one dimension each, and of displaying subsets of any dimension. 12. Unlimited Dimensions and Aggregation Levels – 15 - 20 dimensions are required in modelling, and within each there may be many hierarchical levels, i.e. unlimited aggregation – Rare in reporting to go beyond 12 dimensions, 6-7 is usual TIES443: Introduction to DM Lecture 3: Data Warehousing 35 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY DW Back-End Tools and Utilities • Data extraction: – get data from multiple, heterogeneous, and external sources • Data cleaning: – detect errors in the data and rectify them when possible • Data transformation: – convert data from legacy or host format to warehouse format: different data formats, languages, etc. • Load: – sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions • Refresh – propagate the updates from the data sources to the warehouse TIES443: Introduction to DM Lecture 3: Data Warehousing 36 18 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY DW Information Flows • INFLOW - Processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse. • UPFLOW - Processes associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data. • DOWNFLOW - Processes associated with archiving and backing-up/recovery of data in the warehouse. • OUTFLOW - Processes associated with making the data available to the end-users. • METAFLOW - Processes associated with the management of the metadata. TIES443: Introduction to DM Lecture 3: Data Warehousing 37 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Cleaning • why? – Data warehouse contains data that is analyzed for business decisions – More data and multiple sources could mean more errors in the data and harder to trace such errors – Results in incorrect analysis • finding and resolving inconsistency in the source data • detecting data anomalies and rectifying them early has huge payoffs • Important to identify tools that work together well • Long Term Solution – Change business practices and data entry tools – Repository for meta-data TIES443: Introduction to DM Lecture 3: Data Warehousing 38 19 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Cleaning Techniques • Transformation Rules – Example: translate “gender” to “sex” • Uses domain-specific knowledge to do scrubbing • Parsing and fuzzy matching – Multiple data sources (can designate a preferred source) • Discover facts that flag unusual patterns (auditing) – Some dealer has never received a single complaint TIES443: Introduction to DM Lecture 3: Data Warehousing 39 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Load • Issues: – huge volumes of data to be loaded – small time window (usually at night) when the warehouse can be taken off-line – When to build indexes and summary tables – allow system administrator to monitor status, cancel suspend, resume load, or change load rate – restart after failure with no loss of data integrity • Techniques: – batch load utility: sort input records on clustering key and use sequential I/O; build indexes and derived tables – sequential loads still too long (~100 days for TB) – use parallelism and incremental techniques TIES443: Introduction to DM Lecture 3: Data Warehousing 40 20 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Refresh • when to refresh – on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) – periodically (e.g., every 24 hours, every week) or after “significant” events – refresh policy set by administrator based on user needs and traffic – possibly different policies for different sources • how to refresh – Full extract from base tables • read entire source table or database: expensive – Incremental techniques • detect & propagate changes on base tables: replication servers • logical correctness • transactional correctness: incremental load TIES443: Introduction to DM Lecture 3: Data Warehousing 41 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Metadata Repository • Administrative metadata – source databases and their contents – warehouse schema, view & derived data definitions – dimensions, hierarchies – pre-defined queries and reports – data mart locations and contents – data partitions – data extraction, cleansing, transformation rules, defaults – data refresh and purging rules – user profiles, user groups – security: user authorization, access control • Business data – business terms and definitions – ownership of data – charging policies • Operational metadata – data lineage: history of migrated data and sequence of transf-s applied – currency of data: active, archived, purged – monitoring information: warehouse usage statistics, error reports, audit trails. TIES443: Introduction to DM Lecture 3: Data Warehousing 42 21 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Design of a DW: A Business Analysis Framework • Four views regarding the design of a data warehouse – Top-down view: allows selection of the relevant information necessary for the data warehouse (mature) – Bottom-up: Starts with experiments and prototypes (rapid) – Data source view • exposes the information being captured, stored, and managed by operational systems – Data warehouse view • consists of fact tables and dimension tables – Business query view • sees the perspectives of data in the warehouse from the view of end-user TIES443: Introduction to DM Lecture 3: Data Warehousing 43 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data Warehouse Design Process • From software engineering point of view – Waterfall: structured and systematic analysis at each step before proceeding to the next – Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around • Typical data warehouse design process – Choose a business process to model, e.g., orders, invoices, etc. – Choose the grain (atomic level of data) of the business process – Choose the dimensions that will apply to each fact table record – Choose the measure that will populate each fact table record TIES443: Introduction to DM Lecture 3: Data Warehousing 44 22 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY DW Design: Issues to Consider • What data is needed? • Where does it come from? • How to clean data? • How to represent in warehouse (schema)? • What to summarize? • What to materialize? • What to index? TIES443: Introduction to DM Lecture 3: Data Warehousing 45 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY DW Data Management: Issues to Consider • Meta-data • Data sourcing • Data quality • Data security • Granularity • History- how long and how much? • Performance TIES443: Introduction to DM Lecture 3: Data Warehousing 46 23 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Common DW Problems • Underestimation of resources for data loading • Hidden problems with source systems • Required data not captured • Increased end-user demands • Data homogenization • High demand for resources • Data ownership • High maintenance • Long duration projects • Complexity of integration TIES443: Introduction to DM Lecture 3: Data Warehousing 47 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Research Issues • Data cleaning – focus on data inconsistencies, not schema differences – data mining techniques • Physical Design – design of summary tables, partitions, indexes – tradeoffs in use of different indexes • Query processing – selecting appropriate summary tables – dynamic optimization with feedback – query optimization: cost estimation, use of transformations, search strategies – partitioning query processing between OLAP server and backend server. • Warehouse Management – incremental refresh techniques – computing summary tables during load – failure recovery during load and refresh – process management: scheduling queries, load and refresh – use of workflow technology for process management TIES443: Introduction to DM Lecture 3: Data Warehousing 48 24 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY OLAP Mining: An Integration of DM and DW • Data mining systems, DBMS, Data warehouse systems coupling – No coupling, loose-coupling, semi-tight-coupling, tight-coupling • On-line analytical mining data – integration of mining and OLAP technologies • Interactive mining multi-level knowledge – Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. • Integration of multiple mining functions – Characterized classification, first clustering and then association TIES443: Introduction to DM Lecture 3: Data Warehousing 49 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Summary • What is a data warehouse • Data warehouse architectures – Conceptual DW Modelling – Physical DW Modelling • A multi-dimensional data model – Data Cubes – Main OLAP operations • Data warehouse implementation and maintenance What else did you get from this lecture? TIES443: Introduction to DM Lecture 3: Data Warehousing 50 25 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Additional Slides TIES443: Introduction to DM Lecture 3: Data Warehousing 51 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Some Critics for Data Cubes • Jargon for IT professionals • Metaphor for end-users – Not useful beyond introduction – Users expect to see one – Rubic’s cube • Easy to manipulate but difficult to solve • Alternatives are better – Cognos’ ways, Thomsen’s multi-dimensional domain diagrams, Bulos’s Adapt diagrams or simply well designed interactive reports TIES443: Introduction to DM Lecture 3: Data Warehousing 52 26 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY DW Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Enterprise Data Data Data Mart Mart Warehouse Model refinement Model refinement Define a high-level corporate data model TIES443: Introduction to DM Lecture 3: Data Warehousing 53 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY DMQL: Language Primitives Nice presentation on Data Mining Query Languages can be found here: http://www.cs.wisc.edu/EDAM/slides/Data%20Mining%20Query%20Languages.ppt • Cube Definition (Fact Table) – define cube <cube_name> [<dimension_list>]: <measure_list> • Dimension Definition (Dimension Table) – define dimension <dimension_name> as (<attribute_or_subdimension_list>) • Special Case (Shared Dimension Tables) – First time as “cube definition” – define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> TIES443: Introduction to DM Lecture 3: Data Warehousing 54 27 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) TIES443: Introduction to DM Lecture 3: Data Warehousing 55 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Defining a Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) TIES443: Introduction to DM Lecture 3: Data Warehousing 56 28 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Defining a Fact Constellation in DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales TIES443: Introduction to DM Lecture 3: Data Warehousing 57 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP CITY SALES PERSON COUNTRY DISTRICT REGION DIVISION Location Each circle is called a Promotion Organization footprint TIES443: Introduction to DM Lecture 3: Data Warehousing 58 29 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Index Structures This and the following slides on Indexing for DW are adopted with minor modifications from: http://infolab.stanford.edu/~hector/cs245/Notes12.ppt • Indexing principle: • mapping key values to records for associative direct access • Most popular indexing techniques in relational database: B+-trees • For multi-dimensional data, a large number of indexing techniques have been developed: R-trees • Index structures applied in warehouses – inverted lists – bit map indexes – join indexes – text indexes TIES443: Introduction to DM Lecture 3: Data Warehousing 59 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Inverted Lists 18 19 r4 rId name age r18 r4 joe 20 20 r18 fred 20 20 r34 23 r19 sally 21 21 r35 22 r34 nancy 20 r35 tom 20 r5 r36 pat 25 r19 23 r5 dave 21 r37 25 r41 jeff 26 r40 26 ... inverted data age records index lists • Query: List for age = 20: r4, r18, r34, r35 – Get people with age = 20 and List for name = “fred”: r18, r52 name = “fred” Answer is intersection: r18 TIES443: Introduction to DM Lecture 3: Data Warehousing 60 30 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Bitmap Indexes • Bitmap index: An indexing technique that has attracted attention in multi-dimensional database implementation table Customer City Car c1 Detroit Ford c2 Chicago Honda c3 Detroit Honda c4 Poznan Ford c5 Paris BMW c6 Paris Nissan TIES443: Introduction to DM Lecture 3: Data Warehousing 61 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Bitmap Indexes • The index consists of bitmaps: Index on City: Index on Car: ec1 Chicago Detroit Paris Poznan ec1 BMW Ford Honda Nissan 1 0 1 0 0 1 0 1 0 0 2 1 0 0 0 2 1 0 1 0 3 0 1 0 0 3 0 0 1 0 4 0 0 0 1 4 0 1 0 0 5 0 0 1 0 5 1 0 0 0 6 0 0 1 0 6 0 0 0 1 bitmaps bitmaps •Index on a particular column •Index consists of a number of bit vectors - bitmaps •Each value in the indexed column has a bit vector (bitmaps) •The length of the bit vector is the number of records in the base table •The i-th bit is set if the i-th row of the base table has the value for the indexed column TIES443: Introduction to DM Lecture 3: Data Warehousing 62 31 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Bitmap Indexes • Index on a particular column • Index consists of a number of bit vectors - bitmaps • Each value in the indexed column has a bit vector (bitmaps) • The length of the bit vector is the number of records in the base table • The i-th bit is set if the i-th row of the base table has the value for the indexed column TIES443: Introduction to DM Lecture 3: Data Warehousing 63 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Bitmap Index Query: 18 1 Get people with age = 20 id name age 19 1 and name = “fred” 1 joe 20 0 2 fred 20 1 3 sally 21 List for age = 20: 20 1 20 4 nancy 20 1101100000 23 0 0 21 5 tom 20 List for name = “fred”: 0 0 22 6 pat 25 0100000001 0 1 7 dave 21 0 0 Answer is intersection: 8 jeff 26 0 0100000000 23 ... 0 25 1 Suited well for domains 26 0 with small cardinality 1 data 1 records age bit index maps TIES443: Introduction to DM Lecture 3: Data Warehousing 64 32 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Bitmap Index – Summary • With efficient hardware support for bitmap operations (AND, OR, XOR, NOT), bitmap index offers better access methods for certain queries • e.g., selection on two attributes • Some commercial products have implemented bitmap index • Works poorly for high cardinality domains since the number of bitmaps increases • Difficult to maintain - need reorganization when relation sizes change (new bitmaps) TIES443: Introduction to DM Lecture 3: Data Warehousing 65 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Join • “Combine” SALE, PRODUCT relations • In SQL: SELECT * FROM SALE, PRODUCT sale prodId storeId date amt product id name price p1 c1 1 12 p1 bolt 10 p2 c1 1 11 p2 nut 5 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 joinTb prodId name price storeId date amt p1 bolt 10 c1 1 12 p2 nut 5 c1 1 11 p1 bolt 10 c3 1 50 p2 nut 5 c2 1 8 p1 bolt 10 c1 2 44 p1 bolt 10 c2 2 4 TIES443: Introduction to DM Lecture 3: Data Warehousing 66 33 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Join Indexes join index product id name price jIndex p1 bolt 10 r1,r3,r5,r6 p2 nut 5 r2,r4 sale rId prodId storeId date amt r1 p1 c1 1 12 r2 p2 c1 1 11 r3 p1 c3 1 50 r4 p2 c2 1 8 r5 p1 c1 2 44 r6 p1 c2 2 4 TIES443: Introduction to DM Lecture 3: Data Warehousing 67 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Join Indexes • Traditional indexes map the value to a list of record ids. Join indexes map the tuples in the join result of two relations to the source tables. • In data warehouse cases, join indexes relate the values of the dimensions of a star schema to rows in the fact table. • For a warehouse with a Sales fact table and dimension city, a join index on city maintains for each distinct city a list of RIDs of the tuples recording the sales in the city • Join indexes can span multiple dimensions TIES443: Introduction to DM Lecture 3: Data Warehousing 68 34

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 2 |

posted: | 10/18/2012 |

language: | Unknown |

pages: | 34 |

OTHER DOCS BY zhouwenjuan

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.