Docstoc

Data Warehousing

Document Sample
Data Warehousing Powered By Docstoc
					          UNIVERSITY OF JYVÄSKYLÄ                                             DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                                                  TIES443


                                                                  Lecture 3

                                        Data Warehousing

                                                                                   Mykola Pechenizkiy

                                                 Course webpage: http://www.cs.jyu.fi/~mpechen/TIES443

                                                           November 3, 2006

                                        Department of Mathematical Information Technology
                                                      University of Jyväskylä
TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                               1




          UNIVERSITY OF JYVÄSKYLÄ                                             DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                 Topics for today
         • What is a data warehouse?
         • Data warehouse architectures
                 – Conceptual DW Modelling
                 – Physical DW Modelling
         • A multi-dimensional data model
                 – Data Cubes
         • OLAP
                 – 12 Codd’s rules for OLAP
                 – Main OLAP operations
                        • New buzzwords
         • Data warehouse implementation and maintenance


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                               2




                                                                                                                                  1
          UNIVERSITY OF JYVÄSKYLÄ                                              DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                 Data Warehouse
         A decision support DB that is maintained separately
           from the organization’s operational databases.
         Why Separate Data Warehouse?
         • High performance for both systems
                 – DBMS— tuned for OLTP
                        • access methods, indexing, concurrency control, recovery
                 – Warehouse—tuned for OLAP
                        • complex OLAP queries, multidimensional view, consolidation.
         • Different functions and different data
                 – Missing data: Decision support requires historical data which
                   operational DBs do not typically maintain
                 – Data consolidation: DS requires consolidation (aggregation,
                   summarization) of data from heterogeneous sources
                 – Data quality: different sources typically use inconsistent data
                   representations, codes and formats which have to be reconciled

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                3




          UNIVERSITY OF JYVÄSKYLÄ                                              DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                       Three-Tier Architecture

                                                                   Monitor         OLAP                     Analysis
              other                    Metadata                       &            Server
              sources                                             Integrator

                                                                                                      Query/Reporting
             Operational            Extract
             DBs                    Transform                 Data                Serve
                                    Load
                                    Refresh
                                                            Warehouse
                                                                                                         Data Mining


                                                                                      ROLAP
                                                                                       Server
                                                              Data Marts


         Data Sources                       Data Storage                       OLAP Engine Front-End Tools

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                4




                                                                                                                                   2
          UNIVERSITY OF JYVÄSKYLÄ                                                     DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                          Three-Tier Architecture
          • Warehouse database server
                   –   Almost always a relational DBMS, rarely flat files
                   –   Schema design
                   –   Specialized scan, indexing and join techniques
                   –   Handling of aggregate views (querying and materialization)
                   –   Supporting query language extensions beyond SQL
                   –   Complex query processing and optimization
                   –   Data partitioning and parallelism
          • OLAP servers
                   – Relational OLAP (ROLAP): extended relational DBMS that maps operations on
                     multidimensional data to standard relational operators
                   – Multidimensional OLAP (MOLAP): special-purpose server that directly
                     implements multidimensional data and operations
                   – Hybrid OLAP (HOLAP): user flexibility, e.g., low level: relational, high-level: array
                   – Specialized SQL servers: specialized support for SQL queries over star/snowflake
                     schemas
          • Clients
                   – Query and reporting tools
                   – Analysis tools
                   – Data mining tools

TIES443: Introduction to DM            Lecture 3: Data Warehousing                                                                    5




          UNIVERSITY OF JYVÄSKYLÄ                                                     DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                              Warehouse Physical Architectures
          Client              Client           Client




                               Central
                                Data
                              Warehouse




            Source                         Source
                                                                  Logical                                    Physical
                                                                   Data                                       Data
              Centralized architecture                           Warehouse                                  Warehouse



                                                     Source                  Source
                                                                                                  Source                 Source

                                                         Federated architecture                   Tiered architecture

TIES443: Introduction to DM            Lecture 3: Data Warehousing                                                                    6




                                                                                                                                          3
          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                              Three Data Warehouse Models
         • Enterprise warehouse: collects all information about subjects
           (customers,products,sales,assets, personnel) that span
           the entire organization
                 – Requires extensive business modeling (may take years to design
                   and build)
         • Data Marts: Departmental subsets that focus on selected
           subjects
                 – Marketing data mart: customer, product, sales
                 – Faster roll out, but complex integration in the long run
         • Virtual warehouse: views over operational DBs
                 – Materialize selective summary views for efficient query processing
                 – Easy to build but require excess capability on operat. db servers



TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                   7




          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                   8




                                                                                                                      4
          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                 Data Warehouse


         • A data warehouse is a
                 – subject-oriented,
                 – integrated,
                 – time-varying,
                 – non-volatile
               collection of data that is used primarily in organizational
               decision making




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                   9




          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                        Data Warehouse—Subject-Oriented

           • Organized around major subjects, such as customer,
                 product, sales

           • Focusing on the modeling and analysis of data for
                 decision makers, not on daily operations or transaction
                 processing

           • Provide a simple and concise view around particular
                 subject issues by excluding data that are not useful in the
                 decision support process


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  10




                                                                                                                      5
          UNIVERSITY OF JYVÄSKYLÄ                                     DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                    Data Warehouse—Integrated
              • Constructed by integrating multiple, heterogeneous
                data sources
                     – relational databases, flat files, on-line transaction records
              • Data cleaning and data integration techniques are
                applied
                     – Ensure consistency in naming conventions, encoding
                       structures, attribute measures, etc. among different data
                       sources
                              • E.g., Hotel price: currency, tax, breakfast covered, etc.
                     – When data is moved to the warehouse, it is converted




TIES443: Introduction to DM          Lecture 3: Data Warehousing                                                     11




          UNIVERSITY OF JYVÄSKYLÄ                                     DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                              Data Warehouse—Time Variant

         • The time horizon for the data warehouse is significantly
           longer than that of operational systems
                – Operational database: current value data
                – Data warehouse data: provide information from a historical
                  perspective (e.g., past 5-10 years)
         • Every key structure in the data warehouse
                – Contains an element of time, explicitly or implicitly
                – But the key of operational data may or may not contain “time
                  element”




TIES443: Introduction to DM          Lecture 3: Data Warehousing                                                     12




                                                                                                                          6
          UNIVERSITY OF JYVÄSKYLÄ                                    DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                              Data Warehouse—Non-Volatile

          • A physically separate store of data transformed from
                the operational environment
          • Operational update of data does not occur in the data
                warehouse environment
                  – Does not require transaction processing, recovery, and
                       concurrency control mechanisms
                  – Requires only two operations in data accessing:
                                      initial loading of data and access of data




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                     13




          UNIVERSITY OF JYVÄSKYLÄ                                    DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



              Data Warehouse vs. Heterogeneous DBMS

          • Traditional heterogeneous DB integration
                  – Build wrappers/mediators on top of heterogeneous databases
                  – Query driven approach
                          • When a query is posed to a client site, a meta-dictionary is used to
                            translate the query into queries appropriate for individual
                            heterogeneous sites involved, and the results are integrated into a
                            global answer set
                          • Complex information filtering, compete for resources
          • Data warehouse
                  – update-driven, high performance
                  – Information from heterogeneous sources is integrated in advance and
                    stored in warehouses for direct query and analysis




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                     14




                                                                                                                         7
          UNIVERSITY OF JYVÄSKYLÄ                                   DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                   Data Warehouse vs. Operational DBMS
          • OLTP (On-Line Transaction Processing)
                  – Major task of traditional relational DBMS
                  – Day-to-day operations: purchasing, inventory, banking, manufacturing,
                    payroll, registration, accounting, etc.
          • OLAP (On-Line Analytical Processing)
                  – Major task of data warehouse system
                  – Data analysis and decision making
          • Distinct features (OLTP vs. OLAP):
                  – User and system orientation: customer vs. market
                  – Data contents: current, detailed vs. historical, consolidated
                  – Database design: ER + application vs. star + subject
                  – View: current, local vs. evolutionary, integrated
                  – Access patterns: update vs. read-only but complex queries

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                    15




          UNIVERSITY OF JYVÄSKYLÄ                                   DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



              Conceptual Modeling of Data Warehouses

         •          ER design techniques not appropriate - design should
                    reflect multidimensional view
                –        Star Schema
                        •      A fact table in the middle connected to a set of dimension tables
                –        Snowflake Schema
                        •      A refinement of star schema where some dimensional hierarchy is
                               normalized into a set of smaller dimension tables, forming a shape
                               similar to snowflake
                –        Fact Constellation Schema
                        •      Multiple fact tables share dimension tables, viewed as a collection
                               of stars, therefore called galaxy schema or fact constellation




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                    16




                                                                                                                        8
          UNIVERSITY OF JYVÄSKYLÄ                                          DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                      Example of a Star Schema
                       Order
                                                                                     Product
                       Order No                                                      ProductNO
                       Order Date                                                    ProdName
                                                           Fact Table                ProdDescr
                      Customer
                                                           OrderNO                   Category
                      Customer No                                                    CategoryDescription
                                                           SalespersonID
                      Customer Name                                                  UnitPrice
                                                           CustomerNO
                      Customer
                      Address                              ProdNo                    Date
                      City
                                                           DateKey                   DateKey

                                                           CityName                  Date
                      Salesperson
                                                           Quantity                   City
                      SalespersonID
                      SalespersonName                      Total Price
                                                                                     CityName
                      City
                                                                                     State
                      Quota
                                                                                     Country



TIES443: Introduction to DM           Lecture 3: Data Warehousing                                                         17




          UNIVERSITY OF JYVÄSKYLÄ                                          DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                         Star Schema
         • A single fact table and a single table for each dimension
         • Every fact points to one tuple in each of the dimensions
           and has additional attributes
         • Does not capture hierarchies directly
         • Generated keys are used for performance and
           maintenance reasons
         • Fact constellation: Multiple Fact tables that share many
           dimension tables
                 – Example: Projected expense and the actual expense may share
                   dimensional tables




TIES443: Introduction to DM           Lecture 3: Data Warehousing                                                         18




                                                                                                                               9
          UNIVERSITY OF JYVÄSKYLÄ                                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                       Some Terms
           • Relation, which relates the dimensions to the
             measure of interest, is called the fact table (e.g. sale)
           • Information about dimensions can be represented as
             a collection of relations – called the dimension tables
             (product, customer, store)
           • Each dimension can have a set of associated
             attributes
           • For each dimension, the set of associated attributes
             can be structured as a hierarchy




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                   19




          UNIVERSITY OF JYVÄSKYLÄ                                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




              A Concept Hierarchy: Dimension (location)

           all                                                         all


          region                          North_America                      ...                       Europe


          country                   Canada                 ...    Mexico                Ireland              ...       France


           city               Toronto                ...               Dublin                   ...      Belfast


           office                                                 Belfield     ...       Blackrock

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                   20




                                                                                                                                       10
          UNIVERSITY OF JYVÄSKYLÄ                                       DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                              Example of a Snowflake Schema
                        Order
                                                                        Product
                Order No                                                                         Category
                                                                    ProductNO
                Order Date                                          ProdName                     CategoryName
                                                Fact Table
                                                                    ProdDescr                    CategoryDescr
                       Customer
                                                                    Category
                                                OrderNO
                Customer No                                         Category
                Customer Name
                                                SalespersonID
                                                                    UnitPrice
                Customer                        CustomerNO
                Address                                                Date
                                                ProdNo                                   Month
                City                                                  DateKey
                                                DateKey                                  Month
               Salesperson                                            Date                               Year
                                                CityName                                 Year
                                                                      Month
                SalespersonID                   Quantity              City                               Year
                SalespersonName                                                              State
                                                Total Price           CityName
                City                                                                         StateName
                                                                      State
                Quota                                                                        Country
                                                                      Country




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                        21




          UNIVERSITY OF JYVÄSKYLÄ                                       DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                  Example of Fact Constellation
          Multiple fact tables share dimension tables
                                                                                            Shipping Fact Table

                   Time                                                                           Time_key
          time_key                                                       Item                     Item_key
                                             Sales Fact Table
          day                                                     item_key                      shipper_key
          day_of_the_week                        Time_key         item_name
          month                                                   brand                         from_location
          quarter                                 Item_key        type                           to_location
          year                                 Branch_key         supplier_key
                                                                                                 Euros_sold
                                              Location_key                                      unit_shipped
            Branch                                                      Location
          branch_key                             Unit_sold
                                                                    location_key
          branch_name                           Euros_sold
                                                                    street                           shipper
          branch_type
                                                 Avg_sales          city
                                                                                                 shipper_key
                                                                    Province/street
                                                                                                 shipper_name
                                                                    country                      location_key
                       Measures
                                                                                                 shipper_type

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                        22




                                                                                                                            11
          UNIVERSITY OF JYVÄSKYLÄ                                             DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                Multidimensional Data Model

                              Fact relation                             Two-dimensional cube

                      sale     Product Client            Amt
                                 p1     c1                12                           c1       c2        c3
                                 p2     c1                11                  p1       12                 50
                                 p1     c3                50                  p2       11        8
                                 p2     c2                 8



                              Fact relation                                   3-dimensional cube

                sale     Product Client           Date       Amt
                           p1     c1               1          12
                           p2     c1               1          11                            c1    c2    c3
                                                                        day 2
                           p1     c3               1          50                      p1    44    4
                           p2     c2               1          8                       p2 c1    c2    c3
                                                                      day 1
                           p1     c1               2          44                   p1    12          50
                           p1     c2               2          4                    p2    11    8
TIES443: Introduction to DM          Lecture 3: Data Warehousing                                                             23




          UNIVERSITY OF JYVÄSKYLÄ                                             DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                        Multidimensional Data

           • Sales volume as a function of product, month, and
             region

                                                                   Dimensions: Product, Location, Time
                                                                   Hierarchical summarization paths
                 on
              gi
            Re




                                                                       Industry Region                   Year

                                                                       Category Country Quarter
            Product




                                                                       Product          City         Month       Week

                                                                                         Office            Day


                                    Month
TIES443: Introduction to DM          Lecture 3: Data Warehousing                                                             24




                                                                                                                                  12
          UNIVERSITY OF JYVÄSKYLÄ                                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                         From Tables to Data Cubes
            • A data warehouse is based on
                    – multidimensional data model which views data in the form of a data
                      cube
            • A data cube allows data to be modeled and viewed in multiple
              dimensions (such as sales)
                    – Dimension tables, such as item (item_name, brand, type), or time(day,
                      week, month, quarter, year)
                    – Fact table contains measures (such as dollars_sold) and keys to each of
                      the related dimension tables
            • Definitions
                    – an n-Dimensional base cube is called a base cuboid
                    – The top most 0-D cuboid, which holds the highest-level of
                      summarization, is called the apex cuboid
                    – The lattice of cuboids forms a data cube


TIES443: Introduction to DM             Lecture 3: Data Warehousing                                                               25




          UNIVERSITY OF JYVÄSKYLÄ                                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                       Cube: A Lattice of Cuboids

                                                      all
                                                                                                      0-D(apex) cuboid

                              time            item               location     supplier
                                                                                                     1-D cuboids

       time,item              time,location                   item,location           location,supplier

                                             time,supplier
                                                                                                     2-D cuboids
                                                                         item,supplier

                                                    time,location,supplier
         time,item,location
                                                                                                     3-D cuboids
                                     time,item,supplier               item,location,supplier

                                                                                                      4-D(base) cuboid
                                        time, item, location, supplier

TIES443: Introduction to DM             Lecture 3: Data Warehousing                                                               26




                                                                                                                                       13
          UNIVERSITY OF JYVÄSKYLÄ                                               DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                          A Sample Data Cube

                                                                                             Total annual sales
                                                             Date                            of TV in Ireland
                  ct                1Qtr           2Qtr           3Qtr   4Qtr      sum
                    TV
              u
           od

                  PC                                                                         Ireland
         Pr



                VCR




                                                                                                           Country
             sum
                                                                                             France

                                                                                            Germany


                                                                                                sum




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                27




          UNIVERSITY OF JYVÄSKYLÄ                                               DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                    Typical OLAP Operations
           • Roll up (drill-up): summarize data
                   – by climbing up hierarchy or by dimension reduction
           • Drill down (roll down): reverse of roll-up
                   – from higher level summary to lower level summary or detailed data, or
                     introducing new dimensions
           • Slice and dice
                   – project and select
           • Pivot (rotate)
                   – reorient the cube, visualization, 3D to series of 2D planes.
           • Other operations
                   – drill across: involving (across) more than one fact table
                   – drill through: through the bottom level of the cube to its back-end relational
                     tables (using SQL)
                   – rankings
                   – time functions: e.g. time avg.
TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                28




                                                                                                                                    14
          UNIVERSITY OF JYVÄSKYLÄ                                                DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                    Typical OLAP Operations: Drill & Roll

                              Drill                 Total Sales                                            Drill
                              Down                  Total Sales per city                                   Up
                                                    Total Sales per city per store
                                                    Total Sales per city per store per month



                              Drill                 Total Sales                                            Drill
                              Down                  Total Sales per city                                   Up
                                                    Total Sales per city by category
                                                                            Drill Across




TIES443: Introduction to DM                Lecture 3: Data Warehousing                                                          29




          UNIVERSITY OF JYVÄSKYLÄ                                                DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                      OLAP Queries: Rollup & Drill-Down


                                       c1    c2    c3
               day 2
                                 p1    44    4
                                 p2 c1    c2    c3
                              p1    12          50
          day 1               p2    11    8


                                                                                    c1       c2       c3
                                                                           sum      67       12       50
                                      c1         c2         c3
                         p1           56         4          50
                         p2           11         8                                                                     129
                                                                                  sum
                                 rollup                                     p1    110
                                                                            p2     19
                              drill-down
TIES443: Introduction to DM                Lecture 3: Data Warehousing                                                          30




                                                                                                                                     15
          UNIVERSITY OF JYVÄSKYLÄ                                           DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                        Typical OLAP Operations: Pivoting
                   • Pivoting can be combined with aggregation


               sale       prodId clientid          date       amt
                            p1      c1              1          12
                            p2      c1              1          11                      c1    c2    c3
                                                                     day 2
                            p1      c3              1          50                p1    44    4
                            p2      c2              1           8                p2 c1    c2    c3
                            p1      c1              2          44
                                                                    day 1
                                                                              p1    12          50
                            p1      c2              2           4             p2    11    8




                              c1    c2        c3      Sum                        c1        c2       c3      Sum
                1             23     8        50       81               p1       56        4        50      110
                2             44     4                 48               p2       11        8                 19
               Sum            67    12        50      129              Sum       67        12       50      129


                                                   The result of pivoting is called a cross-tabulation
TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                            31




          UNIVERSITY OF JYVÄSKYLÄ                                           DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                          Browsing a Data Cube

                                                                                        • Visualization
                                                                                        • OLAP capabilities
                                                                                        • Interactive
                                                                                              manipulation




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                            32




                                                                                                                                16
          UNIVERSITY OF JYVÄSKYLÄ                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                    12 Codd’s Rules for OLAP
         1. Multi-Dimensional Concept View
                 – The user should be able to see the data as being multidimensional insofar as it
                   should be easy to 'pivot' or 'slice and dice’. (See later.)
         2. Transparency
                 – The OLAP functionality should be provided behind the user's existing
                   software without adversely affecting the functionality of the 'host‘, i.e.
                   OLAP server should shield the user for the complexity of the data and
                   application
         3. Accessibility
                 – OLAP should allow the user to access diverse data stores (relational,
                   nonrelational and legacy systems) but see the data within a common
                   'schema‘ provided by the OLAP tool, i.e. Users shouldn’t have to know
                   the location, type or layout of the data to access it.
                 – OLAP server should automate the mapping of the logical schema to the
                   physical data
         4. Consistent Reporting Performance
                 – There should not be significant degradation in performance with large
                   numbers of dimensions or large quantities of data.


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                   33




          UNIVERSITY OF JYVÄSKYLÄ                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                    12 Codd’s Rules for OLAP
         5. Client-Server Architecture
                 – Since much of the data is on mainframes, and the users work on
                   PCs, the OLAP tool must be able to bring the two together
                 – Different clients can be used
                 – Data sources must be transparently supported by the OLAP server
         6. Generic Dimensionality
                 – Data dimensions must all be treated equally. Functions available
                   for one dimension must be available for others.
         7. Dynamic Sparse Matrix Handling
                 – The OLAP tool should be able to work out for itself the most
                   efficient way to store sparse matrix data.
         8. Multi User Support
                 – access, integrity, security


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                   34




                                                                                                                       17
          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                    12 Codd’s Rules for OLAP
         9. Unrestricted Cross-Dimensional Operations
                 – e.g., individual office overheads are allocated according to total corporate
                   overheads divided in proportion to individual office sales.
                 – Non-additive formulas cause the problems
                        • Contribution = Revenue - Total Costs
                        • Margin Percentage = Margin / Revenue
         10. Intuitive Data Manipulation
                 – Navigation should be done by operations on individual cells rather than
                   menus.
                 – Dimensions defined should allow automatic reorientation, drill-down,
                   zoom-out, etc
                 – Interface must be intuitive
         11. Flexible Reporting
                 – Row and column headings must be capable of more than one dimension
                   each, and of displaying subsets of any dimension.
         12. Unlimited Dimensions and Aggregation Levels
                 – 15 - 20 dimensions are required in modelling, and within each there may
                   be many hierarchical levels, i.e. unlimited aggregation
                 – Rare in reporting to go beyond 12 dimensions, 6-7 is usual


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  35




          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                              DW Back-End Tools and Utilities

          • Data extraction:
                  – get data from multiple, heterogeneous, and external sources
          • Data cleaning:
                  – detect errors in the data and rectify them when possible
          • Data transformation:
                  – convert data from legacy or host format to warehouse format:
                    different data formats, languages, etc.
          • Load:
                  – sort, summarize, consolidate, compute views, check integrity, and
                    build indicies and partitions
          • Refresh
                  – propagate the updates from the data sources to the warehouse


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  36




                                                                                                                      18
          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                       DW Information Flows
         • INFLOW - Processes associated with the extraction,
           cleansing, and loading of the data from the source systems
           into the data warehouse.
         • UPFLOW - Processes associated with adding value to the
           data in the warehouse through summarizing, packaging,
           and distribution of the data.
         • DOWNFLOW - Processes associated with archiving and
           backing-up/recovery of data in the warehouse.
         • OUTFLOW - Processes associated with making the data
           available to the end-users.
         • METAFLOW - Processes associated with the management
           of the metadata.
TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  37




          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                    Data Cleaning
         • why?
                 – Data warehouse contains data that is analyzed for business
                   decisions
                 – More data and multiple sources could mean more errors in the
                   data and harder to trace such errors
                 – Results in incorrect analysis
         • finding and resolving inconsistency in the source data
         • detecting data anomalies and rectifying them early has
           huge payoffs
         • Important to identify tools that work together well
         • Long Term Solution
                 – Change business practices and data entry tools
                 – Repository for meta-data

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  38




                                                                                                                      19
          UNIVERSITY OF JYVÄSKYLÄ                                        DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                    Data Cleaning Techniques
         • Transformation Rules
                 – Example: translate “gender” to “sex”
         • Uses domain-specific knowledge to do scrubbing
         • Parsing and fuzzy matching
                 – Multiple data sources (can designate a preferred
                   source)
         • Discover facts that flag unusual patterns
           (auditing)
                 – Some dealer has never received a single complaint




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                         39




          UNIVERSITY OF JYVÄSKYLÄ                                        DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                                  Load
         • Issues:
                 – huge volumes of data to be loaded
                 – small time window (usually at night) when the warehouse can be
                   taken off-line
                 – When to build indexes and summary tables
                 – allow system administrator to monitor status, cancel suspend,
                   resume load, or change load rate
                 – restart after failure with no loss of data integrity
         • Techniques:
                 – batch load utility: sort input records on clustering key and use
                   sequential I/O; build indexes and derived tables
                 – sequential loads still too long (~100 days for TB)
                 – use parallelism and incremental techniques



TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                         40




                                                                                                                             20
          UNIVERSITY OF JYVÄSKYLÄ                                       DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                              Refresh
         • when to refresh
                 – on every update: too expensive, only necessary if OLAP queries
                   need current data (e.g., up-the-minute stock quotes)
                 – periodically (e.g., every 24 hours, every week) or after
                   “significant” events
                 – refresh policy set by administrator based on user needs and traffic
                 – possibly different policies for different sources
         • how to refresh
                 – Full extract from base tables
                        • read entire source table or database: expensive
                 – Incremental techniques
                        • detect & propagate changes on base tables: replication servers
                        • logical correctness
                        • transactional correctness: incremental load


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                        41




          UNIVERSITY OF JYVÄSKYLÄ                                       DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                           Metadata Repository
         • Administrative metadata
                 –   source databases and their contents
                 –   warehouse schema, view & derived data definitions
                 –   dimensions, hierarchies
                 –   pre-defined queries and reports
                 –   data mart locations and contents
                 –   data partitions
                 –   data extraction, cleansing, transformation rules, defaults
                 –   data refresh and purging rules
                 –   user profiles, user groups
                 –   security: user authorization, access control
         • Business data
                 – business terms and definitions
                 – ownership of data
                 – charging policies
         • Operational metadata
                 – data lineage: history of migrated data and sequence of transf-s applied
                 – currency of data: active, archived, purged
                 – monitoring information: warehouse usage statistics, error reports, audit
                   trails.
TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                        42




                                                                                                                            21
          UNIVERSITY OF JYVÄSKYLÄ                                   DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



             Design of a DW: A Business Analysis Framework

              • Four views regarding the design of a data warehouse
                       – Top-down view: allows selection of the relevant
                         information necessary for the data warehouse (mature)
                       – Bottom-up: Starts with experiments and prototypes (rapid)
                       – Data source view
                              • exposes the information being captured, stored, and managed
                                by operational systems
                       – Data warehouse view
                              • consists of fact tables and dimension tables
                       – Business query view
                              • sees the perspectives of data in the warehouse from the view
                                of end-user




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                    43




          UNIVERSITY OF JYVÄSKYLÄ                                   DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                              Data Warehouse Design Process
           • From software engineering point of view
                   – Waterfall: structured and systematic analysis at each step before
                     proceeding to the next
                   – Spiral: rapid generation of increasingly functional systems,
                     short turn around time, quick turn around
           • Typical data warehouse design process
                   –    Choose a business process to model, e.g., orders, invoices, etc.
                   –    Choose the grain (atomic level of data) of the business process
                   –    Choose the dimensions that will apply to each fact table record
                   –    Choose the measure that will populate each fact table record




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                    44




                                                                                                                        22
          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                              DW Design: Issues to Consider
         •     What data is needed?
         •     Where does it come from?
         •     How to clean data?
         •     How to represent in warehouse (schema)?
         •     What to summarize?
         •     What to materialize?
         •     What to index?




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  45




          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



              DW Data Management: Issues to Consider
         •     Meta-data
         •     Data sourcing
         •     Data quality
         •     Data security
         •     Granularity
         •     History- how long and how much?
         •     Performance




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  46




                                                                                                                      23
          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                      Common DW Problems
         •     Underestimation of resources for data loading
         •     Hidden problems with source systems
         •     Required data not captured
         •     Increased end-user demands
         •     Data homogenization
         •     High demand for resources
         •     Data ownership
         •     High maintenance
         •     Long duration projects
         •     Complexity of integration



TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  47




          UNIVERSITY OF JYVÄSKYLÄ                                 DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                  Research Issues
         • Data cleaning
                 – focus on data inconsistencies, not schema differences
                 – data mining techniques
         • Physical Design
                 – design of summary tables, partitions, indexes
                 – tradeoffs in use of different indexes
         • Query processing
                 – selecting appropriate summary tables
                 – dynamic optimization with feedback
                 – query optimization: cost estimation, use of transformations, search
                   strategies
                 – partitioning query processing between OLAP server and backend server.
         • Warehouse Management
                 –   incremental refresh techniques
                 –   computing summary tables during load
                 –   failure recovery during load and refresh
                 –   process management: scheduling queries, load and refresh
                 –   use of workflow technology for process management


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                  48




                                                                                                                      24
          UNIVERSITY OF JYVÄSKYLÄ                                    DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                 OLAP Mining: An Integration of DM and DW

          • Data mining systems, DBMS, Data warehouse systems
                coupling
                  – No coupling, loose-coupling, semi-tight-coupling, tight-coupling

          • On-line analytical mining data
                  – integration of mining and OLAP technologies

          • Interactive mining multi-level knowledge
                  – Necessity of mining knowledge and patterns at different levels of
                       abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

          • Integration of multiple mining functions
                  – Characterized classification, first clustering and then association

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                     49




          UNIVERSITY OF JYVÄSKYLÄ                                    DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                           Summary
          • What is a data warehouse
          • Data warehouse architectures
                  – Conceptual DW Modelling
                  – Physical DW Modelling
          • A multi-dimensional data model
                  – Data Cubes
                  – Main OLAP operations
          • Data warehouse implementation and maintenance




          What else did you get from this lecture?
TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                     50




                                                                                                                         25
          UNIVERSITY OF JYVÄSKYLÄ                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                                  Additional Slides




TIES443: Introduction to DM          Lecture 3: Data Warehousing                                                  51




          UNIVERSITY OF JYVÄSKYLÄ                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                    Some Critics for Data Cubes
         • Jargon for IT professionals
         • Metaphor for end-users
                 – Not useful beyond introduction
                 – Users expect to see one
                 – Rubic’s cube
                        • Easy to manipulate but difficult to solve

         • Alternatives are better
                 – Cognos’ ways, Thomsen’s multi-dimensional domain diagrams,
                   Bulos’s Adapt diagrams or simply well designed interactive
                   reports




TIES443: Introduction to DM          Lecture 3: Data Warehousing                                                  52




                                                                                                                       26
          UNIVERSITY OF JYVÄSKYLÄ                                              DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                DW Development: A Recommended Approach


                                                                           Multi-Tier Data
                                                                           Warehouse
                                    Distributed
                                    Data Marts



                                                                                Enterprise
                    Data                          Data
                                                                                Data
                    Mart                          Mart
                                                                                Warehouse


                              Model refinement              Model refinement



                       Define a high-level corporate data model
TIES443: Introduction to DM           Lecture 3: Data Warehousing                                                             53




          UNIVERSITY OF JYVÄSKYLÄ                                              DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                    DMQL: Language Primitives
          Nice presentation on Data Mining Query Languages can be found here:
          http://www.cs.wisc.edu/EDAM/slides/Data%20Mining%20Query%20Languages.ppt


          • Cube Definition (Fact Table)
                  – define cube <cube_name> [<dimension_list>]:
                    <measure_list>
          • Dimension Definition (Dimension Table)
                  – define dimension <dimension_name> as
                    (<attribute_or_subdimension_list>)
          • Special Case (Shared Dimension Tables)
                  – First time as “cube definition”
                  – define dimension <dimension_name> as
                    <dimension_name_first_time> in cube
                    <cube_name_first_time>

TIES443: Introduction to DM           Lecture 3: Data Warehousing                                                             54




                                                                                                                                   27
          UNIVERSITY OF JYVÄSKYLÄ                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                              Defining a Star Schema in DMQL
           define cube sales_star [time, item, branch, location]:
                          dollars_sold = sum(sales_in_dollars),
                          avg_sales = avg(sales_in_dollars),
                          units_sold = count(*)
           define dimension time as
                          (time_key, day, day_of_week, month, quarter, year)
           define dimension item as
                          (item_key, item_name, brand, type, supplier_type)
           define dimension branch as
                          (branch_key, branch_name, branch_type)
           define dimension location as
                          (location_key, street, city, province_or_state, country)




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                   55




          UNIVERSITY OF JYVÄSKYLÄ                                  DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                         Defining a Snowflake Schema in DMQL
         define cube sales_snowflake [time, item, branch, location]:
                         dollars_sold = sum(sales_in_dollars),
                         avg_sales = avg(sales_in_dollars),
                         units_sold = count(*)
         define dimension time as
                         (time_key, day, day_of_week, month, quarter, year)
         define dimension item as
                         (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
         define dimension branch as
                         (branch_key, branch_name, branch_type)
         define dimension location as
                         (location_key, street, city(city_key, province_or_state, country))



TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                   56




                                                                                                                       28
          UNIVERSITY OF JYVÄSKYLÄ                                                DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                   Defining a Fact Constellation in DMQL
           define cube sales [time, item, branch, location]:
                              dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
                              units_sold = count(*)
           define dimension time as (time_key, day, day_of_week, month, quarter, year)
           define dimension item as (item_key, item_name, brand, type, supplier_type)
           define dimension branch as (branch_key, branch_name, branch_type)
           define dimension location as (location_key, street, city, province_or_state,
                            country)
           define cube shipping [time, item, shipper, from_location, to_location]:
                               dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
           define dimension time as time in cube sales
           define dimension item as item in cube sales
           define dimension shipper as (shipper_key, shipper_name, location as location
              in cube sales, shipper_type)
           define dimension from_location as location in cube sales
           define dimension to_location as location in cube sales

TIES443: Introduction to DM           Lecture 3: Data Warehousing                                                               57




          UNIVERSITY OF JYVÄSKYLÄ                                                DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                     A Star-Net Query Model
                                                            Customer Orders
                         Shipping Method
                                                                                                           Customer
                                                                         CONTRACTS
                              AIR-EXPRESS

                                                                         ORDER
                                                 TRUCK
                                                                                  PRODUCT LINE
              Time                                                                                Product
                       ANNUALY QTRLY DAILY                                  PRODUCT ITEM PRODUCT GROUP
                                               CITY
                                                                                 SALES PERSON
                                     COUNTRY
                                                                                         DISTRICT
                   REGION
                                                                                                       DIVISION
              Location              Each circle is
                                    called a                        Promotion                             Organization
                                    footprint
TIES443: Introduction to DM           Lecture 3: Data Warehousing                                                               58




                                                                                                                                     29
          UNIVERSITY OF JYVÄSKYLÄ                                               DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                              Index Structures
          This and the following slides on Indexing for DW are adopted with minor
          modifications from: http://infolab.stanford.edu/~hector/cs245/Notes12.ppt
           • Indexing principle:
                   • mapping key values to records for associative direct access
           • Most popular indexing techniques in relational
             database: B+-trees
           • For multi-dimensional data, a large number of
             indexing techniques have been developed: R-trees

           • Index structures applied in warehouses
                   –    inverted lists
                   –    bit map indexes
                   –    join indexes
                   –    text indexes
TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                59




          UNIVERSITY OF JYVÄSKYLÄ                                               DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                                  Inverted Lists
                                         18
                                         19

                                                                  r4                             rId    name       age
                                                                  r18                            r4       joe      20
                       20                                                                       r18      fred       20
                                         20                       r34
                       23                                                                       r19      sally     21
                                         21                       r35
                                         22                                                     r34     nancy      20
                                                                                                r35      tom        20
                                                                  r5
                                                                                                r36       pat      25
                                                                  r19
                                         23                                                      r5     dave        21
                                                                  r37
                                         25                                                     r41       jeff     26
                                                                  r40
                                         26
                                                                                                         ...




                                                                  inverted                               data
                         age                                                                           records
                        index                                       lists
      • Query:                                                     List for age = 20: r4, r18, r34, r35
              – Get people with age = 20 and                       List for name = “fred”: r18, r52
                name = “fred”                                      Answer is intersection: r18
TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                60




                                                                                                                                    30
          UNIVERSITY OF JYVÄSKYLÄ                                       DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                               Bitmap Indexes

           • Bitmap index: An indexing technique that has
             attracted attention in multi-dimensional database
             implementation
                 table

                         Customer                        City             Car
                            c1                          Detroit           Ford
                            c2                          Chicago          Honda
                            c3                          Detroit          Honda
                            c4                          Poznan            Ford
                            c5                           Paris           BMW
                            c6                           Paris           Nissan


TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                        61




          UNIVERSITY OF JYVÄSKYLÄ                                       DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                   Bitmap Indexes
         • The index consists of bitmaps:
                        Index on City:                              Index on Car:
             ec1      Chicago Detroit         Paris      Poznan   ec1      BMW         Ford       Honda      Nissan
              1          0      1               0           0      1        0           1           0           0
              2          1      0               0           0      2        1           0           1           0
              3          0      1               0           0      3        0           0           1           0
              4          0      0               0           1      4        0           1           0           0
              5          0      0               1           0      5        1           0           0           0
              6          0      0               1           0      6        0           0           0           1



                                bitmaps                                            bitmaps


          •Index on a particular column
          •Index consists of a number of bit vectors - bitmaps
          •Each value in the indexed column has a bit vector (bitmaps)
          •The length of the bit vector is the number of records in the base table
          •The i-th bit is set if the i-th row of the base table has the value for the indexed
          column

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                        62




                                                                                                                            31
          UNIVERSITY OF JYVÄSKYLÄ                                                DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                   Bitmap Indexes
           • Index on a particular column
           • Index consists of a number of bit vectors - bitmaps
           • Each value in the indexed column has a bit vector
             (bitmaps)
           • The length of the bit vector is the number of records
             in the base table
           • The i-th bit is set if the i-th row of the base table has
             the value for the indexed column




TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                 63




          UNIVERSITY OF JYVÄSKYLÄ                                                DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                                  Bitmap Index
                                                                                           Query:
                              18                 1                                         Get people with age = 20
                                                                  id   name age
                              19                 1                                         and name = “fred”
                                                                  1      joe  20
                                                 0
                                                                  2     fred  20
                                                 1
                                                                  3     sally 21           List for age = 20:
             20                                  1
                              20                                  4    nancy 20            1101100000
             23                                  0     0
                              21                                  5     tom   20           List for name = “fred”:
                                                 0     0
                              22                                  6      pat  25           0100000001
                                                 0     1          7    dave 21
                                                 0     0
                                                                                           Answer is intersection:
                                                                  8      jeff 26
                                                       0                                   0100000000
                              23
                                                                         ...




                                                       0
                              25
                                                       1                                   Suited well for domains
                              26
                                                       0                                   with small cardinality
                                                       1                 data
                                                       1               records

               age                                 bit
              index                               maps

TIES443: Introduction to DM         Lecture 3: Data Warehousing                                                                 64




                                                                                                                                     32
          UNIVERSITY OF JYVÄSKYLÄ                                                   DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                        Bitmap Index – Summary

           • With efficient hardware support for bitmap
             operations (AND, OR, XOR, NOT), bitmap index
             offers better access methods for certain queries
                              • e.g., selection on two attributes
           • Some commercial products have implemented
             bitmap index
           • Works poorly for high cardinality domains since the
             number of bitmaps increases
           • Difficult to maintain - need reorganization when
             relation sizes change (new bitmaps)



TIES443: Introduction to DM             Lecture 3: Data Warehousing                                                                65




          UNIVERSITY OF JYVÄSKYLÄ                                                   DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                                                      Join
          • “Combine” SALE, PRODUCT relations
          • In SQL: SELECT * FROM SALE, PRODUCT
            sale       prodId storeId               date        amt                product         id     name price
                         p1     c1                   1           12                                p1      bolt 10
                         p2     c1                   1           11                                p2      nut   5
                         p1     c3                   1           50
                         p2     c2                   1            8
                         p1     c1                   2           44
                         p1     c2                   2            4

                               joinTb     prodId         name         price   storeId   date       amt
                                            p1            bolt         10       c1       1          12
                                            p2            nut           5       c1       1          11
                                            p1            bolt         10       c3       1          50
                                            p2            nut           5       c2       1           8
                                            p1            bolt         10       c1       2          44
                                            p1            bolt         10       c2       2           4



TIES443: Introduction to DM             Lecture 3: Data Warehousing                                                                66




                                                                                                                                        33
          UNIVERSITY OF JYVÄSKYLÄ                                                   DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY




                                                     Join Indexes


                                                                              join index
                              product      id      name price             jIndex
                                           p1       bolt 10             r1,r3,r5,r6
                                           p2       nut   5                 r2,r4




                              sale       rId        prodId         storeId   date       amt
                                          r1          p1             c1       1          12
                                          r2          p2             c1       1          11
                                          r3          p1             c3       1          50
                                          r4          p2             c2       1          8
                                          r5          p1             c1       2          44
                                          r6          p1             c2       2          4




TIES443: Introduction to DM          Lecture 3: Data Warehousing                                                                   67




          UNIVERSITY OF JYVÄSKYLÄ                                                   DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY



                                                     Join Indexes
           • Traditional indexes map the value to a list of record
             ids. Join indexes map the tuples in the join result of
             two relations to the source tables.
           • In data warehouse cases, join indexes relate the
             values of the dimensions of a star schema to rows in
             the fact table.
                   • For a warehouse with a Sales fact table and dimension city, a
                     join index on city maintains for each distinct city a list of
                        RIDs of the tuples recording the sales in the city
           • Join indexes can span multiple dimensions



TIES443: Introduction to DM          Lecture 3: Data Warehousing                                                                   68




                                                                                                                                        34

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:10/18/2012
language:Unknown
pages:34