Data Warehouse Architecture Best Practices by yyc68236

VIEWS: 31 PAGES: 61

									cohesion institute




     Data Warehouse
Architecture Best Practices

               December 5, 2005




          Speaker: R. Michael Pickering
   President, Cohesion Systems Consulting Inc.
cohesion institute




Agenda
        Introductions
        Business Intelligence Background
        Architecture Best Practices
        Questions & Answers




     September 2, 2010   DW Architecture Best Practices   2
cohesion institute




     Data Warehouse
Architecture Best Practices

                     Introductions
cohesion institute




Presenter Biography
   R. Michael Pickering
     President and Chief Architect,
     Cohesion Systems Consulting Inc.
              previously, Managing Consultant, BI&W, Oracle
               Consulting (Canada)
              before that, Red Brick Systems, Inc.
        over 8 years DW experience
              Manulife Reinsurance, Bell Canada, USDA, Kraft
               Foods, LCBO, Telecom Argentina, Nortel Networks,
               Procter & Gamble, Bayer, Syncrude, OMoHLTC…
        Mr. Pickering has had DW articles published
         in The Handbook of Data Management

     September 2, 2010       DW Architecture Best Practices    4
cohesion institute




Cohesion Systems Consulting
   Provides DW and BI services,
    specializing in:
        Architecture & Implementation Consulting
        Project Management
        Databases, Appliances & Emerging
         Technology
        Training & Mentoring
   Since inception in 2000, clients have
    included Enbridge, CIBC, The Bank of
    New York, Loyalty Management Group,
    Canada Post Borderfree, Katz Group
     September 2, 2010   DW Architecture Best Practices   5
cohesion institute




Audience Survey
   By a show of hands, please indicate your
    experience with:
        normalization
        dimensional modeling
        operational data store
        data consolidation
        Extract Transform Load (ETL)
        metadata architecture
        DW appliances

     September 2, 2010   DW Architecture Best Practices   6
cohesion institute




     Data Warehouse
Architecture Best Practices

          Business Intelligence
              Background
cohesion institute




What is Business Intelligence?
    A Data Warehouse is usually one
     component of an overall business
     intelligence solution
    IT people may be tempted to think in
     terms of products and technologies

     BUT...



     September 2, 2010   DW Architecture Best Practices   8
cohesion institute




Overarching Goal
                            The overarching goal of
                             business intelligence is to
                             provide the information
                             necessary to MANAGE a
                             business
                            This means providing
                             information in support of
                             management decision making,
                             which is why BI is also called
                             “Decision Support”
     September 2, 2010          DW Architecture Best Practices   9
cohesion institute




BI is about “Data Abstraction”
                         wisdom
              Stages     knowledge
                (4)      information
                         data

   audience for a data warehouse typically
    considers higher slices of data abstraction
    pyramid
   lowest level of pyramid is too detailed &
    unwieldy
     September 2, 2010   DW Architecture Best Practices   10
cohesion institute




It’s Not Technology
   Business Intelligence is about delivering
    business value
        provide tangible benefit by answering important
         questions that can help the business to achieve its
         strategic focus
              Improving profitability
                    Who are our five most profitable clients?
                    What are our least profitable products?
              Reducing cost
                    Who are our lowest cost suppliers?
                    Which materials incur highest spoilage costs?
              Improving customer satisfaction
                    What factors may lead to lost customers?



     September 2, 2010             DW Architecture Best Practices    11
cohesion institute




Business of BI
     In some cases, legislation such as Sarbanes-
      Oxley or Basel II makes some kind of BI
      fundamental to doing business
     Many leading companies use BI to achieve
      competitive advantage
          E.g. Walmart, Dell, Amazon.com, Kraft, American
           Express, etc…




     September 2, 2010     DW Architecture Best Practices    12
cohesion institute




Data Warehouse Architecture
   architecture is about delivering
    an elegant solution that meets
    the solution requirements
        this means really understanding the
         problem
   DW architecture is part art, part
    science

     September 2, 2010   DW Architecture Best Practices   13
cohesion institute




Good Architecture
   „It‟s not easy to describe a good
    design, but I‟ll know it when I
    see it‟




     September 2, 2010   DW Architecture Best Practices   14
cohesion institute




BI Architecture Requirements
   must recognize change as a
    constant
   take incremental development
    approach
   existing applications must
    continue to work
   need to allow more data and
    new types of data to be added

     September 2, 2010   DW Architecture Best Practices   15
cohesion institute




End User Acceptance
                understandability
                        understandability is in the eyes
                         of the beholder
                        want to hide the complexity
                        try to make it:
                            intuitive, obvious
                            visible, memorable



     September 2, 2010             DW Architecture Best Practices   16
cohesion institute




End User Acceptance
                            performance
                                 don‟t want to interrupt
                                  the thinking process
                                 provide one click,
                                  instantaneous access
                                 warehouse must be
                                  available, “production”
                                  system

     September 2, 2010           DW Architecture Best Practices   17
cohesion institute




     Data Warehouse
Architecture Best Practices

                     Architecture
                     Best Practices
cohesion institute




High Level Architecture
    remember the different “worlds”
          on-line transaction processing (OLTP)
          business intelligence systems (BIS)
    users are different
    data content is different
    data structures are different
    architecture & methodology must be
     different
     September 2, 2010   DW Architecture Best Practices   19
cohesion institute




Two Different Worlds
   On-Line Transaction Processing
        Entity Relational Data Model
              created in 1960‟s to address performance
               issues with relational database
               implementations
              normalized to most efficiently get data in
              divides the data into many discrete entities
              many relationships between these entities
              this approach was documented by C.J. Date in
               An Introduction to Database Systems


     September 2, 2010       DW Architecture Best Practices   20
cohesion institute




Two Different Worlds
      Business Intelligence Systems
           Dimensional Data Model
                 also called star schema
                 designed to easily get information out
                 fewer relationships than ERD, the only
                  table with multiple joins connecting to
                  other tables is the central table
                 developed in 1960‟s by data service
                  providers, formalized by Ralph Kimball in
                  The Data Warehouse Toolkit

      September 2, 2010      DW Architecture Best Practices   21
cohesion institute




Entity Relation Disadvantages
   all tables look the same
   people can‟t visualize/remember
    diagrams
   software can‟t navigate as schema
    becomes too complex
   business processes mixed together
   many artificial keys created
     September 2, 2010   DW Architecture Best Practices   22
cohesion institute


Dimensional Model
Advantages
      simplicity
      humans can navigate and remember
      software can navigate
       deterministically
      business process explicitly separated
       (Data Mart)
      not so many keys (keys = # of
       attendant tables)
      September 2, 2010   DW Architecture Best Practices   23
cohesion institute




Best Practice #1
   Use a data model that is optimized
    for information retrieval
        dimensional model
        denormalized
        hybrid approach




     September 2, 2010   DW Architecture Best Practices   24
cohesion institute




Data Acquisition Processes
   Extract Transform Load (ETL)
        the process of unloading or copying data
         from the source systems, transforming it
         into the format and data model required in
         the BI environment, and loading it to the DW
        also, a software development tool for
         building ETL processes (an ETL tool)
        many production DWs use COBOL or other
         general-purpose programming languages to
         implement ETL

     September 2, 2010   DW Architecture Best Practices   25
cohesion institute




Data Quality Assurance
   data cleansing
        the process of validating and enriching
         the data as it is published to the DW
        also, a software development tool for
         building data cleansing processes (a data
         cleansing tool)
        many production DWs have only very
         rudimentary data quality assurance
         processes

     September 2, 2010   DW Architecture Best Practices   26
cohesion institute




Data Acquisition & Cleansing
   getting data loaded efficiently and
    correctly is critical to the success of
    your DW
        implementation of data acquisition &
         cleansing processes represents from 50
         to 80% of effort on typical DW projects
        inaccurate data content can be „the kiss
         of death‟ for user acceptance

     September 2, 2010   DW Architecture Best Practices   27
cohesion institute




Best Practice #2
   Carefully design the data
    acquisition and cleansing processes
    for your DW
        Ensure the data is processed
         efficiently and accurately
        Consider acquiring ETL and Data
         Cleansing tools
        Use them well!

     September 2, 2010   DW Architecture Best Practices   28
cohesion institute




Data Model
   Already discussed the benefits of a
    dimensional model
   No matter whether dimensional
    modeling or any other design
    approach is used, the data model
    must be documented


     September 2, 2010   DW Architecture Best Practices   29
cohesion institute




Documenting the Data Model
     The best practice is to use some kind of data
      modeling tool
          CA ERwin
          Sybase PowerDesigner
          Oracle Designer
          IBM Rational Rose
          Etc.
     Different tools support different modeling notations,
      but they are more or less equivalent anyway
     Most tools allow sharing of their metadata with an
      ETL tool

     September 2, 2010     DW Architecture Best Practices   30
cohesion institute




Data Model Standards
   data model standards appropriate for the
    environment and tools chosen in your data
    warehouse should be adopted
   considerations should be given to data access
    tool(s) and integration with overall enterprise
    standards
   standards must be documented and enforced
    within the DW team
        someone must „own‟ the data model
   to ensure a quality data model, all changes
    should be reviewed thru some formal process

     September 2, 2010   DW Architecture Best Practices   31
cohesion institute




Data Model Metadata
    Business definitions should be recorded
     for every field (unless they are technical
     fields only)
    Domain of data should be recorded
    Sample values should be included
    As more metadata is populated into the
     modeling tool it becomes increasingly
     important to be able to share this data
     across ETL and Data Access tools

     September 2, 2010   DW Architecture Best Practices   32
cohesion institute




Metadata Architecture
   The strategy for sharing data model
    and other metadata should be
    formalized and documented
   Metadata management tools should
    be considered & the overall
    metadata architecture should be
    carefully planned

     September 2, 2010   DW Architecture Best Practices   33
cohesion institute




Best Practice #3
   Design a metadata architecture
    that allows sharing of metadata
    between components of your DW
        consider metadata standards such as
         OMG‟s Common Warehouse
         Metamodel (CWM)



     September 2, 2010   DW Architecture Best Practices   34
cohesion institute


Alternative Architecture
Approaches
   Bill Inmon: “Corporate Information
    Factory”
   Hub and Spoke philosophy
   “JBOC” – just a bunch of cubes
   Let it evolve naturally



     September 2, 2010   DW Architecture Best Practices   35
cohesion institute


What We Want
(Architectural Principal)
   In most cases, business and IT
    agree that the data warehouse
    should provide a „single version of
    the truth‟
   Any approach that can result in
    disparate data marts or cubes is
    undesireable
   This is known as data silos or…

     September 2, 2010   DW Architecture Best Practices   36
cohesion institute




Enterprise DW Architecture
        how to design an enterprise data
         warehouse and ensure a „single
         version of the truth‟?
        according to Kimball:
             start with an overall data architecture
              phase
             use “Data Warehouse Bus” design to
              integrate multiple data marts
             use incremental approach by building one
              data mart at a time
     September 2, 2010    DW Architecture Best Practices   37
cohesion institute


Data Warehouse Bus
Architecture
           named for the bus in a computer
                standard interface that allows you to plug
                 in cdrom, disk drive, etc.
                these peripherals work together smoothly
           provides framework for data marts to
            fit together
           allows separate data marts to be
            implemented by different groups, even
            at different times
     September 2, 2010      DW Architecture Best Practices   38
cohesion institute




Data Mart Definition
   data mart is a complete subset of the
    overall data warehouse
         a single business process OR
         a group of related business processes
   think of a data mart as a collection of
    related fact tables sharing conformed
    dimensions, aka a „fact constellation‟


     September 2, 2010   DW Architecture Best Practices   39
cohesion institute




Designing The DW Bus
   determine which dimensions will be
    shared across multiple data marts
   conform the shared dimensions
              produce a master suite of shared
               dimensions
   determine which facts will be shared
    across data marts
   conform the facts
              standardize the definitions of facts
     September 2, 2010      DW Architecture Best Practices   40
cohesion institute




Dimension Granularity
   conformed dimensions will usually
    be granular
        makes it easy to integrate with
         various base level fact tables
        easy to extend fact table by adding
         new facts
        no need to drop or reload fact tables,
         and no keys have to be changed

     September 2, 2010   DW Architecture Best Practices   41
cohesion institute




Conforming Dimensions
    by adhering to standards, the separate
     data marts can be plugged together
       e.g. customer, product, time

    they can even share data usefully, for
     example in a drill across report
    ensures reports or queries from different
     data marts share the same context


     September 2, 2010   DW Architecture Best Practices   42
cohesion institute




Conforming Dimensions (cont’d)

    accomplish this by adding any dimension
     attribute(s) needed in any data mart(s)
     to the standard dimension definition
         attributes not needed everywhere can always be
          ignored
    typically harder to determine how to load
     conformed dimensions than to design
     them initially
         need a single integrated ETL process
         what is the SOR for each attribute?
         how do we deal with attributes for which there is
          more than one possible SOR?
     September 2, 2010    DW Architecture Best Practices   43
cohesion institute




Conforming Facts

   in an enterprise, some metrics may
    not have the same generally
    accepted definition across all
    business units
   conforming facts is generally a
    bigger design challenge than
    conforming dimensions
        why?

     September 2, 2010   DW Architecture Best Practices   44
cohesion institute




Conforming Facts - Benefits
   ensures the constituent data marts can as
    clearly as possible represent fact data
    expressed on the same basis using consistent
    definitions
   ensures reports or queries from different data
    marts share consistent content
   success of an Enterprise DW hinges on
    successfully conformed facts
        any perceived inconsistencies in fact definitions across
         data marts will generally be considered to be a DW
         bug or data problem by users
        if users don‟t have full confidence in data quality they
         may stop using the DW

     September 2, 2010     DW Architecture Best Practices      45
cohesion institute




Data Consolidation
   a current trend in BI/DW is „data
    consolidation‟
   from a software vendor
    perspective, it is tempting to
    simplify this:
        „we can keep all the tables for all your
         disparate applications in one physical
         database‟
     September 2, 2010   DW Architecture Best Practices   46
cohesion institute




Data Integration
    To truly achieve „a single version of
     the truth‟, must do more than simply
     consolidating application databases
    Must integrate data models and
     establish common terms of reference




     September 2, 2010   DW Architecture Best Practices   47
cohesion institute




Best Practice #4
   Take an approach that consolidates
    data into „a single version of the
    truth‟
        Data Warehouse Bus
              conformed dimensions & facts
        OR?



     September 2, 2010    DW Architecture Best Practices   48
cohesion institute




Operational Data Store (ODS)
   a single point of integration for disparate
    operational systems
   contains integrated data at the most
    detailed level (transactional)
   may be loaded in „near real time‟ or
    periodically
   can be used for centralized operational
    reporting

     September 2, 2010   DW Architecture Best Practices   49
cohesion institute


Role of an ODS in DW
Architecture
   In the case where an ODS is a
    necessary component of the overall
    DW, it should be carefully integrated
    into the overall architecture
   Can also be used for:
         Staging area
         Master/reference data management
         Etc…

     September 2, 2010   DW Architecture Best Practices   50
cohesion institute




ODS Data Model
   Not clear if any design approach for
    an ODS data model has emerged
    as a best practice
        normalized
        dimensional
        denormalized/hybrid
        any suggestions?


     September 2, 2010   DW Architecture Best Practices   51
cohesion institute




Best Practice #5
   Consider implementing an ODS only
    when information retrieval requirements
    are near the bottom of the data
    abstraction pyramid and/or when there
    are multiple operational sources that
    need to be accessed
        Must ensure that the data model is
         integrated, not just consolidated
        May consider 3NF data model
        Avoid at all costs a „data dumping ground‟

     September 2, 2010   DW Architecture Best Practices   52
cohesion institute




Capacity Planning
   DW workloads are typically very
    demanding, especially for I/O capacity
   Successful implementations tend to grow
    very quickly, both in number of users
    and data volume
   Rules of thumb do exist for sizing the
    hardware platform to provide adequate
    initial performance
        typically based on estimated „raw‟ data size
         of proposed database e.g. 100-150 Gb per
         modern CPU

     September 2, 2010   DW Architecture Best Practices   53
cohesion institute




SMP Server Scale Up
   Scaling performance within a single SMP
    server is referred to as „scale up‟
   Database benchmarks suggest Windows
    scalability is near that of Linux
   IBM claims near-linear scalability for Linux
    (on commodity hardware) up to about 4
    processors
        Probably not cost effective to scale up Linux
         much beyond 4 processors
   IBM claims near-linear scalability for AIX
    on POWER5 up to about 8 processors
     September 2, 2010   DW Architecture Best Practices   54
cohesion institute




Scale Out
   There is an increasing trend in IT to „scale out‟
    processing capacity by deploying many small,
    commodity servers rather than a single large
    SMP system
   This strategy tends to work well for relatively
    simple applications such as network or web
    servers
   For very complex workloads such as a data
    warehouse, this strategy is much more difficult
    to effectively implement
        Especially so for the database server itself

     September 2, 2010      DW Architecture Best Practices   55
cohesion institute




Scale Up vs. Scale Out
   To obtain the total number of processors
    required for the estimated DW workload,
    must plan either to scale up or scale out
   Both options are viable but, all other
    things being equal, scaling up is less
    disruptive to end users and requires less
    work to implement
        scaling up can offer lower hardware
         investment, if practical
        however, network bandwidth or latency
         issues can limit effectiveness of parallelism

     September 2, 2010   DW Architecture Best Practices   56
cohesion institute




Best Practice #6
   Create a capacity plan for your BI
    application & monitor it carefully
   Consider future additional performance
    demands
        Establish standard performance benchmark
         queries and regularly run them
        Implement capacity monitoring tools
        Build scalability into your architecture
        May need to allow for scaling both up and
         out!

     September 2, 2010   DW Architecture Best Practices   57
cohesion institute




Open Source Affordability
   Another emerging trend in IT generally is to
    utilize Open Source software running on
    commodity hardware
        this is expected to offer lower total cost of ownership
        certainly, GNU/Linux and other Open Source
         initiatives do provide very good functionality and
         quality for minimal cost
   This trend also applies to BI & DW:
        most traditional rdbms‟s are now supported on Linux
        however, open source rdbms‟s lag behind on
         providing good performance for DW queries


     September 2, 2010      DW Architecture Best Practices     58
cohesion institute




DW Appliances
   DW appliances, consisting of
    packaged solutions providing all
    required software and hardware, are
    beginning to offer very promising
    price/performance
   production experience is limited so
    far, so this is not yet a „best practice‟


     September 2, 2010   DW Architecture Best Practices   59
cohesion institute




     Data Warehouse
Architecture Best Practices


                     Q&A
 cohesion institute




cohesion systems consulting inc


               the modern art
             of data abstraction

								
To top