Data Warehousing - PowerPoint - PowerPoint

Document Sample
Data Warehousing - PowerPoint - PowerPoint Powered By Docstoc
					What is a Data Warehouse?
 And Why Are So Many
Schools Setting Them Up?

      Richard Goerwitz
    What Is a Data Warehouse?
   Nobody can agree
   So I’m not actually going to define a DW
   Don’t feel cheated, though
   By the end of this talk, you’ll
    • Understand key concepts that underlie all
      warehouse implementations (“talk the talk”)
    • Understand the various components out of
      which DW architects construct real-world data
    • Understand what a data warehouse project
      looks like
        Why Are Schools Setting Up
           Data Warehouses?
   A data warehouse makes it easier to:
    •   Optimize classroom, computer lab usage
    •   Refine admissions ratings systems
    •   Forecast future demand for courses, majors
    •   Tie private spreadsheet data into central repositories
    •   Correlate admissions and IR data with outcomes such as:
            GPAs
            Placement rates
            Happiness, as measured by alumni surveys
    • Notify advisors when extra help may be needed based on
            Admissions data (student vitals; SAT, etc.)
            Special events: A-student suddenly gets a C in his/her major
            Slower trends: Student’s GPA falls for > 2 semesters/terms
    • (Many other examples could be given!)
   Better information = better decisions
    • Better admission decisions
    • Better retention rates
    • More effective fund raising, etc.
                   Talking The Talk
   To think and communicate usefully about data warehouses
    you’ll need to understand a set of common terms and
    •   OLTP
    •   ODS
    •   ETL
    •   Star schema
    •   Conformed dimension
    •   Data mart
    •   Cube
    •   Metadata
   Even if you’re not an IT person, pay heed:
    • You’ll have to communicate with IT people
    • More importantly:
         Evidence shows that IT will only build a successful warehouse if you
           are intimately involved!
   OLTP = online transaction processing
   The process of moving data around to
    handle day-to-day affairs
    •   Scheduling classes
    •   Registering students
    •   Tracking benefits
    •   Recording payments, etc.
   Systems supporting this kind of activity
    are called transactional systems
             Transactional Systems
   Transactional systems are optimized primarily for
    the here and now
    •   Can support many simultaneous users
    •   Can support heavy read/write access
    •   Allow for constant change
    •   Are big, ugly, and often don’t give people the data they
            As a result a lot of data ends up in shadow databases
            Some ends up locked away in private spreadsheets
   Transactional systems don’t record all previous
    data states
   Lots of data gets thrown away or archived, e.g.:
    • Admissions data
    • Enrollment data
    • Asset tracking data (“How many computers did we
      support each year, from 1996 to 2006, and where do we
      expect to be in 2010?”)
Simple Transactional Database
                  Map of Microsoft
                   Windows Update
                   Service (WUS)
                   back-end database
                   • Diagrammed using
                        Each green box is a
                         database “table”
                        Arrows are “joins” or
                         foreign keys
                        This is simple for an
                         OLTP back end
         More Complex Example
   Recruitment Plus back-end
   Used by many admissions
   Note again:
    • Green boxes are tables
    • Lines are foreign key
    • Purple boxes are views
   Considerable expertise is
    required to report off this
   Imagine what it’s like for
    even more complex
    • Colleague
    • SCT Banner (over 4,000
       The “Reporting Problem”
   Often we require OLTP data as a snapshot, in a
    spreadsheet or report
   Reports require querying back-end OLTP support
   But OLTP databases are often very complex, and
    • Contain many, often obscure, tables
    • Utilize cryptic, unintuitive field/column names
    • Don’t store all necessary historical data
   As a result, reporting becomes a problem –
    • Requires special expertise
    • May require modifications to production OLTP systems
    • Becomes harder and harder for staff to keep up!
   Ways of working around the reporting
    problem include:
    1. Have OLTP system vendors do the work
      •   Provide canned reports
      •   Write reporting GUIs for their products
    2. Hire more specialists
      •   To create simplified views of OLTP data
      •   To write reports, create snapshots
    3. Periodically copy data from OLTP systems to
       a place where
      •   The data is easier to understand
      •   The data is optimized for reporting
      •   Easily pluggable into reporting tools
   ODS = operational data store
   ODSs were an early workaround to the “reporting
   To create an ODS you
    • Build a separate/simplified version of an OLTP system
    • Periodically copy data into it from the live OLTP system
    • Hook it to operational reporting tools
   An ODS can be an integration point or real-time
    “reporting database” for an operational system
   It’s not enough for full enterprise-level, cross-
    database analytical processing
   OLAP = online analytical processing
   OLAP is the process of creating and
    summarizing historical, multidimensional
    • To help users understand the data better
    • Provide a basis for informed decisions
    • Allow users to manipulate and explore data
      themselves, easily and intuitively
   More than just “reporting”
   Reporting is just one (static) product of
      OLAP Support Databases
   OLAP systems require support databases
   These databases typically
    • Support fewer simultaneous users than OLTP
      back ends
    • Are structured simply; i.e., denormalized
    • Can grow large
          Hold snapshots of data in OLTP systems
          Provide history/time depth to our analyses
    • Are optimized for read (not write) access
    • Updated via periodic batch (e.g., nightly) ETL
                ETL Processes
   ETL = extract, transform, load
    • Extract data from various sources
    • Transform and clean the data from those sources
    • Load the data into databases used for analysis and
   ETL processes are coded in various ways
    • By hand in SQL, UniBASIC, etc.
    • Using more general programming languages
    • In semi-automated fashion using specialized ETL tools
      like Cognos Decision Stream
   Most institutions do hand ETL; but note well:
    • Hand ETL is slow
    • Requires specialized knowledge
    • Becomes extremely difficult to maintain as code
      accumulates and databases/personnel change!
     Where Does the Data Go?

   What sort of a database do the ETL
    processes dump data into?
   Typically, into very simple table
   These table structures are:
    • Denormalized
    • Minimally branched/hierarchized
    • Structured into star schemas
    So What Are Star Schemas?
   Star schemas are collections of data arranged
    into star-like patterns
    • They have fact tables in the middle, which contain
      amounts, measures (like counts, dollar amounts, GPAs)
    • Dimension tables around the outside, which contain
      labels and classifications (like names, geocodes, majors)
    • For faster processing, aggregate fact tables are
      sometimes also used (e.g., counts pre-averaged for an
      entire term)
   Star schemas should
    • Have descriptive column/field labels
    • Be easy for users to understand
    • Perform well on queries
   A Very Simple Star Schema
Data Center UPS
 Power Output

  A More Complex Star Schema
                                                   Freshman survey
                                                    data (HERI/CIRP)
                                                   Dimensions:
                                                    • Questions
                                                    • Survey years
                                                    • Data about test
                                                   Facts:
                                                    • Answer (text)
                                                    • Answer (raw)
                                                    • Count (1)
                                                   Oops
                                                    • Not a star
                                                    • Snowflaked!
Oops, answers should have been placed in their
own dimension (creating a “factless fact table”).
I’ll demo a better version of this star later!
                    Data Marts
   One definition:
    • One or more star schemas that present data on a single
      or related set of business processes
   Data marts should not be built in isolation
   They need to be connected via dimensional tables
    that are
    • The same or subsets of each other
    • Hierarchized the same way internally
   So, e.g., if I construct data marts for…
    • GPA trends, student major trends, enrollments
    • Freshman survey data, senior survey data, etc.
   …I connect these marts via a conformed student
    • Makes correlation of data across star schemas intuitive
    • Makes it easier for OLAP tools to use the data
    • Allows nonspecialists to do much of the work
     Simple Data Mart Example
Battery star
   By battery
        % charged
Input star
   By phase
Output star
   By phase
Sensor star
   By sensor
Note conformed
date, time dimensions!
CIRP Star/Data Mart
                   CIRP
                    survey data
                   Corrected
                    from a
                   Note the
                   Note student
                    (ties in with
                    other marts)
CIRP Mart in Cognos BI 8
                 ROLAP, MOLAP
   ROLAP = OLAP via direct relational query
    • E.g., against a (materialized) view
    • Against star schemas in a warehouse
   MOLAP = OLAP via multidimensional
    database (MDB)
    • MDB is a special kind of database
    • Treats data kind of like a big, fast spreadsheet
    • MDBs typically draw data in from a data
          Built to work best with star schemas
                      Data Cubes
   The term data cube
    means different things to
    different people
   Various definitions:
    1.   A star schema
    2.   Any DB view used for
    3.   A three-dimensional
         array in a MDB
    4.   Any multidimensional
         MDB array (really a
   Which definition do you
    suppose is technically
   Metadata = data about data
   In a data warehousing context it can mean many
    •   Information on data in source OLTP systems
    •   Information on ETL jobs and what they do to the data
    •   Information on data in marts/star schemas
    •   Documentation in OLAP tools on the data they
   Many institutions make metadata available via
    data malls or warehouse portals, e.g.:
    •   University of New Mexico
    •   UC Davis
    •   Rensselear Polytechnic Institute
    •   University of Illinois
   Good ETL tools automate the setup of
            The Data Warehouse
   OK now we’re experts in terms like
    OLTP, OLAP, star schema, metadata, etc.
   Let’s use some of these terms to describe how a
    DW works:
    •   Provides ample metadata – data about the data
    •   Utilizes easy-to-understand column/field names
    •   Feeds multidimensional databases (MDBs)
    •   Is updated via periodic (mainly nightly) ETL jobs
    •   Presents data in a simplified, denormalized form
    •   Utilizes star-like fact/dimension table schemas
    •   Encompasses multiple, smaller data “marts”
    •   Supports OLAP tools (Access/Excel, Safari, Cognos BI)
    •   Derives data from (multiple) back-end OLTP systems
    •   Houses historical data, and can grow very big
     A Data Warehouse is Not…
   Vendor and consultant proclamations
    aside, a data warehouse is not:
    • A project
          With a specific end date
    • A product you buy from a vendor
          Like an ODS (such as SCT’s)
          A canned “warehouse” supplied by iStrategy
          Cognos ReportNet
    • A database schema or instance
          Like Oracle
          SQL Server
    • A cut-down version of your live transactional
    Kimball & Caserta’s Definition
   According to Ralph Kimball and Joe
    Caserta, a data warehouse is:
         A system that extracts, cleans, conforms, and
         delivers source data into a dimensional data
         store and then supports and implements
         querying and analysis for the purpose of
         decision making.

   Another def.: The union of all the enterprise’s data marts
   Aside: The Kimball model is not without some critics:
     •   E.g., Bill Inmon
    Example Data Warehouse (1)

   This one is
   5 parts:
    •   Sources
    •   ETL stuff
    •   DW proper
    •   Cubes etc.
    •   OLAP apps
Example Data Warehouse (2)
                     Caltech’s DW
                     Five Parts:
                      • Source systems
                      • ETL processes
                      • Data marts
                      • FM/metadata
                      • Reporting and
                        analysis tools
                      • Note: They’re
                        also customers
                        of Cognos!
    So Where is Colorado College?
   Phil Goldstein (Educause Center for Applied
    Research fellow) identifies the major deployment
    •   Level 1: Transactional systems only
    •   Level 2a: ODS or single data mart; no ETL
    •   Level 2: ODS or single data mart with ETL tools
    •   Level 3a: Warehouse or multiple marts; no ETL; OLAP
    •   Level 3b: Warehouse or multiple marts; ETL; OLAP
    •   Level 3: Enterprise-wide warehouse or multiple marts;
        ETL tools; OLAP tools
   Goldstein’s study was just released in late 2005
   It’s very good; based on real survey data
   Which level is Colorado College at?
    Implementing a Data Warehouse
   In many organizations IT people want to huddle and work
    out a warehousing plan, but in fact
    • The purpose of a DW is decision support
    • The primary audience of a DW is therefore College decision
    • It is College decision makers therefore who must determine
           Scope
           Priority
           Resources
   Decision makers can’t make these determinations without
    an understanding of data warehouses
   It is therefore imperative that key decision makers first be
    educated about data warehouses
    • Once this occurs, it is possible to
           Elicit requirements (a critical step that’s often skipped)
           Determine priorities/scope
           Formulate a budget
           Create a plan and timeline, with real milestones and deliverables!
    Is This Really a Good Plan?
   Sure, according to Phil Goldstein (Educause Center for
    Applied Research)
   He’s conducted extensive surveys on “academic analytics”
    (= business intelligence for higher ed)
   His four recommendations for improving analytics:
    1. Key decisionmakers must lead the way
    2. Technologists must collaborate
       •   Must collect requirements
       •   Must form strong partnerships with functional sponsors
    3. IT must build the needed infrastructure
       •   Carleton violated this rule with Cognos BI
       •   As we discovered, without an ETL/warehouse
           infrastructure, success with OLAP is elusive
    4. Staff must train and develop deep analysis skills
   Goldstein’s findings mirror closely the advice of industry
    heavyweights – Ralph Kimball, Laura Reeves, Margie
    Ross, Warren Thornthwaite, etc.
Isn’t a DW a Huge Undertaking?
   Sure, it can be huge
   Don’t hold on too tightly to the big-
    sounding word, “warehouse”
   Luminaries like Ralph Kimball have shown
    that a data warehouse can be built
    • Can start with just a few data marts
    • Targeted consulting help will ensure
      proper, extensible architecture and tool
What Takes Up the Most Time?
   You may be surprised          90

    to learn what DW step         80
    takes the most time           60

   Try guessing which:           50

    •   Hardware                  30                                       North
                                                                         OLAP tools

    •   Physical database setup   20
    •   Database design            0
                                       1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
    •   ETL
    •   OLAP setup

Acc. to Kimball & Caserta, ETL will eat up 70% of the time.
Other analysts give estimates ranging from 50% to 80%.
The most often underestimated part of the warehouse
   Eight Month Initial Deployment
              Step                Duration                Step              Duration
Begin educating decision makers   21 days    Secure, configure network     1 day
Collect requirements              14 days    Deploy physical “target” DB   4 days
Decide general DW design          7 days     Learn/deploy ETL tool         28 days
Determine budget                  3 days     Choose/set up modeling tool   21 days
Identify project roles            1 day      Design initial data mart      7 days
Eval/choose ETL tool              21 days    Design ETL processes          28 days
Eval/choose physical DB           14 days    Hook up OLAP tools            7 days
Spec/order, configure server      20 days    Publicize, train, train       21 days
   Information is held in transactional systems
    • But transactional systems are complex
    • They don’t talk to each other well; each is a silo
    • They require specially trained people to report off of
   For normal people to explore institutional data, data in
    transactional systems needs to be
    • Renormalized as star schemas
    • Moved to a system optimized for analysis
    • Merged into a unified whole in a data warehouse
   Note: This process must be led by “customers”
    • Yes, IT people must build the infrastructure
    • But IT people aren’t the main customers
   So who are the customers?
    •   Admissions officers trying to make good admission decisions
    •   Student counselors trying to find/help students at risk
    •   Development offers raising funds that support the College
    •   Alumni affairs people trying to manage volunteers
    •   Faculty deans trying to right-size departments
    •   IT people managing software/hardware assets, etc….