BI _ DM for CRM by liuhongmei


									  BI & DM for CRM

        Lecture 2
Data Warehouse and OLAP
Steps in Knowledge Discovery
                Steps in KD cont...
   Data Cleaning: To remove noise and inconsistent data / check data
    validity & correctness – PLAY WİTH DATA (good step to know your
    data)                         Data
    Data İntegration: To combine multiple sources of data / integrate
    various data sources on various platforms
    (DBMS/flatfiles/spreedsheets etc..)
   Data Selection: Retrieve relevent data from data repository for
   Data Transformation: To transform or consolidate data into different
    forms to deploy in DM operations. (If Data Warehouses were used,
    the process would be before the data selection.)
   Data Mining: To apply intelligent models for extracting patterns from
   Pattern Evaluation: To identify interesting patterns, result of DM
    process, on some measures.
   Knowledge presentation: To visualize the representation of
    knowledge, driven by DM process and pattern evaluationz
DM System Architecture
  So, what is Data Warehouse?
  Loosely speaking;
  • A decision support database that is
    maintained separately from the
    organization‟s operational database
  • Support information processing by
    providing a solid platform of consolidated,
    historical data for analysis.
 W.H. Inmon- „ A data warehouse is a
  subject-oriented, integrated, time-varient
  and nonvolatile collection of data in support
  of management‟s decision making proces.‟
   Organized around major subjects, such as customer,
    product, sales.
   Focusing on the modeling and analysis of data for
    decision makers, not on daily operations or transaction
   Provide a simple and concise view around particular
    subject issues by excluding data that are not useful in
    the decision support process.


 Constructed by integrating multiple, heterogeneous
  data sources
  relational databases, flat files, on-line transaction
 Data cleaning and data integration techniques are
  Ensure consistency in naming conventions,
    encoding structures, attribute measures, etc. among
    different data sources
           E.g., Hotel price: currency, tax, breakfast covered, etc.
       When data is moved to the warehouse, it is
               DW—Time Variant

   The time horizon for the data warehouse is significantly
    longer than that of operational systems.
       Operational database: current value data.
       Data warehouse data: provide information from a
        historical perspective (e.g., past 5-10 years)
   Every key structure in the data warehouse
       Contains an element of time, explicitly or implicitly
       But the key of operational data may or may not contain
        “time element”.

   A physically separate store of data transformed from
    the operational environment.
   Operational update of data does not occur in the data
    warehouse environment.
       Does not require transaction processing, recovery,
        and concurrency control mechanisms
       Requires only two operations in data accessing:
        • initial loading of data and access of data.

   On-line transaction processing vs On-line
              Analytical Processing
                     OLTP                        OLAP
users                clerk, IT professional      knowledge worker
function             day to day operations       decision support
DB design            application-oriented        subject-oriented
data                 current, up-to-date         historical,
                     detailed, flat relational   summarized, multidimensional
                     isolated                    integrated, consolidated
usage                repetitive                  ad-hoc
access               read/write                  lots of scans
                     index/hash on prim. key
unit of work         short, simple transaction   complex query
# records accessed   tens                        millions
#users               thousands                   hundreds
DB size              100MB-GB                    100GB-TB
metric               transaction throughput      query throughput, response

       Why seperate Database?
   Main reason is high performance for both
    systems – processing OLAP queries in
    transactional DBs would substantially
    degrade the performance of operational
   Unlike OLTP, OLAP usually needs read only
    access from the database. So, for OLAP
    operations DBs does not require concurrent
    control and recovery. If applied for OLAP this
    may jeopardize the execution of OLTP
   Systems are mainly used for different
   Data Model- Multidimentional

DW & OLAP are based on multidementional
 data models...

Actually... It is all about multidimentional
 data models...
                   2D view of data
Facebook Data...
     Time         Video       Photo   Write Up
     W1           200         500     1000
     W2           250         520     1023
     W3           310         782     1051

 Please do example for TeknoSA!!!
             3D view of Data cont

            MALE               FEMALE                 N/A
        POSTING                POSTING              POSTING
Time   Video Photo Write Video Photo Write     Video Photo Write
                   Up                Up                    Up
W1     50     120   400   80    150      500   70    230      100
                LATTICE OF CUBOIDS
                                               ALL              0 D – Apex cuboid

1-D Cuboids                         Posting               Gender          Education

     Time, Posting      Time, Gender Time, Education
2-D Cuboids                                            Posting, Time,
                                                       Gender Education
       Time, Posting,          Time, Posting, Time, Gender,       Posting, Gender,
       Gender                  Education      Education           Education
3-D Cuboids

                             Time, Posting, Gender, Education       4-D Cuboids
                Latice & Cuboids
  Any n-D data as a series of (n-1)-D “cubes”
 In data warehousing literature,
 A data cube is referred to as a cuboid
 The lattice of cuboids forms a data cube.
 The cuboid holding the lowest level of summarization is called a
   the 4-D cuboid is the base cuboid for the given four dimensions
• base cuboid.
 The top most 0-D cuboid, which holds the highest-level of
   summarization, is called the apex cuboid.
   Here this is the total posting
   typically denoted by all
     Conceptual Modelling of DW
   Modeling data warehouses: dimensions & measures
    – Star schema: A fact table in the middle connected to a set
      of dimension tables
    – Snowflake schema: A refinement of star schema where
      some dimensional hierarchy is normalized into a set of
      smaller dimension tables, forming a shape similar to
    – Fact constellations: Multiple fact tables share dimension
      tables, viewed as a collection of stars, therefore called
      galaxy schema or fact constellation
                 STAR SCHEMA
                          Post            Post Type
                          Time_key        Post_key
                          Post_key        Post _name
                          Total Post
Year                                      Gender
           Snow Flake Schema
                          FACT TABLE
                          Post                           Post Type
                          Time_key                       Post_key
                          Post_key                       Post _name
                          Total Post
Year                                                      Gender
                                          Edu Type
                       Fact Constellation
Time                    Post                     Post Type
Time_key                Time_key                 Post_key
Day                     Post_key                 Post _name
Day_of_week             Education_key                         Chat
Month                   Gender_key                            Time_key
Quarter                 Total Post                            Education_key
Year                                                          Gender_key
        Education                                             Total_online_time
    Categorization & Computation
   How are measures computed?
       Data Cube measure is a numarical function that
        can be evaluated at each point in the data cube
   Distributive- sum(), min(), max()-
       How many posts are from USA?
           Sum (Washington)+Sum(Nevada)....
   Algebraic: sum(),count(),
   Holisatic: median(), mode()
         Concept Hierarchies

   What is it?
       Flow of a concept (could be dimension)
    from lower levels to higher levels.
       More specific          More General
          Example for concept hierarchy
all                                  all

region                  Europe             ...        North_America

country       Germany      ...   Spain             Canada     ...   Mexico

city        Frankfurt   ...           Vancouver ...         Toronto

office                           L. Chan     ...   M. Wind

Hierarchical vs lattice Structures
       (Concept Hierarchy)

                or state

                                         month                                week

 We also have user defined concept hierarcies like Fiscal year or Academic Year
Please find one example for hierarchical and lattice structures
        OLAP Operations in Multi
           Dimensional Data
   How can we use Concept Hierarchies
    useful in OLAP?
         In multi dimensional model, data are organized in multi
          dimensions and each dimension contains multi level of
          abstraction defined by concept hierarchies
         This type of organization provides user to view the data
          from various perspectives
         Basically, OLAP provides user friendly environment for
          interactive data analysis.
                  OLAP ACTIONS
   Roll-up (drill-up)
   Drill-down
   Slice and Dice: The slice operation performs a
    selection on one dimension of the given cube, resulting in a
    subcube. The operation defines a subcube by performing a
    selection on two or more dimensions.
   Pivot: visualization operation that rotates the data axes in
    view in order to provide an alternative presentation of the data.
                Roll up action exp...
   Roll-up (drill-up):Performs aggregation on a data cube, either
    by climbing up a concept hierarchy for a dimension or
    dimension reduction.
                       location by city
                       Istanbul Ankara Berlin    Münih
            PC                 20       30    50       40
            Printer            15        5    10       20
          Hierarchy Roll Up                Dimension Roll Up

         Location by country
                      TR       GR             PC            140
         PC           50       90             Printer       50
         Printer      20       30
          Drill down action exp...
   Drill Down:Drill down is reverse of roll-up. It navigates
    from general hierarchy to more specific hierarchy. Adding new
    dimension to data

            Year 2009                        Quarter
PC          100                              Q1   Q2    Q3   Q4
Printer     150
                                   PC        20   30    40   10

                                   Printer   30   60    20   40
          Slice & Dice Action exp...
     The slice operation performs a selection on one dimension of
      the given cube, resulting in a subcube. The operation defines a
      subcube by performing a selection on two or more dimensions.

                                          Measure    Sales
                                                     Time 2010
           Quarter              Slice
           Q1   Q2    Q3   Q4
                                          PC         20
PC         20   30    40   10
                                          Printer    30
Printer    30   60    20   40                         Time 2010

                                          PC          20
            Other OLAP Actions
   drill across: involving (across) more than one fact table
   drill through: through the bottom level of the cube to its back-
    end relational tables (using SQL)
   ranking the top N or bottom N items in lists
   moving averages
   growth rates
   interests
        Star-Net Query Model
   Radial lines from a central point
   each line represents a concept
    hierarchy for a dimension
   each abstraction level is called a
    granularities available for use by OLAP
    four radial lines,for concept hierarchies
      location,customer,item,time
    time line has 4 footprints:
      day,month,quarter,year
                  A Star-Net Query Model
                                Customer Orders
       Shipping Method

                                                      PRODUCT LINE
Time                                                                            Product
                                                     SALES PERSON

Location     Each circle is
             called a footprint     Promotion                         Organization
Design of a Data Warehouse: A
 Business Analysis Framework
   Four views regarding the design of a data warehouse
       Top-down view
            allows selection of the relevant information necessary for the
             data warehouse
       Data source view
            exposes the information being captured, stored, and
             managed by operational systems
       Data warehouse view
            consists of fact tables and dimension tables
       Business query view
            sees the perspectives of data in the warehouse from the view
             of end-user
    Data Warehouse Design
    Top-down, bottom-up approaches or a combination of both
        Top-down: Starts with overall design and planning (mature)
        Bottom-up: Starts with experiments and prototypes (rapid)
•   From software engineering point of view
        Waterfall: structured and systematic analysis at each step before
         proceeding to the next
        Spiral: rapid generation of increasingly functional systems, short turn
         around time, quick turn around
•   Typical data warehouse design process
        Choose a business process to model, e.g., orders, invoices, etc.
        Choose the grain (atomic level of data) of the business process
        Choose the dimensions that will apply to each fact table record
        Choose the measure that will populate each fact table record
Three Data Warehouse Models
   Enterprise Warehouse
       Collects all information about subject spanning of the
        entire organization
   Data Mart
       A subset of corporate-wide data that is valuable to
        specific groups of users. Such as marketing
   Virtual Warehouse
       A set of views over operational databases
       Only some of the possible summary views may be
            Storage of a cube
   Cuboids are referred as aggregations
   One factor affecting storage requirements
   Sparsity: the amount of empty cells in a cube
   The base cuboid is likely to contain many empty
     it is a spares cube or array
           the 0 or lower dimensional cuboids are
            less spares than the higher dimensional
     it is not likely that they contain empty cells
     Moving along higher levels for the dimension
     the cuboids becomes less spares or more
                  PC    Prt CD DV
                                         Two dimensional
 01.10.2003 10                           sparse cuboid
 02.10.2003             1
 03.10.2003 4                   2
 04.10.2003             2


One dimensional dense
                            01.10.2003       10
cuboid                      02.10.2003       1
                            03.10.2003       6
                            04.10.2003       2
    OLAP Server Architectures
   ROLAP – Relational OLAP
   MOLAP – Multidimensional OLAP
   HOLAP – Hybrid OLAP
   Use relational or extended-relational DBMS to store and manage
    warehouse data and OLAP middle ware to support missing pieces
   query response is generally slower
   low storage requirement
   Include optimization of DBMS backend, implementation of
    aggregation navigation logic, and additional tools and services
   greater scalability
   appropriate for large data sets that are infrequently queried
     historical data from less recent previous years
   Array-based multidimensional storage engine (sparse matrix
   fast indexing to pre-computed summarized data
   a two-level storage representation
     dense subcubes are stored as array structures

     spars subcubes are stored by compression techniques

   appropriate for cubes with frequent use and rapid query response
   combines ROLAP and MOLAP benefiting from
     greater scalability of ROLAP

     faster computation of MOLAP

   Large volumes of data base cuboid is stored in a relational
   aggregations are stored as arrays
   appropriate for for cubes that requre
     rapid query response for summaries based on a large amount

      of base data
             Efficient Data Cube
   Data cubes can be viewed as lattice of
           The bottom cuboid is the base cuboid
           The top is the apex cuboid
       What is the number of cuboids for N
        dimensional data cube? 2N
       OLAP computes at least some of the cuboids
          For fast response
          For avoiding redundant calculation
                                                               0-D(apex) cuboid
       product        date           country
                                                               1-D cuboids

product,date       product,country             date, country
                                                               2-D cuboids

                                                               3-D(base) cuboid
                 product, date, country
    Efficient computation cont...
   If one dimension has concept
        T = ∏ Li i 1

    T 's total numer of cuboids
    Li is the number of levels associated with dimension I
         •Excluding the top level all
         •As generalizing to all is equivalent to the removal
         of a dimension
    Materialization of data cube
   Three Types
           Materialize every cuboids
             Huge amounts of memory space
           Non materialization – Zero cube calculation
             Show processing of queries
           Some (partial) materialization
             Trade off between storage space and response time
             Selection of which cuboids to materialize
             Based on size, sharing, access frequency and etc
         Processing the cubes
   Complete load of the cube
         all dimension and fact table data is read and
         all specified aggregations cuboids are calculated
         process a cube when
         its structure is new or
         its dimensions or measures have been edited
   Incrementally updating a cube
         new data is added but existing data not changed
          and cube structure si the same
   Refreshing
         data cleared and reloaded
         its aggregations recalculated
         faster then processing:no design of aggregation
        Calculated Memebers
   Dimension member or measure whose
    value is computed at run time using an
   Only the definitions are stored but
    values exists only in memory upon a
   do not increase in cube size
   Ex: if sales and cost are included in the
    base fact table
        a profit measure can be a calculated member
        profit = sales – cost
        Average_sales = sales/#_items_sold
               Virtual Cubes
   Combination of multiple cubes in one
    logical cube
      can be based on a single cube to expose only
       selected subsets of measures and dimensions
   Require no physical space
      store only the dimensions information not actual
   provide a valuable security function
      limiting the access of some users
             Member Cubes
   Attribute of a dimension member
   provides additional information about the
   a column in the same dimension table as the
    associated members
   used in queries
          provide users more options when
           analysing cube data
    Exp Member Property – Time
   A typical time table:
         (time_id,day,month,quarter,year,business
          day,leap,day of the week)
     dimension levels day<month<quarter<year
     member properties for day:
         weekend or business day:0 or 1
         day of the week:1,2,3,...,7
     a member property for year is
         whether it is leap year or not:0 or 1
              Virtual Dimension
   Logical dimension based on a member property of a level in a
    physical dimension
   enables users to analyse cube data based on the member
    properties of dimension levels
   add a virtual dimension to a cube only if
           the dimension that supplies its member property is also
            included in the cube
   adding a virtual dimension does not increase cube size
           not affect cube processing time
              calculated in memory when needed
           query processing time is slower
       Exp Virtual Dimension
   The business day column was a
    member property for day level of the
    time dimension
   the user may want to investigate sales
    by type of day (business or weekend)
   makes business day member property
    as a virtual dimension of the sale cube
        Parent Child Dimension
   Based on two dimension table columns
    that together define the lieage
    relationships among the members of
    the dimension
        Member key column:identifies each member
        Parent key column: identifies the parent of each
 Exp Parent Child Dimension
         Example: A HR Department

                  Poal West

   James Smith     Amy Joens   Jill Kelly

John Grande      Jo Brown
Exp Parent Child Dim cont...
Emp Name         Emp_id            Manager_id
James Smith      1                 3
Amy Jones        2                 3
Paul West        3                 3
Jim Kelly        4                 3
John Grande      5                 1
Jo Brown         6               1
Emp_id represnts each employee
Manager_id represents parent dimention

Q? How can you represnt the same hierarchy in the traditional
concept hierarchy?
         Data Warehouse Usage
Three kinds of data warehouse applications
     Information processing
        supports querying, basic statistical analysis, and reporting
         using cross-tabs, tables, charts and graphs
     Analytical processing
        multidimensional analysis of data warehouse data
        supports basic OLAP operations, slice-dice, drilling,
     Data mining
        knowledge discovery from hidden patterns
        supports associations, constructing analytical models,
         performing classification and prediction, and presenting
         the mining results using visualization tools.
 Differences among the three tasks
    From OLAP to OLAM (online
        analytical mining)
   Why online analytical mining?
         High quality of data in data warehouses – DW contains
          integrated, consistent, cleaned data
         Available information processing structure surrounding
          data warehouses – ODBC, OLEDB, Web accessing,
          service facilities, reporting and OLAP tools
         OLAP-based exploratory data analysis (mining with
          drilling, dicing, pivoting, etc.)
         On-line selection of data mining functions – integration
          and swapping of multiple mining functions, algorithms,
          and tasks.

To top