Docstoc

warehousing

Document Sample
warehousing Powered By Docstoc
					Data Warehouse &
Online Analytical
Processing (OLAP)


   Aditya Ramani
     Arpit Jain
   Kashif Manzoor
   Omid Fatemieh
The story of the Walton’s - the richest family ever !




                                                        2
What are the items I am running out of ?
     SELECT * FROM PRODUCTS WHERE Quantity < 200
     Okay now color code them so that anything below 100 is dark read and all others
     are white.
Show me the sum of all the individual transactions
     SELECT Transaction.TID, SUM(Product.Price) FROM Transaction, Product,
     LineItem WHERE LineItem.TID = Transaction.TID AND LineItem.Product =
     Product.ID GROUP BY (Transaction.TID)




Show me all the items that are usually purchased together (I can then place them close
to each other on the shelves)


I had put Printer cartridge on sale in the hope that customer will buy a printer. Show me
if this strategy worked ?


Now the Walmart manager is getting too fancy with the requests !!!                    3
                    Need to the day
1. Present data that can help him in making wise decision.
2. Keep historical data so that he can perform trend analysis on
   the data.
3. Keep the analysis data up to date.
4. Don’t overload him with lot of irrelevant data.
5. Use the “Keep It Simple Stupid” rule – just present what is of
   interest to him.
6. Let him visualize the data in different ways as per his choice.




                                                               4
                      Agenda
• Introduction:
• Data Warehouse
• OLAP
• Future Trends
• Thankyou to the class for their feedback on
  the newsgroup…
• Theme:
    – Less slides, more interaction, more diagrams, less
      convoluted equations, more talk, less sleepy faces
      – hopefully this will lead to less pain.



                                                       5
                   What is Data Warehouse ?
   • Data warehousing provides architectures and tools for
      business executives to systematically organize,
      understand and use their data to make strategic
      decisions. – Jiawei Han

   • “A data warehouse is a subject-oriented, integrated,
      time-variant, and nonvolatile collection of data in
      support of management's decision making process“ -
      William H. Inmon
                                                All Relevant Info
                                                 under one roof
Changes as new                   To the point
 data trickle in   Retains the
                     history
                                                                    6
What is Online Analytical Processing (OLAP) ?




                                                7
DATA WAREHOUSE




                 8
Why can’t the Existing DBMS do the trick ?
 • Operational in nature - Online Transaction
        Mundane day-to-day entries do not help in decision support.
   Processing (OLTP) oriented.
             We need a much higher and aggregated view of data

          Information that I could have obtained from a single table
 • Normalized.       shouldn’t be split into several tables.
 • Multipurpose (support insert, delete, to view it
    I do not need to insert, delete, update the data. I only need
                                                                  update,
   select)
 • Support Transaction Management,
                 Why waste your energy on these features.
   Recovery, Locking etc.
 • Restrict I the size the historical data for my analysis.
                would need
                              of data by dumping
   historical data into archives.
              Now what ever is left is something that I needed ?
 So why are these things bad/not can use !

                                                                        9
RDBMS and Data Warehouse – no love lost

• They are not at odds with each other.

• RDBMS can be a placeholder for DataWarehouse to
  store huge amount of read only data.

• Traditional RDBMS are made to handle operational
  data efficiently.

   – In their raw form, they are not very efficient in handling huge
     de-normalized read-only data

   – Many RDBMSs provide a DBMS version fully targeted towards a
     Data Warehouse.

                                                                  10
                 Data warehouse architecture

                                   Monitor
  Other            Metadata           &           OLAP Server
  sources                         Integrator

                                 Data
 Operational   Extract         Warehouse                         Analysis
 DBs           Transform                                          Query
               Load
                                                Serve           Reports
               Refresh                                          Data mining




                               Data Marts

Data Sources               Data Storage        OLAP Engine Front-End Tools
                                                                          11
       Data Warehouse Terminologies
• Dimension (city):
     • The ‘angle’ from which the data is to be analyzed.


• Measures (e.g. sales, profit):
     • The consolidated value of a particular dimension measured
       against one or more dimensions.
     • Can be a sum, an average, a median, etc.


• Multidimensional Analysis (city and month):
     • Analyzing measures that have been aggregated against
       more than one dimensions.




                                                            12
   Data Warehouse Terminologies (cont’d)

• Data Cube & Cuboid
        • Each of the possible group-by combination of dimensions
          is called a Cuboid, the collection of all possible cuboids is
          called a Data Cube.
• Example: What is the total amount sale per month in the Walmarts in
   each city ?


 Table T Product          City          Date          Price

• SELECT City, SUM(PRICE)
  FROM T GROUP BY City



                                                                        13
               Cuboids from a SQL point of view

        T     Product        City          Date         Price
                                                                       Cuboids
            SELECT SUM(PRICE) FROM T

                                    SELECT Product, SUM(PRICE)
                                     FROM T GROUP BY Product

                                       SELECT City, SUM(PRICE)
                                        FROM T GROUP BY City
1.   {}
2.   {Product}                                    …
3.   {City}
4.   {Date}                           SELECT City, Date, SUM(PRICE)
5.   {Product, City}                   FROM T GROUP BY City, Date
6.   {Product, Date }               SELECT Product, City, Date, SUM(PRICE)
7.   {City, Date}                    FROM T GROUP BY Product, City, Date
8.   {Product, City, Date}
                                                                             14
• For n- dimensions, total cuboids would be 2n
  (exponential growth !).




  Data Cube (lattice of cuboids). Notice that the word Cube does not
  necessarily mean 3-Dimensions !




                                                                       15
                          DW Data Model                 The Star Schema
time
time_key                                                  item
day                                                    item_key
day_of_the_week                 Sales Fact Table       item_name
month                                                  brand
quarter                                 time_key       type
year                                                   supplier_type
                                         item_key
                                       branch_key
       branch                                        location
                                      location_key
       branch_key                                    location_key
       branch_name                      units_sold   street
       branch_type                                   city
                                      dollars_sold   state_or_province
                                                     country
                                         avg_sales



                     Measures                                     Dimensions
                                                                         16
                                       The Snowflake Schema


time
time_key
                                       item             supplier
day                                    item_key         supplier_key
day_of_the_week   Sales Fact Table     item_name        supplier_type
month                                  brand
quarter                   time_key     type
year                       item_key    supplier_key

                         branch_key
   branch                              location
                        location_key
                                       location_key
   branch_key
                          units_sold   street
   branch_name
                                       city_key
   branch_type
                        dollars_sold                  city
                                                      city_key
                          avg_sales                   city
                                                      state_or_province
                                                      country

                                                                  17
                 Data warehouse architecture

                                   Monitor
  Other            Metadata           &           OLAP Server
  sources                         Integrator

                                 Data
 Operational   Extract         Warehouse                         Analysis
 DBs           Transform                                          Query
               Load
                                                Serve           Reports
               Refresh                                          Data mining




                               Data Marts

Data Sources               Data Storage        OLAP Engine Front-End Tools
                                                                         18
             Issues in DW Construction
• Extract, Transform, Load (ETL), Refresh:

   • Data Cleaning:
      • Missing data, optional fields, etc.

   • Data Loading:
      • Sorting, aggregation, building indexes.

   • Refresh
      • Propagate update on source data to the data warehouse.
      • Recreate indexes, aggregate tables.
      • Frequency.
      • Refreshing Methods:
          • Data shipping (has more overhead)
          • Transaction shipping (not portable)



                                                         19
 Issues in DW Construction (cont’d)
            Issues & Challenges
• Curse of Dimensionality:
                            n
  – n dimensions result into 2 aggregated cells !


• Queries are complex
  – Query optimization a real challenge


• Index may not be used.
  – Due to the complexity of the queries index
    may not be used.



                                                    20
              Materializing Cuboids
• The user wants to see the results immediately.

• Materializing all cuboids is really expensive or even
  impossible

• Alternatives:
  1. Only materialize the most specific cuboid
   – Use that to derive low dimensions cuboids on the fly
   – E.g. Multi-Way aggregation, Star Cubing

  2. Materialization smaller cuboids with low dimensions
   – stitch them together to make higher dimension cuboids
     on the fly
   – E.g. Shell-Fragments




                                                             21
           OLAP
(Online Analytical Processing)



                            25
26
How do the regional and seasonal differences affect revenues ?

SELECT city, month,                              Multiple-Dimensions
         sum(price)
                                                   Single Measure
FROM Product, Transaction
WHERE <…>
GROUP BY (city, month)



                                                  Raw query result.
                                                  Contains thousands may
                                                  be million of such rows.




                                                                       27
let the OLAP magic begin ..



                                    Sections and
                                     Cross Tabs




                              But that’s still too
                              much. How about
                              only showing me
                              the top 3
                              revenues ?           28
          Rank




That’s much better. Now
instead of cities - show me
the regions.


                      29
                                              Roll-Up




                                            Countries




                                             Regions




                                              States




                                              Cities


Okay now show me data as a 3-D cuboid.   Concept Hierarchy
                                                         30
                                         for Location
                                        Okay, that’s better. But I want
                                        to see 2002 in more detail.



                                                Drill Down




Good. But wait a minute what
happened to Accessories. It shows
some strange behavior. Show me
all the accessories so that I can see
what went wrong.
                                                                      31
                                                         Okay. Just show me the data
                                                         for Q1 and Q3.




                                                                    Dice



                                                         Now just show me the data
                                                         without any Quarter and
                                                         year information.




                                                                      Slice
How are Slice/dice different than Roll-up/Drill down ?



                                                                                 32
                  Data Mining Vs. OLAP


• OLAP:
     • Helps you perform analysis by letting you navigate
       around the data as you desire.
• Data Mining:
     • Helps you make decision by highlighting suspicious data,
       outliers, identifying patterns, deciding what should be the
       suitable level of abstraction at which you should be
       looking at.




                                                                 33
                                                Data Collection & Database Creation
The Retrospective                               1960s and earlier

               & the Future                     - Primitive File Processing




                                       Database Management Systems (1970- early 1980s)
                                       Hierarchical & Network DBMS
                                       Relational DBMS
                                       Entity-Relational modeling
                                       Indexing and accessing methods: B-Trees, hashing
                                       SQL
                                       Query processing & optimization,
                                       Transactions, concurrent control, recovery.




     Advanced Database Systems                                                                    Web-Based Databases
                                        Advanced Data Analysis: Data Warehousing &
     (1980s – present)                                                                            (1990s – present)
                                        Data Mining (late 1980s – present)
     Advanced data models: extended                                                               XML-based databases systems.
                                        Data Warehouse and OLAP
     relational, Objects models                                                                   Information Retrieval systems.
                                        Data mining: classification, association, clustering, …
     Advanced Applications: Spatial,    Advanced data mining applications:
     Temporal, Multimedia, Sensor.      Stream data mining, web mining, …




                                       New Generation of Integrated Data & Information Systems
                                       (present – future)                                                                34
Thank you




            35
          36
Go back

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:19
posted:8/21/2011
language:English
pages:33