warehousing
Document Sample


Data Warehouse &
Online Analytical
Processing (OLAP)
Aditya Ramani
Arpit Jain
Kashif Manzoor
Omid Fatemieh
The story of the Walton’s - the richest family ever !
2
What are the items I am running out of ?
SELECT * FROM PRODUCTS WHERE Quantity < 200
Okay now color code them so that anything below 100 is dark read and all others
are white.
Show me the sum of all the individual transactions
SELECT Transaction.TID, SUM(Product.Price) FROM Transaction, Product,
LineItem WHERE LineItem.TID = Transaction.TID AND LineItem.Product =
Product.ID GROUP BY (Transaction.TID)
Show me all the items that are usually purchased together (I can then place them close
to each other on the shelves)
I had put Printer cartridge on sale in the hope that customer will buy a printer. Show me
if this strategy worked ?
Now the Walmart manager is getting too fancy with the requests !!! 3
Need to the day
1. Present data that can help him in making wise decision.
2. Keep historical data so that he can perform trend analysis on
the data.
3. Keep the analysis data up to date.
4. Don’t overload him with lot of irrelevant data.
5. Use the “Keep It Simple Stupid” rule – just present what is of
interest to him.
6. Let him visualize the data in different ways as per his choice.
4
Agenda
• Introduction:
• Data Warehouse
• OLAP
• Future Trends
• Thankyou to the class for their feedback on
the newsgroup…
• Theme:
– Less slides, more interaction, more diagrams, less
convoluted equations, more talk, less sleepy faces
– hopefully this will lead to less pain.
5
What is Data Warehouse ?
• Data warehousing provides architectures and tools for
business executives to systematically organize,
understand and use their data to make strategic
decisions. – Jiawei Han
• “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management's decision making process“ -
William H. Inmon
All Relevant Info
under one roof
Changes as new To the point
data trickle in Retains the
history
6
What is Online Analytical Processing (OLAP) ?
7
DATA WAREHOUSE
8
Why can’t the Existing DBMS do the trick ?
• Operational in nature - Online Transaction
Mundane day-to-day entries do not help in decision support.
Processing (OLTP) oriented.
We need a much higher and aggregated view of data
Information that I could have obtained from a single table
• Normalized. shouldn’t be split into several tables.
• Multipurpose (support insert, delete, to view it
I do not need to insert, delete, update the data. I only need
update,
select)
• Support Transaction Management,
Why waste your energy on these features.
Recovery, Locking etc.
• Restrict I the size the historical data for my analysis.
would need
of data by dumping
historical data into archives.
Now what ever is left is something that I needed ?
So why are these things bad/not can use !
9
RDBMS and Data Warehouse – no love lost
• They are not at odds with each other.
• RDBMS can be a placeholder for DataWarehouse to
store huge amount of read only data.
• Traditional RDBMS are made to handle operational
data efficiently.
– In their raw form, they are not very efficient in handling huge
de-normalized read-only data
– Many RDBMSs provide a DBMS version fully targeted towards a
Data Warehouse.
10
Data warehouse architecture
Monitor
Other Metadata & OLAP Server
sources Integrator
Data
Operational Extract Warehouse Analysis
DBs Transform Query
Load
Serve Reports
Refresh Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
11
Data Warehouse Terminologies
• Dimension (city):
• The ‘angle’ from which the data is to be analyzed.
• Measures (e.g. sales, profit):
• The consolidated value of a particular dimension measured
against one or more dimensions.
• Can be a sum, an average, a median, etc.
• Multidimensional Analysis (city and month):
• Analyzing measures that have been aggregated against
more than one dimensions.
12
Data Warehouse Terminologies (cont’d)
• Data Cube & Cuboid
• Each of the possible group-by combination of dimensions
is called a Cuboid, the collection of all possible cuboids is
called a Data Cube.
• Example: What is the total amount sale per month in the Walmarts in
each city ?
Table T Product City Date Price
• SELECT City, SUM(PRICE)
FROM T GROUP BY City
13
Cuboids from a SQL point of view
T Product City Date Price
Cuboids
SELECT SUM(PRICE) FROM T
SELECT Product, SUM(PRICE)
FROM T GROUP BY Product
SELECT City, SUM(PRICE)
FROM T GROUP BY City
1. {}
2. {Product} …
3. {City}
4. {Date} SELECT City, Date, SUM(PRICE)
5. {Product, City} FROM T GROUP BY City, Date
6. {Product, Date } SELECT Product, City, Date, SUM(PRICE)
7. {City, Date} FROM T GROUP BY Product, City, Date
8. {Product, City, Date}
14
• For n- dimensions, total cuboids would be 2n
(exponential growth !).
Data Cube (lattice of cuboids). Notice that the word Cube does not
necessarily mean 3-Dimensions !
15
DW Data Model The Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures Dimensions
16
The Snowflake Schema
time
time_key
item supplier
day item_key supplier_key
day_of_the_week Sales Fact Table item_name supplier_type
month brand
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
country
17
Data warehouse architecture
Monitor
Other Metadata & OLAP Server
sources Integrator
Data
Operational Extract Warehouse Analysis
DBs Transform Query
Load
Serve Reports
Refresh Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
18
Issues in DW Construction
• Extract, Transform, Load (ETL), Refresh:
• Data Cleaning:
• Missing data, optional fields, etc.
• Data Loading:
• Sorting, aggregation, building indexes.
• Refresh
• Propagate update on source data to the data warehouse.
• Recreate indexes, aggregate tables.
• Frequency.
• Refreshing Methods:
• Data shipping (has more overhead)
• Transaction shipping (not portable)
19
Issues in DW Construction (cont’d)
Issues & Challenges
• Curse of Dimensionality:
n
– n dimensions result into 2 aggregated cells !
• Queries are complex
– Query optimization a real challenge
• Index may not be used.
– Due to the complexity of the queries index
may not be used.
20
Materializing Cuboids
• The user wants to see the results immediately.
• Materializing all cuboids is really expensive or even
impossible
• Alternatives:
1. Only materialize the most specific cuboid
– Use that to derive low dimensions cuboids on the fly
– E.g. Multi-Way aggregation, Star Cubing
2. Materialization smaller cuboids with low dimensions
– stitch them together to make higher dimension cuboids
on the fly
– E.g. Shell-Fragments
21
OLAP
(Online Analytical Processing)
25
26
How do the regional and seasonal differences affect revenues ?
SELECT city, month, Multiple-Dimensions
sum(price)
Single Measure
FROM Product, Transaction
WHERE <…>
GROUP BY (city, month)
Raw query result.
Contains thousands may
be million of such rows.
27
let the OLAP magic begin ..
Sections and
Cross Tabs
But that’s still too
much. How about
only showing me
the top 3
revenues ? 28
Rank
That’s much better. Now
instead of cities - show me
the regions.
29
Roll-Up
Countries
Regions
States
Cities
Okay now show me data as a 3-D cuboid. Concept Hierarchy
30
for Location
Okay, that’s better. But I want
to see 2002 in more detail.
Drill Down
Good. But wait a minute what
happened to Accessories. It shows
some strange behavior. Show me
all the accessories so that I can see
what went wrong.
31
Okay. Just show me the data
for Q1 and Q3.
Dice
Now just show me the data
without any Quarter and
year information.
Slice
How are Slice/dice different than Roll-up/Drill down ?
32
Data Mining Vs. OLAP
• OLAP:
• Helps you perform analysis by letting you navigate
around the data as you desire.
• Data Mining:
• Helps you make decision by highlighting suspicious data,
outliers, identifying patterns, deciding what should be the
suitable level of abstraction at which you should be
looking at.
33
Data Collection & Database Creation
The Retrospective 1960s and earlier
& the Future - Primitive File Processing
Database Management Systems (1970- early 1980s)
Hierarchical & Network DBMS
Relational DBMS
Entity-Relational modeling
Indexing and accessing methods: B-Trees, hashing
SQL
Query processing & optimization,
Transactions, concurrent control, recovery.
Advanced Database Systems Web-Based Databases
Advanced Data Analysis: Data Warehousing &
(1980s – present) (1990s – present)
Data Mining (late 1980s – present)
Advanced data models: extended XML-based databases systems.
Data Warehouse and OLAP
relational, Objects models Information Retrieval systems.
Data mining: classification, association, clustering, …
Advanced Applications: Spatial, Advanced data mining applications:
Temporal, Multimedia, Sensor. Stream data mining, web mining, …
New Generation of Integrated Data & Information Systems
(present – future) 34
Thank you
35
36
Go back
Get documents about "