VIEWS: 22 PAGES: 59 POSTED ON: 8/11/2011
BI & DM for CRM Lecture 2 Data Warehouse and OLAP Steps in Knowledge Discovery Steps in KD cont... Data Cleaning: To remove noise and inconsistent data / check data validity & correctness – PLAY WİTH DATA (good step to know your data) Data Warehouse Data İntegration: To combine multiple sources of data / integrate various data sources on various platforms (DBMS/flatfiles/spreedsheets etc..) Data Selection: Retrieve relevent data from data repository for analysis Data Transformation: To transform or consolidate data into different forms to deploy in DM operations. (If Data Warehouses were used, the process would be before the data selection.) Data Mining: To apply intelligent models for extracting patterns from data. Pattern Evaluation: To identify interesting patterns, result of DM process, on some measures. Knowledge presentation: To visualize the representation of knowledge, driven by DM process and pattern evaluationz DM System Architecture So, what is Data Warehouse? Loosely speaking; • A decision support database that is maintained separately from the organization‟s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis. W.H. Inmon- „ A data warehouse is a subject-oriented, integrated, time-varient and nonvolatile collection of data in support of management‟s decision making proces.‟ DW—Subject-Oriented Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 6 DW—Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. 7 DW—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”. 8 DW—Non-Volatile A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: • initial loading of data and access of data. 9 On-line transaction processing vs On-line Analytical Processing OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date historical, detailed, flat relational summarized, multidimensional isolated integrated, consolidated usage repetitive ad-hoc access read/write lots of scans index/hash on prim. key unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response 10 Why seperate Database? Main reason is high performance for both systems – processing OLAP queries in transactional DBs would substantially degrade the performance of operational tasks Unlike OLTP, OLAP usually needs read only access from the database. So, for OLAP operations DBs does not require concurrent control and recovery. If applied for OLAP this may jeopardize the execution of OLTP operations Systems are mainly used for different operations... Data Model- Multidimentional DW & OLAP are based on multidementional data models... Actually... It is all about multidimentional data models... 2D view of data Facebook Data... POSTING Time Video Photo Write Up W1 200 500 1000 W2 250 520 1023 W3 310 782 1051 Please do example for TeknoSA!!! 3D view of Data cont MALE FEMALE N/A POSTING POSTING POSTING Time Video Photo Write Video Photo Write Video Photo Write Up Up Up W1 50 120 400 80 150 500 70 230 100 W2 W3 LATTICE OF CUBOIDS ALL 0 D – Apex cuboid 1-D Cuboids Posting Gender Education Time Time, Posting Time, Gender Time, Education 2-D Cuboids Posting, Time, Gender, Gender Education Education Time, Posting, Time, Posting, Time, Gender, Posting, Gender, Gender Education Education Education 3-D Cuboids Time, Posting, Gender, Education 4-D Cuboids Latice & Cuboids Any n-D data as a series of (n-1)-D “cubes” In data warehousing literature, A data cube is referred to as a cuboid The lattice of cuboids forms a data cube. The cuboid holding the lowest level of summarization is called a the 4-D cuboid is the base cuboid for the given four dimensions • base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. Here this is the total posting typically denoted by all Conceptual Modelling of DW Modeling data warehouses: dimensions & measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation STAR SCHEMA Time Post Post Type Time_key Time_key Post_key Day Post_key Post _name Day_of_week Education_key Month Gender_key Quarter Total Post Year Gender Gender_key Education Gender_name Education_key Education_name Edu_type_key Snow Flake Schema FACT TABLE Time Post Post Type Time_key Time_key Post_key Day Post_key Post _name Day_of_week Education_key Month Gender_key Quarter Total Post Year Gender Gender_key Education Gender_name Education_key Education_name Edu Type Edu_type_key Edu_type_key Edu_type_name Fact Constellation Time Post Post Type Time_key Time_key Post_key Day Post_key Post _name Day_of_week Education_key Chat Month Gender_key Time_key Quarter Total Post Education_key Year Gender_key Total_chat_time Education Total_online_time Gender Education_key Gender_key Education_name Gender_name Edu_type_key Categorization & Computation How are measures computed? Data Cube measure is a numarical function that can be evaluated at each point in the data cube space. Distributive- sum(), min(), max()- How many posts are from USA? Sum (Washington)+Sum(Nevada).... Algebraic: sum(),count(), standard_deviation()... Holisatic: median(), mode() Concept Hierarchies What is it? Flow of a concept (could be dimension) from lower levels to higher levels. More specific More General Example for concept hierarchy Location all all region Europe ... North_America country Germany ... Spain Canada ... Mexico city Frankfurt ... Vancouver ... Toronto office L. Chan ... M. Wind 24 Hierarchical vs lattice Structures (Concept Hierarchy) year country quarter Province or state month week city Street day We also have user defined concept hierarcies like Fiscal year or Academic Year Please find one example for hierarchical and lattice structures OLAP Operations in Multi Dimensional Data How can we use Concept Hierarchies useful in OLAP? In multi dimensional model, data are organized in multi dimensions and each dimension contains multi level of abstraction defined by concept hierarchies This type of organization provides user to view the data from various perspectives Basically, OLAP provides user friendly environment for interactive data analysis. OLAP ACTIONS Roll-up (drill-up) Drill-down Slice and Dice: The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The operation defines a subcube by performing a selection on two or more dimensions. Pivot: visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. Roll up action exp... Roll-up (drill-up):Performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or dimension reduction. location by city Istanbul Ankara Berlin Münih PC 20 30 50 40 Printer 15 5 10 20 Hierarchy Roll Up Dimension Roll Up Location by country Location TR GR PC 140 PC 50 90 Printer 50 Printer 20 30 Drill down action exp... Drill Down:Drill down is reverse of roll-up. It navigates from general hierarchy to more specific hierarchy. Adding new dimension to data Year 2009 Quarter Drill PC 100 Q1 Q2 Q3 Q4 Down Printer 150 PC 20 30 40 10 Printer 30 60 20 40 Slice & Dice Action exp... The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The operation defines a subcube by performing a selection on two or more dimensions. Measure Sales Time 2010 Quarter Slice Q1 Q1 Q2 Q3 Q4 PC 20 PC 20 30 40 10 Printer 30 Dice Printer 30 60 20 40 Time 2010 Q1 PC 20 Other OLAP Actions drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back- end relational tables (using SQL) ranking the top N or bottom N items in lists moving averages growth rates interests Star-Net Query Model Radial lines from a central point each line represents a concept hierarchy for a dimension each abstraction level is called a footprint granularities available for use by OLAP four radial lines,for concept hierarchies location,customer,item,time time line has 4 footprints: day,month,quarter,year A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP CITY SALES PERSON COUNTRY DISTRICT REGION DIVISION Location Each circle is called a footprint Promotion Organization Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view allows selection of the relevant information necessary for the data warehouse Data source view exposes the information being captured, stored, and managed by operational systems Data warehouse view consists of fact tables and dimension tables Business query view sees the perspectives of data in the warehouse from the view of end-user Data Warehouse Design • Process Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) • From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around • Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record Three Data Warehouse Models Enterprise Warehouse Collects all information about subject spanning of the entire organization Data Mart A subset of corporate-wide data that is valuable to specific groups of users. Such as marketing Virtual Warehouse A set of views over operational databases Only some of the possible summary views may be materialized Storage of a cube Cuboids are referred as aggregations One factor affecting storage requirements Sparsity: the amount of empty cells in a cube The base cuboid is likely to contain many empty cells it is a spares cube or array the 0 or lower dimensional cuboids are less spares than the higher dimensional ones it is not likely that they contain empty cells Moving along higher levels for the dimension hierarchy the cuboids becomes less spares or more dense PC Prt CD DV Two dimensional 01.10.2003 10 sparse cuboid 02.10.2003 1 03.10.2003 4 2 04.10.2003 2 items One dimensional dense 01.10.2003 10 cuboid 02.10.2003 1 03.10.2003 6 04.10.2003 2 OLAP Server Architectures ROLAP – Relational OLAP MOLAP – Multidimensional OLAP HOLAP – Hybrid OLAP ROLAP Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces query response is generally slower low storage requirement Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability appropriate for large data sets that are infrequently queried historical data from less recent previous years MOLAP Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data a two-level storage representation dense subcubes are stored as array structures spars subcubes are stored by compression techniques appropriate for cubes with frequent use and rapid query response HOLAP combines ROLAP and MOLAP benefiting from greater scalability of ROLAP faster computation of MOLAP Large volumes of data base cuboid is stored in a relational database aggregations are stored as arrays appropriate for for cubes that requre rapid query response for summaries based on a large amount of base data Efficient Data Cube Computation Data cubes can be viewed as lattice of cuboids The bottom cuboid is the base cuboid The top is the apex cuboid What is the number of cuboids for N dimensional data cube? 2N OLAP computes at least some of the cuboids For fast response For avoiding redundant calculation Example all 0-D(apex) cuboid product date country 1-D cuboids product,date product,country date, country 2-D cuboids 3-D(base) cuboid product, date, country Efficient computation cont... If one dimension has concept hierarchies? N T = ∏ Li i 1 i=1 T 's total numer of cuboids Li is the number of levels associated with dimension I •Excluding the top level all •As generalizing to all is equivalent to the removal of a dimension Materialization of data cube Three Types Materialize every cuboids Huge amounts of memory space Non materialization – Zero cube calculation Show processing of queries Some (partial) materialization Trade off between storage space and response time Selection of which cuboids to materialize Based on size, sharing, access frequency and etc Processing the cubes Complete load of the cube all dimension and fact table data is read and all specified aggregations cuboids are calculated process a cube when its structure is new or its dimensions or measures have been edited Incrementally updating a cube new data is added but existing data not changed and cube structure si the same Refreshing data cleared and reloaded its aggregations recalculated faster then processing:no design of aggregation tables Calculated Memebers Dimension member or measure whose value is computed at run time using an expression Only the definitions are stored but values exists only in memory upon a query do not increase in cube size Ex: if sales and cost are included in the base fact table a profit measure can be a calculated member profit = sales – cost Average_sales = sales/#_items_sold Virtual Cubes Combination of multiple cubes in one logical cube can be based on a single cube to expose only selected subsets of measures and dimensions Require no physical space store only the dimensions information not actual data provide a valuable security function limiting the access of some users Member Cubes Attribute of a dimension member provides additional information about the member a column in the same dimension table as the associated members used in queries provide users more options when analysing cube data Exp Member Property – Time Table A typical time table: (time_id,day,month,quarter,year,business day,leap,day of the week) dimension levels day<month<quarter<year member properties for day: weekend or business day:0 or 1 day of the week:1,2,3,...,7 a member property for year is whether it is leap year or not:0 or 1 Virtual Dimension Logical dimension based on a member property of a level in a physical dimension enables users to analyse cube data based on the member properties of dimension levels add a virtual dimension to a cube only if the dimension that supplies its member property is also included in the cube adding a virtual dimension does not increase cube size not affect cube processing time calculated in memory when needed query processing time is slower Exp Virtual Dimension The business day column was a member property for day level of the time dimension the user may want to investigate sales by type of day (business or weekend) makes business day member property as a virtual dimension of the sale cube Parent Child Dimension Based on two dimension table columns that together define the lieage relationships among the members of the dimension Member key column:identifies each member Parent key column: identifies the parent of each member Exp Parent Child Dimension Example: A HR Department Poal West James Smith Amy Joens Jill Kelly John Grande Jo Brown Exp Parent Child Dim cont... Emp Name Emp_id Manager_id James Smith 1 3 Amy Jones 2 3 Paul West 3 3 Jim Kelly 4 3 John Grande 5 1 Jo Brown 6 1 Emp_id represnts each employee Manager_id represents parent dimention Q? How can you represnt the same hierarchy in the traditional concept hierarchy? Data Warehouse Usage Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using cross-tabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks From OLAP to OLAM (online analytical mining) Why online analytical mining? High quality of data in data warehouses – DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses – ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis (mining with drilling, dicing, pivoting, etc.) On-line selection of data mining functions – integration and swapping of multiple mining functions, algorithms, and tasks. Q&A ???
Pages to are hidden for
"BI _ DM for CRM"Please download to view full document