Data Warehouse Faq by pcherukumalla

VIEWS: 60 PAGES: 12

									1. Define Data Warehouse ?

“A subject-oriented , integrated , time-variant and non-volatile collection of data in support of
management's decision making process”

Subject-Oriented : Focus on Particular Topic to analyze the data.
Integrated :        Data from separate sources must be put into a consistent format thru Data
                    Transformations and Data Cleansing.
Non-Volatile :       Data Once entered into the warehouse can’t be changed.
Time-Variant :      Creates a time dimension to the Data ware house. Can view patterns and
                     trends over time.

2. Explain the Datawarehouse LifeCycle?
Extraction - As a first step, heterogeneous data from different online transaction processing
systems is extracted. This data becomes the data source for the data warehouse.
Cleansing/transformation - The source data is sent into the populating systems where the data
is cleansed, integrated, consolidated, secured and stored in the corporate or central data
warehouse.
Distribution - From the central data warehouse, data is distributed to independent data marts
specifically designed for the end user.
Analysis - From these data marts, data is sent to the end users who access the data stored in the
data mart depending upon their requirement.
3. What is the aim/objective of having a data warehouse? And who needs a data
warehouse? Or what is the use of Data Warehousing?
       Data warehousing technology comprises a set of new concepts and tools which support
the executives, managers and analysts with information material for decision making.
The fundamental reason for building a data warehouse is to improve the quality of information in
the organization. The main goal of data warehouse is to report and present the information in a
very user friendly form.
4. Diff. between OLTP and OLAP?

Online Transactional Processing                Online Analytical Processing

Functional : Day to Day Operations             Decision Support

Db Design : Application Oriented               Subject Oriented.

Data : Current Up to date                      Historical Data

Detailed, Flat,Relational                      Summarized, Isolated

Higly Normalized                               Highly De-Normailized

Unit of work : Short, Simple, Transaction      Complex Query



5. Diff between Data warehouse and Data Mart?

A data warehouse is for very large             A data mart is for smaller databases.
databases (VLDBs)

A data warehouse is a database designed to     A data mart is a selected part of the data
support a broad range of decision tasks in a   warehouse which supports specific
specific organization.                         decision support application requirements
                                               of a company’s department

                                               Is a Smaller Subset of Data ware House.

Has More Subject-Oriented Areas                Has Less Subject – Oriented Areas

Size is > 100 GB                               Size is < 100 GB

Running Ad hoc(unplanned) Queries from         Running ad hoc queries in data marts allow
a huge data warehouse takes long time.         the efficient execution of predicted queries.
It is Enterprise Level One                     It is Smaller Scale One.




6. What is Staging Area?
 Provides a place and an area with a set of functions to clean, change, combine, convert,
deduplicate and prepare source data for storage and use in the data warehouse.

7. What is junk dimension? What is the difference between junk dimension and
degenerated dimension?
A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that
are unrelated to any particular dimension. The junk dimension is simply a structure that provides
a convenient place to store the junk attributes.where as A degenerate dimension is data that is
dimensional in nature but stored in a fact table.
Junk dimension: the column which we are using rarely or not used, these columns are formed a
dimension is called junk dimension
Degenerative dimension: the column which we use in dimension are degenerative dimension
Ex.Emp table has empno, ename, sal, job, deptno
But We are talking only the column empno, ename from the EMP table and forming a dimension
this is called degenerative dimension

8.Differnce between Normalization and Denormalization?

Normalization is the process of removing redundancies.
OLTP uses the Normalization process

Denormalization is the process of allowing redundancies.
OLAP/DW uses the denormalized process to capture greater level of detailed data (each and
every transaction)

9. Why fact table is in normal form?

A fact table consists of measurements of business requirements and foreign keys of dimensions
tables as per business rules.
A fact table consists of measurements of business requirements and foreign keys of dimensions
tables as per business rules.
There can just be SKs within a Star schema, which itself is de-Normalized. Now, if there were
then FKs on the dimensions as well, I would agree. Being in normal form, more granularity is
achieved with less coding i.e. less number of joins while retrieving the fact.

10. What is Difference between E-R Modeling and Dimensional Modeling?
Basic difference is E-R modeling will have logical and physical model. Dimensional model will
have only physical model. E-R modeling is used for normalizing the OLTP database design.
Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design. Adding to the
point:
E-R modeling revolves around the Entities and their relationships to capture the overall process
of the system.
Dimensional model / Multidimensional Modeling revolves around Dimensions (point of
analysis) for decision-making and not to capture the process.
In ER modeling the data is in normalized form. So more number of Joins, which may adversely
affect the system performance. Whereas in Dimensional Modeling the data is denormalized, so
less number of joins, by which system performance will improve.

11. What is conformed fact?
Conformed dimensions are the dimensions, which can be used across multiple Data Marts in
combination with multiple facts tables accordingly
Conformed facts are allowed to have the same name in separate tables and can be combined and
compared mathematically. Conformed dimensions are those tables that have a fixed structure.
There will b no need to change the metadata of these tables and they can go along with any
number of facts in that application without any changes
Dimension table, which is used, by more than one fact table is known as a conformed dimension.

12. What are the methodologies of Data Warehousing?

They are mainly 2 methods.

1. Ralph Kimbell Model (Top - Down approach :: Data Warehouse --> Data Mart)
Kimball model always structured as Denormalized structure.

2. Inmon Model. (Bottom - Up approach :: Data Mart --> Data Warehouse)
Inmon model structured as Normalized structure.

13. What are data validation strategies for data mart validation after loading process?
Data validation is to make sure that the loaded data is accurate and meets the business
requirements. Strategies are different methods followed to meet the validation requirements.
14. What is surrogate key?

Surrogate key is the primary key for the Dimensional table. Surrogate key is a substitution for
the natural primary key.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the
dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or
SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can
change and this makes updates more difficult and also used in SCDs to preserve historical data.

15. What is meant by metadata in context of a Data warehouse and how it is important?
Metadata or Meta data is data about data. Examples of metadata include data element
descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions,
and process/method descriptions. The repository environment encompasses all corporate
metadata resources: database catalogs, data dictionaries, and navigation services. Metadata
includes things like the name, length, valid values, and description of a data element. Metadata is
stored in a data dictionary and repository. It insulates the data warehouse from changes in the
schema of operational systems. Metadata Synchronization The process of consolidating, relating
and synchronizing data elements with the same or similar meaning from different systems.
Metadata synchronization joins these differing elements together in the data warehouse to allow
for easier access.

In context of a Data warehouse metadata is meant the information about the data. This
information is stored in the designer repository. Meta data is the data about data; Business
Analyst or data modeler usually capture information about data - the source (where and how the
data is originated), nature of data (char, varchar, nullable, existance, valid values etc) and
behavior of data (how it is modified / derived and the life cycle) in data dictionary a.k.a
metadata. Metadata is also presented at the Data mart level, subsets, fact and dimensions, ODS
etc. For a DW user, metadata provides vital information for analysis / DSS.

16. What are the possible data marts in Retail sales?
Product information, sales information

17. What is the main difference between schema in RDBMS and schemas in Data
Warehouse?

RDBMS Schema                                DWH Schema
* Used for OLTP systems                     * Used for OLAP systems
* Traditional and old schema                * New generation schema
* Normalized                                * De Normalized
* Difficult to understand and navigate      * Easy to understand and navigate
* Cannot solve extract and complex problems * Extract and complex problems can be easily
* Poorly modelled                           solved
                                               * Very good model

18.What is Dimensional Modeling?
In Dimensional Modeling, Data is stored in two kinds of tables: Fact Tables and Dimension
tables.
Fact Table contains fact data e.g. sales, revenue, profit etc.....
Dimension table contains dimensional data such as Product Id, product name, product
description etc.....
Dimensional Modeling is a design concept used by many data warehouse designers to build their
data warehouse. In this design model all the data is stored in two types of tables - Facts table and
Dimension table. Fact table contains the facts/measurements of the business and the dimension
table contains the context of measurements i.e., the dimensions on which the facts are calculated.

19. Why is Data Modeling Important?

The data model is also detailed enough to be used by the database developers to use as a
"blueprint" for building the physical database. The information contained in the data model will
be used to define the relational tables, primary and foreign keys, stored procedures, and triggers.
A poorly designed database will require more time in the long-term. Without careful planning
you may create a database that omits data required to create critical reports, produces results that
are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

20. What does level of Granularity of a fact table signify?
It describes the amount of space required for a database. Level of Granularity indicates the extent
of aggregation that will be permitted to take place on the fact data. More Granularity implies
more aggregation potential and vice-versa. In simple terms, level of granularity defines the
extent of detail. As an example, let us look at geographical level of granularity. We may analyze
data at the levels of COUNTRY, REGION, TERRITORY, CITY and STREET. In this case, we
say the highest level of granularity is STREET. Level of granularity means the upper/lower level
of hierarchy, up to which we can see/drill the data in the fact table. Level of granularity means
the upper/lower level of hierarchy, up to which we can see/drill the data in the fact table.

21. What is degenerate dimension table?
The values of dimension, which is stored, in fact table is called degenerate dimensions. These
dimensions don't have it's own dimensions.

22. How do you load the time dimension?
In Data warehouse we manually load the time dimension, Every Data warehouse maintains a
time dimension. It would be at the most granular level at which the business runs at (ex: week
day, day of the month and so on). Depending on the data loads, these time dimensions are
updated. Weekly process gets updated every week and monthly process, every month.

23. Difference between Snowflake and Star Schema. What are situations where Snow flake
Schema is better than Star Schema to use and when the opposite is true?
Star schema and snowflake both serve the purpose of dimensional modeling when it comes to
data warehouses.
Star schema is a dimensional model with a fact table (large) and a set of dimension tables
(small). The whole set-up is totally denormalized.
However in cases where the dimension tables are split to many tables that are where the schema
is slightly inclined towards normalization (reduce redundancy and dependency) there comes the
snowflake schema.
The nature/purpose of the data that is to be feed to the model is the key to your question as to
which is better.

Star schema

      contains the dimension tables mapped around one or more fact tables.

      It is a denormalized model.

      No need to use complicated joins.

      Queries results fastly.

Snowflake schema

      It is the normalized form of Star schema.

      Contains in depth joins, because the tables are splited in to many pieces.
       We can easily do modification directly in the tables.

      We have to use complicated joins, since we have more tables.

      There will be some delay in processing the Query.

24. Why do you need Star schema?
1) Less joiners contains
2) Simply database
3) Support drilling up options

25. Why do you need Snowflake schema?
Some times we used to provide separate dimensions from existing dimensions that time we will
go to snowflake
Disadvantage Of snowflake: Query performance is very low because more joiners is there

26. What is conformed fact?
Conformed dimensions are the dimensions, which can be used across multiple Data Marts in
combination with multiple facts tables accordingly
Conformed facts are allowed to have the same name in separate tables and can be combined and
compared mathematically. Conformed dimensions are those tables that have a fixed structure.
There will b no need to change the metadata of these tables and they can go along with any
number of facts in that application without any changes
Dimension table, which is used, by more than one fact table is known as a conformed dimension.

27. What are conformed dimensions
They are dimension tables in a star schema data mart that adhere to a common structure, and
therefore allow queries to be executed across star schemas. For example, the Calendar dimension
is commonly needed in most data marts. By making this Calendar dimension adhere to a single
structure, regardless of what data mart it is used in your organization, you can query by date/time
from one data mart to another to another.
Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas
contains facts and dimension tables)

Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and
Dimensions
here D1,D2 are the Conformed Dimensions


28. What is Fact table
A table in a data warehouse whose entries describe data in a fact table. Dimension tables contain
the data from which dimensions are created. A fact table in data ware house is it describes the
transaction data. It contains characteristics and key figures.

29. What are Semi-additive and faceless facts and in which scenario will you use such kinds
of fact tables
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions
in the fact table, but not the others. For example:
Current Balance and Profit Margin are the facts. Current Balance is a semi-additive fact, as it
makes sense to add them up for all accounts (what's the total current balance for all accounts in
the bank?), but it does not make sense to add them up through time (adding up all current
balances for a given account for each day of the month does not give us any useful information
A factless fact table captures the many-to-many relationships between
dimensions, but contains no numeric or textual facts. They are often used to record events or
coverage information. Common examples of factless fact tables include:
- Identifying product promotion events (to determine promoted products that didn't sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or university
30. What are the Different methods of loading Dimension tables
Conventional Load: Before loading the data, all the Table constraints will be checked against
the data.

Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded
directly.Later the data will be checked against the table constraints and the bad data won't be
indexed. Conventional and Direct load method are applicable for only oracle. The naming
convension is not general one applicable to other RDBMS like DB2 or SQL server..

31.What are Aggregate tables
Aggregate tables contain redundant data that is summarized from other data in the warehouse.
These are the tables which contain aggregated / summarized data. E.g Yearly, monthly sales
information. These tables will be used to reduce the query execution time.
Aggregate table contains the summary of existing warehouse data which is grouped to certain
levels of dimensions.Retrieving the required data from the actual table, which have millions of
records will take more time and also affects the server performance.To avoid this we can
aggregate the table to certain required level and can use it.This tables reduces the load in the
database server and increases the performance of the query and can retrieve the result very fastly.

32. What is a dimension table
A dimensional table is a collection of hierarchies and categories along which the user can drill
down and drill up. it contains only the textual attributes.

33. Why are OLTP database designs not generally a good idea for a Data Warehouse
OLTP cannot store historical information about the organization. It is used for storing the details
of daily transactions while a datawarehouse is a huge storage of historical information obtained
from different datamarts for making intelligent decisions about the organization.

34. What is the need of surrogate key; why primary key not used as surrogate key
Surrogate Key is an artificial identifier for an entity.In surrogate key values are generated by the
system sequentially(Like Identity property in SQL Server and Sequence in Oracle). They do not
describe anything.
Primary Key is a natural identifier for an entity. In Primary keys all the values are entered
manually by the user which are uniquely identified. There will be no repeatition of data.

Need for surrogate key not Primary Key
If a column is made a primary key and later there needs a change in the datatype or the length for
that column then all the foreign keys that are dependent on that primary key should be changed
making the database Unstable
Surrogate Keys make the database more stable because it insulates the Primary and foreign key
relationships from changes in the data types and length.
For Example : You are extracting Customer Information from OLTP Source and after ETL
process, loading customer information in a dimension table (DW). If you take SCD Type 1, Yes
you can use Primary Key of Source CustomerID as Primary Key in Dimension Table. But if you
would like to preserve history of customer in Dimension table i.e. Type 2. Then you need
another unique no apart from CustomerID. There you have to use Surrogate Key.
Another reason : If you have AlphaNumeric as a CustomerID. Then you have to use surrogate
key in Dimension Table. It is advisable to have system generated small integer number as a
surrogate key in the dimension table. so that indexing and retrieval is much faster.

35. What is data cleaning? how is it done?
Data Cleansing: the act of detecting and removing and/or correcting a database's dirty data (i.e.,
data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly)
It can be done by using the exisitng ETL tools or using third party tools like Trivillium etc.,

36. What are slowly changing dimensions
Dimensions that change over time are called Slowly Changing Dimensions. For instance, a
product price changes over time; People change their names for some reason; Country and State
names may change over time. These are a few examples of Slowly Changing Dimensions since
some changes are happening to them over a period of time

37. What are Data Marts
Data Mart is a segment of a data warehouse that can provide data for reporting and analysis on a
section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts
are sometimes complete individual data warehouses which are usually smaller than the corporate
data warehouse.
Data Mart: a data mart is a small data warehouse. In general, a data warehouse is divided into
small units according the busness requirements. for example, if we take a Data Warehouse of an
organization, then it may be divided into the following individual Data Marts. Data Marts are
used to improve the performance during the retrieval of data.
eg: Data Mart of Sales, Data Mart of Finance, Data Mart of Maketing, Data Mart of HR etc.

38. Can a dimension table contains numeric values?
No. Only Fact Table having Numeric Fields.

39. Explain degenerated dimension in detail.
Degenerated dimension is a dimension, which is not having any source in oltp
It is generated at the time of transaction
Like invoice no this is generated when the invoice is raised
It is not used in linking and it is also not a fkey
But we can refer these degenerated dimensions as a primary key of the fact table
A Degenerate dimension is a Dimension which has only a single attribute.
This dimension is typically represented as a single field in a fact table.
The data items thar are not facts and data items that do not fit into the existing dimensions are
termed as Degenerate Dimensions.
Degenerate Dimensions are the fastest way to group similar transactions.
Degenerate Dimensions are used when fact tables represent transactional data.

40. Give examples of degenerated dimensions
Degenerated Dimension is a dimension key without corresponding dimension. Example:
In the PointOfSale Transaction Fact table, we have:
Date Key (FK), Product Key (FK), Store Key (FK), Promotion Key (FP), and POS Transaction
Number
Date Dimension corresponds to Date Key, Production Dimension corresponds to Production
Key. In a traditional parent-child database, POS Transactional Number would be the key to the
transaction header record that contains all the info valid for the transaction as a whole, such as
the transaction date and store identifier. But in this dimensional model, we have already
extracted this info into other dimension. Therefore, POS Transaction Number looks like a
dimension key in the fact table but does not have the corresponding dimension table.Therefore,
POS Transaction Number is a degenerated dimension.

41. What are the steps to build the data warehouse
1.Gathering bussiness requiremnts
• Identifying Sources
• Identifying Facts
• Defining Dimensions
• Define Attribues
• Redefine Dimensions & Attributes
• Organise Attribute Hierarchy & Define Relationship
• Assign Unique Identifiers
• Additional convetions:Cardinality/Adding ratios
• Understand the bussiness requirements.
2.Once the business requirements are clear then Identify the Grains(Levels).
3.Grains are defined; design the Dimensional tables with the Lower level Grains.
4.Once the Dimensions are designed, design the Fact table With the Key Performance Indicators
(Facts).
5.Once the dimensions and Fact tables are designed define the relation ship between the tables by
using primary key and Foreign Key. In logical phase data base design looks like Star Schema
design so it is named as Star Schema Design

42. What is the different architecture of data warehouse
1. Top down - (bill Inmon)
2. Bottom up - (Ralph kimbol)
There are three types of architectures.
• Date warehouse Basic Architecture:
In this architecture end users access data that is derived from several sources through the data
warehouse.
Architecture: Source --> Warehouse --> End Users
• Data warehouse with staging area Architecture:
Whenever the data that is derived from sources need to be cleaned and processed before putting
it into warehouse then staging area is used.
Architecture: Source --> Staging Area -->Warehouse --> End Users
• Data warehouse with staging area and data marts Architecture:
Customization of warehouse architecture for different groups in the organization then data marts
are added and used.
Architecture: Source --> Staging Area --> Warehouse --> Data Marts --> End Users

								
To top