Data Warehousing Concepts

Document Sample
Data Warehousing Concepts Powered By Docstoc
					1) What are the advantages data mining over tradition...

A) Data Mining is used for the estimation of future. For example, if we take
a company/business organization, by using the concept of Data Mining, we
can predict the future of business in terms of Revenue (or) Employees (or)
Customers (or) Orders etc.

Traditional approaches use simple algorithms for estimating the future. But,
it does not give accurate results when compared to Data Mining.

2) What is the difference between views and materialized Views?

2A) View - stores the SQL statement in the database and let you use it as a
table. Every time you access the view, the SQL statement executes.

Materialized view - stores the results of the SQL in table form in the
database. SQL statement only executes once and after that every time you
run the query, the stored result set is used. Pros include quick query results.


2B) VIEW: This is a PSEUDO table that is not stored in the database and it
is just a query.

MATERIALIZED VIEWS: These are similar to a view but these are
permantely stored in the database and often refreshed. This is used in
optimization for the faster data retrieval and is useful in aggregation and
summarization of data.
 3) What is the main difference between Inmon and Kimball...?

3A) basically speaking, Inmon professes the Snowflake Schema while
Kimball relies on the Star Schema


3B) both differed in the concept of building the data warehouse...

According to Kimball...

Kimball views data warehousing as a constituency of data marts. Data marts
are focused on delivering business objectives for departments in the
organization. And the data warehouse is a conformed dimension of the data
marts. Hence a unified view of the enterprise can be obtained from the
dimension modeling on a local departmental level.

Inmon beliefs in creating a data warehouse on a subject-by-subject area
basis. Hence the development of the data warehouse can start with data from
the online store. Other subject areas can be added to the data warehouse as
their needs arise. Point-of-sale (POS) data can be added later if management
decides it is necessary.



i.e.,

Kimball--First Data Marts--Combined way ---Data warehouse

Inmon---First Data warehouse--Later----Data marts



3C) the main difference b/w the Kimball and inmon technologies is...

Kimball --- creating data marts first then combining them up to form a data
warehouse

Inmon----Creating data warehouse --- then data marts



3D) actually, the main difference is
Kimball: fallows Dimensional Modeling
Inmon: fallows ER Modeling bye Mayee

3E) Ralf Kimball: he follows bottom-up approach i.e., first creates
individual Data Marts from the existing sources and then create Data
Warehouse.

BillImmon: he follows top-down approach i.e., first creates Data Warehouse
from the existing sources and then create individual Data Marts.
4) What is junk dimension? What is the difference between junk dimension
and degenerate dimension?

4A) a junk dimension is a collection of random transactional codes, flags
and text attributes that are unrelated to any particular dimension. The junk
dimension is simply a structure that provides the convenient place to store
the junk dimension.


4B) a "junk" dimension is a collection of random transactional codes, flags
and/or text attributes that are unrelated to any particular dimension. The junk
dimension is simply a structure that provides a convenient place to store the
junk attributes. Where as A degenerate dimension is data that is dimensional
in nature but stored in a fact table.


4C) junk dimension:

The column which we are using rarely or not used, these columns are
formed a dimension is called junk dimension

Degenerative dimension

The column which we use in dimension is degenerative dimension

Example:

EMP table has empno, ename, sal, job, deptno

But

We are talking only the column empno, ename from the EMP table and
forming a dimension this is called degenerative dimension

5) What is the definition of normalized and denormalization?

5A) Normalization is the process of removing redundancies.

Denormalization is the process of allowing redundancies.
OLTP uses the Normalization process and

The OLAP/DW uses the denormalised process to capture greater level of
detailed data (each and every transaction)


6) Why fact table is in normal form?

6A) a fact table consists of measurements of business requirements and
foreign keys of dimensions tables as per business rules.


6B) basically the fact table consists of the Index keys of the dimension/look
up tables and the measures.

So when ever we have the keys in a table .that itself implies that the table is
in the normal form.


7) What is Difference between E-R Modeling and Dimensional modeling?

7A) Basic diff is E-R modeling will have logical and physical model.
Dimensional model will have only physical model.

E-R modeling is used for normalizing the OLTP database design.

Dimensional modeling is used for de-normalizing the ROLAP/MOLAP
design.

 7B) E-R modeling revolves around the Entities and their relationships to
capture the overall process of the system.

Dimensional model/Muti-Dimensinal Modeling revolves around
Dimensions (point of analysis) for decision making and not to capture the
process.



7C) In ER modeling the data is in normalized form. So more number of
Joins, which may adversely affect the system performnace.Whereas in
Dimensional Modeling the data, is denormalised, so less number of joins, by
which system performance will improve.

 8) What is conformed fact?
8A) conformed dimensions are the dimensions which can be used across
multiple Data Marts in combination with multiple facts tables accordingly

8B) Conformed facts are allowed to have the same name in separate tables
and can be combined and compared mathematically.


8C) the relationship between the facts and dimensions are with 3NF, and can
works in any type of joins are called as conformed schema, the members of
that schema are called so...

8D) Conformed dimensions are those tables that have a fixed structure.
There will be no need to change the metadata of these tables and they can go
along with any number of facts in that application without any changes


8E) A dimension table which is used by more than one fact table is known as
a conformed dimension.

9) What are the methodologies of Data Warehousing?
9A) every company has methodology of their own. But to name a few SDLC
Methodology, AIM methodology are stardadly used. Other methodologies
are AMM, World class methodology and many more.



9B) Most of the time, we use Mr. Ralph Kimball methodologies for data
warehousing design. Two kind of schema: star and snow flake.


9C) most probably every one fallows

Either star schema or snowflake schema

9D) there r 2 methodologies 1) Kimball-first data marts then EDWH 2)
inmon-first EDWH then data marts from edwh
9C) regarding the methodologies in the Data warehousing. They are mainly
2 methods.

 1. Ralph Kimball Model

 2. Inmon Model.

 Kimball model always structured as Denormalised structure.

 Inmon model structed as Normalized structure.

  Depends on the requirements of the company anyone can follow the
company's DWH will choose the one of the above models.



9D) in Data warehousing contains the Two Methods

1>> Top Down Method

2>>Bottom up method

In Top Down method first loads the Datamarts and then loads the data ware
house.

In Bottom Up method first loads the Data warehouse and then loads the Data
marts.

9E) Top Down approach is first Data warehouse then Data marts.

Bottom Down approach is first Data marts then Data warehouse.

9F) there are 2 methodologies 1. Kimball 2. Inmon likewise
1. Star Flake 2. Snow Flake schemas



9G) There are two approaches in Data ware housing named as
Top down Approach and Bottom-up Approach

Top down approach in the sense preparing individual departments data (Data
Marts) from the Enterprise Data warehouse

Bottom up Approach is nothing but first gathering all the departments’ data
and then cleanse the data and Transforms the data and then load all the
individual departments data into the enterprise data ware house

10) what is BUS Schema?

10A) BUS Schema is composed of a master suite of confirmed dimension
and standardized definition if facts.

10B) A BUS Schema or a BUS Matrix? A BUS Matrix (in Kimball
approach) is to identify common Dimensions across Business Processes; i.e.:
a way of identifying Conforming Dimensions.

11) What is Data warehousing Hierarchy?

11A) Hierarchies are logical structures that use ordered levels as a means of
organizing data. A hierarchy can be used to define data aggregation. For
example, in a time dimension, a hierarchy might aggregate data from the
month level to the quarter level to the year level. A hierarchy can also be
used to define a navigational drill path and to establish a family structure.

Within a hierarchy, each level is logically connected to the levels above and
below it. Data values at lower levels aggregate into the data values at higher
levels. A dimension can be composed of more than one hierarchy. For
example, in the product dimension, there might be two hierarchies--one for
product categories and one for product suppliers.

Dimension hierarchies also group levels from general to granular. Query
tools use hierarchies to enable you to drill down into your data to view
different levels of granularity. This is one of the key benefits of a data
warehouse.

When designing hierarchies, you must consider the relationships in business
structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular
level value, a value at the next higher level is its parent, and values at the
next lower level are its children. These familial relationships enable analysts
to access data quickly.

Levels
A level represents a position in a hierarchy. For example, a time dimension
might have a hierarchy that represents data at the month, quarter, and year
levels. Levels range from general to specific, with the root level as the
highest or most general level. The levels in a dimension are organized into
one or more hierarchies.

Level Relationships
Level relationships specify top-to-bottom ordering of levels from most
general (the root) to most specific information. They define the parent-child
relationship between the levels in a hierarchy.

Hierarchies are also essential components in enabling more complex
rewrites. For example, the database can aggregate existing sales revenue on
a quarterly base to a yearly aggregation when the dimensional dependencies
between quarter and year are known.

12) What are data validation strategies for data mart v...?

12A) Data validation is to make sure that the loaded data is accurate and
meets the business requirements.

Strategies are different methods followed to meet the validation
requirements

13) What r the data types present in Bo? N what happens I...

13A) three different data types: Dimensions, Measure and Detail.

View is nothing but an alias and it can be used to resolve the loops in the
universe.
13B) in my knowledge, these are called as object types in the Business
Objects.
And alias is different from view in the universe. View is at database level,
but alias is a different name given for the same table to resolve the loops in
universe.

13C) the different data types in business objects are: 1. Character.2. Date.3.
Long text.4. Number


13D) dimension, measure, detail are objects type.

Data types are character, date and numeric

14) What is surrogate key? Where we use it explain WI...

14A) Surrogate key is the primary key for the Dimensional table.


14B) surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the
primary key to the table. The only requirement for a surrogate primary key is
that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or
identity key), key for the dimension tables primary keys. They can use in
sequence generator, or Oracle sequence, or SQL Server Identity values for
the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in
Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME
which are stated as the primary keys (according to the business users) but
,not only can these change, indexing on a numerical value is probably better
and you could consider creating a surrogate key called, say, AIRPORT_ID.
This would be internal to the system and as far as the client is concerned you
may display only the AIRPORT_NAME.

2. Adapted from response by Vincent on Thursday, March 13, 2003
Another benefit you can get from surrogate keys (SID) is:

Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1'
(that's what would be in your Employee Dimension). This employee has a
turnover allocated to him on the Business Unit 'BU1' but on the 2nd of June
the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit
'BU2.' All the new turnover has to belong to the new Business Unit 'BU2'
but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your data
warehouse everything would be allocated to Business Unit 'BU2' even what
actually belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record
for the Employee 'E1' in your Employee Dimension with a new surrogate
key.

This way, in your fact table, you have your old data (before 2nd of June)
with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June)
would take the SID of the employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your
natural key: natural key of the Employee was Employee Code 'E1' but for
you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the
difference with the natural key enlargement process is that you might not
have all part of your new key within your fact table, so you might not be
able to do the join on the new enlarge key -> so you need another id.



14C) When creating a dimension table in a data warehouse, we generally
create the tables with a system generated key to uniquely identify a row in
the dimension. This key is also known as a surrogate key. The surrogate key
is used as the primary key in the dimension table. The surrogate key will
also be placed in the fact table and a foreign key will be defined between the
two tables. When you ultimately join the data it will join just as any other
join within the database.

14D) A surrogate key to a data warehouse is what a primary key is for an
OLTP source. It is used to uniquely identify a record in dimension tables. It
provides the solution for the critical col., problem. mar


14E) surrogate key is system generated artificial primary key values

e.g.: any candidate key can be considered as surrogate key.



14F) Surrogate key is a unique identification key, it is like an artificial or
alternative key to production key, because the production key may be
alphanumeric or composite key but the surrogate key is always single
numeric key.

Assume the production key is an alphanumeric field. If u creates an index
for this fields it will occupy more space, so it is not advisable to join/index,
because generally all the data warehousing fact table are having historical
data. These fact tables are linked with so many dimension tables. If it's a
numerical field the performance is high
14G) surrogate key in a data warehouse is more than just a substitute for a
natural key. In a data warehouse, a surrogate key is a necessary
generalization of the natural production key and is one of the basic elements
of data warehouse design

14F) Surrogate key is a system generated key, It is mainly used for
criticalum in dwh, Here criticalum means nothing but it is a column which
when we updated in the them most dwh in to oltp systems

14G) surrogate keys r that which join dimension tables and fact table


14H) Surrogate Key is the solution for critical column problems.

          For example the customer purchases different items in different
locations, for this situation we have to maintain historical data.
        By using surrogate key we can introduce the row in the data
warehouse to maintain historical data.

15) What is a linked cube?

15A) a cube can be partitioned in 3 ways.Replicate, Transparent and Linked.

In the linked cube the data cells can be linked in to another analytical
database. If an end-user clicks on a data cell, you are actually linking
through another analytic database.

15B) linked cube in which a sub-set of the data can be analyzed into great
detail. The linking ensures that the data in the cubes remain consistent.



16) Partitioning a cube?

16A) Partitioning a cube mainly used for optimization.(ex) U may have data
for 5gb to create a report u can specify a size for a cube as 2gb so if the cube
exceeds 2gb it automatically creates the second cube to store the data.

17) What is meant by metadata in context of a Data ware house?
17A) in context of a Data warehouse metadata is meant the information
about the data .This information is stored in the designer repository.


17B) Meta data is the data about data; Business Analyst or data modeler
usually capture information about data - the source (where and how the data
is originated), nature of data (char, varchar, nullable, existence, valid values
etc) and behavior of data (how it is modified / derived and the life cycle ) in
data dictionary a.k.a metadata. Metadata is also presented at the Data mart
level, subsets, fact and dimensions, ODS etc. For a DW user, metadata
provides vital information for analysis / DSS.



17C) metadata is data about data, it including things name, location, and
length including things.
We can u store data in metadata in data warehouse

18) What is incremental loading? 2. What i...

18A) Incremental loading means loading the ongoing changes in the OLTP.

Aggregate table contains the [measure] values, aggregated /grouped/summed
up to some level of hierarchy.



18B) Please learn to spell incremental and cross reference first! Or at least
use a spell check!
 19) What are the possible data marts in Retail sales....?

19A) product information, sales information

20) What is the main difference between schema in RDBMS and schemas in
Data Warehouse....?

20A) RDBMS Schema
* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* Cannot solve extract and complex problems
* Poorly modeled

DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model


20B) Schema is nothing but the systematic arrangement of tables
 In OLTP it will be normalized
In Data warehouse it will be denormalized

20C) the difference depends on the context.

Technically, if Oracle is used, a schema is a "user". In that context there is no
difference between the schemas in OLTP or ROLAP.Although,
denormalized/normalized tables are given as examples above, it is not the
difference.

20D) RDBMS-normalized

Data warehouse -Denormalized
20E) Diff b.w OLTP and OLAP:
------------------------

OLTP Schema:
* Normalized
* More no. of Trans
* Less time for queries execution
* More no. of users
* Have Insert, delete and update Trans.

OLAP (DWH) Schema:

* De Normalized
* Less no. of Trans
* Less no. of users
* More time for query exec
* Will not have more insert, delete and updates.

21) What are the various ETL tools in the Market?

21A) Informatica, Ascential Data stage, AbInitio

21B) Various ETL tools used in market are:

Informatica
Data Stage
Oracle Warehouse Builder
Ab Initio
Data Junction

21C) 1. Informatica Power Center
2. Ascential Data Stage
3. ESS Base Hyperion
4. Ab Intio
5. BO Data Integrator
6. SAS ETL
7. MS DTS
8. Oracle OWB
9. Pervasive Data Junction
10. Cognos Decision Stream

21D) ETL TOOLS BY different vendors
Informatica
Ascential Data stage
AbInitio

21E)
Informatica
Data Stage
MS-SQL DTS (Integrated Services 2005)
Abinitio
SQL Loader
Sunopsis
Oracle Warehouse Builder
Data Junction
Data Integrator (Business Objects)


21F) Have any come acress ETL tool "sunopsis"..? If not please check this
URL....It is amazing...

http://www.sunopsis.com


22) What is Dimensional Modeling?
22A) In Dimensional Modeling, Data is stored in two kinds of tables: Fact
Tables and Dimension tables.
Fact Table contains fact data e.g. sales, revenue, profit etc.....
Dimension table contains dimensional data such as Product Id, product
name, product description etc.....



22B) Dimensional Modeling is a design concept used by many data
warehouse designers to build their data warehouse. In this design model all
the data is stored in two types of tables - Facts table and Dimension table.
Fact table contains the facts/measurements of the business and the
dimension table contains the context of measurements i.e., the dimensions
on which the facts are calculated.


23) Why is Data Modeling Important?

Data modeling is probably the most labor intensive and time consuming part
of the development process. Why bother especially if you are pressed for
time? A common response by practitioners who write on the subject is that
you should no more build a database without a model than you should build
a house without blueprints.

The goal of the data model is to make sure that the all data objects required
by the database are completely and accurately represented. Because the data
model uses easily understood notations and natural language, it can be
reviewed and verified as correct by the end-users.

The data model is also detailed enough to be used by the database
developers to use as a "blueprint" for building the physical database. The
information contained in the data model will be used to define the relational
tables, primary and foreign keys, stored procedures, and triggers. A poorly
designed database will require more time in the long-term. Without careful
planning you may create a database that omits data required to create critical
reports, produces results that are incorrect or inconsistent, and is unable to
accommodate changes in the user's requirements.



22C) Steps In Building the Data Model
While ER model lists and defines the constructs required to build a data
model, there is no standard process for doing so. Some methodologies, such
as IDEFIX, specify a bottom-up development process were the model is
built in stages. Typically, the entities and relationships are modeled first,
followed by key attributes, and then the model is finished by adding non-key
attributes. Other experts argue that in practice, using a phased approach is
impractical because it requires too many meetings with the end-users. The
sequence used for this document is:

Identification of data objects and relationships

Drafting the initial ER diagram with entities and relationships
Refining the ER diagram

Add key attributes to the diagram

Adding non-key attributes

Diagramming Generalization Hierarchies

Validating the model through normalization

Adding business and integrity rules to the Model


22D) Dimensional Modeling is a logical design technique that seeks to
present the data in a standard, intuitive framework that allows for high-
performance access. It is inherently dimensional, and it adheres to a
discipline that uses the relational model with some important restrictions.
Every dimensional model is composed of one table with a multipart key,
called the fact table, and a set of smaller tables called dimension tables. Each
dimension table has a single-part primary key that corresponds exactly to
one of the components of the multipart key in the fact table. Dimensional
Modeling



22E) it’s a process or technique of designing a database model.
22F) a centralized table is called as fact table which is connected to multiple
dimension table is called as dimensional modeling or star schema



22G) Systematic arrangement of Fact/Dimension tables are called as
Schema, the designing a schema in data warehouse / data mart is known as
Dimension modeling


22H) Dimensional Modeling, It is a modeling technic used in OLAP system,

Here one fact table and surrounded by different dimensions.

23) What is VLDB?

23A) very large database

 23B) the perception of what constitutes a VLDB continues to grow. A one
terabyte database would normally be considered to be a VLDB.

23C) Data base is too large to back up in a time frame
Then it's a VLDB

23D) VLDB stands for Very Large Data Base, any database too large
(normally more than 1TB) considered as VLDB.


23E) Very Large Database (VLDB)

It is sometimes used to describe databases occupying magnetic storage in the
terabyte range and containing billions of table rows. Typically, these are
decision support systems or transaction processing applications serving large
numbers of users.

24) What is real time data-warehousing?

24A) Real-time data warehousing is a combination of two things: 1) real-
time activity and 2) data warehousing. Real-time activity is activity that is
happening right now. The activity could be anything such as the sale of
widgets. Once the activity is complete, there is data about it.

Data warehousing captures business activity data. Real-time data
warehousing captures business activity data as it occurs. As soon as the
business activity is complete and there is data about it, the completed
activity data flows into the data warehouse and becomes available instantly.
In other words, real-time data warehousing is a framework for deriving
information from data as the data becomes available.



24B) A real time data warehouse provide live data for DSS (may not be
100% up to that moment, some latency will be there). Data warehouse have
access to the OLTP sources, data is loaded from the source to the target not
daily or weekly, but may be every 10 minutes through replication or log
shipping or something like that. SAP BW is providing real time DW, with
the help of extended star schema, source data is shared.


24C) in real-time data warehousing, your warehouse contains completely
up-to-date data and is synchronized with the source systems that provide the
source data. In near-real-time data warehousing, there is a minimal delay
between source data being generated and being available in the data
warehouse. Therefore, if you want to achieve real-time or near-real-time
updates to your data warehouse, you’ll need to do three things:

Reduce or eliminate the time taken to get new and changed data out of your
source systems.
Eliminate, or reduce as much as possible, the time required to cleanse,
transform and load your data.
Reduce as much as possible the time required to update your aggregates.
Starting with version 9i, and continuing with the latest 10g release, Oracle
has gradually introduced features into the database to support real-time and
near-real-time, data warehousing. These features include:

Change Data Capture
External tables, table functions, pipelining, and the MERGE command, and
Fast refresh materialized views
24D) Real time Data warehousing means combination of heterogeneous
databases and query and analysis purpose and Decision-making and
reporting purpose.

25) What is a lookup table?

25A) when a table is used to check for some data for its presence prior to
loading of some other data or the same data to another table, the table is
called a LOOKUP Table.


25B) when a value for the column in the target table is looked up from
another table apart from the source tables, that table is called the lookup
table.


25C) when we want to get related value from some other table based on
particular value... suppose in one table A we have two columns emp_id,
name and in other table B we have emp_id address in target table we want to
have emp_id, name, address we will take source as table A and look up table
as B by matching emp_id we will get the result as three columns...emp_id,
name, address


25D) A lookup table is nothing but a 'lookup' it gives values to referenced
table (it is a reference), it is used at the run time, it saves joins and space in
terms of transformations. Example, a lookup table called states, provide
actual state name ('Texas') in place of TX to the output.


25E) based on responsibility how to protect/secure/hide even lookup values
such as meaning?


25F) when a table is used to check for some data for its presence prior to
loading of some other data or the same data to another table, the table is
called a LOOKUP Table.

25G) reference table can be otherwise called as lookup table
25H) in DW Terminology the Dimension table is also called as Look up
Table (Specific IB Business Objects)

Since the index key in the fact table is from (referencing) the particular
Dimension table so it’s also called as look up table.



25I) The Look Up table provides the detailed information about the
attributes. For example, the lookup table for the quarter attribute would
include a list of all the quarters available in the data warehouse.i.e. First
quarter of 2001 may be represented as "Q1 2001" or "2001 Q1".BYE.

26) What is a general purpose scheduling tool?

26A) General purpose of scheduling tool may be cleansing and loading data
at specific given time


26B) the basic purpose of the scheduling tool in a DW Application is to
stream line the flow of data from Source to Target at specific time or based
on some condition.

27) What type of Indexing mechanism do we need to use for a typical data
warehouse

27A) bitmap index

27B) Function Index, B-tree Index, Partition Index, Hash index etc...


27C) on the fact table it is best to use bitmap indexes. Dimension tables can
use bitmap and/or the other types of clustered/non-clustered, unique/non-
unique indexes.

To my knowledge, SQLServer does not support bitmap indexes. Only Oracle
supports bitmaps.
27D)It generally depends upon the data which u have in table if u have less
distinct values in particular column its always that u built up bit map index...
rather that other one on dimension tables generally we have indexes...

27E) that is based on requirement and size of your data mart/data
warehouse; Most of the data warehouse is in Bitmap index


28) Explain the advantages of RAID 1, 1/0, and 5. What type of RAID setup
would you put your TX logs

28A) Raid 0 - Make several physical hard drives look like one hard drive.
No redundancy but very fast. May use for temporary spaces where loss of
the files will not result in loss of committed data.

Raid 1- Mirroring. Each hard drive in the drive array has a twin. Each twin
has an exact copy of the other twin’s data so if one hard drive fails; the other
is used to pull the data. Raid 1 is half the speed of Raid 0 and the read and
writes performance is good.

Raid 1/0 - Striped Raid 0, then mirrored Raid 1. Similar to Raid 1.
Sometimes faster than Raid 1. Depends on vendor implementation.

Raid 5 - Great for read-only systems. Write performance is 1/3rd that of
Raid 1 but Read is same as Raid 1. Raid 5 is great for DW but not good for
OLTP.

Hard drives are cheap now so I always recommend Raid 1.

29) What is a Data Warehousing?

29A) Datawarehosing is a process of creating, queriring and populating data
warehouse. It includes a number of discrete technologies like
Identifying sources
Process of ECCD, ETL which includes data cleansing, data transforming and
data loading to targets.
29B) A Data warehouse is a subject oriented, integrated, time-variant,
nonvolatile collection of data to enable decision making across disparate
group of users.


29C) a data warehouse is a repository containing subject-oriented,
integrated, time-variant and non-volatile collection of data, used for
companies’ decision support systems requirement


29D) Data warehousing is a subject oriented, authoritative, integrated
historical database reflective of changes over meaningful time periods in
order to facilitate query and analysis for useful management decision
making.



29E) Data warehousing is a subject oriented, authoritative, integrated
historical database reflective of changes over meaningful time periods in
order to facilitate query and analysis for useful management decision
making.



29F) Data warehouse contains a collection of historic (history of data),
integrated, non-volatile data, which is used for analyzing and developing
forecasting reports.


30) What does level of Granularity of a fact table signify?

30A) it describes the amount of space required for a database.


30B) Level of Granularity indicates the extent of aggregation that will be
permitted to take place on the fact data. More Granularities implies more
aggregation potential and vice-versa.
30C) in simple terms, level of granularity defines the extent of detail. As an
example, let us look at geographical level of granularity. We may analyze
data at the levels of COUNTRY, REGION, TERRITORY, CITY and
STREET. In this case, we say the highest level of granularity is STREET.


30D) level of granularity means the upper/lower level of hierarchy, up to
which we can see/drill the data in the fact table.



30E) Granularity means nothing but it is a level of representation of
measures and metrics.

The lowest level is called detailed data

And highest level is called summary data

It depends of project we extract fact table significance

31) What is data mining?

31A) Data mining is a process of extracting hidden trends within a data
warehouse. For example an insurance data ware house can be used to mine
data for the most high risk people to insure in a certain geographical area.


31B) in its simple definition you can say data mining is a way to discover
new meaning in data.


31C) Data mining is a concept of deriving/discovering the hidden,
unexpected information from the existing data


31D) Data Mining is a non-trivial process of identified valid, potantially
useful and ultimately understands of data
31E) A data warehouse typically supplies answer to a question like 'who is
buying our products/". A data mining approach would seek answer to
questions like "Who is NOT buying our products?”

32) What is degenerate dimension table?

32A) the values of dimension which is stored in fact table is called
degenerate dimensions. These dimensions don’t have its own dimensions.


32B) A attribute in fact table it’s not a fact and it’s not a key value


32C) in simple terms, the column in a fact table that does not map to any
dimensions, neither it s a measure column.

For e.g. Invoice_no, Invoice_line_no in fact table will be a degenerate
dimension (columns), provided if you don’t have a dimension called invoice.



32D) Degenerate Dimensions: If a table contains the values, which r neither
dimension nor measures is called degenerate dimensions.Ex: invoice id,
empno

33) How do you load the time dimension

33A)In Data ware house we manually load the time dimension



33B) Every Data warehouse maintains a time dimension. It would be at the
most granular level at which the business runs at (ex: week day, day of the
month and so on). Depending on the data loads, these time dimensions are
updated. Weekly process gets updated every week and monthly process,
every month.


33C) Time dimension in DWH must be load manually. We load data into
Time dimension using pl/sql scripts.
33D) Generally we load the Time dimension by using Source Stage as a Seq
File and we use one passive stage in that transformer stage we will manually
write functions as Month and Year Functions to load the time dimensions but
for the lower level i.e., Day also we have one function to implement loading
of Time Dimension.



34) What is ER DIAGRAM?

34A) ER - Stands for entity relationship diagrams. It is the first step in the
design of data model which will later lead to a physical database design of
possible an OLTP or OLAP database


34B) The Entity-Relationship (ER) model was originally proposed by Peter
in 1976 [Chen76] as a way to unify the network and relational database
views.

Simply stated the ER model is a conceptual data model that views the real
world as entities and relationships. A basic component of the model is the
Entity-Relationship diagram which is used to visually represent data objects.

Since Chen wrote his paper the model has been extended and today it is
commonly used for database design For the database designer, the utility of
the ER model is:

It maps well to the relational model. The constructs used in the ER model
can easily be transformed into relational tables.
it is simple and easy to understand with a minimum of training. Therefore,
the model can be used by the database designer to communicate the design
to the end user.

In addition, the model can be used as a design plan by the database
developer to implement a data model in a specific database management
software.
34C) ER diagram is a entity relationship diagram that provides the entities
along with attributes.

34D) E.R Diagram (Entity Relationship diagram) means how the different
database table related to each other and what r the primary key and foreign
key and their relation.

It is the first step of any database project to build E-R Diagram

34E) Physical and logical arrangement of the database table and relationship
is explained by a diagram, that diagram is known as ER diagram

34F) ER diagram means it is a suitable modeling technic in OLTP systems,

Here contain one-one, menty-menty relationship

35) Difference between Snow flake and Star Schema. What are situations
where Snow flake Schema is better?

35A) star schema and snowflake both serve the purpose of dimensional
modeling when it comes to data warehouses.
Star schema is a dimensional model with a fact table (large) and a set of
dimension tables (small). The whole set-up is totally denormalized.
However in cases where the dimension table are split to many table that is
where the schema is slightly inclined towards normalization ( reduce
redundancy and dependency) there comes the snow flake schema.

The nature/purpose of the data that is to be feed to the model is the key to
your question as to which is better.


35B) Star schema contains the dimension tables mapped around one or more
fact tables.

It is a denormalised model.

No need to use complicated joins.

Queries results fastly.
Snowflake schema

It is the normalized form of Star schema.

Contains in-depth joins, because the tables r splitted in to many pieces. We
can easily do modification directly in the tables.

We have to use complicated joins, since we have more tables.

There will be some delay in processing the Query.



35C) Star Schema means

A centralized fact table and surrounded by different dimensions

Snowflake means

In the same star schema dimensions split into another dimensions

Star Schema contains Highly Denormalized Data

Snow flake contains partially normalized

Star can not have parent table

But snow flake contain parent tables

Why need to go there Star:

Here 1) less joiners contain

2) Simply database

3) Support drilling up options

Why need to go Snowflake schema:
Here some times we used to provide seperate dimensions from existing
dimensions that time we will go to snowflake

Dis Advantage Of snowflake:

Query performance is very low because more joiners is there

35D) star schema and snowflake both serve the purpose of dimensional
modeling when it comes to data warehouses.
Star schema is a dimensional model with a fact table (large) and a set of
dimension tables (small). The whole set-up is totally denormalized.
However in cases where the dimension table are split to many table that is
where the schema is slightly inclined towards normalization ( reduce
redundancy and dependency) there comes the snow flake schema.

The nature/purpose of the data that is to be feed to the model is the key to
your question as to which is better

36) What is a CUBE in data warehousing concept?

36A) Cubes are logical representation of multidimensional data. The edge of
the cube contains dimension members and the body of the cube contains
data values.


36B) Cube is a logical schema which contains facts and dimensions


36C) cubes r multi-dimensional view of DW or data marts. it is designed in a
logical way to drill, slice-n-dice. Every part of the cube is a logical
representation of the combination of facts-dimension attribs.

37) What is ODS

37A) ODS stands for Online Data Storage.

It is used to maintain, store the current and up to date information and the
transactions regarding the source databases taken from the OLTP system.
It is directly connected to the source database systems instead of to the
staging area.

It is further connected to data warehouse and moreover can be treated as a
part of the data warehouse database.

Edit by Admin: ODS Stands for Operational Data Store not Online Data
Storage




37B) ODS stands for Operational Data Store.
It is the final integration point in the ETL process before loading the data
into the Data Warehouse.



37C) ODS stands for Operational Data Store. It contains near real time data.
In typical data warehouse architecture, sometimes ODS is used for analytical
reporting as well as source for Data Warehouse.


37D) Operational Data Services is Hybrid structure that has some aspects of
a data warehouse and other
Aspects of an Operational system.
Contains integrated data.
It can support DSS processing.
It can also support High transaction processing.
Placed in between Warehouse and Web to support web users.



37E) the form that data warehouse takes in the operational environment.
Operational data stores can be updated, do provide rapid constant time, and
contain only limited amount of historical data
37F) An Operational Data Store presents a consistent picture of the current
data stored and managed by transaction processing system. As data is
modified in the source system, a copy of the changed data is moved into the
ODS. Existing data in the ODS is updated to reflect the current status of the
source system


37G) ODS means Operational Data Store

It is used to store current data through transactional webpplications, sap, and
MQ series

Current data means particular data from one date into one date

ODS contains 30-90 data



37 H) an Operational Data Store is a collection of data in support of an
organizations need for up to operational, integrated, collective information.
ODS is purely operational construct to address the operational needs of a
corporation. While loading data from Staging to ODS we do the process of
data scrubbing, data validation.

38) What are conformed dimensions?

38A) they are dimension tables in a star schema data mart that adhere to a
common structure, and therefore allow queries to be executed across star
schemas. For example, the Calendar dimension is commonly needed in most
data marts. By making this Calendar dimension adhere to a single structure,
regardless of what data mart it is used in your organization, you can query
by date/time from one data mart to another to another.


38B) Conformed dimensions are dimensions which are common to the
cubes. (Cubes are the schemas contains facts and dimension tables)

Consider Cube-1 contains F1, D1, D2, D3 and Cube-2 contains F2,D1, D2,
D4 are the Facts and Dimensions
Here D1, D2 are the Conformed Dimensions



38C) if a table is used as a dimension table for more than one fact tables.
Then the dimension table is called conformed dimensions.


38D) confirmed dimensions are the dimensions which can be used in
multiple star schemas correct me if i am wrong.


38E) Conformed Dimensions are the one if they share one or more attributes
whose values are drawn from the same domains.



38F) the dimensions which is used more than one fact table is called
conformed dimensions

38G) A conformed dimension is a single, coherent view of the same piece of
data throughout the organization. The same dimension is used in all
subsequent star schemas defined. This enables reporting across the complete
data warehouse in a simple format


38H) Conformed Dimensions are the Dimensions which are common to two
cubes .say CUBE-1 contains F1,D1,D2,D3 and CUBE-2 contains
F2,D1,D2,D4 are the Facts and Dimensions ,here D1,D2 are the Conformed
Dimensions



38I) if the dimension is 100% sharable across the star schema then this
dimension is called as confirmed dimension.
RE: Which columns go to the fact table and which columns go the
dimension table

39) What are SCD1, SCD2, and SCD3?
39A) SCD 1: Complete overwrite
SCD 2: Preserve all history. Add row
SCD 3: Preserve some history. Add additional column for ol/new.


39B) SCD Type 1, the attribute value is overwritten with the new value,
obliterating the historical attribute values. For example, when the product
roll-up
Changes for a given product, the roll-up attribute are merely updated with
the current value.
SCD Type 2, a new record with the new attributes is added to the dimension
table. Historical fact table rows continue to reference the old dimension key
with the old roll-up attribute; going forward, the fact table rows will
reference the new surrogate key with the new roll-up thereby perfectly
partitioning history.
SCDType 3, attributes are added to the dimension table to support two
simultaneous roll-ups - perhaps the current product roll-up as well as
“current version minus one”, or current version and original.



39C) SCD: -------- The value of dimensions is used change very rarely that is
called Slowly Changing dimensions

Here mainly 3

1) SCD1: Replace the old values overwrite by new values

2) SCD2: Just Creating Additional records

3) SCD3: It's maintain just previous and recent

In the SCD2 again 3

1) Versioning

2) Flag value

3) Effective Date range
Versioning: Here the updated dimensions inserted in to the target along with
version number

The new dimensions will be inserted into the target along with Primary key

Flagvalue: The updated dimensions insert into the target along with 0

And new dimensions inset into the target along with 1

40) What is Normalization, First Normal Form, Second Normal Form, and
Third Normal Form?

40A) Normalization: The process of decomposing tables to eliminate data
redundancy is called Normalization.

1N.F:- The table should contain scalar or atomic values.
2 N.F:- Table should be in 1N.F + No partial functional dependencies
3 N.F:-Table should be in 2 N.F + No transitive dependencies


40B) 2NF - table should be in 1NF + non-key should not dependent on
subset of the key ({part, supplier}, sup address)

3NF - table should be in 2NF + non key should not dependent on another
non-key ({part}, warehouse name, warehouse addr)

{Primary key}

More...
4, 5 NF - for multi-valued dependencies (essentially to describe many-to-
many relations)

40C) Normalization can be defined as segregating of table into two different
tables, so as to avoid duplication of values.
The normalization is a step by step process of removing redundancies and
dependencies of attributes in data structure

The condition of data at completion of each step is described as a “normal
form”.
Needs for normalization: improves data base design.
Ensures minimum redundancy of data.
Reduces need to reorganize data when design is modified or enhanced.
Removes anomalies for database activities.

First normal form:
· A table is in first normal form when it contains no repeating groups.
· The repeating column or fields in a UN normalized table are removed from
the table and put in to tables of their own.
· Such a table becomes dependent on the parent table from which it is
derived.
· The key to this table is called concatenated key, with the key of the parent
table forming a part it.

Second normal form:
· A table is in second normal form if all its non_key fields fully dependent on
the whole key.
· This means that each field in a table must depend on the entire key.
· Those that do not depend upon the combination key, are moved to another
table on whose key they depend on.
· Structures which do not contain combination keys are automatically in
second normal form.
Third normal form:
· A table is said to be in third normal form, if all the non key fields of the
table are independent of all other non key fields of the same table.


40D) Normalization is a process of remove the redundancy and
inconsistency

Mainly 3 normal forms

1normal form: Here contain only atomic values

2normal form: The nonkey values must be depend upon the primary key

3normal form: not depended on transitivity
40E) Normalization: It is the process of efficiently organizing data in a
database.There is 2-goals of the normalization process: 1. Eliminate
redundant data 2. Ensure data dependencies make sense (only storing related
data in a table)First Normal Form: It sets the very basic rules for an
organized database. 1. Eliminate duplicate columns from the same table 2.
Create separate tables for each group of related data and identify each row
with a unique column or set of columns. Second Normal Form: Further
addresses the concept of removing duplicative data. 1. Remove subsets of
data that apply to multiple rows of a table and place them in separate tables.
2. Create relationships between these new tables and their predecessors
through the use of foreign keys. Third Normal Form: 1.Remove columns
that are not dependent upon the primary key. Fourth Normal Form: 1.A
relation is in 4NF if it has no multi valued dependencies. These
normalization guidelines are cumulative. For a database to be in 2NF, it must
first fulfill all the criteria of a 1NF database.

41) What is ETL?

41A) ETL is extraction, trasformation and loading, ETL technology is used
for extraction the information from the source database and loading it to the
target database with the necessary transformations done in between.

41B) ETL is a short for Extract, Transform and Load. It is a data integration
function that involves extracting the data from outside sources, transforming
it into business needs and ultimately loading it into a data warehouse


41C) ETL is an abbreviation for "Extract, Transform and Load”. This is the
process of extracting data from their operational data sources or external
data sources, transforming the data which includes cleansing, aggregation,
summarization, integration, as well as basic transformation and loading the
data into some form of the data warehouse.


41 D) Extraction Transformations Loading


41E) E: Extraction of data from the homogeneous/heterogeneous sources.
     T: Transforming/modifying the source data by applying some
transformations like Filter, Expression, Router, Joiner, Union (or) Lookup.

     L: Loading the Transformed data into corresponding Target tables.

42) What are non-additive facts?

42A) Non-additive facts are facts that cannot be summed up for any of
The dimensions present in the fact table. Example: temparature, bill
number...etc

42B) fact table typically has two types of columns: those that contain
numeric facts (often called measurements), and those that are foreign keys to
dimension tables.

A fact table contains either detail-level facts or facts that have been
aggregated. Fact tables that contain aggregated facts are often called
summary tables. A fact table usually contains facts with the same level of
aggregation.

Though most facts are additive, they can also be semi-additive or non-
additive. Additive facts can be aggregated by simple arithmetical addition. A
common example of this is sales. Non-additive facts cannot be added at all.

An example of this is averages. Semi-additive facts can be aggregated along
some of the dimensions and not along others. An example of this is
inventory levels, where you cannot tell what a level means simply by
looking at it.


42C) If the columns of a fact table is not able in the position to aggregate
then it is called non-additive facts.


42D) Non-Additive: Non-additive facts are facts that cannot be summed up
for any of the dimensions present in the fact table.


43) How are the Dimension tables designed?
43A) most dimension tables are designed using Normalization principles up
to 2NF. In some instances they are further normalized to 3NF.


43B) Find where data for this dimension are located.

Figure out how to extract this data.

Determine how to maintain changes to this dimension (see more on this in
the next section).

Change fact table and DW population routines.

44) Why should you put your data warehouse on a different system than
your OLTP system?

44A)OLTP system stands for on-line transaction processing.

These are used to store only daily transactions as the changes have to be
made in as few places as possible. OLTP do not have historical data of the
organization

Data warehouse will contain the historical information about the
organization


44B) Data Warehouse is a part of OLAP (On-Line Analytical Processing). It
is the source from which any BI tools fetch data for Analytical, reporting or
data mining purposes. It generally contains the data through the whole life
cycle of the company/product. DWH contains historical, integrated,
denormalized, subject oriented data.
However, on the other hand the OLTP system contains data that is generally
limited to last couple of months or a year at most. The nature of data in
OLTP is: current, volatile and highly normalized. Since, both systems are
different in nature and functionality we should always keep them in different
systems.


44C) An DW is typically used most often for intensive querying . Since the
primary responsibility of an OLTP system is to faithfully record on going
transactions (inserts/updates/deletes), these operations will be considerably
slowed down by the heavy querying that the DW is subjected to.

45) What is Fact Table?
45A) A table in a data warehouse whose entries describe data in a fact table.
Dimension tables contain the data from which dimensions are created.


45B) a fact table in data ware house is it describes the transaction data. It
contains characteristics and key figures.


45C) A Fact table is a collection of facts and foreign key relations to the
dimensions.


45D) Fact Table contains the measurements or metrics or facts of business
process. If your business process is "Sales”, then a measurement of this
business process such as "monthly sales number" is captured in the Fact
table. Fact table also contains the foreign keys for the dimension tables.


45E) Fact table contains the transactions data ,which have more columns and
less no of rows.

Among the data it also includes the foreign key of the dimension tables
which r attached to it.



45F) Fact Table contains the keys(primary key,foreign key) of the related
dimension tables and measures which is based on the keys.



45G) Fact table will have numeric columns.

Or Values of the columns in Dimension table
45H) fact table which represents the information of measurements and as
well as the foreign key of dimension table.... if am wrong please inform me


46) What are Semi-additive and factless facts and in which scenario will you
use such kinds of fact table?

46A) Semi-Additive: Semi-additive facts are facts that can be summed up
for some of the dimensions in the fact table, but not the others. For example:
Current_Balance and Profit_Margin are the facts. Current_Balance is a
semi-additive fact, as it makes sense to add them up for all accounts (what's
the total current balance for all accounts in the bank?), but it does not make
sense to add them up through time (adding up all current balances for a
given account for each day of the month does not give us any useful
information


46B) a factless fact table captures the many-to-many relationships between
Dimensions, but contains no numeric or textual facts. They are often used to
record events or
Coverage information. Common examples of factless fact tables include:
- Identifying product promotion events (to determine promoted products that
didn’t sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or
university
47) What is a level of Granularity of a fact table?

47A) Level of granularity means level of detail that you put into the fact
table in a data warehouse. For example: Based on design you can decide to
put the sales data in each transaction. Now, level of granularity would mean
what detail you are willing to put for each transactional fact. Product sales
with respect to each minute or you want to aggregate it up to minute and put
that data.


47B) It also means that we can have (for example) data aggregated for a year
for a given product as well as the data can be drilled down to Monthly,
weekly and daily basis...the lowest level is known as the grain. Going down
to details is Granularity

The Aggregation or calculated value columns will go to Fact Table and
details information will go to dimensional table.


48) Which columns go to the fact table and which columns go the dimension
table?

48A) To add on, Foreign key elements along with Business Measures, such
as Sales in $ amt, Date may be a business measure in some case, units (qty
sold) may be a business measure, are stored in the fact table. It also depends
on the granularity at which the data is stored


48B) Before broken into columns is going to the fact

After broken going to dimensions



49) What are the Different methods of loading Dimension Tables?

49A) they are of two types insert--> if it is not there in the dimension and
update--> if it exists.


49B) Conventional Load:
Before loading the data, all the Table constraints will be checked against the
data.

Direct load :( Faster Loading)
All the Constraints will be disabled. Data will be loaded directly. Later the
data will be checked against the table constraints and the bad data won't be
indexed.
49C) Conventional and Direct load method are applicable for only oracle.
The naming convention is not general one applicable to other RDBMS like
DB2 or SQL server..


50) What are Aggregate Tables?

50A) Aggregate tables contain redundant data that is summarized from other
data in the warehouse.


50B) these are the tables which contain aggregated / summarized data. E.g.
Yearly, monthly sales information. These tables will be used to reduce the
query execution time.

50C) Aggregate table contains the summary of existing warehouse data
which is grouped to certain levels of dimensions. Retrieving the required
data from the actual table, which have millions of records will take more
time and also affects the server performance. To avoid this we can aggregate
the table to certain required level and can use it. This table reduces the load
in the database server and increases the performance of the query and can
retrieve the result very fastly.


51) What is a dimension table?

51A) a dimension table in data warehouse is one which contains primary key
and attributes. we called primary key as DIMID's(dimension id's).


51B) a dimensional table is a collection of hierarchies and categories along
which the user can drill down and drill up. it contains only the textual
attributes.


51C) Dimension tables r nothing but a master tables ,thru which u can
extract the actual transactions .Dimension table contains less columns and
more rows.
51D) Dimensional table is a table which contains business dimensions thru
which v analyze the business matrices

52) What are the various Reporting tools in the Market?

52A) Cognos
Business Objects
Micro Strategies
Actuate


52B)
1. MS-Excel
2. Business Objects (Crystal Reports)
3. Cognos (Impromptu, Power Play)
4. Micro strategy
5. MS reporting services
6. Informatica Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. Proclarity


52C)
INEA
MS-Excel
Business Objects (Crystal Reports)
Cognos (Impromptu, Power Play)
Micro strategy
MS reporting services
Informatica Power Analyzer
Actuate
Hyperion (BRIO)
Oracle Express OLAP
Proclarity
SAS


52D) Reporting tools are entirely different from OLAP tools
OLAP tool are
1. Cognos
2. Business Objects
3. SAS
4. Microsoft Source analyzer
5. MSTR
6. Hyperion (BRIO)



53) What is the Difference between OLTP and OLAP?


53A) OLTP

Current data
Short database transactions
Online update/insert/delete
Normalization is promoted
High volume transactions
Transaction recovery is necessary


OLAP
Current and historical data
Long database transactions
Batch update/insert/delete
Denormalization is promoted
Low volume transactions
Transaction recovery is not necessary




53B) OLTP is nothing but Online Transaction Processing, which contains a
normalized tables and online data, which have frequent insert/updates/delete.
But OLAP (Online Analytical Programming) contains the history of OLTP
data, which is, non-volatile ,acts as a Decisions Support System and is used
for creating forecasting reports.


53C)
OLTP: FEW

OLAP: MANY

JOINS

OLTP: MANY

OLAP: FEW

54) What is a Star Schema?

54A) A relational database schema organized around a central table (fact
table) joined to a few smaller tables (dimension tables) using foreign key
references. The fact table contains raw numeric items that represent relevant
business facts (price, discount values, number of units sold, dollar value,
etc.)

54B) Star schema is a type of organizing the tables such that we can retrieve
the result from the database easily and fastly in the warehouse environment.
Usually a star schema consists of one or more dimension tables around a fact
table which looks like a star, so that it got its name.


54C) it’s a type of organizing the entities in a way, such that u can retrieve
the result from the database easily and very fastly.Usually a star schema will
have one or more dimension tables linking around a fact table and looks like
a star. Hence got this name.


54D) Single fact table with ‘N’ number of dimension tables

55) Why are OLTP database designs not generally a good idea for a Data
Warehouse?
55A) OLTP cannot store historical information about the organization. It is
used for storing the details of daily transactions while a data warehouse is a
huge storage of historical information obtained from different datamarts for
making intelligent decisions about the organization.


56) Differences between star and snowflake schemas?

56A) the star schema is created when all the dimension tables directly link to
the fact table. Since the graphical representation resembles a star it is called
a star schema. It must be noted that the foreign keys in the fact table link to
the primary key of the dimension table. This sample provides the star
schema for a sales_ fact for the year 1998. The dimensions created are Store,
Customer, Product_ class and time_by_day. The Product table links to the
product_class table through the primary key and indirectly to the fact table.
The fact table contains foreign keys that link to the dimension tables.


56B) the snowflake schema is a schema in which the fact table is indirectly
linked to a number of dimension tables. The dimension tables are
normalized to remove redundant data and partitioned into a number of
dimension tables for ease of maintenance. An example of the snowflake
schema is the splitting of the Product dimension into the product_category
dimension and product_manufacturer dimension..


Read more on this here....
http://www.exforsys.com/content/view/1301/332/This tutorial covers
Designing the Dimensional Model, Dimensional Model schemas like Star
Schema, Snowflake Schema, Optimizing star schema and Design of the
Relational Database, OLAP Cubes and Data mining tools, Security
considerations, metadata and backup and recovery plans


56C) star schema uses denormalized dimension tables, but in case of
snowflake schema it uses normalized dimensions to avoid redundancy...


56D) Star schema
A single fact table with N number of Dimension

Snowflake schema

Any dimensions with extended dimensions are know as snowflake schema



56E)
Star Schema                         snowflake schema
-----------                          ----------------
Star schema is normalized            Denormalised.
Easy to use and understand           End users will get confused.
Want little efforts for maintenance Easy to maintain
Fast execution of queries             more time for exec bcas of more joins



56F) star schema uses denormalized dimension tables,but in case of
snowflake schema it uses normalized dimensions to avoid redundancy...


57) What Snow Flake Schema

57A) Snowflake schemas normalize dimensions to eliminate redundancy.
That is, the dimension data has been grouped into multiple tables instead of
one large table. For example, a product dimension table in a star schema
might be normalized into a products table, a product_category table, and a
product_manufacturer table in a snowflake schema. While this saves space,
it increases the number of dimension tables and requires more foreign key
joins. The result is more complex queries and reduced query performance


57B) a normalized form of star schema is called snow flake schema.

57C) the snowflake schema is an extension of the star schema, where each
point of the star explodes into more points. The main advantage of the
snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The
main disadvantage of the snowflake schema is the additional maintenance
efforts needed due to the increase number of lookup tables.



57D) some people are considering as Normalized star schema, but it is
partially normalized star schema. By partially normalizing it we may save
some disk space.



57E) Star schema

A single fact table with N number of Dimension

Snowflake schema

Any dimensions with extended dimensions are know as snowflake schema

Multiple Star (galaxy)

If the schema has more than one fact table then the schema is said to be
Multiple star

58) What are modeling tools available in the Market?

58A) Modeling Tool Vendor
    =========== ==========
Erwin Computer Associates
ER/Studio Embarcadero
Power Designer Sybase
Oracle Designer Oracle


58B) these tools are used for Data/dimension modeling

Oracle Designer
Erwin (Entity Relationship for windows)
Informatica (Cubes/Dimensions)
Embarcadero
Power Designer Sybase

59) What are slowly changing dimensions?

59A) Dimensions that change over time are called Slowly Changing
Dimensions. For instance, a product price changes over time; People change
their names for some reason; Country and State names may change over
time. These are a few examples of Slowly Changing Dimensions since some
changes are happening to them over a period of time


59B) if the data in the dimension table happen to change very rarely, then it
is called as slowly changing dimension.

60) What are Data Marts?

60A) Data mart is small subset of the data warehouse. It contains business
division and department level.

60B) a data mart is a focused subset of a data warehouse that deals with a
single area (like different department) of data and is organized for quick
analysis
60C) Data Marts: A subset of data warehouse data used for a specific
business function whose format may be a star schema, hypercube or
statistical sample

60D) Data mart is the sub set of data ware housing and it is analysis the data
one particular department and particular point of view.

60E) Data Mart: a data mart is a small data warehouse. In general, a data
warehouse is divided into small units according the business requirements.
For example, if we take a Data Warehouse of an organization, then it may be
divided into the following individual Data Marts. Data Marts are used to
improve the performance during the retrieval of data.

   e.g.: Data Mart of Sales, Data Mart of Finance, Data Mart of
Marketing, Data Mart of HR etc.
61) What is the difference between Data warehousing and Business
Intelligence?

61A)Data warehousing deals with all aspects of managing the development,
implementation and operation of a data warehouse or data mart including
meta data management, data acquisition, data cleansing, data transformation,
storage management, data distribution, data archiving, operational reporting,
analytical reporting, security management, backup/recovery planning, etc.
Business intelligence, on the other hand, is a set of software tools that enable
an organization to analyze measurable aspects of their business such as sales
performance, profitability, operational efficiency, effectiveness of marketing
campaigns, market penetration among certain customer groups, cost trends,
anomalies and exceptions, etc. Typically, the term “business intelligence” is
used to encompass OLAP, data visualization, data mining and
query/reporting tools. Think of the data warehouse as the back office and
business intelligence as the entire business including the back office. The
business needs the back office on which to function, but the back office
without a business to support, makes no sense.


61B) Data ware house is a relational database and it design analysis and
transformation processing. A Data warehousing is a subject oriented,
integrated, timevarient and nonvolatile collection of the data, A the support
and management of the decision making process.

Business Intelligence is collection of data warehousing, data mart and
knowledge.


62) What is snapshot?

62A) you can disconnect the report from the catalog to which it is attached
by saving the report with a snapshot of the data. However, you must
reconnect to the catalog if you want to refresh the data.

63) Is OLAP databases are called decision support systems?

63A) True
64) What is active data warehousing?

64A) an active data warehouse provides information that enables decision-
makers within an organization to manage customer relationships nimbly,
efficiently and proactively. Active data warehousing is all about integrating
advanced decision support with day-to-day-even minute-to-minute-decision
making in a way that increases quality of those customer touches which
encourages customer loyalty and thus secure an organization's bottom line.
The marketplace is coming of age as we progress from first-generation
"passive" decision-support systems to current- and next-generation "active"
data warehouse implementations


64B) Active Data ware house means

Every user can access the database any time 24/7

That is called Active dwh



64B) Active Transformation means data can change and pass.


65) Why Renormalization is promoted in Universe Design...

65A) in a relational data model, for normalization purposes, some lookup
tables are not merged as a single table. In a dimensional data modeling (star
schema), these tables would be merged as a single table called DIMENSION
table for performance and slicing data. Due to this merging of tables into one
large Dimension table, it comes out of complex intermediate joins.
Dimension tables are directly joined to Fact tables. Though, redundancy of
data occurs in DIMENSION table, size of DIMENSION table is 15% only
when compared to FACT table. So only Denormalization is promoted in
Universe Designing.


65B) in a relational data model, for normalization purposes, some lookup
tables are not merged as a single table. In a dimensional data modeling (star
schema), these tables would be merged as a single table called DIMENSION
table for performance and slicing data. Due to this merging of tables into one
large Dimension table, it comes out of more intermediate joins. Dimension
tables are directly joined to Fact tables. Though, redundancy of data occurs
in DIMENSION table, size of DIMENSION table is 15% only when
compared to FACT table. So only Denormalization is promoted in Universe
Designing.



66) Explain in detail about type 1, type...

66A) Type-1

Most Recent Value

Type-2(full History)

i) Version Number

ii) Flag

iii) Date

Type-3

Current and one previous value

66B) SCD

Type 1: overwrite data is to be there.

Type 2: current, recent and history data should be there.

Type 3: current and recent data should be there.

66C) SCD

Type 1: overwrite data is to be there.
Type 2: current, recent and history data should be there.
Type 3: current and recent data should be there.

66D) SCD means if the data in the dimension is happen to change very
rarely,

Mainly SCD 3 types

1) SCD-1: HERE THE PREVIOUS DATA OVERWRITE BY CURRENT
DATA

MEANS HERE ONLY MAINTAIN CURRENT DATA.

2) SCD-2: HERE JUST ADD THE ADDITIONAL RECORDS

IN SCD2 3 TYPES1) VERSIONING2) FLAGVALUE3) EFFECT DATE
RANGE

VERSIONING: MEANS HERE JUST SEND THE UPDATED RECORDS
TO THE TARGET ALONG WITH VERSION NUMBER.

NEW RECORDS WILL BE SENDING TO THE TARGET ALONG WITH
PRIMARY KEY

FLAGVALUE: HERE UPDATED RECORDS SEND TO THE TARGET
ALONG WITH 0

AND RECENT RECORDS SEND TO THE TARGET ALONG WITH 1

EFFECTIVE DATE RANGE: MEANS HERE TRACKS THE BOTH
PREVIOUS AND CURRENT DATA

3) SCD-3: HERE MAINTAINS JUST PREVIOUS AND CURRENT DATA.

67) What are non-additive facts in detail?

67A) a fact may be measure, metric or a dollar value. Measure and metric
are non additive facts.
Dollar value is additive fact. If we want to find out the amount for a
particular place for a particular period of time, we can add the dollar
amounts and come up with the total amount.

A non additive fact, for e.g. measure height(s) for 'citizens by geographical
location' , when we rollup 'city' data to 'state' level data we should not add
heights of the citizens rather we may want to use it to derive 'count'

67B) Types of Facts
There are three types of facts:

Additive: Additive facts are facts that can be summed up through all of the
dimensions in the fact table.

Semi-Additive: Semi-additive facts are facts that can be summed up for
some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for
any of the dimensions present in the fact table.

67C) Factless Fact - same as non additive facts ... it can be counted but
cannot be measured directly...

67D) Non-Additive: Non-additive facts are facts that cannot be summed up
for any of the dimensions present in the fact table.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:246
posted:4/13/2011
language:English
pages:54