Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Online analytical Processing

VIEWS: 125 PAGES: 13

									Data Warehouse Technology:

One of the most important assets of any organization is its information. This asset is almost

always kept by an organization in two forms: the operational systems of record and the data

warehouse. Crudely speaking, the operational systems are where the data is put in, and the

data warehouse is where we get the data out.


A data warehouse is a repository (or archive) of information gathered from multiple sources,
stored under a unified schema, at a single site [s1]. Once gathered, the data are stored for a
long time, permitting access to historical data. Thus data warehouses provide the user a single
consolidated interface to data, making decision support queries easier to write. Moreover, by
accessing information for decision support from a data ware house, the decision maker
ensures that online transaction-processing system are not affected by the decision-support
workload.

Goals of a Data Warehouse:

Now a day‟s data warehouse has become an essential part of information system of a
business organization. A data warehouse has some particular goals to meet the business
requirement of a corporate body. They are described below-

   1. The data warehouse must make an organization‟s information easily accessible. The
       contents of the data warehouse must be understandable. The data must be intuitive
       and obvious to the business user, not merely the developer.
   2. The data warehouse must present the organization‟s information consistently. The
       data in the warehouse must be credible. Data must be carefully assembled from a
       variety of sources around the organization, cleansed, quality assured, and released
       only when it is fit for user consumption.
   3. The data warehouse must be adaptive and resilient to change. User needs, business
       conditions, data, and technology are all subject to the shifting sands of time. The data
       warehouse must be designed to handle this inevitable change. Changes to the data
       warehouse should be graceful, meaning that they don‟t invalidate existing data or
       application.
   4. The data warehouse must be a secure bastion that protects information assets.
   5. The data warehouse must serve as the foundation for improved decision making. The
       data warehouse must have the right data in it to support decision making.
   6. The data warehouse must provide enough facility for reporting as well as analysis.

Components of a Data Warehouse:

A data warehouse environment needs some specific components. Each warehouse component
serves a specific function. There are four separate and distinct components to be considered.
They are

    1. Operational Source System.
    2. Data Staging Area.
    3. Data Presentation Area.
    4. Data Access Tools.


1.Operational Source System:
Operational source system are developed to a capture and process original business
transaction. These systems are designed for data entry, not for reporting, but it is from here
the data in data warehouse gets populated.




                       Figure: Basic component of the data warehouse


The main priorities of the source system are processing performance and availability. In a
source driven architecture for gathering data, the data sources transmit new information,
either continually or periodically. In a destination driven architecture, the data warehouse
periodically sends requests for new data to the sources.


2.Data Staging Area:


The data staging area of the data warehouse is both a storage area and a set of processes
commonly referred to as extract-transformation-load (ETL)[7]. The data staging area is
everything between the operational source systems and the data presentation area. It is
typically not accessible to users. It is accessible only to skilled professionals. In this area raw
operational data is transformed into a warehouse deliverable fit for user query and
consumption.


Data staging is a major process that includes the following sub procedure:


#Extraction:
Extraction is the first step in the process of getting data into the data warehouse environment.
Extracting means reading and understanding the source data and copying the data needed for
the data warehouse into the staging area for further manipulation.


#Transformation:
Once the data is extracted into data staging area, there are many transformation steps,
including
   1. Cleansing the data by correcting misspellings, resolving domain conflicts, dealing
       with missing elements, or parsing into standard formats.
   2. Purging selected fields from the legacy data that are not useful for data warehouse.
   3. Combining data sources by matching exactly on key values or by performing fuzzy
       matches on non key attributes.
   4. Creating surrogates keys for each dimension record in order to avoid dependency on
       legacy defined keys, where surrogates key generation process enforces referential
       integrity between the dimensional tables and fact tables.
   5. Building the aggregates for boosting the performance for common query.
#Loading & indexing:
At the end of the transformation process, the data is in the form of load record images.
Loading in the data warehouse environment usually take the form of replicating the
dimensional tables and fact tables and presenting this tables to bulk loading facilities of each
recipient data mart. Bulk loading is a very important capability that is to be contrasted with
record at a time loading, which is far slower. The target data mart must then index the newly
arrived data for query performance.


3.Data Presentation Area:
The data presentation area is where data is organized, stored, and made available for direct
querying by users, report writers, and other analytical application. It is all the business
community sees and touches via data access tools. The data presentation area is mainly made
by series of integrated data marts. A data mart presents the data from a single business
process[7]. These business processes cross the boundaries of organizational functions. Data
marts contain detailed and atomic data. Atomic data is required to withstand assaults from
unpredictable ad hoc user queries. The data marts also may contain performance-enhancing
summary data, or aggregates. All the data marts must be built using common dimensions and
facts. Dimensional data marts are organized by subject area such as finance, sales and
marketing and coordinated by data category such as customer, product and location. These
flexible information stores allows data structure to respond to business changes such as
product line addition, new staff responsibilities, mergers, consolidation and acquisition.


4.Data Access Tools:
The final major component of the data warehouse environment is the data access tools. All
data access tools query the data in the data warehouse‟s presentation area. A data access tool
can be as simple as an ad hoc query tool or as complex as a sophisticated data mining or
modeling application[6]. Ad hoc query tools, as powerful as they are, can be understood and
used effectively only by a small percentage of the potential data warehouse business user
population. The majority of the business user base likely will access the data via prebuilt
parameter-driven analytic applications.


Warehouse Schemas:
Data warehouses typically have schemas that are designed for data analysis, using tools such
as OLAP tools. Thus, data are usually multidimensional data, with dimension attributes and
measure attributes. For this reason, two types of tables are used: Fact tables and Dimension
tables.
Fact Table:
Tables containing multidimensional data are called fact tables and are usually very large.
These are a used to record actual facts and measures in business[7]. Facts are numeric data
items that are of interest to the business. Example, telecommunication – length of call in
minutes, average number of calls.


Dimension Table:
To minimize storage requirements, dimension attributes are usually short identifiers that are
foreign keys into other tables called dimension tables. Dimension tables establish the context
of the facts[7]. Dimension tables store fields that describe the facts. Example,
telecommunication- call origin, call destination.


Using both fact table and dimension table a data warehouse schema is built. There are two
types of data warehouse schema. They are- Star schema and Snowflake schema.
Star Schema:
The star schema (sometimes referenced as star join schema) is the simplest style of data
warehouse schema. The star schema consists of a few "fact tables" (possibly only one,
justifying the name) referencing any number of "dimension tables". The star schema is
considered an important special case of the snowflake schema. Main characteristics of a star
schema are given below-
      One fact table.
      De-normalized dimension table.
      One column per level/attributes.
      Simple and easy overview => ease of use.
      Relatively flexible.
      Fact table is normalized.
      Dimension tables are often relatively small,
      Recognized by many RDBMSs.
      Good performance.
      Hierarchies are hidden in the columns.
                                  Figure: The Star Schema


Snowflake schema:
A snowflake schema is a logical arrangement of tables in a relational database such that the
entity relationship diagram resembles a snowflake in shape. Closely related to the star
schema, the snowflake schema is represented by centralized fact tables which are connected
to multiple dimensions. In the snowflake schema, however, dimensions are normalized into
multiple related tables whereas the star schema's dimensions are de-normalized with each
dimension being represented by a single table. When the dimensions of a snowflake schema
are elaborate, having multiple levels of relationships, and where child tables have multiple
parent tables , a complex snowflake shape starts to emerge. The "snowflaking" effect only
affects the dimension tables and not the fact tables.
Main characteristics of snowflake schema are given below-
      Dimensions are normalized.
      One dimension table per level.
      Each dimension table has integer key, level name and one column per attribute.
      Hierarchies are made explicit/visible.
      Very flexible.
      Dimension table use less space.
      Harder to use due to many joins.
      Worse performance




                              Figure: The Snowflake Schema


Both type of schema have advantages and disadvantages. A data warehouse designer should
chose schema wisely because it has big effect on the other stages of designing.


OLAP (Online Analytical Processing):
Online analytical processing or OLAP is an approach to quickly answer multidimensional
analytical queries. OLAP is part of the broader category of business intelligence, which also
encompasses relational reporting and data mining. OLAP tools support interactive analysis of
summary information. Databases configured for OLAP use a multidimensional data model,
allowing for complex analytical and ad-hoc queries with a rapid execution time.
At the core of any OLAP system is the concept of an OLAP cube (also called a
multidimensional cube or a hypercube). It consists of numeric facts called measures which
are categorized by dimensions. The cube metadata is typically created from a star schema or
snowflake schema of tables in a relational database. Measures are derived from the records in
the fact table and dimensions are derived from the dimension tables.


OLAP systems have been traditionally categorized using the following taxonomy.


MOLAP (Multidimensional)
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP. MOLAP
stores this data in an optimized multi-dimensional array storage, rather than in a relational
database. Therefore it requires the pre-computation and storage of information in the cube -
the operation known as processing.


ROLAP (Relational)
ROLAP works directly with relational databases. The base data and the dimension tables are
stored as relational tables and new tables are created to hold the aggregated information.
Depends on a specialized schema design.


HOLAP (Hybrid)
There is no clear agreement across the industry as to what constitutes "Hybrid OLAP", except
that a database will divide data between relational and specialized storage. For example, for
some vendors, a HOLAP database will use relational tables to hold the larger quantities of
detailed data, and use specialized storage for at least some aspects of the smaller quantities of
more-aggregate or less-detailed data.


Data Warehouse Development Methodologies:
Building a data warehouse is a very challenging issue because compared to software
engineering it is quite a young discipline and does not yet offer well-established strategies
and techniques for the development process. Current data warehouse development methods
can fall within three basic groups: data-driven, goal-driven and user-driven. They are
described below.
Data-Driven Methodologies: Bill Inmon, the founder of data warehousing argues that data
warehouse environments are data driven, in comparison to classical systems, which have a
requirement driven development life cycle (see [6]). He states that requirements are the last
thing to be considered in the decision support development lifecycle, they are understood
after the data warehouse has been populated with data and results of queries have been
analyzed by users. The data warehouse development strategy is based on the analysis of the
corporate data model and relevant transactions. The approach ignores the needs of data
warehouse users a priori. Company goals and user requirements are not reflected at all. User
needs are integrated in the second cycle.
Golfarelli, Maio and Rizzi propose a semi-automated methodology to build a dimensional
data warehouse model from the pre-existing E/R schemes that represent operational databases
(see [5]).


The data-driven development methodology is recommended for data mining and data
exploration purposes. The bottom-up approach exploits the database and is suited for
tayloristic measurement. The data-driven development methodology is particularly suited for
production workflows. These workflows generate a high business value, have a high degree
of repetition, are customer focused, often time critical and therefore require tight and close
monitoring.


Goal-Driven Methodologies: Böhnlein and Ulbrich-vom Ende present an approach that is
based on the SOM (Semantic Object Model) process modeling technique in order to derive
the initial data warehouse structure (see [1]). The first stage of the derivation process
determines goals and services the company provides to its customers. Then the business
process is analyzed by applying the SOM interaction schema that highlights the customers
and their transactions with the process under study. In a third step sequences of transactions
are transformed into sequences of existing dependencies that refer to information systems.
The last step identifies measures and dimensions: One has to find enforcing (information
request) transactions for measures and get dimensions from existing dependencies. In our
opinion this highly complex approach works only well when business processes are designed
throughout the company and are combined with business goals.
Kimball proposes a four-step approach where he starts to choose a business process, takes the
grain of the process, and chooses dimensions and facts (see [7]). He defines a business
process as a major operational process in the organization that is supported by some kind of
legacy system (or systems).


The goal-driven development methodology supports modern management methods and is a
foundation for decision support at all organizational levels. The level of granularity is much
higher compared to that of the data-driven approach. While the Process Warehouse based on
the goal-driven development methodology measures only the cycle time for business
processes and has only one cube, the Process Warehouse based on the data-driven
development methodology measures the duration of all process and activity states as well as
the workload of departments and roles. End-users are rarely involved. They are only required
when operational detail matters. As the model is aligned with the corporate strategy, it is very
stable. Measures and dimensions are balanced: financial, non-financial, qualitative and
quantitative aspects are considered. A lot of data sources are integrated, because a holistic
approach is based on all aspects of an organization.


User-Driven Methodologies: Westerman describes an approach that was developed at Wal-
Mart and has its main focus on implementing business strategy (see [13]). The methodology
assumes that the company goal is the same for everyone and the entire company will
therefore be pursuing the same direction. It is proposed to set up a first prototype based on the
needs of the business. Business people define goals and gather, priorities as well as define
business questions supporting these goals. Afterwards the business questions are prioritized
and the most important business questions are defined in terms of data elements, including
the definition of hierarchies. Although the Wal-Mart approach focuses on business needs,
business goals that are defined by the organization are not taken into consideration at all.
Poe proposes a catalogue for conducting user interviews in order to collect end user
requirements (see [11]). She recommends interviewing different user groups in order to get a
complete understanding of the business. The questions cover a very board field and include
also topics like job responsibilities.


Basically, a monopolization of this user-driven development methodology is risky and must
be avoided, as it generates performance information that reflects the organizational level of
the people involved. Therefore, selected measures, dimensions, the level of granularity and
the targeting level of the organizational hierarchy are very unstable. The methodology has a
bottom-up tendency, because most employees do not see the organization from a broad angle,
theirs is a narrow-minded, egocentric
point of view. The project duration may be long-winded and very costly, as project
participants request long discussions on a lot of unnecessary measures and dimension. This
development methodology may well raise acceptance of a system, but must be combined with
the data-driven or goal-driven development methodology in order to improve the longevity of
the system. The more a system suffers rejection, the more user involvement is required beside
a focus on organizational strategies or the corporate data model.


Related Work in Telecom Industry:

Beginning in the late 1990‟s, data integration moved from he lab into the commercial arena.
Today, this industry     known as Enterprise Information Integration (EII). The vision
underlying this industry is to provide tools for integrating data from multiple sources without
having to first load all the data into a central warehouse as required by previous solutions. A
collection of short articles by some of the players in this industry appears in [9]. A data
integration scenario started with identifying the data sources that will participate in the
application, and then building a mediated schema (often called a virtual schema) which
would be queried by users or applications, and building semantic mappings from the data
sources to the mediated schema .

Data integration products are offered by most major DBMS vendors, and are also playing a
significant role in the business analytics products (e.g., Actuate and Hyperoll). Personal
Information Management [10,11,12] is also an application where data integration is taking a
significant role [13].

SCORE is an information integration approach that focuses on integrating structured
and unstructured data      [14].   Information integration solutions typically require the
application to formulate the SQL logic to retrieve the needed structured data on one hand, and
identify a set of keywords to retrieve the related unstructured data on the other. SCORE
proposes a novel approach wherein the application specifies its information needs using
only a SQL query on the structured data, and this query is automatically “translated”
into a set of keywords that can be used to retrieve relevant unstructured data.
Korea Telecom‟s Call Data Analysis Team made SIMS (Strategic Information Management
System), a prototype Data Warehouse System for telecommunications pricing strategy
[15].

Telecom Italia has their own data warehouse project IBDA which now consists of 52
databases [16] .

LGR Telecommunications, a specialized solutions provider to the global telecommunications
industry, offering a unique business solution that taps directly into the source of each
customer interaction with the network by accessing the call data record (CDR) it creates.
LGR‟s approach intelligently captures CDR data in realtime, appends additional business
information to the record, stores the data within a comprehensive Oracle data warehouse
solution, and provides real-time analysis to the telecom service provider [17]. Other
telecomm companies using Oracle‟s Data Warehouse includes Telefonica Germany, Turkcell
Telecommunications , Mobiltel (Bulgaria) , Anhui Telecom Company Ltd (China) and others
[18].

In Bangladesh Grameenphone engaged Oracle Certified Advantage Partner IBCS-PRIMAX
Software to develop a data warehouse based on Oracle database 10g with a built-in business
intelligence solution, the first such solution developed in Bangladesh [19].

So although data integration has come a long way and at the same time individual data
warehouses for individual telecommunication company has been in existence for a long time
there is a real dearth of     industry wide integrated data warehouses suited for all the
telecommunication companies in a country . So our project is unique in that essence and a
prototype in Bangladesh .
[9] Alon Y. Halevy, Naveen Ashish, Dina Bitton ,Michael J. Carey, Denise Draper, Jeff
Pollock, Arnon Rosenthal, and Vishal Sikka. Enterprise information integration: successes,
challenges and controversies. In SIGMOD Conference, pages 778–787, 2005.

[10] Xin Dong and Alon Halevy. A Platform for Personal Information Management and
Integration. In Proc. of CIDR, 2005.

[11] Jim Gemmell, Gordon Bell, Roger Lueder, Steven Drucker, and Curtis Wong.
Mylifebits: Fulfilling the memex vision. In ACM Multimedia, 2002.

[12] Dennis Quan, David Huynh, and David R. Karger. Haystack: A platform for authoring
end user semantic web applications. In ISWC, 2003.

[13] Data Integration: The Teenage Years - Alon Halevy,Google inc. ,Anand Rajaraman ,
Kosmix Corp. , Joann Ordille , Avaya Labs . ACM. VLDB „06, September 12-15, 2006,
Seoul, Korea.

[14] Roy, P., Mohania, M., Bamba, B., and Raman, S. 2005. Towards Automatic Association
of Relevant Unstructured Content with Structured Query Results. CIKM 2005.

[15] Case Study : How to Make Telecom Pricing Strategy Using Data Warehouse Approach
Seungjae Shin, Gilju Park, Wonjun Lee, Sunmi Lee Korea Telecom R&D Group, Telecom
Economic Research Lab

[16] Data Warehousing and Integration in Telecom Italia , M. Trisolini et. al. , 2000.

[17] Oracle Customer Snapshot:             http://www.oracle.com/customers/snapshots/lgr-
telecommunications-db-business-benefits-case-study.pdf

[18]Oracle          Data         Warehousing           Customers                list     :
http://www.oracle.com/customers/products/data-warehouse.html

[19]Oracle Customer Snapshot: http://www.oracle.com/customers/snapshots/grameenphone-
tech-snapshot.pdf

								
To top