Data Warehouse Technology: One of the most important assets of any organization is its information. This asset is almost always kept by an organization in two forms: the operational systems of record and the data warehouse. Crudely speaking, the operational systems are where the data is put in, and the data warehouse is where we get the data out. A data warehouse is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site [s1]. Once gathered, the data are stored for a long time, permitting access to historical data. Thus data warehouses provide the user a single consolidated interface to data, making decision support queries easier to write. Moreover, by accessing information for decision support from a data ware house, the decision maker ensures that online transaction-processing system are not affected by the decision-support workload. Goals of a Data Warehouse: Now a day‟s data warehouse has become an essential part of information system of a business organization. A data warehouse has some particular goals to meet the business requirement of a corporate body. They are described below- 1. The data warehouse must make an organization‟s information easily accessible. The contents of the data warehouse must be understandable. The data must be intuitive and obvious to the business user, not merely the developer. 2. The data warehouse must present the organization‟s information consistently. The data in the warehouse must be credible. Data must be carefully assembled from a variety of sources around the organization, cleansed, quality assured, and released only when it is fit for user consumption. 3. The data warehouse must be adaptive and resilient to change. User needs, business conditions, data, and technology are all subject to the shifting sands of time. The data warehouse must be designed to handle this inevitable change. Changes to the data warehouse should be graceful, meaning that they don‟t invalidate existing data or application. 4. The data warehouse must be a secure bastion that protects information assets. 5. The data warehouse must serve as the foundation for improved decision making. The data warehouse must have the right data in it to support decision making. 6. The data warehouse must provide enough facility for reporting as well as analysis. Components of a Data Warehouse: A data warehouse environment needs some specific components. Each warehouse component serves a specific function. There are four separate and distinct components to be considered. They are 1. Operational Source System. 2. Data Staging Area. 3. Data Presentation Area. 4. Data Access Tools. 1.Operational Source System: Operational source system are developed to a capture and process original business transaction. These systems are designed for data entry, not for reporting, but it is from here the data in data warehouse gets populated. Figure: Basic component of the data warehouse The main priorities of the source system are processing performance and availability. In a source driven architecture for gathering data, the data sources transmit new information, either continually or periodically. In a destination driven architecture, the data warehouse periodically sends requests for new data to the sources. 2.Data Staging Area: The data staging area of the data warehouse is both a storage area and a set of processes commonly referred to as extract-transformation-load (ETL). The data staging area is everything between the operational source systems and the data presentation area. It is typically not accessible to users. It is accessible only to skilled professionals. In this area raw operational data is transformed into a warehouse deliverable fit for user query and consumption. Data staging is a major process that includes the following sub procedure: #Extraction: Extraction is the first step in the process of getting data into the data warehouse environment. Extracting means reading and understanding the source data and copying the data needed for the data warehouse into the staging area for further manipulation. #Transformation: Once the data is extracted into data staging area, there are many transformation steps, including 1. Cleansing the data by correcting misspellings, resolving domain conflicts, dealing with missing elements, or parsing into standard formats. 2. Purging selected fields from the legacy data that are not useful for data warehouse. 3. Combining data sources by matching exactly on key values or by performing fuzzy matches on non key attributes. 4. Creating surrogates keys for each dimension record in order to avoid dependency on legacy defined keys, where surrogates key generation process enforces referential integrity between the dimensional tables and fact tables. 5. Building the aggregates for boosting the performance for common query. #Loading & indexing: At the end of the transformation process, the data is in the form of load record images. Loading in the data warehouse environment usually take the form of replicating the dimensional tables and fact tables and presenting this tables to bulk loading facilities of each recipient data mart. Bulk loading is a very important capability that is to be contrasted with record at a time loading, which is far slower. The target data mart must then index the newly arrived data for query performance. 3.Data Presentation Area: The data presentation area is where data is organized, stored, and made available for direct querying by users, report writers, and other analytical application. It is all the business community sees and touches via data access tools. The data presentation area is mainly made by series of integrated data marts. A data mart presents the data from a single business process. These business processes cross the boundaries of organizational functions. Data marts contain detailed and atomic data. Atomic data is required to withstand assaults from unpredictable ad hoc user queries. The data marts also may contain performance-enhancing summary data, or aggregates. All the data marts must be built using common dimensions and facts. Dimensional data marts are organized by subject area such as finance, sales and marketing and coordinated by data category such as customer, product and location. These flexible information stores allows data structure to respond to business changes such as product line addition, new staff responsibilities, mergers, consolidation and acquisition. 4.Data Access Tools: The final major component of the data warehouse environment is the data access tools. All data access tools query the data in the data warehouse‟s presentation area. A data access tool can be as simple as an ad hoc query tool or as complex as a sophisticated data mining or modeling application. Ad hoc query tools, as powerful as they are, can be understood and used effectively only by a small percentage of the potential data warehouse business user population. The majority of the business user base likely will access the data via prebuilt parameter-driven analytic applications. Warehouse Schemas: Data warehouses typically have schemas that are designed for data analysis, using tools such as OLAP tools. Thus, data are usually multidimensional data, with dimension attributes and measure attributes. For this reason, two types of tables are used: Fact tables and Dimension tables. Fact Table: Tables containing multidimensional data are called fact tables and are usually very large. These are a used to record actual facts and measures in business. Facts are numeric data items that are of interest to the business. Example, telecommunication – length of call in minutes, average number of calls. Dimension Table: To minimize storage requirements, dimension attributes are usually short identifiers that are foreign keys into other tables called dimension tables. Dimension tables establish the context of the facts. Dimension tables store fields that describe the facts. Example, telecommunication- call origin, call destination. Using both fact table and dimension table a data warehouse schema is built. There are two types of data warehouse schema. They are- Star schema and Snowflake schema. Star Schema: The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name) referencing any number of "dimension tables". The star schema is considered an important special case of the snowflake schema. Main characteristics of a star schema are given below- One fact table. De-normalized dimension table. One column per level/attributes. Simple and easy overview => ease of use. Relatively flexible. Fact table is normalized. Dimension tables are often relatively small, Recognized by many RDBMSs. Good performance. Hierarchies are hidden in the columns. Figure: The Star Schema Snowflake schema: A snowflake schema is a logical arrangement of tables in a relational database such that the entity relationship diagram resembles a snowflake in shape. Closely related to the star schema, the snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the snowflake schema, however, dimensions are normalized into multiple related tables whereas the star schema's dimensions are de-normalized with each dimension being represented by a single table. When the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and where child tables have multiple parent tables , a complex snowflake shape starts to emerge. The "snowflaking" effect only affects the dimension tables and not the fact tables. Main characteristics of snowflake schema are given below- Dimensions are normalized. One dimension table per level. Each dimension table has integer key, level name and one column per attribute. Hierarchies are made explicit/visible. Very flexible. Dimension table use less space. Harder to use due to many joins. Worse performance Figure: The Snowflake Schema Both type of schema have advantages and disadvantages. A data warehouse designer should chose schema wisely because it has big effect on the other stages of designing. OLAP (Online Analytical Processing): Online analytical processing or OLAP is an approach to quickly answer multidimensional analytical queries. OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining. OLAP tools support interactive analysis of summary information. Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time. At the core of any OLAP system is the concept of an OLAP cube (also called a multidimensional cube or a hypercube). It consists of numeric facts called measures which are categorized by dimensions. The cube metadata is typically created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. OLAP systems have been traditionally categorized using the following taxonomy. MOLAP (Multidimensional) MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database. Therefore it requires the pre-computation and storage of information in the cube - the operation known as processing. ROLAP (Relational) ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information. Depends on a specialized schema design. HOLAP (Hybrid) There is no clear agreement across the industry as to what constitutes "Hybrid OLAP", except that a database will divide data between relational and specialized storage. For example, for some vendors, a HOLAP database will use relational tables to hold the larger quantities of detailed data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed data. Data Warehouse Development Methodologies: Building a data warehouse is a very challenging issue because compared to software engineering it is quite a young discipline and does not yet offer well-established strategies and techniques for the development process. Current data warehouse development methods can fall within three basic groups: data-driven, goal-driven and user-driven. They are described below. Data-Driven Methodologies: Bill Inmon, the founder of data warehousing argues that data warehouse environments are data driven, in comparison to classical systems, which have a requirement driven development life cycle (see ). He states that requirements are the last thing to be considered in the decision support development lifecycle, they are understood after the data warehouse has been populated with data and results of queries have been analyzed by users. The data warehouse development strategy is based on the analysis of the corporate data model and relevant transactions. The approach ignores the needs of data warehouse users a priori. Company goals and user requirements are not reflected at all. User needs are integrated in the second cycle. Golfarelli, Maio and Rizzi propose a semi-automated methodology to build a dimensional data warehouse model from the pre-existing E/R schemes that represent operational databases (see ). The data-driven development methodology is recommended for data mining and data exploration purposes. The bottom-up approach exploits the database and is suited for tayloristic measurement. The data-driven development methodology is particularly suited for production workflows. These workflows generate a high business value, have a high degree of repetition, are customer focused, often time critical and therefore require tight and close monitoring. Goal-Driven Methodologies: Böhnlein and Ulbrich-vom Ende present an approach that is based on the SOM (Semantic Object Model) process modeling technique in order to derive the initial data warehouse structure (see ). The first stage of the derivation process determines goals and services the company provides to its customers. Then the business process is analyzed by applying the SOM interaction schema that highlights the customers and their transactions with the process under study. In a third step sequences of transactions are transformed into sequences of existing dependencies that refer to information systems. The last step identifies measures and dimensions: One has to find enforcing (information request) transactions for measures and get dimensions from existing dependencies. In our opinion this highly complex approach works only well when business processes are designed throughout the company and are combined with business goals. Kimball proposes a four-step approach where he starts to choose a business process, takes the grain of the process, and chooses dimensions and facts (see ). He defines a business process as a major operational process in the organization that is supported by some kind of legacy system (or systems). The goal-driven development methodology supports modern management methods and is a foundation for decision support at all organizational levels. The level of granularity is much higher compared to that of the data-driven approach. While the Process Warehouse based on the goal-driven development methodology measures only the cycle time for business processes and has only one cube, the Process Warehouse based on the data-driven development methodology measures the duration of all process and activity states as well as the workload of departments and roles. End-users are rarely involved. They are only required when operational detail matters. As the model is aligned with the corporate strategy, it is very stable. Measures and dimensions are balanced: financial, non-financial, qualitative and quantitative aspects are considered. A lot of data sources are integrated, because a holistic approach is based on all aspects of an organization. User-Driven Methodologies: Westerman describes an approach that was developed at Wal- Mart and has its main focus on implementing business strategy (see ). The methodology assumes that the company goal is the same for everyone and the entire company will therefore be pursuing the same direction. It is proposed to set up a first prototype based on the needs of the business. Business people define goals and gather, priorities as well as define business questions supporting these goals. Afterwards the business questions are prioritized and the most important business questions are defined in terms of data elements, including the definition of hierarchies. Although the Wal-Mart approach focuses on business needs, business goals that are defined by the organization are not taken into consideration at all. Poe proposes a catalogue for conducting user interviews in order to collect end user requirements (see ). She recommends interviewing different user groups in order to get a complete understanding of the business. The questions cover a very board field and include also topics like job responsibilities. Basically, a monopolization of this user-driven development methodology is risky and must be avoided, as it generates performance information that reflects the organizational level of the people involved. Therefore, selected measures, dimensions, the level of granularity and the targeting level of the organizational hierarchy are very unstable. The methodology has a bottom-up tendency, because most employees do not see the organization from a broad angle, theirs is a narrow-minded, egocentric point of view. The project duration may be long-winded and very costly, as project participants request long discussions on a lot of unnecessary measures and dimension. This development methodology may well raise acceptance of a system, but must be combined with the data-driven or goal-driven development methodology in order to improve the longevity of the system. The more a system suffers rejection, the more user involvement is required beside a focus on organizational strategies or the corporate data model. Related Work in Telecom Industry: Beginning in the late 1990‟s, data integration moved from he lab into the commercial arena. Today, this industry known as Enterprise Information Integration (EII). The vision underlying this industry is to provide tools for integrating data from multiple sources without having to ﬁrst load all the data into a central warehouse as required by previous solutions. A collection of short articles by some of the players in this industry appears in . A data integration scenario started with identifying the data sources that will participate in the application, and then building a mediated schema (often called a virtual schema) which would be queried by users or applications, and building semantic mappings from the data sources to the mediated schema . Data integration products are oﬀered by most major DBMS vendors, and are also playing a signiﬁcant role in the business analytics products (e.g., Actuate and Hyperoll). Personal Information Management [10,11,12] is also an application where data integration is taking a signiﬁcant role . SCORE is an information integration approach that focuses on integrating structured and unstructured data . Information integration solutions typically require the application to formulate the SQL logic to retrieve the needed structured data on one hand, and identify a set of keywords to retrieve the related unstructured data on the other. SCORE proposes a novel approach wherein the application specifies its information needs using only a SQL query on the structured data, and this query is automatically “translated” into a set of keywords that can be used to retrieve relevant unstructured data. Korea Telecom‟s Call Data Analysis Team made SIMS (Strategic Information Management System), a prototype Data Warehouse System for telecommunications pricing strategy . Telecom Italia has their own data warehouse project IBDA which now consists of 52 databases  . LGR Telecommunications, a specialized solutions provider to the global telecommunications industry, offering a unique business solution that taps directly into the source of each customer interaction with the network by accessing the call data record (CDR) it creates. LGR‟s approach intelligently captures CDR data in realtime, appends additional business information to the record, stores the data within a comprehensive Oracle data warehouse solution, and provides real-time analysis to the telecom service provider . Other telecomm companies using Oracle‟s Data Warehouse includes Telefonica Germany, Turkcell Telecommunications , Mobiltel (Bulgaria) , Anhui Telecom Company Ltd (China) and others . In Bangladesh Grameenphone engaged Oracle Certified Advantage Partner IBCS-PRIMAX Software to develop a data warehouse based on Oracle database 10g with a built-in business intelligence solution, the first such solution developed in Bangladesh . So although data integration has come a long way and at the same time individual data warehouses for individual telecommunication company has been in existence for a long time there is a real dearth of industry wide integrated data warehouses suited for all the telecommunication companies in a country . So our project is unique in that essence and a prototype in Bangladesh .  Alon Y. Halevy, Naveen Ashish, Dina Bitton ,Michael J. Carey, Denise Draper, Jeﬀ Pollock, Arnon Rosenthal, and Vishal Sikka. Enterprise information integration: successes, challenges and controversies. In SIGMOD Conference, pages 778–787, 2005.  Xin Dong and Alon Halevy. A Platform for Personal Information Management and Integration. In Proc. of CIDR, 2005.  Jim Gemmell, Gordon Bell, Roger Lueder, Steven Drucker, and Curtis Wong. Mylifebits: Fulﬁlling the memex vision. In ACM Multimedia, 2002.  Dennis Quan, David Huynh, and David R. Karger. Haystack: A platform for authoring end user semantic web applications. In ISWC, 2003.  Data Integration: The Teenage Years - Alon Halevy,Google inc. ,Anand Rajaraman , Kosmix Corp. , Joann Ordille , Avaya Labs . ACM. VLDB „06, September 12-15, 2006, Seoul, Korea.  Roy, P., Mohania, M., Bamba, B., and Raman, S. 2005. Towards Automatic Association of Relevant Unstructured Content with Structured Query Results. CIKM 2005.  Case Study : How to Make Telecom Pricing Strategy Using Data Warehouse Approach Seungjae Shin, Gilju Park, Wonjun Lee, Sunmi Lee Korea Telecom R&D Group, Telecom Economic Research Lab  Data Warehousing and Integration in Telecom Italia , M. Trisolini et. al. , 2000.  Oracle Customer Snapshot: http://www.oracle.com/customers/snapshots/lgr- telecommunications-db-business-benefits-case-study.pdf Oracle Data Warehousing Customers list : http://www.oracle.com/customers/products/data-warehouse.html Oracle Customer Snapshot: http://www.oracle.com/customers/snapshots/grameenphone- tech-snapshot.pdf
"Online analytical Processing"