Tour of Data Warehousing

Document Sample
Tour of Data Warehousing Powered By Docstoc
					Tour of Data Warehousing
                   CIS 609
  Introduction
• The ultimate goal of advanced data analysis
  is to make decisions that lead as directly as
  possible to benefits.
• Data warehouses aim at physically framing
  multiple sources of data (e.g., databases
  and file collections) in an architecture that
  requires the mapping of data from one or
  more operational data sources to a target
  database management system that supports
  the many decision making processes and
  business intelligence (BI) systems of an
  enterprise.
The Value Chain
Data to Decisions
• Historically, BI systems have evolved through
  three main phases.
• Phase 1. Enterprises started to put in place
  structured data stores filing data relevant to
  their needs. Typically a snapshot of data from
  a single operational source (i.e., information
  center) or sometimes directly against the
  operational source. In the 1980s, decision
  support tasks were performed centrally, with
  highly skilled individuals analyzing mainframe-
  resident data. The results were delivered to
  management as hard-copy reports and graphs.
Data to Decisions
• Phase 2. Competitive factors pushed
  enterprises toward a better leverage of
  the data and they adopted augmented
  Decision Support Systems (DSS).
• With data analysis techniques such as
  statistics, creating a focused store of
  data with subject matter from two or
  three operational sources.
• Along with the storage, tools taking
  advantage of the structure have been
  setup for enterprises to question these
  data stores.
Data to Decisions
• Phase 3. Finally, to cope with the
  multiplicity and diversity of the data
  stores, enterprises are starting to unify
  and rationalize these through data
  warehouse frameworks.
• In addition to this effort, and due to an
  increased competitive pressure,
  advanced data analysis techniques are
  also implemented to leverage further -
  and gain business advantages - the
  resulting and ever growing amount of
  data.
Three Phases of Data
Leverage
Data Warehouse Definition
• A data warehouse is a process and architecture
  that requires robust planning to implement a
  platform, which consists of the selection,
  conversion, transformation, consolidation,
  integration, cleansing and mapping of data (i.e.,
  recent and historical) from multiple operational
  data sources to a target DBMS that supports an
  enterprise’s decision-making processes and BI
  systems.
• As an architecture, a data warehouse is more
  than a single product and requires significant
  planning of five essential components:
  operational data sources, data conversion and
  extraction, data warehouse DBMS, data
  warehouse administration, and BI tools.
Data Mining Definition
• Data mining is the process of
  discovering meaningful correlations,
  patterns, and trends by sifting through
  large amounts of data stored in
  repositories.
• Whether or not this discovery or
  exploration is performed by human
  analysts, software agents or machine
  learning techniques, it is important that
  the results provide enterprises with
  insights not available through traditional
  techniques or predefined relationships
  (e.g., relational tables).
Business Intelligence
Definition
• Business Intelligence (BI) is the user-centered
  process of exploring data, data relationships
  and trends - thereby helping to improve overall
  decision making.
• This involves an iterative process of accessing
  data (ideally stored in the data warehouse) and
  analyzing it - thereby deriving insights, drawing
  conclusions and communicating findings - to
  effect change positively within the enterprise.
• BI is an application of a data warehouse, but
  does not necessitate a data warehouse. BI is
  comprised of four major product segments:
  interactive query tools, reporting tools,
  advanced DSSs, and EISs.
Decision Support Systems
(DSS)
• DSSs, as a subset of BI, utilizes
  organized collections of data,
  systems, applications, tools and
  techniques by which enterprises
  gather and interpret relevant
  information about the business and
  turn it into highly quantifiable plans,
  policies, procedures and metrics
  (see Figure 1).
Data Warehousing
•   Data warehousing technology and architectures are
    becoming a mainstream activity

•   However, many enterprises are balking at the high cost of
    implementation and experience difficulty in quantifying
    benefits and the return on investment (ROI).

•   In addition, the vast majority of efforts to build and manage a
    data warehouse still require in-house customization of
    products, thus consuming significant resources.

•   Many enterprises will need to turn to outside help to obtain
    the expertise necessary to perform the architecture
    integration.

•   There are several key trends that indicate that data
    warehousing is a market that is growing and will produce
    maturing technology
Data Warehousing cont’d
•   Increased investment by portable RDBMS vendors is improving support
    for data warehousing and complex DSS applications.

•   Niche tool vendors are moving to provide comprehensive solutions;
    those that do not could be acquired.

•   Increasing consulting business, with systems integrators bringing
    vertical expertise and data models to the table.

•   Specific industries (e.g., retail) will increase granularity and the amount
    of historical data, creating 100 terabyte databases easily.

•   Inclusion of external data (e.g., demographic) that occurs in a more
    flexible and integrated way

•   Query monitors that give administrators some relief in managing the
    data warehouse and anticipate enterprise’s data needs, but many
    problems for administrators in managing the data warehouse
    environment (e.g., metadata redundancy and summary table
    maintenance) continue with tools slow to emerge and mature.
Supporting Multiple BI Applications

  • A data warehouse is more than a single
    product and requires significant planning
    of five essential components:

    –   Operational data source
    –   Data conversion and extraction
    –   Data warehouse DBMS
    –   Data warehouse administration
    –   BI tools
Operational Data Source
• Data administration must take an
  active role to help plan extracts and
  to work with the business
  administrator to define the data
  needs.

• The data administrator can help
  gather the information about the
  operational data and assist in
  designing the data model that will be
  used for the data warehouse DBMS.
Data Conversion and Extraction
• A data warehouse architecture should include
  extracts of operational data that are “frozen views of
  information” trapped in time capsules, which in
  some cases have some level of summarization and
  history associated with the view of information.

• The extracts are created either by handcrafted
  programs that take time and expense to maintain, or
  through tools that help automate the generation of
  the extract applications or processes.

• These applications should provide the capabilities to
  perform the complex task of integrating data from
  multiple sources to create a consolidated view of the
  data, as well as the transformation of data for use
  by BI applications.
Data Warehouse DBMS
• The RDBMS vendors (e.g., Oracle, IBM, and
  Microsoft) have significantly increased the amount of
  research and development to improve support for data
  warehousing and complex DSS applications.

• This investment is geared toward providing strong
  support of complex database schemas with databases
  approaching several hundred terabytes.

• Some of the returns (e.g., improved parallel
  techniques, bitmap indexing and improved query
  algorithms) from this investment are beginning to
  emerge out of the labs and into the products.
Data Warehouse Administration
• Data warehousing brings many complex administration
  issues that are much different from handling
  transactional applications

• With the increasing number of subject areas and the
  increasing amount of historical data (i.e., more than
  five years) typically maintained, a data warehouse
  requires significant amounts of disk storage, memory
  and processing power.

• These new administration requirements need
  extensive planning to provide data usage auditing,
  business data model, directory management,
  chargeback, summary tables, security, request control,
  query catalog, subscription services, and managing of
  operational data extracts.
Business Intelligence (BI)
• BI empowers enterprises with systems that facilitate the
  access and analysis of data contained in the data
  warehouse.

• A data warehouse in combination with the right BI tools
  can be an important part of supporting a business
  mission.

• The selection of BI tools needs to be done after an
  analysis of the business and enterprise needs have been
  performed.

• It is likely that a single tool will not support the needs of
  all users. Therefore, many times several tools will be
  selected to perform each of the following functions: ad
  hoc queries and report writing, OLAP, advanced DSS
  and EIS.
The Data Warehouse Market: Clarifying the
Hype and Confusion
 • With most market size estimates reaching as
   high as $6 billion, it is clear that data
   warehousing has captured the imagination of
   the masses.

 • How these estimates have been drawn, or that
   they cover a laundry list of sundry items, is less
   clear. This is not surprising since the data
   warehouse market is a challenging one to
   quantify accurately.

 • It is difficult to ascertain definitively whether a
   particular product is being used in a structured
   data warehouse environment or merely in a
   conventional DSS context
Virtual Data Warehouse
• The virtual data warehouse (VDW) is a concept
  being touted by some as a more timely and less
  risky approach for implementing DSSs than
  traditional data warehousing.

• VDW is an architecture for accessing data directly
  from data stores, regardless of the data source,
  thus eliminating the need to construct a separate
  database of redundant data.

• This is not a new idea; Information Builders
  incorporated the concept with the Enterprise Data
  Access/SQL (EDA/SQL) product in the early 1990s,
  and it was previously known as universal data
  access.
Virtual Data Warehouse cont’d
• Proponents of VDW tout the speed of deployment
  over the traditional processes required for building a
  data warehouse, by focusing on user data access
  without the heavy analysis and infrastructure
  requirements.

• However, this is misleading because many of the
  planning activities (e.g., identifying source data and
  transformation of the data) are still required with the
  VDW architecture to ensure that the correct data is
  available for user access.

• The identification of source data is a complicated
  and time-consuming process.
VDW Challenges
These challenges are plentiful:

   – Data integration, consolidation and transformation
   – Performance issues of SQL access
   – Impact on performance of operational systems
   – The cost of repeated extracts to obtain the same data
     vs. a single extract to create a database
   – Unification of different data management systems
   – Schema integration and reconciliation of the different
     data models implemented with each database or
     application
   – Distributed query processing (i.e., decompose,
     distribute and determine where the query should be
     processed based on the cost of the query)
Data Marts
• A data mart is a subset of data from the data warehouse,
  designed to support the unique business unit
  requirements of a specific application.
• If the right planning is done to incorporate the data
  mart into the overall data warehouse architecture, data
  marts can be beneficial.
• This may save valuable server resources and give the
  business analysts more freedom to perform heavy
  analysis activities.
• A user’s data does not have to reside physically in
  their departments for them to be empowered, but
  instead could be a set of aggregate tables within the
  data warehouse designed to support a specific set of
  users or application.
Data Mart Challenges

• The need to create multiple data extraction
  and copy management applications could
  consume a great deal of IS resources and
  prove problematic to manage.

• Improved data quality, consolidation and
  sharing are primary goals for creating a
  data warehouse, a strategy that could be
  jeopardized without centralized control for
  the distribution of data to the data marts.
Data Warehouse/Mart Architecture
Data Mart Considerations
• The data mart and data warehouse are not mutually
  exclusive. A data mart should be a component of a data
  warehouse architecture.

• The data mart should not be used as a replacement for
  accessing the data warehouse.

• The cost and benefits of implementing each data mart,
  which makes the data architecture a great deal more
  complex, needs to be understood and compared with
  directly accessing the data warehouse. This is analogous
  to the higher cost of items found at a small convenience
  store, over those at a large grocery store.