IBM Finance Forum

Document Sample
IBM Finance Forum Powered By Docstoc
					An Effective Data Integration: Strategy to
Drive Innovation on the InfoSphere Platform
             Simon Tang
             InfoSphere Technical Manager
             IBM GCG
      Pain Point: Understanding Core Information Assets
                                                   “This data does not
                                                     look right” –
“I don’t have the                                    Business User
  information I need”
  – Business Analyst                                                     “We are not leveraging
                                                                          our information” -
                                                                          Architect




“How can I see
 how this is used”
 –Governance
 Steward


                                                                         “I’m not sure what the
                                                                           business wants” -
                                                                           Developer
                        “What systems will be
                         impacted from this change” -
                         DBA
         Impact of NOT Managing Core Information Assets
   83% of data integration                        Scrap and rework
projects either overrun or fail                    Increased $$$


                                                   Lack of consumer
                                                      confidence

                                          Lost
Inaccurate or incomplete data          opportunities
is a leading cause of failure in
   business-intelligence and                Low data quality costs
         CRM projects                       companies $611 billion
                                                  annually
            25% of time is
            spent clarifying       Undetected defects will cost 10 to
               bad data            100 times as much to fix upstream
         Who are Looking for Trusted Information?
Target Audience
  • Data/Business Analysts
  • Subject Matter Experts            Data/Business   Subject Matter Experts   Architects   Governance
                                        Analysts                                             Stewards
  • Architects
  • Governance Stewards             What do these roles do today?
                                  Trusted Information
                                     • Manage information manually in disconnected tools,
                                  1. Accurate and spreadsheets
                                       documents,
What are they working on?
                                     Complete
                                  2. What is wrong with what they do today?
• Information-centric projects:
                                     Insightful
                                  3. • Time consuming – churn between business & IT
  • BI & Data Warehousing
  • Master Data Management
                                     Real Time
                                  4. • Imprecise & error prone – manual processes not
                                  thorough enough
  • Application Implementation,
    Consolidation or Migration  • No collaboration – different roles work in silos
  • Information Architecture    • Lacks audit trail – no ongoing record
  • Governance Initiatives          • Redundancy – duplication of effort & storage




    5
A Flexible Platform for Managing, Integrating, Analyzing and Governing Information

              Transactional
              & Collaborative                                                                     Business
              Applications                                                 Analyze                Analytics
                                             Integrate                                          Applications


                                                                                 Big Data
                                         Master
                                           Data
                        Manage                                                     Cubes



                                                                                      Streams
                        Data
                                                         Data
                                                      Warehouses
                         Content

  External                   Streaming
Information                Information
  Sources
                                                      Govern
                                                                    Security &
                                            Quality     Lifecycle    Privacy
                   Challenges in Data Management

•   Inconsistent islands of information            •    Touching data multiple times at its source
    underlying applications                             – storing multiple times and updating
•   Complex, manual & costly copy                       multiple times
    synchronization                                •    Inability to share common business rules
•   Inconsistent and poor quality data                  across projects, processes and
•   Inability to exploit enterprise meta                applications
    data across tools                              •    Lack single, repeatable methodology for
                                                        consistency across all projects
                                           Order       Supply            Procure
                     CRM
                                           Proc        Chain              -ment
  Convert information into a trusted strategic asset
                                            • Business Vocabulary
  Only IBM has                              • Data Relationships
invested to provide                         • Data Quality Compliance
                                            • Data Models and
  the breadth of                              Mapping
  capabilities to                           • Business Specification
define and govern                             Rules
                                            • Provenance of
your information…
                                              information


       • Discover and understand the data across
         heterogeneous systems
       • Design trusted information structures for business
         optimization
       • Govern that information over time
 Remedy: 10 Proven Strategies


No single path is THE panacea to all corporate
   data problems - multiple approaches must
                   be employed




  Consider where your organization’s most
   SIGNIFICANT data pain exists – take that
               approach first
         Strategy #1 – Understand Source Systems




                                                             2. Verify if characteristics
•   Discovers actual             Data             Business      of data conform to
    characteristics of data                                     established / known
                              Analysis            Analysis
                                                                business rules




                                 3. Report on the
                                    assessment and
                                    variances / exceptions
        Strategy #1 – Understand Source Systems
 § Poor data quality costs U.S. businesses over $600 billion each year
 § Data deteriorates up to 3% every month
 § What is the key to integrating corporate data? – Having the right
   data before you start

   Ensuring adequate data quality
       Understanding source data
Creating complex transformations
      Creating complex mappings
       Ensuring adequate performance
   Collecting and maintaining meta data
           Finding skilled programmers
         Providing access to meta data
          Ensuring adequate scalability
             Integrating 3rd party tools
           Ensuring adequate reliability

                                           0   10   20   30   40   50   60   70   80   90   100
Recommended Best Practices: Automated Data Profiling
                               Advice: You won’t have the
                            time, $ or energy to profile 100%     No
  Table & Primary               quickly so go automated         coding
   Key Analysis


                    Source 2
Column




         Source 1


              Foreign Key &
            Duplicate Analysis
                     Foreign Key &
                  Duplicate Analysis
                        Strategy #2 – Build-in Data Quality
NAME                       ADDRESS
IBM                        187 N. Pk. Str. Salem NH 01456                    •    Same company / person?
I.B.M. Inc.                187 N. Pk. St. Sarem NH 01456                     •    Same address?
International Bus. M.      187 No. Park St Salem NH 04156                    •    Same parts?
Int. Bus. Machines         187 Park Ave Salem NH 01456
                                                                             •    Same instructions?
Inter-Nation Consult.      15 Main St. Andover MA 02341
Int. Bus. Consultants      PO Box 9 Boston MA 02210
I.B. Manufacturing         Park Blvd. Boston MA 04106


                                      Spelling Errors       Error Codes?
    Lack of Standards in
                                                                                                       Assembly
   Synonyms, Acronyms,                                                               Instruction
       Abbreviations                       Part                     Size

                                     PART DESCRIPTION
                                     WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT ¼ INCH
                                     WING ASSEMBLY, USE 5J868-A HEX BOLT .25” – DRILL FOUR HOLES
                                     USE 4 5J868A BOLTS (HEX .25) – DRILL HOLES FOR EA ON WING ASSEM
                                     RUDER, TAP 6 HOLES, SECURE W/KL 2301 RIVETS (10 CM)
      Recommended Best Practices: Data Cleansing
                                 Data Re-Engineering
Original
Blk 1, 1 St, 05-00
                                             Building | Street | Unit
05-00 Frist St, Block 1                      Blk 1    |First   St|05-00
1 First Str, #05-00                          Blk 1    |First   St|05-00
Block 1, First Str, #05-00                   1        |First   St|#05-00
1, St, #05-00                                Blk 1    |First   St|#05-00
                                             1        |St        |#05-00



                                             Match                                Survive
      Standardize


               Building | Street | Unit
                                                                   Final Result
               Blk 1    |First   St|05-00
               Blk 1    |First   St|05-00                          #05-00, Blk 1, First St
               1        |First   St|#05-00                         #05-00, 1, St
               Blk 1    |First   St|#05-00
               1        |St        |#05-00
        Strategy #3 – Share Common Meta Data

From Data Model    The Identifier of customers that
                   are tracked for ordering            From ETL Tool
 Customer          purposes. Corporate customer
 CustomerNumber    identifiers are assigned by the      CustomerTbl
 Name              Sales Data Controller                CustomerID
 Address           according to the corporate data      Name
 Comments          description and naming policy        Address
                   for reference identifiers.           Address1       Unique identifier of
                                                        Comments       customers that are tracked
                                                                       for ordering purposes.
    Which meta data is right?                                          Values start with 02 for
                                                                       non-Corporate customers
    Which one is current?                                              and 01 for Corporate
                                                                       customers.
    Which one should be
    used?
                                                      From Database
From BI Tool
                   Customer’s identifier numbers.     Customer
CustomerDetails    Values start with 01 for           ID                   <NULL>
CustomerNumber     Corporate customers, 02 for        Name
Name               non-Corporate customers, 03        Address1
Address            for overseas-based Customers.      Address2
Remarks                                               Descr
     Recommended Best Practices: Create a common
                    repository
                           Modeling tool                                   BI tool




                                           Integrated                          BI Repository
                                           Meta Data
                                           Repository

                     ETL Tool +
                     Processes




                                                            Integrate by
                                                         gathering in from
Other sources’
                                                        diverse applications
definition files                                            and sources
                      COBOL
                   definition files
                      Create a Common Vocabulary
Database = DB2                                               Category: Costs
                                                             Term: Tax Expense
Schema = NAACCT
                                                             Full Name: Tax to be paid on
Table = DLYTRANS                                             Gross Income
Column = TAXVL                                               “The expense due to taxes
data type = Decimal                                          …..”
            (14,2)                                           (John Walsh is responsible
Derivation: SUM(TRNTXAMT)                                    for updates. 90% reliable
                                                             source)
                                                             Status: CURRENT



           Achieve a common vocabulary between business & technical users!




                                        Shared
                                    Metadata Server
                                     & Repository


        InfoSphere DataStage                            InfoSphere Business Glossary
Collaborate and Share Feedback
                                  GL Organizational Unit

                                  STEWARD: Controllers Office
                                  FORMAT: X(7)
                                  DEFINITION: A seven digit number
                                  designating the organizational unit
                                  to which this account belongs.


                                           Author Standard Definitions

             Annotate and
                Share
              Feedback



              I’ve noticed that
            the last two digits
                  of the GL
               Organizational
                 Unit, which
             indicate the sub-
              department, are
                 often blank.
               Extend Business Information
• Categorize Information Assets according to Business Logic
• Map Business Terms to Information Assets
• Find and view relevant details of Information Assets
• View the Stewardship of Information Assets
Where does a Field of Data in this Report Come
                    From?
     •   Import & Browse Full BI Report Metadata
     •   Navigate through report attributes
     •   Visually navigate through data lineage across tools
     •   Combines operational & design viewpoint
Metadata Lineage available from Studio & Viewers




                   IBM Confidential
Access Business Glossary from Cognos Studios




                 IBM Confidential
   Strategy #4 – Connect to Any System, Anywhere


  DB2, Informix,
Netezza, ODBC,      WebSphere MQ,
    Oracle, Red     SeeBeyond,
    Brick, SAS,     JMS, XML, EJB,
        Sybase,     Web Services,
   Teradata, etc    EXML, XMLS,
        Adabas,     EDI, SWIFT, etc
   Allbase/SQL,
   Datacom/DB,
       DB2/400,                        Oracle Applications,
   DB2/OS390,                         PeopleSoft, SAP R/3,
       Essbase,                         SAP BW, Siebel
        FOCUS,
IDMS/SQL, IMS,
  NonStopSQL,
RDB, VSAM, etc
 Recommended Best Practices: Native
       Connectivity Software

                                                  Advice:
                                              Go for pre-built
                                              connectors with
                                               little/no coding




Do you wish to worry what will be your next
Do you wish to worry what will be your next
  application or database to connect to?
   application or database to connect to?
          Strategy #5 – Abandon Hand-coding

                                   These Visual BASIC,
                                    These Visual BASIC,
                                   Java, C++, UNIX codes
                                    Java, C++, UNIX codes
                                   can be developed
                                    can be developed
                                   cheaply and they work
                                    cheaply and they work
                                   ……




… but what happens when there
 … but what happens when there
is a new source or requirement?
 is a new source or requirement?
Cheap? Works? Maybe not.
Cheap? Works? Maybe not.
Recommended Best Practices: Graphical ETL Tools




                               Benefits:
                               •   Jobs are easy to
                                   develop, understand,
                                   debug and maintain
                               •   Robust, fully-tested,
                                   best practices approach
                                   to data migration or
                                   extraction
Recommended Best Practices: Graphical ETL Tools




                            Benefits:
                            •   Complex transformations
                                can be made very simple
                                with mere point-and-click
               Workflow Process - Sequences




• Workflow is as important as dataflow.
• Dynamic workflow processes can be defined during
  the workflow itself.
• DataStage can run external processes and perform
  complex evaluations inline.
• Advanced concepts such as looping are supported.
             Physical Machine Utilization
Average Process Distribution       Disk Throughput




   Free Memory Whisker Box       Percent CPU Utilization
Strategy #6 – Implement a Highly Scalable Foundation

       44x               as much Data and Content
                         Over Coming Decade
                                                        Prediction:
                                                        Prediction:
                                                        Your data
                                                         Your data
                                                      volume is not
                                                       volume is not
                                                       going to get
                                                        going to get
                                                         smaller
                                                          smaller




                                           2020
                                      35 zettabytes

           2009
     800,000 petabytes
      Strategy #6 – Implement a Highly Scalable
                     Foundation
                   2 considerations in handling growth:

      You want these                                         Not these
Processing Time                            Processing Throughput
(Hours)
  .                                         .
                                          (Hundreds of Gigabytes)

  .                                         .
  .                                         .
                                           32X
 32
                                     or    24X
 24
 16                                        16X

 8                                          8X
 1                                          1X
      1    8      16   24   32
          Number of Processors
                               ...                1      8    16    24   32   ...
                                                      Number of Processors
   Strategy #6 – Implement a Highly Scalable
                  Foundation
            Three Elements of a Scalable Infrastructure
   Scalable Hardware Platform            Scalable Database Platform

                                                      Database vendors have offered a
                                                      scalable parallel relational
Hardware vendors                                      database for more than 5 years.
have offered scalable
parallel computers
for more than 5 years.


                                            Data integration vendors are starting to offer
                                            “scalable” “parallel” platforms




                         Scalable Data Integration Platform
 Recommended Best Practices: Parallelism

Make sure you get
      this                                     Not this

                                         Shared Disk       Shared Disk        Shared Disk

                                       CPU CPU CPU CPU    CPU CPU CPU CPU    CPU CPU CPU CPU



                                        Shared Memory
                                          Shared Memory   Shared Memory
                                                            Shared Memory    Shared Memory
                                                                               Shared Memory

                                         SMP System        SMP System         SMP System

                      Shared Disk       Shared Disk        Shared Disk        Shared Disk

                     CPU CPU CPU CPU   CPU CPU CPU CPU    CPU CPU CPU CPU    CPU CPU CPU CPU



                     Shared Memory
                       Shared Memory    Shared Memory
                                          Shared Memory    Shared Memory
                                                             Shared Memory    Shared Memory
                                                                                Shared Memory

                      SMP System        SMP System          SMP System         SMP System

                      Shared Disk       Shared Disk        Shared Disk        Shared Disk

                     CPU CPU CPU CPU   CPU CPU CPU CPU    CPU CPU CPU CPU    CPU CPU CPU CPU



                     Shared Memory
                       Shared Memory    Shared Memory
                                          Shared Memory    Shared Memory
                                                             Shared Memory    Shared Memory
                                                                                Shared Memory

                      SMP System        SMP System          SMP System         SMP System
             Recommended Best Practices: Parallelism
                                                          Data
                   Source   TRANSFORM   ENRICH   LOAD
                    Data                                Warehouse




Application Execution: Sequential or Parallel

                                                                                        One application assembly



    Sequential                 4-Way Parallel               64-Way Parallel


                                                                                         Sort
                                                                              Time to
                                                                              Process
                                                                                         Join


                                                                                        Scan

                                                                                            Serial   Parallel   Parallel

                                                         MPP, GRID, and
   Uniprocessor                 SMP System
                                                        Clustered Systems
                                                                                        Auto parallel-enabled and
                                                                                        parallel-aware run-time
                                                                                        execution
Strategy #7 – Architect for “Right-Time”
                   • In an InformationWeek 2003 survey
                     of 467 business professionals about
                     how often their IT systems provide
                     business managers with timely
                     updates of primary products or
                     services:
                      – 3% no such process
                      – 1% annually
                      – 17% monthly
                      – 13% weekly
                      – 36% daily
                      – 5% hourly
                      – 8% every minute
                   • In that same report:
                      – “Whereas 57% of sites surveyed a
                         year ago said that real-time
                         business information was a key
                         company focus, 70% see it that
                         way today.”
   Recommended Best Practices: Right-Time

Business
 Event            Latency           Recognition                Latency            Response
 Occurs

             Latency is defined as the elapsed time between when an event
              occurs and when an appropriate response or action is made


   Event Occurs                        Awareness                       Appropriate
                                                                        Response

   campaign initiated      .......................                     tuning
                                         Acceptable
    customer churns                          .....
                           . . . . . . . . .Latency. . . . . . . . .   win-back

    fraud committed        .......................                     prevention

        website click      .......................                     offer made
   Recommended Best Practices: Right-Time

        Business
         Event        Latency   Recognition
         Occurs


§ Improving the ability to recognize business events


        Recognition   Latency   Response



2. Improving the ability to respond to those events
                  Log-Based Change Data Capture
                                            Monitoring and
                                            Configuration
                                                                                                   Database



                                                                                                     Web
                                                                                                   Services


                                                                                                   Message
                                                                                                    Queue

                                                                             Information Server   InfoSphere
                                                 TCP/IP
                                                                                                  Information
                                                                                                     Server
 DB2, Oracle,       Database     Source Engine               Target Engine
SQL Server, etc       Logs

                                                                                                   Flat files

   Key Benefits:
         – Low impact                   – Heterogeneous platform support
         – Flexible implementation      – Easy to use
                      InfoSphere CDC & InfoSphere DataStage (ETL)
Information Server
Change Data Capture




                                                            Data Stage Consumption

                                                                           Direct
                                                                                           TCP via Data Stage operator
                                                                           Connect

                                                                           Staging         Out of the box
                                                                           Table
                               Point Of Sale
                                                                           Message
                                                                                           Out of the box
                                                                           Queue
                                                “CDC”
                                 Native
                                               Continuous                  Flat File       DataStage DSX file format
                                  DB                                                                                           Teradata, DB2,
                      Oracle      Log                                                                                          Oracle, SQL Server,
                                                               IBM Information Server
                                                              IBM Information Server                                     EDW   Sybase…
                                     Retail                                                   ETL Load
                                                                                        Including BalOp (ELT)
Strategy #8 – Extend Quality and Transformation
     Capabilities throughout the Enterprise
                EAI,             Web
               BPM, EII       applications
    Portals                                  Dashboards

                                                           •   Hand-coded rules in
                                                               each project/tool are
                                                               not re-usable to other
                                                               projects/tools

                                                           2. High costs associated
                                                              with building &
                                                              maintaining data
                                                              access, data quality
                                                              and transformation
                                                              rules in each project
Packaged                                          Master
  Apps                                             Data
              Data                     Legacy     Stores
           Warehouses    Business       Apps
                        Partner Data
Recommended Best Practices: Data Integration
                Services
                               Service-Oriented
                                   SOA Approach                          •   Service-Oriented
     Web                         Architecture                                Architecture
    Services                                                                 (SOA) approach
                                                                             packages data
                                                                             integration logic
                                    get customer                             of SOA-friendly
                                                                             applications as
 Message                                                                     services
 Queues,
   EAI
                                                                         •   Services can be
                                                                             invoked as Web
                                                                             Services, EJB,
                                                                             JMS by any third-
                                                                             party applications
        Java,
      Application
       Servers


                    Packaged                                    Master
                      Apps                                       Data
                              Data                     Legacy   Stores
                           Warehouses    Business       Apps
                                        Partner Data
    Strategy #9 – Choose a Proven Deployment
     Methodology designed for Quick Success
•    Many available out there
•    How many and which are workable – who knows?
•    Be aware there are as much risks in deployment methodology as there in
     tools usage
      Recommended Best Practices: Iterative
              Deployment Plan
      End                                                                        plan
                Derive Business                             proto-
                                                             type
                                                                                                   it
                                                                                                      era
                Value                                                                                     ti
                                                                                                             on
                                                                            investigate
                                            unit
12 - 24 Weeks




                                            test


                Evaluate Results
                                                                                                                       etc.
                                                                                                          operate
                                                                             e
                                   system
                                               design                   nag
                                                                     ma
                                    test



                Deploy Solution                                                                                        maint-
                                                                                                                       enance

                                      UAT


                                                                                            deploy
                                                             develop
                Establish Business                                                                           monitor
                Drivers                            regression

  Start
                                                     test
                                                                                           Prod-
                                                                       audit              uction
       A Blueprint Director
The GPS for your information project
       Palette free form
       “sketching” elements


                                                       Diagram for a blueprint




                                              •Method browser (displaying method content)
                                              •Asset browser (browsing metadata repository)
                                              •Glossary explorer (showing glossary tree view)




•Outline (zoom in/out view)
•Blueprint explorer (shows tree view of the
                                                             Context specific
 elements in the blueprint)
                                                             property view
                     Business and IT: Working Together

     Business
      Analyst
                     §Collects business terms and
                     business requirements;
                     Converts into business rules      Business
                     in a spec                       Requirements
                                                       §Business
                                                         terms
                                                                                           Successful Data
                                                                                           Integration Project

                                                                                               •extract
                                                                                               •transform
                                                                                               •load
                                                                     Create DataStage
                                                                     jobs and data flows
                                                                     that reflect
Mapping                                                              business needs.
specification
created – critical
to collaboration
between IT and               §Takes those
business                     business rules and
                             mapping spec and
                             turns them into code,
                             such as a DataStage                   Developer
                             job.
         Track business requirements to application
                        deployment
•   Single, central managed
    infrastructure to track
    requirements to deployment
                                                            Define mapping
                                                           specification with
•   Import Excel mapping                                  business rules and
                                                                 terms
    spreadsheets

•   Define and link business terms
    to physical structures

•   Generate DataStage jobs with
    annotated to-do tasks for
                                                           Auto-generate
    developer
                                                           DataStage jobs

•   Generate historical              Flexible reporting
    documentation for tracking          and tracking
Strategy #10 – Ensure Interoperability of Integration
                  Infrastructures

 The Goal

                         Connected, integrated, seamlessly




The Reality

                                                Cobbled, piece-meal,
                                                 manual-intensive
     Data Integration Projects require a Collaborative
                          Effort
Business
  user
                              business terms




     business
   requirements
                                                Data
                       Business                Modeler
                        Analyst
                                                   Data Analyst



transformation rules


                                                                       data model




                           Developer
                                                                    •extract
                                                                    •transform
                                                                    •load
                                   data flow

                                                                  application
1   Establish Platform Import
    & Enhance Industry Model
                                               3          Understand Data
                                                           Relationships
                                                                                                7      Deliver Reports




                                                            Discovery                                     Cognos
        Data Architect

             Populates



2      Define Business
    Requirement & Glossary
                                4       Assess, Monitor, Manage
                                           Data Quality Rules
                                                                  5     Map Sources to Target
                                                                               Model
                                                                                                6   Generate Logic to Load
                                                                                                         Warehouse




                                Links



                                                                                                        DataStage &
      Business Glossary                 Information Analyzer                 FastTrack                  QualityStage


                                                      Metadata Server
                     Simplification & Content: reduces project time, risk and cost!

        49
Recommended Best Practices: Integrated Tool Suites

                                       Information Services Director
                                     Publish SOA services for information
                                             integration and access
  Business Glossary                 QualityStage                                           Federation Server
   Enterprise Data           Data Quality: Standardize,                DataStage          Virtualize access to
      Dictionary               Correct & Match Data                                      disparate information



 Information Analyzer       Global Name Recognition            Extract, Transform, and    CDC & Replication
                                                               Load in Batch or Real-
Data Source Profiling &       Recognize & Classify                                       Deliver and replicate
                                                                           time
  Problem Diagnosis            Multi-cultural names                                          changed data

                           Metadata Server / Metadata Workbench / FastTrack
                           Manage and track consistent metadata across information
                          integration tasks and automate generation of data flow logic
                                                 Parallel Processing
                              Rich Connectivity to Applications, Data, and Content
                      Summary
§ A number of large enterprises have successfully integrated
  their enterprise systems resulting in business results that
  drove revenue and lowered costs
§ These enterprises accomplished this through a set of
  technologies collectively known as Enterprise Data
  Integration
§ There are 10 proven strategies for success in an enterprise
  data integration initiative; although no single path is THE
  panacea to all corporate data problems - multiple approaches
  must be employed
    Convert Data into Trusted Information
                         Test Data Generation

                         Application Consolidation

                         Data De-identification

                         Data Quality

                         Data Integration

                         Data Archival

                         Master Data Management
InfoSphere Information
       Server            Data Warehousing
Your Choice…
                                   Point Products

  ?      +               +           +         +               +        +     ?

Models       Cleansing       ETL         MDM       Warehouse       BI       Mashups




         +               +           +         +               +        +
                             Integrated Platform




   53

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:7/18/2013
language:English
pages:53