Learning Center
Plans & pricing Sign in
Sign Out

not for distribution IBM


									Robert Mckeown
11 June 2012

TASP Workshop - Day 1
Introduction / Architecture

 Intro / Background / Why TASP

 Solution / Architecture
 Select examples

2                           not for distribution

   Identify and predict faults and performance degradations in the
    physical and logical infrastructure

   Moving from re-active to proactive management

   Reduce cost of threshold maintenance

   Algorithms developed in partnership with IBM Research under
    ‘Joint Program’

   Re-use of Streams & Datastage infrastructure

   GA in ‘Summer 2012’ – First release.

3                             not for distribution
What is Predictive Analytics?
   Predictive Analytics enable IT organizations to
   move from reactive to proactive management of
   services, reducing outages and improving
   business performance.
"Analytics leverage data in a particular functional process (or application) to enable context-
specific insight that is actionable.“ - Gartner

                       Move the sensing and alerting, and eventually actions to earlier and earlier

                                   Advance warning of service impact, deterioration or outage
                                   Realistic service baselines
                                   Avoidance of expensive and time-consuming false alerts
                                   Detection of service impacts that are not identified by fixed thresholds alone
                                   Swifter diagnosis of certain events and patterns
                                   Identification of the underlying root cause to implement fixes

                                                                                                                            16 0 %
                                                                                                                            14 0 %

              IDC study: Predictive analytics initiatives show an average ROI of                                            12 0 %
                                                                                                                            10 0 %

                 145%, in comparison to 89% for non-predictive analytics*                                                   80%

                      * Source: “Predictive Analytics and ROI: Lessons from IDC’s Financial Impact Study” paper, Henry D.   40%

                      Morris                                                                                                20%       145%           89%
                                                                                                                                     Predictive   Non-Predictive

                                                                   not for distribution
Operations challenge: Balancing the need to
manage more with improved service levels &
lower costs
                    “..multi-dimensional relationships between dynamic infrastructure and
                    changing business services are too complex for IT staff to continue reacting to
                    event storms and constantly tweaking static monitoring thresholds.” *
                                                                                         * Source: PNA, 2008

                    “40% of unplanned downtime due to operator error” **
                                                                                      ** Source: Gartner, March 2009

Challenge: Can you create and maintain increasing numbers of thresholds and
situations in a constantly changing IT environment? How can you minimize the
number of alerts that operators must handle?
Service Assurance Analytics approach: Intelligent or ‘Predictive Events’ that result from
speeding the ability to detect abnormal trends before end users and mission critical
applications are impacted

                                        not for distribution
        Maturation of Monitoring and Analytics

Receive trouble                                                                                   Add performance monitor
                                      Gather alerts                   Gather &Correlate alerts
 tickets. Fault!                                                                                  React to performance changes
                                      React to events,                React to events,
React to user report                   before user report              w/ improved RCA

          Static Thresholds               Linear Forecasting
                                                                             Dynamic Thresholds    Nonlinear Forecasting
          Set reasonable thresholds    Linear prediction
                                                                              Earlier Detection    More accurate prediction of
          thresholds, create alerts    One metric
                                                                              based on history    single metric
          when violated

                                                               not for distribution
Monitoring and Analytics Approaches

Where we’re going
•Low latency analysis of performance and wellness data in motion across physical and virtual
infrastructures, from across the service delivery stack (from servers to hypervisors to networks
to applications)
Advanced univariate, and multivariate predictive analysis
      With behavioral learning algorithms that can learn normal behavior during
      operation, and react to changes

                                         not for distribution
Tivoli Analytics for Service Performance
 Multi-domain, agnostic analytics, which leverage existing monitoring
  systems, analyzing metrics from physical and virtual environments
 Uses analytics to learn normal operational behaviour across the
 Learns mathematical relationships between metrics across the physical
  and virtual elements of the service delivery stack, from network
  elements, to hypervisors and applications, to end user experience
 Detects problems before they become business impacting.
 Sends anomaly events to management consoles.
                                Console                  BSM


     Server          VM          Network            Middleware   Application   Customer
    Monitoring   Performance   Performance          Monitoring   Monitoring    Experience

                                     not for distribution
Tivoli Analytics Life Cycle

                                                    Score the
          Import Patterns (e.g.                      Metrics
          Topologies, or Rules)
                                  Closed loop analysis

                            Improve customer experience
                          Prevent business impacting events
                           Increased operational efficiency     Alert & Resolve
          Discover Relevant

                                   Collect Metrics and
                                  Multi Domain/Vendor

                                  not for distribution
Service Assurance Analytics                                          •Leverages IBM Information
                                                                     assets to field a state of the art

 3rd Party Event Consoles                                            •Highly scalable and resilient
                                                                     streaming analytics engine

                                                                     •Powerful analytics algorithms,
  SNMP                              TIP Visualization                combining multiple approaches,
                                                                     designed to leverage the analytics
                                                                     engine for extensive scalability
    Analytic Algorithm Application
                                                                     •Highly flexible and scalable data
                                                                     mediation layer providing turn key
    Streaming Analytic Engine                                        integrations and easily extendable

                            Mediation                                •SNMP and Netcool/Omnibus
                                                                     native predictive alerts

                                                                    So far, trials have included data from:
                                                                    Compuware Vantage, HP Mercury BAC
                                                                    HP Sitescope, Quest Foglight, Wily
                                             not for distribution
Under the Covers : Market Leading Mediation

                   not for distribution
Streaming analytics engine
     Continuous Ingestion              Continuous Complex Analysis in

   Processes millions of events per second
   Used in finance, manufacturing, law enforcement
                                not for distribution
TASP Architecture – more detail

                                                                        Analytic         Analytic        System                  Event Destination
                                                                         Mgmt              UI            Mgmt                     (e.g. OMNIbus)
      Configuration.Instance and Metric Data


                                                                                                    Notification   SysMgmt
                                                                           Config Data


                                                                                                 Specific                                   Install
                                                     Analytic                                     Analytics
                                                / Resource Config      Learned Models           Analytics


                                               Resource Inventory
                                                                                           Mediation Data Source

                                               Analytic Registration

                                                                                           Common Logic (stages)
                                                  Mediation Jobs

                                                                                            Normalized Interface

                                                                                            Data Source-Specific

                                                                                   not for distribution
Granger Analytics
   Problem addressed
          Technical Problem: Given a large number of time series data, identify significant causal relationships among them
           and jointly build a predictive/causal model for those time series, and make use of this model to perform outlier
           detection, prediction and anomaly diagnosis
   Approach based on “Granger causal modelling methodology”
          “Granger causality” is an operational notion of causality proposed by the Nobel prize winning economist Clive
           Granger, allowing one to reduce the notion of causation to a statistical test involving “delayed correlations”
          The methodology analyzes a large number of time series data as a whole, and uncovers the significant causal
           relationships that exist between the time series variables in the data

                Time Series Data
                                                                                       Causal/Statistical Model

                  Server#1     availability
           Application#1           Server 3 No of
            availability       Trade Processors
              Out Packets     Server#2
                             Memory Free
                                                         Granger Causality
                                                    not for distribution
    Granger Anomaly Detection
    (Unsupervised) API with Wrapping in Streams
                                            Model refreshing

                                                    Granger        XML
                                                    Detection      Model
                 Stream Pre-      Training Data     Training

                                                                                                         Anomaly detection Output Stream
  Time Series
  Stream Set1
                 processing                           API
                    steps:                                                                   Detection
                  Data pivoting
                 and formatting
   Time Series
   Stream SetN     operations                                                                  API
                                                                               Window         JavaUDOP
Configuration                                                              (width=max lag)

                                            not for distribution
     Automated modelling of mathematical relationships in the
     environment, providing insight and reducing time for root
     cause analysis.
     Understanding which KPIs are
     mathematically related allows an
     operator to gain insight into the scope
     of problems. Seeing the performance
     of related KPIs in context allows more
     rapid isolation of problems.

                 Financial Benefit

     Reduced RCA costs+ Reduced time to repair

•Number of faults
•Average time to repair today
•Reduced time to repair with TASP
•Labor costs
•Average revenue impact of fault

16                                             not for distribution
     Enhancement of manual threshold management with
     automated analytics.
     Increased reliability – no more missed
     problems because a threshold was not
     set or was set at a suboptimal value

                  Financial Benefit
           Reduced labor/management cost
          Reduced number of missed problems

•Number of resources being monitored
•Labor costs
•Number of problems due to management gaps today
•Average revenue loss due to fault

17                                                 not for distribution
Prediction and Early detection of problems allows mitigating action
before they affect service.
     Advanced warning of problems
     allows mitigating actions to be taken
     avoiding service disruption
     Early action prevents additional
     related problems from occurring

                  Financial Benefit
               Reduced revenue impact
              Reduced problem cascades
•Number of faults
• Average time to detect today
• Average improvement in detection time with TASP
• Average revenue impact of fault

18                                                  not for distribution
     Example: Memory leak
  Relationship automatically identified between KPIs for memory
   use in MB, inbound requests and server I/O
  Memory rises while other metrics are stable. Memory leak!
  Alert at red triangle; operator action to recycle machine shortly

19                           not for distribution
                                                                      Identify and predict faults
                                                                          and performance
                                                                         degradations in the
                                                                         physical and logical
Responding to an alert
                                Operator launches detail    Key Performance indicator
 Analytics generate alert                                   has dropped

 Add in other related metrics
 (based on automatically        Requests rose at the same
                                                              Historically anomalous
 generated model                time performance dropped

                                  not for distribution
Example for Building Monitoring

      asdf

21               not for distribution
   Visualize Predictive Analytics in a Business Context
                                                        BSM helping further drive IT Operations by
                                                      aligning Correlated Actual and Predictive States
                                                             into Business Reprioritized actions
IT Operations now
managing from the                                                                                 Visualize Results (TBSM)
  Predictive and
  Actual states of
                                                                                 Add Service and Operations Context (Impact)

                     Integrated Analytics Dashboard                                                      Insert anomaly events
                                                                                                          and their associated KPI

                                                                               Learn normal behaviour and                              Topology,
                                                                                alert on anomalies (TASP)                            Configuration,

                                                                               Historical                   Real-time                   Data
                                                                                 Data                         Data
                          Analytics User Interface

       Tivoli's solutions allows you see anomalous conditions priortized for business impact associated with other
       environmental data, such as faults, configurations changes, maintenance activities, etc...

                                                                 not for distribution
    Service Assurance Analytics
   Identify anomalous KPI behavior
    without any thresholds.
        Leverage existing managed data.

   Leverage near real time streaming
    analytics to identify complex, multi-
    domain interactions and subtle
    emerging problems across

   Warn users in advance of service
    impact, deterioration or outage.
        Learning algorithms which learn
         normal behavior, with ability to
         adapt to changes

   Focus on usefulness of results, not
    on individual algorithms.
     •Add new algorithms over time, without requiring users to become analytics experts.

                                            not for distribution

not for distribution
     Example result from current trials
     Industrial Sector customer
      Approximately 15K metrics monitored
      Abnormal behavior identified in mail access
      Did not occur on other servers, but only on the two that
       were identified
       as related.

25                              not for distribution
     Example result from current trials          Teachers exam ple; one of m any around m id Septem ber

     Financial sector customer                        Click to add an outline

      Approximately 18K metrics
      Multiple warning signs of
       abnormal behavior on server
      Did not occur on other servers

26                                   not for distribution
 Typical POC hardware requirements
In order for Tivoli Analytics for Service Performance (TASP) to support 30k
metrics in 5-15 minute intervals, it requires two xLinux servers for the two main
components, the analytic engine and mediation .

•1 x (RHEL 5.5 , 64bit, 4x Core, 24GB Ram, 150GB Disk) for the Mediation
•1 x (RHEL 5.5 , 64bit, 4x Core, 24GB Ram, 150GB Disk) for the Analytics

Tivoli Analytics for Service Assurance also requires a Windows machine to run the
mediation customization client for the duration of the integrations phase.

1 x (Windows Server 2008 Enterprise, 64 bit, 2n6Ghz, 8GB Ram).


The Tivoli trial team will conduct the installation of TASP.

                                        not for distribution
Analytics in IBM Tivoli Service Management Today


    State to
   Analytics                                  Event        KPI calculation &
                                          correlation to     correlation to
                                            Services       service definition

                                                                    Event correlation
                                                                     & enrichment
                 Service to                                             analytics

                                   not for distribution

To top