Testing Live Production Applicat by liwenting


									                 Presentation at: AsiaSTAR2004, Canberra, Australia, 7 Sep 2004

A Brave New Frontier:

Testing Live

Dr Kelvin Ross, Steve Woodyatt, Dr Steven Butler
SMART Testing Technologies Pty Ltd

• Avoiding Production
•   Testing for Service Level Management
•   Case Study
•   Considerations Unique to Production Testing
•   Information for SLM
•   Implementation Choices
•   Wrap-Up
Why Test On
• Despite best efforts to test an application prior
  to deployment there are still post-deployment
  problems that frequently occur
   –   Server offline
   –   No response
   –   Functions not available
   –   Incorrect response
   –   Slow response
   –   Security breach
   –   Data out-of-date
The user experience

• What is it that the user will experience in
  dealing with our application

• E.g. Airline Reservation business process:
   –   Search for flights

                                                        Information Flow
   –   Make a reservation
   –   Pay with credit card
   –   Obtain electronic ticket reservation code
   –   Confirmation by email with matching details
   –   Reservation details reported in frequent flyer


            Web Application

 ERP                 Mainframe

Email Gateway


• Avoiding Production Problems
• Testing for Service Level
•   Case Study
•   Considerations Unique to Production Testing
•   Information for SLM
•   Implementation Choices
•   Wrap-Up
Service Level
• Service Level Management (SLM)
   – “set of people and systems that allows the
     organisation to ensure that SLAs are being met and
     that the necessary resources are being provided
• Service Level Agreement (SLA)
   – “contracts between service providers and customers
     that define the services provided, the metrics
     associated with these services, acceptable and
     unacceptable service levels, liabilities on the part
     of the service provider and the customer, and
     actions to be taken in specific circumstances”

         Definitions from IEC, “Service Level Management” tutorial, www.iec.org
SLM in the context of

Availability                  Security
• End-to-end, not just        • Exposure
  components                      – No. of breaches
    – No. and duration of         – Vulnerabilities detected
      outages,                    – Viruses
    – Total uptime/downtime

Accuracy                      Performance
• Correct results             • Responsiveness
• Processes followed              – Response time for web
                                  – Data transfer / throughput
                                  – MTTR
                                  – No. of incidents
                                  – Service degradation
                         Passive                         Active
                Listen into transactions and      Transactions are synthesised
                          analyse logs

                End-to-End                     SMART Cat
                Topaz                          Topaz
End User
Observes user   NetIQ                          Keynote
experience      …                              Netmechanic
                Web Trends                     HP OpenView
Component       …                              IBM Tivoli
Focuses on
servers and                                    CA Unicentre
backend                                        BMC Patrol
Business Process
Auditing (BPA)
Business Process Auditing
        Functionality                 Automated
        Accuracy                      Real-time

  Reporting         Alerting   Diagnosis &

          Service Level Management
Testing and SLM
• Testing can be used to synthesise business
   – Interact with system through various interfaces
   – Collect and report metrics

• Transfer of technology predominantly used
Problems Detected
• Problems detected
   – End-To-End processes not available
   – Responses slow
   – Incorrect data
• Problems not detected
   – Issues localised to individual clients
   – Actual response times to all clients
Who Owns Production
•   The testing group?
•   The support group?
•   The operations group?
•   The application owners?
•   Marketing?

• Marriage of skills and technology required for

      “We don’t call that testing” syndrome
Which applications
most benefit
• Those with real time dependence for
  completion of vital business processes
   – High risk & dependence
       • Financial
       • Market reputation
   – Probity, Accountability and Liability
   – Potentially unreliable or difficult to manage
     technology dependencies
       • increasingly complex linkages
       • distributed application architectures
       • history of failure, problems

• Risk assessment:

• Avoiding Production Problems
• Testing for Service Level Management
• Case Study
•   Considerations Unique to Production Testing
•   Information for SLM
•   Implementation Choices
•   Wrap-Up

BPA Planning checklist:
 What are the critical business processes
 Who are the users
 What is the user experience
 How can success be determined
 How can the test be automated
Airline Reservation
Case Study
• Critical business processes
   –   Search available flights
   –   Make online booking
   –   Change booking
   –   Cancel booking
   –   Etc.
• Users
   – Consumers
   – Travel agents
   – Call centre
Airline Reservation
Case Study
• What is the user experience
   – Search for flights
      • Available
          – Function accessible
          – Response returned
      • Correct
          – Correct flights, source and destination, time,
      • Complete
          – No missing flights with available seats
      • Responsive
          – With tolerable response times
• How can success be determined
   – What is the source of truth
Airline Reservation
Case Study
• Choose what to monitor based on risk
   – Previous operational reliability problems, complex
     dynamic behaviour

• What was previously tested and will continue
  to function
   – Are there problems with distributed components
     continuing to run appropriately, e.g. tuxedo services,
     LDAP authentication, payment gateway not
   – Are there problems with timely propagation/retrieval
     of data, e.g. flight data not retrieved consistently,
     bookings not updated in timely manner
Test Frameworks

• Outcomes have to be reported at business
  level, not application object level
   – Object level – Too Low Level for Audience
      getURL search.jsp
      saveForm, submitflight
      setParam, submitflight, startime, 200412011100
      submitForm, submitflight
   – Business level – Appropriate for Audience
      searchFlight, return, 20041201110000, SYD, …

• “Action Word” approaches recommended
   – See Carl Nagle or Hans Buwalda’s work
Dynamic behaviour
• searchFlight, return, 200412011100, SYD, …
   – Won’t remain useful for long as production data is

• Dynamic input data
   Type = return
   DepartTime = today()@10am + 1 month
   ReturnTime = today()@10am + 1 month + 5 days
   Depart = Sydney
   Arrive = Melbourne

• May even want to randomise data
   – Vary depart and arrive on successive runs
The Test Oracle

• Mechanisms for determining correct response
   – Get any response
   – Get a response containing predefined expected
   – Expected values are checked using an oracle
      • E.g. formula determining whether valid date
   – Results are compared to reference data
      • 3rd party data feed
      • Trusted internal source, e.g. Mainframe
3rd Party Reference

• Trending against price data

                                 Price trend

           × Prices
Airline Reservation
Case Study
• Verification failures for searchFlight response
Condition                 Code   Notify                     Test Oracle Required
No response received      FAIL   Ops support immediately    -
Response time
>= 8 secs, < 20 secs      WARN   App support if sustained   -
                                 more than 15 minutes
>= 20 secs                FAIL   Ops and App support        -
Gateway connection        FAIL   Ops support immediately    -
error page
Unexpected content        FAIL   App support immediately    -
Flight data isn’t for     FAIL   App support immediately    Confirm flights correct
intended routes and                                         – flight code lookup
dates                                                       table, dates consistent
No flights found          FAIL   App support immediately    -
Flight availability and   FAIL   App support immediately    Confirm against flight
pricing incorrect                                           availability and pricing
                                                            in Reservations
                                                            Mainframe using API

• Avoiding Production Problems
• Testing for Service Level Management
• Case Study
• Considerations Unique to
  Production Testing
• Information for SLM
• Implementation Choices
• Wrap-Up
Scheduling the test

• How often
   – 1 minutes, 5 minutes, hourly, daily, weekly
   – Depends on how quickly support can respond
• What business hours
   – 24x7, 9 to 5, higher frequency at certain events
• What about scheduled outages
   – Planned outages, public holidays
• Coordinating tests
   – Locking to prevent simultaneous tests
   – E.g. don’t check prices or submit orders unless
     logged in
   – Semaphores
Sensitive Data
• Frequently there may be sensitive information stored in
  scripts and test logs
    – Logins and passwords
    – Credit card ids
    – Personal details, e.g. phone numbers, ABNs, etc

• Where possible avoid
    – Use dummy accounts
    – Don’t log sensitive information
       • Can be difficult to control, eg. failure may save screen
         shot that then displays credentials
•   Use encryption
    – Sensitive data is stored in encrypted, but test engine still
      required key to send
    – At least it is obfuscated
Where tests should be
run from
• Many tools allow tests to be run from multiple locations
    – Simulate users of different geographies
    – Different connection speeds to report on a variety of user
• Inside/outside firewall
    – Probably the largest concern
    – Consumer users outside, Corporate users inside
    – To provide end-to-end scenarios, may need combination
       • Scenario initiated internally, and end results are
          propagated to external, or vice-versa
       • External view of web may be verified using Test Oracle
          data that is internal
    – Agents may be deployed internal and external to run tests
Problems to Avoid

Need to be aware of impact of testing:
• Performance hits
• Volatile features
• Intrusive tests
• Biased results
• Compliance restrictions
• Impact on Business KPIs

• Taking measurements may distort the system
  being measured
Minimising the Effect
of Transactions
• Cost of Transaction
   – Financial – purchase flight may incur credit card
     merchant fee
   – Resource – seats unavailable until refund provided,
     searching places additional load on resource pool
• Reversing the transaction
   – Providing a refund, merchant fee may still apply
• What if the transaction is incomplete
   – What happens if refund process doesn’t
• Compliance issues
   – Corporate
   – Legislative
Managing the Test
• Modifications to the application under test to
  cleanup data or control test effects
   – Manual fallback may be convenient option
• Test Objects
   – Dummy frequent flyer accounts
   – Dummy cost centres
• Testing the tests
   – Access to test environment pre-deployment
   – Endurance test that can be part of application test
       • Transfer of load, stress and endurance test

•   Avoiding Production Problems
•   Testing for Service Level Management
•   Case Study
•   Considerations Unique to Production Testing
• Information for SLM
• Implementation Choices
• Wrap-Up
Effective Reporting

• Who are the users of the reports, different
  expectations on presentation/content
   –   Business/Application Manager
   –   Operations
   –   Development
   –   Support
   –   SLM
• How do they access reports?
   – Web, email, Thick client
   – Which reports are real-time or batched
   – Is data summarised, or is original data accessible
 Historic Reporting
 •    Service level reports
 •    Trends
 •    Progress
 •    Post Mortem Analysis

Count = 525
Pass = 513 (97.71%)
Fail = 12 (2.29%)

Min = 4.339 sec
Avg = 8.253 sec
Max = 87.708 sec
Realtime Reporting
• Alerts
• Current status
• Diagnosis
Diagnosing root cause
and remedies
• Accessing fault and failure data for multiple
   – Pinpoint failures
• Correlation is a skill
   – manual, expert analysis required
   – Variety of support:
      • Saved actual results
          – Unattended collection for debugging
      • Correlation with component performance analysis
• Automated correlation with component failure
   – Sophisticated “expert system”
   – Rules that correlate tested events to arrive at
     diagnosis of root cause(s)
                                     Fault Analysis
                                                                   Cant connect to
                                                                     OT agents

                 Cant connect to                                   Cant connect to                                                 Cant connect to
                       OT                                          OT Test Agent                                                    OT RefData

Cant connect to Cant resolve IP of      Cant                                                            RefData agent       Requests cant      RefData Agent not
   internet       OT correctly       connect to                                                         servers failed       pass via OT           accepting
                                     OT gateway                                                                          firewall to RefData      connections

                                                   Test agent       Requests cant      Test Agent not
                                                  server failed       pass via OT        processing
                                                                   firewall to Abbot    connections

                                                             SSH port
                                                          forward to Test
                                                            agents has

•   Avoiding Production Problems
•   Testing for Service Level Management
•   Case Study
•   Considerations Unique to Production Testing
•   Information for SLM
• Implementation Choices
• Wrap-Up
Tool Requirements
• Evaluation Checklist
    –   Test script can interact with a variety of systems
          • GUI, Terminal, APIs, HTTP, SOAP, POP/SMTP, etc.
    –   Test script can respond to dynamic behaviour
    –   Agents can be deployed internal/external of the WAN
    –   Ability to control frequency
    –   Time based functions can be used to control execution
    –   Functions available for data manipulation for dynamic responses (time,
        extraction, etc.)
    –   Inter-process coordination between tests using locking/semaphores
    –   Test steps can be reported on business process steps, object actions can
        be hidden in reports
    –   Test outcomes saved to repository for later analysis
    –   Ability to export data for other purposes, e.g. trending, visualisation, etc.
    –   Reporting capability on stored data
    –   Online ability to drill into test data for problem diagnosis
    –   Alerting mechanisms to email, SMS, online dashboards
    –   Alerting can be controlled, ie. escalation, filtering
• Apply weighting to each criteria according to need
• Available Commercial Tools/Services
   –   SmartTestTech - SMARTCat
   –   Mercury – Topaz
   –   Compuware –Vantage
   –   Keynote
   –   Lesser extent, enterprise monitoring tools:
        • BMC Patrol, Tivoli, HP Openview
• Home Brew Tools
   – Extensive support for testing protocols in open source
       • E.g. Java/Junit, .Net/Nunit, Perl/Ruby/Python
• Extend Existing In-house Regression Test Suites
   – Automated scripts may be adapted
      • Robot, QARun, WinRunner, Silk
   – Post results to Database
   – Provide reporting capability
      • e.g. Crystal Reports, Cognos, etc

•   Avoiding Production Problems
•   Testing for Service Level Management
•   Case Study
•   Considerations Unique to Production Testing
•   Information for SLM
•   Implementation Choices
• Wrap-Up

• Strong business case
   – Benefit in bringing testing to the production world
   – Small %age availability increase translates to large $
   – Manages reputational risk with user base
   – Large investment in SLM
   – SLAs very ad-hoc and not measured
   – Uses tests to provide SLM reports to Business /
     Application Managers
   – Leveraging the investment in test resources
   – Protects overall investment
     Questions &

Contact details:
   Dr Kelvin Ross
   SMART Testing Technologies Pty Ltd
   PO Box 131, West Burleigh, Q4219
   Ph: +61 7 5522 5131
   Email: kelvinr@smarttesttech.com

To top