Managing Availability in an Enterprise Agenda • Welcome and Introductions • Review of HA Requirements and IT Infrastructure • Availability Options • Open Discussion The Issue of “Downtime” • Defined: – Applications and/or data are not accessible by users for any reason • Unplanned – 20% of all Downtime – Environmental Factors – 20% – Operator Error – 40% – Application failure – 40% • Planned – 80% of all Downtime – Physical Environment / Back-up / Recovery – 10% – HW, Network, OS, Systems Software – 10% – Batch Processing – 10% – Application and database –50% The Analyst‟s Comment Gartner research shows that an average of 80 percent of mission-critical application service downtime is directly caused by people or process failures. Organizations must recognize that people, process and infrastructure are all interdependent facets of an HA solution. In fact, the people, process issues comprise at least 80% of the solution. Did You Know … • Average company has – 4 different Server operating systems – 6 operational databases – 10GB of data per person • 65% of IT Managers have begun to integrate with suppliers • 55% of IT Managers have begun to integrate with Customers • 45% of all Enterprise Integration Plans have Failed OS/400 Availability Options Switch Disk SAN - Shark/EMC² LAN LAN HSL Replication LAN HSL OS/400 Switched Disk (IASP) Production Clients Secondary Server Server LAN Switched HSL Disk Disk Tower OS/400 Switched Disk (IASP) • System level solution with limited redundancy – hardware upgrades – Systems maintenance – System failure • Does not resolve: – backup window issue – software maintenance – failure of disk (same as with single system) – disaster recovery – IASP component failure (power supply, HSL breakage, disk failure) • iSeries Hosts must be physically close to I/O tower – maximum HSL cable length is 50 feet (15 meters) • Provides Redundancy at system unit level only • Limited Availability solution for avoiding downtime – No automatic switch facility or heartbeat to detect systems unit failure unless in clustered environment OS/400 Switched Disk (IASP) Object types not supported in an IASP *AUTHLR Authorization Holder *IPXD Internetwork packet exchange *AUTL Authorization List description *CFGL Configuration List *JOBQ Job Queue *CNNL Connection List *JOBSCD Job Scheduled Entry *COSD Class of Service Description *LIND Line Description *CRG Cluster Resource Group *MODD Mode Description *CSPMAP *M36 AS/400 Advanced 36 machine *CSPTBL *M36CFG AS/400 Advanced 36 *CTLD Controller Description machine configuration *DDIR Distributed File Directory *NTBD NetBIOS description *DEVD Device Description *NWID Network Description *DOC Document *NWSD Network server description *EDTD Edit Description *OUTQ Output Queue *EXITRG Exit Registration *PRDAVL Product availability *FLR Folder *USRPRF User profile *IGCSRT Double-byte character set *SOCKET Socket (DBCS) sort table *SSND Session description *IGCTBL Double-byte character set *S36 System 36 Description (DBCS) font table OS/400 SAN Solution (Shark/EMC²) Production Clients Secondary Server Server LAN SAN SAN PPRC or SRDF OS/400 SAN Solution (Shark/EMC²) • Disk image based solution similar to Switched Disk – System unit solution only – Disk-level approach for Disaster Recovery – Does not address many availability challenges • Data resiliency solution, not part of OS/400 topology, architecture, or clustering – Works with volumes (NT, Unix) while iSeries file system is object level – Takes image rather then „net change‟ objects • Disaster recovery setup requires 2nd SAN – 2nd SAN can‟t be used for real time processing such as backups, etc. – In restricted state and unusable – Limitation on synchronous distance to 64 miles (103 km) SAN • Recovery process identical to single system outage recovery process … IPL and manual intervention – Primary copy must be brought to restricted state and powered down to ensure object integrity – Requires full Volume retrieval rather then individual objects – Individual objects cannot be retrieved from 2nd SAN image – Risk of damaged objects – data is okay but application does not start OS/400 Replicated Systems Production Clients Secondary Server Server LAN HSL Replicated Servers OS/400 Solution versus Downtime Outage source % Raid-5 SwDisk SAN 2x SAN H Avail Planned Backup window 68% no no No No Yes Software changes 10% no no No No Yes PTF installation 10% no no No No Yes Maintenance 7% no no no no Yes Hardware upgrade 5% no Yes Yes Yes Yes Switched Disk Unplanned Disk Unit failure 25% No* no No Yes Yes Software 22% no no No Yes Yes Power outage 17% No no No Yes Yes Telecom 18% no no No No No Human error 12% no no no No Yes Processors 4% no Yes Yes Yes Yes Disaster 2% no no No Yes Yes Replicated * If only one disk fails that is within a Raid, failure is protected. If the whole disk unit fails, or more Servers than one disk or load source disk outside Raid-5, loss of data and downtime will occur. The Critical Criteria Solution Fundamentals • Professional Services – Certified Expertise – Proven Methodology – Delivered runbook of procedures – Education • Vendor Support – Certified Technical Expertise Switched – A Live Person on the Phone Disk • Field Service and Support – Understands the business and requirements – Supports the environment • Solution Options for Your Requirements – Provider delivers a range of options including applications / services and support Replicated • References Servers – In your industry and applications OS/400 Replication Production System Backup System PGMS USRPRF PGMS USRPRF DB2 DA DB2 DA SPOOL LIND SPOOL LIND IFS DQ IFS DQ Other Other DB2 Journal APPLY Audit Journal CHANGE 1, 2 . . . . . . 8, 9 ODS/400 ARCHIVE Event Staging Polled queue SAVING PARAMETERS ODS/400 SAVF PROCESSING SNA, IP, Opti - Communication Links Remote journaling has no provision to ensure integrity of the target data. OS/400 Remote Journaling Production A Backup Local J1 Remote J2 A 1 A2 A3 A4 A 1 A2 A3 A4 Remote J4 Local J3 Local J5 A 1 A2 A3 Remote J6 A 1 A2 A3 Remote J8 Local J7 Local J9 A 1 A2 Remote J10 A 1 A2 Remote J12 Local J11 Replicated Servers A 1 Local J13 Remote J14 A 1 Remote J16 Local J15 The Critical Criteria Technology Fundamentals • System Integrity – If the copy isn‟t perfect .. it‟s useless – Includes ALL the data and objects – System Integrity is different than data integrity – Role swap is about integrity not about time • Ease of Use – Powerful and Intuitive / common interface Switched – Ease of management and configuration Disk – Easily trained and fully documented including a “runbook” • Performance – Slow isn‟t an option – replication and switching – Best throughput least CPU – Best throughput in Catch up mode • Scalability – The solution must work on one to many machines, – across applications, Replicated – across databases, Servers – across departments, – servers and service level agreements The Critical Criteria Solution Fundamentals • Professional Services – Certified Expertise – Proven Methodology – Delivered runbook of procedures • Support – Certified Technical Expertise – A Live Person on the Phone Switched Disk • Local Service and Support – Understands the business and requirements – Supports the environment • Solution Options for Your Requirements – Provider delivers a range of options including applications / services and support • References Replicated – In your industry and applications Servers How OS/400 Role Swap Works Source System Target System USER USER PGM PGM,INQ PRT JRN DATA DTAA IFS DATA DTAA IFS BASE DTAQ BASE DTAQ A 1 A 2 An A 9 ROUTER JOURNAL STAGING QUEUE Replicated Servers SNA, IP,Opti Communication Links Vision Solutions 09, 2001 How OS/400 Role Swap Works Target System Source System SNA, IP,Opti Communication JRN Links STAGING QUEUE ROUTER DTAA IFS DATA BASE DTAQ A 1 A 2 An A 9 A 1 A 2 An A 9 DTAA ROUTER DATA IFS BASE DTAQ USER PGM STAGING QUEUE Replicated Servers JOURNAL Vision Solutions 09, 2001 Thank You !