Docstoc

Project

Document Sample
Project Powered By Docstoc
					This document presents a summary of backup requirements for ERP services, and a
translation of those requirements into a design for Tivoli Storage Manager (TSM)
infrastructure adequate to meet those requirements. The discussion includes some
possible scenaria in which backup infrastructure would be used to restore services, and
several operational requirements.
 Table of Contents:

I.Needs
   A.Back up all ERP servers.
     1)Retention
   B.Back up all ERP databases.
     1)Retention
   C.Special cases
     1)Router configuration
     2)NetApp
     3)Oracle Databases
   D.Example scenaria:
     1)Loss of a file
     2)Move a database
     3)Refresh a database
     4)Loss of a database
     5)Loss of a machine
     6)Loss of a data center

 II.Operational requirements
     A.Point of contact
     B.Access
     C.Staff services
 III.Architecture
     A.Server design
     B.Network considerations
     C.Landing-pad Disk
     D.Database disk
     E.Primary storage
     F.Copy storage
     G.Offsites
I. Needs
    A.ERP maintains servers enumerated in Appendix A.
       1)These machines are estimated to require the retention of seven versions of
          existing files, for 90 days, retention of the last copy of a file for 365 days.
    B.ERP maintains databases enumerated in Appendix B.
       1)Retention policies for these databases fall into three classes:
          1.Make full copies on an irregular schedule, retain them for long periods. (DMO,
             SYS, TRN, RPT, Cognos, Stat)
     2.Make full copies on a weekly basis, retain them for a month (rounded up to 35
        days) Retain transaction logs for a month (rounded up to 35 days) (DEV,
        CFG, CNV, TST, MST, QAT)
     3.Make full copies on a daily basis, retain them for 14 days, retain activity logs
        for 14 days. (PRD)
C.Special cases
  1)Router con fig
     1.Network Services maintains router configuration history, at least one year of
        daily snapshots of router configurations.
  2)NetApp Filer
     1.After examining current options, it appears that it will be most effective to
        back up NetApp filesystems from one of the machines mounting and serving
        the filesystem. This evaluation will be reviewed as appropriate.
  3)Oracle databases
     1.So far, all oracle databases have been of a scale and activity level such that
        periodic full backups to local DASD, thereupon backed up as a normal file,
        have met the operational requirements of the services involved. This
        evaluation will be reviewed as appropriate.


D. Example scenaria:
  1) Loss of a file
     1.An application admin requests the file be restored through locally-installed
        clients. Process usually completes without intervention from TSM support
        personnel. Request is usually filled within 10 minutes. Impact on backup
        system is negligible to very low.
  2) Move a database
     1.An application admin backs up a database from its source location, and
        restores it to its destination. Process usually completes without intervention
        from TSM support personnel. Request duration is significantly affected by
        size of database, but small databases can be moved in a few minutes. Impact
        on backup system is negligible to very low.
  3) Refresh a database
     1.An application admin restores database to interesting refresh state. Process
        may complete without intervention from TSM support personnel. If desired,
        provisions for the acceleration of restores can be taken, such as moving files
        from tape to DASD, etc. These can be managed directly by an application
        admin, or may be handed off to TSM support. Request duration is probably
        dominated by database effort involved in replaying logs, and might be
        substantial. Impact on backup system is low to moderate.
  4) Loss of a database
     1.An application admin restores database to last full state, and then rolls forward
        the recovery logs. Process may complete without intervention from TSM
        support personnel. If desired, provisions for the acceleration of restores can
        be taken, such as moving files from tape to DASD, etc. These can be
        managed directly by an application admin, or may be handed off to TSM
            support. Request duration is probably dominated by database effort involved
            in replaying logs, and might be substantial. Impact on backup system is low
            to moderate.
      5)Loss of a machine
         1.An application admin restores the machine, either through base install or bare
            metal restore methods. Process may complete without intervention from
            TSM support personnel. If desired, provisions for the acceleration of
            restores can be taken, such as moving files from tape to DASD, etc. These
            can be managed directly by an application admin, or may be handed off to
            TSM support. For machines which require especially rapid recovery from
            catastrophic failure, additional bare metal measures might be taken, such as
            Ghost or mksysb images, and these in turn may be backed up.
II.
      1) Loss of a data center- no 2-site operations extant
         1.All units begin acquisition of replacement hardware. TSM support recreates
            backup service, and begins rebuilding structure for most critical machines
            first. Restoration can begin immediately as backup service is recreated, but
            gets more efficient as offsite copies are re-organized into primary versions.

            This process can be accelerated and streamlined by pre-positioning additional
            hardware at a remote site, and maintaining configuration data such that
            backup services may be erected quickly, perhaps even automatically.

III. Operational requirements:
    A.There should be designated a point of contact and clearinghouse for backup
      operations for ERP systems. This liason should be technically familiar with at least
      some of the platforms under management, and have access to e.g. machine counts
      and manifests.
    B.TSM support staff will authorize a set of technical personnel to administer the TSM
      artifacts of ERP machines.
    C.TSM support staff are available for guidance and technical planning, but production
      control staff must monitor and schedule backup activity.




Infrastructure proposal:

Load requirements (Excerpted from ERP Sizing spreadsheet):
                         Daily Load (M B)                 Total Oc c upation (M B)
    Serv ers :                               102,640                           9,986,411
    Databas es :                             511,415                         21,029,600




    Total:                                   614,055                         31,016,011



Offsite needs have not yet been defined in detail.

In order to sustain this traffic, we will need

a)Network bandwidth adequate to receive the load
b)Disk capacity adequate to serve as a landing pad
c)Tape capacity adequate to store the data
d)Tape drives adequate to process the data
e)CPU, Memory, and I/O adequate to manage the data
f)A server architecture which defends operations from failure
g)Adequate client licensure.

              Remote ERP servers




             ERP Databases         Network                          Existing 3494:
                                                                    Primary data
                 ERP Servers




                                               'Landing pad' DASD


                                                                                           New 3584:
                                                   TSM Server                        SAN
                                                                                           Copy data


                                              TSM Database DASD




                                                     CNS Machine Room




h)Network bandwidth
Predicted daily traffic equates to under 8 hours' 1/4 utilization of a single gigabit
Ethernet link. Using this pessimistic backup window estimate, we can easily
accommodate ERP traffic on a single gig-E link. In fact, database backups can be
scattered through the day, and database log activity will definitely be so scattered. This
decreases the intensity of the load during the usual backup window.
i)Disk capacity

We will need a "landing pad" of capacity equal to the daily traffic, and in such
architecture that we can make sustained writes to it at more than 25MB/s.

Disk for the TSM database is dramatically smaller, but we'd probably need 2 36GB
drives to serve medium-term database needs.

j)Tape capacity

Primary data:

TSM's tape management methods create a long-term tape utilization rate which tends
towards 75% at best. Tape compression in our current architecture gets us approximately
5:1 compression ratio averaged over data currently stored with us.

30TB, at 5:1 compression, with a utilization efficiency of 75%, yields approximately 410
3590-J tapes for primary storage.

“Collocation” can affect this count dramatically. Collocation is a storage management
practice that separates each client machine's data on a separate tape. This practice
dramatically accelerates restores, as a minimum of tape volumes can be mounted, and is
therefore desirable. It is unlikely that this would expand the tape count by greater than
50%.

This leaves us with a conservative high-side of 615 tapes, or approximately 1.5 3494 S
cabinets, for primary data.



Secondary data:

Our current implementation of TSM makes three copies of most data backed up to it, to
guard against media failure and for limited offsite coverage.

Currently we maintain one copy in the 3494 library, on higher-density media, non-
collocated. That would yield another 205 tapes for media-failure copy data. A limited
offsite is located at the IT center on virtual volumes in their TSM server. I anticipate that
the load would amount to at least 140 LTO tapes in their library.

The proposed design includes “local” offsites, placing a SAN-attached LTO tape library
at a remote site, in place of both the media-failure copy and the remote site. This would
save substantial tape capacity, both in slots and tape-hours.

Aggregate:
                                 Extend current design           SAN-attached offsites
3494 slots                    820                            615
3584 slots                    140 (IT center library)        140 (new library)

This discussion omits long-distance offsites, pending detailed needs specification.

k)Tape drives

Tape occupation needs are divided into two categories: Nightly backup processing, and
maintainance processing. We propose to use a SAN-attached library to fulfill these
functions.

Each night's backups must be copied to tape two or three times (depending on
implementation). If incrementals average 600 GB/night, that translates to 1200 - 1800
GB/day of tape traffic, minimum (all disk-to-tape) and up to 3600 GB/day if all copying
is tape-to-tape.

3590s and LTOs both have rated speeds over 30MB/s sustained. We regularly observe
write speeds in this range. If we adopt the SAN-attached remote library, then our tape
capacity per day would far outstrip our ability to receive backups. If we do not, then
copies must navigate another network/disk/tape chain; This chain is already showing
some weakness.

Further, extending the current design would still require substantial augmentation of both
the 3494 library and the IT Center's 3584. TSM Support has already suggested adding
drives to the remote library as a response to a problem report. Additionally, we'd need at
least two more 3590 drives in the 3494, possibly four. These additions would displace
more tape slots, accelerating expansion.

To summarize:

                                    Extend current design        SAN-attached offsites
3590 Drives                   2-4                                                         0
New library                                                 0                             1
LTO Drives                    2 (at IT center)                                            4
Fibre Channel Adapters        0                              2
Fibre Channel switch          0                              1

l)CPU, IO, Memory.

We've run out of PCI busses on our current TSM server; We'll need a machine with more
I/O in order to get the full performance we need. A 630 would be substantially greater
capacity than we have now, in both CPU and I/O. We should acquire at least 4GB of
memory; the multiple-server architecture will consume more.

m) Server architecture.
It is possible to deploy multiple TSM servers on the same hardware. If we separate the
operations into servers by administrative domain, this would defend functions from
impact due to e.g. maintainance. A prominent example of such impact is the periodic
need to reorganize the TSM database. With separate servers, each administrative domain
would experience that service interruption at a moment convenient to that service.


n) Client licensure.

New clients will need to be purchased as new systems require service. Summary, initial

o)Current network seems adequate.
p)Current allocation of 200 GB landing-pad disk seems adequate
q)Database space of ~36 GB. This will need to be acquired, as during the server
    architecture modification we will effectively have ERP data recorded twice.
r) Initial allocation of 200 tapes (currently 90) in existing 3494 library for primary ERP
    data. New storage frame required.
s)Initial allocation of 50 tapes in a SAN attached 3584 LTO library (to be acquired),
    located at a remote site.
t)A logically separate TSM server located on the same hardware as other TSM servers
    centrally maintained. New 630 machine indicated to deal with increased memory,
    CPU, and I/O needs.
u)SAN switch to replace currently-installed switch which has inadequate capacity, both
    in ports (only has 8) and in speed (1GB instead of 2)


Summary: 18 months
v)Possible second Gigabit Ethernet interface. More likely if long-distance offsites are
   initiated.
w)500GB more landing-pad disk indicated. Suggest budgeting, acquisition in smaller
   aliquots as need is demonstrated.
x)36GB of database should be adequate in the 18-month timeframe
y)Primary tape occupation expected to increase to 600.
z)Copy tape occupation expected to increase to 200.
aa)630 should be adequate in 18-month timeframe


5-year outlook:

The major feature in the 5-year outlook which appears to offer an opportunity or
challenge is the choices of tape architecture.

Network technology is going to remain ethernet, and it will be clear when the time has
arrived to move to multiple GigE, or 10G, or whatever. Disk technology is going to
remain fairly obvious: TSM is a good destination for enterprise-class disk which is 'cast
off' from other applications. This will result in continuing augmentation, with relatively
clear decision points. Underlying hardware will also most likely continue to be obvious;
it is likely that we will upgrade hardware two times in the next 60 months.

Tape architecture is not so simple. As of September 1, our major choices are our current
3590 architecture, and a relatively new format, called “LTO”. LTO tape architecture, a
competitively produced open standard, offers several times greater capacity on a per-
cartridge basis. (60 GB is the current peak 3590 raw capacity, current LTO offers up to
200GB.) To balance the greater capacity, the LTO technology offers dramatically slower
seek performance. This makes it eminently practical for secondary (copy) storage, but a
much less compelling choice for primary storage.

Further clouding this field is the not-yet-announced '3592' tape architecture. Currently,
rumors are all I've been able to track down, but the rumors seem to indicate terabyte -
capacity cartridges in a form factor physically similar to the 3590 carts.

ISCSI presents an opportunity to replace dedicated SAN fabric with utilization of high-
speed IP network over the same or similar paths. This standard is not widely deployed
yet.

Given what we know now a 3590-class primary storage system, with LTO in use for
secondary storage, seems the best choice. This both exploits our current investment and
maximizes the performance and capacity return for dollars spent. LTO drives cost
approximately half what 3590 drives do, and by situating a new library at a remote point,
we will be able to cut our total tape usage by approximately one third.


Rough shopping list:


            630 plus 'RIO' drawer, peripherals                      $55,595.00
            3584 library, 4 drives, 240 tapes                      $139,496.00
            S' frame for 3494                                       $21,000.00
            16-port FCA L switch                                    $55,163.00


                                       Sum:                        $271,254.00




 The initial installation of the 3584 will probably need more room for tapes 18-24 months
after it goes into service. The initial allocation of drives should be adequate through that
timeframe.

The additional frame of 3494 slots should be adequate for about two years, once we move
copies to the 3584. The most likely upgrade for that library will be a drive upgrade to
3592 drives, and a wholesale replacement of cartridges. This should increase the
capacity of the deployed library by nearly a factor of 10.




Possible enhancements:

1) Short-distance offsites.


            Remote ERP servers




           ERP Databases         Network                             Existing 3494:
                                                                     Primary data
               ERP Servers




                                                'Landing pad' DASD


                                                                                              New 3584:
                                                    TSM Server                        SAN
                                                                                              Copy data


                                               TSM Database DASD                            Remote location




                                           CNS Machine Room




It is possible to locate the SAN-attached 3584 at a remote location, at the end of dark
fiber. This arrangement would accomplish both the short-distance offsite benefits we
currently desire from the interaction with the ITC, and the media-failure protection for
which we copy data within the 3494.

If we place a TSM server with similar capacity to the primary at the location of the
offsite 3584 library, we can maintain configuration data in both locations, permitting
both load distribution, and rapid recovery from crisis.


2) Eventual ideal:
            Remote ERP servers




           ERP Databases         Network                              Existing 3494:
                                                                      Primary data
               ERP Servers




                                                'Landing pad' DASD


                                                                                               New 3584:
                                                    TSM Server                         SAN
                                                                                               Copy data


                                               TSM Database DASD                             Remote location




                                           CNS Machine Room




                                                      Network




                                                 'Landing pad' DASD



                                                    TSM Server


                                                 TSM Database DASD




                                                 VERY remote location




In addition to the short-distance offsite within dark-fiber SAN range, we could deploy
duplicate backup operations at a second site. This would permit recovery operations to
begin almost immediately after even a major catastrophe.
Appendix A: ERP machine statement
      Count of OS          U                                          Disk
      Model         OS          1   2    4   5   7   42 Grand Total   GB
               615 AIX                  16                  16
      615 Total                         16                  16        576
               630 AIX                  12                  12
      630 Total                         12                  12        432
               670 AIX                                2     2
      670 Total                                       2     2         144
              1650 linux        1                           1          36
                    w2k        29                           29        2088
      1650 Total               30                           30
              2650 w2k              5                       5
      2650 Total                    5                       5         360
              4650 w2k                           2          2
      4650 Total                                 2          2         144
      oth           oth         2            3        1     6
      oth Total                 2            3        1     6         5120
              6650 w2k                   9                  9
      6650 Total                         9                  9         648
      Grand Total              32   5   37   3   2    3     82        9548




Appendix B: ERP Database statement
                 DMO      SYS     DEV      CFG     CNV      TST    MST      QAT     TRN        RPT      PRD    GB       Growth
PORTAL             O        O       W       W       W        W       W       W        O                  D         5         8
STUDENT            O        O       W       W       W        W       W       W        O         O        D         8        24
FINANCE            O        O       W       W       W        W       W       W        O         O        D         52       65
HRMS               O        O       W       W       W        W       W       W        O         O        D         18       25
EPM                O        O       W       W       W        W       W       W        O         O        D         27       35
CRM                O        O       W       W       W        W       W       W        O         O        D         8        12


COGNOS-DW                           O                                         O                          O         10       30
STAT                                O                                                                              12       14
Responsibility    Bill     Bill    Bill    Bill     Bill    Bill    Bill    Bill     Bill      Bill     Bill
Priority          Low     Low      Med     High    Med      Med     High    Low      Low       Med      High
Capacity
Recurrance          1       1                                                         1
Retention        eternity eternity 1 month 1 month 1 month 1 month 1 month 1 month eternity
                                                                                              O = On Demand
                                                                                              D=Daily
                                                                                              W = Weekly(online)