Docstoc

data-center-best-practices-architecture

Document Sample
data-center-best-practices-architecture Powered By Docstoc
					                       Data Center
  Best Practices and Architecture
for the California State University




                      Author(s):   DCBPA Task Force
                      Date:        OctAug 127, 2009
                      Status:      DRAFT
                      Version:     0.34.11
The content of this document is the result of the collaborative work of the Data Center Best Practice and
Architecture (DCBPA) Task Force established under the Systems Technology Alliance committee within
the California State University.



Team members who directly contributed to the content of this document are listed below.

       Samuel G. Scalise, Sonoma, Chair of the STA and the DCBPA Task Force

       Don Lopez, Sonoma

       Jim Michael, Fresno

       Wayne Veres, San Marcos

       Mike Marcinkevicz, Fullerton

       Richard Walls, San Luis Obispo

       David Drivdahl, Pomona

       Ramiro Diaz-Granados, San Bernardino

       Don Baker, San Jose

       Victor Vanleer, San Jose

       Dustin Mollo, Sonoma

       David Stein, PlanNet Consulting

       Mark Berg, PlanNet Consulting

       Michel Davidoff, Chancellor’s Office




                                                   2
Table of Contents
1.      Introduction ............................................................................................................................. 4
     1.1.   Purpose ................................................................................................................................ 4
     1.2.   Context ................................................................................................................................. 4
     1.3.   Audience .............................................................................................................................. 5
     1.4.   Development Process .......................................................................................................... 5
     1.5.   Principles and Properties..................................................................................................... 5
2.      Framework/Reference Model ................................................................................................... 7
3.      Best Practice Components ................................................................................................... 4315
   3.1.     Standards ....................................................................................................................... 4315
   3.2.     Hardware Platforms ...................................................................................................... 4515
   3.3.     Software ......................................................................................................................... 4717
   3.4.     Delivery Systems ............................................................................................................ 4718
   3.5.     Disaster Recovery .......................................................................................................... 5323
   3.6.     Total Enterprise Virtualization ...................................................................................... 5929
   3.7.     Management Disciplines ............................................................................................... 6232




                                                                         3
    1. Introduction

        1.1. Purpose

As society and institutions of higher education increasingly benefit from technology and collaboration,
the importance of identifying mutually best practices and architecture makes this document vital to the
behind-the-scenes infrastructure of the university. Key drivers behind the gathering and assimilation of
this collection are:

       Many campuses want to know what the others are doing so they can draw from a knowledge
        base of successful initiatives and lessons learned. Having a head start in thinking through
        operational practices and effective architectures--as well as narrowing vendor selection for
        hardware, software and services--creates efficiencies in time and cost.

       Campuses are impacted financially and data center capital and operating expenses need to be
        curbed. For many, current growth trends are unsustainable with limited square footage to
        address the demand for more servers and storage without implementing new technologies to
        virtualize and consolidate.

       Efficiencies in power and cooling need to be achieved in order to address green initiatives and
        reduction in carbon footprint. They are also expected to translate into real cost savings in an
        energy-conscious economy. Environmentally sound practices are increasingly the mandate and
        could result in measurable controls on higher energy consumers.

       Creating uniformity across the federation of campuses allows for consolidation of certain
        systems, reciprocal agreements between campuses to serve as tertiary backup locations, and
        opt-in subscription to services hosted at campuses with capacity to support other campuses,
        such as the C-cubed initiative.                                                                     Comment [mmb1]: Reference other ITAC
                                                                                                            initiatives, as appropriate


        1.2. Context

This document is a collection of Best Practices and Architecture for California State University Data
Centers. It identifies practices and architecture associated with the provision and operation of mission-
critical production-quality servers in a multi-campus university environment. The scope focuses on the
physical hardware of servers, their operating systems, essential related applications (such as
virtualization, backup systems and log monitoring tools), the physical environment required to maintain
these systems, and the operational practices required to meet the needs of the faculty, students, and
staff. Data centers that adopt these practices and architecture should be able to house any end-user
service – from Learning Management Systems, to calendaring tools, to file-sharing.

This work represents the collective experience and knowledge of data center experts from the 23
campuses and the chancellor’s office of the California State University system. It is coordinated by the
Systems Technology Alliance, whose charge is to advise the Information Technology Advisory Committee



                                                    4
(made up of campus Chief Information Officers and key Chancellor’s Office personnel) on matters
relating to servers (i.e., computers which provide a service for other computers connected via a
network) and server applications.

This is a dynamic, living document that can be used to guide planning to enable collaborative systems,
funding, procurement, and interoperability among the campuses and with vendors.

This document does not prescribe services used by end-users, such as Learning Management Systems
nor Document Management Systems. As those services and applications are identified by end-users
such as faculty and administrators, this document will describe the data center best practices and
architecture needed to support such applications.

Campuses are not required to adopt the practices and architecture elucidated in this document. There
may be extenuating circumstances that require alternative architectures and practices. However, it is
hoped that these alternatives are documented in this process.

It is not the goal to describe a single solution, but rather the range of best solutions that meet the
diverse needs of diverse campuses.


        1.3. Audience

This information is intended to be reviewed by key stakeholders who have material knowledge of data
center facilities and service offerings from business, technical, operational, and financial perspectives.


        1.4. Development Process

The process for creating and updating these best Practices and Architecture (P&A) is to identify the most
relevant P&A, inventory existing CSU P&A for key aspects of data center operations, identify current
industry trends, and document those P&A which best meet the needs of the CSU. This will include
information about related training and costs, so that campuses can adopt these P&A with a full
understanding of the costs and required expertise.

The work of creating this document will be conducted by members of the Systems Technology Alliance
appointed by the campus Chief Information Officers, by members of the Chancellors Office Technology
Infrastructure Services group, and by contracted vendors.


        1.5. Principles and Properties

In deciding which Practices and Architecture should be adopted, it is important to have a set of criteria
that reflect the unique needs, values, and goals of the organization. These Principles and Properties
include:

       Cost-effectiveness


                                                      5
      Long-term viability
      Flexibility to support a range of services
      Security of the systems and data
      Reliable and dependable uptime
      Environmental compatibility
      Redundancy
      High availability
      Performance
      Training
      Communication

Additionally, the architecture should emphasize criteria that are standards-based. The CSU will
implement standards-based solutions in preference to proprietary solutions where this does not
compromise the functional implementation.

The CSU seeks to adhere to standard ITIL practices and workflows where practical. Systems and
solutions described herein should relate to corresponding ITIL and service management principles.




                                                    6
   2. Framework/Reference Model

The framework is used to describe the components and management processes that lead to a holistic         Comment [mmb2]: This section (Reference
                                                                                                          Model) is meant to define terms, show the high-level
data center design. Data centers are as much about the services offered as they are the equipment and     relationships, reflect a baseline architecture, and
space contained in them. Taken together, these elements should constitute a reference model for a         reference external documents from other initiatives
                                                                                                          and reference architectures.
specific CSU campus implementation.
                                                                                                          This same outline is developed in Section 3
                                                                                                          (Components) where each topic is given a more
       2.1. Standards                                                                                     thorough “deep dive” treatment; including best
           2.1.1.     ITIL                                                                                practices.


               The Information Technology Infrastructure Library is a set of concepts around managing
               services and operations. The model was developed by the UK Office of Government
               Commerce and has been refined and adopted internationally. The ITIL version 2
               framework for Service Support breaks out several management disciplines that are
               incorporated in this CSU reference architecture (see Section 2.7).

               ITIL version 3 has reworked the framework into a collection of five volumes that
               describe

                   Service Strategy
                   Service Design
                   Service Transition
                   Service Operation
                   Continual Service Improvement

           2.1.2.      ASHRAE

               The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE)
               releases updated standards and guidelines for industry consideration in building design.
               They include recommended and allowable environment envelopes, such as
               temperature, relative humidity, and altitude for spaces housing datacomm equipment.
               The purpose of the recommended envelope is to give guidance to data center operators
               on maintaining high reliability and also operating their data centers in the most energy
               efficient manner.

           2.1.3.      Uptime Institute

               The Uptime Institute addresses architectural, security, electrical, mechanical, and
               telecommunications design considerations. See Section 2.4.1.1 for specific information
               on tiering standards as applied to data centers.

           2.1.4.      ISO/IEC 20000

               An effective resource to draw upon as part of one of the ISO IT management standards
               are the ISO 20000-1 and ISO 20000-2 processes. ISO 20000-1 promotes the adoption of


                                                    7
       an integrated process approach to effectively deliver managed services to meet the
       business and customer requirements. It comprises ten sections: Scope; Terms &
       Definitions; Planning and Implementing Service Management; Requirements for a
       Management System; Planning & Implementing New or Changed Services; Service
       Delivery Process; Relationship Processes; Control Processes; Resolution Processes; and
       Release Process. ISO 20000-2 is a 'code of practice', and describes the best practices for
       service management within the scope of ISO20000-1. It comprises nine sections: Scope;
       Terms & Definitions; The Management System; Planning & Implementing Service
       Management; Service Delivery Processes; Relationship Processes; Resolution Processes;
       Control Processes; Release Management Processes.

       Together, this set of ISO standards is the first global standard for IT service
       management, and is fully compatible and supportive of the ITIL framework.

2.2. Hardware Platforms
    2.2.1.    Servers

       Types

              Rack-mounted Servers – provide the foundation for any data center’s compute
               infrastructure. The most common are 1U and 2U: these form factors compose
               what is known as the volume market. The high-end market, geared towards
               high-performance computing (HPC) or applications that need more
               input/output (I/O) and /or storage is composed of 4U to 6U rack-mounted
               servers. The primary distinction between volume market and high-end servers is
               the I/O and storage capabilities.

              Blade Servers –are defined by the removal of many components – PSUs,
               network interface cards (NICS) and storage adapters from the server itself.
               These components are grouped together as part of the blade chassis and shared
               by all the blades. The chassis is the piece of equipment that all of the blade
               servers “plug” into. The blade servers themselves contain processors, memory
               and a hard drive or two. One of the primary caveats to selecting the blade
               server option is the potential for future blade/chassis compatibility. Most IHVs
               do not guarantee blade/chassis beyond two generations or five years. Another
               potential caveat is the high initial investment in blade technology because of
               additional costs associated with the chassis.

              Towers – There are two primary reasons for using tower servers…price and
               remote locations. Towers offer the least expensive entrance into the server
               platform market. Towers have the ability to be placed outside the confines of a
               data center. This feature can be useful for locating an additional Domain Name
               Server (DSN) or backup server in a remote office for redundancy purposes.



                                            8
Principles

    1. Application requirements – Applications such as databases, backup servers and
       other high I/O requirements are better suited HPC rack-mounted servers.
       Applications such as web servers and MTAs work well in a volume-market rack-
       mounted environment or even in a virtual server environment. These
       applications allow servers to be easily added and removed to meet spikes in
       capacity demand. The need to have servers that are physically located at
       different sites for redundancy or ease of administration can be met by tower
       servers, especially if they are low demand applications. Applications with high
       I/O requirements perform better with 1U or 2U rack-mounted servers rather
       than blade servers because stand alone servers have a dedicated I/O interface
       rather than a common one found on the chassis of a blade server.

    2. Software support – can determine the platform an application lives on. Some
       vendors refuse to support virtual servers making VMs unsuitable if support is a
       key requirement. Multiple instances of an application is not supported by some
       software, requiring the application to run on a large single server rather than
       multiple smaller servers.

    3. Storage – requirements can vary from a few gigabytes to accommodate the
       operating system, application and state data for application servers to terabytes
       to support large database servers. Applications requiring large amounts of
       storage should be SAN attached using fiber channel or iSCSI. Fiber offers greater
       reliability and performance but a higher skill lever from SAN Admins. Support for
       faster speeds in iSCSI is and improved reliability is making it more attractive.
       Direct Attached Storage (DAS) is still prevalent because it is less costly and
       easier to manage than SAN storage. Rack-mounted 4U to 6U servers have the
       space to house a large number of disk drives and make suitable DAS servers.

    4. Consolidation – projects can result in several applications being combined onto
       a single server or virtualization. Care must be taken when combining
       applications to ensure they are compatible with each other and vendor support
       can be maintained. Virtualization accomplishes consolidation by allowing each
       application think it’s running on its own server. The benefits of consolidation
       include reduced power and space requirements and fewer servers to manage.

    5. Energy efficiency – starts with proper cooling design, server utilization
       management and power management. Replacing old servers with newer energy
       efficient ones reduces energy use and cooling requirements and may be eligible
       for rebates which allow them to pay for themselves.

    6. Improved management – Many data centers contain “best of breed”
       technology. They contain server platforms and other devices from many


                                   9
            different vendors. Servers may be from vendor A, storage from vendor B and
            network from vendor C. This complicates troubleshooting and leads to finger
            pointing. Reducing the number of vendors produces standardization and is more
            likely to allow a single management interface for all platforms.

        7. Business growth/New services – As student enrollment grows and the number
           of services to support them increases, the data center’s capacity to run its
           applications and store its data increases. This is the most common reason for
           buying new server platforms. IT administrators must use a variety of gauges to
           anticipate this need and respond in time.

    2.2.1.1. Server Virtualization
           Principles

                1. Reliability and availability—An implementation of server virtualization
                   should provide increased reliability of servers and services by providing
                   for server failover in the event of a hardware loss of service as well as
                   high-availability by ensuring that access to shared services like network
                   and disk are fault-tolerant and balanced by load.

                2. Reuse—Server virtualization should allow better utilization of hardware
                   and resources by provisioning multiple services and operating
                   environments on the same hardware. Care must be taken to ensure
                   that hardware is operating within limits of its capacity. Effective
                   capacity planning becomes especially important.

                3. Consumability—Server virtualization should allow us to provide quickly
                   available server instances, using technologies such as cloning and
                   templating when appropriate.

                4. Agility—Server virtualization should allow us to improve organizational
                   efficiency by provisioning servers and services faster by allowing for
                   rapid deployment of instances using cloning and templates.

                5. Administration—Server virtualization will improve administration by
                   having a single, secure, easy-to-access interface to all virtual servers.


2.2.2.       Storage
    2.2.2.1. SAN – Storage Area Network
        2.2.2.1.1. Fiber Channel
        2.2.2.1.2. iSCSI
                     1. Benefits
                        1.1. Reduced costs: By leveraging existing network components
                              (network interface cards [NICs], switches, etc.) as a storage
                              fabric, iSCSI increases the return on investment (ROI) made for
                              data center network communications and potentially saves


                                       10
         capital investments required to create a separate storage
         network. For example, iSCSI host bus adapters (HBAs) are 30-
         40% less expensive than Fibre Channel HBAs. Also, in some
         cases, 1 Gigabit Ethernet (GbE) switches are 50% less than
         comparable Fibre Channel switches.

            Organizations employ qualified network administrator(s) or
            trained personnel to manage network operations. Being a
            network protocol, iSCSI leverages existing network
            administration knowledge bases, obviating the need for
            additional staff and educational training to manage a
            different storage network.

    1.2. Improved options for DR: One of iSCSI's greatest strengths is its
          ability to travel long distances using IP wide area networks
          (WANs). Offsite data replication plays a key part in disaster
          recovery plans by preserving company data at a co-location
          that is protected by distance from a disaster affecting the
          original data center. Using a SAN router (iSCSI to Fibre Channel
          gateway device) and a target array that supports standard
          storage protocols (like Fibre Channel), iSCSI can replicate data
          from a local target array to a remote iSCSI target array,
          eliminating the need for costly Fibre Channel SAN
          infrastructure at the remote site.

            iSCSI-based tiered storage solutions such as backup-to-disk
            (B2D) and near-line storage have become popular disaster
            recovery options. Using iSCSI in conjunction with Serial
            Advanced Technology Attachment (SATA) disk farms, B2D
            applications inexpensively back up, restore, and search data
            at rapid speeds.

    1.3. Boot from SAN: As operating system (OS) images migrate to
          network storage, boot from SAN (BfS) becomes a reality,
          allowing chameleon-like servers to change application
          personalities based on business needs, while removing ties to
          Fibre Channel HBAs previously required for SAN connectivity
          (would still require hardware initiator).

2. Components
   2.1. Initiators



                     11
     2.1.1.Software Initiators: While software initiators offer cost-
           effective SAN connectivity, there are some issues to
           consider. The first is host resource consumption versus
           performance. An iSCSI initiator runs within the
           input/output (I/O) stack of the operating system, utilizing
           the host memory space and CPU for iSCSI protocol
           processing. By leveraging the host, an iSCSI initiator can
           outperform almost any hardware-based initiator.
           However, as more iSCSI packets are sent or received by
           the initiator, more memory and CPU bandwidth is
           consumed, leaving less for applications. Obviously, the
           amount of resource consumption is highly dependent on
           the host CPU, NIC, and initiator implementation, but
           resource consumption could be problematic in certain
           scenarios. Software iSCSI initiators can consume
           additional resource bandwidth that could be partitioned
           for supplemental virtual machines.
     2.1.2.Hardware Initiators: iSCSI HBAs simplify boot-from-SAN
           (BfS). Because an iSCSI HBA is a combination NIC and
           initiator, it does not require assistance to boot from the
           SAN, unlike software initiator counterparts. By discovering
           a bootable target LUN during system power-on self test
           (POST), an iSCSI HBA can enable an OS to boot an iSCSI
           target like any DAS or Fibre Channel SAN-connected
           system. In terms of resource utilization, an iSCSI HBA
           offloads both TCP and iSCSI protocol processing, saving
           host CPU cycles and memory. In certain scenarios, like
           server virtualization, an iSCSI HBA may be the only choice
           where CPU processing power is consequential.
2.2. Targets
     2.2.1.Software Targets: Any standard server can be used as a
           software target storage array but should be deployed as a
           stand-alone application. A software target can capitalize
           platform resources, leaving little room for additional
           applications.
     2.2.2.Hardware Targets: Many of the iSCSI disk array platforms
           are built using the same storage platform as their Fibre
           Channel cousin. Thus, many iSCSI storage arrays are
           similar, if not identical, to Fibre Channel arrays in terms of
           reliability, scalability, performance, and management.
           Other than the controller interface, the remaining product
           features are almost identical.


               12
2.3. Tape Libraries
     2.3.1.Tape libraries should be capable of being iSCSI target
           devices, however broad adoption and support in this
           category hasn’t been seen and remains a territory served
           by native Fiber Channel connectivity.
2.4. Gateways and Routers
     2.4.1.iSCSI to Fibre Channel gateways and routers play a vital
           role in two ways. First, these devices increase return on
           invested capital made in Fibre Channel SANs by extending
           connectivity to “Ethernet islands” where devices that
           were previously unable to reach the Fibre Channel SAN
           can tunnel through using a router or gateway. Secondly,
           iSCSI routers and gateways enable Fibre Channel to iSCSI
           migration. SAN migration is a gradual process. Replacing a
           large investment in Fibre Channel SANs at one time is not
           a cost reality. As IT administrators carefully migrate from
           one interconnect to another, iSCSI gateways and routers
           afford IT administrators the luxury of time and money.
           One note of caution: It's important to know the port
           speeds and amount of traffic passing through a gateway
           or router. These devices can become potential bottlenecks
           if too much traffic from one network is aggregated into
           another. For example, some router products offer eight 1
           GbE ports and only two 4 Gb Fibre Channel ports. While
           total throughput is the same, careful attention must be
           made to ensure traffic is evenly distributed across ports.

                 Any x86 server can act as an iSCSI to Fibre Channel
                 gateway. Using a Fibre Channel HBA and iSCSI
                 target software, any x86 server can present LUNs
                 from a Fibre Channel SAN as an iSCSI target. Once
                 again, this is not a turnkey solution—especially for
                 large SANs—and caution should be exercised to
                 prevent performance bottlenecks. However, this
                 configuration can be cost-effective for small
                 environments and connectivity to a single Fibre
                 Channel target or small SAN.
2.5. Internet Storage Name Service (iSNS)
     2.5.1.Voracious storage consumption, combined with lower-
           cost SAN devices, has stimulated SAN growth beyond what
           administrators can manage without help. iSCSI
           exacerbates this problem by proliferating iSCSI initiators


               13
                                        and low-cost target devices throughout a boundless IP
                                        network. Thus, a discovery and configuration service, like
                                        iSNS is a must for large SAN configurations. Although
                                        other discovery services exist for iSCSI SANs, such as
                                        Service Location Protocol (SLP), iSNS is emerging as the
                                        most widely accepted solution.

                    3. Security
                    4. Multi-path support


        2.2.2.2. NAS – Network Attached Storage
        2.2.2.3. DAS – Direct Attached Storage
        2.2.2.4. Storage Virtualization
2.3. Software
    2.3.1.       Operating Systems
         An Operating System (commonly abbreviated to either OS or O/S) is an interface between
         hardware and user; an OS is responsible for the management and coordination of activities and
         the sharing of the resources of the computer. The operating system acts as a host for computing
         applications that are run on the machine. As a host, one of the purposes of an operating system is
         to handle the details of the operation of the hardware. This relieves application programs from
         having to manage these details and makes it easier to write applications.

   2.3.2.       Middleware
         Middleware is computer software that connects software components or applications. The
         software consists of a set of services that allows multiple processes running on one or more
         machines to interact across a network. This technology evolved to provide for interoperability
         in support of the move to coherent distributed architectures, which are used most often to
         support and simplify complex, distributed applications. It includes web servers, application
         servers, and similar tools that support application development and delivery. Middleware is
         especially integral to modern information technology based on XML, SOAP, Web services, and
         service-oriented architecture.

       2.3.2.1. Identity Management
               Identity management or ID management is a broad administrative area that deals with
               identifying individuals in a system (such as a country, a network or an organization) and
               controlling the access to the resources in that system by placing restrictions on the
               established identities.

   2.3.3.       Databases
         A database is an integrated collection of logically related records or files which consolidates
         records previously stored in separate files into a common pool of data records that provides
         data for many applications. A database is a collection of information that is organized so that it
         can easily be accessed, managed, and updated. In one view, databases can be classified
         according to types of content: bibliographic, full-text, numeric, and images. The structure is
         achieved by organizing the data according to a database model. The model that is most
         commonly used today is the relational model. Other models such as the hierarchical model and
         the network model use a more explicit representation of relationships.




                                              14
2.3.4.       Core/Enabling Applications
    2.3.4.1. Email
           Electronic mail, often abbreviated as email or e-mail, is a method of exchanging digital
           messages, designed primarily for human use. E-mail systems are based on a store-and-
           forward model in which e-mail computer server systems accept, forward, deliver and
           store messages on behalf of users, who only need to connect to the e-mail infrastructure,
           typically an e-mail server, with a network-enabled device (e.g., a personal computer) for
           the duration of message submission or retrieval.
       2.3.4.1.1.    Spam Filtering
                  E-mail spam, also known as junk e-mail, is a subset of spam that involves nearly
                  identical messages sent to numerous recipients by e-mail. Spam filtering comes
                  with a large set of rules which are applied to determine whether an email is spam
                  or not. Most rules are based on regular expressions that are matched against the
                  body or header fields of the message, but Spam vendors also employ a number of
                  other spam-fighting techniques including header and text analysis, Bayesian
                  filtering, DNS blocklists, and collaborative filtering databases.
   2.3.4.2. Web Services
           A Web Service (also Webservice) is defined by the W3C as "a software system designed
           to support interoperable machine-to-machine interaction over a network. It has an
           interface described in a machine-processable format (specifically WSDL). Other systems
           interact with the Web service in a manner prescribed by its description using SOAP-
           messages, typically conveyed using HTTP with an XML serialization in conjunction with
           other Web-related standards."
   2.3.4.3. Calendaring
           iCalendar is a computer file format which allows internet users to send meeting requests
           and tasks to other internet users, via email, or sharing files with an .ics extension.
           Recipients of the iCalendar data file (with supporting software, such as an email client or
           calendar application) can respond to the sender easily or counter propose another
           meeting date/time. iCalendar is used and supported by a large number of products.
           iCalendar data is usually sent with traditional email.
   2.3.4.4. DNS
           Domain Name Services enable the use of canonical names, (rather then IP addresses) in
           addressing network resources. To provide a highly available network, DNS servers should
           be placed in an Enabling Services Network Infrastructure Model (see section 12.7.5 ).           Comment [mmb3]: Remove NTA reference
           DNS services must also be highly available.
   2.3.4.5. DHCP
           Dynamic Host Configuration Protocol is used to manage the allocation of IP addresses. To
           provide a highly available network, DHCP servers should be placed in an Enabling
           Services Network Infrastructure Model (see section 12.7.5). DHCP services must also be          Comment [mmb4]: Remove NTA reference
           highly available.
   2.3.4.6. Syslog
           syslog is a standard for forwarding log messages in an IP network. The term "syslog" is
           often used for both the actual syslog protocol, as well as the application or library sending
           syslog messages. Syslog is essential to capturing system messages generated from
           network devices. Devices provide a wide range of messages, including changes to device
           configurations, device errors, and hardware component failures.
   2.3.4.7. Desktop Virtualization
           Desktop virtualization (or Virtual Desktop Infrastructure) is a server-centric computing
           model that borrows from the traditional thin-client model but is designed to give system
           administrators and end-users the best of both worlds: the ability to host and centrally
           manage desktop virtual machines in the data center while giving end users a full PC




                                           15
               desktop experience. The user experience is intended to be identical to that of a standard
               PC, but from a thin client device or similar, from the same office or remotely.
        2.3.4.8. Application Virtualization
               Application virtualization is an umbrella term that describes software technologies that
               improve portability, manageability and compatibility of applications by encapsulating
               them from the underlying operating system on which they are executed. A fully
               virtualized application is not installed in the traditional sense, although it is still executed
               as if it is. The application is fooled at runtime into believing that it is directly interfacing
               with the original operating system and all the resources managed by it, when in reality it
               is not. Application virtualization differs from operating system virtualization in that in the
               latter case, the whole operating system is virtualized rather than only specific
               applications.
    2.3.5.       Third Party Applications
        2.3.5.1. LMS
               A learning management system (LMS) is software for delivering, tracking and managing
               training/education. LMSs range from systems for managing training and educational
               records to software for distributing courses over the Internet and offering features for
               online collaboration.                                                                              Comment [mmb5]: Architecturally, there are
        2.3.5.2. CMS                                                                                              some common interactions that need to be described;
                                                                                                                  nightly feeds from CMS to LMS for student rosters,
               The mission of the Common Management Systems (CMS) is to provide efficient,                        identity management systems, TurnItIn.com,
               effective and high quality service to the students, faculty and staff of the 23-
               campus California State University System (CSU) and the Office of the Chancellor.

               Utilizing a best practices approach, CMS supports human resources, financials
               and student services administration functions with a common suite of Oracle
               Enterprise applications in a shared data center, with a supported data
               warehouse infrastructure.                                                                          Comment [mmb6]: Sonoma uses CMS as top-
                                                                                                                  level system for other systems that feed info to and
        2.3.5.3. Help Desk/Ticketing                                                                              from, including data warehousing. APIs to
               Help desks are now fundamental and key aspects of good business service and operation.             cashiering, parking systems, for example. Also used
               Through the help desk, problems are reported, managed and then appropriately resolved in           for authenticating users.
               a timely manner. Help desks can provide users the ability to ask questions and receive
               effective answers. Moreover, help desks can help the organization run smoothly and
               improve the quality of the support it offers to the users.
                     Traditional - Help desks have been traditionally used as call centers. Telephone
                        support was the main medium used until the advent of Internet.
                     Internet - The advent of the Internet has provided the opportunity for potential
                        and existing customers to communicate with suppliers directly and to review and
                        buy their services online. Customers can email their problems without being put
                        on hold over the phone. One of the largest advantages Internet help desks have
                        over call centers are that it is available 24/7.
2.4. Delivery Systems
    2.4.1.       Facilities
        2.4.1.1. Tiering Standards


        The industry standard for measuring data center availability is the tiering metric
        developed by The Uptime Institute and addresses architectural, security, electrical,
        mechanical, and telecommunications design considerations. The higher the tier, the
        higher the availability. Tier descriptions include information like raised floor heights,
        watts per square foot, and points of failure. “Need,” or “N,” indicates the level of



                                               16
redundant components for each tier with N representing only the necessary system
need. Construction cost per square foot is also provided and varies greatly from tier to
tier with Tier 3 costs double that of Tier 1.

Tier 1 – Basic: 99.671% Availability

       Susceptible to disruptions from both planned and unplanned activity
       Single path for power and cooling distribution, no redundant components (N)
       May or may not have a raised floor, UPS, or generator
       Typically takes 3 months to implement
       Annual downtime of 28.8 hours
       Must be shut down completely to perform preventative maintenance

Tier 2 – Redundant Components: 99.741% Availability

       Less susceptible to disruption from both planned and unplanned activity
       Single path for power and cooling distribution, includes redundant components
        (N+1)
       Includes raised floor, UPS, and generator
       Typically takes 3 to 6 months to implement
       Annual downtime of 22.0 hours
       Maintenance of power path and other parts of the infrastructure require a
        processing shutdown

Tier 3 – Concurrently Maintainable: 99.982% Availability

       Enables planned activity without disrupting computer hardware operation, but
        unplanned events will still cause disruption
       Multiple power and cooling distribution paths but with only one path active,
        includes redundant components (N+1)
       Includes raised floor and sufficient capacity and distribution to carry load on one
        path while performing maintenance on the other
       Typically takes 15 to 20 months to implement
       Annual downtime of 1.6 hours

Tier 4 – Fault Tolerant: 99.995% Availability

       Planned activity does not disrupt critical load and data center can sustain at
        least one worst-case unplanned event with no critical load impact
       Multiple active power and cooling distribution paths, includes redundant
        components (2 (N+1), i.e. 2 UPS each with N+1 redundancy)
       Typically takes 15 to 20 months to implement
       Annual downtime of 0.4 hours



                                       17
Trying to achieve availability above Tier 4 presents a level of complexity that some
believe presents diminishing returns. EYP, which manages HP’s data center design
practice, says their empirical data shows no additional uptime from the considerable
cost of trying to further reduce downtime from 0.4 hours due to the human element
that gets introduced in managing the complexities of the many redundant systems.


2.4.1.2. Spatial Guidelines and Capacities

1. Locale: A primary consideration in data center design is understanding the
   importance of location. In addition to the obvious criteria of adjacency to business
   operations and technical support resources, consideration for cost factors such as
   utilities, networking and real estate are prime. Exposure to natural disaster is also a
   key component. Power is generally the largest cost factor over time, which has
   prompted organizations to increasingly consider remote data centers in low utility
   cost areas. Addressing remote control operations and network latency become
   essential considerations.
2. Zoned space: Data centers should be block designed with specific tiering levels in
   mind so that sections of the space can be operated at high density with supporting
   infrastructure while other sections can be supported with minimal infrastructure.
   Each zone should have capacity for future growth within that tier.
3. Raised floor: A typical design approach for data centers is to use raised floor for air
   flow management and cable conveyance. Consideration must be given for air flow
   volume, which dictates the height of the floor, as well as weight loading. Raised
   floor structures must also be grounded.
4. Rack rows and density: Equipment racks and cabinets should be arranged in rows
   that provide for logical grouping of equipment types, ease of distribution for power
   and network, and provide for air flow management, either through perforated floor
   tiles or direct ducting.


2.4.1.3. Electrical Systems

Generators: Can be natural gas or petroleum/diesel fuel type. For higher tier designs,
are deployed in an N+1 configuration to account for load.

UPS: Can be rack-based or large room-based systems. Must be configured for load and
runtime considerations. Asset management systems should track the lifecycle of
batteries for proactive service and replacement.

PDUs: Power distribution units provide receptacles from circuits on the data center
power system, usually from the UPS. Intelligent PDUs are able to provide management
systems information about power consumption at the rack or even device level. Some
PDUs are able to be remotely managed to allow for power cycling of equipment at the




                                   18
receptacle level, which aids in remote operation of servers where a power cycle is
required to reboot a hung system.

Dual A-B cording: In-rack PDUs should make multiple circuits available so that
redundant power supplies (designated A and B) for devices can be corded to separate
circuits. Some A-B cording strategies call for both circuits to be on UPS while others call
for one power supply to be on house power while the other is on UPS. Each is a
function of resilience and availability.


2.4.1.4. HVAC Systems

CRAC units: Computer Room Air Conditioners are specifically designed to provide
cooling with humidification for data centers. They are typically tied to power systems
that can maintain cooling independent of the power distribution to the rest of the
building.

Hot/Cold Aisle Containment: Arranging equipment racks in rows that allow for the
supply of cold air to the front of racks and exhaust of hot air at the rear. Adjacent rows
would have opposite airflow to provide only one set of supply or exhaust ducts. Some
very dense rack configurations may require the use of chimney exhaust above the racks
to channel hot air away from the cold air supply. The key design component is to not
allow hot air exhaust to mix with cold air supply and diminish its overall effectiveness.
Containment is achieved through enclosed cabinet panels, end of row wall or panel
structures, or plastic sheet curtains.

Economizers: Directs ambient outside air in cooler climates to supplement cooling to
the data center.


2.4.1.5. Fire Protection & Life Safety

Fire suppression systems are essential for providing life safety protection for occupants
of a data center and to protecting the equipment. Design of systems should give
priority to human life over equipment, which factors in the decision of certain gas
suppression systems.

   Pre-action: Describes a water sprinkler design that allows for the water pipes
    serving sprinkler heads within a data center to be free from water until such point
    that a triggering mechanism allows water to enter the pipes. This is meant to
    mitigate damage from incidental leakage or spraying water from ruptured water
    lines normally under pressure.

   VESDA: Very Early Smoke Detection Apparatus allows for pre-action or gas
    suppression systems to have a human interrupt and intervention at initial



                                    19
    thresholds before ultimately triggering on higher thresholds. The system operates
    by using lasers to evaluate continuous air samples for very low levels of smoke.

   Halon: Oxygen displacing gas suppression system that is generally no longer used in
    current data center design due to risk to personnel in the occupied space.

   FM-200: Gas suppression system that quickly rushes the FM-200 gas to the
    confined data center space that must be kept air tight for effectiveness. It is a
    popular replacement for halon gas since it can be implemented without having to
    replace deployment infrastructure. A purge system is usually required to exhaust
    and contain the gas after deployment so it does not enter the atmosphere.

   Novec1230: Gas suppression system that is stored as a liquid at room temperature
    and allows for more efficient use of space over inert gas systems. Also a popular
    halon gas alternative.

   Inergen: A gas suppression system that does not require a purge system or air tight
    facility since it is non-toxic and can enter the atmosphere without environmental
    concerns. Requires larger footprint for more tanks and is a more expensive gas to
    use and replace.


2.4.1.6. Access Control

        Part of a good physical security plan includes access controls which allow you to
determine who has access to your Data Center and when. Metal keys can provide a
high level of security, but they do not provide an audit trail, and don't allow you to limit
access based on times and/or days. Intrusion systems (aka, alarm systems) can
sometimes allow for this kind of control in a facility where it is not possible to migrate to
an electronic lock system.
        Most new Data Centers constructed today include some sort electronic locking
system. These can take the form of simple, offline keypad locks, to highly complex
systems that include access portals (aka, man traps) and anti-tailgating systems.
Electronic lock systems allow the flexability to issue and revoke access instantaneously,
or nearly so, depending on the product. Online systems (sometimes refered to as
hardwired systems) consist of an access control panel that connects to a set of doors
and readers of various types using wiring run through the building. Offline systems
consist of locks that have a reader integrated into the lock, a battery and all of the
electronics to make access determinations. Updates to these sorts of locks are usually
done through some sort of hand-held device that is plugged into the lock.
        There are two fairly common reader technologies in use today. One is magnetic
stripe based. These systems usually read data encoded on tracks two or three. While
the technology is mature and stable, it has a few weaknesses. The data on the cards can
be easily duplicated with equipment easily purchased on the Internet. The magnetic



                                    20
    stripe can wear out or become erased if it gets close to a magnetic field. One option to
    improving the security of magnetic swipe installations is the use of a dual-validation
    reader, where after swiping your card, the user must enter a PIN code before the lock
    will open.
             The other common access token in use today is the proximity card, also called a
    RFID card. These cards have an integrated circuit (IC), capacitor and wire coil inside of
    them. When the coil is placed near a reader, the energy field emitted by the reader
    produces a charge in the capacitor, which powers the IC. Once powered, the IC
    transmits it's information to the reader and the reader or control panel that it
    communicates with, determines if you should gain access.
             Beyond access control, the other big advantage to electronic locking systems is
    their ability to provide an audit trail. The system will keep track of all credentials
    presented to the reader, and the resulting outcome of that presentation - access was
    either granted or denied. Complex access control systems will even allow you to do
    things such as implement a two-man rule, where two people must present authorized
    credentials before a lock will open, or anti-passback.
             Anti-passback systems require a user to present credentials to both enter or exit
    a given space. Obviously, locking someone into a room would be a life safety issue, so
    usually, some sort of alarm is sounded on egress if proper credentials were not
    presented. Anti-passback also allows you to track where individuals are at any given
    time, because the system knows that they presented credentials to exit a space.


    2.4.1.7. Commissioning

    Commissioning is essential to have validation of the design, verification of load
    capacities, and testing of failover mechanisms. A commissioning agent can identify
    design flaws, single-points of failure, and inconsistencies in the build-out from the
    original design. Normally a commissioning agent would be independent from the design
    or build team.

    A commissioning agent will inspect for such things as proper wiring, pipe sizes, weight
    loads, chiller and pump capacities, electrical distribution panels and switch gear. They
    will test battery run times, UPS and generator step loads, and air conditioning. They will
    simulate load with resistive coils to generate heat and UPS draw and go through a play-
    book of what-if scenarios to test all aspects of redundant systems.


2.4.2.       Load Balancing/High Availability
2.4.3.       Connectivity
    2.4.3.1. Network

    Network components in the data center—such as Layer 3 backbone switches, WAN
    edge routers, perimeter firewalls, and wireless access points—are described in the
    ITRP2 Network Baseline Standard Architecture and Design document, developed by the


                                       21
Network Technology Alliance, sister committee to the Systems Technology Alliance.
Latest versions of the standard can be located at http://nta.calstate.edu/ITRP2.shtml.

Increasingly, boundaries are blurring between systems and networks. Virtualization is
causing an abstraction of traditional networking components and moving them into
software and the hypervisor layer. Virtual switches

Considerations beyond “common services”

The following components have elements of network enabling services but are also
systems-oriented and may be managed by the systems or applications groups.                    Comment [mmb7]: Standards for these services
                                                                                              should be developed collaboratively with the NTA.

        1. DNS

            For privacy and security reasons, many large enterprises choose to make
            only a limited subset of their systems “visible” to external parties on the
            public Internet. This can be accomplished by creating a separate Domain
            Name System (DNS) server with entries for these systems, and locating it
            where it can be readily accessible by any external user on the Internet (e.g.,
            locating it in a DMZ LAN behind external firewalls to the public Internet).
            Other DNS servers containing records for internally accessible enterprise
            resources may be provided as “infrastructure servers” hidden behind
            additional firewalls in “trusted” zones in the data center. This division of
            responsibility permits the DNS server with records for externally visible
            enterprise systems to be exposed to the public Internet, while reducing the
            security exposure of DNS servers containing the records of internal
            enterprise systems.

        2. E-Mail (MTA only)

            For security reasons, large enterprises may choose to distribute e-mail
            functionality across different types of e-mail servers. A message transfer
            agent (MTA) server that only forwards Simple Mail Transfer Protocol (SMTP)
            traffic (i.e., no mailboxes are contained within it) can be located where it is
            readily accessible to other enterprise e-mail servers on the Internet. For
            example, it can be located in a DMZ LAN behind external firewalls to the
            public Internet). Other e-mail servers containing user agent (UA) mailboxes
            for enterprise users may be provided as “infrastructure servers” located
            behind additional firewalls in “trusted” zones in the data center. This
            division of responsibility permits the “external” MTA server to communicate
            with any other e-mail server on the public Internet, but reduces the security
            exposure of “internal” UA e-mail servers.

        3. Voice Media Gateway



                                   22
        The data center site media gateway will include analog or digital voice ports
        for access to the local PSTN, possibly including integrated services digital
        network (ISDN) ports.

        With Ethernet IP phones, the VoIP gateway is used for data center site
        phone users to gain local dial access to the PSTN. The VoIP media gateway
        converts voice calls between packetized IP voice traffic on a data center site
        network and local circuit-switched telephone service. With this
        configuration, the VoIP media gateway operates under the control of a call
        control server located at the data center site, or out in the ISP public
        network as part of an “IP Centrex” or “virtual PBX” service. However,
        network operators/carriers increasingly are providing a SIP trunking
        interface between their IP networks and the PSTN; this will permit
        enterprises to send VoIP calls across IP WANs to communicate with PSTN
        devices without the need for a voice media gateway or direct PSTN
        interface. Instead, data center site voice calls can be routed through the
        site’s WAN edge IP routers and data network access links.

    4. Ethernet L2 Virtual Switch

        In a virtual server environment, the hypervisor manages L2 connections
        from virtual hosts to the NIC(s) of the physical server.

        A hypervisor plug-in module may be available to allow the switching
        characteristics to emulate a specific type of L2 switch so that it can be
        managed apart from the hypervisor and incorporated into the enterprise
        NMS.                                                                             Comment [mmb8]: Added as a result of the
                                                                                         announcement of the virtual switch at our Cisco
                                                                                         briefing.
    5. Top-of-Rack Fabric Switches

        As a method of consolidating and aggregating connections from dense rack
        configurations in the data center, top-of-rack switching has emerged as a
        way to provide both Ethernet and Fiber Channel connectivity in one
        platform. Generally, these devices connect to end-of-row switches that,
        optimally, can manage all downstream devices as one switching fabric. The
        benefits are a modularized approach to server and storage networks,
        reduced cross connects and better cable management.

2.4.3.1.1.   Network Virtualization



2.4.3.1.2.   Structured Cabling




                               23
           The CSU has developed a set of standards for infrastructure planning that
           should serve as a starting place for designing cabling systems and other utilities
           serving the data center. These Telecommunications Infrastructure Planning
           (TIP) standards can be referenced at the following link:
           http://www.calstate.edu/cpdc/ae/gsf/TIP_Guidelines/

           There is also a NTA working group specific to networking that regards cabling
           infrastructure, known as the Infrastructure Physical Plant Working Group
           (IPPWG). Information about the working group can be found at the following
           link: http://nta.calstate.edu/NTA_working_groups/IPP/



           The approach to structured cabling in a data center differs from other aspects of
           building wiring due to the following issues:

              Managing higher densities, particularly fiber optics
              Cable management, especially with regard to moves, adds and changes
              Heat control, for which cable management plays a role

           The following are components of structured cabling design in the data center:

           1. Cable types: Cabling may be copper (shielded or unshielded) or fiber optic
              (single mode or multi mode).
           2. Cabling pathways: usually a combination of raised floor access and
              overhead cable tray. Cables under raised floor should be in channels that
              protect them from adjacent systems, such as power and fire suppression.
           3. Fiber ducts: fiber optic cabling has specific stress and bend radius
              requirements to protect the transmission of light and duct systems designed
              for fiber takes into account the proper routing and storage of strands,
              pigtails and patchcords among the distribution frames and splice cabinets.
           4. Fiber connector types: usually MT-RJ, LC, SC or ST. The use of modular fiber
              “cassettes” and trunk cables allows for higher densities and the benefit of
              factory terminations rather than terminations in the field, which can be
              time-consuming and subject to higher dB loss.
           5. Cable management:                                                                 Comment [mmb9]: Finish this section




2.4.4   Operations

        Information Technology (IT) operations refers to the day-to-day management of an
        IT infrastructure. An IT operation incorporates all the work required to keep a
        system running smoothly. This process typically includes the introduction and
        control of small changes to the system, such as mailbox moves and hardware
        upgrades, but it does not affect the overall system design. Operational support
        includes systems monitoring, network monitoring, problem determination, problem
        reporting, problem escalation, operating system upgrades, change control, version


                                      24
  management, backup and recovery, capacity planning, performance tuning and
  system programming.



  The mission of data center operations is to provide the highest possible quality of
  central computing support for the campus community and to maximize availability
  central computing systems.


  Data center operations services include:

     Help Desk Support
     Network Management
     Data Center Management
     Server Management
     Application Management
     Database Administration
     Web Infrastructure Management
     Systems Integration
     Business Continuity Planning
     Disaster Recovery Planning
     Email Administration


2.4.4.1 Staffing

        Staffing is the process of acquiring, deploying, and retaining a workforce of
        sufficient quantity and quality maximize the organizational effectiveness of
        the data center.                                                                  Comment [mmb10]: Include references to
                                                                                          benchmarks and classifications

2.4.4.2 Training

        Training is not simply a support function, but a strategic element in achieving
        an organization’s objectives.



        IT Training Management Processes and Sample Practices



        Management Processes                      Sample Practices

        Align IT training with business goals.    Enlist executive-level champions.

                                                  Involve critical stakeholders.




                                  25
       Identify and assess IT training needs.   Document competencies/skills
                                                required for each job description.

                                                Perform a gap analysis to determine
                                                needed training.

       Allocate IT training resources.          Use an investment process to select
                                                and manage training projects.

                                                Provide resources for management
                                                training, e.g., leadership and project
                                                management.

       Design and deliver IT training.          Give trainees choice among different
                                                training delivery methods.

                                                Build courses using reusable
                                                components.

       Evaluate/demonstrate the value of IT     Collect information on how job
       training.                                performance is affected by training.

                                                Assess evaluation results in terms of
                                                business impact




2.4.4.3 Monitoring

       Monitoring is a critical element of data center asset management and covers a
       wide spectrum of issues such as system availability, system performance
       levels, component serviceability and timely detection of system operational or
       security problems such as disk capacity exceeding defined thresholds or
       system binary files being modified, etc.                                          Comment [mmb11]: Provide some examples of
                                                                                         systems that can be monitored.

2.4.4.4 Automation                                                                       In best practice section , drop in sample graphs of
                                                                                         info from systems such as Nagios or HP OpenView,
                                                                                         Big Brother
       Automation of routine data center tasks reduces staffing headcount by using
       tools such as automated tape backup systems that auto load magnetic media         In a discussion of monitoring, also make mention of
                                                                                         the actions that follow from monitoring, whether
       from tape libraries sending backup status and exception reports to data           prescriptive escalations or automation or distributed
       center staff. The potential for automating routine tasks is limitless.            notifications, etc.
       Automation increases reliability and frees staff from routine tasks so that       Comment [mmb12]: Other examples of
                                                                                         automation can involve provisioning of VMs and
       continuous improvement of operations can occur.                                   storage

2.4.4.5 Console Management


                                26
                    To the extent possible console management should integrate the
                    management of heterogeneous systems using orchestration or a common
                    management console.                                                                Comment [mmb13]: Should this section also
                                                                                                       mention anything about contention with remote
                                                                                                       access and IP-based KVMs?
             2.4.4.6 Remote Operations

                    Lights out operations are facilitated by effective remote operations tools. This   Comment [mmb14]: Architecturally, this
                                                                                                       usually involves dedicated NICs and IP address
                    leverages the economy of scales enjoyed by managing multiple remote                assignments, which has some implications for
                    production data centers from a single location that may be dynamically             system management.
                    assigned in manner such as “follow the sun.”

    2.4.5.       Accounting
        2.4.5.1. Auditing

                  The CSU publishes findings and campus responses to information security
                  audits. Reports can be found at the following site:
                  http://www.calstate.edu/audit/audit_reports/information_security/index.shtml


2.5. Disaster Recovery
    2.5.1.      Relationship to overall campus strategy for Business Continuity
        Campuses should already have a business continuity plan, which typically includes a
        business impact analysis (BIA) to monetize the effects of interrupted processes and
        system outages. Deducing a maximum allowable downtime through this exercise will
        inform service and operational level agreements, as well as establish recovery time and
        point objectives, discussed in section 2.7.4.1 Backup and Recovery.

    2.5.2.        Relationship to CSU Remote Backup – DR initiative

        ITAC has sponsored an initiative to explore business continuity and disaster recovery
        partnerships between CSU campuses. [Charter document?] Several campuses have
        teamed to develop documents and procedures and their workproduct is posted at
        http://drp.sharepointsite.net/itacdrp/default.aspx.

        Examples of operational considerations, memorandums of understanding, and network
        diagrams are given in Section 3.5.4.2

    2.5.3.       Infrastructure considerations
        2.5.3.1. Site availability

        Disaster recovery planning should account for short-, medium-, and long-term disaster
        and disruption scenarios, including impact and accessibility to the data center.
        Consideration should be given to location, size, capacity, and utilities necessary to
        recover the level of service required by the critical business functions. Attention should
        be given to structural, mechanical, electrical, plumbing and control systems and should
        also include planning for workspace, telephones, workstations, network connectivity,
        etc.


                                             27
Alternate sites could be geographically diverse locations on the same campus, locations
on other campuses (perhaps as part of a reciprocal agreement between campuses to
recover each other’s basic operations), or commercially available co-location facilities
described in Section 2.5.3.2.

When determining an alternate site, management should consider scalability, in the
event a long-term disaster becomes a reality. The plan should include logistical
procedures for accessing backup data as well as moving personnel to the recovery
location.


2.5.3.2. Co-location

One method of accomplishing business continuity objectives through redundancy with
geographic diversity is to use a co-location scenario, either through a reciprocal
agreement with another campus or a commercial provider. The following are typical
types of collocation arrangements:
     Real estate investment trusts (REITs): REITs offer leased shared data center
        facilities in a business model that leverages tax laws to offer savings to
        customers.
     Network-neutral co-location: Network-neutral co-location providers offer
        leased rack space, power, and cooling with the added service of peer-to-peer
        network cross-connection.
     Co-location within hosting center: Hosting centers may offer co-location as a
        basic service with the ability to upgrade to various levels of managed hosting.
     Unmanaged hosted services: Hosting centers may offer a form of semi-co-
        location wherein the hosting provider owns and maintains the server
        hardware for the customer, but doesn't manage the operating system or
        applications/services that run on that hardware.

Principles for co-location selection criteria
    1. Business process includes or provides an e-commerce solution
    2. Business process does not contain applications and services that were
         developed and are maintained in-house
    3. Business process does not predominantly include internal infrastructure or
         support services that are not web-based
    4. Business process contain predominantly commodity and horizontal applications
         and services (such as email and database systems)
    5. Business process requires geographically distant locations for disaster recovery
         or business continuity
    6. Co-location facility meets level of reliability objective (Tier I, II, III, or IV) at less
         cost than retrofitting or building new campus data centers
    7. Access to particular IT staff skills and bandwidth of the current IT staffers
    8. Level of SLA matches the campus requirements, including those for disaster
         recovery
    9. Co-location provider can accommodate regulatory auditing and reporting for
         the business process


                                      28
           10. Current data center facilities have run out of space, power, or cooling

       [concepts from Burton Group article, “Host, Co-Lo, or Do-It-Yourself?”]

    2.5.4.       Operational considerations
        2.5.4.1. Recovery Time Objectives and Recovery Point Objectives discussed in 2.7.4.1
               (Backup and Recovery
2.6. Total Enterprise Virtualization                                                                 Comment [mmb15]: From Jim Michael:
                                                                                                     Internal cloud computing would probably span
2.7. Management Disciplines                                                                          aspects of almost all of this framework and external
    2.7.1.       Service Management                                                                  cloud computing could have a similar placement, but
                                                                                                     I expect that security, compliance and other issues
                                                                                                     might define the scope of appropriate use for
           IT service management is the integrated set of activities required to ensure the cost     external cloud computing.
           and quality of IT services valued by the customer. It is the management of
           customer-valued IT capabilities through effective processes, organization,
           information and technology, including:
                 Aligning IT with business objectives
                 Managing IT services and solutions throughout their lifecycles
                 Service management processes like those described in ITIL, ISO/IEC 20000,
                    or IBM’s Process Reference Model for IT.

       2.7.1.1. Service Catalog

           An IT Service Catalog defines the services that an IT organization is delivering to the
           business users and serves to align the business requirements with IT capabilities,
           communicate IT services to the business community, plan demand for these
           services, and orchestrate the delivery of these services across the functionally
           distributed (and, oftentimes, multi-sourced) IT organization. An effective Service
           Catalog also segments the customers who may access the catalog - whether end
           users or business unit executives - and provides different content based on function,
           roles, needs, locations, and entitlements.

           The most important requirement for any Service Catalog is that it should be
           business-oriented, with services articulated in business terms. In following this
           principle, the Service Catalog can provide a vehicle for communicating and
           marketing IT services to both business decision-makers and end users.

           The ITIL framework distinguishes between these groups as "customers" (the
           business executives who fund the IT budget) and "users" (the consumers of day-to-
           day IT service deliverables). The satisfaction of both customers and users is equally
           important, yet it's important to recognize that these are two very distinct and
           different audiences.

           To be successful, the IT Service Catalog must be focused on addressing the unique
           requirements for each of these business segments. Depending on the audience,
           they will require a very different view into the Service Catalog. IT organizations
           should consider a two-pronged approach to creating an actionable Service Catalog:




                                          29
          The executive-level, service portfolio view of the Service Catalog used by
           business unit executives to understand how IT's portfolio of service
           offerings map to business unit needs. This is referred to in this article as the
           "service portfolio."
          The employee-centric, request-oriented view of the Service Catalog that is
           used by end users (and even other IT staff members) to browse for the
           services required and submit requests for IT services. For the purposes of
           this article, this view is referred to as a "service request catalog."

   As described above, a Service Request Catalog should look like consumer catalogs,
   with easy-to-understand descriptions and an intuitive store-front interface for
   browsing available service offerings. This customer-focused approach helps ensure          Comment [mmb16]: Should this be part of the
                                                                                              Best Practice section that describes different and
   that the Service Request Catalog is adopted by end users. The Service Portfolio            effective ways of presenting the information?
   provides the basis for a balanced, business-level discussion on service quality and
   cost trade-offs with business decision-makers.

   To that end, service catalogs should extend beyond a mere list of services offered
   and can be used to facilitate:
        IT best practices, captured as Service Catalog templates
        Operational Level Agreements, Service Level Agreements (aligning internal
           & external customer expectations)
        Hierarchical and modular service models
        Catalogs of supporting and underlying infrastructures and dependencies
           (including direct links into the CMDB)
        Demand management and capacity planning
        Service request, configuration, validation, and approval processes
        Workflow-driven provisioning of services
        Key performance indicator (KPI)-based reporting and compliance auditing

2.7.1.2. Service Level Agreements

   The existence of a quality service level agreement is of fundamental importance for
   any service or product delivery of any importance. It essentially defines the formal
   relationship between the supplier and the recipient, and is NOT an area for short-
   cutting. This is an area which too often is not given sufficient attention. This can
   lead to serious problems with the relationship, and indeed, serious issues with
   respect to the service itself and potentially the business itself.

   It will embrace all key issues, and typically will define and/or cover:
         The services to be delivered
         Performance, Tracking and Reporting Mechanisms
         Problem Management Procedures
         Dispute Resolution Procedures
         The Recipient's Duties and Responsibilities
         Security
         Legislative Compliance
         Intellectual Property and Confidential Information Issues


                                    30
                Agreement Termination


2.7.2.       Project Management

    An organization’s ability to effectively manage projects allows it to adapt to changes and
    succeed in activities such as system conversions, infrastructure upgrades and system
    maintenance. A project management system should employ well-defined and proven
    techniques for managing projects at all stages, including:

            Initiation
            Planning
            Execution
            Control
            Close-out

    Project monitoring will include:

            Target completion dates – realistically set for each task or phase to improve
             project control.
            Project status updates – measured against original targets to assess time and
             cost overruns.

    Stakeholders and IT staff should collaborate on defining project requirements, budget,
    resources, critical success factors, and risk assessment, as well as a transition plan from
    the implementation team to the operational team.


2.7.3.       Change Management

    Change Management addresses routine maintenance and periodic modification of
    hardware, software and related documentation. It is a core component of a functional
    ITIL process as well. Functions associated with change management are:

             1. Major modifications: significant functional changes to an existing system, or
                converting to or implementing a new system; usually involves detailed file
                mapping, rigorous testing, and training.
             2. Routine modifications: changes to applications or operating systems to
                improve performance, correct problems or enhance security; usually not of
                the magnitude of major modifications and can be performed in the normal
                course of business.
             3. Emergency modifications: periodically needed to correct software problems
                or restore operations quickly. Change procedures should be similar to
                routine modifications but include abbreviated change request, evaluation
                and approval procedures to allow for expedited action. Controls should be
                designed so that management completes detailed evaluation and
                documentation as soon as possible after implementation.


                                        31
             4. Patch management: similar to routine modifications, but relating to
                externally developed software.
             5. Library controls: provide ways to manage the movement of programs and
                files between collections of information, typically segregated by the type of
                stored information, such as for development, test and production.
             6. Utility controls: restricts the use of programs used for file maintenance,
                debugging, and management of storage and operating systems.
             7. Documentation maintenance: identifies document authoring, approving and
                formatting requirements and establishes primary document custodians.
                Effective documentation allows administrators to maintain and update
                systems efficiently and to identify and correct programming defects, and
                also provides end users access to operations manuals.
             8. Communication plan: change standards should include communication
                procedures that ensure management notifies affected parties of changes.
                An oversight or change control committee can help clarify requirements and
                make departments or divisions aware of pending changes.

    [concepts from FFIEC Development and Acquisition handbook]                                  Comment [mmb17]: Consider using FFIEC
                                                                                                document on Sharepoint to draw best practice
                                                                                                concepts
2.7.4.       Configuration Management

    Configuration Management is the process of creating and maintaining an up to date
    record of all components of the infrastructure.

    1. Functions associated with Configuration Management are:

            Planning
            Identification
            Control
            Status Accounting
            Verification and Audit

    2. Configuration Management Database (CMDB) - A database that contains details
       about the attributes and history of each Configuration Item and details of the
       important relationships between CI’s. The information held may be in a variety of
       formats, textual, diagrammatic, photographic, etc.; effectively a data map of the
       physical reality of IT Infrastructure.

    3. Configuration Item - Any component of an IT Infrastructure which is (or is to be)
       under the control of Configuration Management.

    4. The lowest level CI is normally the smallest unit that will be changed independently
       of other components. CI’s may vary widely in complexity, size and type, from an
       entire service (including all its hardware, software, documentation, etc.) to a single
       program module or a minor hardware component.


2.7.5.       Data Management


                                       32
2.7.5.1. Backup and Recovery


Concepts

1. Recovery Time Objective, or RTO, is the duration of time in which a set of data, a
   server, business process etc. must be restored by. For example, a highly visible
   server such as a campus' main web server may need to be up and running again in a
   matter of seconds, as the business impact if that service is down is high. Conversely,
   a server with low visibility, such as a server used in software QA, may have a RTO of
   a few hours.

2. Recovery Point Objective, or RPO, is the acceptable amount of data loss a business
   can tolerate, measured in time. In other words, this is the point in time before a
   data loss event occurred, at which data may be successfully recovered. For less
   critical systems, it may be acceptable to recover to the most recent backup taken at
   the end of the business day, whereas highly critical systems may have a RPO of an
   hour or only a few minutes. RPO and RTO go hand-in-hand in developing your data
   protection plan.

3. Deduplication:

            a. Source deduplication - Source deduplication means that the
               deduplication work is done up-front by the client being backed up.

            b. Target deduplication - Target deduplication is where the deduplication
               processing is done by the backup appliance and/or server. There tend
               to be two forms of target deduplication: in-line and post-process.

                i.   In-line deduplication devices decide whether or not they have seen
                     the data before writing it out to disk.

                ii. Post-process deduplication devices write all of the data to disk, and
                    then at some later point, analyze that data to find duplicate blocks.

        4. Backup types

            a. Full backups - Full backups are a backup of a device that includes all
               data required to restore that device to the point in time at which the
               backup was performed.

            b. Incremental backups - Incremental backups backup the changed data
               set since the last full backup of the system was performed. There does
               not seem to be any industry standards when you compare one vendor's
               style of incremental to another. In fact, some vendors include multiple
               styles of incrementals that a backup administrator may choose from.




                                   33
               i.   A cumulative incremental backup is a style of incremental backup
                    where the data set contains all data changed since the last full
                    backup.

               ii. A differential incremental backup is a style of incremental backup
                   where the data set contains all data changed since the previous
                   backup, whether it be a full or another differential incremental.

       5. Tape Media - There are many tape formats to choose from when looking at
          tape backup purchases. They range from open-standards (many vendors
          sell compatible drives) to single-vendor or legacy technologies.

           a. DLT - Digital Linear Tape, or DLT, was originally developed by Digital
              Equipment Corporation in 1984. The technology was later purchased by
              Quantum in 1994. Quantum licenses the technology to other
              manufacturers, as well as manufacturing their own drives.

           b. LTO - Linear Tape Open, or LTO, is a tape technology developed by a         Comment [mmb18]: For best practices: Be sure
                                                                                          to stick with the same vendor for all drive types
              consortium of companies in order to compete with proprietary tape
              formats in use at the time.                                                 Be sure to understand backward compatibility issues
                                                                                          with drive types
           c. DAT/DDS - Digital Data Store, or DDS, is a tape technology that evolved     Not adviseable to mix drive types in library chassis
              from Digital Audio Tape, or DAT technology.
                                                                                          Should select vendor solution that is open with
                                                                                          respect to tape manufacturers; no warranty issues if
           d. AIT - Advanced Intelligent Tape, or AIT, is a tape technology developed     going off platform or mixing manufacturers
              by Sony in the late 1990's.

           e. STK/IBM - StorageTek and IBM have created several proprietary tape
              formats that are usually found in large, mainframe environments.

Methods

1. Disk-to-Tape (D2T)- Disk-to-tape is what most system administrators think of when
   they think of backups, as it has been the most common backup method in the past.
   The data typically moves from the client machine through some backup server to an
   attached tape drive. Writing data to tape is typically faster than reading the data
   from the tape.

2. Disk-to-Disk (D2D) - With the dramatic drop in hard drive prices over the recent
   years, disk-to-disk methods and technologies have become more popular. The big
   advantage they have over the traditional tape method, is speed in both the writing
   and reading of data. Some options available in the disk-to-disk technology space:

           a. VTL - Virtual Tape Libraries, or VTLs, are a class of disk-to-disk backup
              devices where a disk array and software appear as a tape library to your
              backup software.

           b. Standard disk array - Many enterprise backup software packages
              available today support writing data to attached disk devices instead of


                                  34
                 a tape drive. One advantage to this method is that you don't have to
                 purchase a special device in order to gain the speed benefits of disk-to-
                 disk technology.

3. Disk-to-Disk-to-Tape (D2D2T) - Disk-to-disk-to-tape is a combination of the previous
   two methods. This practice combines the best of both worlds - speed benefits from
   using disk as your backup target, and tape's value in long-term and off-site storage
   practices. Many specialized D2D appliances have some support for pushing their
   images off to tape. Backup applications that support disk targets, also tend to
   support migrating their images to tape at a later date.

4. Snapshots - A snapshot is a copy of a set of files and directories as they were at a
   particular moment in time. On a server operating system, the snapshot is usually
   taken by either the logical volume manager (LVM) or the file system driver. File
   system snapshots tend to be more space-efficient than their LVM counterpart.
   Most storage arrays come with some sort of snapshot capabilities either as base
   features, or as licenseable add-ons.

5. VM images – In a virtualized environment, backup agents may be installed on the
   virtual host and file level backups invoked in a conventional method. Backing up
   each virtual instance as a file at the hypervisor level is another consideration. A
   prime consideration in architecting backup strategies in a virtual environment is the
   use of a proxy server or intermediate staging server to handle snapshots of active
   systems. Such proxies allow for the virtual host instance to be staged for backup
   without having to quiesce or reboot the VM. Depending on the platform and the
   OS, it may also be possible to achieve file-level restores within the VM while backing
   up the entire VM as a file.

6. Replication

            a. On-site - On-site replication is useful if you are trying to protect against
               device failure. You would typically purchase identical storage arrays and
               then configure them to mirror the data between them. This does not,
               however, protect against some sort of disaster that takes out your
               entire data center.

            b.    Off-site - Off-site implies that you are replicating your data to a similar
                 device located away from your campus. Technically, off-site could mean
                 something as simple as a different building on your campus, but
                 generally this term implies some geo-diversity to the configuration.

            c.    Synchronous vs. Asynchronous - Synchronous replication guarantees
                 zero data-loss by performing atomic writes. In other words, the data is
                 written to the arrays that are part of the replication configuration, or
                 none of them. A write request is not considered complete until
                 acknowledged by all storage arrays. Depending on your application and
                 the distance between your local and remote arrays, synchronous
                 replication can cause performance impacts, since the application may



                                    35
                wait until it has been informed by the OS that the write is complete.
                Asynchronous replication gets around this by acknowledging the write
                as soon as the local storage array has written the data. Asynchronous
                replication may increase performance, but it can contribute to data loss
                if the local array fails before the remote array has received all data
                updates.

            d. In-band vs. Out-of-band - In-band replication refers to replication
               capabilities built into the storage device. Out-of-band can be
               accomplished with an appliance, software installed on a server or "in
               the network", usually in the form of a module or licensed feature
               installed into a storage router or switch.

7. Tape Rotation and Aging Strategies

            a. Grandfather, father, son - From Wikipedia: "Grandfather-father-son
               backup refers to the most common rotation scheme for rotating backup
               media. Originally designed for tape backup, it works well for any
               hierarchical backup strategy. The basic method is to define three sets of
               backups, such as daily, weekly and monthly. The daily, or son, backups
               are rotated on a daily basis with one graduating to father status each
               week. The weekly or father backups are rotated on a weekly basis with
               one graduating to grandfather status each month."

            b. Offsite vaults - Vaulting, or moving media from on-site to an off-site
               storage facility, is usually done with some sort of full backup. The media
               sent off-site can either be the original copy or a duplicate, but it is
               common to have at least one copy of the media being sent rather than
               sending your only copy. The amount of time it takes to retrieve a given
               piece of media should be taken into consideration when calculating and
               planning for your RTO.

            c. Retention policies: The CSU maintains a website with links and
               resources to help campuses comply with requirements contained in
               Executive Order 1031, the CSU Records/Information Retention and
               Disposition Schedules. The objective of the executive order is to ensure
               compliance with legal and regulatory requirements while implementing
               appropriate operational best practices. The site is located at
               http://www.calstate.edu/recordsretention.

8. Tape library: A tape library is a device which usually holds multiple tapes, multiple
   tape drives and has a robot to move tapes between the various slots and drives. A
   library can help automate the process of switching tapes so that an administrator
   doesn't have to spend several hours every week changing out tapes in the backup
   system. A large tape library can also allow you to consolidate various media formats
   in use in an environment into a single device (ie, mixing DLT and LTO tapes and
   drives).




                                   36
   9. Disk Backup appliances/arrays: some vendor backup solutions may implement the
      use of a dedicated storage appliance or array that is optimized for their particular
      backup scheme. In the case of incorporating deduplication into the backup
      platform, a dedicated appliance may be involved for handling the indexing of the
      bit-level data.



   2.7.5.2. Archiving


   2.7.5.3. Media Lifecycle

           Destruction of expired data


    2.7.5.4. Hierarchical Storage Management
    2.7.5.5. Document Management
2.7.6.       Asset Management

   Effective data center asset management is necessary for both regulatory and
   contractual compliance. It can improve life cycle management, and facilitate inventory
   reductions by identifying under-utilized hardware and software, potentially resulting in
   significant cost savings. An effective management process requires combining current
   Information Technology Infrastructure Library (ITIL) and Information Technology Asset
   Management (ITAM) best practices with accurate asset information, ongoing
   governance and asset management tools. The best systems/tools should be capable of
   asset discovery, manage all aspects of the assets, including physical, financial and
   contractual, life cycle management with Web interfaces for real time access to the data.
   Recognizing that sophisticated systems may be prohibitively expensive, asset
   management for smaller environments may be able to be managed by spreadsheets or
   simple database. Optimally, a system that could be shared among campuses while
   maintaining restricted permission levels, would allow for more comprehensive and
   uniform participation, such as the Network Infrastrucure Asset Management System
   (NIAMS), http://www.calstate.edu/tis/cass/niams.shtml

   The following are asset categories to be considered in a management system:

          Physical Assets – to include the grid, floor space, tile space, racks and cables.
           The layout of space and the utilization of the attributes above are literally an
           asset that needs to be tracked both logically and physically.

          Network Assets – to include routers, switches, firewalls, load balancers, and
           other network related appliances.




                                         37
      Storage Assets – to include Storage Area Networks (SAN), Network Attached
       Storage (NAS), tape libraries and virtual tape libraries.

      Server Assets – to include individual servers, blade servers and enclosures.

      Electrical Assets – to include Universal Power Supplies (UPS), Power Distribution
       Units (PDU), breakers, outlets (NEMA noted), circuit number and grid location of
       same. Power consumption is another example of logical asset that needs to be
       monitored by the data center manager in order to maximize server utilization
       and understand, if not reduce, associated costs.

      Air Conditioning Assets – to include air conditioning units, air handlers, chiller
       plants and other airflow related equipment. Airflow in this instance may be
       considered a logical asset as well but the usage plays an important role in a data
       center environment. Rising energy costs and concerns about global warming
       require data center managers to track usage carefully. Computational fluid
       dynamics (CFD) modeling can serve as a tool for maximizing airflow within the
       data center.

      Data Center Security and Safety Assets – Media access controllers, cameras,
       fire alarms, environmental surveillance, access control systems and access
       cards/devices, fire and life safety components, such as fire suppression systems.

      Logical Assets – T1’s, PRI’s and other communication lines, air conditioning,
       electrical power usage. Most importantly in this logical realm is the
       management of the virtual environment. Following is a list of logical assets or
       associated attributes that would need to be tracked:

           o   A list of Virtual Machines
           o   Software licenses in use in data center
           o   Virtual access to assets
                     VPN access accounts to data center
                     Server/asset accounts local to the asset

      Information Assets – to include text, images, audio, video and other media.
       Information is probably the most important asset a data center manager is
       responsible for. The definition is: An information asset is a definable piece of
       information, stored in any manner, recognized as valuable to the organization.
       In order to achieve access users must have accurate, timely, secure and
       personalized access to this information.

The following are asset groupings to be considered in a management system:

      By Security Level
           o Confidentiality


                                  38
                    o FERPA
                    o HIPPA
                    o PCI
              By Support Organization
                    o Departmental
                    o Computer Center Supported
                    o Project Team
              Criticality
                    o Critical (ex. 24x7 availability)
                    o Business Hours only (ex. 8AM - 7 PM)
                    o Noncritical
              By Funding Source (useful for recurring costs)
                    o Departmental funded
                    o Project funded
                    o Division funded

    2.7.6.1. Tagging/Tracking
    2.7.6.2. Licensing
    2.7.6.3. Software Distribution
2.7.7.       Problem Management

    Problem Management investigates the underlying cause of incidents, and aims to
    prevent incidents of a similar nature from recurring. By removing errors, which often
    requires a structural change to the IT infrastructure in an organization, the number of
    incidents can be reduced over time. Problem Management should not be confused with
    Incident Management. Problem Management seeks to remove the causes of incidents
    permanently from the IT infrastructure whereas Incident Management deals with
    fighting symptoms to incidents. Problem Management is proactive while Incident
    Management is reactive.


2.7.7.1.       Fault Detection - A condition often identified as a result of multiple incidents
               that exhibit common symptoms. Problems can also be identified from a single
               significant incident, indicative of a single error, for which the cause is unknown,
               but for which the impact is significant.
2.7.7.2.       Correction - An iterative process to diagnose known errors until they are
               eliminated by the successful implementation of a change under the control of
               the Change Management process.
2.7.7.3.       Reporting - Summarizes Problem Management activities. Includes number of
               repeat incidents, problems, open problems, repeat problems, etc.

2.7.8.       Security
    2.7.8.1. Data Security
           Data security is the protection of data from accidental or malicious modification,
           destruction, or disclosure. Although the subject of data security is broad and
           multi-faceted, it should be an overriding concern in the design and operation of a
           data center. There are multiple laws, regulations and standards that are likely to


                                           39
        be applicable such as the Payment Card Industry Data Security Standard, ISO
        17799 Information Security Standard, California SB 1386, California AB 211, the
        California State University Information Security Policy and Standards to name a
        few. It is required to periodically prove compliance to these standards and laws.
2.7.8.2. Encryption
        Encryption is the use of an algorithm to encode data in order to render a
        message or other file readable only for the intended recipient. Its primary
        functions are to ensure non-repudiation, integrity, and confidentiality in both
        data transmission and data storage. The use of encryption is especially
        important for Protected Data (data classified as Level 1 or 2). Common
        transmission encryption protocols and utilities include SSL/TLS, SecureShell, and
        IPSec. Encrypted Data Storage programs include PGP's encryption products
        (other security vendors such as McAfee have products in this space as well),
        encrypted USB keys, and TrueCrypt's free encryption software. Key management
        (exchange of keys, protection of keys, and key recovery) should be carefully
        considered.
2.7.8.3. Authentication
Authentication is the verification of the identity of a user. From a security perspective it
is important that user identification be unique so that each person can be positively
identified. Also the process of issuing identifiers must be secure and documented.
There are three types of authentication available:
      What a person knows (e.g., password or passphrase)
      What a person has (e.g., smart card or token)
      What a person is or does (e.g., biometrics or keystroke dynamics)

Single-factor authentication is the use one of the above authentication types, two-factor
authentication uses two of them, and three-factor authentication uses all of them.

       Single-factor password authentication remains the most common means of
       authentication ("What a person knows"). However due to the computing power
       of modern computers in the hands of attackers and technologies such as
       "rainbow tables", passwords used for single factor authentication may soon
       outlive their usefulness. Strong passwords should be used and a password
       should never be transmitted or stored without being encrypted. A reasonably
       strong password would be a minimum of eight characters and should contain
       three of the following four character types: lower case alpha, upper case alpha,
       number, and special character.
2.7.8.4. Vulnerability Management
    2.7.8.4.1. Anti-malware Protection
             Malware (malicious code, such as viruses, worms, and spyware, written to
             circumvent the security policy of a computer) represents a threat to data
             center operations. Anti malware solutions must be deployed on all
             operating system platforms to detect and reduce the risk to an acceptable
             level. Solutions for malware infection attacks include firewalls (host and
             network), antivirus/anti-spyware, host/network intrusion protection
             systems, OS/Application hardening and patching. Relying on only anti virus
             solutions will not fully protect a computer from malware. Determining the
             correct mix and configuration of the anti-malware solutions depends on


                                    40
             the value and type of services provided by a server. Anti virus, firewalls,
             and intrusion protection systems need to be regularly updated in order to
             respond to current threats.
    2.7.8.4.2. Patching
             The ongoing patching of operating systems and applications are important
             activities in vulnerability management. Patching includes file updates and
             configuration alterations. Data Center Operations groups should
             implement a patching program designed to monitor available patches,
             categorize, test, implement, and monitor the deployment of OS and
             application patches. In order to detect and address emerging
             vulnerabilities in a timely manner, campus staff members should frequently
             monitor announcements from sources such as BugTraq, REN-ISAC, US-Cert,
             and Microsoft and then take appropriate action. Both timely patch
             deployment and patch testing are important and should be thoughtfully
             balanced. Patches should be applied via a change control process. The
             ability to undo patches is highly desirable in case unexpected consequences
             are encountered. Also the capability to verify that patches were
             successfully applied is important.
    2.7.8.4.3. Vulnerability Scanning
             The datacenter should implement a vulnerability scanning program such as
             regular use of McAfee’s Foundstone.                                           Comment [mmb19]: Remove product reference
                                                                                           and describe specific toolsets that are typically used,
    2.7.8.4.4. Compliance Reporting                                                        such as port scanning, password crack tools, etc.
Compliance Reporting informs all parties with responsibility for the data and
applications how well risks are reduced to an acceptable level as defined by policy,
standards, and procedures. Compliance reporting is also valuable in proving compliance
to applicable laws and contracts (HIPAA, PCI DSS, etc.). Compliance reporting should
include measures on:
     How many systems are out of compliance.
     Percentage of compliant/non-compliant systems.
     Once detected out of compliance, how quickly a system comes into compliance.
          Compliance trends over time.


2.7.8.5. Physical Security
       When planning for security around your Data Center and the equipment
       contained therein, physical security must be part of the equation. This would be
       part of a "Defense-in-Depth" security model. If physical security of critical IT
       equipment isn't addressed, it doesn't matter how long your passwords are, or
       what method of encryption you are using on your network - once an attacker has
       gained physical access to your systems, not much else matters.

       See section 2.4.1.6 for description of access control




                                   41
<insert diagram of reference model with key components as building blocks>




                                                42
    3. Best Practice Components

        3.1. Standards
            3.1.1.     ITIL

The Information Technology Infrastructure Library (ITIL) Version 3 is a collection of good practices for
the management of Information Technology organizations. It consists of five components whose central
theme is the management of IT services. The five components are Service Strategy (SS), Service Design
(SD), Service Transition (ST), Service Operations (SO), and Service Continuous Improvement (SCI).
Together these five components define the ITIL life cycle with the first four components (SS, SD, ST and
SO) at the core with SCI overarching the first four components. SCI wraps the first four components and
depicts the necessary concern of each of the core components to continuously look for ways to improve
the respective ITIL process.




ITIL defines the five components in terms of functions/activities, concepts, and processes, as illustrated
below:

Service Strategy
Main Activities                     Key Concepts                         Processes
Define the Market                   Utility & Warranty                   Service Portfolio Management
Develop Offerings                   Value Creation                       Demand Management
Develop Strategic Assets            Service Provider                     Financial Management
Prepare Execution                   Service Model
                                    Service Portfolio

Service Design
Five Aspects of SD                  Key Concepts                         Processes
Service Solution                    Four “P’s”: People, Processes,       Service Catalog Management
                                    Products, & Partners
Service Management Systems          Service Design Package               Service Level Management
  and Solutions
Technology and Management           Delivery Model Options               Availability Management
  Architectures & Tools
Processes                           Service Level Agreement              Capacity Management
Measurement Systems, Methods        Operational Level Agreement          IT Service Continuity
  & Metrics                                                                 Management


                                                    43
Underpinning Contract                                                  Information Security
                                                                         Management
                                                                       Supplier Management

Service Transition
Processes                                            Key Concepts
Change Management                                    Service Changes
Service Asset & Configuration Management             Request for Change
Release & Deployment Management                      Seven “R’s” of Change Management
Knowledge Management                                 Change Types
Transition Planning & Support                        Release Unit
Service Validation & Testing                         Configuration Management Database (CMDB)
Evaluation                                           Configuration Management System
                                                     Definitive Media Library (DML)

Service Operation
Achieving the Right Balance        Processes                           Function
Internal IT View versus            Event Management                    Service Desk
  External Business View
Stability versus Responsiveness    Incident Management                 Technical Management
Reactive versus Proactive          Problem Management                  IT Operations Management
Quality versus Cost                Access Management                   Application Management
                                   Request Fulfillment

Service Continuous Improvement
The 7 Step Improvement Process to identify vision and strategy, tactical and operational goals
    1. Define what you should measure
    2. Define what you can measure
    3. Gather the data. Who? How? When? Integrity of the data?
    4. Process the data. Frequency, format, system, accuracy.
    5. Analyze the data. Relationships, trends, according to plan, targets met, corrective actions?
    6. Present and use the information assessment summary action plans, etc.
    7. Implement corrective action.


            3.1.2.      ASHRAE

ASHRAE modified their operational envelope for data centers with the goal of reducing energy
consumption. For extended periods of time, the IT manufacturers recommend that data center
operators maintain their environment within the recommended envelope. Exceeding the recommended
limits for short periods of time should not be a problem, but running near the allowable limits for
months could result in increased reliability issues. In reviewing the available data from a number of IT
manufacturers the 2008 expanded recommended operating envelope is the agreed-upon envelope that
is acceptable to all the IT manufacturers, and operation within this envelope will not compromise overall
reliability of the IT equipment.



                                                   44
Following are the previous and 2008 recommended envelope data:

                                    2004 Version        2008 Version

Low End Temperature                 20°C (68 °F)        18°C (64.4 °F)

High End Temperature                25°C (77 °F)        27°C (80.6 °F)

Low End Moisture                    40% RH              5.5°C DP (41.9 °F)

High End Moisture                   55% RH              60% RH & 15°C DP (59 °F DP)



<Additional comments on the relationship of electro static discharge (ESD) and relative humidity and the
impact to printed circuit board (PCB) electronics and component lubricants in drive motors for disk and
tape.>


            3.1.3.       Uptime Institute
        3.2. Hardware Platforms
            3.2.1.       Servers
                3.2.1.1. Server Virtualization
                         1. Practices

                            a. Production hardware should run the latest stable release of the selected
                               hypervisor, with patching and upgrade paths defined and pursued on a
                               scheduled basis with each hardware element (e.g. blade) dual-attached
                               to the data network and storage environment to provide for load
                               balancing and fault tolerance.

                            b. Virtual machine templates should be developed, tested and maintained
                               to allow for consistent OS, maintenance and middleware levels across
                               production instances. These templates should be used to support
                               cloning of new instances as required and systematic maintenance of
                               production instances as needed.

                            c. Virtual machines should be provisioned using a defined work order
                               process that allows for an effective understanding of server
                               requirements and billing/accounting expectations.

                                 This process should allow for interaction between requestor and
                                  provider to ensure appropriate configuration and acceptance of any
                                  fee-for-service arrangements.

                            d. Virtual machines should be monitored for CPU, memory, network and
                               disk usage. Configurations should be modified, with service owning unit
                               participation, to ensure an optimum balance between required and
                               committed capacity.


                                                   45
                    Post-provisioning capacity analysis should be performed via a
                     formal, documented process. For example, a 4 VCPU virtual
                     machine with 8 gigabytes of RAM that is using less than 10% of 1
                     VCPU and 500 megabytes of RAM should be adjusted to ensure that
                     resources are not wasted. This process should be formal,
                     documented and performed on a frequent basis.

            e. Virtual machine boot/system disks should be provisioned into a LUN
               maintained in the storage environment to ensure portability of server
               instances across hardware elements.

            f.     To reduce I/O contention, virtual machines with high performance or
                   high capacity requirements should have their non-boot/system disks
                   provisioned using dedicated LUNs mapped to logical disks in the storage
                   environment.

            g. Virtual machines should be administered using a central
               console/resource such as VMWare VirtualCenter. However, remote
               KVM functionality should also be implemented to support remote
               hypervisor installation and patching and remote hardware
               maintenance.

            h. A virtual development environment should be implemented, allowing
               for development and testing of new server instances/templates,
               changes to production instances/templates, hypervisor upgrades and
               testing of advanced high-availability features.

3.2.2.   Storage

         1. Practices

            a. Develop meaningful naming convention when defining components of
               the SAN. This readily identifies components, quickly presents
               information about them and reduces likelihood of misunderstandings.

            b. Use different types of storage to produce tiers. High speed fiber channel
               drives are not necessary for all applications. A mix of fiber with various
               SATA capacity and speed drives produces a SAN that balances
               performance with cost.

            c. Utilize NDMP when able for NFS. Whenever possible use NDMP to
               backup NFS files on the SAN. Backups are faster and network traffic is
               reduced. However, make sure the target for the backup is not the same
               storage device as the SAN if you are doing disk to disk backups.

            d. Isolate iSCSI from regular network traffic. SAN iSCSI traffic should be on
               its own network using its own switches for performance and security
               reasons.



                                     46
                   e. Ensure partition alignment between servers and SAN disks.
                      Misalignment of partitions will cause one or two additional I/O’s for
                      every read or write. This can have a huge performance impact on large
                      files or databases.

                   f.   Use thin provisioning where possible. Some SAN vendor tools allow you
                        to dynamically grow and shrink storage allocations. A large amount of
                        storage can be assigned to a server but doesn’t get allocated until it is
                        needed.

                   g. Use deduplication where possible. Deduplication on VTL’s have been in
                      use for quite a while but deduplication of primary storage is fairly new.
                      Keep an eye on the technology and begin to apply conservatively.


       3.2.2.1. Storage Virtualization
                1. To abstract storage hardware for device/array independence

               2. To provide replication/mirroring for higher availability, DR/BC, etc.

3.3. Software
    3.3.1.       Operating Systems
        3.3.1.1. Decision matrix for OS platforms (Windows, Linux, HP-UX, AIX)
        3.3.1.2. Licensing considerations
        3.3.1.1.3.3.1.3. Support for OS platforms (sys admin skill set maintenance,
                outsourcing)
    3.3.2.       Middleware
        3.3.2.1. Identity Management
    3.3.3.       Databases
        3.3.3.1. Decision matrix for DB platforms (Oracle, MSSQL, Informix, MySQL)
        3.3.3.2. Preferred internal development platforms
        3.3.3.3. Licensing considerations
        3.3.2.2.3.3.3.4. Support for DB platforms (training, sourcing)
    3.3.3.3.3.4. Core/Enabling Applications
        3.3.3.1.3.3.4.1. Email
            3.3.3.1.1.3.3.4.1.1. Spam Filtering
        3.3.3.2.3.3.4.2. Web Services
        3.3.3.3.3.3.4.3. Calendaring
        3.3.3.4.3.3.4.4. DNS
        3.3.3.5.3.3.4.5. DHCP
        3.3.3.6.3.3.4.6. Syslog
        3.3.3.7.3.3.4.7. Desktop Virtualization
        3.3.3.8.3.3.4.8. Application Virtualization
    3.3.4.3.3.5. Third Party Applications
        3.3.4.1.3.3.5.1. LMS
        3.3.4.2.3.3.5.2. CMS
        3.3.4.3.3.3.5.3. Help Desk/Ticketing
3.4. Delivery Systems



                                           47
3.4.1.       Facilities
    3.4.1.1. Tiering Standards




    3.4.1.2. Spatial Guidelines and Capacities
             1. The Data Center should be located in an area with no flooding sources
                directly above or adjacent to the room. Locate building water and sprinkler
                main pipes, and toilets and sinks away from areas alongside or above the
                Data Center location.

            2. The Data Center is to be constructed of slab-to-slab walls and a minimum
               one-hour fire rating. All penetrations through walls, floors, and ceilings are
               to be sealed with an approved sealant with fire ratings equal to the
               penetrated construction.

            3. FM-200 gaseous fire suppression will be used, possibly supplemented with
               VESDA smoke detection with 24/7 monitoring.

            4. Provide a 20 millimeter vapor barrier for the Data Center envelope to allow
               maintaining proper humidity. Provide R-21 minimum insulation. If any
               walls are common to the outside, increase the R factor to achieve the same
               effect based on the exterior wall construction.

            5. All walls of the Data Center shall be full height from slab to under floor
               above.

            6. All paint within the Data Center shall be moisture control/anti-flake type.

            7. The redundant UPS systems and related critical electrical equipment should
               be located in separate rooms external to the data center. Redundant
               systems such as UPS systems (not parallel) not connected together shall be
               located in separate rooms where not cost prohibitive.

            8. Provide at least one set of double doors or one 3’-6” wide door into the
               Data Center to facilitate the movement of large equipment.

            9. There will be a Network Operation Center (NOC) outside the data center;
               KVMs will be located inside the Data Center as well as in the NOC.

            10. A portion of future data center expansion space may be used for staging,
                storage and test/development.

            11. Interior windows, if any, on fire-rated walls between the Data Center and
                adjacent office spaces shall be one-hour rated. No windows are permitted
                in Data Center exterior walls.

            12. Column spacing shall be 30’x 30’ or greater.


                                       48
13. The concrete floor beneath the raised access floor in the Data Center is to
    be sealed with appropriate water based sealer prior to the installation of
    the raised access floor. IT equipment cabinets can range to 2,000 lbs each.
    Subfloor load capacity shall not be less than 150 lbs/SF for the Data Center
    and 250 lbs/SF for electrical and mechanical support areas.

14. Ceiling load capacity for suspension of data and power cable tray, ladder
    rack and HVAC ductwork shall not be less than 50 lbs/SF.

15. A steel roof shall be provided. Consider installing a redundant roof or fire-
    proof gypsum board ceiling (especially if the original roof is wooden). Roof
    load capacity shall not be less than 40 lbs/SF with capability to add
    structural steel for support of HVAC heat rejection equipment.

16. HVAC heat rejection equipment may be installed on grade if space allows,
    otherwise on the roof. Utility power transformers, standby generators and
    switchgear will be installed on grade.

17. Provide two drains with anti-backflow in the floor unless cost prohibitive.

18. Provide a minimum ceiling height of 12.5' clear from the floor slab to the
    bottom of beams in the Data Center.

19. A 24” raised access floor would be required (please refer to details in this
    document). The access floor will be used as the primary cool air supply
    plenum and should be kept free of cables, piping or other equipment that
    can cause obstruction of airflow.

20. Power cabling shall be routed in raceway overhead, mounted on equipment
    racks or suspended overhead from the ceiling.

21. Data cabling and fiber shall be routed overhead in cable tray or on ladder
    rack.

22. All doors into the Data Center and Lab areas are to be secured with locks.
    All doors into the Data Center are to have self-closing devices. All doors
    should have the appropriate fire rating per the NFPA and local codes. The
    doors shall have a full 180-degree swing.

23. All construction in the Data Center is to be complete a minimum three
    weeks prior to occupancy. This includes walls, painting, backboards, doors,
    frames, hardware, windows, VCT floor, ceiling grid and tile (if specified),
    lights, sprinklers, fire suppression systems, electrical, HVAC and UPS.
    Temporary power and HVAC are not acceptable. Rooms must be wet-mop
    cleaned to remove all dust prior to installing equipment.

24. Orient light fixtures to run between and parallel with equipment rows in the
    Data Center.



                           49
          25. Motion sensors shall be used for data center lighting control for energy
              efficiency.

3.4.1.3. Electrical Systems

<This section from Lawrence Berkeley National Laboratories. The referenced Design
Guidelines Sourcebook is by PG&E. http://hightech.lbl.gov/DCTraining/best-practices-
technical.html >

1.   Electrical Infrastructure


         Maximize UPS Unit Loading
              o When using battery based UPSs, design the system to maximize the load factor on operating
                   UPSs. Use of multiple smaller units can provide the same level of redundancy while still
                   maintaining higher load factors, where UPS systems operate most efficiently. For more
                   information, see Chapter 10 of the Design Guidelines Sourcebook.
         Specify Minimum UPS Unit Efficiency at Expected Load Points
               o There are a wide variety of UPSs offered by a number of manufacturers at a wide range of
                    efficiencies. Include minimum efficiencies at a number of typical load points when specifying
                    UPSs. Compare offerings from a number of vendors to determine the best efficiency option for a
                    given UPS topography and feature set. For more information, see Chapter 10 of the Design
                    Guidelines Sourcebook.
         Evaluate UPS Technologies for the Most Efficient
               o New UPS technologies that offer the potential for higher efficiencies and lower maintenance costs
                    are in the process of being commercialized. Consider the use of systems such as flywheel or fuel
                    cell UPSs when searching for efficient UPS options. For more information, see Chapter 10 of the
                    Design Guidelines Sourcebook.


2.   Lighting


         Use Occupancy Sensors
              o Occupancy sensors can be a good option for datacenters that are infrequently occupied.
                   Thorough area coverage with occupancy sensors or an override should be used to insure the
                   lights stay on during installation procedures when a worker may be 'hidden' behind a rack for an
                   extended period.
         Provide Bi-Level Lighting
               o Provide two levels of clearly marked, easily actuated switching so the lighting level can be easily
                     changed between normal, circulation space lighting and a higher power detail work lighting level.
                     The higher power lighting can be normally left off but still be available for installation and other
                     detail tasks.
         Provide Task Lighting
               o Provide dedicated task lighting specifically for installation detail work to allow for the use of lower,
                    circulation space and halls level lighting through the datacenter area.




3.4.1.4. HVAC Systems

<This section from Lawrence Berkeley National Laboratories. The referenced Design
Guidelines Sourcebook is by PG&E. http://hightech.lbl.gov/DCTraining/best-practices-
technical.html >

1.   Mechanical Air Flow Management


         Hot Aisle/Cold Aisle Layout
         Blank Unused Rack Positions




                                               50
              o     Standard IT equipment racks exhaust hot air out the back and draw cooling air in the front.
                    Openings that form holes through the rack should be blocked in some manner to prevent hot air
                    from being pulled forward and recirculated back into the IT equipment. For more information, see
                    Chapter 1 of the Design Guidelines Sourcebook.
        Use Appropriate Air Diffusers
        Position supply and returns to minimize mixing
               o Diffusers should be located to deliver air directly to the IT equipment. At a minimum, diffusers
                    should not be placed such that they direct air at rack or equipment heat exhausts, but rather
                    direct air only towards where IT equipment draws in cooling air. Supplies and floor tiles should be
                    located only where there is load to prevent short circuiting of cooling air directly to the returns; in
                    particular, do not place perforated floor supply tiles near computer room air conditioning units
                    using the as a return air path. For more information, see Chapters 1 and 2 of the Design
                    Guidelines Sourcebook.
        Minimize Air Leaks in Raised Floor


2.   Mechanical Air Handler Systems


        Use Redundant Air Handler Capacity in Normal Operations
             o With the use of Variable Speed Drives and chilled water based air handlers, it is most efficient to
                  maximize the number air handlers operating in parallel at any given time. Power usage drops
                  approximately with the square of the velocity, so operating two units at 50% capacity uses a sum
                  total less energy than a single unit at full capacity. For more information, see Chapter 3 of the
                  Design Guidelines Sourcebook.
        Configure Redundancy to Reduce Fan Power Use in Normal Operation
              o When multiple small distributed units are used, redundancy must be equally distributed.
                   Achieving N+1 redundancy can require the addition of a large number of extra units, or the
                   oversizing of all units. A central air handler system can achieve N+1 redundancy with the addition
                   of a single unit. The redundant capacity can be operated at all times to provide a lower air handler
                   velocity and an overall fan power reduction, since fan power drops with the square of the velocity.
                   Light loading. For more information, see Chapter 3 of the Design Guidelines Sourcebook.
        Control Volume by Variable Speed Drive on Fans Based on Space Temperature
                         The central air handlers should use variable fan speed control to minimize the volume
                             of air supplied to the space. The fan speed should be varied in series with the supply
                             air temperature in a manner that reduces fan speed to the minimum speed possible
                             before increase supply air temperature above a reasonable set point. Typically, supply
                             air of 60F is appropriate to provide the sensible cooling required by datacenters. For
                             more information, see Chapters 1 and 3 of the Design Guidelines Sourcebook.


3.   Mechanical Humidification


        Use Widest Suitable Humidity Control Band
        Centralize Humidity Control
        Use Lower Power Humidification Technology
              o There are several options for lower power, non-isothermal humidification, including air or water
                  pressure based 'fog' systems, air washers, and ultrasonic systems. For more information, see
                  Chapter 7 of the Design Guidelines Sourcebook.


4.   Mechanical Plant Operation


        Use Free Cooling / Waterside Economization
              o Free cooling provides cooling using only the cooling tower and a heat exchanger. It is very
                   attractive in dry climates and for facilities that have local concerns about outside air quality that
                   may cause concern about the use of standard airside economizers.
                   For more information, see Chapters 4 and 6 of the Design Guidelines Sourcebook.
        Monitor System Efficiency
              o Install reliable, accurate monitoring of key plant metrics such as such kW/ton. The first cost of
                   monitoring can be quickly recovered by identifying common efficiency problems, such as: low
                   refrigerant charge, non-optimal compressor mapping, incorrect sensors, incorrect pumping
                   control, etc. Efficiency monitoring provides the information needed for facilities personnel to
                   optimize the system's energy performance during buildout and avoid efficiency decay and




                                               51
                    troubleshoot developing equipment problems over the life of the system.
                    For more information, see Chapter 4 of the Design Guidelines Sourcebook.
        Rightsize the Cooling Plant
              o Due to the critical nature of the load and unpredictability of future IT equipment loads, datacenter
                    cooling plants are oversized. The design should recognize that the standard operating condition
                    will be at partload and optimize for efficiency accordingly. Consistent part-load operation dictates
                    using well know design approaches to part load efficiency such as utilizing redundant towers to
                    improve approach, using multiple chillers with variable speed drive, variable speed pumping
                    throughout, chiller staging optimized for partload operation, etc. For more information, see
                    Chapter 4 of the Design Guidelines Sourcebook.




3.4.1.5. Fire Protection & Life Safety
3.4.1.6. Access Control
3.4.1.7. Commissioning

<This section from Lawrence Berkeley National Laboratories. The referenced Design
Guidelines Sourcebook is by PG&E. http://hightech.lbl.gov/DCTraining/best-practices-
technical.html >

1.   Commissioning and Retrocommissioning


        Perform a Peer Review
              o A peer review offers the benefit of having the design evaluated by a professional without the
                   preconceived assumptions that the main designer will inevitably develop over the course of the
                   project. Often, efficiency, reliability and cost benefits can be achieved through the simple process
                   of having a fresh set of eyes, unencumbered by the myriad small details of the project, review the
                   design and offer suggestions for improvement.
        Engage a Commissioning Agent
             o Commissioning is a major task that requires considerable management and coordination
                  throughout the design and construction process. A dedicated commissioning agent can ensure
                  that commissioning is done in a thorough manner, with a minimum of disruption and cost.
        Document Testing of All Equipment and Control Sequences
             o Develop a detailed testing plant for all components. The plan should encompass all expected
                  sequence of operation conditions and states. Perform testing at with the support of all relevant
                  trades — it is most efficient if small errors in the sequence or programming can be corrected on-
                  the-spot rather than relegated to the back and forth of a traditional punchlist. Functional testing
                  performed for commissioning does not take the place of equipment startup testing, control point-
                  to-point testing or other standard installation tests.
        Measure Equipment Energy Onsite
             o Measure and verify that major pieces of equipment meet the specified efficiency requirements.
                  Chillers in particular can have seriously degraded cooling efficiency due to minor installation
                  damage or errors with no outward symptoms, such as loss of capacity or unusual noise.
        Provide Appropriate Budget and Scheduling for Commissioning
              o Commissioning is a separate, non-standard, procedure that is necessary to ensure the facility is
                   constructed to and operating at peak efficiency. Additional time commitment beyond a standard
                   construction project will be required from the contractors. Coordination meetings dedicated to
                   commissioning are often required at several points during construction to ensure a smooth and
                   effective commissioning.
        Perform Full Operational Testing of All Equipment
              o Commissioning testing of all equipment should be performed after the full installation of the
                   systems are complete, immediately prior to occupancy. Normal operation and all failure modes
                   should be tested. In many critical facility cases, the use of load banks to produce a realistic load
                   on the system is justified to ensure system reliability under design conditions.
        Perform a Full Retrocommissioning
              o Many older datacenters may have never been commissioned, and even if they had performance
                   degrades over time. Perform a full commissioning and correct any problems found. Where control
                   loops have been overridden due to immediate operational concerns, such as locking out




                                              52
                         condenser water reset due to chiller instability, diagnose and correct the underlying problem to
                         maximize system efficiency, effectiveness, and reliability.
              Recalibrate All Control Sensors
              Where Appropriate, Install Efficiency Monitoring Equipment
                   o As a rule, a thorough retrocommissioning will locate a number of low-cost or no-cost areas where
                        efficiency can be improved. However, without a simple means of continuous monitoring, the
                        persistence of the savings is likely to be low. A number of simple metrics (cooling plant kW/ton,
                        economizer hours of operation, humidification/dehumidification operation, etc.) should be
                        identified and continuously monitored and displayed to allow facilities personnel to recognize
                        when system efficiency has been compromised.




   3.4.2.       Load Balancing/High Availability
   3.4.3.       Connectivity
       3.4.3.1. Network
           3.4.3.1.1. Network Virtualization
                1. Use of VLANs for LAN security and network management

                    a. Controlling network access through VLAN assignment

                    b. Using VLANs to present multiple virtual subnets within a given physical
                       subnet

                    c. Using VLANs to present one virtual subnet across portions of many
                       physical subnets

               2. Use of MPLS for WAN security and network management

                    a. Multiprotocol Label Switching (MPLS) used to create Virtual Private
                       Networks (VPNs) to provide traffic isolation and differentiation.

            3.4.3.1.2. Structured Cabling
    3.4.4.       Operations
        3.4.4.1. Staffing
        3.4.4.2. Training
        3.4.4.3. Monitoring
        3.4.4.4. Console Management
        3.4.4.5. Remote Operations
    3.4.5.       Accounting
3.5. Disaster Recovery
    3.5.1.       Relationship to overall campus strategy for Business Continuity
    3.5.2.       Relationship to CSU Remote Backup – DR initiative
    3.5.3.       Infrastructure considerations
    3.5.4.       Operational considerations
        3.5.4.1. Recovery Time Objectives and Recovery Point Objectives discussed in 2.7.3.1
               (Backup and Recovery
        3.5.4.2. Resource-sharing between campuses: Cal State Fullerton/San Francisco State
               example




                                                  53
<sourced from ITAC Disaster Recovery Plan Project,
http://drp.sharepointsite.net/itacdrp/default.aspx >

CSU, Fullerton has established a Business Continuity computing site at San Francisco
State University. The site allows a number of critical computing functions to remain in
service in the event of severe infrastructure disruption, such as complete failure of both
CENIC links to the Internet, failure of the network core on the Fullerton campus, or
complete shutdown of the Fullerton Data Center.

CSU, Fullerton is the only CSU campus to establish such extensive off-site capabilities.

The goal of the site is to provide continuity for the most critical computing services that
can be supported in a cost-effective manner. Complete duplication of every central
computing resource on the Fullerton campus is prohibitively expensive, and San
Francisco State’s Data Center could not provide the space for that much equipment.
 (The same is true for SFSU. Fullerton does not have the space, cooling, or electrical to
duplicate all the equipment at SFSU.)

Major capabilities include continuity of access to:

       The main campus website: www.fullerton.edu
       The faculty/student Portal: my.fullerton.edu
       Faculty/staff email;
       Student email (provided by Google, but accessed via the Fullerton Portal)
       Blackboard Learning Solutions (hosted by Blackboard ASP, but accessed via the
        Fullerton Portal)
       CMS HR, Finance, and Student applications (hosted by Unisys in Salt Lake City,
        but accessed via the Fullerton Portal)
       Brass Ring H.R. recruitment system (hosted by Brass Ring)
       OfficeMax ordering system
       GE Capital state procurement card system


With these capabilities, educational and business activities can continue while the
infrastructure interruption is resolved.

In addition to complete operation, the remote site can substitute for specific resources,
such as the campus website. This “granular” approach provides significant flexibility for
responding to specific computing issues in the Data Center.

A number of significant resources were found to be too costly to duplicate off site,
including:

       CMS Data warehouse
       Filenet document repository
       Voice mail


                                    54
       IBM mainframe (which is to be decommissioned in Dec, 2008)


The project does not provide continuity for resources provided outside of Fullerton IT.

The project began with initial discussions between the CIO’s of Fullerton and San
Francisco in 2006, where they agreed in concept to provide limited “hosting” for
equipment from the other campus. An arrangement was worked out with CENIC, the
statewide data network provider, to use features of the “Border Gateway Protocol” to
allow a campus to switch a portion of the campus Internet Protocol (IP) space from their
main campus to the remote campus in a matter of seconds. This capability was
perfected and tested in the summer of 2007.

A major innovation was the use of the “backup” CENIC link at SFSU to provide Internet
access for Fullerton’s remote site. This completely avoids Fullerton having any impact
on the SFSU network. We use no IP addresses at SF. They make no changes to their
firewall or routers. We need no network ports. And, all Internet traffic to our site goes
through the backup CENIC link, not through SFSU’s primary link.

Fullerton purchased an entire cabinet of equipment, including firewall, network switch,
servers, and remote management of keyboards and power plugs. The equipment was
tested locally and transported to San Francisco in January, 2008.

The site became operational in February, 2008.

All maintenance and monitoring of the remote equipment is done through the Internet
from the Fullerton campus. Staff at San Francisco provide no routine assistance. This
has several important benefits: (1) it places little burden on the remote “host”. (2) it
avoids the need to train staff at the remote campus and re-train when personnel turn-
over. And (3) it allows the entire remote site to be relocated to a different remote host
with almost no change.

Because the remote site is connected to Fullerton through a VPN tunnel, the same Op
Manager software that monitors equipment on the Fullerton campus also monitors the
servers at SFSU. Any unexpected server conditions at SFSU automatically trigger alerts
to Fullerton staff.

To avoid unexpected disruption, the Continuity site is activated manually by accessing
the Firewall through the Internet and changing a few rules. Experience with the
unexpected consequences of setting up “automated” systems prompted this design.

The SFSU site contains a substantial capability, including:

       Domain controllers for Fullerton’s two Microsoft Active Directory domains (AD
        and ACAD)



                                    55
       This provides secure authentication to the portal and email servers, and would
       allow Active Directory to be rebuilt if the entire Fullerton Data Center were
       destroyed.
      Web Server
      Portal Server
      Application Servers for running batch jobs and synchronizing the Portal
       databases
      Microsoft Exchange email servers
      Blackberry Enterprise Server (BES) to provide continuity for Blackberry users
      SQL database servers
      CMS Portal servers (this capability not totally implemented yet because CMS
       Student is just coming “on line” during 2008)
      Email gateway server


Because the SFSU site constantly replicates domain updates and refreshes the Portal
database daily, the SFSU site is a better source of much information than the tape
backups kept at Iron Mountain.




                                  56
               Network Diagrams




Figure 1 – Intercampus network diagram         Figure 2 – Remote LAN at DR site


               3.5.4.3. Memorandum of Understanding

               Campuses striking partnerships in order to share resources and create geographic
               diversity for data stores will want to document the terms of their agreement,
               representing elements such as physical space, services, access rights and effective dates.
               Following is a sample template (sourced from ITAC DR site):


                                                                                        MOU SSU-SJSU
                                                                                Sonoma State University
                                                                                    Rider A, Page 1 of 1


                                  MEMORANDUM OF UNDERSTANDING

               This MEMORANDUM OF UNDERSTANDING is entered into this 1st day of September,
               2008, by and between Sonoma State University, Computer Operations, Information
               Technology Department and San Jose State University, University Computing and
               Telecomm.


                                                  57
Each campus will provide, for use by the other campus: 2 each - 19” racks, mounting
equipment specification (EIA-310-C) and requisite power and cooling for same.

Power provided by Sonoma State University will include uninterrupted power supply
(UPS) and diesel generator backup power.

PHYSICAL LOCATION OF HOSTED EQUIPMENT

Sonoma State University will station the two 19” racks for San Jose State University in
the Sonoma State University Information Technology data center. Collocated network
devices will reside in a physically segregated LAN.

San Jose State University will station the two racks for Sonoma State University …

LOGICAL LOCATION OF HOSTED EQUIPMENT

Equipment stationed by SJSU in the two racks in the SSU data center will be provisioned
in a segregated security zone behind a firewall interface whose effective security policy
is specified by the hosted campus. Security policy change requests for the firewall by
SJSU will be accomplished within 7 days of receipt by SSU.

PHYSICAL ACCESS BY SISTER CAMPUS

Physical access to the two racks in the SSU data center by SJSU support staff will be
granted according to the escorted visitor procedures in place at SSU. Physical access is
available during normal business hours (8am – 5pm) by appointment or emergency
access on a best effort basis by contacting the SSU emergency contact listed in this
document.

Physical access to the two racks in the SJSU data center by SSU support staff will be
granted according to …

SERVICES PROVIDED BY HOST CAMPUS

SSU Computer Operations staff will provide limited services to SJSU as required during
normal business hours (8am – 5pm) or after hours on a best effort basis by contacting
the SSU emergency contact listed in this document. Limited services are defined to be
such things as tasks not to exceed 1 hour of labor such as visiting a system console for
diagnostic purposes, power cycling a system, or other simple task that cannot be
performed remotely by SJSU.



                                   58
        SJSU Computer Operations staff will provide limited services to SSU as required during
        normal business hours (8am – 5pm) or after hours on a best effort basis by contacting
        the SJSU emergency contact listed in this document. Limited services are defined to be
        such things as tasks not to exceed 1 hour of labor such as visiting a system console for
        diagnostic purposes, power cycling a system, or other simple task that cannot be
        performed remotely by SSU.

        SECURITY REQUIREMENTS

        Level-one data in transit or at rest must be encrypted. Each campus will conform to CSU
        information security standards as they may apply to equipment stationed at the sister
        campus and cooperate with the sister campus’ Information Security Officer pertaining to
        audit findings on their collocated servers.

        The point of contact person for Sonoma State University will be Mr. Samuel Scalise (707
        664-3065, scalise@sonoma.edu). The point of contact person for San Jose State
        University will be Mr. Don Baker (408 924-7820 don.baker@sjsu.edu).

        The emergency point of contact person for Sonoma State University will be Mr. Don
        Lopez (707 291-4970, don.lopez@sonoma.edu). The emergency point of contact for
        San Jose State University will be …

        No charges to either party since equipment, services and support are mutual
        The term of this MOU shall be September 1, 2008 through June 30, 2009.


3.6. Total Enterprise Virtualization

In order to allow IT organizations to remain nimble to the increased complexity of application
provisioning and delivery, and to maximize the unused capacity of compute and storage
resources, virtualization is key. And while virtualizing servers and storage are obvious targets
for optimization, total enterprise virtualization would encompass additional layers, extending to
desktop virtualization and application virtualization. Data centers that are able to deliver
services dynamically to meet demand will have other virtualization layers as well, such as
virtualizing connectivity to storage systems and the network.

The following are characteristics of a dynamic data center:

       Enables workload mobility
       Automatically managed through orchestration
       Seamlessly leverages external services
       Service-oriented
       Highly available


                                           59
       Energy and space efficient
       Utilizes a unified fabric
       Secure and regulatory compliant

Achieving these characteristics requires investments in some of the following key enabling
technologies:

    1. Server Virtualization

        Server virtualization's ability to abstract the system hardware away from the workload
        (i.e., guest OS +application) enables the workload to move from one system to another
        without hardware compatibility worries. In turn, this opens up a whole new world of IT
        agility possibilities that enable the administrator to dynamically shift workloads to
        different IT resources for any number of reasons, including better resource utilization,
        greater performance, high availability, disaster recovery, server maintenance, and even
        energy efficiency. Imagine a data center that can automatically optimize workloads
        based on spare CPU cycles from highly energy-efficient servers.

        Issues to be aware of in server virtualization: licensing, hardware capabilities,
        application support, and administrator trust with critical applications to virtual
        platforms.

    2. Storage Virtualization

        Storage virtualization is an increasingly important technology for the dynamic data
        center because it brings many of the same benefits to the IT table as server
        virtualization. Storage virtualization is an abstraction layer that decouples the storage
        interface from the physical storage, obfuscating where and how data is stored. This
        virtualization layer not only creates workload agility (not tied to a single storage
        infrastructure), but it also improves storage capacity utilization, decreases space and
        power consumption, and increases data availability. In fact, storage and server
        virtualization fit hand and glove, facilitating a more dynamic data center together by
        enabling workloads to migrate to any physical machine connected to the storage
        virtualization layer that houses the workload's data.

    3. Automation and orchestration

        Storage virtualization is an increasingly important technology for the dynamic data
        center because it brings many of the same benefits to the IT table as server
        virtualization. Storage virtualization is an abstraction layer that decouples the storage
        interface from the physical storage, obfuscating where and how data is stored. This
        virtualization layer not only creates workload agility (not tied to a single storage
        infrastructure), but it also improves storage capacity utilization, decreases space and
        power consumption, and increases data availability. In fact, storage and server



                                            60
   virtualization fit hand and glove, facilitating a more dynamic data center together by
   enabling workloads to migrate to any physical machine connected to the storage
   virtualization layer that houses the workload's data.

   The best automation and orchestration software can reduce the management
   complexity created by workload mobility. Workflow automation and orchestration tools
   use models and policies stored in configuration management databases (CMDBs) that
   describe the desired data center state and the actions that automation must take to
   keep the data center operating within administrator-defined parameters. These tools
   put the IT administrator in the role of the conductor, automating systems and workload
   management using policy-based administration.

4. Unified Fabric with 10GbE

   10 Gigabit Ethernet (10GbE) is an important dynamic data center-enabling technology
   because it raises Ethernet performance to a level that can compete with specialized I/O
   fabrics, thereby potentially unifying multiple I/O buses into a single fabric. For example,
   SANs today operate on FC-based fabrics running at 2, 4, and now 8 Gb speeds. Ethernet,
   at 10 Gb, has the performance potential (i.e., bandwidth and latency) to carry both SAN
   and network communication I/O on the same medium.

   Just like 1GbE and Fast Ethernet before it, 10GbE will become the standard network
   interface shipped with every x86/64 server, increasing host connectivity to shared
   resources and increasing workload mobility. Using 10GbE as a universal medium,
   administrators can move workloads between physical servers without worrying about
   network or SAN connectivity.

   The development of Converged Enhanced Ethernet, or Cisco’s version called Data
   Center Ethernet, allows for lossless networks that can do without the overhead of
   TCP/IP and therefore rival Fibre Channel (FC) for transactional throughput. iSCSI is
   already a suitable alternative for many FC applications, but Fibre Channel over Ethernet
   (FCoE) should close the gap for those systems that still cannot tolerate the relative
   inefficiencies of iSCSI.

   One of the key benefits of a unified fabric within the data center is putting the network
   team in the role of managing all connectivity, including the storage networks, which are
   often managed by the storage administrators whose time could be better spent on
   managing the data and archival and recovery processes rather than the connectivity.


5. Desktop and Application Virtualization

6. Cloud computing




                                       61
       Not for everything, dependent on service offerings, APIs

       Abstraction layer

       Internal and external



[Key concepts extracted from “The Dynamic Data Center” by The Burton Group]



3.7. Management Disciplines
    3.7.1.       Service Management
        3.7.1.1. Service Level Agreements
    3.7.2.       Project Management
    3.7.3.       Configuration Management
    3.7.4.       Data Management
        3.7.4.1. Backup and Recovery


       One of the essential elements of an effective business continuity plan resides within
       backup and disaster recovery infrastructure component. While the technical issues
       pertaining to hardware and software are critical to the implementation of an effective
       backup and disaster recovery plan, well written standards and policies are the lynchpin
       of a successful backup and recovery program deployment. In this regard, development
       of backup and disaster recovery standards and policies are responsibility of enterprise
       governance. In doing this, the Chancellor’s Office must make decisions about the
       classification and retention of business information. This is not a trivial task as
       evidenced by the complex legal compliance issues posed by Family Educational Rights
       and Privacy Act (FERPA), The Health Insurance Portability and Accountability Act
       (HIPAA), Sarbanes-Oxley (SOX) statutes present a moving target with severe penalties
       for non-compliance.

               1. Assumptions:

                   a. Budget constraints drive the technological solution for backup/recovery.

                   b. RPO/RTO requirements are driven by business and legal constraints
                      (FERPA, HIPAA, SOX, etc.) and should be defined by enterprise
                      governance.

                   c. The requirements defined by the governance process will be a critical
                      factor in establishing recovery point objectives (RPO) and recovery time
                      objectives (RTO) and must be congruent with budget.




                                         62
   d. RPO and RTO have a major influence on the backup/recovery
      technological solution.

   e. The backup window, i.e., the time slot within which backups must take
      place (so as to not interfere with production) is a significant constraint
      on the backup/recovery design.

   f.   A tiered approach to the backup/recovery architecture mitigates budget
        and backup window constraints.

   g. Sizing of the tiers is driven largely by the retention and recovery time
      objectives and corresponding budget constraints.

   h. Lower tiers generally set expectations for faster recovery time and
      lower cost of implementation.

   i.   Sizing of tiers of backup/recovery storage should be as large as is
        practicable. Subject to budget constraints, as much storage as possible
        at each tier means lower RPO/RTO.

2. Best Practices

   a. Work with CSU and SSU governance bodies to establish retention
      requirements for electronic media. Review and insure that established
      retention requirements comply with state and federal legal
      requirements.

   b. Work with CSU and SSU governance bodies to establish business
      continuity and disaster requirements.

   c. RPO and RTO objectives will have a significant impact on the design of
      the backup and disaster recovery plan: higher expectations for RPO and
      RTO will drive the costs of the requisite technological design.

   d. The backup/recovery architecture should be a tiered design consisting
      of:

           Tier 1

            The first tier of storage is volume snapshots for immediate recovery
            under user control. This typically is referred to as “snap reserve”
            and is set aside when the volume is configured.

            User control at tier one facilitates restoration of backup files
            without support from operations staff.

           Tier 2




                           63
                    The second tier of online storage is a disk target for backups such as
                    a virtual tape library (VTL). This would be the target of system
                    backups which more effectively utilize the available backup window.
                    The sizing of tier 2 will drive the availability of the backup datasets
                    on the VTL. The higher the capacity of the disk target, the longer
                    these backup datasets can be retained. Recovery time from disk is
                    much faster that other archive media such as magnetic tape and
                    hence, can support more aggressive recovery time objectives. De-
                    duplication should be employed to reduce the footprint of the
                    backup dataset.

                   Tier 3

                    The third tier is deployed for longer term archival of backup
                    datasets. Tier 3 strategies can be disk-based but budget constraints
                    often rule out disk based solution. Magnetic tape solutions are
                    usually more economical and the chosen media for archival
                    systems.

           e. Magnetic media stored off-site should be encrypted using LTO-4 tape
              drives and a secure key management system. Key management is vital
              to ensure timely decryption. This is particularly true during a disaster
              recovery process when magnetic media must be utilized to recover on a
              remote site. Without effective and timely key management, the
              encrypted backup tapes are useless.

           f.   A copy of encrypted full backup datasets should be stored at another
                CSU campus serviced by a different power grid and if possible in a
                different seismic zone at a minimum in a different earthquake fault zone
                as established by the California State Geologist. The schedule of remote
                backup datasets will be determined by the recovery point objective
                (RPO) established by the CSU governance body. The movement of and
                tracking of the encrypted backup datasets will be defined in a
                memorandum of understanding between the sister campuses who
                engage in the reciprocal agreement.

           g.    For more aggressive recovery time objective requirements, a WAN-
                based remote backup system should be employed. While the cost of a
                WAN-based solution is significantly higher than a magnetic media
                solution, disaster recovery would be much faster. In addition, a WAN-
                based system could also dove tail into a high availability architecture
                where the backup datasets could feed redundant systems on the sister
                campus and provide for failover in the event of an outage in the primary
                datacenter.


3.7.4.2. Archiving
3.7.4.3. Hierarchical Storage Management



                                   64
    3.7.4.4. Document Management
3.7.5.       Asset Management
    3.7.5.1. Tagging/Tracking
    3.7.5.2. Licensing
    3.7.5.3. Software Distribution
3.7.6.       Problem Management
    3.7.6.1. Fault Detection
    3.7.6.2. Correction
    3.7.6.3. Reporting
3.7.7.       Security
    3.7.7.1. Data Security
    3.7.7.2. Encryption
    3.7.7.3. Authentication
    3.7.7.4. Antivirus protection
    3.7.7.5. OS update/patching
    3.7.7.6. Physical Security




                                     65

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:11
posted:7/30/2012
language:English
pages:65