Data Center Best Practice and Architecture

Reviews
Shared by: PaulyDeacon
Stats
views:
96
rating:
not rated
reviews:
0
posted:
7/1/2009
language:
English
pages:
0
Data Center Best Practices and Architecture for the California State University Author(s): Date: Status: Version: DCBPA Task Force June 9, 2009 DRAFT 0.3.7 The content of this document is the result of the collaborative work of the Data Center Best Practice and Architecture (DCBPA) Task Force established under the Systems Technology Alliance committee within the California State University. Team members who directly contributed to the content of this document are listed below.              Samuel G. Scalise, Sonoma, Chair of the STA and the DCBPA Task Force Don Lopez, Sonoma Jim Michael, Fresno Wayne Veres, San Marcos Mike Marcinkevicz, Fullerton Richard Walls, San Luis Obispo David Drivdahl, Pomona Ramiro Diaz-Granados, San Bernardino Don Baker, San Jose Victor Vanleer, San Jose David Stein, PlanNet Consulting Mark Berg, PlanNet Consulting Michel Davidoff, Chancellor’s Office 2 Table of Contents 1. Introduction ............................................................................................................................. 4 1.1. Purpose ................................................................................................................................ 4 1.2. Context ................................................................................................................................. 4 1.3. Audience .............................................................................................................................. 5 1.4. Development Process .......................................................................................................... 5 1.5. Principles and Properties..................................................................................................... 5 2. Framework/Reference Model ................................................................................................... 7 3. Best Practice Components ....................................................................................................... 24 3.1. Standards ........................................................................................................................... 24 3.2. Hardware Platforms .......................................................................................................... 24 3.3. Software ............................................................................................................................. 26 3.4. Delivery Systems ................................................................................................................ 27 3.5. Disaster Recovery .............................................................................................................. 32 3.6. Total Enterprise Virtualization .......................................................................................... 38 3.7. Management Disciplines ................................................................................................... 41 3 1. Introduction 1.1. Purpose As society and institutions of higher education increasingly benefit from technology and collaboration, the importance of identifying mutually best practices and architecture makes this document vital to the behind-the-scenes infrastructure of the university. Key drivers behind the gathering and assimilation of this collection are:  Many campuses want to know what the others are doing so they can draw from a knowledge base of successful initiatives and lessons learned. Having a head start in thinking through operational practices and effective architectures--as well as narrowing vendor selection for hardware, software and services--creates efficiencies in time and cost. Campuses are impacted financially and data center capital and operating expenses need to be curbed. For many, current growth trends are unsustainable with limited square footage to address the demand for more servers and storage without implementing new technologies to virtualize and consolidate. Efficiencies in power and cooling need to be achieved in order to address green initiatives and reduction in carbon footprint. They are also expected to translate into real cost savings in an energy-conscious economy. Environmentally sound practices are increasingly the mandate and could result in measurable controls on higher energy consumers. Creating uniformity across the federation of campuses allows for consolidation of certain systems, reciprocal agreements between campuses to serve as tertiary backup locations, and opt-in subscription to services hosted at campuses with capacity to support other campuses, such as the C-cubed initiative.    1.2. Context This document is a collection of Best Practices and Architecture for California State University Data Centers. It identifies practices and architecture associated with the provision and operation of missioncritical production-quality servers in a multi-campus university environment. The scope focuses on the physical hardware of servers, their operating systems, essential related applications (such as virtualization, backup systems and log monitoring tools), the physical environment required to maintain these systems, and the operational practices required to meet the needs of the faculty, students, and staff. Data centers that adopt these practices and architecture should be able to house any end-user service – from Learning Management Systems, to calendaring tools, to file-sharing. This work represents the collective experience and knowledge of data center experts from the 23 campuses and the chancellor’s office of the California State University system. It is coordinated by the Systems Technology Alliance, whose charge is to advise the Information Technology Advisory Committee 4 (made up of campus Chief Information Officers and key Chancellor’s Office personnel) on matters relating to servers (i.e., computers which provide a service for other computers connected via a network) and server applications. This is a dynamic, living document that can be used to guide planning to enable collaborative systems, funding, procurement, and interoperability among the campuses and with vendors. This document does not prescribe services used by end-users, such as Learning Management Systems nor Document Management Systems. As those services and applications are identified by end-users such as faculty and administrators, this document will describe the data center best practices and architecture needed to support such applications. Campuses are not required to adopt the practices and architecture elucidated in this document. There may be extenuating circumstances that require alternative architectures and practices. However, it is hoped that these alternatives are documented in this process. It is not the goal to describe a single solution, but rather the range of best solutions that meet the diverse needs of diverse campuses. 1.3. Audience This information is intended to be reviewed by key stakeholders who have material knowledge of data center facilities and service offerings from business, technical, operational, and financial perspectives. 1.4. Development Process The process for creating and updating these best Practices and Architecture (P&A) is to identify the most relevant P&A, inventory existing CSU P&A for key aspects of data center operations, identify current industry trends, and document those P&A which best meet the needs of the CSU. This will include information about related training and costs, so that campuses can adopt these P&A with a full understanding of the costs and required expertise. The work of creating this document will be conducted by members of the Systems Technology Alliance appointed by the campus Chief Information Officers, by members of the Chancellors Office Technology Infrastructure Services group, and by contracted vendors. 1.5. Principles and Properties In deciding which Practices and Architecture should be adopted, it is important to have a set of criteria that reflect the unique needs, values, and goals of the organization. These Principles and Properties include:  Cost-effectiveness 5           Long-term viability Flexibility to support a range of services Security of the systems and data Reliable and dependable uptime Environmental compatibility Redundancy High availability Performance Training Communication Additionally, the architecture should emphasize criteria that are standards-based. The CSU will implement standards-based solutions in preference to proprietary solutions where this does not compromise the functional implementation. The CSU seeks to adhere to standard ITIL practices and workflows where practical. Systems and solutions described herein should relate to corresponding ITIL and service management principles. 6 2. Framework/Reference Model The framework is used to describe the components and management processes that lead to a holistic data center design. Data centers are as much about the services offered as they are the equipment and space contained in them. Taken together, these elements should constitute a reference model for a specific CSU campus implementation. 2.1. Standards 2.1.1. ITIL The Information Technology Infrastructure Library is a set of concepts around managing services and operations. The model was developed by the UK Office of Government Commerce and has been refined and adopted internationally. The ITIL version 2 framework for Service Support breaks out several management disciplines that are incorporated in this CSU reference architecture (see Section 2.7). ITIL version 3 has reworked the framework into a collection of five volumes that describe      2.1.2. Service Strategy Service Design Service Transition Service Operation Continual Service Improvement ASHRAE The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) releases updated standards and guidelines for industry consideration in building design. They include recommended and allowable environment envelopes, such as temperature, relative humidity, and altitude for spaces housing datacomm equipment. The purpose of the recommended envelope is to give guidance to data center operators on maintaining high reliability and also operating their data centers in the most energy efficient manner. 2.1.3. Uptime Institute The Uptime Institute addresses architectural, security, electrical, mechanical, and telecommunications design considerations. See Section 2.4.1.1 for specific information on tiering standards as applied to data centers. 2.2. Hardware Platforms 2.2.1. Servers Types 7  Rack-mounted Servers – provide the foundation for any data center’s compute infrastructure. The most common are 1U and 2U: these form factors compose what is known as the volume market. The high-end market, geared towards high-performance computing (HPC) or applications that need more input/output (I/O) and /or storage is composed of 4U to 6U rack-mounted servers. The primary distinction between volume market and high-end servers is the I/O and storage capabilities. Blade Servers –are defined by the removal of many components – PSUs, network interface cards (NICS) and storage adapters from the server itself. These components are grouped together as part of the blade chassis and shared by all the blades. The chassis is the piece of equipment that all of the blade servers “plug” into. The blade servers themselves contain processors, memory and a hard drive or two. One of the primary caveats to selecting the blade server option is the potential for future blade/chassis compatibility. Most IHVs do not guarantee blade/chassis beyond two generations or five years. Another potential caveat is the high initial investment in blade technology because of additional costs associated with the chassis. Towers – There are two primary reasons for using tower servers…price and remote locations. Towers offer the least expensive entrance into the server platform market. Towers have the ability to be placed outside the confines of a data center. This feature can be useful for locating an additional Domain Name Server (DSN) or backup server in a remote office for redundancy purposes.   Principles 1. Application requirements – Applications such as databases, backup servers and other high I/O requirements are better suited HPC rack-mounted servers. Applications such as web servers and MTAs work well in a volume-market rackmounted environment or even in a virtual server environment. These applications allow servers to be easily added and removed to meet spikes in capacity demand. The need to have servers that are physically located at different sites for redundancy or ease of administration can be met by tower servers, especially if they are low demand applications. Applications with high I/O requirements perform better with 1U or 2U rack-mounted servers rather than blade servers because stand alone servers have a dedicated I/O interface rather than a common one found on the chassis of a blade server. 2. Software support – can determine the platform an application lives on. Some vendors refuse to support virtual servers making VMs unsuitable if support is a key requirement. Multiple instances of an application is not supported by some 8 software, requiring the application to run on a large single server rather than multiple smaller servers. 3. Storage – requirements can vary from a few gigabytes to accommodate the operating system, application and state data for application servers to terabytes to support large database servers. Applications requiring large amounts of storage should be SAN attached using fiber channel or iSCSI. Fiber offers greater reliability and performance but a higher skill lever from SAN Admins. Support for faster speeds in iSCSI is and improved reliability is making it more attractive. Direct Attached Storage (DAS) is still prevalent because it is less costly and easier to manage than SAN storage. Rack-mounted 4U to 6U servers have the space to house a large number of disk drives and make suitable DAS servers. 4. Consolidation – projects can result in several applications being combined onto a single server or virtualization. Care must be taken when combining applications to ensure they are compatible with each other and vendor support can be maintained. Virtualization accomplishes consolidation by allowing each application think it’s running on its own server. The benefits of consolidation include reduced power and space requirements and fewer servers to manage. 5. Energy efficiency – starts with proper cooling design, server utilization management and power management. Replacing old servers with newer energy efficient ones reduces energy use and cooling requirements and may be eligible for rebates which allow them to pay for themselves. 6. Improved management – Many data centers contain “best of breed” technology. They contain server platforms and other devices from many different vendors. Servers may be from vendor A, storage from vendor B and network from vendor C. This complicates troubleshooting and leads to finger pointing. Reducing the number of vendors produces standardization and is more likely to allow a single management interface for all platforms. 7. Business growth/New services – As student enrollment grows and the number of services to support them increases, the data center’s capacity to run its applications and store its data increases. This is the most common reason for buying new server platforms. IT administrators must use a variety of gauges to anticipate this need and respond in time. 2.2.1.1. Server Virtualization Principles 1. Reliability and availability—An implementation of server virtualization should provide increased reliability of servers and services by providing for server failover in the event of a hardware loss of service as well as 9 high-availability by ensuring that access to shared services like network and disk are fault-tolerant and balanced by load. 2. Reuse—Server virtualization should allow better utilization of hardware and resources by provisioning multiple services and operating environments on the same hardware. Care must be taken to ensure that hardware is operating within limits of its capacity. Effective capacity planning becomes especially important. 3. Consumability—Server virtualization should allow us to provide quickly available server instances, using technologies such as cloning and templating when appropriate. 4. Agility—Server virtualization should allow us to improve organizational efficiency by provisioning servers and services faster by allowing for rapid deployment of instances using cloning and templates. 5. Administration—Server virtualization will improve administration by having a single, secure, easy-to-access interface to all virtual servers. 2.2.2. Storage 2.2.2.1. SAN – Storage Area Network 2.2.2.1.1. Fiber Channel 2.2.2.1.2. iSCSI 1. Benefits 1.1. Reduced costs: By leveraging existing network components (network interface cards [NICs], switches, etc.) as a storage fabric, iSCSI increases the return on investment (ROI) made for data center network communications and potentially saves capital investments required to create a separate storage network. For example, iSCSI host bus adapters (HBAs) are 3040% less expensive than Fibre Channel HBAs. Also, in some cases, 1 Gigabit Ethernet (GbE) switches are 50% less than comparable Fibre Channel switches. Organizations employ qualified network administrator(s) or trained personnel to manage network operations. Being a network protocol, iSCSI leverages existing network administration knowledge bases, obviating the need for additional staff and educational training to manage a different storage network. 1.2. Improved options for DR: One of iSCSI's greatest strengths is its ability to travel long distances using IP wide area networks 10 (WANs). Offsite data replication plays a key part in disaster recovery plans by preserving company data at a co-location that is protected by distance from a disaster affecting the original data center. Using a SAN router (iSCSI to Fibre Channel gateway device) and a target array that supports standard storage protocols (like Fibre Channel), iSCSI can replicate data from a local target array to a remote iSCSI target array, eliminating the need for costly Fibre Channel SAN infrastructure at the remote site. iSCSI-based tiered storage solutions such as backup-to-disk (B2D) and near-line storage have become popular disaster recovery options. Using iSCSI in conjunction with Serial Advanced Technology Attachment (SATA) disk farms, B2D applications inexpensively back up, restore, and search data at rapid speeds. 1.3. Boot from SAN: As operating system (OS) images migrate to network storage, boot from SAN (BfS) becomes a reality, allowing chameleon-like servers to change application personalities based on business needs, while removing ties to Fibre Channel HBAs previously required for SAN connectivity (would still require hardware initiator). 2. Components 2.1. Initiators 2.1.1.Software Initiators: While software initiators offer costeffective SAN connectivity, there are some issues to consider. The first is host resource consumption versus performance. An iSCSI initiator runs within the input/output (I/O) stack of the operating system, utilizing the host memory space and CPU for iSCSI protocol processing. By leveraging the host, an iSCSI initiator can outperform almost any hardware-based initiator. However, as more iSCSI packets are sent or received by the initiator, more memory and CPU bandwidth is consumed, leaving less for applications. Obviously, the amount of resource consumption is highly dependent on the host CPU, NIC, and initiator implementation, but resource consumption could be problematic in certain scenarios. Software iSCSI initiators can consume 11 additional resource bandwidth that could be partitioned for supplemental virtual machines. 2.1.2.Hardware Initiators: iSCSI HBAs simplify boot-from-SAN (BfS). Because an iSCSI HBA is a combination NIC and initiator, it does not require assistance to boot from the SAN, unlike software initiator counterparts. By discovering a bootable target LUN during system power-on self test (POST), an iSCSI HBA can enable an OS to boot an iSCSI target like any DAS or Fibre Channel SAN-connected system. In terms of resource utilization, an iSCSI HBA offloads both TCP and iSCSI protocol processing, saving host CPU cycles and memory. In certain scenarios, like server virtualization, an iSCSI HBA may be the only choice where CPU processing power is consequential. 2.2. Targets 2.2.1.Software Targets: Any standard server can be used as a software target storage array but should be deployed as a stand-alone application. A software target can capitalize platform resources, leaving little room for additional applications. 2.2.2.Hardware Targets: Many of the iSCSI disk array platforms are built using the same storage platform as their Fibre Channel cousin. Thus, many iSCSI storage arrays are similar, if not identical, to Fibre Channel arrays in terms of reliability, scalability, performance, and management. Other than the controller interface, the remaining product features are almost identical. 2.3. Tape Libraries 2.3.1.Tape libraries should be capable of being iSCSI target devices, however broad adoption and support in this category hasn’t been seen and remains a territory served by native Fiber Channel connectivity. 2.4. Gateways and Routers 2.4.1.iSCSI to Fibre Channel gateways and routers play a vital role in two ways. First, these devices increase return on invested capital made in Fibre Channel SANs by extending connectivity to “Ethernet islands” where devices that were previously unable to reach the Fibre Channel SAN can tunnel through using a router or gateway. Secondly, iSCSI routers and gateways enable Fibre Channel to iSCSI migration. SAN migration is a gradual process. Replacing a large investment in Fibre Channel SANs at one time is not 12 a cost reality. As IT administrators carefully migrate from one interconnect to another, iSCSI gateways and routers afford IT administrators the luxury of time and money. One note of caution: It's important to know the port speeds and amount of traffic passing through a gateway or router. These devices can become potential bottlenecks if too much traffic from one network is aggregated into another. For example, some router products offer eight 1 GbE ports and only two 4 Gb Fibre Channel ports. While total throughput is the same, careful attention must be made to ensure traffic is evenly distributed across ports. Any x86 server can act as an iSCSI to Fibre Channel gateway. Using a Fibre Channel HBA and iSCSI target software, any x86 server can present LUNs from a Fibre Channel SAN as an iSCSI target. Once again, this is not a turnkey solution—especially for large SANs—and caution should be exercised to prevent performance bottlenecks. However, this configuration can be cost-effective for small environments and connectivity to a single Fibre Channel target or small SAN. 2.5. Internet Storage Name Service (iSNS) 2.5.1.Voracious storage consumption, combined with lowercost SAN devices, has stimulated SAN growth beyond what administrators can manage without help. iSCSI exacerbates this problem by proliferating iSCSI initiators and low-cost target devices throughout a boundless IP network. Thus, a discovery and configuration service, like iSNS is a must for large SAN configurations. Although other discovery services exist for iSCSI SANs, such as Service Location Protocol (SLP), iSNS is emerging as the most widely accepted solution. 3. Security 4. Multi-path support 2.2.2.2. NAS – Network Attached Storage 2.2.2.3. DAS – Direct Attached Storage 2.2.2.4. Storage Virtualization 2.3. Software 2.3.1. Operating Systems 13 2.3.2. Middleware 2.3.2.1. Identity Management 2.3.3. Databases 2.3.4. Core/Enabling Applications 2.3.4.1. Email 2.3.4.1.1. Spam Filtering 2.3.4.2. Web Services 2.3.4.3. Calendaring 2.3.4.4. DNS 2.3.4.5. DHCP 2.3.4.6. Syslog 2.3.4.7. Desktop Virtualization 2.3.4.8. Application Virtualization 2.3.5. Third Party Applications 2.3.5.1. LMS 2.3.5.2. CMS 2.3.5.3. Help Desk/Ticketing 2.4. Delivery Systems 2.4.1. Facilities 2.4.1.1. Tiering Standards The industry standard for measuring data center availability is the tiering metric developed by The Uptime Institute and addresses architectural, security, electrical, mechanical, and telecommunications design considerations. The higher the tier, the higher the availability. Tier descriptions include information like raised floor heights, watts per square foot, and points of failure. “Need,” or “N,” indicates the level of redundant components for each tier with N representing only the necessary system need. Construction cost per square foot is also provided and varies greatly from tier to tier with Tier 3 costs double that of Tier 1. Tier 1 – Basic: 99.671% Availability       Susceptible to disruptions from both planned and unplanned activity Single path for power and cooling distribution, no redundant components (N) May or may not have a raised floor, UPS, or generator Typically takes 3 months to implement Annual downtime of 28.8 hours Must be shut down completely to perform preventative maintenance Tier 2 – Redundant Components: 99.741% Availability    Less susceptible to disruption from both planned and unplanned activity Single path for power and cooling distribution, includes redundant components (N+1) Includes raised floor, UPS, and generator 14    Typically takes 3 to 6 months to implement Annual downtime of 22.0 hours Maintenance of power path and other parts of the infrastructure require a processing shutdown Tier 3 – Concurrently Maintainable: 99.982% Availability      Enables planned activity without disrupting computer hardware operation, but unplanned events will still cause disruption Multiple power and cooling distribution paths but with only one path active, includes redundant components (N+1) Includes raised floor and sufficient capacity and distribution to carry load on one path while performing maintenance on the other Typically takes 15 to 20 months to implement Annual downtime of 1.6 hours Tier 4 – Fault Tolerant: 99.995% Availability     Planned activity does not disrupt critical load and data center can sustain at least one worst-case unplanned event with no critical load impact Multiple active power and cooling distribution paths, includes redundant components (2 (N+1), i.e. 2 UPS each with N+1 redundancy) Typically takes 15 to 20 months to implement Annual downtime of 0.4 hours Trying to achieve availability above Tier 4 presents a level of complexity that some believe presents diminishing returns. EYP, which manages HP’s data center design practice, says their empirical data shows no additional uptime from the considerable cost of trying to further reduce downtime from 0.4 hours due to the human element that gets introduced in managing the complexities of the many redundant systems. 2.4.1.2. Spatial Guidelines and Capacities 2.4.1.3. Electrical Systems 2.4.1.4. HVAC Systems 2.4.1.5. Fire Protection & Life Safety 2.4.1.6. Access Control 2.4.1.7. Commissioning 2.4.2. Load Balancing/High Availability 2.4.3. Connectivity 2.4.3.1. Network 15 Network components in the data center—such as Layer 3 backbone switches, WAN edge routers, perimeter firewalls, and wireless access points—are described in the ITRP2 Network Baseline Standard Architecture and Design document, developed by the Network Technology Alliance, sister committee to the Systems Technology Alliance. Latest versions of the standard can be located at http://nta.calstate.edu/ITRP2.shtml. Increasingly, boundaries are blurring between systems and networks. Virtualization is causing an abstraction of traditional networking components and moving them into software and the hypervisor layer. Virtual switches Considerations beyond “common services” The following components have elements of network enabling services but are also systems-oriented and may be managed by the systems or applications groups. 1. DNS For privacy and security reasons, many large enterprises choose to make only a limited subset of their systems “visible” to external parties on the public Internet. This can be accomplished by creating a separate Domain Name System (DNS) server with entries for these systems, and locating it where it can be readily accessible by any external user on the Internet (e.g., locating it in a DMZ LAN behind external firewalls to the public Internet). Other DNS servers containing records for internally accessible enterprise resources may be provided as “infrastructure servers” hidden behind additional firewalls in “trusted” zones in the data center. This division of responsibility permits the DNS server with records for externally visible enterprise systems to be exposed to the public Internet, while reducing the security exposure of DNS servers containing the records of internal enterprise systems. 2. E-Mail (MTA only) For security reasons, large enterprises may choose to distribute e-mail functionality across different types of e-mail servers. A message transfer agent (MTA) server that only forwards Simple Mail Transfer Protocol (SMTP) traffic (i.e., no mailboxes are contained within it) can be located where it is readily accessible to other enterprise e-mail servers on the Internet. For example, it can be located in a DMZ LAN behind external firewalls to the public Internet). Other e-mail servers containing user agent (UA) mailboxes for enterprise users may be provided as “infrastructure servers” located behind additional firewalls in “trusted” zones in the data center. This division of responsibility permits the “external” MTA server to communicate 16 with any other e-mail server on the public Internet, but reduces the security exposure of “internal” UA e-mail servers. 3. Voice Media Gateway The data center site media gateway will include analog or digital voice ports for access to the local PSTN, possibly including integrated services digital network (ISDN) ports. With Ethernet IP phones, the VoIP gateway is used for data center site phone users to gain local dial access to the PSTN. The VoIP media gateway converts voice calls between packetized IP voice traffic on a data center site network and local circuit-switched telephone service. With this configuration, the VoIP media gateway operates under the control of a call control server located at the data center site, or out in the ISP public network as part of an “IP Centrex” or “virtual PBX” service. However, network operators/carriers increasingly are providing a SIP trunking interface between their IP networks and the PSTN; this will permit enterprises to send VoIP calls across IP WANs to communicate with PSTN devices without the need for a voice media gateway or direct PSTN interface. Instead, data center site voice calls can be routed through the site’s WAN edge IP routers and data network access links. 4. Ethernet L2 Virtual Switch In a virtual server environment, the hypervisor manages L2 connections from virtual hosts to the NIC(s) of the physical server. A hypervisor plug-in module may be available to allow the switching characteristics to emulate a specific type of L2 switch so that it can be managed apart from the hypervisor and incorporated into the enterprise NMS. 5. Top-of-Rack Fabric Switches As a method of consolidating and aggregating connections from dense rack configurations in the data center, top-of-rack switching has emerged as a way to provide both Ethernet and Fiber Channel connectivity in one platform. Generally, these devices connect to end-of-row switches that, optimally, can manage all downstream devices as one switching fabric. The benefits are a modularized approach to server and storage networks, reduced cross connects and better cable management. 2.4.3.1.1. Network Virtualization 17 2.4.3.1.2. Structured Cabling The approach to structured cabling in a data center differs from other aspects of building wiring due to the following issues:    Managing higher densities, particularly fiber optics Cable management, especially with regard to moves, adds and changes Heat control, for which cable management plays a role The following are components of structured cabling design in the data center: 1. Cable types: Cabling may be copper (shielded or unshielded) or fiber optic (single mode or multi mode). 2. Cabling pathways: usually a combination of raised floor access and overhead cable tray. Cables under raised floor should be in channels that protect them from adjacent systems, such as power and fire suppression. 3. Fiber ducts: fiber optic cabling has specific stress and bend radius requirements to protect the transmission of light and duct systems designed for fiber takes into account the proper routing and storage of strands, pigtails and patchcords among the distribution frames and splice cabinets. 4. Fiber connector types: usually MT-RJ, LC, SC or ST. The use of modular fiber “cassettes” and trunk cables allows for higher densities and the benefit of factory terminations rather than terminations in the field, which can be time-consuming and subject to higher dB loss. 5. Cable management: 2.4.4. Operations 2.4.4.1. Staffing 2.4.4.2. Training 2.4.4.3. Monitoring 2.4.4.4. Console Management 2.4.4.5. Remote Operations 2.4.5. Accounting 2.4.5.1. Auditing The CSU publishes findings and campus responses to information security audits. Reports can be found at the following site: http://www.calstate.edu/audit/audit_reports/information_security/index.shtml 2.5. Disaster Recovery 2.5.1. Relationship to overall campus strategy for Business Continuity 2.5.2. Relationship to CSU Remote Backup – DR initiative 18 ITAC has sponsored an initiative to explore business continuity and disaster recovery partnerships between CSU campuses. [Charter document?] Several campuses have teamed to develop documents and procedures and their workproduct is posted at http://drp.sharepointsite.net/itacdrp/default.aspx. Examples of operational considerations, memorandums of understanding, and network diagrams are given in Section 3.5.4.2 2.5.3. Infrastructure considerations 2.5.3.1. Colocation One method of accomplishing business continuity objectives through redundancy with geographic diversity is to use a colocation scenario, either through a reciprocal agreement with another campus or a commercial provider. The following are typical types of collocation arrangements:  Real estate investment trusts (REITs): REITs offer leased shared data center facilities in a business model that leverages tax laws to offer savings to customers.  Network-neutral co-location: Network-neutral co-location providers offer leased rack space, power, and cooling with the added service of peer-to-peer network cross-connection.  Co-location within hosting center: Hosting centers may offer co-location as a basic service with the ability to upgrade to various levels of managed hosting.  Unmanaged hosted services: Hosting centers may offer a form of semi-colocation wherein the hosting provider owns and maintains the server hardware for the customer, but doesn't manage the operating system or applications/services that run on that hardware. Principles for colocation selection criteria 1. Business process includes or provides an e-commerce solution 2. Business process does not contain applications and services that were developed and are maintained in-house 3. Business process does not predominantly include internal infrastructure or support services that are not web-based 4. Business process contain predominantly commodity and horizontal applications and services (such as email and database systems) 5. Business process requires geographically distant locations for disaster recovery or business continuity 6. Colocation facility meets level of reliability objective (Tier I, II, III, or IV) at less cost than retrofitting or building new campus data centers 7. Access to particular IT staff skills and bandwidth of the current IT staffers 8. Level of SLA matches the campus requirements, including those for disaster recovery 9. Colocation provider can accomodate regulatory auditing and reporting for the business process 10. Current data center facilities have run out of space, power, or cooling 19 [concepts from Burton Group article, “Host, Co-Lo, or Do-It-Yourself?”] 2.5.4. Operational considerations 2.5.4.1. Recovery Time Objectives and Recovery Point Objectives discussed in 2.7.4.1 (Backup and Recovery 2.6. Total Enterprise Virtualization 2.7. Management Disciplines 2.7.1. Service Management 2.7.1.1. Service Catalog 2.7.1.2. Service Level Agreements 2.7.2. Project Management 2.7.3. Configuration Management Configuration Management is the process of creating and maintaining an up to date record of all components of the infrastructure. 1. Functions associated with Configuration Management are:      Planning Identification Control Status Accounting Verification and Audit 2. Configuration Management Database (CMDB) - A database that contains details about the attributes and history of each Configuration Item and details of the important relationships between CI’s. The information held may be in a variety of formats, textual, diagrammatic, photographic, etc.; effectively a data map of the physical reality of IT Infrastructure. 3. Configuration Item - Any component of an IT Infrastructure which is (or is to be) under the control of Configuration Management. 4. The lowest level CI is normally the smallest unit that will be changed independently of other components. CI’s may vary widely in complexity, size and type, from an entire service (including all its hardware, software, documentation, etc.) to a single program module or a minor hardware component. 2.7.4. Data Management 2.7.4.1. Backup and Recovery 2.7.4.2. Archiving 2.7.4.3. Hierarchical Storage Management 2.7.4.4. Document Management 2.7.5. Asset Management Effective data center asset management is necessary for both regulatory and contractual compliance. It can improve life cycle management, and facilitate inventory 20 reductions by identifying under-utilized hardware and software, potentially resulting in significant cost savings. An effective management process requires combining current Information Technology Infrastructure Library (ITIL) and Information Technology Asset Management (ITAM) best practices with accurate asset information, ongoing governance and asset management tools. The best systems/tools should be capable of asset discovery, manage all aspects of the assets, including physical, financial and contractual, life cycle management with Web interfaces for real time access to the data. Recognizing that sophisticated systems may be prohibitively expensive, asset management for smaller environments may be able to be managed by spreadsheets or simple database. Optimally, a system that could be shared among campuses while maintaining restricted permission levels, would allow for more comprehensive and uniform participation, such as the Network Infrastrucure Asset Management System (NIAMS), http://www.calstate.edu/tis/cass/niams.shtml The following are asset categories to be considered in a management system:  Physical Assets – to include the grid, floor space, tile space, racks and cables. The layout of space and the utilization of the attributes above are literally an asset that needs to be tracked both logically and physically. Network Assets – to include routers, switches, firewalls, load balancers, and other network related appliances. Storage Assets – to include Storage Area Networks (SAN), Network Attached Storage (NAS), tape libraries and virtual tape libraries. Server Assets – to include individual servers, blade servers and enclosures. Electrical Assets – to include Universal Power Supplies (UPS), Power Distribution Units (PDU), breakers, outlets (NEMA noted), circuit number and grid location of same. Power consumption is another example of logical asset that needs to be monitored by the data center manager in order to maximize server utilization and understand, if not reduce, associated costs. Air Conditioning Assets – to include air conditioning units, air handlers, chiller plants and other airflow related equipment. Airflow in this instance may be considered a logical asset as well but the usage plays an important role in a data center environment. Rising energy costs and concerns about global warming require data center managers to track usage carefully. Computational fluid dynamics (CFD) modeling can serve as a tool for maximizing airflow within the data center.      21  Data Center Security and Safety Assets – Media access controllers, cameras, fire alarms, environmental surveillance, access control systems and access cards/devices, fire and life safety components, such as fire suppression systems. Logical Assets – T1’s, PRI’s and other communication lines, air conditioning, electrical power usage. Most importantly in this logical realm is the management of the virtual environment. Following is a list of logical assets or associated attributes that would need to be tracked: o o o A list of Virtual Machines Software licenses in use in data center Virtual access to assets  VPN access accounts to data center  Server/asset accounts local to the asset   Information Assets – to include text, images, audio, video and other media. Information is probably the most important asset a data center manager is responsible for. The definition is: An information asset is a definable piece of information, stored in any manner, recognized as valuable to the organization. In order to achieve access users must have accurate, timely, secure and personalized access to this information. The following are asset groupings to be considered in a management system:  By Security Level o Confidentiality o FERPA o HIPPA o PCI By Support Organization o Departmental o Computer Center Supported o Project Team Criticality o Critical (ex. 24x7 availability) o Business Hours only (ex. 8AM - 7 PM) o Noncritical By Funding Source (useful for recurring costs) o Departmental funded o Project funded o Division funded    2.7.5.1. Tagging/Tracking 2.7.5.2. Licensing 2.7.5.3. Software Distribution 2.7.6. Problem Management 22 Problem Management investigates the underlying cause of incidents, and aims to prevent incidents of a similar nature from recurring. By removing errors, which often requires a structural change to the IT infrastructure in an organization, the number of incidents can be reduced over time. Problem Management should not be confused with Incident Management. Problem Management seeks to remove the causes of incidents permanently from the IT infrastructure whereas Incident Management deals with fighting symptoms to incidents. Problem Management is proactive while Incident Management is reactive. 2.7.6.1. 2.7.6.2. 2.7.6.3. Fault Detection - A condition often identified as a result of multiple incidents that exhibit common symptoms. Problems can also be identified from a single significant incident, indicative of a single error, for which the cause is unknown, but for which the impact is significant. Correction - An iterative process to diagnose known errors until they are eliminated by the successful implementation of a change under the control of the Change Management process. Reporting - Summarizes Problem Management activities. Includes number of repeat incidents, problems, open problems, repeat problems, etc. 2.7.7. Security 2.7.7.1. Data Security 2.7.7.2. Encryption 2.7.7.3. Authentication 2.7.7.4. Antivirus protection 2.7.7.5. OS update/patching 2.7.7.6. Physical Security 23 3. Best Practice Components 3.1. Standards 3.1.1. ITIL 3.1.2. ASHRAE ASHRAE modified their operational envelope for data centers with the goal of reducing energy consumption. For extended periods of time, the IT manufacturers recommend that data center operators maintain their environment within the recommended envelope. Exceeding the recommended limits for short periods of time should not be a problem, but running near the allowable limits for months could result in increased reliability issues. In reviewing the available data from a number of IT manufacturers the 2008 expanded recommended operating envelope is the agreed-upon envelope that is acceptable to all the IT manufacturers, and operation within this envelope will not compromise overall reliability of the IT equipment. Following are the previous and 2008 recommended envelope data: 2004 Version Low End Temperature High End Temperature Low End Moisture High End Moisture 20°C (68 °F) 25°C (77 °F) 40% RH 55% RH 2008 Version 18°C (64.4 °F) 27°C (80.6 °F) 5.5°C DP (41.9 °F) 60% RH & 15°C DP (59 °F DP) 3.1.3. Uptime Institute 3.2. Hardware Platforms 3.2.1. Servers 3.2.1.1. Server Virtualization 1. Practices a. Production hardware should run the latest stable release of the selected hypervisor, with patching and upgrade paths defined and pursued on a scheduled basis with each hardware element (e.g. blade) dual-attached to the data network and storage environment to provide for load balancing and fault tolerance. 24 b. Virtual machine templates should be developed, tested and maintained to allow for consistent OS, maintenance and middleware levels across production instances. These templates should be used to support cloning of new instances as required and systematic maintenance of production instances as needed. c. Virtual machines should be provisioned using a defined work order process that allows for an effective understanding of server requirements and billing/accounting expectations.  This process should allow for interaction between requestor and provider to ensure appropriate configuration and acceptance of any fee-for-service arrangements. d. Virtual machines should be monitored for CPU, memory, network and disk usage. Configurations should be modified, with service owning unit participation, to ensure an optimum balance between required and committed capacity.  Post-provisioning capacity analysis should be performed via a formal, documented process. For example, a 4 VCPU virtual machine with 8 gigabytes of RAM that is using less than 10% of 1 VCPU and 500 megabytes of RAM should be adjusted to ensure that resources are not wasted. This process should be formal, documented and performed on a frequent basis. e. Virtual machine boot/system disks should be provisioned into a LUN maintained in the storage environment to ensure portability of server instances across hardware elements. f. To reduce I/O contention, virtual machines with high performance or high capacity requirements should have their non-boot/system disks provisioned using dedicated LUNs mapped to logical disks in the storage environment. g. Virtual machines should be administered using a central console/resource such as VMWare VirtualCenter. However, remote KVM functionality should also be implemented to support remote hypervisor installation and patching and remote hardware maintenance. h. A virtual development environment should be implemented, allowing for development and testing of new server instances/templates, changes to production instances/templates, hypervisor upgrades and testing of advanced high-availability features. 3.2.2. Storage 1. Practices 25 a. Develop meaningful naming convention when defining components of the SAN. This readily identifies components, quickly presents information about them and reduces likelihood of misunderstandings. b. Use different types of storage to produce tiers. High speed fiber channel drives are not necessary for all applications. A mix of fiber with various SATA capacity and speed drives produces a SAN that balances performance with cost. c. Utilize NDMP when able for NFS. Whenever possible use NDMP to backup NFS files on the SAN. Backups are faster and network traffic is reduced. However, make sure the target for the backup is not the same storage device as the SAN if you are doing disk to disk backups. d. Isolate iSCSI from regular network traffic. SAN iSCSI traffic should be on its own network using its own switches for performance and security reasons. e. Ensure partition alignment between servers and SAN disks. Misalignment of partitions will cause one or two additional I/O’s for every read or write. This can have a huge performance impact on large files or databases. f. Use thin provisioning where possible. Some SAN vendor tools allow you to dynamically grow and shrink storage allocations. A large amount of storage can be assigned to server but doesn’t get allocated until it is needed. g. Use deduplication where possible. Deduplicaton on VTL’s have been in use for quite a while but deduplicaiton of primary storage is fairly new. Keep an eye on the technology and begin to apply conservatively. 3.2.2.1. Storage Virtualization 1. To abstract storage hardware for device/array independence 2. To provide replication/mirroring for higher availability, DR/BC, etc. 3.3. Software 3.3.1. Operating Systems 3.3.2. Middleware 3.3.2.1. Identity Management 3.3.3. Databases 3.3.4. Core/Enabling Applications 3.3.4.1. Email 3.3.4.1.1. Spam Filtering 3.3.4.2. Web Services 3.3.4.3. Calendaring 3.3.4.4. DNS 26 3.3.4.5. DHCP 3.3.4.6. Syslog 3.3.4.7. Desktop Virtualization 3.3.4.8. Application Virtualization 3.3.5. Third Party Applications 3.3.5.1. LMS 3.3.5.2. CMS 3.3.5.3. Help Desk/Ticketing 3.4. Delivery Systems 3.4.1. Facilities 3.4.1.1. Tiering Standards 3.4.1.2. Spatial Guidelines and Capacities 1. The Data Center should be located in an area with no flooding sources directly above or adjacent to the room. Locate building water and sprinkler main pipes, and toilets and sinks away from areas alongside or above the Data Center location. 2. The Data Center is to be constructed of slab-to-slab walls and a minimum one-hour fire rating. All penetrations through walls, floors, and ceilings are to be sealed with an approved sealant with fire ratings equal to the penetrated construction. 3. FM-200 gaseous fire suppression will be used, possibly supplemented with VESDA smoke detection with 24/7 monitoring. 4. Provide a 20 millimeter vapor barrier for the Data Center envelope to allow maintaining proper humidity. Provide R-21 minimum insulation. If any walls are common to the outside, increase the R factor to achieve the same effect based on the exterior wall construction. 5. All walls of the Data Center shall be full height from slab to under floor above. 6. All paint within the Data Center shall be moisture control/anti-flake type. 7. The redundant UPS systems and related critical electrical equipment should be located in separate rooms external to the data center. Redundant systems such as UPS systems (not parallel) not connected together shall be located in separate rooms where not cost prohibitive. 8. Provide at least one set of double doors or one 3’-6” wide door into the Data Center to facilitate the movement of large equipment. 9. There will be a Network Operation Center (NOC) outside the data center; KVMs will be located inside the Data Center as well as in the NOC. 27 10. A portion of future data center expansion space may be used for staging, storage and test/development. 11. Interior windows, if any, on fire-rated walls between the Data Center and adjacent office spaces shall be one-hour rated. No windows are permitted in Data Center exterior walls. 12. Column spacing shall be 30’x 30’ or greater. 13. The concrete floor beneath the raised access floor in the Data Center is to be sealed with appropriate water based sealer prior to the installation of the raised access floor. IT equipment cabinets can range to 2,000 lbs each. Subfloor load capacity shall not be less than 150 lbs/SF for the Data Center and 250 lbs/SF for electrical and mechanical support areas. 14. Ceiling load capacity for suspension of data and power cable tray, ladder rack and HVAC ductwork shall not be less than 50 lbs/SF. 15. A steel roof shall be provided. Consider installing a redundant roof or fireproof gypsum board ceiling (especially if the original roof is wooden). Roof load capacity shall not be less than 40 lbs/SF with capability to add structural steel for support of HVAC heat rejection equipment. 16. HVAC heat rejection equipment may be installed on grade if space allows, otherwise on the roof. Utility power transformers, standby generators and switchgear will be installed on grade. 17. Provide two drains with anti-backflow in the floor unless cost prohibitive. 18. Provide a minimum ceiling height of 12.5' clear from the floor slab to the bottom of beams in the Data Center. 19. A 24” raised access floor would be required (please refer to details in this document). The access floor will be used as the primary cool air supply plenum and should be kept free of cables, piping or other equipment that can cause obstruction of airflow. 20. Power cabling shall be routed in raceway overhead, mounted on equipment racks or suspended overhead from the ceiling. 21. Data cabling and fiber shall be routed overhead in cable tray or on ladder rack. 22. All doors into the Data Center and Lab areas are to be secured with locks. All doors into the Data Center are to have self-closing devices. All doors should have the appropriate fire rating per the NFPA and local codes. The doors shall have a full 180-degree swing. 28 23. All construction in the Data Center is to be complete a minimum three weeks prior to occupancy. This includes walls, painting, backboards, doors, frames, hardware, windows, VCT floor, ceiling grid and tile (if specified), lights, sprinklers, fire suppression systems, electrical, HVAC and UPS. Temporary power and HVAC are not acceptable. Rooms must be wet-mop cleaned to remove all dust prior to installing equipment. 24. Orient light fixtures to run between and parallel with equipment rows in the Data Center. 25. Motion sensors shall be used for data center lighting control for energy efficiency. 3.4.1.3. Electrical Systems 1. Electrical Infrastructure  Maximize UPS Unit Loading o When using battery based UPSs, design the system to maximize the load factor on operating UPSs. Use of multiple smaller units can provide the same level of redundancy while still maintaining higher load factors, where UPS systems operate most efficiently. For more information, see Chapter 10 of the Design Guidelines Sourcebook. Specify Minimum UPS Unit Efficiency at Expected Load Points o There are a wide variety of UPSs offered by a number of manufacturers at a wide range of efficiencies. Include minimum efficiencies at a number of typical load points when specifying UPSs. Compare offerings from a number of vendors to determine the best efficiency option for a given UPS topography and feature set. For more information, see Chapter 10 of the Design Guidelines Sourcebook. Evaluate UPS Technologies for the Most Efficient o New UPS technologies that offer the potential for higher efficiencies and lower maintenance costs are in the process of being commercialized. Consider the use of systems such as flywheel or fuel cell UPSs when searching for efficient UPS options. For more information, see Chapter 10 of the Design Guidelines Sourcebook.   2. Lighting  Use Occupancy Sensors o Occupancy sensors can be a good option for datacenters that are infrequently occupied. Thorough area coverage with occupancy sensors or an override should be used to insure the lights stay on during installation procedures when a worker may be 'hidden' behind a rack for an extended period. Provide Bi-Level Lighting o Provide two levels of clearly marked, easily actuated switching so the lighting level can be easily changed between normal, circulation space lighting and a higher power detail work lighting level. The higher power lighting can be normally left off but still be available for installation and other detail tasks. Provide Task Lighting o Provide dedicated task lighting specifically for installation detail work to allow for the use of lower, circulation space and halls level lighting through the datacenter area.   29 3.4.1.4. HVAC Systems 1. Mechanical Air Flow Management   Hot Aisle/Cold Aisle Layout Blank Unused Rack Positions o Standard IT equipment racks exhaust hot air out the back and draw cooling air in the front. Openings that form holes through the rack should be blocked in some manner to prevent hot air from being pulled forward and recirculated back into the IT equipment. For more information, see Chapter 1 of the Design Guidelines Sourcebook. Use Appropriate Air Diffusers Position supply and returns to minimize mixing o Diffusers should be located to deliver air directly to the IT equipment. At a minimum, diffusers should not be placed such that they direct air at rack or equipment heat exhausts, but rather direct air only towards where IT equipment draws in cooling air. Supplies and floor tiles should be located only where there is load to prevent short circuiting of cooling air directly to the returns; in particular, do not place perforated floor supply tiles near computer room air conditioning units using the as a return air path. For more information, see Chapters 1 and 2 of the Design Guidelines Sourcebook. Minimize Air Leaks in Raised Floor    2. Mechanical Air Handler Systems  Use Redundant Air Handler Capacity in Normal Operations o With the use of Variable Speed Drives and chilled water based air handlers, it is most efficient to maximize the number air handlers operating in parallel at any given time. Power usage drops approximately with the square of the velocity, so operating two units at 50% capacity uses a sum total less energy than a single unit at full capacity. For more information, see Chapter 3 of the Design Guidelines Sourcebook. Configure Redundancy to Reduce Fan Power Use in Normal Operation o When multiple small distributed units are used, redundancy must be equally distributed. Achieving N+1 redundancy can require the addition of a large number of extra units, or the oversizing of all units. A central air handler system can achieve N+1 redundancy with the addition of a single unit. The redundant capacity can be operated at all times to provide a lower air handler velocity and an overall fan power reduction, since fan power drops with the square of the velocity. Light loading. For more information, see Chapter 3 of the Design Guidelines Sourcebook. Control Volume by Variable Speed Drive on Fans Based on Space Temperature  The central air handlers should use variable fan speed control to minimize the volume of air supplied to the space. The fan speed should be varied in series with the supply air temperature in a manner that reduces fan speed to the minimum speed possible before increase supply air temperature above a reasonable set point. Typically, supply air of 60F is appropriate to provide the sensible cooling required by datacenters. For more information, see Chapters 1 and 3 of the Design Guidelines Sourcebook.   3. Mechanical Humidification    Use Widest Suitable Humidity Control Band Centralize Humidity Control Use Lower Power Humidification Technology o There are several options for lower power, non-isothermal humidification, including air or water pressure based 'fog' systems, air washers, and ultrasonic systems. For more information, see Chapter 7 of the Design Guidelines Sourcebook. 4. Mechanical Plant Operation 30  Use Free Cooling / Waterside Economization o Free cooling provides cooling using only the cooling tower and a heat exchanger. It is very attractive in dry climates and for facilities that have local concerns about outside air quality that may cause concern about the use of standard airside economizers. For more information, see Chapters 4 and 6 of the Design Guidelines Sourcebook. Monitor System Efficiency o Install reliable, accurate monitoring of key plant metrics such as such kW/ton. The first cost of monitoring can be quickly recovered by identifying common efficiency problems, such as: low refrigerant charge, non-optimal compressor mapping, incorrect sensors, incorrect pumping control, etc. Efficiency monitoring provides the information needed for facilities personnel to optimize the system's energy performance during buildout and avoid efficiency decay and troubleshoot developing equipment problems over the life of the system. For more information, see Chapter 4 of the Design Guidelines Sourcebook. Rightsize the Cooling Plant o Due to the critical nature of the load and unpredictability of future IT equipment loads, datacenter cooling plants are oversized. The design should recognize that the standard operating condition will be at partload and optimize for efficiency accordingly. Consistent part-load operation dictates using well know design approaches to part load efficiency such as utilizing redundant towers to improve approach, using multiple chillers with variable speed drive, variable speed pumping throughout, chiller staging optimized for partload operation, etc. For more information, see Chapter 4 of the Design Guidelines Sourcebook.   3.4.1.5. Fire Protection & Life Safety 3.4.1.6. Access Control 3.4.1.7. Commissioning 1. Commissioning and Retrocommissioning  Perform a Peer Review o A peer review offers the benefit of having the design evaluated by a professional without the preconceived assumptions that the main designer will inevitably develop over the course of the project. Often, efficiency, reliability and cost benefits can be achieved through the simple process of having a fresh set of eyes, unencumbered by the myriad small details of the project, review the design and offer suggestions for improvement. Engage a Commissioning Agent o Commissioning is a major task that requires considerable management and coordination throughout the design and construction process. A dedicated commissioning agent can ensure that commissioning is done in a thorough manner, with a minimum of disruption and cost. Document Testing of All Equipment and Control Sequences o Develop a detailed testing plant for all components. The plan should encompass all expected sequence of operation conditions and states. Perform testing at with the support of all relevant trades — it is most efficient if small errors in the sequence or programming can be corrected onthe-spot rather than relegated to the back and forth of a traditional punchlist. Functional testing performed for commissioning does not take the place of equipment startup testing, control pointto-point testing or other standard installation tests. Measure Equipment Energy Onsite o Measure and verify that major pieces of equipment meet the specified efficiency requirements. Chillers in particular can have seriously degraded cooling efficiency due to minor installation damage or errors with no outward symptoms, such as loss of capacity or unusual noise. Provide Appropriate Budget and Scheduling for Commissioning o Commissioning is a separate, non-standard, procedure that is necessary to ensure the facility is constructed to and operating at peak efficiency. Additional time commitment beyond a standard construction project will be required from the contractors. Coordination meetings dedicated to     31 commissioning are often required at several points during construction to ensure a smooth and effective commissioning.  Perform Full Operational Testing of All Equipment o Commissioning testing of all equipment should be performed after the full installation of the systems are complete, immediately prior to occupancy. Normal operation and all failure modes should be tested. In many critical facility cases, the use of load banks to produce a realistic load on the system is justified to ensure system reliability under design conditions. Perform a Full Retrocommissioning o Many older datacenters may have never been commissioned, and even if they had performance degrades over time. Perform a full commissioning and correct any problems found. Where control loops have been overridden due to immediate operational concerns, such as locking out condenser water reset due to chiller instability, diagnose and correct the underlying problem to maximize system efficiency, effectiveness, and reliability. Recalibrate All Control Sensors Where Appropriate, Install Efficiency Monitoring Equipment o As a rule, a thorough retrocommissioning will locate a number of low-cost or no-cost areas where efficiency can be improved. However, without a simple means of continuous monitoring, the persistence of the savings is likely to be low. A number of simple metrics (cooling plant kW/ton, economizer hours of operation, humidification/dehumidification operation, etc.) should be identified and continuously monitored and displayed to allow facilities personnel to recognize when system efficiency has been compromised.    3.4.2. Load Balancing/High Availability 3.4.3. Connectivity 3.4.3.1. Network 3.4.3.1.1. Network Virtualization 1. Use of VLANs for LAN security and network management a. Controlling network access through VLAN assignment b. Using VLANs to present multiple virtual subnets within a given physical subnet c. Using VLANs to present one virtual subnet across portions of many physical subnets 2. Use of MPLS for WAN security and network management a. Multiprotocol Label Switching (MPLS) used to create Virtual Private Networks (VPNs) to provide traffic isolation and differentiation. 3.4.3.1.2. Structured Cabling 3.4.4. Operations 3.4.4.1. Staffing 3.4.4.2. Training 3.4.4.3. Monitoring 3.4.4.4. Console Management 3.4.4.5. Remote Operations 3.4.5. Accounting 3.5. Disaster Recovery 3.5.1. Relationship to overall campus strategy for Business Continuity 32 3.5.2. Relationship to CSU Remote Backup – DR initiative 3.5.3. Infrastructure considerations 3.5.4. Operational considerations 3.5.4.1. Recovery Time Objectives and Recovery Point Objectives discussed in 2.7.3.1 (Backup and Recovery 3.5.4.2. Resource-sharing between campuses: Cal State Fullerton/San Francisco State example CSU, Fullerton has established a Business Continuity computing site at San Francisco State University. The site allows a number of critical computing functions to remain in service in the event of severe infrastructure disruption, such as complete failure of both CENIC links to the Internet, failure of the network core on the Fullerton campus, or complete shutdown of the Fullerton Data Center. CSU, Fullerton is the only CSU campus to establish such extensive off-site capabilities. The goal of the site is to provide continuity for the most critical computing services that can be supported in a cost-effective manner. Complete duplication of every central computing resource on the Fullerton campus is prohibitively expensive, and San Francisco State’s Data Center could not provide the space for that much equipment. (The same is true for SFSU. Fullerton does not have the space, cooling, or electrical to duplicate all the equipment at SFSU.) Major capabilities include continuity of access to:          The main campus website: www.fullerton.edu The faculty/student Portal: my.fullerton.edu Faculty/staff email; Student email (provided by Google, but accessed via the Fullerton Portal) Blackboard Learning Solutions (hosted by Blackboard ASP, but accessed via the Fullerton Portal) CMS HR, Finance, and Student applications (hosted by Unisys in Salt Lake City, but accessed via the Fullerton Portal) Brass Ring H.R. recruitment system (hosted by Brass Ring) OfficeMax ordering system GE Capital state procurement card system With these capabilities, educational and business activities can continue while the infrastructure interruption is resolved. 33 In addition to complete operation, the remote site can substitute for specific resources, such as the campus website. This “granular” approach provides significant flexibility for responding to specific computing issues in the Data Center. A number of significant resources were found to be too costly to duplicate off site, including:     CMS Data warehouse Filenet document repository Voice mail IBM mainframe (which is to be decommissioned in Dec, 2008) The project does not provide continuity for resources provided outside of Fullerton IT. The project began with initial discussions between the CIO’s of Fullerton and San Francisco in 2006, where they agreed in concept to provide limited “hosting” for equipment from the other campus. An arrangement was worked out with CENIC, the statewide data network provider, to use features of the “Border Gateway Protocol” to allow a campus to switch a portion of the campus Internet Protocol (IP) space from their main campus to the remote campus in a matter of seconds. This capability was perfected and tested in the summer of 2007. A major innovation was the use of the “backup” CENIC link at SFSU to provide Internet access for Fullerton’s remote site. This completely avoids Fullerton having any impact on the SFSU network. We use no IP addresses at SF. They make no changes to their firewall or routers. We need no network ports. And, all Internet traffic to our site goes through the backup CENIC link, not through SFSU’s primary link. Fullerton purchased an entire cabinet of equipment, including firewall, network switch, servers, and remote management of keyboards and power plugs. The equipment was tested locally and transported to San Francisco in January, 2008. The site became operational in February, 2008. All maintenance and monitoring of the remote equipment is done through the Internet from the Fullerton campus. Staff at San Francisco provide no routine assistance. This has several important benefits: (1) it places little burden on the remote “host”. (2) it avoids the need to train staff at the remote campus and re-train when personnel turnover. And (3) it allows the entire remote site to be relocated to a different remote host with almost no change. Because the remote site is connected to Fullerton through a VPN tunnel, the same Op Manager software that monitors equipment on the Fullerton campus also monitors the 34 servers at SFSU. Any unexpected server conditions at SFSU automatically trigger alerts to Fullerton staff. To avoid unexpected disruption, the Continuity site is activated manually by accessing the Firewall through the Internet and changing a few rules. Experience with the unexpected consequences of setting up “automated” systems prompted this design. The SFSU site contains a substantial capability, including:  Domain controllers for Fullerton’s two Microsoft Active Directory domains (AD and ACAD) This provides secure authentication to the portal and email servers, and would allow Active Directory to be rebuilt if the entire Fullerton Data Center were destroyed. Web Server Portal Server Application Servers for running batch jobs and synchronizing the Portal databases Microsoft Exchange email servers Blackberry Enterprise Server (BES) to provide continuity for Blackberry users SQL database servers CMS Portal servers (this capability not totally implemented yet because CMS Student is just coming “on line” during 2008) Email gateway server         Because the SFSU site constantly replicates domain updates and refreshes the Portal database daily, the SFSU site is a better source of much information than the tape backups kept at Iron Mountain. 35 Network Diagrams Figure 1 – Intercampus network diagram Figure 2 – Remote LAN at DR site 3.5.4.3. Memorandum of Understanding Campuses striking partnerships in order to share resources and create geographic diversity for data stores will want to document the terms of their agreement, representing elements such as physical space, services, access rights and effective dates. Following is a sample template (sourced from ITAC DR site): MOU SSU-SJSU Sonoma State University Rider A, Page 1 of 1 MEMORANDUM OF UNDERSTANDING This MEMORANDUM OF UNDERSTANDING is entered into this 1st day of September, 2008, by and between Sonoma State University, Computer Operations, Information Technology Department and San Jose State University, University Computing and Telecomm. 36 Each campus will provide, for use by the other campus: 2 each - 19” racks, mounting equipment specification (EIA-310-C) and requisite power and cooling for same. Power provided by Sonoma State University will include uninterrupted power supply (UPS) and diesel generator backup power. PHYSICAL LOCATION OF HOSTED EQUIPMENT Sonoma State University will station the two 19” racks for San Jose State University in the Sonoma State University Information Technology data center. Collocated network devices will reside in a physically segregated LAN. San Jose State University will station the two racks for Sonoma State University … LOGICAL LOCATION OF HOSTED EQUIPMENT Equipment stationed by SJSU in the two racks in the SSU data center will be provisioned in a segregated security zone behind a firewall interface whose effective security policy is specified by the hosted campus. Security policy change requests for the firewall by SJSU will be accomplished within 7 days of receipt by SSU. PHYSICAL ACCESS BY SISTER CAMPUS Physical access to the two racks in the SSU data center by SJSU support staff will be granted according to the escorted visitor procedures in place at SSU. Physical access is available during normal business hours (8am – 5pm) by appointment or emergency access on a best effort basis by contacting the SSU emergency contact listed in this document. Physical access to the two racks in the SJSU data center by SSU support staff will be granted according to … SERVICES PROVIDED BY HOST CAMPUS SSU Computer Operations staff will provide limited services to SJSU as required during normal business hours (8am – 5pm) or after hours on a best effort basis by contacting the SSU emergency contact listed in this document. Limited services are defined to be such things as tasks not to exceed 1 hour of labor such as visiting a system console for diagnostic purposes, power cycling a system, or other simple task that cannot be performed remotely by SJSU. 37 SJSU Computer Operations staff will provide limited services to SSU as required during normal business hours (8am – 5pm) or after hours on a best effort basis by contacting the SJSU emergency contact listed in this document. Limited services are defined to be such things as tasks not to exceed 1 hour of labor such as visiting a system console for diagnostic purposes, power cycling a system, or other simple task that cannot be performed remotely by SSU. SECURITY REQUIREMENTS Level-one data in transit or at rest must be encrypted. Each campus will conform to CSU information security standards as they may apply to equipment stationed at the sister campus and cooperate with the sister campus’ Information Security Officer pertaining to audit findings on their collocated servers. The point of contact person for Sonoma State University will be Mr. Samuel Scalise (707 664-3065, scalise@sonoma.edu). The point of contact person for San Jose State University will be Mr. Don Baker (408 924-7820 don.baker@sjsu.edu). The emergency point of contact person for Sonoma State University will be Mr. Don Lopez (707 291-4970, don.lopez@sonoma.edu). The emergency point of contact for San Jose State University will be … No charges to either party since equipment, services and support are mutual The term of this MOU shall be September 1, 2008 through June 30, 2009. 3.6. Total Enterprise Virtualization In order to allow IT organizations to remain nimble to the increased complexity of application provisioning and delivery, and to maximize the unused capacity of compute and storage resources, virtualization is key. And while virtualizing servers and storage are obvious targets for optimization, total enterprise virtualization would encompass additional layers, extending to desktop virtualization and application virtualization. Data centers that are able to deliver services dynamically to meet demand will have other virtualization layers as well, such as virtualizing connectivity to storage systems and the network. The following are characteristics of a dynamic data center:      Enables workload mobility Automatically managed through orchestration Seamlessly leverages external services Service-oriented Highly available 38    Energy and space efficient Utilizes a unified fabric Secure and regulatory compliant Achieving these characteristics requires investments in some of the following key enabling technologies: 1. Server Virtualization Server virtualization's ability to abstract the system hardware away from the workload (i.e., guest OS +application) enables the workload to move from one system to another without hardware compatibility worries. In turn, this opens up a whole new world of IT agility possibilities that enable the administrator to dynamically shift workloads to different IT resources for any number of reasons, including better resource utilization, greater performance, high availability, disaster recovery, server maintenance, and even energy efficiency. Imagine a data center that can automatically optimize workloads based on spare CPU cycles from highly energy-efficient servers. Issues to be aware of in server virtualization: licensing, hardware capabilities, application support, and administrator trust with critical applications to virtual platforms. 2. Storage Virtualization Storage virtualization is an increasingly important technology for the dynamic data center because it brings many of the same benefits to the IT table as server virtualization. Storage virtualization is an abstraction layer that decouples the storage interface from the physical storage, obfuscating where and how data is stored. This virtualization layer not only creates workload agility (not tied to a single storage infrastructure), but it also improves storage capacity utilization, decreases space and power consumption, and increases data availability. In fact, storage and server virtualization fit hand and glove, facilitating a more dynamic data center together by enabling workloads to migrate to any physical machine connected to the storage virtualization layer that houses the workload's data. 3. Automation and orchestration Storage virtualization is an increasingly important technology for the dynamic data center because it brings many of the same benefits to the IT table as server virtualization. Storage virtualization is an abstraction layer that decouples the storage interface from the physical storage, obfuscating where and how data is stored. This virtualization layer not only creates workload agility (not tied to a single storage infrastructure), but it also improves storage capacity utilization, decreases space and power consumption, and increases data availability. In fact, storage and server 39 virtualization fit hand and glove, facilitating a more dynamic data center together by enabling workloads to migrate to any physical machine connected to the storage virtualization layer that houses the workload's data. The best automation and orchestration software can reduce the management complexity created by workload mobility. Workflow automation and orchestration tools use models and policies stored in configuration management databases (CMDBs) that describe the desired data center state and the actions that automation must take to keep the data center operating within administrator-defined parameters. These tools put the IT administrator in the role of the conductor, automating systems and workload management using policy-based administration. 4. Unified Fabric with 10GbE 10 Gigabit Ethernet (10GbE) is an important dynamic data center-enabling technology because it raises Ethernet performance to a level that can compete with specialized I/O fabrics, thereby potentially unifying multiple I/O buses into a single fabric. For example, SANs today operate on FC-based fabrics running at 2, 4, and now 8 Gb speeds. Ethernet, at 10 Gb, has the performance potential (i.e., bandwidth and latency) to carry both SAN and network communication I/O on the same medium. Just like 1GbE and Fast Ethernet before it, 10GbE will become the standard network interface shipped with every x86/64 server, increasing host connectivity to shared resources and increasing workload mobility. Using 10GbE as a universal medium, administrators can move workloads between physical servers without worrying about network or SAN connectivity. The development of Converged Enhanced Ethernet, or Cisco’s version called Data Center Ethernet, allows for lossless networks that can do without the overhead of TCP/IP and therefore rival Fibre Channel (FC) for transactional throughput. iSCSI is already a suitable alternative for many FC applications, but Fibre Channel over Ethernet (FCoE) should close the gap for those systems that still cannot tolerate the relative inefficiencies of iSCSI. One of the key benefits of a unified fabric within the data center is putting the network team in the role of managing all connectivity, including the storage networks, which are often managed by the storage administrators whose time could be better spent on managing the data and archival and recovery processes rather than the connectivity. 5. Desktop and Application Virtualization 6. Cloud computing 40 Not for everything, dependent on service offerings, APIs Abstraction layer Internal and external [Key concepts extracted from “The Dynamic Data Center” by The Burton Group] 3.7. Management Disciplines 3.7.1. Service Management 3.7.1.1. Service Level Agreements 3.7.2. Project Management 3.7.3. Configuration Management 3.7.4. Data Management 3.7.4.1. Backup and Recovery One of the essential elements of an effective business continuity plan resides within backup and disaster recovery infrastructure component. While the technical issues pertaining to hardware and software are critical to the implementation of an effective backup and disaster recovery plan, well written standards and policies are the lynchpin of a successful backup and recovery program deployment. In this regard, development of backup and disaster recovery standards and policies are responsibility of enterprise governance. In doing this, the Chancellor’s Office must make decisions about the classification and retention of business information. This is not a trivial task as evidenced by the complex legal compliance issues posed by Family Educational Rights and Privacy Act (FERPA), The Health Insurance Portability and Accountability Act (HIPAA), Sarbanes-Oxley (SOX) statutes present a moving target with severe penalties for non-compliance. 1. Assumptions: a. Budget constraints drive the technological solution for backup/recovery. b. RPO/RTO requirements are driven by business and legal constraints (FERPA, HIPAA, SOX, etc.) and should be defined by enterprise governance. c. The requirements defined by the governance process will be a critical factor in establishing recovery point objectives (RPO) and recovery time objectives (RTO) and must be congruent with budget. 41 d. RPO and RTO have a major influence on the backup/recovery technological solution. e. The backup window, i.e., the time slot within which backups must take place (so as to not interfere with production) is a significant constraint on the backup/recovery design. f. A tiered approach to the backup/recovery architecture mitigates budget and backup window constraints. g. Sizing of the tiers is driven largely by the retention and recovery time objectives and corresponding budget constraints. h. Lower tiers generally set expectations for faster recovery time and lower cost of implementation. i. Sizing of tiers of backup/recovery storage should be as large as is practicable. Subject to budget constraints, as much storage as possible at each tier means lower RPO/RTO. 2. Best Practices a. Work with CSU and SSU governance bodies to establish retention requirements for electronic media. Review and insure that established retention requirements comply with state and federal legal requirements. b. Work with CSU and SSU governance bodies to establish business continuity and disaster requirements. c. RPO and RTO objectives will have a significant impact on the design of the backup and disaster recovery plan: higher expectations for RPO and RTO will drive the costs of the requisite technological design. d. The backup/recovery architecture should be a tiered design consisting of:  Tier 1 The first tier of storage is volume snapshots for immediate recovery under user control. This typically is referred to as “snap reserve” and is set aside when the volume is configured. User control at tier one facilitates restoration of backup files without support from operations staff.  Tier 2 42 The second tier of online storage is a disk target for backups such as a virtual tape library (VTL). This would be the target of system backups which more effectively utilize the available backup window. The sizing of tier 2 will drive the availability of the backup datasets on the VTL. The higher the capacity of the disk target, the longer these backup datasets can be retained. Recovery time from disk is much faster that other archive media such as magnetic tape and hence, can support more aggressive recovery time objectives. Deduplication should be employed to reduce the footprint of the backup dataset.  Tier 3 The third tier is deployed for longer term archival of backup datasets. Tier 3 strategies can be disk-based but budget constraints often rule out disk based solution. Magnetic tape solutions are usually more economical and the chosen media for archival systems. e. Magnetic media stored off-site should be encrypted using LTO-4 tape drives and a secure key management system. Key management is vital to ensure timely decryption. This is particularly true during a disaster recovery process when magnetic media must be utilized to recover on a remote site. Without effective and timely key management, the encrypted backup tapes are useless. f. A copy of encrypted full backup datasets should be stored at another CSU campus serviced by a different power grid and if possible in a different seismic zone at a minimum in a different earthquake fault zone as established by the California State Geologist. The schedule of remote backup datasets will be determined by the recovery point objective (RPO) established by the CSU governance body. The movement of and tracking of the encrypted backup datasets will be defined in a memorandum of understanding between the sister campuses who engage in the reciprocal agreement. For more aggressive recovery time objective requirements, a WANbased remote backup system should be employed. While the cost of a WAN-based solution is significantly higher than a magnetic media solution, disaster recovery would be much faster. In addition, a WANbased system could also dove tail into a high availability architecture where the backup datasets could feed redundant systems on the sister campus and provide for failover in the event of an outage in the primary datacenter. g. 3.7.4.2. Archiving 3.7.4.3. Hierarchical Storage Management 43 3.7.4.4. Document Management 3.7.5. Asset Management 3.7.5.1. Tagging/Tracking 3.7.5.2. Licensing 3.7.5.3. Software Distribution 3.7.6. Problem Management 3.7.6.1. Fault Detection 3.7.6.2. Correction 3.7.6.3. Reporting 3.7.7. Security 3.7.7.1. Data Security 3.7.7.2. Encryption 3.7.7.3. Authentication 3.7.7.4. Antivirus protection 3.7.7.5. OS update/patching 3.7.7.6. Physical Security 44

Related docs
Data Center Best Practice and Architecture
Views: 2  |  Downloads: 1
ARCHITECTURE
Views: 15  |  Downloads: 2
Architecture
Views: 7  |  Downloads: 0
Best Practice
Views: 168  |  Downloads: 10
IT Architecture Guide
Views: 66  |  Downloads: 36
IT Architecture Guide
Views: 349  |  Downloads: 87
Architecture-of-Innovation
Views: 1  |  Downloads: 0
Data Warehouse Architecture Best Practices
Views: 7  |  Downloads: 5
The Colonial Architecture of Philadelphia
Views: 215  |  Downloads: 1
Architecture
Views: 266  |  Downloads: 18
Other docs by PaulyDeacon
employee_feedback_script
Views: 359  |  Downloads: 10
Of claim of title to real property
Views: 240  |  Downloads: 4
Mom and Dad in the 60s
Views: 192  |  Downloads: 0
Transcript of Marshall Plan
Views: 94  |  Downloads: 0
testdoc5[2]
Views: 71  |  Downloads: 0
AGREEMENT FOR PURCHASE AND SALE OF BUSINESS
Views: 759  |  Downloads: 73
2mbplus
Views: 121  |  Downloads: 0
Real estate leasing and management
Views: 246  |  Downloads: 9
sa_______'
Views: 178  |  Downloads: 0
Sale of business
Views: 362  |  Downloads: 3