Server Consolidation: Phase 1 Evaluation Report
Virtual Server Evaluation Results
8/15/06 Version 0.3 Author: Dave Klein
THINKING AT THE EDGE
Document Control
Change Record
Date
7/31/06 8/6/06 8/8/06
Author
Dave Klein Dave Klein Dave Klein
Version
0.1 0.2 0.3
Change Reference
Reviewers
Sign Off Date Reviewer Position Sign Off
Contributors
Version
0.2 Shawn Duncan
Names
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
ii
Table of Contents
DOCUMENT CONTROL ..........................................................................................................................II
Change Record ................................................................................................................................................................................ii Reviewers..........................................................................................................................................................................................ii Contributors.....................................................................................................................................................................................ii
TABLE OF CONTENTS..........................................................................................................................III 1 INTRODUCTION.................................................................................................................5
1.1 1.2 1.3 1.4 1.5 1.6 Executive Summary.........................................................................................................................................................5 Project Description (excerpts from scorecard)...........................................................................................................5 Scope: Server Consolidation and Storage Consolidation..........................................................................................6 Phase 1 summary (VM environment) ..........................................................................................................................6 Phase 1 timeline ...............................................................................................................................................................7 Guidance from Gartner reports....................................................................................................................................7 1.6.1 1.6.2 Consolidation Opportunities (from Gartner summary) ...................................................................................7 Server Consolidation Action Plan..................................................................................................................8
2
EVALUATION GOALS, METHODS AND RESULTS.......................................................10
2.1 Proof of Concept system architecture .......................................................................................................................10 2.1.1 2.2 2.3 System Specifications....................................................................................................................................10
Gather virtual server requirements .............................................................................................................................10 Validate virtual platform (VMware) functionality ....................................................................................................11 2.3.1 Operating systems tested (supported by ESX 2.5.x).....................................................................................11
2.4 2.5
Validate physical server hardware functionality .......................................................................................................12 Verify VM performance compared to standalone server .......................................................................................12 2.5.1 2.5.2 Application layer study - Business Objects ...................................................................................................12 Other POC test results ................................................................................................................................13
2.6 2.7 2.8 2.9
Verify V-Motion server failover..................................................................................................................................14 Evaluate Virtual Center management platform........................................................................................................14 Smooth migration of existing servers to VM............................................................................................................15 Assessment .....................................................................................................................................................................16 2.9.1 2.9.2 2.9.3 Assess current server utilization ...................................................................................................................16 Excerpt from ACCESSFLOW report: Total Cost of Ownership (TCO) and ROI Analysis....................16 Our take on the assessment..........................................................................................................................17
2.10 2.11
Tech Support..................................................................................................................................................................17 Stress testing ...................................................................................................................................................................18
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
iii
3
RECOMMENDATIONS.....................................................................................................20
3.1 Hardware recommendations........................................................................................................................................20 3.1.1 3.1.2 3.1.3 3.1.4 3.2 Servers.........................................................................................................................................................20 SAN and Fibre-channel network................................................................................................................20 Network......................................................................................................................................................21 Virtual Center Server (optional)..................................................................................................................21
Software recommendations .........................................................................................................................................21 3.2.1 VMware ESX 2.5.x and 3.0 ....................................................................................................................21
3.3
Technology strategy ......................................................................................................................................................22
4
PHASE 2 MILESTONES – WHAT WE CAN DO, WHEN .................................................23
4.1 4.2 Incremental approach to deployment ........................................................................................................................23 Next steps .......................................................................................................................................................................24 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.2.7 4.2.8 Proposed Phase 2 Timeline, to end of 2006 .................................................................................................24 Migrate Core Tech systems...........................................................................................................................24 Server migration process ...............................................................................................................................25 Business Processes ........................................................................................................................................25 Prepare for new service .................................................................................................................................25 Integration into backup strategy ...................................................................................................................25 Risk reduction .............................................................................................................................................25 Planning and Budget ...................................................................................................................................25
5
PHASE 2 BUDGET...........................................................................................................26
5.1 5.2 Proposed phase 2 budget for VM servers .................................................................................................................26 Budgetary notes .............................................................................................................................................................27 5.2.1 5.2.2 5.3 Project scope.................................................................................................................................................27 VI3 ............................................................................................................................................................27
Don’t forget storage......................................................................................................................................................27
6
SERVER COST MODEL – VM VS. PHYSICAL BOX.......................................................28
6.1 Server baseline summary ..............................................................................................................................................28 6.1.1 6.1.2 6.2 6.3 Actual costs per year....................................................................................................................................28 Prices to customers per year* ........................................................................................................................28
Average* System administration costs (based on FTE cost) .................................................................................28 Storage costs for phase 2 VM environment..............................................................................................................29 6.3.1 6.3.2 6.3.3 Actual current cost of tape backup ...............................................................................................................29 Estimated cost of SAN storage....................................................................................................................29 Prices to customers, per GB, per year............................................................................................................29
7 8
REFERENCE DOCUMENTS ............................................................................................30 APPENDIX A- ARCHITECTURE SUMMARY ..................................................................31
8.1 8.2 8.3 Overall architecture concept, including storage .......................................................................................................31 Initial phase 2 architecture, VM farm.........................................................................................................................32 Final project architecture..............................................................................................................................................33
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
iv
1 Introduction
1.1 Executive Summary
The evaluation phase of the Server Consolidation project has concluded successfully with clear validation of the VMware technology. A 30-day utilization report shows significant benefit can be attained by migrating a vast majority of our existing servers to this new environment. In addition, we verified the functionality of VM’s failover mechanism that allows live transfer of a server from one physical box to another without a disruption in service, thus reducing potential downtime and allowing straightforward load balancing. The recommended deployment phase of the project (“phase 2”) consists of 3 steps, taking an aggressive yet measured approach to moving our existing systems and providing the basis for a new service to configure new servers very quickly. Step 1 converts Windows and Linux servers in the data center to VMs starting as soon as the first round of equipment purchases can be made. This step includes buying the servers themselves and the infrastructure needed to support the VM server “farm”. Steps 2 and 3 add server capacity as demand grows and high risk divisional servers are added. The storage consolidation portion of the project will be continued in parallel with the preparation for the step 1 infrastructure.
1.2 Project Description (excerpts from scorecard)
The purpose of this project is to develop a state-of-the-art consolidated server and data storage infrastructure that can be used to provide a set of server hosting services to the campus. The goal of these services is to consolidate the support load of managing the many lightly loaded servers deployed around campus. The infrastructure must provide for application level flexibility, quick deployment of applications, quick and easy redeployment of compute resources storage capacity, high availability and redundancy, reduced administrative overhead and reduced overall cost. The infrastructure will also be leveraged to deal with the currently unfulfilled demand for on-demand services for faculty and staff. These services may include database and web servers with varying degrees of dynamic content. The benefits this project will be cheaper, more responsive, and more robust server and storage management. Cost savings come from statistical sharing of system resources and improved effectiveness of systems support staff. Improved responsiveness comes from more sophisticated tools for system deployment, configuration, and resource management. Improved robustness comes from
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
5
platform standardization, which facilitates disaster preparedness, and security monitoring and enforcement.
1.3 Scope: Server Consolidation and Storage Consolidation
Originally seen as two separate projects, server consolidation and storage consolidation were merged due to a perceived server dependency on centralized storage and almost identical project team membership. While phase 1 activities have involved analysis of both virtual server and storage solutions, the project timeline was revised to accelerate VM deployment. This document describes the evaluation of the virtual server components and includes recommendations for phase 2 of that part of the project. Storage recommendations will be moved to later in the project timeline to allow the team to focus on the more urgently needed VM environment and give the team more time to unravel the unforeseen complexities of the storage options. The revised timeline ends at approximately the same time as the original merged schedule and has the same estimated budget.
1.4 Phase 1 summary (VM environment)
The goals of the phase 1 evaluation were: Gather virtual server requirements Validate virtual platform (VMware) functionality Verify VM performance is adequate for migrated servers Verify V-Motion server failover Evaluate the Virtual Center management platform Assess current server utilization Develop a cost model for virtual server hosting Provide initial training for key resources involved in the evaluation Make recommendations for broader deployment, if successful Integration with SAN The next section (2.0) of this document describes the details of achieving each of these goals and the results. With the immediate need for test systems for critical projects, the team supports the use of the Test Bed systems to provide temporary test and development servers until the Test/Production servers are ready. For this we additionally prepared an Operational Level Agreement (OLA) to outline the interim service level for the temporary systems. This OLA will be revised into a full Service Level Agreement for the final Test/Production systems.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
6
1.5 Phase 1 timeline
Figure 1.1
Phase 1 Timeline
1.6 Guidance from Gartner reports
1.6.1 Consolidation Opportunities (from Gartner summary)
“The new model offers the opportunity to consolidate some functions in order to take advantage of economies of scale and best practices in pursuit of an optimized service delivery capability. For workstation support and Helpdesk, there is the opportunity for reduced FTE requirements and more efficient support operations. For servers there is the opportunity for improved uptime, manageability, and serviceability. The greatest cost reduction opportunity ($2.03MM) will be available from the Workstation Support and Helpdesk environment. This will come mostly from reduced FTE requirements and more efficient support operations. The magnitude of cost reductions from server consolidation is smaller ($0.27MM) due to the current under-staffing situation, inadequate service levels, lower university hardware costs, and relatively small quantity of servers (approximately 550). The biggest opportunities from server consolidation will be from improvements in uptime, manageability, and serviceability. However improved security and recoverability will require additional investment by UCSC.”
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
7
1.6.2
Server Consolidation Action Plan
The Gartner report on the IT Consolidation at UCSC suggested the following actions to be taken to consolidate servers. Below is the list of those actions and the Server Consolidation project response.
Suggested action
1. Build and Deploy Server Operations Organization
SSC Project response
Existing Core Tech Server and Operations groups (Windows and UNIX) are in a position to completely support this consolidation effort, as they support the current server environment. No separate organization is required. Data for existing server inventory and server requirements were collected from DL interviews. Current server usage statistics were collected by Accessflow assessment. The existing Communications Bldg data center is adequate to house the initial deployment of virtual servers. The use of VLANs for network connections may require some additional hardware. There will also a need for a Fibre-Channel network for the connections to the SAN. Backup will require FC ports be added to tape systems. DR upgrades will require additional equipment, both on the current data center and at the remote DR site.
2. Close Data Gaps
3. Identify Required Data Center and Network Infrastructure Upgrades
4. Quantify Costs/Benefits and Return on Investment 5. Prioritize Consolidation Candidates
See Cost Model and Assessment report excerpts. Prioritized list of candidates comes from a risk assessment of existing servers coupled with the list of additional servers for new projects. The data is already collected and the analysis of the list is one of the first steps in phase 2. Part of the deployment of this new environment involves the use of one server platform model for hardware and “golden images” for OS and basic applications. All migrated servers will be deployed on the standard h/w, though some may not “fit” into a golden image. Longer term goals include having a single process to specify and deploy server “capacity”, be it virtual or physical. Part of the overall project goals and includes all consolidated systems and most other systems in the data center. See Storage Consolidation section. Not directly related to this project, except for use of software tools to support physical-to-virtual server migration (P2V). Firewall/security issues are part of the networking architecture associated with the project. Some security processes have been modified for VMs. DR is better-achieved with a virtual environment when coupled with data replication and at least 2 data centers capable of hosting virtual servers. Project team found no relevance of playground equipment to the project
6. Identify Standardization Opportunities
7. Identify Storage Consolidation Opportunities 8. Identify Automation Opportunities
9. Assess Impact on Disaster Recovery and Security
10. Identify Need for Swing Equipment
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
8
Suggested action
11. Evaluate and Pilot Consolidation Technologies
SSC Project response
Many virtualization platforms were researched. VM ESX was the only “Enterprise-quality” product appropriate for our environment. Proof of Concept system (POC) was used as a pilot for the VM technology. The POC included 2 Dell 2850 servers w/ VM ESX software connected to an EMC Clariion CX500 SAN. The POC is the basis for testing and the first wave of interim deployment. High level budget was developed as part of the initial project plan. Detailed phase 2 budget is included in this report. Project team was identified early in phase 1. See project charter for specific members and roles. Phase 1 is complete, phase 2 execution pending approval of this report and recommendations. Gartner Summary Report Suggested Actions
12. Develop Detailed Project Budget
13. Select and Deploy Program Execution Team 14. Execute Project
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
9
2 Evaluation goals, methods and results
2.1 Proof of Concept system architecture
Figure 2.1
Proof of Concept system
2.1.1 • •
System Specifications Dell 2850, 2 x Xeon dual core, 2.8 GHz, 8 MB RAM (upgraded to 12 for phase 2 test bed) EMC Clarion CX500 SAN controller, 4 GB Cache, 17 x 146 GB 15K RPM drives (2.48 TB raw) configured into 3 RAID groups with hot spare, redundant controllers and power supplies, battery backed cache 2 x EMC (McData) Fibre-Channel switches, (not kept for test bed)
•
2.2 Gather virtual server requirements
Requirements for both servers and storage were gathered through a series of team meetings, customer interviews, discussions with other campuses and presentations from potential suppliers. The major considerations derived from the customer requirements are:
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
10
Supports standard versions of Windows and Linux currently supported by Core technologies Complete isolation of each VM (one corrupted VM does not interfere with others) Equivalent performance to a standalone box, less acceptable overhead for ESX, including CPU, Disk access and I/O No special requirements or modifications for the OS or applications Hardware redundancy, both server and associated storage Failover and fast recovery from unscheduled downtime, High availability Excellent support, from both suppliers and ITS support groups Smooth migration of existing servers into the VM environment Ability to manage VM (hardware layer) regardless of location, ease of use, no new complexity Security – firewalls, patching, isolating compromised systems, forensics
2.3 Validate virtual platform (VMware) functionality
ESX Server installs on the “bare metal” and allows multiple unmodified operating systems and their applications to run in virtual machines that share physical resources. Each virtual machine represents a complete system, with processors, memory, networking, storage and BIOS. Besides verifying the basic installation and functionality of ESX, the team and various Applications users tested all of the supported operating systems in use on campus. 2.3.1
2.3.1.1
Operating systems tested (supported by ESX 2.5.x)
Microsoft Windows
We verified complete virtual hardware functionality, on the following VM “supported” guest operating systems: Server 2000 Server 2003 Server 2003 SR2 NT4, SP6a XP Pro, SP2 Event logs had no issues regarding OS virtualization. Device management interface validation was completed without problems.
2.3.1.2 Linux, various flavors
Verified complete virtual hardware functionality, on the following VM “supported” guest operating systems: CentOS Red Hat Enterprise Fedora Core 4
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
11
2.4 Validate physical server hardware functionality
The ESX platform is approved to run on many makes and models of physical servers. Our POC system utilized 2 Dell 2850 servers. These were chosen as Dell is the current server vendor of choice in the data center for Windows and Linux, and they are on VMware’s “recommended” list. These systems were provided by Dell based on the very good existing relationship between UCSC and Dell. The fact that this relationship works so well factored heavily in our recommendations. Key parameters for the validation were: Compatibility with ESX and the test OSs Ease of use and configurability Overall execution performance Reliability Support As mentioned above, all supported operating systems ran without issue or modification. As expected, ESX installs without modification – hardware driver support is embedded in ESX. A key add-on component, the Fibre-channel HBA (high-speed SAN connection), was installed without problems. During the 7 months the POC has been tested, we have had no server failures. The overall track record of this model in our experience is exceptional. Our initial discussions with other UC campuses using VMware with a variety of server platforms showed performance results are essentially equivalent for similarly configured systems. UCB, for example, tested servers from IBM, Dell and HP. Performance was measured to be about equal, with the key differentiator being customer service. UCB selected Dell as their platform. Other campuses also reviewed multiple platforms and the commodity nature of servers led them to buy from vendors with whom they already had a positive relationship. UCSB has 2 data centers using VMware; one chose IBM and the other chose HP.
2.5 Verify VM performance compared to standalone server
2.5.1 Application layer study - Business Objects
A series of tests were performed by ITS Applications Solution group comparing virtual server performance with equivalent standalone servers. Results indicate virtual machine performance to be very close to physical machines, in some cases slightly better and others slightly worse. These tests focused on the Application layer and provided the most critical and realistic test scenarios. The test results dictate the potential service level that can be delivered in a typical VM environment. Excerpt from the test report:
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
12
“We see no significant difference between our physical and our virtual environments. The accompanying chart shows median performance well within tolerable limits. We present response times as four metrics in four environments.” The metrics are: Total Average: Average time for all HTTP responses to process. Many web pages results in more than one HTTP process. This metric represents the simple mean of all such requests for our complete run. Total Median: Median time for all HTT responses to process. The median being lower than our average indicates that our users would see adequate response most of the time. There will be a few responses that take a long time and they raise the average. Open Doc Average: Average time to open or refresh a document. This time is small compared to many of our reports because we have chosen reports that use little database runtime. Open Doc Median: Median time to open or refresh a document.
Figure 2.2
VM vs Physical performance comparison
2.5.2
Other POC test results
In addition to the validation of VM technology, the POC system also yielded valuable information as to how to provide such a service to campus. In the POC phase we allowed some other ITS entities into the environment to do specific testing at the OS / Application layer. In doing so we became aware of elements and processes which the operational level agreement would need to incorporate.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
13
Examples of other ITS testing: SOE LAMP environment Active Directory Servers Terminal Services Asset Management Technologies Change Management Solution Running on CentOS
2.6 Verify V-Motion server failover
V-Motion is a feature that allows the live migration of a virtual server from one physical box to another within the VM server “farm”. The transfer is managed by the Virtual Center software and is performed via a dedicated Ethernet port on each physical server Essentially the vital data lives on a multi-pathed redundant SAN and the configuration (pointer) file lives on the ESX servers. VMotion simply moves the configuration file via a dedicated network path thus moving the processing / hosting of a given VM to another ESX system. VMotion has been tested time and time again under many differing load scenarios. The technology is so effective it will even function over a 10 Mb connection, albeit slowly. Using a 1 Gb connection, the transfer takes approximately 15 seconds. The testing of V-Motion culminated in a live demonstration at ITSMG, where ITS managers we shown the migration of the Virtual Center server from one physical box to the other, without interruption. Note that V-Motion for ESX 2.5.x is a manual process and will be automated in version 3.0 based on rules established by the administrator. A limitation of V-Motion is that the live transfer can only be done to another box with the same CPU type and “generation”. It is, therefore, important to have several boxes of the same type in the server farm. This limitation leads to the technology strategy explained in the Recommendations section.
2.7 Evaluate Virtual Center management platform
Virtual Center Management platform can be explained via an analogy to the data center itself. In the data center, machines are hosted and access is granted via SSH (Linux) or RDP (Windows) and all OS and software configurations can be done via these mechanisms. Of course, at times a CD needs to be mounted; hardware needs to be added / modified, a new system needs to be provisioned or the power simply needs to be forced off and back on due to a system halt. Thus, in this perspective Virtual Center administrators are the equivalent to the on site data center operators. The Virtual Server management platform consists of two main interfaces. The MUI which is accessed via HTTPS on the ESX Servers directly and Virtual Center which manages the collection of servers. In this report we shall focus on V-Center as that is the interface where the operation is managed as opposed to the MUI which is more of the basic configuration of the server’s themselves. Or to put it another way, once the server is setup and integrated into the VM farm, the MUI is no longer accessed while V-Venter will be visited often as that is where we monitor
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
14
performance, build new machines and migrate VMs due to resource constraints or hardware failures. V-Center is a powerful management tool that allows a comprehensive view of the VM Host server farm(s), virtual machines and real time resource monitoring. The technology has been integrated in to the campus Active Directory while allows for a straightforward access management solution without the need for yet another account structure. Virtual Center has performed exactly as advertised and was demonstrated to ITSMG as the interface for the V-Motion demo.
Figure 2.3
Virtual Center screen shot
2.8 Smooth migration of existing servers to VM
VMware provided a demo license for the P2V (Physical-to-virtual) application. In the case of many of our current servers, this application encapsulated the information on a physical server and converts it to function properly as a virtual server. This conversion process is not as straightforward as we had hoped, but it does work well. Our team converted several systems with the limited license and determined that the typical conversion effort is 6-8 hours. We will recharge accordingly to cover these costs. A full unlimited P2V license is included with the “Jumpstart” training program we will initiate as soon as we have phase 2 approvals.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
15
2.9 Assessment
2.9.1 Assess current server utilization
Accessflow was contracted by Dell to perform a 30-day server utilization analysis of the Windows servers in the data center. This assessment was deliberately scheduled to include the end-of-quarter period where server utilization was expected to be highest. Results from the Accessflow report indicate the average server utilization in the data center is 1-2%, with average peaks around 4%. Worst case peak was approximately 11%. According to Accessflow, these numbers are on the low end of typical data center values, but not unexpected. 2.9.2
2.9.2.1
Excerpt from ACCESSFLOW report: Total Cost of Ownership (TCO) and ROI Analysis
Summary of Estimated Costs/Savings for Virtualization
Cost savings over 3 years
Cost/Savings
Direct Costs Indirect Costs Total Cost of Ownership (TCO)
Without VMware
$407,500 $256,800 $664,300
With VMware
$284,637 $24,400 $309,038
Savings
$122,863 $232,399 $355,262
TCO Analysis from ACCESSFLOW report
TCO calculations cover a 3-year period. Costs are derived from running a combined total of 33 virtual servers on 2 ESX Server hosts: • Server Consolidation. 33 total virtualization candidates will be consolidated onto 2 ESX Server hosts, with no hardware reuse. We assume all virtualization candidates can be migrated in the first year. The alternative of not virtualizing assumes that 33 candidates would be replaced with likefor-like hardware at the rate of 20%, 30%, and 40% in Years 1, 2, and 3. Server Containment. 50 total expected virtual machines will be provisioned on 3 ESX Server hosts, with no overlap with ESX Server hosts reserved for server consolidation. The alternative of not virtualizing assumes that all 50 servers that would have been provisioned as VMs would each require individual physical hardware.
•
Only tangible costs are considered in this estimate. You may expect sizeable savings in the form of intangible cost savings resulting from higher availability and less downtime. Virtual machines offer the possibility for higher availability through services such as VMotion or faster backup and recovery, due to the encapsulation properties of virtual machines and virtual disks and the ability to capture virtual machine states.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
16
2.9.2.2
Return on Investment (ROI) Analysis
Estimated ROI for Virtualization
Cost/Savings
Virtualization Investment Savings from Virtualization Return on Investment (ROI)
Year 1
$187,137 $32,589 17%
Year 2
$48,750 $153,087 79%
Year 3
$48,750 $169,587 125%
ROI Analysis from ACCESSFLOW report
Our ROI calculation uses the previous corresponding TCO Analysis calculation for the savings amounts. The payback period is estimated to take approximately 28 months. 2.9.3 Our take on the assessment
The Accessflow assessment shows a high rate of virtualization leading to only 3 ESX servers needed for all 50 physical servers intended for migration within the data center. While it is very feasible to run 16 or more VMs on a single box, we feel this number is aggressive for all servers and allows no spare capacity for failover and expansion of services. Both Accessflow and our analyses show not all servers are best suited for 16:1 or higher ratios. We, therefore, recommend an average of 10:1 leading to a requirement for 5 physical boxes plus one more for failover just to support the data center server migration. We will need 2 more than this for the service to be provided to other early adopters, bringing the total to 8. See budget and timeline for recommendations and options. Also note the assessment report incorrectly identifies the intended server platform as HP. The Dell 2950 has equivalent or better specifications.
2.10 Tech Support
VMware tech support was tested in a real world situation. During stress testing, we managed to crash one of the ESX platforms. After some initial analysis internally, we chose to contact VMware tech support for their help. Our support level agreement says we will be contacted no later than 12 hours after initiating the trouble report. We were contacted within 2 hours by a local, competent engineer. He directed us to collect data from the crashed box by providing detailed instructions. The information was sent to VMware. Their response took about one hour. They found 2 problems, one the result of a known bug that had already been fixed in a subsequent release, and the other apparently due to our inadvertent interruption of SAN capacity while testing. The new version of ESX was downloaded and installed. Tests were repeated with no issues and none have been experienced since. We give VMware high marks for both the timeliness and quality of support.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
17
2.11 Stress testing
Both Windows and Linux VMs were tested under extreme conditions using the Bonnie++ test scripts. The scripts involve a variety of tests reading and writing extremely large files or sets of characters, ranging from 1 GB to 8GB. Worst case test was 16VMs each trying to read/write 4GB of data simultaneously. Performance dropped off significantly as memory went to paging, as expected and as would be seen on a standalone server. We also saw the bottleneck at the SAN due to the 4GB of cache being completely used. VM to VM variation was as expected given the asynchronous nature of the tests. While this level of data access is far beyond typical usage, these limitations will help guide the matching of virtual servers to appropriate available hardware, as well as inform a starting point for storage stress testing later in phase 2.
Sequential Write per Char - Average
30000 25000 20000 K/sec 15000 10000 5000 0 1G 2G Chunk size 4G 8G 1VM 2VM 4VM 8VM 16VM
Sequential Write per Block - Average
100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 1G 2G Chunk size 4G 8G
1VM 2VM 4VM 8VM 16VM
Sequential Read per Char - Average
50000 40000 1VM K/sec 4VM 20000 10000 0 1G 2G Chunk size 4G 8G 8VM 16VM
K/sec
Sequential Read per Block - Average
250000 200000 1VM K/sec 150000 100000 50000 0 1G 2G Chunk size 4G 8G 2VM 4VM 8VM 16VM
2VM
30000
Sequential Output per Character, by VM - 8VM
30000
25000
20000
15000
10000
155 156 157 158 162 186 187 190
K/sec
5000
0 1G 2G Chunk Size 4G 8G
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
18
Sequential Output per Block, by VM - 8VM
100000
90000
80000
70000 155 60000 K/sec 156 157 50000 158 162 186 40000 187 190 30000
20000
10000
0 1G 2G Chunk Size 4G 8G
Figure 2.4
Examples of stress test results (Linux)
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
19
3 Recommendations
3.1 Hardware recommendations
3.1.1 Servers
The team recommends purchasing the new Dell 2950 server based on Intel Xeon processors (2 x dual core). This is a new model similar to the Test Bed systems with additional RAM capacity and higher CPU speed. Each server will be configured with: 2 x Intel dual-core Xeon processors, Prescott or Woodcrest stepping 16 GB RAM (8 x 2GB) 2 x 73GB 10K RPM disk drives, mirrored 2 x Quad 1Gb Ethernet network cards (10 ports total) 1 x Dual Fibre-Channel Host Bus Adapters (HBA) for SAN connections The initial deployment provides 2x1Gb active/active connections to the LAN through redundant Cisco 3750 switches. Should additional network capacity be needed, addition parallel connections can be made without adding server hardware. Besides exceptional performance during testing, selecting Dell allows us to take advantage of existing support relationships and purchase agreements, as well as our experience with this equipment, lowering both initial costs and ongoing support costs. The Wincore and Operations groups are already familiar with this platform family. 3.1.2 SAN and Fibre-channel network
In the short term, the team recommends we expand the test bed EMC SAN for the initial deployment by adding 7 TB (raw) of SATA disks to the CX500. This low cost addition allows us to take full advantage of the existing SAN controllers by simply adding one disk shelf. Since most of the VM servers have relatively low disk IO requirements, the addition of lower-cost SATA drives (7200 RPM vs. 15K RPM for existing FC drives) provides the option for cheaper tier 2 storage to be offered to customers. See price list. Total SAN capacity will then be 9.5 TB raw, or about 6 TB useable, which is expected to be enough for the first year of VM service. The EMC SAN with the extended disks will be part of the official test bed after the final storage solution is deployed. Core Tech assures me that disk capacity is always needed and will not be wasted after the final Storage is put in place. After the storage consolidation has been achieved, the VM environment will be moved. The use of the new storage system after December 2006, coincides with the VM upgrade to ESX 3.0.1. This upgrade requires a new file system and therefore forces us to briefly shut down each upgraded VM and move to different disk volumes. This is a perfect opportunity to switch to the new storage system.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
20
3.1.3
Network
Networking considerations forced us to take a slightly different approach for the VM farm. Traditional standalone servers are directly connected to the specific subnet needed for the individual application assigned to the server. Since the VM environment can host many different servers and applications on a single box, there must be a way to connect the physical server to all subnets at once. Directly connecting to each subnet would require over 10 Ethernet connections to each box. There is an option to group servers by subnet on a single box to reduce the number of required connections, but that would make it impossible to “V-Motion” servers from one box to another. The appropriate solution is to use Virtual LANs or VLANs. Using VLANs, each physical box needs only 1 Ethernet connection, though we use 2 at a minimum fpr redundancy. Through “tagging” messages from different subnets, multiple connections can be made through a single port. VLANs require the use of a Layer 3 switch to configure the various connections. The NTS group requests Cisco 3750 switches are used for this purpose.
3.1.4
Virtual Center Server (optional)
The Virtual Center management application runs on a separate server. It has relatively low-end system requirements. We plan to reuse an existing PC for this purpose and may eventually move to a larger system that is free after migration to VM.
3.2 Software recommendations
3.2.1 VMware ESX 2.5.x and 3.0
We intend to release to Production on version 2.5.3 of the ESX platform. This is the version we used during testing, and it has a well-established track record in the industry. VMware has just release a new major upgrade, version 3.0. Besides the existing components described above, the new “VI3” infrastructure includes new components that automate many of the existing manual operations, including VMotion for load balancing and failover. Our plan is to use the Test Bed to qualify 3.0 as soon as the early adopters are moved to the new servers. This will also be the platform to test storage options, since any SAN we bring in must be compatible with VMware version 3. VMware ESX 3.0 has a different file system than 2.x. We must therefore briefly shut each VM down to upgrade them to 3.0. We intend to make that change at the same time the data is moved to the new storage system, since data must be copied into the new file system anyway, thus minimizing downtime.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
21
3.3 Technology strategy
For V-Motion to work optimally, the server “Farm” must be made up of CPU’s with similar configurations and speeds. This is simple to achieve when we make the first rounds of purchases, but over time new product technology will advance to the point of being incompatible with previous generations. At that point, we will need to purchase several new servers to allow V-Motion to work between them. In essence, the goal is to replace half of the servers with each significant technology change. This means maintaining 2 sets of related servers in the farm, replacing the older set every 4 years.
Figure 3.1
Alternating server generation strategy
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
22
4 Phase 2 milestones – What we can do, when
4.1 Incremental approach to deployment
We expect to deploy the first year’s servers in three increments: “Dog Food” – Core Tech servers as well as some “early adopters” (Test Bed users). We want to migrate early Test Bed customers to the new platform to free the Test Bed for ESX 3.0 testing. High Risk – These are divisional servers identified by the Server Risk Matrix as high risk. This timeframe also includes additional Applications Solutions servers needed for urgent projects not included in Early Adopters. Note: the risk analysis is not complete and the numbers of requested servers may change. New Service – At this point, we are in a position to offer virtual servers to other customers on campus, such as faculty. Lower risk divisional servers can be migrated at this time also. Note: The scope and funding of the project is actually limited to deploying only the first step and preparing for the other two steps. However, with the full benefit to the campus realized from the broader service, the team requests that the resources and funding be approved for all 3 steps, with the understanding that the equipment and people will be phased in.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
23
4.2 Next steps
4.2.1 Proposed Phase 2 Timeline, to end of 2006
Figure 4.1
Phase 2 Timeline
4.2.2
Migrate Core Tech systems
Of course, the primary goal for phase 2 and the “Dog Food” step is to actually move the Core Tech servers into the virtual environment. Even amongst the CT servers, there is a prioritized list of servers. It is likely that some high risk servers from the divisions will be migrated before all of the CT servers are completed, essentially overlapping the first 2 steps. Proposed priorities are: High Risk core, including early adopters Core infra systems (ie MOM to get us ready for the influx) High Risk Divisional Other Core New services Other Existing Services The VM server farm will be located in row H in the data center.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
24
4.2.3
Server migration process
This describes the technical details of how a physical server is converted into a virtual server and moved into the data center. Both manual and automated (P2V) conversions will be covered. 4.2.4 Business Processes
Business processes include Capacity planning, Provisioning and Governance. These need to be developed and agreed upon before we can provide the server service. Though we hope to have enough capacity for all requests, the more likely scenario is that some rules must be in place to prioritize requests ad allocate resources. This decision-making process will be documented and will become an ongoing part of the service. Also included in this category is the process to decide which servers are subject to mandatory migration and which are voluntary. Some mission critical system may not have the option to convert. Requests for new servers must be managed in a different way than our physical servers are now. The provisioning process will be modified to include requests for both virtual and physical servers. 4.2.5
4.2.5.1
Prepare for new service
Service Level Agreement (SLA)
The levels of service that ITS will guarantee to its customers must be described and agreed upon. Our existing Operation Level Agreement will serve as the basis for the new SLA and will also be written by a Core Tech / IT Services team.
4.2.5.2 Training
New FTEs and existing support staff must be trained in the VM technology before we can go live with the new service. Core Tech staff will be in a position to support the initial migrations. 4.2.6 Integration into backup strategy
The data contained on the SAN (both interim and final) needs to be backed up to tape at regular intervals. The VM farm must be incorporated into our existing backup strategy and must be included in any pending redesign of that system. 4.2.7 Risk reduction
The risk for a given migrated server is managed by the fact that the VM version can be tested in parallel with the physical server while the physical server is still on line. We’ll “flip the switch” after the converted server is proven to be stable. 4.2.8 Planning and Budget
Planning and Budget needs to review the cost model and help us with the funding details.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
25
5 Phase 2 budget
5.1 Proposed phase 2 budget for VM servers
The budget for the VM portion of the project implementation includes the following items: Physical servers and Ethernet/FC network cards VM licenses 7TB expansion to the test bed SAN Management servers (VC) Fibre-Channel (FC) switches to connect the storage network Data center expenses Training and professional services Limited travel budget (not included in original budget) Associated annual maintenance
Item
# of VM’s Purchase dates Qty of physical boxes Dell 2950 servers* FC HBA for servers EMC SAN expansion (7TB raw, SATA)* Fibre-channel switches and cables Rack space x n RU Network switches to support VLANs VM licenses VM Maintenance* VI3 add-ons/upgrades Training Professional services Travel to other campuses Network connections (2 x 1Gb per system) TOTALS
“Dog Food”
50 Core Tech + early adopters† Aug 2006 8 48,000 12,000 25,000 20,000 725 NTS will buy? 21840 7064 6000 20,000 10,000 1000 7000 178,629
High Risk
30-50 from divisions + critical Apps projects Nov 2006 4 24,000 6,000
New Service
50 Lower risk + service customers Feb 2007 6 36,000 9,000
260
390
10,920 3532 3000 5000
16380 5300 4500 10,000
3500 38,760
5300 86,870
* H/W and S/W maintenance may be paid in advance with up-front expenses or is included
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
26
**Virtual Infrastructure 3 has some additional new components not covered by Maintenance † Early Adopters are Test Bed customers who will be moved to the Prod/Test servers
5.2 Budgetary notes
5.2.1 Project scope
The original proposed budget for the Server and Storage Consolidation project included the infrastructure needed to support the migration of Core Tech servers only. The money associated with the “New Service” step was not included in the original “commitment” and should be considered “out of scope” of the project budget. Of course, if there is money available after CT servers are migrated and the storage system is deployed, ITS has the discretion to spend it for the new service 5.2.2 VI3
Virtual Infrastructure 3 is the new version of VMware’s virtual environment. It includes upgrades to our existing licenses for ESX, Virtual Center and V-Motion. These upgrades are included in our annual maintenance. In addition, there are several new components not included in maintenance but are highly desirable. Our recommendation is to add these new components to the new servers as well as to our 2 existing Dells.
5.3 Don’t forget storage
This evaluation and recommendation covers the VM portion of the project, and was separated from the storage portion to make virtual servers available sooner. The consolidated storage solution is also funded by the project budget.
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
27
6 Server Cost model – VM vs. Physical box
6.1 Server baseline summary
6.1.1 Actual costs per year
Item Xsmall Servers/box 32 UPS/Gen deprec. 274 Data Ctr 63 Physical box 1750 Box Maint inc Network jacks 880 VM license 910 VM Maint. 883 CT support 1410 Mgmt overhead 228 Total / box 6398 Total / server 200 Small 16 274 63 1750 inc 880 910 883 1410 228 6398 400 Med 8 274 63 1750 inc 880 910 883 1410 228 6398 800 Large Xlarge Standalone 4 2 1 274 274 274 63 63 63 1750 1750 1250 inc inc 880 880 440 910 910 883 883 1410 1410 1410 228 228 228 6398 6398 3665 1600 3199 3665
6.1.2
Prices to customers per year*
Item Annual baseline Setup charge OS license Phys-to-Virt conversion Xsmall Small 300 600 200 200 50 50 500 500 Med Large Xlarge 1000 1800 3500 200 200 200 50 50 50 500 500 500
*These prices reflect a recharge rate that recovers the entire cost for server hosting. They may be offset by state funding (19900) to reduce the recharge rate. Planning and Budget still need to review these numbers.
6.2 Average* System administration costs (based on FTE cost)
Sys Admin charge Virtual server Standalone server VM savings Small 3116 4380 1264 Med 7126 8391 1265 Large 11322 13080 1758 Xlarge 14777 17523 2746
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
28
Costs are based on our current process. Patch management and Asset management will reduce the required support hours and, in turn, the Sys Admin costs. * Costs are adjusted based on system criticality, complexity and other factors, for example, clustering.
6.3 Storage costs for phase 2 VM environment
We plan to follow the lead of all of the other service providers we investigated and charge for storage by the GB. This is the only fair method and allows a customer to tailor the server capacity to their specific needs. In addition, we will offer two tiers of disk performance, allowing us to charge less for lower performance disk space that is less expensive for us to purchase. 6.3.1 Actual current cost of tape backup
Item Total cost / year Backup hardware 100000 Backup software 10000 Tape 5000 Offsite storage 8000 TOTAL 123000 $ / GB 10 1 0.5 0.8 12.3
These values reflect costs for 10TB of backed-up storage 6.3.2 Estimated cost of SAN storage
One time cost 9000 13000 20000 7200 9800 Divide Annual Annual by Cost / Life Costs Maint (GBs) GB 4 4 4 4 4 2250 3250 5000 1800 2450 2500 95 95 5000 3000 4800 1700 6500 1500 5000 6500 6500 6500 1500 5000 6500 $0.81 $5.37 $1.00 $0.54 $0.38 $0.38 $0.06 $0.02 $1.23 $8.77 $4.36
Item SAN Contoller h/w H/W purchase Tier 1 disk H/W purchase Tier 2 disk S/W Purchase Prof. services Training Rack unit charge, tier 1 Rack unit charge, tier 2 FC switch fabric Tier 1 cost / GB Tier 2 cost / GB
20000
4
3000
Note: these costs will be modified next year to reflect final storage solution 6.3.3 Prices to customers, per GB, per year
Item Storage (disk space) Backup Total Tier 1 9 12.5 21.5 Tier 2 5 12.5 17.5
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
29
7 Reference documents
Reference documents are posted on the project web site: http://www2.ucsc.edu/sscproject Project Charter Accessflow assessment report Test result spreadsheets Applications test report VMware whitepapers – ESX intro, VI3 Gartner reports Info from other UC campuses Service price lists from other campuses OLA for test environment Previous presentations Change request document
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
30
8 Appendix A- Architecture Summary
8.1 Overall architecture concept, including storage
Figure 8.1 – Recommended overall concept
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
31
8.2 Initial phase 2 architecture, VM farm
Figure Number 8.2
Recommended initial architecture - details
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
32
8.3 Final project architecture
Figure 8.3
Recommended final architecture - details
UC Santa Cruz
Server Consolidation: Phase 1 Evaluation Report
33
Portfolio Managment Group Information Technology Services University of California, Santa Cruz 1156 High Street Santa Cruz, CA 95064