Using Grid Technologies on the Cloud for High Scalability A Practitioner Report for Cloud User Group
Victoria Livschitz, CEO Grid Dynamics vlivschitz@griddynamics.com September 17th, 2008
A word about Grid Dynamics
Who we are: global leader in scalability engineering
Mission: enable adoption of scalable applications and networks though design patterns, best practices and engineering excellence Value proposition: fusion of innovation with best practices Focused on “physics”, “economics” and “engineering” of extreme scale Founded in 2006, 30 people and growing, HQ in Silicon Valley
Services
Technology consulting Application & systems architecture, design, development
Customers
Users of scalable applications: eBay, Bank of America, web start-ups Makers of scalable middleware: GigaSpaces, Sun, Microsoft Partners: GridGain, GigaSpaces, Terracotta, Data Synapse, Sun, MS
Grid Dynamics 1
Why I am speaking here tonight?
We do scalability engineering for a living
Cloud computing is new, very exciting and terribly over-hyped
Not a lot of solid data on performance, scalability, usability, stability…
Many of our customers are early adopters or enablers
Their pains, discoveries and lessons are worth sharing
The practitioner prospective
Recently completed 3 benchmark projects that we can make public Results are presented here tonight
Grid Dynamics
2
Exploring Scalability thru Benchmarking
Benchmark Cloud Vendor Middleware Application
1. Test scalability of EC2 on the simplest map-reduce problem 2. Test scalability of data-driven HPC applications, similar to those used in practice 3. Explore performance implications of data “in the cloud” vs. “outside the cloud”
Public commercial cloud, EC2 Public commercial cloud, EC2
Amazon
GridGain
Monte-Carlo
Amazon
GigaSpaces
Risk Management
Incubator compute cloud for academic use, CompFin
Microsoft
Windows HPC Server, Velocity
Dataintensive Analytics
Grid Dynamics
3
Benchmark #1: Scalability of Simple Map/Reduce Application on EC2
Grid Dynamics
4
Basic Scalability of Simple Map/Reduce
Goal: Establish upper limit on scalability of Monte-Carlo simulations
performed on EC2 using GridGain
Why Monte-Carlo: simple, widely-used, perfectly scalable problem Why EC2: most popular public cloud Why GridGain: simple, open-source map-reduce middleware
Intended Claims:
EC2 scales linearly as grid execution platform GridGain scales linearly as map-reduce middleware Businesses can run their existing Monte-Carlo simulations on EC2 today using open-source technologies
Grid Dynamics
5
Other Goals
Understand “process bottlenecks” of EC2 platform
Changes to the programming, deployment, management model Ease of use Security Metering and payment
Identify scalability bottlenecks at any level in the stack
EC2 GridGain Glueware
Robustness
Stability Predictability
Grid Dynamics 6
Architecture
Job Execution JMS Message Processing
Manages worker nodes and tasks
Discovery & Task Assignment
Spare EC2 Instances
OpenMQ Server
JMS
Amazon EC2 Cloud
Spare Capacity
Head Node
Worker Nodes
Controls Grid Operation Configuration & Task Repository
Technology Stack:
EC2 GridGain Typica OpenMQ
Corporate Intranet
Grid Console
HTTP Server
Grid Dynamics
7
Performance Methodology & Results
Same algorithm exercised on wide range of nodes
2,4, 8, 16, …, 256, 512. Limited by Amazon permission of 550 nodes Simultaneously double the amount of computations and nodes Measure completion time Repeat several times to get statistical averages
Conclusions
Total degradation from 13 to 16 seconds, or 20% Discarding first 8 nodes, near perfect scale up to 128 Slight degradation from 128 to 256 (3%), from 256 to 512 (7%)
=> Prove point of near linear scalability end-to-end
Grid Dynamics 8
Simple scaling script
var itersPerNode = 5000; var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
for (var i in cnode) { var n = cnode[i]; grid.growEC2Grid(n, true); grid.waitForGridInstances(n); runTask(itersPerNode * n, n, 3);
}
Grid Dynamics
9
Observations
Deployment considerations
Start-up for whole grid in different configurations is 0.5 - 3 min 2-step deployment process
First, bring up one EC2 node as controller Next, use the controller on-the-inside to coordinate bootstrapping
Some of EC2 nodes don’t finish bootstrapping successfully
Average of 0.5% nodes come up in incomplete state Not clear the nature of the problem If the exact processing power is essential, start the nodes, then kill off the sick ones and bring up a few new ones before starting computation
IP address deadlock issue
IP addresses of the nodes are needed to start & configure the grid IP addresses are not available until the grid is up & configures Need carefully choreograph bootstrapping and pass IP’s as parameters into controlling scripts
Grid Dynamics
10
Observations
Monitoring considerations
Connection to each node from outside is possible, but not efficient Check heartbeat from the internal management nodes Local scripts must be stored on S3 or passed back before termination
Programming model considerations
EC2 does not support IP multicast
Switched to JMS instead Luckily, GridGain supported multiple protocols
Typica : 3rd party connectivity library that use EC2 query interface
Undocumented limit on URL length is hit with 100s of nodes Amazon just disconnects with improper URLs without specifying the error, so debugging was hard Workaround: rewrote some parts of our framework to enquire about individual running nodes. Works, but less efficient
Grid Dynamics 11
Observations
Metering and payment
Amazon sets a limit on concurrent VM
Eventually approval for 550 VMs after some due diligence from Amazon
Amazon charges by full or partial VM/hours Sometimes, short usage of VMs is not metered
Not clear why One hypotheses: metering “sweeps” happen every so often
Be careful with usage bills for testing
A test may need to be run multiple times Beware of rouge scripts Test everything on smaller configurations first Scale gradually, or you will miss the bottlenecks
Grid Dynamics
12
Achieving scalability
Software breaks at scale. Including the glueware
Barrier #1 was hit at 100 nodes because of ActiveMQ scalability
Correction: Switched ActiveMQ for OpenMQ Comment: some users report better ActiveMQ scalability with 5.x
Barrier #2 was hit at 300 nodes because of Typica URL length limit
Correction: Changed our use of the API
Security considerations
EC2 credentials are passed to Head Node 3rd party GridGain tasks can access them Sounds like potential vulnerability
Grid Dynamics
13
What have we learned?
EC2 is ready for production usage on large-scale stateless computations
Price/performance Strong linear scale curve
GridGain showed itself very well
Scale, stability, ease-of-use, pluggability Solid open source choice of map-reduce middleware
Some level of effort is required to “port” grid system to EC2
Deployment, monitoring, programming mode, metering, security
What’s next?
Can we go higher then 512? What is the behavior of more complex applications?
Grid Dynamics 14
Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2
Grid Dynamics
15
Data-driven Risk Management on EC2
Goal: Investigate scalability of a prototypical Risk Management
application that use significant amount of cached data to support largescale Monte-Carlo simulations executed on EC2 using GigaSpaces
Why risk management: class of problems widely used in financial
services
Why GigaSpaces: leading middleware platform for compute & data
grids
Intended Claims:
EC2 scales linearly for data-driven HPC applications GigaSpaces scales well as both compute and data grid middleware Businesses can run their existing risk management (and similar) applications on EC2 today using off-the-shelf technologies
Grid Dynamics 16
Architecture
User uses ec2-gdc-tools to manage grid
Service Grid Manager
Compute Grid
Amazon EC2 Grid
Grid Console
Workers take tasks, perform calculations, write results back
Master
Master writes tasks into data grid and waits for results…
Data Grid
Grid Dynamics
17
Performance methodology & results
Same algorithm exercised on wide range of nodes
2000 1800 1600
16,32, 128, 256, 512. Still limited by Amazon permission of 550 Constant size of data grid (4 large EC2 nodes) Double the nodes with constant amount of work Measure completion time (strive for linear time reduction)
Conclusions
Near perfect scale from 16 to 256 nodes 28% degradation from 256 to 512 since data cache becomes a bottleneck
16 32 64 96
Total time (secs)
1400 1200 1000 800 600 400
200 0
Number of Nodes
Grid Dynamics
18
What have we learned?
EC2 is ready for production usage for classes of large-scale data-driven HPC applications, common to Risk Management GigaSpaces showed itself very well
Compute - data grid scales well in master-worker pattern
Some level of effort is required to “port” grid system to EC2
Deployment, monitoring, programming mode, metering, security Bootstrapping this system is far more complex then GridGain’s. For more details, contact me offline
What’s next?
How does data grid scale? What about more complex applications? What’s the scalability of co-located compute-data grid configuration?
Grid Dynamics 19
Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applications
Grid Dynamics
20
Data-intensive Analytics on MS cloud
Goal: Investigate performance improvements from data “in the cloud” vs.
“outside the cloud” for complex data-intensive Analytical applications in the context of HPC CompFin++ Labs environment using Velocity
What is CompFin++ Labs: MS-funded “incubator” compute cloud for
exploration of modern compute & data challenges on massive scale
What is Velocity: MS new in-memory data grid middleware, still CTP1
The Model: Computes correlation between stock prices over time.
Algorithms use significant amount of data which could be cached. Maximum cache hit ratio for the model is around 90%.
Intended Claims:
Measure impact of data “closeness” to the computation on the cloud
Grid Dynamics
21
Architecture: CompFin
Grid Dynamics
22
Architecture: Anticipated Bottlenecks
Grid Dynamics
23
Architecture: CompFin + Velocity
Grid Dynamics
24
Benchmarked configurations
Same analytical model with complex queries
Perfect linear scale curve (baseline) Original CompFin Distributed cache (original CompFin + Velocity distributed cache for financial data) Local cache (original CompFin + Velocity distributed cache for financial data + near cache with data-aware routing)
Grid Dynamics
25
Test methodology
3 ways of measuring scalability were used
Fixed amount of computations, increasing amount of data Fixed amount of date, increasing amount of computations Proportional Increase of computations and nodes “Node” = 1 core “Data unit” = 32 million records or 512 megabytes of tick data
Test 1 Test 2 Test 3
Test # Nodes Data Units
1 8 1
2 32 1
3 32 1
4 32 6
5 32 12
6 32 12
7 64 24
8 128 48
9 200 69
Grid Dynamics
26
Performance results
Grid Dynamics
27
Performance results
Grid Dynamics
28
Conclusions
Data “on the cloud” definitely matters!
Performance improvements up to 31 times over “outside the cloud”
Velocity distributed cache has some scalability challenges:
Failure on 50 nodes cluster with 200 concurrent clients Good news: it’s a very young product and MS is actively improving it
Compute-data affinity matters too!
Significant performance gain of local cache over distributed cache Local cache resolved distributed cache scalability issue by reducing its load
Grid Dynamics
29
Final Remarks
Clouds are proving themselves out
Early adaptors are there already The rest of the real world will join soon
There are still significant adoption challenges
Technology immaturity Lack of real data, best practices, robust design patterns “Fitting” of application middleware to cloud platforms is just starting
Amazon is the leading commercial cloud provider, but is not the only game in town
Companies are building public, private, dedicated and specialpurpose clouds
Grid Dynamics
30
Thank You!
Victoria Livschitz
vlivschitz@griddynamics.com