Business Plan Overview Grid Dynamics

Reviews
Shared by: richman8
Stats
views:
64
rating:
not rated
reviews:
0
posted:
1/16/2009
language:
English
pages:
0
Using Grid Technologies on the Cloud for High Scalability A Practitioner Report for Cloud User Group Victoria Livschitz, CEO Grid Dynamics vlivschitz@griddynamics.com September 17th, 2008 A word about Grid Dynamics  Who we are: global leader in scalability engineering  Mission: enable adoption of scalable applications and networks though design patterns, best practices and engineering excellence  Value proposition: fusion of innovation with best practices  Focused on “physics”, “economics” and “engineering” of extreme scale  Founded in 2006, 30 people and growing, HQ in Silicon Valley  Services  Technology consulting  Application & systems architecture, design, development  Customers  Users of scalable applications: eBay, Bank of America, web start-ups  Makers of scalable middleware: GigaSpaces, Sun, Microsoft  Partners: GridGain, GigaSpaces, Terracotta, Data Synapse, Sun, MS Grid Dynamics 1 Why I am speaking here tonight?  We do scalability engineering for a living  Cloud computing is new, very exciting and terribly over-hyped  Not a lot of solid data on performance, scalability, usability, stability…  Many of our customers are early adopters or enablers  Their pains, discoveries and lessons are worth sharing  The practitioner prospective  Recently completed 3 benchmark projects that we can make public  Results are presented here tonight Grid Dynamics 2 Exploring Scalability thru Benchmarking Benchmark Cloud Vendor Middleware Application 1. Test scalability of EC2 on the simplest map-reduce problem 2. Test scalability of data-driven HPC applications, similar to those used in practice 3. Explore performance implications of data “in the cloud” vs. “outside the cloud” Public commercial cloud, EC2 Public commercial cloud, EC2 Amazon GridGain Monte-Carlo Amazon GigaSpaces Risk Management Incubator compute cloud for academic use, CompFin Microsoft Windows HPC Server, Velocity Dataintensive Analytics Grid Dynamics 3 Benchmark #1: Scalability of Simple Map/Reduce Application on EC2 Grid Dynamics 4 Basic Scalability of Simple Map/Reduce  Goal: Establish upper limit on scalability of Monte-Carlo simulations performed on EC2 using GridGain  Why Monte-Carlo: simple, widely-used, perfectly scalable problem  Why EC2: most popular public cloud  Why GridGain: simple, open-source map-reduce middleware  Intended Claims:  EC2 scales linearly as grid execution platform  GridGain scales linearly as map-reduce middleware  Businesses can run their existing Monte-Carlo simulations on EC2 today using open-source technologies Grid Dynamics 5 Other Goals  Understand “process bottlenecks” of EC2 platform     Changes to the programming, deployment, management model Ease of use Security Metering and payment  Identify scalability bottlenecks at any level in the stack  EC2  GridGain  Glueware  Robustness  Stability  Predictability Grid Dynamics 6 Architecture Job Execution JMS Message Processing Manages worker nodes and tasks Discovery & Task Assignment Spare EC2 Instances OpenMQ Server JMS Amazon EC2 Cloud Spare Capacity Head Node Worker Nodes Controls Grid Operation Configuration & Task Repository  Technology Stack:     EC2 GridGain Typica OpenMQ Corporate Intranet Grid Console HTTP Server Grid Dynamics 7 Performance Methodology & Results  Same algorithm exercised on wide range of nodes     2,4, 8, 16, …, 256, 512. Limited by Amazon permission of 550 nodes Simultaneously double the amount of computations and nodes Measure completion time Repeat several times to get statistical averages  Conclusions  Total degradation from 13 to 16 seconds, or 20%  Discarding first 8 nodes, near perfect scale up to 128  Slight degradation from 128 to 256 (3%), from 256 to 512 (7%) => Prove point of near linear scalability end-to-end Grid Dynamics 8 Simple scaling script var itersPerNode = 5000; var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]; for (var i in cnode) { var n = cnode[i]; grid.growEC2Grid(n, true); grid.waitForGridInstances(n); runTask(itersPerNode * n, n, 3); } Grid Dynamics 9 Observations  Deployment considerations  Start-up for whole grid in different configurations is 0.5 - 3 min  2-step deployment process  First, bring up one EC2 node as controller  Next, use the controller on-the-inside to coordinate bootstrapping  Some of EC2 nodes don’t finish bootstrapping successfully  Average of 0.5% nodes come up in incomplete state  Not clear the nature of the problem  If the exact processing power is essential, start the nodes, then kill off the sick ones and bring up a few new ones before starting computation  IP address deadlock issue  IP addresses of the nodes are needed to start & configure the grid  IP addresses are not available until the grid is up & configures  Need carefully choreograph bootstrapping and pass IP’s as parameters into controlling scripts Grid Dynamics 10 Observations  Monitoring considerations  Connection to each node from outside is possible, but not efficient  Check heartbeat from the internal management nodes  Local scripts must be stored on S3 or passed back before termination  Programming model considerations  EC2 does not support IP multicast  Switched to JMS instead  Luckily, GridGain supported multiple protocols  Typica : 3rd party connectivity library that use EC2 query interface  Undocumented limit on URL length is hit with 100s of nodes  Amazon just disconnects with improper URLs without specifying the error, so debugging was hard  Workaround: rewrote some parts of our framework to enquire about individual running nodes. Works, but less efficient Grid Dynamics 11 Observations  Metering and payment  Amazon sets a limit on concurrent VM  Eventually approval for 550 VMs after some due diligence from Amazon  Amazon charges by full or partial VM/hours  Sometimes, short usage of VMs is not metered  Not clear why  One hypotheses: metering “sweeps” happen every so often  Be careful with usage bills for testing     A test may need to be run multiple times Beware of rouge scripts Test everything on smaller configurations first Scale gradually, or you will miss the bottlenecks Grid Dynamics 12 Achieving scalability  Software breaks at scale. Including the glueware  Barrier #1 was hit at 100 nodes because of ActiveMQ scalability  Correction: Switched ActiveMQ for OpenMQ  Comment: some users report better ActiveMQ scalability with 5.x  Barrier #2 was hit at 300 nodes because of Typica URL length limit  Correction: Changed our use of the API  Security considerations  EC2 credentials are passed to Head Node  3rd party GridGain tasks can access them  Sounds like potential vulnerability Grid Dynamics 13 What have we learned?  EC2 is ready for production usage on large-scale stateless computations  Price/performance  Strong linear scale curve  GridGain showed itself very well  Scale, stability, ease-of-use, pluggability  Solid open source choice of map-reduce middleware  Some level of effort is required to “port” grid system to EC2  Deployment, monitoring, programming mode, metering, security  What’s next?  Can we go higher then 512?  What is the behavior of more complex applications? Grid Dynamics 14 Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2 Grid Dynamics 15 Data-driven Risk Management on EC2  Goal: Investigate scalability of a prototypical Risk Management application that use significant amount of cached data to support largescale Monte-Carlo simulations executed on EC2 using GigaSpaces  Why risk management: class of problems widely used in financial services  Why GigaSpaces: leading middleware platform for compute & data grids  Intended Claims:  EC2 scales linearly for data-driven HPC applications  GigaSpaces scales well as both compute and data grid middleware  Businesses can run their existing risk management (and similar) applications on EC2 today using off-the-shelf technologies Grid Dynamics 16 Architecture User uses ec2-gdc-tools to manage grid Service Grid Manager Compute Grid Amazon EC2 Grid Grid Console Workers take tasks, perform calculations, write results back Master Master writes tasks into data grid and waits for results… Data Grid Grid Dynamics 17 Performance methodology & results  Same algorithm exercised on wide range of nodes     2000 1800 1600 16,32, 128, 256, 512. Still limited by Amazon permission of 550 Constant size of data grid (4 large EC2 nodes) Double the nodes with constant amount of work Measure completion time (strive for linear time reduction)  Conclusions  Near perfect scale from 16 to 256 nodes  28% degradation from 256 to 512 since data cache becomes a bottleneck 16 32 64 96 Total time (secs) 1400 1200 1000 800 600 400 200 0 Number of Nodes Grid Dynamics 18 What have we learned?  EC2 is ready for production usage for classes of large-scale data-driven HPC applications, common to Risk Management  GigaSpaces showed itself very well  Compute - data grid scales well in master-worker pattern  Some level of effort is required to “port” grid system to EC2  Deployment, monitoring, programming mode, metering, security  Bootstrapping this system is far more complex then GridGain’s. For more details, contact me offline  What’s next?  How does data grid scale?  What about more complex applications?  What’s the scalability of co-located compute-data grid configuration? Grid Dynamics 19 Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applications Grid Dynamics 20 Data-intensive Analytics on MS cloud  Goal: Investigate performance improvements from data “in the cloud” vs. “outside the cloud” for complex data-intensive Analytical applications in the context of HPC CompFin++ Labs environment using Velocity  What is CompFin++ Labs: MS-funded “incubator” compute cloud for exploration of modern compute & data challenges on massive scale  What is Velocity: MS new in-memory data grid middleware, still CTP1  The Model: Computes correlation between stock prices over time. Algorithms use significant amount of data which could be cached. Maximum cache hit ratio for the model is around 90%.  Intended Claims:  Measure impact of data “closeness” to the computation on the cloud Grid Dynamics 21 Architecture: CompFin Grid Dynamics 22 Architecture: Anticipated Bottlenecks Grid Dynamics 23 Architecture: CompFin + Velocity Grid Dynamics 24 Benchmarked configurations  Same analytical model with complex queries     Perfect linear scale curve (baseline) Original CompFin Distributed cache (original CompFin + Velocity distributed cache for financial data) Local cache (original CompFin + Velocity distributed cache for financial data + near cache with data-aware routing) Grid Dynamics 25 Test methodology  3 ways of measuring scalability were used      Fixed amount of computations, increasing amount of data Fixed amount of date, increasing amount of computations Proportional Increase of computations and nodes “Node” = 1 core “Data unit” = 32 million records or 512 megabytes of tick data Test 1 Test 2 Test 3 Test # Nodes Data Units 1 8 1 2 32 1 3 32 1 4 32 6 5 32 12 6 32 12 7 64 24 8 128 48 9 200 69 Grid Dynamics 26 Performance results Grid Dynamics 27 Performance results Grid Dynamics 28 Conclusions  Data “on the cloud” definitely matters!  Performance improvements up to 31 times over “outside the cloud”  Velocity distributed cache has some scalability challenges:   Failure on 50 nodes cluster with 200 concurrent clients Good news: it’s a very young product and MS is actively improving it  Compute-data affinity matters too!   Significant performance gain of local cache over distributed cache Local cache resolved distributed cache scalability issue by reducing its load Grid Dynamics 29 Final Remarks  Clouds are proving themselves out   Early adaptors are there already The rest of the real world will join soon  There are still significant adoption challenges    Technology immaturity Lack of real data, best practices, robust design patterns “Fitting” of application middleware to cloud platforms is just starting  Amazon is the leading commercial cloud provider, but is not the only game in town  Companies are building public, private, dedicated and specialpurpose clouds Grid Dynamics 30 Thank You! Victoria Livschitz vlivschitz@griddynamics.com

Related docs
Business Plan Overview Grid Dynamics
Views: 13  |  Downloads: 1
2009 Draft Grid
Views: 27  |  Downloads: 0
Dynamics
Views: 5  |  Downloads: 0
Team Dynamics and Conflict Resolution
Views: 694  |  Downloads: 87
Grid Economics and Business Models
Views: 4  |  Downloads: 0
Microsoft Dynamics Quick Start
Views: 72  |  Downloads: 9
Buyers_Guide_Grid
Views: 35  |  Downloads: 5
Business Plan Overview
Views: 4  |  Downloads: 0
A. BUSINESS PLAN OVERVIEW
Views: 8  |  Downloads: 0
MS Dynamics Nav - Solution Overview
Views: 0  |  Downloads: 0
INTRODUCTION TO TEAMWORK AND GROUP DYNAMICS
Views: 29  |  Downloads: 1
Overview on the Grid Data Management
Views: 1  |  Downloads: 0
premium docs
Other docs by richman8