Scaling The Software Development Process Lessons by byt34827

VIEWS: 13 PAGES: 58

									          Scaling The
Software Development Process:
     Lessons Learned from
        The Sims Online

          Greg Kearney,
    Larry Mellon, Darrin West
        Spring 2003, GDC

                                1
               Talk Overview
• Covers: Software Engineering techniques to
  help when projects get big
  – Code structure
  – Work processes (for programmers)
  – Testing
• Does Not Cover:
  – Game Design / Content Pipeline
  – Operations / Project Management



                                               2
            How to Apply it.
• We didn‟t do all of this right away
• Improve what you can
• Don‟t change too much at once
• Prove that it works, and others will take up
  the cause
• Iterate




                                                 3
               Match Process to Scale
       +tve

  Team
Efficiency                         Change to a
                                   new process       Team Size
           0


                                                   “Meeting Hell”
                                   “Everything’s
                                   Broken Hell”


Process for 5 to 15 programmers
Process for 30 to 50 programmers


                                                                    4
  What You Should Leave With
• TSO “Lessons Learned”
  – Where we were with our software process
  – What we did about it
  – How it helped
• Some Rules of Thumb
  – General practices that tend to smooth software
    development @ scale
  – Not a blueprint for MMP development
  – Useful “frame of reference”




                                                     5
           Classes of
    Lessons Learned & Rules
• Architecture / Design: Keep it Simple
  – Minimizing dependencies, fatal couplings
  – Minimizing complexity, brittleness
• Workspace Management: Keep it Clean
  – Code and directory structure
  – Check in and integration strategies
• Dev. Support Structure: Make it Easy, Prove it
  – Testing
  – Automation
         -All of these had to change as we scaled up.
         -They eventually exceeded the team’s ability to deal
         with (using existing tools & processes).

                                                                6
              Non-Geek Analogy




–Sharpen your tools.
–Clean up your mess.
–Measure twice, cut once.
                            Bad flashbacks found at:
–Stay with your buddy.      http://www.easthamptonhigh.org/cernak/
                            http://www.hancock.k12.mi.us/high/art/wood/index.html


                                                                                    7
  Key Factors Affecting Efficiency
• High “Churn Rate”: large #coders
  times tightly coupled code equaled
  frequent breaks
  – Our code had a deep
    root system
  – And we had a forest of
    changes to make

                   “Big root ball” found at:
                    http://www.on.ec.gc.ca/canwarn/norwich/norsummary-e.html

                                                                               8
Make It Smaller




     Evolve




                  9
 Key Factors Affecting Efficiency
• “Key Logs”: some issues were
  preventing other issues from even
  being worked on




                                      10
  Key Factors Affecting Efficiency
                            Login
• A chain of single
  points of failure…   Create an avatar

• Can take out the       Enter a city

  entire team            Buy a house

                        Enter a house

                        Buy the chair

                        Sit on a chair



                                          11
         So, What Did We Do
             That Worked
• Switched to a logical architecture with less
  coupling
• Switched to a code structure with fewer
  dependencies
• Put in scaffolding to keep everyone working
• Developed sophisticated configuration
  management
• Instituted automated testing
• Metrics, Metrics, Metrics


                                                 12
            So, What Did We Do
                That Didn‟t?
•   Long range milestone planning
•   Network emulator(s)
•   Over engineered a few things (too general)
•   Some tasks failed due to:
    – Not replanning, reviewing long tasks
    – Not breaking up long tasks
• Coding standard changed part way through
• …


                                                 13
       What we were faced with:
•   600K lines of legacy Windows code (maybe)
•   Port it to Linux
•   Change from “multiplayer” to Client/Server
•   18 months
•   Developers must remain alive after shipping
•   Continuous releases starting at Beta




                                                  14
Go To Final
Architecture
   ASAP




               15
    Go to final architecture ASAP
     Multiplayer:                          Client/Server:

           Client
                                                Sim        Nice
            Sim
                                                        Undemocratic
                              Evolve                      Request/
Here be                                                  Command
 Sync       Client
 Hell        Sim


                                       Client               Client
  Client             Client
   Sim                Sim                         Client

                                                                     16
      Final Architecture ASAP:
                   “Refactoring”
• Decomposed into Multiple dll‟s
  – Found the Simulator
• Interfaces
• Reference Counting
• Client/Server subclassing

            How it helped:
            –Reduced coupling. Even reduced compile times!
            –Developers in different modules broke each other less often.
            –We went everywhere and learned the code base.


                                                                            17
       Final Architecture ASAP:
            It Had to Always Run
• Initially clients wouldn‟t behave predictably
• We could not even play test
• Game design was demoralized

• We needed a bridge, now!
                             ?           ?


                                                  18
       Final Architecture ASAP:
               Incremental Sync
• A quick temporary solution…
  – Couldn‟t wait for final system to be finished
  – High overhead, couldn‟t ship it
• We took partial state snapshots on the
  server and restored to them on the client

                         How it helped:
                         –Could finally see the game as it would be.
                         –Allowed parallel game design and coding
                         –Bought time to lay in the “right” stuff.


                                                                       19
       Final Architecture ASAP:
                      Null View
• Created Null View HouseSim on Windows
  – Same interface
  – Null (text output) implementation



                How it helped
                –No ifdef’s!
                –Done under Windows, we could test this first step.
                –We knew it was working during the port.
                –Allowed us to port to Linux only the “needed” parts.


                                                                        20
       Final Architecture ASAP:
                      More “Bridges”
• HSB‟s: proxy on Linux, pass-through to a
  Windows Sim.
      How it helped
      –Could exercise Linux components before finishing HouseSim port.
      –Allowed us to debug server scale, performance and stability issues early.
      –Make best use of Windows developers.
      –Allowed single platform development. Faster compiles.
• Disabled authentication, etc.
      How it helped
      –Could keep working even when some of the system wasn’t available.

                                                                              21
Mainline *Must* Work!




                        22
      If Mainline Doesn‟t Work,
           Nobody Works
• The Mainline source control branch *must*
  run
• Never go dark: Demo/Play Test every day
• If you hit a bug, do you sync to mainline,
  hoping someone else fixed it? Or did you
  just add it?

       –If mainline breaks for “only” an hour, the project loses a man-week.
       –If each developer breaks the mainline “only” once a month, it is
       broken every day.


                                                                               23
             Mainline must work:
                         Sniff Test
• Mainline was breaking for “simple” things.
   – Features you “didn‟t touch” (and didn‟t test).
• Created an auto-test to exercise all core functions.
• Quick to run. Fun to watch. Checked results.
• Mandated that it pass before submitting code
  changes.
• Break the build: “feed the pig”.
               How it helped
               –Very simple test. Amazing difference.
               –Sometimes we got lazy and trusted it too much.

                                                                 24
          Mainline must work:
           Stages to “Sandboxing”
1. Got it to build reliably.
2. Instituted Auto-Builds: email all on failure.
3. Used a “Pumpkin” to avoid duplicate merge-
   test cycles, pulling partial submissions,...
4. Used a Pumpkin Queue when we really got
   rolling
            How it helped
            –Far fewer thumbs twiddled.
            –The extra process got on some people’s nerves.

                                                              25
           Mainline must work:
                    Sandboxing
5. Finally, went to per-developer branching.
  –   Develop on your own branch.
  –   Submit changes to an integration engineer.
  –   Full Smoke test run per submission/feature.
  –   If it worked, integrated to mainline in
      priority order, or else it is bounced.
                  How it helped
                  –Mainline *always* runs. Pull any time.
                  –Releases are not delayed by partial features.
                  –No more code freezes going to release.

                                                                   26
Support Structure




                    27
    Background: Support Structure

• Team size placed design constraints on
  supporting tools
  – Automation: big win in big teams
  – Churn rate: tool accuracy / support cost
• Types of tools / processes
  – Data management: collection / corrolation
  – Testing: controlled, sync‟ed, repeatable inputs
  – Baselines: my bug, your bug, or our bug?




                                                      28
     Support Structure Increased
       Developer Effectiveness
• Faster triage
  – Repeatable inputs (single/multi avatar)
  – View any module status while testing
• Monitoring & control tools became a
  focal point of development




                                              29
   Overview: Support Structure
• Automated testing: designs to minimize
  impact of churn rate
• Automated data collection / corrolation
  – Distributed sytem == distributed data
  – Dashboard / Esper / MonkeyWatcher
• Use case: load testing
  – Controlled (tunable) inputs, observable results
  – “Scale&Break”



                                                      30
        Churn Rate:
Abstract Your Troubles Away




                              31
     Problem: Testing Accuracy
• Load & Regression: inputs must be
  – Accurate
  – Repeatable
• Churn rate: logic/data in constant motion
• Solution: game client becomes test client
  – Exact mimicry
  – Lower maintenance costs




                                              32
Test Client == Game Client
          Test Client   Game Client



Test Control                  Game GUI


  State                               State

               Commands
          Presentation Layer
     Client-Side Game Logic
                                              33
Game Client: How Much To Keep?
          Game Client

              View
       Presentation Layer
             Logic




                                 34
         What Level To Test At?
                  Game Client
                    View
Mouse
                Presentation Layer
Clicks


                      Logic

         Regression: Too Brittle (pixel shift)
                  Load: Too Bulky

                                                 35
           What Level To Test At?
                  Game Client
                      View


Internal        Presentation Layer
Events
                      Logic
               Regression: Too Brittle
              (Churn Rate vs Logic & Data)

                                             36
  Gameplay: Semantic Abstractions

 Basic gameplay changes less frequently
 than UI or protocol implementations.


View
            Presentation Layer
Logic        Chat          Enter Lot
                       …
        Route Avatar          Use Object



                                           37
   Scriptable User Play Sessions
• SimScript
  – Set of Presentation Layer “primitives”
  – Synchronization: wait_until, remote_command
  – State probes: arbitrary game state
• Test Scripts: Specific / ordered inputs
  – Single user play session
  – Multiple user play session



                                                  38
   Scriptable User Play Sessions

• Scriptable play sessions: big win
  – Load: tunable based on actual play
  – Regression: constantly repeat hundreds
    of play sessions, validating correctness
  – Development: repeatable „live‟ input
• P_Layer events logged as SimScript
  – Recorder (GUI) / Monitor (Remote)


                                               39
Automated Test: Team Baselines

• Hourly “critical path” baseline
  – Sync / clean / build / test
  – Validate Mainline / Servers
• Snifftest weather report
  – Hourly testing
  – Constant reporting
• Daily feature regression


                                    40
 How Automated Testing Helped

• Current, accurate baseline for coders
• Scale&break found many bugs
• Greatly increased stability
  – Code base was “safe”
  – Server health was known (and better)
• QA: lower regression cost



                                           41
Monitoring Tools




                   42
     Monitoring / Diagnostics
“When you can measure what you are speaking about
and can express it in numbers, you know something
about it.
But when you cannot measure it, when you cannot
express it in numbers, your knowledge is of a meager and
unsatisfactory kind." - Lord Kelvin
• DeMarco: You cannot control what you cannot
  measure.
• Maxwell: To measure is to know.
• Pasteur: A science is as mature as its measurement
  tools.


                                                           43
               Dashboard

• System resource & health tool
  – CPU / Memory / Disk / …
• Central point to access
  – Status
  – Test Results
  – Errors
  – Logs
  – Cores
  –…


                                  44
                      Wiki

• Central document repository
  – Designs
  – “How to” guides
• Status tracking




                                45
     Test UI / Monkey Watcher
• Test Central UI
  – Desktop tool for developers & testers
• Monkey Watcher
  – Collects & stores (distributed) test results
  – Produces summarized reports across tests
  – Filters known defects
  – Provides set of baselines
  – Web frontend, unique IDs per test



                                                   46
               Esper

• In-game profiler
• Internal probes & multiple views
 – Process / machine / cluster
 – Time view or summary view
• Automated data management
 – Coders: add one line probe
 – Esper: data shows up on web site


                                      47
 Use Case:
Load Testing




               48
        Outline: Wrapup

• Wins / Losses
• Rules: Analysis & Discussion
• Recommended reading
• Questions




                                 51
        Biggest Wins

Code Isolation

Scaffolding

Tools

Pre-Checkin Regression


                         52
      Biggest Losses

  Massively peer to peer

  Early lack of tools

  #ifdef across platform / function

“Critical Path” dependencies

                                      53
          Rules of Thumb (1)
• KISS: architecture, code / dir structure
• (Mostly) Incremental changes
  – “Baby-Steps”
  – (Some) Executive Mandates
• Continual workflow improvement
  – Automate recurring / “no fail” tasks
  – Speed data acquisition / analysis




                                             54
       Rules of Thumb (2)

• Mainline has got to work
 – “Could <this> break others?”
 – Scaffolding
 – “How do we prove this works?”
• Get something on the ground
 – Early, valuable design feedback
 – Visible progress
 – Early use by team

                                     55
         Rules of Thumb (3)

• Key Logs: break up early
• Fix important, not urgent
• If you can‟t measure it, you don‟t
  understand it




                                       56
   Tools Rule: Keep Developer
         Efficiency High
• Team efficiency impacted by
  – Component coupling / team size
  – Individual efficiency
• Pick a Metric, any Metric (e.g.)
  – Compile / load / test / analyze cycle
• Project / measure tool impact


                                            57
        Tools Rule: Does
  Construction Now Accelerate
       Overall Schedule?
      5% gain across 30 programmers
Over one calendar month == ~ one manmonth
 Over one calendar year == ~ one manyear

         “FredB”: 31st programmer…




                                            58
     Recommended Reading

• Useful tips / strong similarites
  – XP / agile programming
  – Myers: programming in the large, …
  – Gamma, Helm: Patterns
• Caveat Emptor!
  – Slavish following not encouraged
  – Evaluate against “ground conditions”


                                           59
Questions



            60

								
To top