Document Sample
Defense Powered By Docstoc
					Defense: My Story ----- USE AS A GUIDE. DO NOT SCRIPT IN MY HEAD
1. Cover Slide
- Good afternoon, and thank you very much for coming to my talk today.
- My talk will cover my Ph.D. work, which involves interacting domain-specific
languages with biological Problem Solving Environments
- Want to take this opportunity to thank my advisor Dr. Izaguirre for his outstanding
- Also thank my committee, Dr. Greg Madey, Dr. Mark Alber, and Dr. Pat Flynn

2. Talk Outline
- My talk will consist first of some definitions, including what is a biological Problem
Solving Environment
- Also a description of how interfacing a DSL can contribute to their purpose and address
some of their current voids
- I will evaluate two proposed interface strategies, using case studies in morphogenesis
which involves embryonic cell pattern formation and also molecular dynamics
- Using these case studies, I will develop some recommended software design strategies
for the interfacing of a DSL with a biological PSE

3. Problem Solving Environment
- A PSE is a multi-tiered and modular software framework, with each tier designated by
abstraction level and flexibility, and these are inverses
- The lowest level tier, the computational layer, contains complex algorithms
implemented in a general purpose language with design patterns for extensibility,
performance, etc.
- The upper level tier, the user layer, is where the user interacts, and includes tools for
data analysis such as visualization and plotting libraries
- A PSE can thus be viewed as a domain-specific virtual machine, with the user
submitting commands through tier 2 which invoke complex functionality from tier 1
behind the scenes, and tier 1 sends results back which are displayed in tier 2.

4. Using a PSE to Develop a Biological Model
- Developing a biological model through a PSE with the design I showed involves two
players, an experimentalist E and a programmer P
- E provides a prototype model for a phenomenon they observe in the lab and gives it to P
to implement in tier 1, and E runs the model through tier 2
- If they match, either done or refine the model by shrinking the set of input dependencies
- If they do not match, the model is invalid so E changes it, resubmits to P, and the
process repeats.
5. Problem Description
- Developing a model with this strategy requires a lot of communication overhead
between E and P, because they are constrained to interact with the PSE at exclusive tiers
due to their domain of expertise.
- It would be nice to be able to bring the PSE to the lab bench to have E prototype and
test models directly instead of submitting to P

6. New Proposed PSE Design
- Thus I propose a modified PSE design with a middle-level tier consisting of a domain-
specific language, along with a set of language tools and high level APIs which can
invoke functionality from tier 1
- Now a user can interact through the user layer as before which is now tier 3, or take
advantage of greater expressive power provided by the new tier 2, the domain-specific
language, for prototyping new models.
- The domain-specific language can interact with tier 1 using several different strategies,
two of which I have explored. I leave any interaction with tier 3 (i.e. generation of DSL
modules, etc.) for future work.

7. My Contributions
- My contributions are thus to evaluate techniques for interfacing a DSL to a
computational back end, with respect to computational biology and the goals of an
experimentalist and their simulations.
- Use case studies in morphogenesis and molecular dynamics
- Provide useful DSLs for these domains
- Develop a set of guidelines for performing this interface, and for codeveloping a
computational back end and a DSL at the same time

8. Technique 1: Plugin Generation
- The first technique that I will explore is called plugin generation, which involves
splitting back end functionality in core functionality which is executed independent of the
simulation, and optional functionality which may or may not be executed depending on
the simulation.
- Core functionality is compiled and linked into a standalone executable application
- Optional functionality is implemented as plugins, which are separately compiled into
shared object machine code and dynamically loaded upon a user reference in a
configuration file or a GUI. For example here a user could supply a configuration file
which says to use plugins A and C, which would result in machine code for these plugins
dynamically loaded for use in a simulation.
- Using this technique, a DSL simply serves as a generator of one of these plugins,
effectively extending the computational back end with new features.
- I will be using this technique for an XML-based DSL BioLogo which I will interface to
the morphogenesis simulation engine CompuCell3D. CompuCell3D actually consists of
four tiers, the upper level tier is still the user layer and consists of a CompuCellPlayer
composed of Qt and VTK which outputs screenshots of cells and chemicals, and the
lower level tier is still the computational back end which contains several core modules
and makes use of the plugin design pattern. Tier 2 provides a scripting interface to back
end functionality through Python using wrappers, and finally Tier 3 will be where
BioLogo will fit.
- A BioLogo program first passes through a compiler front end which parses and
performs error and type checking, generating a simpler intermediate program which is
more imperative than structural, corresponding more closely to operational semantics.
- The compiler back end then translates intermediate syntax to a plugin for tier 1.
- I have also given the user the option of embedding Python into the XML, and this can
generate a Python plugin which resides in Tier 2.

10. Morphogenesis Models
- The central morphogenesis model run by CompuCell3D is the Cellular Potts Model,
which is lattice based and associates with every lattice location an integer spin which
uniquely defines the cell at that spot. Here we have three cells, and any pixel with a spin
of zero corresponds to the extracellular medium (no cell is there)
- Chemical fields can be coupled to the CPM lattice and evolved using Turing systems,
which represents the temporal change in chemical concentration at a lattice location as
the sum of a reaction term, which if positive models secretion and if negative models
resorption, and also a diffusion term which is the product of a diffusion constant (controls
the rate) and the Laplacian of the concentration.
- We also can categorize cells as specific types designated by behavior, and model
transitions between cell types as type automata with rules for changing type. So for
example I could have two cells of type ‘orange’ and one cell of type ‘blue’, each with
unique properties
- At any point, the state of the CPM lattice is modeled by an effective energy E, which
drives the lattice towards experimentally observed behaviors. Each lattice change
involves a proposed index flip between neighboring locations, if the change results in a
negative E we accept it otherwise we use Monte Carlo probability. So if we had a term to
model cellular adhesion and we knew ‘orange’ and ‘blue’ cells tended to adhere while
two orange cells tended to repel, we would make this term more negative for index flips
which bring orange and blue cells closer. We can use this same idea to model volume
constraints, surface area constraints, and following of these external chemical gradients.
11. Example: Chicken Limb Cell Haptotaxis
- As an example, haptotaxis is a process which occurs in chicken limb formation, which I
model by superimposing two chemical fields, fibronectin and an activator.
- The algorithm dictates that cells exposed to a high enough activator secrete fibronectin,
then cells of a specific type (Condensing) stick to fibronectin. I can model this by
including this energy term for location associated with Condensing cells, and if I have a
mu that is negative then cells will desire to be at locations of a high concentration C of
fibronectin (much more negative term)

12. Haptotaxis
- I used BioLogo to model haptotaxis by specifiying a new plugin name that I will
generate, LimbChemical as a Hamiltonian or energy term. Once generated, a user can
dynamically load this plugin through a CompuCell3D configuration file reference, and
instantiate it with particular input values (like mu, or a threshold for fibronectin
secretion). I designate these as Inputs in my Hamiltonian, and can use these later in the
model implementation.
- I superimpose two chemical fields for fibronectin and activator, and populate the
activator from a file.
- The equation itself is implemented as a pixelsum over all lattice locations, the quantity
Mu*fibronectin, and only including the term if the cell at the lattice location is a
Condensing cell
- Also at every simulation step, for each lattice location I secrete into fibronectin at some
rate (a user input) if the corresponding activator concentration is greater than or equal to a
user-specified threshold (secretion is implicitly done only for cells)

13. Avian Limb Cell Differentiation
- I model cell differentiation in the avian (another word for ‘bird’) limb by defining two
cell types, noncondensing and condensing, where condensing are more adhesive.
- The type automaton is relatively simple, with a noncondesing cells becoming
condensing upon exposure to an activator concentration above the same threshold I had
in the Hamiltonian, and vice versa.
- I model differentiation as a cell model, with two cell types. I become type
noncondensing if my current type is condensing and I am exposed to an activator
concentration above the threshold, and note I ‘borrow’ functionality from the
LimbChemical plugin using the useplugin tag. Thus each are coupled to each other.
(If they ask, you can only do this for BioLogo generated plugins, due to information
14. Gamba-Serini (spinoff) PDEs
- I used this PDE model which builds upon the Gamba-Serini equations to establish an
external activator chemical concentration in a simulation of human capillary formation.
- I superimpose a field C which is evolved by this Turing approach, inputs and
superimposed fields defined as before.
- Each differential equation is modeled as a set of Terms, each of which can be excluded
or included if an extra Boolean condition attribute is supplied (here I left it out).
Specified inputs and fields can be used in expressions. The generated plugin temporally
updates the field using the finite difference method, and parameters such as the timestep
and spacestep are implicitly defined inputs which the user can instantiate.
- Any superimposed fields can be used as arguments to a different plugin to model other
types of reactions
- I could set normalize to true to set all values between zero and one.

15. Embedded Python support
- The limited mathematical constructs for PDEs using pure XML is a bit limited,
containing helper functions like Kronecker and Laplacian but not allowing auxiliary
helper functions, which some PDE systems require (if Alber asks, an example is
- So for PDE solvers, I provide the option of more expressive power through embedded
Python, which the user can substitute for DiffEq blocks. Inputs and fields become
defined and usable within the Python tags, as do quantities for the Finite Difference
Method (dt, dx, etc.). This also enables the user to take advantage of useful external PDE
solving libraries such as FiPy.
- A Python class is generated for Tier 2 of CompuCell3D, with some glue code for pre-
and post-processing (for example, passing the fields to the computational back end).
Inputs and fields become Python bindings and the embedded Python is simply translated

16. Contributions
- The declarative semantics allow a biologist to think in terms of what is being modeled
and not how instructions are executed on the machine
- I measured the number of XML tags versus generated statements in C++ and show
about a 90% savings using BioLogo, so abstraction and writability is strong for a small
set of necessary concepts, which is by definition something a DSL should provide. This
also illustrates the amount of glue code that can be required for extension of a
computational back end, particularly if design patterns are used.
- Structure provided by XML helped in organization of individual models
- There was little explicit performance penalty between those extensions generated by
BioLogo and those you would generate by hand, since the ‘standard’ method for
extending CompuCell3D were used and performance strategies taken by Tier 1 could be
capitalized upon.
17. Observations: Design
- Language is extensible, and this was helped by XML itself and also Xerces-C++ which
was highly conducive to applying recursive descent with the grammar.
- Information hiding enforced multiple function invocations in generated code for low-
level data, making this more difficult.
- Methods had to be taken to resolve naming conflicts, which could occur at multiple
levels in the generated code (between plugin names and also inherited methods and data
- Reliability was increased through multiple levels of error handling – I could produce
understandable syntax errors through the compiler front end, check for tag balancing and
attributes with Xerces and also invoke calls to runtime exception handlers in the
generated plugins.
- A final and very important observation is that codependent evolution of the PSE and
DSL would require a high degree of developer communication between the two tiers.
For example if a naming scheme or method prototype is changed in Tier 1 but there is no
communication with the BioLogo developers, a buggy plugin could easily be generated
without warning. Best case this error would be caught by the C++ compiler, worst case
unpredictable runtime behavior. Neither are desirable.

18. Observations: 10 Users (1-4)
- I presented BioLogo at an August 2007 workshop on CompuCell3D
- At the end, we asked ten randomly sampled experimentalists to rate each tier of
CompuCell3D, including BioLogo, in terms of ease of interacting with the software
- Not surprisingly, Tier 4 was rated the easiest almost unanimously, and Tier 1 (C++) the
most difficulty.
- Good news: BioLogo decisively beat out Tier 1 in terms of user preference
- Bad news: BioLogo was edged by Tier 2, the Python layer

19. Observations: User
- In research we try to learn from both the good and the bad
- Supplemental comments indicated that the main praises for BioLogo were its level of
abstraction and the lack of general imperative language constructs, but the main
complaints were the lack of control flow and also the high translational overhead equated
with discrete error-checking, translating, and generated plugin compilation stages
- Clearly experimentalist preference is subjective, but it is clear that some had a positive
reaction to directly interacting with the computational back end at runtime, and saving
this translational overhead at every model change. And this motivates the second
technique, which is….
20. Technique 2: Selective Precompilation
- Involves dividing the computational back end into a set of functionality that should be
prototypable in the DSL and a set which should be precompiled into machine code and
invokable from the DSL.
- User can submit commands through a domain-specific API, and these can either be
executed directly and/or invoke this precompiled functionality to run a complete
- There are several tools out there that facilitate this process, such as SWIG which can
interface various scripting languages to C, or Jython/Java.

21. MDLab (MDL) and ProtoMol
- In this case, I create Python-based a domain-specific language MDLab for molecular
dynamics simulations, and interface the language to the problem solving environment
ProtoMol, written in C++. Either can be developed using an IDE such as Eclipse if the
appropriate plugins are downloaded.
- I select a subset of the functionality within the ProtoMol computational back end for
precompilation, and use SWIG to create the precompiled machine code and a set of
Python wrappers which can invoke this functionality.
- MDLab commands can be submitted directly into a Python interpreter or can invoke
these wrapper routines, also run through the interpreter.
- Results can then be displayed through external tools

22. MD Models: One Step
- Here I show one step of a molecular dynamics simulation
- Initial input includes a set of atomic positions and momenta, which is equal to mass
times velocity – these defines phase space, and also the structure of the system which
defines what is bonded to what.
- The first step is to compute the potential U, which consists of multiple terms – one
example is the electrostatic potential which is equal to kq_1q_2/r and requires a sum over
all atom pairs. The force on each atom can then be computed as the gradient of this
potential with respect to its position.
- The next step is to propagate the system by solving Newton’s equations of motion.
Here I show a time-discretized Taylor series approximation to Newton’s equations called
the Leapfrog method, which is truncated at the delta-t squared term. I supply a delta-t
value at which to propagate, and this produces a new set of atomic positions and
velocities and thus a new phase space.
- The current limitations to molecular dynamics deal with:
   - The small timestep required for stability, since many fluctuations in a molecule occur
at a high frequency (i.e. bond fluctuations happen on the order of femtoseconds), to
accurately model them requires a small timestep.
   - The quadratic algorithm for calculating forces, as we saw to compute electrostatic
potential required a sum over all atom pairs – for large molecules that is slow
   - These two contributing factors make it very difficult to run a system of reasonable
size for a biologically relevant period of time in a reasonable amount of wall clock time,
even on the massively parallelized supercomputers that we have today!
23. Basic MDL Simulation Protocol
- I start out by defining a Physical system, and also an object to handle I/O.
- I then provide a set of atomic coordinates through a Protein Data Bank file, and also the
structure through a Protein Structure File and extra parameters needed for potential
calculation through a parameter file. I set my spatial boundary conditions to periodic
which implements a wraparound on the edges, and my Kelvin temperature to 300.
- I construct a force field object, and specify which types of forces I want to compute.
Some forces occur between covalently bonded atoms – these are bonded forces, for
example, the equilibrium length of a bond. Others are nonbonded forces, such as van der
Waals and electrostatic or Coulombic interactions.
- I then propagate my system for a certain number of steps, and specify my value for dt in
femtoseconds and the method I use, in this case it’s the Leapfrog method from the
previous slide. This thus propagates my system for a total time of 1 ps

24. Constructing Propagators
- I illustrated on my previous slide how to propagate the system using the Leapfrog
method, which has been predefined in the system, using a uniquely identifying Python
- Here, I show how to add a new Propagator as a Python class, taking advantage of:
   - Domain-specific constructs provided through the API
   - Fast matrix-vector operations provided through Numerical Python, which I import
   - Transparent registration of the new propagator using MDLab factories, which can
     map Python strings to methods for object instantiation. For this I provide the
uniquely identifying string and a set of parameters which it can accept, in this case I can
provide a ‘target temperature’ T0 and give it a default value of 300, so that in my call to
propagate() I can provide values for these parameters if they should be different than
- In this case I show just the last few steps of a propagator which models constant-
temperature dynamics – runs the same Leapfrog method I showed before (I only show the
last update of velocities, called a half-kick) – using an artificial scaling of the velocities
by a square root of my target temperature and the current temperature of the system,
effectively keeping the average kinetic energy of my system constant.

25. Constructing Force Calculators
- Here I illustrate how to calculate a new type of force. So for each force I’ll have a
function for the corresponding potential, and the force itself which is the gradient of the
- The type of force I am constructing here is called a harmonic dihedral force
  - A dihedral is a collection of four covalently bonded atoms, and the dihedral angle is
between the two planes
  - This potential will constrain a particular dihedral to a value phi-naught in radians
  - Note that the energy contribution is lower as the value of the dihedral phi-I
approaches phi-naught. The strength of the constraint can be controlled by an extra
parameter k.
  - So I compute this energy as k times this difference squared, and store it as a
contribution to the system dihedral energy, which also consists of potential terms which
keep dihedral angles at equilibrium values. The force is then the gradient of the potential
with respect to phi, I show parts of the calculation here – the force on each of the four
atoms must be computed.
- I can then construct an instance of the HDForce class by passing appropriate parameter
values to a constructor, and invoking the addPythonForce() member function of the force

26. Testing Methods
- One of the benefits of using Python was that there were a great deal of external libraries
available for data analysis such as Matplotlib, which could interactively plot data, zoom,
save plots, etc.
- I could also incorporate parallelism using PyMPI, which is a Python interpreter which
distributes loads across processors by invoking MPICH libraries.

27. Contributions
- As I showed here and also in my thesis, MDL can prototype many different types of
numerical methods, particularly with: mathematical expressive power provided by
Python and Numerical Python
- I provided the user with the flexibility of prototyping methods as Python objects or
Python functions (I only showed objects here for the sake of time)
- The simulation protocol provides a suitable testing environment, and the language will
be used in Computational Biophysics this semester at Notre Dame
- I will show, this selective precompilation does save performance vs. prototyping in pure
Python, question will be how much can we save.

28. Observations: Design (1)
- First off the control granularity is dependent upon the back end modularity and
segregation of functionality. Large macro functions result in either large amounts of
precompiled and resulting unprototypable functionality for the user, or large amounts of
functionality requiring pure DSL implementation
- Inheritance from precompiled interfaces must be selectively and carefully done, to
avoid data member shadowing in the scripts
- Central structures required some type conversions (for example, from Numpy vectors
and matrices to C double-precision arrays) – I had to write a routine which performed this
conversion and handled garbage collection so that precompiled libraries and Python
would operate on the same data
- I also had multiple levels of exception handling here, as I could invoke Python
exception handlers from the domain-specific API including pre and post condition checks
for all methods, and also had runtime exception handling invocations in tier 1
- Codependent evolution is facilitated better in this case, because new modules in tier 1
could become instantly compatible with MDL assuming compatibility with SWIG. Also
tier 1 implementation changes would mostly be detected as errors, which is less
dangerous than generating buggy extensions. To illustrate this…
29. Observations: Design (2)
- Our current setup involves source code for both ProtoMol and MDL residing remotely
in repositories at, with concurrent access controlled by Subversion. Changes to
either PSE layer can be committed.
- ProtoMol can be compiled into a standalone executable using Automake and friends to
generate makefiles, or compiled with a Makefile.swig which generates the wrapper
Python files and the shared object machine code, storing them in the lib/ directory of
MDL. These are in turn invoked by the domain-specific API which I provide.

30. Observations: User
- Parallelism implemented in precompiled binaries is not invokable using PyMPI,
because the interpreter itself is supplied as an input to mpirun. Thus we could not take
advantage of atom and spatial decomposition implementations of pairwise force
calculations implemented in the computational back end.
- It’s a bit inconvenient having to prototype potentials and forces when one is just the
gradient of the other – haven’t found a good solution
   - There are some libraries which have independent variables interpreted as symbols,
     cause problems when calculations involve cross and dot products
   - Others make approximations, but error propagation in molecular dynamics renders
     that inapplicable
- We’d also like to piggyback on Matlab, because that is probably the most popular
prototyping tool in the molecular dynamics community. Trouble is it’s very difficult
because the interface between Matlab and other languages is not automated well (you
have write a gateway function yourself, rather than use a tool like SWIG). Matlab is so
widely used however that it is something which should be explored
- Performance versus the computational layer will depend on the fraction of a typical
execution selected for precompilation and that selected for prototyping, as I now show…

31. Performance
- Here I’m showing two propagators, BBK and Impulse/Leapfrog running a 1 ps
simulation on 1101-atom BPTI, using standard nonbonded and bonded forces. From
running ProtoMol standalone, there is not much of a penalty involved with invoking
wrappers, running them as pure Python functions, or pure Python objects.
- It is widely known however that force calculations occupy about 98% of a dynamics
simulation/ Here I’m showing two types of forces, the harmonic dihedral which I
prototyped earlier and the quadratic electrostatic algorithm. Once again, there is very
little penalty if these forces are wrapped, however if they are prototyped – we can
potentially deal with huge penalties for expensive algorithms. Thus computationally
intensive tasks really should be translated to machine code at some point, a selective JIT
compiler like we see with Java could come in handy here.
32. Plugin Generation: When Useful?
- If depending on the biological system scale, algorithmic complexity and the amount of
simulation time required to verify a method, performance is a primary consideration.
- Building on the third point, if translational overhead will be negligible compared to
your typical simulation time – for example you won’t care too much about a very
expensive twenty-minute translation if you’re running verification simulations which take
multiple days.
- Although software maintenance always must be done, as a whole your computational
back end remains mostly static – because the hit is high in terms of reliability and also in
terms of language tool changes.

33. Selective Precompilation: When Useful?
- On the flip side, this is more useful if shorter simulations are necessary for verification,
or you are performing large scale parameter sweeps, lots of stops to compare to
experiments, lots of refinements.
- In these cases, opening translational overhead will not be negligible
- A static back end is preferable, but the hit is smaller

34. Observations: Back End Codevelopment
- It would be nice if a DSL and a computational back end could be developed together,
this will help to ensure compatibility through the process.
- But we’ve just established that computational back ends which are highly dynamic
could result in huge costs in terms of DSL reliability and tools!
- Although these seem to contradict one another, we can observe that design costs are
always less expensive than implementation costs.
- So I recommend: design together, but implement sequentially (back end first then DSL)

35. Codevelopment: Modified Waterfall
- As I show here, if you take a ‘waterfall’ approach to software development where
requirements, design, implementation, testing, and maintenance are discrete stages, this is
facilitated well.
- You can establish requirements of the DSL and back end together, and then develop a
software design for both, ensuring that each are compatible with each other.
- Then implement and test the computational back end (this also can produce a
‘prerelease’ of your software – a side benefit!), then once you are sure that works,
implement and test the DSL. This will also localize bugs in the development process to
one PSE layer instead of two.
- Maintenance can then be done together.
36. Codevelopment: Modified Incremental
- In reality these days, a more incremental approach to software development is typically
taken, where each change to a framework designed as an increment on the current design,
with its own design and implementation process.
- We can apply the same rule here: design together, implement sequentially. Thus any
change localizations and impact analysis can be done on the back end and DSL at the
same time, but anything that involves implementation – prefactoring, postfactoring, etc.
should be done sequentially, and DSL change implementations should be done once
those in the computational back end are tested and debugged, for the same reasons that I
mentioned before.
- Maintenance is essentially composed of ‘increments’, another advantage – more of a
guideline as changes are demanded by the community.

37. Publications
- I have published both languages, one at a conference - MDLab at a IEEE Simulation
Symposium in March of last year, and also BioLogo in the IEEE Computing in Science
and Engineering journal.
- I also was involved in publications on CompuCell3D and ProtoMol

38. Acknowledgements
- I’d like to acknowledge once again my advisor, Professor Jesus Izaguirre, and my
- Also the Laboratory for Computational Life Sciences at the University of Notre Dame
and the Biocomplexity Institute at Indiana University, and also Dr. Roeland Merks who is
a post-doc at Utrecht University in Denmark, whose capillary simulation results I
- The ProtoMol and CompuCell3D teams which helped in the integration of my
- Friends who have supported me and also my band, Nice Save!

39. Questions
- With that I’ll take questions.