In these projects

W
Shared by: HC121005012721
Categories
Tags
-
Stats
views:
8
posted:
10/4/2012
language:
English
pages:
9
Document Sample
scope of work template
							    Automated Model Discovery: A New Mode of Scientific Inquiry for the Coming Century

                                          Executive Summary

A crucial component of scientific inquiry is determining the laws that govern the behavior of the
phenomenon under study. For example, an ecologist might seek to determine the equations that govern
how the amount of CO2 that is released by a lake fluctuates according to variables such as the
concentration of aquatic nutrients, an important question in addressing climate change. A social scientist
might wish to understand how attitudes toward women in rural economies vary with factors such as the
rate of intermarriage between villages, a question that may impact how we attack world poverty.

For centuries, this process, that of discovering the model underlying the data, has been accomplished
purely by human observation—of an apple falling from a tree, for example. Enabled by technological
advances over the last few decades, however, a fundamental change in the practice of the physical,
biological, and social sciences is emerging. Computers can now not only be used to test a model that has
been formulated by a scientist, but also discover models on their own, by leveraging massive
computational power to sift through prodigious amounts of data. In other words, in a process known as
automated model discovery, computers are able to try millions of different equations, searching for the
ones that best fit the data.

In this proposed initiative, we will create an interdisciplinary research team (involving both faculty and
students) at BU to research and develop a generalized computational framework for automated model
discovery and refinement. This adaptive infrastructure will serve as an investment in BU’s scholarly
excellence, by enhancing and multiplying the capabilities of BU’s scholars, and putting them at the
forefront of changes in the practice of science. The types of problems that this will help address are the
ones that are of crucial concern for the coming century, and will thus reach out and engage the public,
raising awareness of the important role of research universities such as BU.

The framework will leverage large computational resources available at BU. We will work with our BU
collaborators in the natural and social sciences to identify important problems that are amenable to this
technique, in the fields of ecology, marketing and social sciences, community research and behavioral
economics. Chiu has received an interdisciplinary NSF "Cyber-Enabled Discovery and Innovation" award
to develop advanced simulation cyberinfrastructure to answer scientific questions about complex lake
ecosystems. Funds from this Academic Program and Faculty Development proposal would be used to
leverage the work in the NSF award to also support automated model discovery. The PIs have a good
track record in external funding, with a total of ten federal grants from the NSF and DOE.

1       Purpose
A crucial component of scientific inquiry is determining the laws that govern the behavior of the
phenomenon under study. For example, an ecologist might seek to determine the equations that govern
how the amount of CO2 that is released by a lake fluctuates according to variables such as the
concentration of aquatic nutrients, an important question in addressing climate change. A social scientist
might wish to understand how attitudes toward women in rural economies vary with factors such as the
rate of intermarriage between villages, a question that may impact how we attack world poverty.
Together, the laws that govern the variables are known as the model.

For centuries, this process, that of discovering the model underlying the data, has been accomplished
purely by human observation—of an apple falling from a tree, for example. Enabled by technological
advances over the last few decades, however, a fundamental change in the practice of the physical,
biological, and social sciences is emerging. Computers have the potential to not only to test models
formulated by scientists, but also discover models on their own, by leveraging massive computational
power to sift through prodigious amounts of data. In other words, in a process known as automated
model discovery (AMD), computers are able to try millions of different equations, searching for the
ones that best fit the data.

In this proposed initiative, we will create an interdisciplinary research team (involving both faculty and
students) at BU to research and develop a generalized computational framework for automated model
discovery and refinement. The approach uses symbolic regression to search for equations that best fit the
data. This adaptive infrastructure will serve as an investment in BU’s scholarly excellence, by enhancing
and multiplying the capabilities of BU’s scholars, and putting them at the forefront of transformational
changes in the practice of science. The system will facilitate BU scientists in addressing the types of
problems that are of crucial concern for the coming century, and will thus reach out and engage the
public. We will work with our BU collaborators in the natural and social sciences to identify important
problems that are amenable to this technique, in the fields of ecology, marketing and social sciences,
community research and behavioral economics.

In the short term, the proposed work will benefit our direct collaborators at BU and elsewhere in their
respective scientific and engineering domains, and improve their research productivity; as well as result in
publications by the PIs in complex systems and computer sciences. In the long term, the proposed
work can have significant broader impacts, leading to external funding and resulting in the
recognition of this project as a hallmark for BU.

2       Intellectual Contribution and Significance
In a typical scenario, a researcher is studying a particular physical, biological, or social system. The
system is observed by noting the values of a group of variables over time, i.e., a time series. The
researcher wishes to determine how the system “works”. What equations govern its behavior? What
quantities might be conserved? How does variable X vary with respect to variable Y? To answer these
questions, the researcher would traditionally use plots and graphs of various types. Using accrued
expertise and experience, the researcher would then hypothesize different possible relationships between
the variables, and then confirm or falsify these hypotheses through statistical regression techniques and
other tests. If the available data is insufficient, the researcher may seek to obtain additional data.

A researcher using AMD would also use these traditional methods, but in addition would run an
algorithm-driven search for mathematical relationships in the data. This automated search would test
millions of different equations, discarding those that fit the data poorly, and refining those equations that
were promising. After some stopping condition is reached, the resulting equations would be presented to
the researcher. This automation does not replace the researcher, of course, but rather amplifies the
researcher’s productivity and ability to analyze massive amounts of data. The actual scientific meaning of
the results of AMD will always require human interpretation.

The proposed AMD system is an example of what the NSF terms cyberinfrastructure. Cyberinfrastructure
is the application of leading-edge computer science technology to transform the practice of science and
engineering, and has been recognized by the NSF as a top priority for meeting the challenges of the
coming century [4].

2.1     Approach
Building on previous work in [1][2][6], our approach is based on symbolic regression to search in a space
of possible models. Traditional regression is a process by which the coefficients of a particular
mathematical model are estimated given a set of input data. The limitation is that the form of the model,
the equations, must be specified. Symbolic regression obviates this requirement by not requiring the
structure of the model to be provided. Symbolic regression simultaneously finds both the form and the
parameters, seeking to produce the explicit model that describes the behavior of a system. The models
produced by symbolic regression contain meaning in their form and parameters.
The search proceeds through the use of genetic programming (GP), which is a biologically inspired
method for searching a space for optima. The technique starts with an initial random population of
candidate models. Members are then evaluated against a fitness metric, and the best members are
selected for creating the next generation. The process iterates until a stopping criteria is achieved. A
typical stopping criterion might be that fitness improvements have leveled off, and thus further searching
is unlikely to lead to a better population of solutions.

The fitness evaluation indicates how “good” a member of a population performed and thus determines
survival within a population. There are many ways to determine fitness from the variations of squared
error to how complicated or simple a member of a population is. By supplying different fitness
evaluations, we can find different results from the same input set.
The above search framework can be used in two different ways to discover models. In the first mode, the
search is for the governing, differential equations. In the second mode, the search is for invariants and
conservation laws.

2.1.1 Governing Equations
The governing equations for a system define how the variables change in relationship to each other.
Typically, these would be ordinary or partial differential equations over time. If known, these equations
can be used to predict the response of a system to external inputs, and to simulate the system.

To search for governing equations, the scientist provides the AMD system with a time-series dataset of all
relevant variables and a set of initial building blocks for equations in the form of operators, such as
addition, multiplication, negation, reciprocal, exponentiation, logarithm, and constants. Initially a
population of formulas is created for each differential equation, with each member a randomly generated
combination of the building blocks. Each differential equation is then evaluated on a set of input data.
The fitness evaluator then compares each member within a population and uses numerical integration to
determine how well the differential equations conform to the observed data. Next, each survivor is used
to produce an offspring by using cross-cutting, mutation, and trimming. The survivors and new members
are then put through an iterative process of evaluation, selection, and procreation. When any one of the
stopping criteria is reached, the system returns several candidate solutions. An expert in a field is then
able to verify the validity of the resultant equations and determine its meaning.

2.1.2 Invariants and Conservation Laws
Systems often incorporate invariants, or conserved quantities. The invariant represents an underlying
property of the system, such as conservation of energy or momentum. These unchanging properties
often provide deep insight and understanding of a phenomenon that can then be used in many different
contexts.

To find invariants, a dataset is also provided to the AMD system. The AMD system then also performs a
symbolic regression search using genetic programming techniques. However, to test for fitness, the
system then performs a symbolic differentiation of the invariant to measure how well it fits against the
dataset. This is contrast to finding the governing differential equations, where the system performs a
numerical integration to measure the fit to observed data.

2.2     Challenges
AMD-like techniques have been used successfully in the hydrology domain for a number of years [2], and
the emerging use of AMD in other domains is showing great promise. Significant improvements can be
made to AMD, however. As part of this proposal, we will research and develop ideas including:
       Unit measurements: Incorporate the units (grams, meters, seconds) that accompany the data.
       Dimension-awareness: Use dimensional-analysis techniques to speed the search and improve the
        quality of the results by ensuring that variables are being combined in physically valid equations.
        For example, an equation that equates square meters to cubic meters can be discarded as being
        dimensionally invalid.
       Expert knowledge: Provide the fundamental concepts from a domain rather than using a blank
        slate.
       Improved evaluation techniques: Determining whether or not an expression is a good fit currently
        uses crude metrics. For example, it is possible that an expression is a very good fit, but merely
        phase-shifted. Current techniques would not recognize this.
       Scaling to massive computational clusters: Current AMD techniques can scale to 10s of
        computational CPUs, but tackling challenging problems at the forefront of modern science will
        require harnessing massive computational clusters such as those available through the NSF
        Teragrid [5].

2.3     Applications
To focus our work and provide a specific grounding, we will work with a number of key collaborators that
we have identified. These are given below.

2.3.1 Climate Change and Ecosystems
Many of the problems facing the world today intimately the involve ecosystems that control our
environment. Climate change, water resources, desertification, sustainable agriculture, deforestation and
species loss are all examples of such problems. Meeting these challenges depends critically on our ability
to understand and model the complex, interacting ecosystem processes that regulate our environment.
Climate change, for example, will have profound effects on our ecosystems, but is also in turn controlled
by our ecosystem, in complex, nonlinear feedback loops.

A specific ecosystem problem that we will start with is understanding lake metabolism, which is how food
is produced and consumed in a lake. An accurate model of lake metabolism is crucial for understanding
how lakes will act on climate change, and react to climate change. A primary indicator of lake metabolism
is the dissolved oxygen content (DO) in the water. Typically, the DO rises during the day as plant life in
the lake produce food through photosynthesis. DO then falls during the night as organisms in the lake
consume food and oxygen. In some lakes, however, there is a “bump” in the DO levels in the middle of
the night, where the DO rises again and then falls off before dawn. This bump has so far been
unexplained, and we will work with collaborator Hanson to investigate this (see letter of collaboration in
Appendix). We will use data collected within the Global Lake Environmental Observatory Network
(GLEON) for this work. Chiu is an active participant in GLEON [3], and Hanson is a steering committee
member.

2.3.2 Marketing and Social Sciences
We will work with Manoj Agarwal (School of Management) in applying the proposed technology to the
discovery of a partial differential equation- or network-based model that describes diffusion and adoption
of new technology over society in spatially and temporally extended domains. The modeling process will
be based on and driven by empirical data. We will use a database available from CENTRIS, the only
national database that continuously collects, on a daily basis, individual household information on the
choice and use of various products and services covering over 75 technology areas. From this data, we
will select a subset of tightly coupled products and services and represent the use of each product or
service by a local dynamical variable. Those variables will influence one another and thus form a
dynamical network even within a local geographical point. We will also obtain physical locations of
households, aggregate them at census block or county levels, and construct a mathematical
representation of their distribution over geographical space. We will conduct symbolic regression to find a
model whose behavior best matches (both spatially and temporarily) the dynamics of diffusion and
adoption of new products observed in the empirical data.

2.3.3 Community Research
We plan to offer the proposed AMD system to other projects at BU for the development of dynamical
models of spatio-temporal correlations and variations of various community indicators in the City of
Binghamton. Sayama has collaborative relationships with David Sloan Wilson (Biological Sciences and
Anthropology / Director of EvoS) who runs the Binghamton Neighborhood Project, and with Pamela
Mischen (Public Administration / Director of Center for Applied Community Research & Development) who
runs the Virtual Binghamton Project. In these projects, a number of block-, street- or household-level
community indicator data are collected (e.g., crime incidents, property vacancies, street beautification
levels, students’ perception of prosociality, and holiday illuminations, to name a few). These indicators
will be considered as local dynamic variables, and a partial differential equation- or network-based model
will be constructed by using the same technique as the one applied to marketing studies described
above. We will also apply the proposed method to prediction of spatio-temporal changes in the New York
State County Health Indicator Profiles database that is available from 2002 to 2006.

2.3.4 Behavioral Microeconomics
Finally, we will apply the proposed AMD system to the automatic model development in economics,
especially in behavioral microeconomics. Formal (i.e. equation-based) models of economical decision
making are often developed solely based on the modeler’s speculations and assumptions, with little
support by experimental or empirical data. The technology we propose to develop will open up new areas
of research on automatic acquisition of formal models of economical decision making directly from
empirical data. To demonstrate the feasibility of this research, we will first apply the proposed method to
individual human’s decision making in an economic game. Actual data obtained from human subject
experiments will be fed into our algorithm and the generated formal models will be compared to several
established individual decision models. Then we will expand the scope of application to market-level
dynamics. The actual transaction data taken from a specific mid-sized market will be used to
automatically develop formal models of not only individual trader’s decisions but also their interactions.
This work will be conducted with Andreas Pape (Economics).

3       Process
The project will be led by two PIs, Kenneth Chiu in Computer Science and Hiroki Sayama in
Bioengineering. Kenneth Chiu is an assistant professor in computer science, with expertise in high
performance computing and distributed systems. He has worked on number of projects related to grid
computing and cyberinfrastructure/e-Science, and is particularly interested in understanding how cutting-
edge computer science research can be used to radically transform the way that scientists and engineers
work. He has received seven NSF or DOE grants as PI or co-PI that are relevant to this project. His
primary responsibility will be overseeing the software development and scaling the software to large
computational clusters.

Hiroki Sayama is a complex systems scientist with computer and information sciences background. He
has been working on a wide variety of projects related to complex systems. His research topics include
evolutionary and ecological systems, self-replication and evolution of artifacts, human decision making,
and collective behavior of swarming agents. He has strong expertise on mathematical modeling, analysis
and simulation, nonlinear dynamical systems theory, complex network science, mathematical biology,
evolutionary computation, multi-agent systems, and fundamentals of computer and information sciences.
His primary responsibility in this project will be mathematical modeling and analysis and coordination of
application research.

Chiu and Sayama will hold weekly or biweekly meetings, along with Tony Worm, a CS undergraduate
student who we have already been working with (and possibly some other student assistants). These
meetings will also include our collaborators on an as needed basis. For remote collaboration, we will use
Skype teleconferencing, along with Etherpad as a collaborative, real-time “whiteboard” that works with
any web-capable computer.

3.1     Proposed Work and Feasibility
We seek to both (1) build a practical tool that can have immediate, short-term impact on researchers in
other domains, and (2) also research new techniques in AMD. Research always carries a degree of risk,
however, and research in AMD is no different. To prevent challenges in our AMD research from impeding
our first goal and negatively impacting feasibility, we structure the work in two overlapping phases.
3.1.1 Phase I
Although there are previous research results in AMD, the available software is only a research prototype,
and thus not generally usable by scientists. Thus, in the initial phase of our work, our focus will be on
building a flexible, easy-to-use system for domain scientists and engineers to assist in their research,
relying as much as possible on existing AMD techniques to mitigate risk. This system will provide a web-
based interface which will allow researchers to use their expertise to help guide the search, and abstract
away details of the technique, allowing researchers to focus on the problem at hand. The system will
allow both searching for the governing equations (Section 2.1.1) and for searching for invariants (Section
2.1.2). To build this system, we will work closely with our collaborators, focusing on the areas described
in Section 2.3. The Phase I system will initially run on hardware resources currently available to the PIs,
and use distributed and parallel computing techniques to allow scientists to harness multiple computers
at the same time.

Work on Phase I is already under way, and we plan to submit a paper to a computer science conference
in December describing a prototype system.

3.1.2 Phase II
To ensure feasibility, in Phase I we will rely primarily on existing research results in the computer science
and complex systems literature. In Phase II, we will seek to push the state-of-the-art in AMD by
investigating the ideas described in Section 2.2. Phase II will overlap with Phase I. We will begin the
design of Phase II as we work on the deployment of Phase I.

In Phase II, we will also seek to scale the computation to massive computer clusters consisting of
thousands of processing nodes. For this, we will collaborate with Jim Wolf and Michael Reale at the
Computer Center to use the NSF TeraGrid [5] resources that we have available through the Computer
Center’s participation in the NSF TeraGrid Campus Champions program.

4       Import
The project to be funded by this proposal will impact the university in several significant ways.
We will produce a research tool, in the form of an operational AMD system that will be used by our key
collaborators and others to enhance their research productivity. With an easy to use interface, this
system will help scientists make sense out of large quantities of data by rapidly testing millions of
mathematical models, searching for the ones that best fit the data. By taking much of the “grunt work”
out of data analysis, scientists will be able to quickly test many hypotheses, allowing researchers to focus
on the actual science as opposed to manually operating mathematical and statistical packages in a time-
consuming fashion.

In the short term, this AMD system will be a valuable tool to BU researchers, raising the competitive
advantage of BU faculty for producing papers and in acquiring external funding. In the long term, as we
obtain additional external funding for hardware and operations, this system can be used by researchers
nationally and internationally. This scholarly and engineering resource will contribute substantially to the
intellectual life of the campus.
The results of this proposal will also have significant importance to BU’s educational mission. The NSF, as
well as the scientific funding agencies in other countries, has recognized the importance of
cyberinfrastructure education to the furtherance of the sciences and engineering, and created the NSF CI
Training, Education, Advancement, and Mentoring Program (CI-TEAM) funding program. Chiu has already
obtained one CI-TEAM award that led to the development of an Environmental Cyberinfrastructure course
in the CS department. He will be submitting another proposal this year also to the CI-TEAM program,
which will have significant synergy with this proposal to the Academic Program and Faculty Development
Fund.

The CS research in this proposal is sufficiently deep that it will lead to at least one Ph.D. in CS. The
system itself can be used to support a number of MS projects. Furthermore, the work will provide an
ideal setting for programming assignments in the Introduction to Distributed Systems Course (CS 457 and
CS 557). These programming assignments will expose CS students to real-world systems, and thus be
less limited by the pedagogical nature of some CS assignments.

This project is well-suited as a vehicle to leverage external funding. External funding of course is
important to any research university. Not only are the monies themselves of benefit, but the visibility
associated with the funding. The NSF has placed high priority on the types of interdisciplinary research
this project will promote. The NSF has also recognized the importance of cyberinfrastructure to the
furtherance of the sciences and engineering [4]. This all will enhance the ability of BU to obtain external
funding, so not only is the external funding obtained by this project of benefit, but also the synergistic
enhancement across the university of the ability to obtain external funding.

The PIs themselves also expect additional external funding for this work. Chiu and Sayama have a good
track record in external funding, with a total of nine federal grants from the NSF and DOE. With the
results of the proposed project, we will target programs including, but not limited to, the following:
       NSF Cyber-enabled Discovery and Innovation (CDI; intended deadline: February 2011/2012)
       NSF Dynamical Systems (DS; intended deadline: February 2011/2012)
       NSF Advances in Biological Informatics (ABI; intended deadline: August 2011)
       James S. McDonnell Foundation (intended deadline: March 2012)
We anticipate total additional funding of approximately $750,000. The new funding will be used to
develop additional collaborations, acquire additional data to engage new domains, develop new interfaces
to better allow scientists to guide the search, and new algorithms for extending the scalability to
thousands of computational processors. Furthermore, as usage of the system expands to beyond BU’s
boundaries, we will seek operational and hardware funding to support users in other institutions and
countries.


As can be seen from above, the reach of the proposed work will extend well beyond the immediate
participants. Chiu’s collaborators with GLEON [3], and his collaborations associated with his NSF CDI
award will be immediately impacted. The extensive BU collaborations of Sayama and Chiu will lead to
impact on numerous local faculty members. Sayama’s work in developing a Complex Systems program
will lead to significant educational impacts, as well. Overall, we estimate that up to 15 faculty members
and 25 students per year in CS will benefit, as well as 50~100 students per year in other
departments/schools.

The PIs have also contributed to BU’s outreach to the public. Chiu’s work has been highlighted in
Binghamton University Magazine in Fall 2008 and Summer 2007, and will be highlighted in the
Binghamton Research Foundation’s Research Magazine in Fall 2009. Sayama’s work has been highlighted
in the Research Foundation Magazine in Winter 2009 and in the Binghamton University Magazine in
Summer 2009. Together, the synergy from all of the above will greatly enhance the long-term
viability of this project as a hallmark for BU.

5       Plan of Evaluation and Feedback
The success of this project requires two different coupled activities. The first activity is the system
development. This will require both software development, and actual deployment on real hardware. The
second activity is to actually apply the system to the application domains, and attempt to find significant
results. Though the activities are distinct, they are by no means independent. Rather, they will feedback
to each other in an iterative fashion. An initial system must be developed to give the scientists something
to try. But deficiencies will then feedback into where the system needs to be further developed.

Initial evaluation of the AMD system will test on known systems. In other words, are we able to discover
what the scientists already know? This first stage of evaluation will serve to validate the basic approach.
In the second stage of evaluation, we will seek to discover new relationships that are not yet known.
Success will be determined by feedback from scientists. If we are able to help them gain insights that
would otherwise have taken them much more time to obtain using traditional data analysis methods,
then we will deem that a success.

Once the software tool for automatic model development reaches its beta stage, we will hold a small
intramural workshop to call for collaborators who could try the software for their own research projects
and provide us with feedback. Their feedback will be used for further improvement of architecture,
algorithms, interfaces and documentations of the software tool.
Quantitatively, we will use record how many CPU-hours of computational time we execute on behalf of
scientists. Increasing demand will signify the value of our system.

6       References
[1]     Bongard J., Lipson H., “Automated reverse engineering of nonlinear dynamical systems",
        Proceedings of the National Academy of Science, vol. 104, no. 24, 2007, pp. 9943–9948.
[2]     Babovic, V. and Keijzer, M. “Genetic programming as a model induction engine”. Journal of
        Hydroinformatics, Vol 2., No. 1, 2000.
[3]     GLEON, Global Lake Ecological Observatory Network Home Page, http://www.gleon.org/.
[4]     NSF, Report of the Blue-Ribbon Advisory Panel on Cyberinfrastructure,
        http://nsf.gov/publications/pub_summ.jsp?ods_key=cise051203.
[5]     NSF, TeraGrid Home Page, http://www.teragrid.org/
[6]     Schmidt M., Lipson H. “Distilling Free-Form Natural Laws from Experimental Data,” Science, Vol.
        324, 2009, no. 5923, pp. 81 - 85

						
Other docs by HC121005012721
Political Participation
Views: 2  |  Downloads: 0
Direct Payments Forum Meeting Minutes
Views: 4  |  Downloads: 0
Podcast Project
Views: 0  |  Downloads: 0
florence randolph program
Views: 5  |  Downloads: 0
Christina Gkonou 10
Views: 10  |  Downloads: 0
PA Welcome Letter 2012
Views: 0  |  Downloads: 0
BellevilleMinorHockeyAssociationapp 15
Views: 1  |  Downloads: 0
Register your MAC on the network
Views: 0  |  Downloads: 0
Future class � Make it expository format
Views: 0  |  Downloads: 0