Research Statement Genetics

Document Sample
Research Statement Genetics Powered By Docstoc
					                                       Research Statement
                                              Wei Wang
                                    Department of Computer Science
                                     University of North Carolina
                                        Chapel Hill, NC 27599

My research bridges the areas of data mining, bioinformatics, and databases. The overarching goal of my
research is to achieve efficient and effective knowledge discovery on large and complex databases,
especially in the presence of noise and missing data. I am particularly interested in studying problems
with practical impact, and I enjoy the process of exploring applications in different problem domains. I
have found that the ability to connect real-world applications with theoretical models is crucial to
successful research. This includes abstracting a problem to its essence, and then devising effective
techniques for its solution. In my experience, many apparently different problems in fact share some
common characteristics. Understanding these inherent characteristics has enabled us to tackle problems at
deeper levels and enables us to develop better solutions. My joint position in Computer Science
Department and Carolina Center for Genome Sciences (CCGS) has provided me with excellent
opportunities to explore the fundamental challenges faced by scientists in biomedical disciplines. This
constantly inspires my data mining research. I have focused on developing novel technologies for mining
sequence data, graph and network data, and high-dimensional data. I have established several fruitful
cross-disciplinary collaborations with CCGS in the past eight years, among which the Collaborative Cross
and MotifSpace are the two largest ongoing projects. Most of my research has been in collaborations with
colleagues and students, and it has greatly benefited from their support and insights.

Computational Analysis for Systems Genetics
Advances have been made over the last decade in our understanding of how genes influence phenotypes
and contribute to disease susceptibility. It has become increasingly clear that the underlying mechanisms
have a complex basis in which observed clinical outcomes result from a diverse range of causes
interconnected through networks of genetic, biological and environmental interactions. A clear picture of
biological complexity is available only through the development and efficient deployment of innovative
computational and statistical tools. Because of the tremendous technological advances enabled by the
genome projects, we have entered an era where progress in understanding complex biological systems is
limited only by our creativity and the development of new approaches to integrate, analyze and ultimately
interpret high dimensional data. This inspired me to develop new data mining methods that can meet this

Mining Sequence Data
I was a research staff member at IBM T. J. Watson Research Center for three years before I joined UNC.
My research there focused on mining sequence data. Analyzing and discovering patterns exhibited in
these sequential data sets is important to understanding the underlying nature of the data, and it aids in
various decision-making processes. I worked on a number of research projects that placed emphasis on
different aspects within this general domain: mining long patterns in a noisy environment [CN67, JN15],
mining rare patterns (InfoMiner [CN76], InfoMiner+ [CN63, JN10], STAMP [CN59]), mining
asynchronous periodic patterns (APP [CN78, JN14], MetaPattern [CN75, JN11]), sequence clustering
(CLUSEQ [CN74, CN65, CN62], ApproxMAP [CN64, CN60, JN6, JN4]) and subsequence similarity
searches (BASS [CN46]). Most of the work was summarized in a monograph entitled “Mining
Sequential Patterns from Large Data Sets” [BK1].

The advances in high-throughput genotyping and next-gen sequencing have generated massive amounts
of data that allow genome-wide analysis to be performed at much finer resolution than before but at the
same time posed great computational challenges. Since joining UNC, I have been developing new and
efficient methods for enabling such genome-wide studies that would otherwise be intractable. Genome
wide association studies aim to identify statistically significant associations between genetic variations
and observed traits. I have designed a series of advanced methods for genome wide studies ranging from
imputing missing genotypes (NPUTE [CN20]), inferring haplotype structures (CN10, CN5, JN3),
incorporating such haplotype structures into association study (TreeQA [CN7, CN4]), and analyzing how
interactions between multiple loci may impact the complex traits (fastANOVA [CN12, JN2], fastChi
[CN6], COE [CN3, JN1]). Benefiting from the new data mining techniques we designed, these methods
not only achieve competitive and better accuracies than previous approaches but also are orders of
magnitude faster. Some of these methods (e.g., TreeQA) are capable of real time process, enabling
interactive exploration of the data and pattern spaces. The ground-breaking work was recognized widely,
including a Best Research Paper Award at ACM SIGKDD 2008. Many of these algorithms are available
online at

Mining High Dimensional Data
In addition to studying sequence data, my research interests expanded into the area of mining high-
dimensional data [BC4]. Commonly, many and various attributes are monitored and collected as inputs to
data-mining procedures. The growth in dimensionality is a major barrier in developing efficient data-
mining methods. Well-known examples of high-dimensional data include gene expression data, customer
profile data, and spatial-temporal data. My Ph.D. dissertation research in spatial indexing methods (PK-
tree [CN83, BC5]) and spatial data mining methods (STING [CN85], STING+ [CN82, JN16]) provided
me with a foundation for pursuing this research direction. My work includes multi-variant association
rules (TAR [CN77], WAR [CN79, JN12]), which are rules that can capture patterns of numerical
attributes and their temporal variations; coherent clustering (-cluster [CN73, JN9], p-cluster [CN68,
CN61], OP-cluster [CN53, CN45, CN15], ONOP-cluster [CN44, CN42], reg-cluster [CN34]) which can
model clusters showing coherent patterns in some subspaces of high dimensional data, and most recently,
high-order correlations (CARE [CN14], REDUS[CN9], NIFS[CN11]). The latter areas were the focus of
my research interests after I moved to UNC in 2002.

Subspace Clustering and its Application to Gene Expression Analysis
My more recent work in the area of clustering analysis has focused on discovering coherent patterns
embedded in the subspaces of high dimensional data. Rather than taking distance as the similarity
measure (as is common in most clustering models) my work defines similarity based on patterns. This
approach is motivated by the observation that strongly correlated patterns might be found for data items
that are separated by large distances. For example, in DNA microarray analysis, the expression levels of
two genes may rise and fall synchronously in response to a set of environmental stimuli. Even though the
magnitude of their expression levels may not be close, the patterns they exhibit can still be very similar.
Discovery of such coherent clusters of genes is essential in revealing significant connections in gene
regulatory networks. Three clustering models have been suggested to study coherent patterns, each with a
different point-of-view. -cluster [CN73], p-cluster [CN68] and reg-cluster [CN34] aim to capture
shifting patterns and scaling patterns above a minimum size threshold. Order Preserving cluster (OP-
cluster [CN53, CN15]) relaxes p-cluster by allowing varying distance metrics in each dimension between
any two objects while still satisfying the relative ordering. Both p-cluster and OP-Cluster methods have
been effectively applied to gene expression data. To further improve the capability of OP-clustering, I
developed a new model, namely ONOP-clustering [CN42], which directly incorporates domain
knowledge into the clustering processes. This yields clusters with strong ontology implications. The
application of ONOP-Clustering to gene-expression data and gene ontology has demonstrated its
effectiveness and efficiency for discovering biologically significant clusters [CN44]. In principle, any
existing model of (subspace) clusters can be viewed as a special case of the models I proposed. Even
though existing clustering algorithms may be modified to find clusters under the generalized models, they
often incur an exponential growth in execution times. I also spent significant time on the design and
implementation of both deterministic algorithms (that return accurate results) and probabilistic
alternatives that produce approximate results in a small fraction of time consumed by the deterministic
approaches. I have found that both deterministic solutions and probabilistic methods have their own
places among application domains with differing requirements and tradeoffs.
    My most recent achievement on this topic is on discovering high order correlations in data subspaces.
The problem was formalized as finding strongly correlated feature subsets which are supported by a large
portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the
correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space.
Together with my students, I designed two algorithms (CARE [CN14] and REDUS [CN9]) that utilize
spectrum properties and effective heuristics to prune the search space. As pioneering work in discovering
high-order correlations, CARE was selected as the best student paper by the 2008 ICDE award
committee, among several hundred research papers with a student first author.

Mining Spatial Motifs from Protein Structure
Rapid growth in the number of proteins for which the 3D structure is known has enabled a new
computational approach to the study of protein structure and function based upon recurring amino-acid
packing patterns or spatial motifs in a collection of known protein structures. These spatial motifs may
correlate with experimental determination of protein function or with specific function associated with
protein families. Preliminary work on serine proteases supports the premise that spatial motifs may be a
more suitable starting point for accessing protein function than sequence-level motifs [CN52]. In this
project, I have undertaken a comprehensive analysis of protein structures. I plan to mine all of the protein
structures available in the PDB (Protein Data Bank) for spatial motifs, and construct each protein’s
signature as combinations of such motifs. Similarity measures between the signatures can serve as the
basis for predicting a protein’s structure and functional class. I first look for family-specific motifs
(measured by enrichment significance) and significant associations between subsets of proteins and
subsets of spatial motifs using novel subspace-clustering techniques. In essence, each subspace cluster
defines part of the signature that characterizes a collection of proteins. Subspace clusters, which are
highly correlated to specific biological functions and/or structural families, form the basis of functional
and structural classifiers.
Subgraph Mining
In order to mine spatial motifs, I model a protein’s structure as a labeled multigraph, and detect spatial
motifs by searching for common subgraphs within groups of protein structure graphs. In a structure
multigraph, a node represents an amino acid residue in the protein structure (with the amino acid type as
the node’s label) and edges connect residues and are labeled by (1) the discretized Euclidian distance
between the two amino acid residues, and (2) the potential interaction between the two amino acid
residues. A spatial motif corresponds to a subgraph where edges are labeled by distance intervals that
encompass observed variations. I developed depth-first search algorithms (FFSM [CN54], SPIN [CN43])
with an incremental subgraph isomorphism check to identify all frequent subgraphs from a graph
database. My algorithms take advantage of the bounded edge density in protein structure graphs and
accommodate additional geometric constraints on matching subgraphs. MotifSpace’s frequent subgraph
mining algorithms are able to locate spatial motifs with known biological functions, such as the catalytic
triad in serine protease, the catalytic diad and the hydrophobic binding pocket in papain-like cysteine
protease, the ligand binding sites in nuclear binding domains, and the co-factor binding sites in NADP
binding proteins [CN52, CN51, CN8, JN7, JN5]. Since their release, these algorithms have been used by
over 100 research groups worldwide. Continuing with our earlier success, I have recently developed an
algorithm (COM [CN1]) for finding discriminative spatial motifs and their co-occurrences for enhancing
classification accuracy, which is proven to deliver higher accuracies and is more robust to noise than any
previous approaches.

UNC has been a stimulating environment for me to pursue my personal research interests. I have been
leading several research projects on mining large and complex databases in which we are able to advance
not only the current state of art in the fields of data mining and databases but also the research frontier in
many other disciplines, including but not limited to bioinformatics and computational biology. I have
received three NSF Awards “CAREER: Mining Salient Localized Patterns in Complex Data” and
“Identifying Spatial Motifs for Classification of Protein Structure and Function”, a Microsoft Research
New Faculty Fellowship “MotifSpace: An Integrated Paradigm for Analyzing the Relationships between
Protein Structure and Function”, a Microsoft eScience Applications Award “A Comprehensive Protein
Database Indexed by Spatial Motifs”, a UNC Junior Faculty Development Award “Analyzing Gene
Expression Profiles”, and a Phillip and Ruth Hettleman Prize for Artistic and Scholarly Achievement. My
current interests also extend to the problems of feature selection, representation, and visualization of
multi-dimensional data sets and their applications in biomedical fields. I am a co-PI in 14 projects funded
by DARPA, NSF, and NIH, where my expertise in data mining and databases has benefited research in
3D terrain modeling, night vision enhancement, and phenotype-genotype association. In addition to being
a good team player in large collaborations, I am also taking a leadership role in mentoring junior faculty
and fostering new collaborations. Last September, I served as the PI of a P01 program project proposal
“Integrative Computational Analysis for Systems Genetics” (currently under review by NIH), leading a
team consisting of 16 scientists from computer science, biomedical engineering, genetics, nutrition,
pharmacology, and statistics (
    During the past eight years, I also spent significant effort in serving the large research community. I
have served as associate editor or editorial board member of IEEE transactions on Data Engineering,
ACM transactions on Knowledge Discovery in Data, Knowledge and Information Systems, International
Journal of Knowledge Discovery in Bioinformatics, International Journal of Data Mining and
Bioinformatics, and Proceedings of the VLDB Endowment. In addition to serving in over 60 program
committees of SIGKDD, SIGMOD, VLDB, ICDE, EDBT, ICDM, SIAM, and CIKM, I have played
organizing roles, among which the program committee chair of 2009 IEEE International Conference on
Data Mining is the most recent one.

Shared By:
Description: Research Statement Genetics document sample