Research Statement Wei Wang Department of Computer Science University of North Carolina Chapel Hill, NC 27599 firstname.lastname@example.org My research bridges the areas of data mining, bioinformatics, and databases. The overarching goal of my research is to achieve efficient and effective knowledge discovery on large and complex databases, especially in the presence of noise and missing data. I am particularly interested in studying problems with practical impact, and I enjoy the process of exploring applications in different problem domains. I have found that the ability to connect real-world applications with theoretical models is crucial to successful research. This includes abstracting a problem to its essence, and then devising effective techniques for its solution. In my experience, many apparently different problems in fact share some common characteristics. Understanding these inherent characteristics has enabled us to tackle problems at deeper levels and enables us to develop better solutions. My joint position in Computer Science Department and Carolina Center for Genome Sciences (CCGS) has provided me with excellent opportunities to explore the fundamental challenges faced by scientists in biomedical disciplines. This constantly inspires my data mining research. I have focused on developing novel technologies for mining sequence data, graph and network data, and high-dimensional data. I have established several fruitful cross-disciplinary collaborations with CCGS in the past eight years, among which the Collaborative Cross and MotifSpace are the two largest ongoing projects. Most of my research has been in collaborations with colleagues and students, and it has greatly benefited from their support and insights. Computational Analysis for Systems Genetics Advances have been made over the last decade in our understanding of how genes influence phenotypes and contribute to disease susceptibility. It has become increasingly clear that the underlying mechanisms have a complex basis in which observed clinical outcomes result from a diverse range of causes interconnected through networks of genetic, biological and environmental interactions. A clear picture of biological complexity is available only through the development and efficient deployment of innovative computational and statistical tools. Because of the tremendous technological advances enabled by the genome projects, we have entered an era where progress in understanding complex biological systems is limited only by our creativity and the development of new approaches to integrate, analyze and ultimately interpret high dimensional data. This inspired me to develop new data mining methods that can meet this need. Mining Sequence Data I was a research staff member at IBM T. J. Watson Research Center for three years before I joined UNC. My research there focused on mining sequence data. Analyzing and discovering patterns exhibited in these sequential data sets is important to understanding the underlying nature of the data, and it aids in various decision-making processes. I worked on a number of research projects that placed emphasis on different aspects within this general domain: mining long patterns in a noisy environment [CN67, JN15], mining rare patterns (InfoMiner [CN76], InfoMiner+ [CN63, JN10], STAMP [CN59]), mining asynchronous periodic patterns (APP [CN78, JN14], MetaPattern [CN75, JN11]), sequence clustering (CLUSEQ [CN74, CN65, CN62], ApproxMAP [CN64, CN60, JN6, JN4]) and subsequence similarity searches (BASS [CN46]). Most of the work was summarized in a monograph entitled “Mining Sequential Patterns from Large Data Sets” [BK1]. The advances in high-throughput genotyping and next-gen sequencing have generated massive amounts of data that allow genome-wide analysis to be performed at much finer resolution than before but at the same time posed great computational challenges. Since joining UNC, I have been developing new and efficient methods for enabling such genome-wide studies that would otherwise be intractable. Genome wide association studies aim to identify statistically significant associations between genetic variations and observed traits. I have designed a series of advanced methods for genome wide studies ranging from imputing missing genotypes (NPUTE [CN20]), inferring haplotype structures (CN10, CN5, JN3), incorporating such haplotype structures into association study (TreeQA [CN7, CN4]), and analyzing how interactions between multiple loci may impact the complex traits (fastANOVA [CN12, JN2], fastChi [CN6], COE [CN3, JN1]). Benefiting from the new data mining techniques we designed, these methods not only achieve competitive and better accuracies than previous approaches but also are orders of magnitude faster. Some of these methods (e.g., TreeQA) are capable of real time process, enabling interactive exploration of the data and pattern spaces. The ground-breaking work was recognized widely, including a Best Research Paper Award at ACM SIGKDD 2008. Many of these algorithms are available online at http://www.cs.unc.edu/~weiwang/software.html. Mining High Dimensional Data In addition to studying sequence data, my research interests expanded into the area of mining high- dimensional data [BC4]. Commonly, many and various attributes are monitored and collected as inputs to data-mining procedures. The growth in dimensionality is a major barrier in developing efficient data- mining methods. Well-known examples of high-dimensional data include gene expression data, customer profile data, and spatial-temporal data. My Ph.D. dissertation research in spatial indexing methods (PK- tree [CN83, BC5]) and spatial data mining methods (STING [CN85], STING+ [CN82, JN16]) provided me with a foundation for pursuing this research direction. My work includes multi-variant association rules (TAR [CN77], WAR [CN79, JN12]), which are rules that can capture patterns of numerical attributes and their temporal variations; coherent clustering (-cluster [CN73, JN9], p-cluster [CN68, CN61], OP-cluster [CN53, CN45, CN15], ONOP-cluster [CN44, CN42], reg-cluster [CN34]) which can model clusters showing coherent patterns in some subspaces of high dimensional data, and most recently, high-order correlations (CARE [CN14], REDUS[CN9], NIFS[CN11]). The latter areas were the focus of my research interests after I moved to UNC in 2002. Subspace Clustering and its Application to Gene Expression Analysis My more recent work in the area of clustering analysis has focused on discovering coherent patterns embedded in the subspaces of high dimensional data. Rather than taking distance as the similarity measure (as is common in most clustering models) my work defines similarity based on patterns. This approach is motivated by the observation that strongly correlated patterns might be found for data items that are separated by large distances. For example, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Even though the magnitude of their expression levels may not be close, the patterns they exhibit can still be very similar. Discovery of such coherent clusters of genes is essential in revealing significant connections in gene regulatory networks. Three clustering models have been suggested to study coherent patterns, each with a different point-of-view. -cluster [CN73], p-cluster [CN68] and reg-cluster [CN34] aim to capture shifting patterns and scaling patterns above a minimum size threshold. Order Preserving cluster (OP- cluster [CN53, CN15]) relaxes p-cluster by allowing varying distance metrics in each dimension between any two objects while still satisfying the relative ordering. Both p-cluster and OP-Cluster methods have been effectively applied to gene expression data. To further improve the capability of OP-clustering, I developed a new model, namely ONOP-clustering [CN42], which directly incorporates domain knowledge into the clustering processes. This yields clusters with strong ontology implications. The application of ONOP-Clustering to gene-expression data and gene ontology has demonstrated its effectiveness and efficiency for discovering biologically significant clusters [CN44]. In principle, any existing model of (subspace) clusters can be viewed as a special case of the models I proposed. Even though existing clustering algorithms may be modified to find clusters under the generalized models, they often incur an exponential growth in execution times. I also spent significant time on the design and implementation of both deterministic algorithms (that return accurate results) and probabilistic alternatives that produce approximate results in a small fraction of time consumed by the deterministic approaches. I have found that both deterministic solutions and probabilistic methods have their own places among application domains with differing requirements and tradeoffs. My most recent achievement on this topic is on discovering high order correlations in data subspaces. The problem was formalized as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. Together with my students, I designed two algorithms (CARE [CN14] and REDUS [CN9]) that utilize spectrum properties and effective heuristics to prune the search space. As pioneering work in discovering high-order correlations, CARE was selected as the best student paper by the 2008 ICDE award committee, among several hundred research papers with a student first author. Mining Spatial Motifs from Protein Structure Rapid growth in the number of proteins for which the 3D structure is known has enabled a new computational approach to the study of protein structure and function based upon recurring amino-acid packing patterns or spatial motifs in a collection of known protein structures. These spatial motifs may correlate with experimental determination of protein function or with specific function associated with protein families. Preliminary work on serine proteases supports the premise that spatial motifs may be a more suitable starting point for accessing protein function than sequence-level motifs [CN52]. In this project, I have undertaken a comprehensive analysis of protein structures. I plan to mine all of the protein structures available in the PDB (Protein Data Bank) for spatial motifs, and construct each protein’s signature as combinations of such motifs. Similarity measures between the signatures can serve as the basis for predicting a protein’s structure and functional class. I first look for family-specific motifs (measured by enrichment significance) and significant associations between subsets of proteins and subsets of spatial motifs using novel subspace-clustering techniques. In essence, each subspace cluster defines part of the signature that characterizes a collection of proteins. Subspace clusters, which are highly correlated to specific biological functions and/or structural families, form the basis of functional and structural classifiers. Subgraph Mining In order to mine spatial motifs, I model a protein’s structure as a labeled multigraph, and detect spatial motifs by searching for common subgraphs within groups of protein structure graphs. In a structure multigraph, a node represents an amino acid residue in the protein structure (with the amino acid type as the node’s label) and edges connect residues and are labeled by (1) the discretized Euclidian distance between the two amino acid residues, and (2) the potential interaction between the two amino acid residues. A spatial motif corresponds to a subgraph where edges are labeled by distance intervals that encompass observed variations. I developed depth-first search algorithms (FFSM [CN54], SPIN [CN43]) with an incremental subgraph isomorphism check to identify all frequent subgraphs from a graph database. My algorithms take advantage of the bounded edge density in protein structure graphs and accommodate additional geometric constraints on matching subgraphs. MotifSpace’s frequent subgraph mining algorithms are able to locate spatial motifs with known biological functions, such as the catalytic triad in serine protease, the catalytic diad and the hydrophobic binding pocket in papain-like cysteine protease, the ligand binding sites in nuclear binding domains, and the co-factor binding sites in NADP binding proteins [CN52, CN51, CN8, JN7, JN5]. Since their release, these algorithms have been used by over 100 research groups worldwide. Continuing with our earlier success, I have recently developed an algorithm (COM [CN1]) for finding discriminative spatial motifs and their co-occurrences for enhancing classification accuracy, which is proven to deliver higher accuracies and is more robust to noise than any previous approaches. Summary UNC has been a stimulating environment for me to pursue my personal research interests. I have been leading several research projects on mining large and complex databases in which we are able to advance not only the current state of art in the fields of data mining and databases but also the research frontier in many other disciplines, including but not limited to bioinformatics and computational biology. I have received three NSF Awards “CAREER: Mining Salient Localized Patterns in Complex Data” and “Identifying Spatial Motifs for Classification of Protein Structure and Function”, a Microsoft Research New Faculty Fellowship “MotifSpace: An Integrated Paradigm for Analyzing the Relationships between Protein Structure and Function”, a Microsoft eScience Applications Award “A Comprehensive Protein Database Indexed by Spatial Motifs”, a UNC Junior Faculty Development Award “Analyzing Gene Expression Profiles”, and a Phillip and Ruth Hettleman Prize for Artistic and Scholarly Achievement. My current interests also extend to the problems of feature selection, representation, and visualization of multi-dimensional data sets and their applications in biomedical fields. I am a co-PI in 14 projects funded by DARPA, NSF, and NIH, where my expertise in data mining and databases has benefited research in 3D terrain modeling, night vision enhancement, and phenotype-genotype association. In addition to being a good team player in large collaborations, I am also taking a leadership role in mentoring junior faculty and fostering new collaborations. Last September, I served as the PI of a P01 program project proposal “Integrative Computational Analysis for Systems Genetics” (currently under review by NIH), leading a team consisting of 16 scientists from computer science, biomedical engineering, genetics, nutrition, pharmacology, and statistics (http://compgen.unc.edu/ICASG/index.html). During the past eight years, I also spent significant effort in serving the large research community. I have served as associate editor or editorial board member of IEEE transactions on Data Engineering, ACM transactions on Knowledge Discovery in Data, Knowledge and Information Systems, International Journal of Knowledge Discovery in Bioinformatics, International Journal of Data Mining and Bioinformatics, and Proceedings of the VLDB Endowment. In addition to serving in over 60 program committees of SIGKDD, SIGMOD, VLDB, ICDE, EDBT, ICDM, SIAM, and CIKM, I have played organizing roles, among which the program committee chair of 2009 IEEE International Conference on Data Mining is the most recent one.