					       Category                              Author           Synopsis               Title                       Source               Year
Economics / Financial        x     Ali Moazami & Shaolin Li              Mortgage Business           MortgageIT Holdings Inc. &          2005
Marketing                    x     Arnab Dey and Ritesh                  Application of CART in      Inductis                            2005
Environment /                x     Belle Hudischewskyj                   Use of CART for Air         ICF Consulting/Systems              2004
Clinical Healthcare          x     Brydon Grant 1                        Model Drift in a Predictive University of Buffalo Medical       2005
Clinical Healthcare          x     Brydon Grant 2                        Use of MARS to Predict      University of Buffalo, School       2004
Clinical Healthcare          x     Carrie Salafia                        Salford Methods and the Columbia University School of           2005
Environment /                x     Cecil Dharmasri                       The Importance of CART Syngenta Crop.                           2004
Economics / Financial        x     Charles Pollack                       A Comparison of PRIM        Suncorp Metway                      2005
Economics / Financial        x     Charles Pollack                       Insurance Premium           Suncorp Metway                      2004
Agriculture / Soils /        x     Christopher Martius                   Assessing the               University of Bonn                  2004
Marketing                    x     Cristina Bellido                                                  Director of
                                                                         Modelling Differential Response Rates Data Mining               2005
Marketing                    x     David McCloskey                       Modelling Over Space        Pathfinder Solutions Ptd.           2004
Telecommunications           x     David Poole                           An Out-of-Memory            AT&T Labs - Research                2005
Manufacturing /              x     Deborah Burr                          Models for Recurrence of Ohio State University School           2005
Economics / Financial        x     Dominique Haughton                    Determinants of Market      Bentley College                     2005
Economics / Financial        x     Beng-Hai Chea                         Committee of Decision       Citibank                            2006
Marketing                    x     Don Cozine                            CART for Prim-like                  2005
Life Science / Drug          x     Donovan Chin, Aldrin                  Improved Predictions in     Biogen                              2004
Marketing                    x     Edward Malthouse                      Data Mining and             Northwestern University             2004
Environment /                x     Falk Heuttmann 1                      Predictive Modeling: The University of Alaska                   2004
Environment /                x     Falk Huettmann 2                      Predictive Modeling: The Institute of Arctic Biology.           2005
Life Science / Drug          x     George Towfic                         Generating a Connected Clarke College                           2005
Marketing                    x     Glenn Hofmann                         Marketing Strategies for    Household International             2005
Marketing                    x     Hans Aigner                           How to Exploit the          DATALAB USA                         2005
Economics / Financial        x     Inna Kolyshkina 1                     Using Mulivariate           PricewaterhouseCoopers              2005
Economics / Financial        x     Inna Kolyshkina 2                     Text Mining Challenges      PricewaterhouseCoopers              2005
Clinical Healthcare          x     Jason Haukoos                         Classification and          Denver Health Medical Center        2004
CRM                          x     Jeff Weiner                           Developing a Deeper         Carlson Marketing Group             2005
Environment /                x     Jesus Munoz                                                       Real of Geographic Information Systems and multivariate s
                                                                         Ecological modelling: Integration Jardin Botanico               2005
Life Science / Drug          x     Jing Wang                             Data Mining in the          Cengent Therapeutics                2004
                             x     John Elder
Data Mining Preparation and Implementation                                                           Elder Research,
                                                                         Avoiding the Top 10 Mistakes in Data MiningInc.                 2005
Economics / Financial        x     John Trimble 1                        CART/ MARS Risk             Wells Fargo                         2004
Economics / Financial        x     John Trimble 2                        Improving Customer          Wells Fargo Bank                    2005
Data Mining                  x     John Trimble 3                                                    Wells Automobile
                                                                         CART/MARS Risk Assessment in Fargo Bank Loans and Leases 2005
Life Science / Drug          x     John Warner                           Mining a Pharmaceutical Novartis                                2004
Economics / Financial        x     Jon Farar                             CART Solutions for          Union Bank of California            2004
                              x       Jorge Martin
Life Science / Drug Discovery / Biopharmaceutical Arevalillo                                         Universidad Nacional de Educacion a2005
                                                                           Maximum Predictive- Minimum Redundancy Gene Subset Selection from Microarray D
Clinical Healthcare           x       Joseph C. Cappelleri                                           Pfizer                             2005
                                                                           Data Mining Implementation: Using CART to Develop a Diagnostic Tool for Erectile Dys
                              x        Detection / Insurance
Economics / Financial Services / FraudKasindra Maharaj & Robert Ceuvorst                             Synovate, Inc.                     2005
                                                                           Comparison of TreeNet, RandomForests, and CART for Applied Prediction Problems in
Life Science / Drug           x       Kenna Mawk                           Using CART to Mine        Ciphergen Biosystems, Inc.         2004
Economics / Financial         x       Kevin Walsh                          Modeling for Fraud        Inductis                           2005
Marketing                     x       Larry Lai                            Variable Derivation and   Directv, Inc.                      2004
Economics / Financial         x       Lijia Guo                            Data Mining Applications University of Central Florida       2005
Manufacturing /               x       Linus Nilsson                        Fault Detection in        Capgemini                          2005
Economics / Financial         x       Louise Francis 1                     A Comparison of Treenet Francis Analytics and                2005
Economics / Financial         x       Louise Francis 2                     Insurance Fraud           Actuarial Analytics                2004
Economics / Financial         x       Lynn Garbee                          Using CART to Infer       IntelliClaim, Inc.                 2005
Clinical Healthcare           x       Marsha Wilcox 2                      Using CART to Discern     Boston University Medical          2004
Life Science / Drug           x       Marsha Wilcox 1                      Using CART and TreeNet Boston University Medical             2005
Data Mining                   x       Mikhail Golovnya 1                   Churn Modeling for        Salford Systems                    2005
                              x       Mikhail
Data Mining Preparation and Implementation Golovnya 2                                                2000 Winner
                                                                           Salford Systems: KDD Cup Salford Systems                     2005
Economics / Financial         x       Mitchell Prevett & Richard           Statistical Case          PricewaterhouseCoopers &           2005
Economics / Financial         x       Paolo Manasse                        "Rules of Thumb" for      International Monetary Fund        2005
Life Science / Drug           x       Ramana V. Davuluri                   Identifying Estrogen      Human Cancer Genetics              2005
Economics / Financial         x       Sergey Bakin                         CART Plus Linear Models Suncorp Metway                       2005
Epidemiology                  x       Shenghan Lai                         Data Mining Examples in Johns Hopkins University             2005
Clinical Healthcare           x       Sigurd W. Hermansen                                            Westat                             a Model for Classifyin
                                                                           Linking Heterogeneous Data Sources: Using TreeNet to Develop 2005
Clinical Healthcare           x       Stuart Gansky                        MARS and Related          UCSF School of Dentistry           2004
Epidemiology                  x       U. Suryanarayana Murty               Application of CART for   Indian Institute of Chemical       2005
Life Science / Drug           x       Wayne Danter 2                       A Comparison of Hybrid    Critical Outcome                   2005
Life Science / Drug           x       Wayne Danter 1                       21st Century Drug         Critical Outcome                   2004
Economics / Financial         x       Whitney Olsen                        Does Using CART Need Olsen Capital Management                2005
Fueled by an overall mortgage market of almost $10 Trillion, the $5.3 Trillion in US mortgage-backed securities are an important sector of the global capital markets. Since these se
Many financial institutions, airlines, hotels and retail companies have instituted customer loyalty programs to increase customer satisfaction and loyalty. These programs issue point
This presentation provides an overview of a recent study carried out for the Mid-Atlantic Regional Air Management Association (MARAMA). The primary objective of the study was
In 1996, we collected data and developed a questionnaire to assist physicians to predict the severity of obstructive sleep apnea in patients in whom they suspected this condition. W
Obstructive sleep apnea affects 2-4% of the adult population in the USA. Various indices have been developed to assess the presence or absence of sleep apnea from changes o
The fetal origins hypothesis has renewed interest in the perinatal birth cohorts recruited in the US in the late 1950‟s-1960‟s. Many of these offspring are currently being followed for

In developing easy to understand decision trees, PRIM (Patient Rule Induction Method) is a useful alternative to CART (Classification And Regression Trees). Whilst CART is often

For most applications of the PRIM procedure it is assumed that the entire dataset under consideration can be stored in main memory (RAM). Friedman and Fisher (1998) noted tha
Two hundred low back injured workers returning to full duty work were drawn from over 40 manufacturing facilities in the Midwest USA. A mechanical device, the ``lumbar motion m
We investigate determinants of market induced vulnerability for Vietnamese households according to two definitions of vulnerability, one equal to the difference in consumption at tw
Abstract: Characterization of individual behaviours by the process of data mining has been key tool for commercial banks to manage credit losses, for a long time. Many techniques
In "Bump Hunting in High Dimensional Data", Jerry Friedman introduced a Patient Rule Induction Methodology (PRIM) whereby relatively homogenous subsegments of a database

"Data mining" has been successful in optimizing existing processes, tasks, tactics, etc. in a wide variety of fields, including marketing. Optimizing such tasks can result in very large
Predictive spatial modeling is a steeply growing multidisciplinary science. Applications are manifold and reach from plain distribution questions over investigations of abundance and
The topic of Global Biodiversity represents a major challenge. It is basically impossible to deal with all the species, and inventorize and map them in a Geographic Information Syste
CART 5 has been used to generate a data mining model to study the viral load, CD4+, and CD8+ dynamics in an HIV infected persons. The model is based on the construction of a
Ubiquitous application of data mining in direct marketing has the opportunity and responsibility to increase profitability while easing the burden of unwanted advertising for consume
Data mining products, especially TreeNet, have made it possible to quickly and efficiently develop predictive scores for direct marketing programs. Our experience has shown that p
Generalised linear models (GLM) appear to be a tool that has become very popular and have shown to be effective in actuarial work over the past decade. This paper discusses ho
This presentation discusses the findings of an industry-based study in the utility of text mining. The purpose of the study was to evaluate the impact of textual information in claims c
Dr. Haukoos has significant experience using Classification and Regression Tree (CART) analysis and will discuss its use in biomedical research. Dr. Haukoos will discuss method
Historically, key drivers of customer satisfaction have always been calculated utilizing an ordinary least squares (OLS) approach. That is to say, an underlying assumption is made i

Application of data mining methods in drug discovery is a highly promising yet almost unexplored field. As one of the first to apply CART in the drug discovery field, we have used C
Data Mining is still as much an art as a science, and fancy new tools make it easy to do wrong things with one's data even faster. We'll examine the major "cracks in the crystal ball

Thirty percent of auto loans prepay--pay off prior to maturity--in the first 18 months after origination. Understanding why borrowers pay off is the first step in mitigating the substantia
John Trimble provides an example of basic data preparation steps used at Wells Fargo Auto Finance Group. Trimble describes some of the steps and basic techniques used at We
This presentation discusses the use of the random forest methodology, a tree based procedure that makes use of bootstrapping and random feature generation (Breiman, 2001), in
Objectives: To create an abridged 5-item version of the 15-item International Index of Erectile Function (IIEF-5) as a diagnostic tool to discriminate between men with and without e

Based on patented Surface Enhanced Laser Desorption/Ionization (SELDI) technology, Ciphergen‟s ProteinChip Systems offer a single, unified platform for a multitude of proteomi
Fraud is a large, widespread, and growing problem, with institutional fraud losses estimated at over $200 B annually. These losses take place within the ordinary course of business
In a CRM environment, it is not uncommon to have a huge data warehouse containing all customer touches through different contact channels- mail, phone call, e-mail and website
The insurance industry has experienced many changes in information technology over the years. Advances in hardware, software, and networks have offered benefits, such as red
We present a methodology for model-based fault detection in industrial process plants and discuss the problem of choosing the statistical methods to use for the construction of ref
A recently developed data mining technique, Treenet, has been hailed by some as a viable competitor to neural networks and other techniques such as CART and MARS. Like neu
Introduce a relatively new data mining method which can be used as an alternative to neural networks. Compare the method to neural networks. Apply the methods to fraud data/
IntelliClaim operates in the Healthcare Insurance domain reviewing millions of claims each month to improve accuracy, turn-around-time and administrative efficiencies of claims pa
 1. Genetic Interaction in Alzheimer Disease 2. Novel Genetic Markers and Modes of Inheritance in Alcoholism 3. Aging well, Cardiovascular Advantages - Missing data problem
Recursive Partitioning (RP) as implemented in CART and TreeNet has been useful in discerning genetic models in Alzheimer Disease (AD), alcoholism and cocaine addiction. In A
Winning data mining strategies employed by Salford Systems consulting team in the Duke/NCR Teradata Churn Modeling Tournament 2003 and KDD Cup 2000 are presented. We

The NSW WorkCover Scheme pays benefit to more than 100,000 injured people each year. Many of the injured are seriously injured and on long term benefit support. The total fin
Following the debt-crises of the 1980s, sovereign debt defaults have made a remarkable come-back in recent years. While there has been a significant amount of research regardi
The key aspect in deciphering the complex puzzle of transcriptional regulatory networks is the identification of target genes of transcription factors (TFs). TFs bind to specific seque
Although the idea of performing a stratified regression via crossing of linear models with CART is not new, the literature on the subject is sparse and the software is not readily avai
Dr. Lai will discuss his use of data mining techniques to identify relationships followed by his use of standard statistical techniques to prove that such relationships are indeed true. D

Abstract: A recent study of mechanical properties of the dentinoenamel junction (DEJ) of the human tooth (Marshall et al, 2001) sought to estimate the width of the interface of two
For the first time, data mining tasks are performed on a real-life database pertaining to Bancroftian Filariasis disease in east and west Godavari districts of Andhra Pradesh, India. T

The traditional drug development process is lengthy, expensive, inefficient and unnecessarily risky. It costs on average ~750,000,000.00 USD, takes at least a decade and 5000-10
The presence of Complexity in Financial Markets has been a perennial topic of interest between Scientists and Financiers since the birth of what was once known as Chaos Theory
  capital markets. Since these securities have traditionally included an imbedded prepayment option, their accurate valuation requires projections of both prepayment and default risk
 lty. These programs issue points for every dollar spent by customers, which can be redeemed for various redemption items. The outstanding points represent a potential liability on th
mary objective of the study was to develop and deliver documented and tested methods for forecasting particulate matter concentration of less than 2.5 microns in diameter (known a
they suspected this condition. We trained an artificial neural network (ANN) on 189 patients. The ANN validated its accuracy on the subsequent 80 patients who underwent full night p
of sleep apnea from changes of arterial oxygen saturation recorded overnight. This presentation will show how we developed a mathematical model that would combine the best pre
are currently being followed for a variety of adult diseases from diabetes to schizophrenia. Placental growth is essential for normal fetal growth, the most common surrogate for the in

on Trees). Whilst CART is often criticized for being a 'greedy' technique, PRIM has the advantage of not being greedy whilst still being easy to present and understand. This paper loo

man and Fisher (1998) noted that various subsampling strategies can be used when memory capacity is insufficient for all the data. An alternative approach is an out-of-memory, disk
al device, the ``lumbar motion monitor,'' (LMM) was used to measure trunk position, velocity and acceleration in three planes. These measures were collected both clinically (in the pl
e difference in consumption at two different points in time and one contrasting poor, non-poor but vulnerable and secure households. We find that, in agreement with studies on other
or a long time. Many techniques are available and used by either in-house research teams or Commercial Score Developers. However all these techniques are not very effective whe
 us subsegments of a database containing a target variable are identified. Utilizing Classification and Regression Trees (CART), Salford Systems formulated a "Hot Spot Detector" m

 ch tasks can result in very large cost savings and/or incremental profit. But data mining has not been used in making high-level (marketing) strategy or policy decisions within an org
 nvestigations of abundance and populations up to Population Viability Analysis (PVA), global change modeling and others. This presentation tries to give a brief overview about appli
 a Geographic Information System (GIS). Efficient tools such as predictive GIS modeling can be of help to overcome some of these problems. Here I will show which data sets exist,
 s based on the construction of a dynamic tree structure based on a suggested Independent Component Analysis (ICA) system. The suggested dynamic tree eliminates noise in labor
wanted advertising for consumers. Big retail chains typically have large customer databases, mostly consisting of store credit card holders. Differential treatment of customers is key t
Our experience has shown that predictive scores developed with TreeNet are either equal to or outperform scores developed with logistic regression or other data mining products. Th
 ecade. This paper discusses how the advantages and strengths of GLM can be effectively combined with the computational power of data mining methods, presenting an example o
 of textual information in claims cost prediction. The industrial research setting was a large Australian Insurance company. The data mining methodologies used in this research includ
Dr. Haukoos will discuss methodological and statistical details of three completed studies, emphasizing specific CART-related modeling techniques. The first study used CART to der
underlying assumption is made in the analysis that the relationship between overall satisfaction and a potential driver of satisfaction is linear. Oftentimes, the assumption of a linear re

 discovery field, we have used CART in developing PTP1B inhibitors as potential diabetes drugs. The first usage was to isolate the major physicochemical properties of compounds
 major "cracks in the crystal ball" through case studies of (often personal) errors -- both simple and complex -- drawn from real-world consulting engagements. Best Practices for Data

step in mitigating the substantial administrative and reinvestment costs that prepays pose to lenders. This presentation shows how the modeling efficiency and data mining capabilitie
nd basic techniques used at Wells Fargo Auto Finance Group in preparing raw data for modeling. Included are a description of data sources, basic methodological choices and a sho
e generation (Breiman, 2001), in the mining of a pooled phase II and III Novartis clinical trials database. The goal of this data mining exercise is to construct a predictive model that e
 etween men with and without erectile dysfunction (ED), and to develop a clinically meaningful gradient of severity for ED. Methods: 1152 men (1036 with ED, 116 without ED) who re

  orm for a multitude of proteomics research applications. In particular, the use of SELDI technology is expanding rapidly in clinical proteomics to identify proteins that may be useful a
  the ordinary course of business and may be recognized as operating expenses rather than fraud. Once fraud is recognized, a predictive score can effectively address the problem. T
 , phone call, e-mail and website, inbound or outbound. Data content could be either origination at the time of registration such as credit risk, dealer channels, geographic, lifestyle, ps
ve offered benefits, such as reduced costs, reduced time of data processing and increased potential for profit, as well as new challenges, particularly in the area of increased competi
 o use for the construction of reference models. Factors complicating the problem include the presence of non-linear, dynamic relationships, high dimensionality, large numbers of ob
h as CART and MARS. Like neural networks, it is effective when analyzing complex structures which are commonly found in data, such as nonlinearities and interactions. However, b
pply the methods to fraud data/
 strative efficiencies of claims payment. These improvements take aim at reducing overall cost and providing significant savings to all players in Patient-Provider-Payor triangle. Claim
ntages - Missing data problem
 ism and cocaine addiction. In AD, the discovery of genes associated with the disease brings the hope of understanding the disease process and, ultimately, new therapies for preven
DD Cup 2000 are presented. We describe major data preparation steps involved in these projects as well as a number of various data mining algorithms and approaches. We empha

 rm benefit support. The total financial liabilities of the Scheme are approximately A$9B. NSW WorkCover Statistical Case Estimate Model is a statistical model that uses characteristi
 ant amount of research regarding debt crises in general and about the policy response to these defaults, the macroeconomic and structural weaknesses leading to them however, ar
TFs). TFs bind to specific sequence motifs in the promoter regions of target genes, and participate in combinatorial interaction with TFs (modules) of other signaling networks. ChIP-o
  the software is not readily available. In this presentation we discuss our implementation of the stratified regression based on trees called TreeLM. We also discuss applications of Tr
h relationships are indeed true. Dr. Lai will use as examples his research related to the Association between Vitamin E and the development of myocardial infarction and the associatio

he width of the interface of two dissimilar materials based on atomic force microscope-derived nanohardness and elastic modulus measurements. Here we study the statistical techn
icts of Andhra Pradesh, India. The database consists of socio-economic characteristics of 8938 persons who are screened for the prevalence or otherwise of Filariasis disease cause

s at least a decade and 5000-10000 compounds must be screened in order to take one successful drug through to regulatory approval. The pre-clinical phase of this process relies he
 s once known as Chaos Theory. We present CART as a statistical tool to explore Complex Systems: specifically, Financial Time Series Data. We investigate both the use of delibera
 th prepayment and default risk under various interest rate and economic environments. Learn how CART and TreeNet were used at the core of a complete loan-level mortgage valua
epresent a potential liability on the balance sheets of the institutions offering the program and must be provisioned for. Presently, no GAAP guidelines exist on reserving; therefore eac
 5 microns in diameter (known as PM2.5) for several cities in the MARAMA region. These methods were then applied in the development of a real-time forecasting tool for PM2.5. E
tients who underwent full night polysomnography in 1996. In a different sleep laboratory, we collected data on a new series of 529 patients and used the subsequent 200 patients in 2
 hat would combine the best predictive properties of these indices. We obtained overnight oximetry during overnight polysomnography in 224 patients and validated it prospectively in
ost common surrogate for the intrauterine environment in such studies. Better understanding of how the placenta grows may help us explain how BW may mark long term health risks

 and understand. This paper looks at three different approaches to implementing PRIM and compares it to CART for a particular insurance application (quote sales success profiling)

oach is an out-of-memory, disk-based implementation of PRIM where the dataset is never stored in the memory of the computer. We describe a disk-based PRIM procedure develop
ollected both clinically (in the plant medical area) and while the worker was performing the job. Additional data was collected on anthropometry, psychosocial aspects of the workplac
greement with studies on other countries, education and household composition are strongly related to vulnerability, and that savings intervene in predicting which of the three catego
iques are not very effective when it comes to predicting personal bankruptcies. In current unstable environment when technology is progressing in a phenomenal speed, structural un
 ulated a "Hot Spot Detector" methodology whereby a single terminal node of a Tree identifies a small extremely homogenous subset of rare events (a needle in a haystack). Building

or policy decisions within an organization. This talk discusses whether data mining can and should play a role in such decisions. Data mining fundamentally studies how to make acc
 ive a brief overview about applications, benefits and unresolved questions when using advanced modeling algorithms such as CART and MARS in the context of wildlife research an
will show which data sets exist, how to model them, and how MARS (Multivariate Adaptive Regression Splines) and CART (Classification and Regression Trees) can contribute to the
 ic tree eliminates noise in laboratory data, and the resulting refined dataset is then identified as a set of weighted connected groups that are associated through a directed graph. It is
  treatment of customers is key to modern marketing. Retailers want to detect people with the potential to be loyal shoppers, and build a mutually beneficial relationship with them thro
r other data mining products. The question remains, however, on whether such scores are the most predictive scores that can be obtained. High volume direct marketing programs p
 hods, presenting an example of the combining of multivariate adaptive regression splines (MARS®) and GLM approaches by running a MARS® model and then building a GLM with
gies used in this research included text mining, and the application of the results from the text mining in subsequent predictive data mining models. The researchers used software of
 he first study used CART to derive a clinical decision instrument to help triage HIV-infected patients upon presentation to the emergency department. The second study used CART
 s, the assumption of a linear relationship is not valid and worse, incorrect inference is derived suggesting that the greater the investment in some aspect of a product or service implie

mical properties of compounds that play determinant roles for the kinetic behavior of compounds in the enzyme inhibition. This helped to design competitive inhibitors. The second u
ements. Best Practices for Data Mining will be (accidentally) illuminated by their (rarely described) opposites. These common errors range from allowing anachronistic variables into t

ency and data mining capabilities of MARS were instrumental in a substantial step forward in unraveling the underlying reasons for prepayment. As a group, prepays are very nearly r
ethodological choices and a short case study of how one measure, prepayment, was more usefully classified into two groups, those arising from road hazards and those not.
nstruct a predictive model that explains the variation in clinical response to drug associated with patient demographics, medical histories, concomitant medications, laboratory measur
with ED, 116 without ED) who reported attempting sexual activity were evaluated using baseline data from four clinical trials of VIAGRA™ (sildenafil citrate) and two control samples. T

 fy proteins that may be useful as biomarkers and to correlate these potential biomarkers with disease states. However, protein profiling presents a significant challenge for statistical
 ectively address the problem. The model built in this engagement was intended for use as a screening tool for credit applications. More than 850 variables were generated to charact
 annels, geographic, lifestyle, psychographic and demographic or longitudinal over life such as customer consumption, payment and contact. Prior to a CART modeling analysis, it can
 n the area of increased competition. Technological innovations, such as data mining and data warehousing, have greatly reduced the cost of storing, accessing, and processing data.
ensionality, large numbers of observations, limited a priori knowledge, and high requirements regarding interpretability. CART and MARS can naturally handle many of these problems
es and interactions. However, by averaging the results of many simple models, it attempts to improve over the performance of a neural network model. This presentation will introduc

 t-Provider-Payor triangle. Claims payment is a complex dynamic process affected by a great number of variables. Exacerbating the system complexity payors continuously tweak cla

mately, new therapies for prevention and cure. It is important to understand how these genes function – either alone or interacting with other genes. RP helped show that a new gene i
ms and approaches. We emphasize the importance of data segmentation and the use of prior probabilities and costs in order to get the top performance out of CART models and com

 al model that uses characteristics of the injured person, the injury and past benefit payments to predict the benefits to be paid over the remaining lifetime of the injured person. This m
ses leading to them however, are still not properly understood. Our aim in this paper is to provide an answer to the following basic questions: how can you tell a potential sovereign de
 ther signaling networks. ChIP-on-chip technology, or chromatin immunoprecipitation followed by DNA microarray analysis, has proven to be an efficient means of mapping TF-promo
e also discuss applications of TreeLM in actuarial work such as identification of time trends in claim amounts data and refinement of rating models.
 dial infarction and the association between regional heart function and coronary calcification.

 re we study the statistical techniques used for that estimation, including restricted cubic splines, local polynomial regression (loess), adaptive linear basis functions, and parametric ch
 rwise of Filariasis disease caused by mosquitoes. It comprises four socio-economic predictor variables viz., AGE, SEX, HABITAT and HOUSING TYPE and a dichotomous target var

al phase of this process relies heavily on in vitro (i.e. cellular systems) and in vivo (i.e. animal) testing. Powerful computer architectures and sophisticated modeling tools like CART®,
estigate both the use of deliberately sparse CART trees as well as larger, more optimized trees to predict small movements in the financial markets. We also will compare CART pred
mplete loan-level mortgage valuation system leveraging detailed customer, loan and property information. This estimation of monthly joint risks of prepayment and default combined w
exist on reserving; therefore each company follows its own approach. Under-provisioning or over-provisioning can significantly influence current and future year P&Ls, which makes a
 e forecasting tool for PM2.5. Exposure to PM2.5 has been linked with adverse health affects, and increasing numbers of areas are initiating forecasting programs that provide warnin
he subsequent 200 patients in 2002 for the validation. There is a marked deterioration in predictive accuracy when the 1999 ANN was used for the 2002 data compared with 1996 dat
 and validated it prospectively in another 292 patients. We developed a single MARS model using the indices of pulse oximetry that has the best predictive properties and an aggrega
may mark long term health risks. In daily practice, umbilical cord length cut-offs for “long” and “short” are used clinically to identify children who may have had abnormal mid-trimester

n (quote sales success profiling). Particular focus is given to the process of building PRIM and CART trees and the comparative predictive performance of each. This is then consider

 based PRIM procedure developed primarily for the analysis of large telecommunications datasets. The calculations are performed by making a series of passes over the data residin
osocial aspects of the workplace, and overall health and pain by questionnaires. Three variables of increasing objectivity were considered as measures of recurrence: pain occurrenc
dicting which of the three categories mentioned above that a household falls into. We also propose improved analytical techniques which help avoid spurious correlations, and help id
henomenal speed, structural unemployment is becoming a major issue in many countries. Coupled with this other economic turmoil, globalisation is pushing many individuals towards
 needle in a haystack). Building upon both of these works, "A CART Based PRIM" is utilized to segment a "High Dimensional Database" into a large number of small, highly concentr

 entally studies how to make accurate predictions. I give two examples where understanding predictive accuracy is crucial to making high-level marketing decisions: deciding whethe
e context of wildlife research and conservation management. Drawing from modeling applications and examples world-wide it is shown how these progressive algorithms can be linke
sion Trees) can contribute to the overall research and management of global biodiversity. From an international, interdisciplinary and collaborative research project that models over 5
ed through a directed graph. It is shown that the ICA method is able to generate more refined independent datasets when associated with a dynamic search tree. We also show that w
 ficial relationship with them through programs the customers consider valuable. These may differ by segment. Likewise, on the other side of the customer spectrum, they want to avo
me direct marketing programs provide a wealth of past experience, with hundreds of predictor variables. However, the wealth of past experience or the hundreds of predictor variables
el and then building a GLM with MARS® output functions used as predictors. The results of this combined model are compared to the results achievable by hand-fitted GLM. Compar
  e researchers used software of the leading commercial vendors. The research found commercially interesting utility in textual information for claim cost prediction, and also identified
   The second study used CART to identify optimal cutoff points for predictors of survival in patients diagnosed with colon and rectal carcinoma in order to categorize patients into low-
ect of a product or service implies greater overall customer satisfaction with that product or service. It is of particular interest to marketing personnel to understand where this relations

petitive inhibitors. The second usage was to discover the structural features of compounds responsible for their inhibitory activities to PTP1B. The derived features are similar to the c
 ng anachronistic variables into the pool of candidate inputs, to subtly inflating results through early up-sampling. You'll hear cautionary tales of endangered projects and embarrassed

group, prepays are very nearly random because the vast majority are the result of natural hazards. The balance, however, break out into clear, targetable segments worthy of further
hazards and those not.
medications, laboratory measurements, and/or adverse events. Predictive models of this sort can be used to inform clinical development decision making, address regulatory inquirie
 rate) and two control samples. The statistical program Classification and Regression Trees (CART) was applied to determine optimal cutoff scores on the IIEF-5 (range, 5-25 if patien

 nificant challenge for statistical analysis, as typical differential protein expression studies via SELDI generate 100's to 1000's of spectra that must be sifted through to find patterns co
ables were generated to characterize each application, most of which were related to the types of credit applied for in the past, when these applications occurred, and whether the inp
  CART modeling analysis, it can be more productive to conduct variable derivation and selection within a subcategory of attributes with similar context first, e.g. within billing or custom
accessing, and processing data. Business questions that were previously impossible, impractical, or unprofitable to address due to the lack of data or the lack of processing capabilitie
 handle many of these problems. Since MARS models have continuity properties which are more in line with the underlying physics, we expect more accurate prediction and higher in
 . This presentation will introduce Treenet by showing its similarity to an already well-understood statistical technique. It will compare Treenet to neural networks by applying it to insur

 y payors continuously tweak claims processing software. As a result, it is often the case that the initially well-documented policies grow out of sync with the actual payment logic appli

  helped show that a new gene is responsible for some cases of AD and not interacting with one of the well-established genes for the disorder. Alcoholism is a complex disorder with c
 ce out of CART models and comply with the model evaluation criteria. In addition, we illustrate the added benefit of using TreeNet to ultimately improve the gains and ROC performan

 me of the injured person. This model has been developed using a combination of techniques including CART, MARS and GLMs. The model uses a subset of 200 potential predictors
 you tell a potential sovereign defaulter when you see one? What set of economic and political conditions are empirically associated with a (likely) sovereign debt crisis? By emplying a
 nt means of mapping TF-promoter interactions. We will describe an integrative computational genomics approach, which involves the application of CART data-mining method, to an

 sis functions, and parametric changepoint models. Piecewise linear models are used to simulate data with a plain, rising slope, and plateau with a known horizontal distance (width) b
 E and a dichotomous target variable describing the presence or absence of the filariasis disease in a person as indicated in a blood investigation. In this study, popular data mining to

 ed modeling tools like CART®, TreeNET™, Generalized Additive Models (GAM) and Support Vector Machines (SVM) combined with a growing understanding of the complex and hig
We also will compare CART predictions with other approaches using sophisticated Artificial Neural Networks, demonstrating the superiority of CART methodology. Finally, we will show
ayment and default combined with detailed cash flow financial modeling provided the foundation for a wide-ranging business transformation program. Find out how a large mortgage
uture year P&Ls, which makes accurate estimation of reserve levels extremely important. A few dimensions need to be considered while estimating reserve levels – timing of expense
ng programs that provide warnings to the public of days for which high levels of these particulates are expected. Many current air quality forecasting algorithms rely on simple regres
02 data compared with 1996 data. We developed a new ANN and found no improvement. We then used multiple adaptive regression splines (MARS) that improved the predictive per
 ictive properties and an aggregation of 20 MARS models. The aggregated model had greater diagnostic utility than the single MARS model and is currently in clinical use at our med
ave had abnormal mid-trimester fetal movement (abnormally short cords) and potentially mid-trimester fetal CNS impairment, and children whose abnormally long cords may have pre

e of each. This is then considered with regards to the original intention - that of exploratory analysis - and the resulting applicability of each to that purpose.

  of passes over the data residing on disk. We discuss specific challenges and issues we encountered with this approach, and we adapt an existing algorithm for determining the quan
 s of recurrence: pain occurrence, pain with medical visit, and lost time due to low back pain. The goal was to form the ``best'' prediction model for each of these outcome variables, b
purious correlations, and help identify non-linearities and interactions. More specifically, we find that the use of Tetrad (which fits “causal” diagrams to data), followed by the use of MA
ushing many individuals towards bankruptcy who were considered highly credit worthy only a few years ago. Another problem with bankruptcy is the suddenness of its occurrence ma
 umber of small, highly concentrated (homogenous) nodes contained in branches that span across the space of the database. These branches and nodes are then used as inputs into

ting decisions: deciding whether a firm should (1) customize product offerings and their promotion and (2) reward alleged "best customers" with perks/superior benefits. I give exam
gressive algorithms can be linked with a Geographic Information System (GIS), and how model inference and model predictions can be derived. Conceptual comparisons with other a
earch project that models over 500 species worldwide, I will present selected case studies for several continents and across species and taxa that highlight experiences, advantages a
earch tree. We also show that when treatment is considered, a dynamic search tree generates a reliable connection graph that can be used to allow more detailed study of the effect
 mer spectrum, they want to avoid sending unprofitable “junk mail” to people who will never respond. Using CART, we developed several models to predict customer activity and loyal
  hundreds of predictor variables are most often not homogenous. We have found that by exploiting the homogeneity of data or the homogeneity of experience, more discriminating s
ble by hand-fitted GLM. Comparisons are made in terms of time taken, predictive power, selection predictors and their interactions, interpretability of the model, precision and model f
st prediction, and also identified new risk management factors.
 to categorize patients into low- and high-risk groups. Finally, the third study used CART to derive a clinical decision instrument using patient characteristics available to paramedics t
  understand where this relationship is not linear and especially where the relationship between a potential driver and overall satisfaction changes. We will highlight this through a serie

ived features are similar to the concept of “pharmacophore models” and can be used in the prediction and design of better compounds.
ered projects and embarrassed teams -- but also the keys to avoiding such a fate yourself.

 ble segments worthy of further investigation.

 king, address regulatory inquiries, or to generate publications in support of marketing efforts. The main challenge presented by this type of data is the need to screen a large number
 the IIEF-5 (range, 5-25 if patient engaged in sexual activity; 1-25 if patient had an opportunity but no desire for sexual activity) to distinguish between men with and without ED, and to

 ifted through to find patterns correlating to phenotype or to find individual biomarker candidates. In practice, biomarkers may be expressed as single markers or multiple markers who
s occurred, and whether the input application information matches these previous applications. Implementing most modeling techniques would have been problematic due to high dat
  first, e.g. within billing or customer contact category as opposed to the entire database. This presentation describes the benefits of some techniques for conducting "partial" variable
 he lack of processing capabilities can now be answered using data mining solutions. Modern data mining technologies also offer more accurate and better performing models that ar
ccurate prediction and higher interpretability for MARS than for CART. CART is, on the other hand, the more scalable method, especially when high order interactions are present. Th
 networks by applying it to insurance fraud data and will compare its performance to that of neural networks.

h the actual payment logic applied by software. IntelliClaim employs CART to infer the actual claims payment policy by analyzing the variance between the officially stated policies and

ism is a complex disorder with complex etiologies. It is clear that, for some people, the liability toward alcohol dependence is inherited. We examined a genome screen (~400 genetic
e the gains and ROC performance measures.

ubset of 200 potential predictors to model more than 10 different types of benefit payment. One significant problem is how to use a dataset over a limited time period to project benefi
reign debt crisis? By emplying a Binary Recursive Tree approach, we derive a set of “rules of thumb” that help identify the typical characteristics of defaulters. In the process, we iden
ART data-mining method, to analyze ChIP-on-chip experimental data. The estrogen receptor a (ERa) regulates gene expression by either direct binding to estrogen response elemen

own horizontal distance (width) between the plain and plateau and varying amounts of error. Five hundred replicates per combination of simulation conditions are used. Statistical tec
his study, popular data mining tools such as logistic regression of SPSS 10.0 (1999), artificial neural network trained by back-propagation algorithm and classification and regression

standing of the complex and highly nonlinear relationship between molecular structure and biological activity have sustained a paradigm shift away from traditional methods towards
ethodology. Finally, we will show how basic Risk Management strategies and Information Theory can be used to exploit even a minor predictive advantage for profit.
Find out how a large mortgage lender leveraged this unprecedented level of resolution to optimize not only the hedging of risk and the execution of secondary market activities, but a
serve levels – timing of expense recognition, expected proportion of the total points earned to be redeemed over the portfolio lifetime and cost per point. Companies should develop s
 lgorithms rely on simple regression techniques. While such techniques may provide an analytic description for some of the data dependence, they are not likely to give much physica
that improved the predictive performance. We tested the MARS model on the 1996 dataset to determine if the MARS model found similar results as with the 2002 data. We conclude
rrently in clinical use at our medical center.
ormally long cords may have predisposed them to cord compromise during the delivery process. 8464 cases were analyzed. Umbilical cord length was associated with both larger PW

gorithm for determining the quantiles of an attribute without sorting. We discuss potential advantages and drawbacks of the method, and we illustrate the procedure on some large da
 h of these outcome variables, based on over 200 potential predictor variables. Both logistic regression and CART were used for model-building. Results for the two were similar but n
data), followed by the use of MARS (Multiple Adative Regression Splines) yields improved models which are more predictive and rely on less variables. The relevance of our results to
uddenness of its occurrence making the prediction extremely difficult. Hence long term credit history, which has been the backbone of successful data mining, is often useless in this
 des are then used as inputs into various predictive models or can be grouped to identify targetable sub- segments. "A CART Based PRIM" has been successfully used to augment pr

 /superior benefits. I give examples using Salford System‟s software of the two problems with real data from a credit card, software, retail catalog, and educational service companies
 eptual comparisons with other and more traditional algorithms are done as well. The importance of model evaluation is discussed, as well as the use of „presence only‟ data for such
hlight experiences, advantages and challenges when using these powerful modeling algorithms, e.g. when compared with other and traditional approaches. Following a general and ra
more detailed study of the effect of treatment on HIV dynamics.
edict customer activity and loyalty levels. Starting with hundreds of demographic and account variables in addition to all transactions for the past two years, we first created a large nu
 perience, more discriminating scores can be developed. The presentation shows examples where developing predictive scores on either homogenous subsets of the past experience
 e model, precision and model fit.

 ristics available to paramedics to predict meaningful survival following out-of-hospital cardiac arrest.
 will highlight this through a series of case studies, demonstrating where utilizing MARS to fit piecewise linear regression models to customer satisfaction data results in a deepening u

 need to screen a large number of variables (1000-5000) for main effects and interactions – a task for which random forests is well suited. I will provide a brief introduction to the indu
men with and without ED, and to determine levels of ED severity on the IIEF-5 using the IIEF item on penetration frequency. Results: The optimal cutoff score was 21 with men scorin

markers or multiple markers whose patterns of up- and down-regulation may signal disease susceptibility, onset, progression, therapeutic response, drug efficacy or toxicity. We have
 een problematic due to high data sparseness. CART‟s ability to handle missing values and select appropriate surrogates made it the best modeling choice; surrogates were involved
 for conducting "partial" variable derivation and selection before engaging in a CART analysis and illustrates with a real case of customer churn prediction model.
better performing models that are generated in less time than that with previous technologies. This study applies data mining techniques to health insurance practice. Case studies in
 rder interactions are present. These differences are investigated by applying both methods to real data from a thermal plant. Very precise prediction of many of the controlled variable

 the officially stated policies and the actual paid claims. The expected values are determined by modeling the payor‟s stated policy using IntelliClaim‟s proprietary rule-based claims p

a genome screen (~400 genetic markers) for linkage and association with the disorder. We found several interesting loci across the genome. RP helped us identify a possible mode o

 ed time period to project benefit payments over as long as 30 years. In this presentation we describe the approach taken, including some evaluations of the model‟s predictive power
 aulters. In the process, we identify empirically different typologies of debt crises. Not all crises are equal: they differ depending on whether the governments faces solvency, liquidity,
ng to estrogen response elements (EREs) or indirect tethering to other TFs on promoter targets. We developed a CART model to classify ERa direct targets from non-target promote

 ditions are used. Statistical techniques to estimate this width are compared to assess bias and efficiency.
 d classification and regression trees (evaluation copy of CART 4.0) from Salford Systems Inc. are employed to perform a classification task and build a predictive model. The databa

om traditional methods towards in silico pre-clinical drug profiling and development. Hybrid non-linear machine learning systems that combine CART®, TreeNET™, GAMs, and SVMs
 tage for profit.
 condary market activities, but also to significantly strengthen customer relationships through retention and cross-sell programs imple
nt. Companies should develop segmented point provisioning analytical models based on their observed client redemption and earn
 e not likely to give much physical insight. For example, a result of “3 times temperature plus 0.5 times the relative humidity equals P
with the 2002 data. We conclude that (1) the MARS approach may be less susceptible than the ANN to model drift but further confirmation is needed. (

s associated with both larger PW and BW, but with a decrease in FPR; grouping BW by centiles suggested that the effect was confine

 he procedure on some large datasets, including data relating customer profitability to demographic variables.
lts for the two were similar but not identical, in variables selected and in estimated error rate. The use of CART as a complement t
 . The relevance of our results to policy recommendations is discussed.
 mining, is often useless in this case. All of these has created a strong need of advanced/ non-standard models for predicting personal bankrup
successfully used to augment predefined market zones for several well known retailers and travel agencies, boost response rates for direct mail campaigns and

 educational service companies, and a not-for-profit organization. This talk ultimately argues that there are many add
of „presence only‟ data for such modeling projects, e.g. based on Natural History Collections (NHC), field survey and telemetry data. An outlook
ches. Following a general and rapid assessment approach, MARS and CART and their command languages allow for a convenient and fast modeli

ears, we first created a large number of derived predictor variables. This time-consuming but critical step aims at const
s subsets of the past experience or on homogenous subsets of the data greatly improve the discriminatory power of the predictive score. This improv

 on data results in a deepening understanding of the drivers of customer satisfaction and ultimately, lo

e a brief introduction to the industrial context and rationale for using random forests including data set properties, re
 ff score was 21 with men scoring less than or equal to 21 classified as having ED and those scoring above 21 as not havi

 ug efficacy or toxicity. We have incorporated CART analysis to simplify analysis of these complex data sets, identify p
hoice; surrogates were involved in 60% of score calculations. The resulting score was able to capture over 50% of frauds by reviewing 5%

rance practice. Case studies in long- term care and HMO pricing and underwriting using CART and MARS are provided.
f many of the controlled variables in the complex industrial plant turns out to be possible, using limited physical knowledge. The result

 proprietary rule-based claims processing technology. CART provides decision trees to uncover and quantify the root causes of

ed us identify a possible mode of inheritance for these new loci. Substance dependence is not a unitary phenomenon. We identified syndro

of the model‟s predictive power. The model has not yet been rolled out into Scheme operations so we do not discuss its uses in great detai
ments faces solvency, liquidity, or macroeconomic risks. This classification is crucial for discussing appropriate policy
argets from non-target promoters using previously known experimental data. We found that the model classified 67% of the gene loci ident

 a predictive model. The database is characterized by a highly disproportionate class distribution, in that, the positive cases for

 TreeNET™, GAMs, and SVMs models of the molecular structure-biological activity relationship can accurately simulate specific in vitro and in
                          JOURNAL ARTICLES    PRESENTATIONS   TOTALS
Agriculture/Soils/Geology          7                1            8
Animal Science/Veterinary         12                0           12
Automotive/Transportation          3                0            3
Clinical Healthcare               23                9           32
Computer Science                  41               15           56
Economics/Finance/Fraud Detection/Insurance        18           27
Education/Journalism               4                0            4
Environment/Conservation          37                5           42
Epidemiology                      18                3           21
Food Science                       3                0            3
Life Science/Drug Discovery       74               12           86
Manufacturing/Industrial           7                3           10
Marketing/CRM                      1                8            9

Totals                          239                74          313