Data Mining a Diabetic Data Warehouse: Younger Age is Key Predictor for Bad HgbA1c Values
Joseph L. Breault, MD, ScD, MS, MPHTM, FAAFP
Associate Program Director and Research Director, Ochsner Clinic Foundation Family Practice Residency
5701 Deer Park Blvd., New Orleans, LA, 70127; firstname.lastname@example.org
Additional authors: Colin Goodall, PhD; Peter Fos, DDS, PhD
From the moment of birth to the signing of the death certificate, data are collected at almost every contact of
each individual with providers of healthcare in the United States (and many other countries). These data
include administrative, demographic, health status, clinical, pharmaceutical use, and financial details.
Increasingly, data are abstracted from written records, or entered directly at a workstation, into an extensive
health information system (Goodall 1999).
Diabetes has a history of registries and databases with patient information being systematically collected.
Most of these are simply a collection of clinic, hospital and insurance databases for other purposes, especially billing.
In this paper we examine one such diabetic data warehouse, showing a method of applying data mining techniques,
some of the data issues, analysis problems, and results.
2. Understanding the medical problem domain
The population of diabetic patients is important in healthcare for a number of reasons. It is large with 15.7
million people in 1998 or 5.9% of the population of the US having diabetes, 8.2% of those 20 and older, and 18.4% of
those 65 or older. And the number is increasing rapidly as seen in figure 1. The economic impact is impressive, with
the 1997 estimate being direct medical costs of $44 billion,
indirect costs of $54 billion for an economic cost annually
of about $100 billion dollars (CDC 1998). One in every
seven health care dollars, and 25% of the Medicare budget
are spent on diabetic patients (Blonde 2001).
Death certificates indicate that diabetes contributes
to 193,000 deaths annually, but this is vastly under-reported.
Diabetes is the leading cause of new cases of blindness in
adults aged 20-74 (as many as 24,000 become blind
annually from diabetes), of end-stage kidney disease
(33,000 diabetics start dialysis annually), and of leg
amputations not related to injury (86,000 annually).
Diabetic patients are 2-4 times more likely to have a heart
attack or stroke than a non-diabetic patient. Close to two-
thirds of diabetic patients also have hypertension. The social
effects are massive with disability among diabetics 2-3 times
higher than non-diabetics (Songer 1995), congenital
malformation rates up to 10% in the 18,000 deliveries
annually to women with preexisting diabetes if
preconception care is not provided, and deaths of newborns
at rates 2-3 times higher than average for pregnancies among Figure 1 The rapid growth of diabetes in the US
women with diabetes (CDC 2000).
Early detection and proper treatment of diabetes can prevent up to 90% of blindness, and at least 50% of
dialysis and amputations (CDC 2000). Because of the economic and clinical impact of the disease, there has been a
great deal of energy put into guidelines, best practices, optimization of care, and other management methods to
improve outcomes and save money. In short, all the low-hanging fruit has already been picked. If data mining can be
successful in diabetes in finding novel associations that can be modeled to improve outcomes and save money, it can
easily do so in other medical areas that have not been so intensely studied.
The South has the highest incidence of diabetes, 35.3 per 1000, than other regions which range from 24.9-26.5
(Adams, Hendershot et al. 1999, Table 61). At least 365,000 or 8.4% of Louisiana residents 20 years and older have
diabetes. In Health Care State Rankings for 2000, Louisiana was 49th in health indicators, and 1st in diabetes death
rate at 38.7 per 100,000 (Hood 2001). The diabetes death rate is 68.9 for the City of New Orleans (LADHH 2000,
STFM 2002 Submission/Presenter: J. Breault/Title: Data Mining a Diabetic Data Warehouse / Page 1 of 4
Table 26.1). With this high prevalence and mortality, it may be the ideal location in the country to choose a diabetic
data warehouse to investigate.
3. Understanding the data
The diabetic data warehouse belongs to a large integrated health care system in the New Orleans area that
combines 19,000 admissions annually to a 442 bed tertiary care hospital, a 450 Physician multi-specialty clinic, a
health plan with 160,000 covered lives, a graduate medical education division with 200+ residents/fellows in 25
residency/fellowship programs, and an active research division that includes an outcomes research group. The diabetic
data warehouse as of August 2001 had data from 1/1/1998 to 6/30/01 on 30,383 diabetic patients.
The following diagram outlines the data warehouse structure of the Oracle database along with the number of
rows per table:
Clinic Hospital Laboratory Medications
2.6M 52K 2.0M 1.1M
Hospital Dx Hospital Procedures Laboratory Results Medication Medication
198K 62K 6.9M Categories Classes
The database is set up so that the administrative group links the other 4 groups (clinic, hospital, laboratory,
and medication groups). This paper will be limited to aspects of the administrative and clinic databases, and one
4. Preparation of the data
This is the most important aspect of the process, since the value of later data mining or analysis depends on
the intelligence with which the data is prepared. In the final flat file used for input into data mining software, each row
will correspond to one patient. We exclude those who do not have continuity throughout the year, and define this as
having at least two clinic visits in the time period. Any one given patient may have many office visits, but these must
all be placed on the same row. Therefore, the variables must be transformed into summary data that contains the most
useful information. This avoids having 100 columns for clinic visit 1 through 100 with most rows just being populated
in the first few columns. Since each visit has more than a dozen variables associated with it, this actually would mean
having >1000 sparsely populated columns for office visits unless summary transformations are used.
Some of the transformed variables used are the number of ER visits and office visits, and the ratio of these
(ER/OV). The specific dates within the time period of interest when the ER, office visits, or other types of visits
occurred are not very useful in itself. There is sequencing and time-series information that may be valuable here, but
data mining software will not be able to utilize this without summary transformations.
There are up to 4 diagnosis codes for each visit—a rich source of information, but these must be transformed
to be useful in data mining. Each patient may have 1 to >100 clinic visits, and each of these can have up to 4
diagnoses. To capture some of this information, we constructed a comorbidity index by counting up the number of
major body systems that have been listed as a diagnosis code. We divide all the codes into the 17 categories of the
ICD9, and label patients as having 1 through 17 of the categories they have been diagnosed into as a rough
comorbidity index. E.g., if someone had codes through their visits that fell into 5 of these groups, then their
comorbidity index would be a 5.
In additional, categorical variables were constructed for whether a patient had ever had a diagnosis of a lipid
disorder, hypertension, CAD/PVD, retinopathy, or ESRD/CRF/CRI. The fraction of diagnostic codes for all visits that
pertain to diabetes was also extracted to give a rough estimate of what percent of care is being devoted to diabetes.
The summary measure of the average Hemoglobin A1c was calculated as an outcome variable. In addition to the actual
average, two categorical variables are developed. One is a 4-level ordinal variable (0, 1, 2, 3) using categories of <=7,
>7 & <8, >=8 & <=9.5, >9.5. The other is a 2-level dichotomous variable (0, 1) using a cut-point of 9.5. Handling
time series medical data is challenging for data mining software. One example in our study is the HgbA1c value, the
key measure of gylcemic control that should be measured every 3-6 months in all diabetics. This is closely related to
STFM 2002 Submission/Presenter: J. Breault/Title: Data Mining a Diabetic Data Warehouse / Page 2 of 4
clinical outcomes and complication rates in diabetes. There is a marked increase in health care costs with each 1%
increase in baseline HgbA1c; patients with an HgbA1c of 10% vs. 6% had a 36% increase in 3-year medical costs
(Blonde 2001). How should this time series variable be transformed from the relational database to a vector
(column) in the flat file presented to data mining software? There may be many of these results for a given diabetic
patient. We could pick the last one, the first one, or one from the middle. We could take an average. All of these
methods will lose some information. Since the trend over time for this variable is important, we could choose the slope
of its regression line over time. But a linear line may be a good representation for some patients, but a very bad one for
others that may, for example, be better represented by an upside down U curve. We could try to include it all with
many columns, one for each HgbA1c value, with associated columns indicating the time. However, since each patient
has varying numbers of such events (from 0 to many dozens), this will leave columns with many missing values giving
an inadequate representation to the data making it unsuitable for data mining programs. This difficulty is a problem for
most repeated laboratory tests. In this study we choose to use the average HgbA1c of all the results for a given patient,
and exclude those we do not have at least 2 HgbA1c results for. We also repartition this average HgbA1c into a 4-level
and 2-level categorical variable based on meaningful clinical cut points.
Our final flat file for export to the data mining software had 15,902 patients (rows) and 19 variables per
patient (columns). All of these patients had at least 2 HgbA1c tests and at least 2 office visits, the criteria we used for
5. Data mining
There are many data mining methods. Here we use the classification tree approach as standarized in the
CART 4.0 software from Salford Systems. CART gives the following tree when the target variable is HgbA1c > 9.5
(0, 1), and the defaults are used except for testing via a random variable to separate learn and test samples:
Unexpectedly, the most important variable associated with a bad HgbA1c score is younger age, not the
comorbiditity index or whether patients have related diseases. If we want to target diabetics with bad HgbA1c values,
the odds of finding them are 3.2 times as high in those < 65.6 than those older. The imporatance score for classifying a
bad HgbA1c was age (100), number of office visits (51), comorbidity index (44), PVD/CAD (16), Dyslipedmia (15),
and number of ER visits (7). The classification error in the learning and test samples are listed in the next table:
STFM 2002 Submission/Presenter: J. Breault/Title: Data Mining a Diabetic Data Warehouse / Page 3 of 4
Learning Sample Test Sample From this CART analysis,
N cases N Misclassed % Error N cases N Misclassed there is a clinically
Class 0 1052 262 24.90 1060 303 suprising
1 6901 2873 41.63 6890 2919 The
42.37 most important
variable associated with a
bad HgbA1c score is younger age, not the comorbiditity index or whether patients have related diseases. If we want to
target diabetics with bad HgbA1c values, the odds of finding them are 3.2 times as high in those < 65.6 than those
older. If we look at the tree to find terminal nodes where the percentage of all the patients in those nodes have a bad
HgbA1c that is even greater than the 19.4% true of those < 65.6 years of age, we identify the following in the learning
Node # % w/bad N= Rules
1 23.8 2185 age<=55.231
4 30.8 26 (age<=57.781) and (comorbidity>6.5) and (ERc>3.5) and (Lipid=1)
6 33.3 39 (age>55.231) and (comorbidity>6.5) and (ERc>3.5) and (lipid=0)
8 22.5 102 (65.581<age>=72.788) and (OV<=47.5) and (ERc<=4.5) and (lipid=0) and
Total 29.6 2352
We can use this information to identify the highest risk groups for having a bad HgbA1c, though the vast
majority are simply an even younger group of diabetic patients who are less than 55.2 years of age.
Data mining is useful in discovering novel associations that can prove useful to clinicians and administrators
as noted above. However, a tight integration of domain expertise and data mining skills is important, and issues that
need further work to fully utilize data mining in healthcare include time series data issues, sequencing information, and
data squashing technologies.
Acknowledgements: Leonard Medal provided SQL statements to extract some of the complicated variables. The
CART software was funded by a grant from GlaxoSmithKline Pharmaceuticals. The Institutional Review Board at the
institution that owns the diabetic data warehouse approved this study.
Adams, P. F., G. E. Hendershot, et al. (1999). Current estimates from the National Health Interview Survey, 1996.
Hyattsville, Md., U.S. Dept. of Health and Human Services Centers for Disease Control and Prevention
National Center for Health Statistics.
Blonde, L. (2001). “Epidemiology, costs, consequences, and pathophysioogy of type 2 diabetes: An American
epidemic.” The Ochsner Journal 3(3): 126-131.
CDC (1998). National Diabetes Fact Sheet, http://www.cdc.gov/diabetes/pubs/facts98.htm.
CDC (2000). Diabetes: A serious public health problem, http://www.cdc.gov/diabetes/pubs/glance.htm.
Goodall, C. R. (1999). “Data Mining of Massive Datasets in Healthcare.” Journal of Computational and Graphical
Statistics 8(3): 620-634.
Hood, D. (2001). 2001 Louisiana Health Report Card. Baton Rouge, LA, State of Louisiana, Department of Health
LADHH (2000). 1999 Data Tables, Louisiana State Center for Health Statistics, Department of Health and Hospitals,
Office of Public Health: http://www.dhh.state.la.us/OPH/statctr/1Tables/1999/Parish/t26_99i.xls.
Songer, T. J. (1995). Disability in Diabetes. Diabetes in America, 2nd edition. National Diabetes Data Group (U.S.),
National Institute of Diabetes and Digestive and Kidney Diseases (U.S.) and National Institutes of Health
(U.S.). Bethesda, Md., National Institutes of Health National Institute of Diabetes and Digestive and Kidney
STFM 2002 Submission/Presenter: J. Breault/Title: Data Mining a Diabetic Data Warehouse / Page 4 of 4