Extracting Coded Information from Large Databases by vps11289

VIEWS: 0 PAGES: 113

									   Extracting Coded
Information from Large
      Databases
     Patricia B. Cerrito
      John C. Cerrito
   pcerrito@louisville.edu
        Course Organization
 Section 1. Introduction
 Section 2. Inclusion/Exclusion Criteria
 Section 3. Control Sample
 Section 4. Patient Severity and Co-
  Morbidities
 Section 5. Auditing the Data
 Section 6. Use in Regression Models
 Section 7. Summary
 Section 1
Introduction
                    Objective
 To show how patient diagnoses and
 procedure codes can be extracted from
 large databases so that the information
 can be used in subsequent analyses
     Use of inclusion/exclusion criteria
     Definition of patient severity
               Discussion
 Suppose   we want to look at patient
  outcomes for those patients undergoing
  bypass surgery.
 What kind of inclusion/exclusion criteria
  might we want to use?
 What kind of outcomes do we want to
  investigate?


                                        continued...
              Discussion
 Do  we want to restrict our attention to
  specific DRGs, or look at ICD9 codes?
 Do we want to look at primary and
  secondary diagnoses, or just primary?
 What co-morbidities might we want to look
  at for these patients?




                                      continued...
              Discussion
 How  do we know that the information we
  extract is correct?
 What kind of auditing of the data might we
  want to do?




                                       continued...
               Discussion
 What   is the best way to deal with the issue
  of confounding in extraction of
  information?
 How can we deal with the lack of
  uniformity in the data and the potential for
  bias in the results?
      Use of Statistical Software
 There are many filter-and-query
 requirements to extract coded information
     We will provide SAS code to use to extract
      information (SAS Institute, Inc; Cary, NC)
     We will demonstrate some techniques of data
      extraction to investigate patient outcomes
             Example Databases
   All healthcare databases have three general
    types
       Patient diagnosis and information codes are listed in
        multiple columns with one observation per patient and
        no longitudinal information (National Inpatient
        Sample)
       Patient diagnosis information codes are listed in
        multiple rows with longitudinal information for each
        patient (Medical Expenditure Panel Survey)
       There are combinations of CPT, ICD9, and HCPCS
        codes identifying patient diagnoses and procedures
        (claims data)

                                                      continued...
    National Inpatient Sample
 The  National Inpatient Sample contains a
  stratified sample from 1000 hospitals of all
  inpatient events across 37 states
 Approximately 8 million inpatient events
  per year
 Information is available at www.hcup-
  us.ahrq.gov/home.jsp.
 For students, one year of data is available
  for $20, $200 for everyone else
Medical Expenditure Panel Survey
   Contains information from a cohort of
    approximately 30,000 individuals (11,000
    households)
   Contains information on any encounter with a
    healthcare provider
   Contains complete reimbursement information
   Contains incomplete diagnosis and procedure
    information
   Contains information for each year starting in
    1996
   Available at www.meps.ahrq.gov/mepsweb for
    download
                Claims Data
 Usually   has two column pairs
     Column 1 defines the type of code
     Column 2 gives the code
 Canstill have multiple codes per
 observation
     Multiple Columns of Codes
 Wewill start with the National Inpatient
 Sample with 15 columns of diagnosis
 codes and 15 columns of procedure codes
    All in ICD9 format
 Wewant to extract all asthma patients and
 a matched sample of non-asthma patients
               Example 1
 Modrall JG. Rosero EB. Smith ST. Arko
  FR 3rd. Valentine RJ. Clagett GP. Timaran
  CH.
 Operative mortality for renal artery bypass
  in the United States: Results from the
  National Inpatient Sample.
 Journal of Vascular Surgery. 48(2):317-
  322, 2008

                                       continued...
              Example 1
“Renal artery bypasses were identified using
a combination of ICD-9-CM procedure and
diagnosis codes. The ICD-9-CM procedure
codes for aortorenal bypass (39.24) and
“other abdominal bypass” (39.26) were
merged with the diagnosis codes for renal
atherosclerosis (440.1) and fibromuscular
dysplasia (447.3) to identify patients
undergoing RABG for these diagnoses.”

                                       continued...
              Example 1
“The database lists 29 comorbidities defined
by the HCUP based on the presence of
secondary diagnoses after excluding
postoperative complications and diagnoses
directly related to the primary diagnosis.”
              Example 2
 Phillips B. Clark DE. Nathens AB. Shiloach
  M. Freel AC. Comparison of injury patient
  information from hospitals with records in
  both the national trauma data bank and
  the nationwide inpatient sample.
 Journal of Trauma-Injury Infection &
  Critical Care.
 64(3):768-79; discussion 779-80, 2008


                                      continued...
                    Example 2
“The  30 AHRQ comorbidity measures were used in this
analysis to compare the number of comorbidities between
the two databases and are described in greater detail
elsewhere.
The NTDB diagnosis fields capture an unlimited number
of diagnoses in either ICD-9-CM or AIS coding but do not
designate a primary diagnosis.
Depending on a hospital's submission file format,
complication and comorbidity information can be
embedded as ICD-9-CM codes in the diagnosis field;
submitted as separate fields specific to complication and
comorbidities using the NTDB predefined set of 25
complications and 46 comorbidities; or not submitted at all.”

                                                      continued...
               Example 2
“We then identified all records in both
databases with an injury diagnosis as
defined as ICD-9 codes 800.00 to 959.9
excluding 905 to 909 (late effects of injury),
910 to 924 (blisters, contusions, abrasion,
and insect bites), and 930 to 939 (foreign
bodies). In the NIS only patients with a
primary diagnosis of trauma as determined
by the ICD-9-CM code in variable DX1 were
included.”
              Example 3
 PatilCG. Lad SP. Santarelli J. Boakye M.
 National inpatient complications and
  outcomes after surgery for spinal
  metastasis from 1993-2002.
 Cancer. 110(3):625-30, 2007




                                      continued...
               Example 3
“To identify cases of acute pancreatitis we
queried the NIS database in order to recover
hospital data on all admissions with a
primary ICD-9-CM diagnosis code of 577.0
(acute pancreatitis). All hospitalizations in
which acute pancreatitis was the primary
diagnosis were included in our analysis.”
  Translate Diagnosis to Codes
 http://en.wikipedia.org/wiki/List_of_ICD-
 9_codes_460-
 519:_Diseases_of_the_respiratory_system




                                         continued...
Translate Diagnosis to Code
            Alternate Site
 http://icd9cm.chrisendres.com




                                  continued...
Alternate Site
                ICD9 Codes
 Codes are very complex, and great care must be
  taken to ensure that all of the codes needed are
  used, and no extraneous codes are used
 If the codes are not extracted correctly, then the
  results and relationships discovered can be
  incorrect
 There is also the problem of assuming that the
  ICD9 codes are entered uniformly
 If the ICD9 codes are entered incorrectly, it can
  also bias the results
         Section 2
Inclusion/Exclusion Criteria
                       Question
   Should the extraction be for a primary or
    secondary diagnosis?
       If primary, only have to examine 1 column of codes
       If secondary, need to examine 15 columns of codes
 Always need to be sure that you have extracted
  all codes related to asthma and NO extraneous
  codes.
 What if providers list a primary code in a
  secondary column?
  Primary Diagnosis of Asthma
 All codes 493.xx include some diagnosis
  of asthma.
 We can list all possible codes for asthma,
  or we can reduce the code in the first
  diagnosis column to just 3 digits.
        Possible Asthma Codes
 493
   493.0
   493.01

   493.02

   493.1

   493.11

   493.12

  and so on
     SAS Code for 1 Column
Data nis.extract_asthma_part1;
Set nis.nisyear2005;
If (dx1=„493‟ or dx1=„4930‟ or dx1=„49301‟
   or dx1=„49302‟ or dx1=„4931‟ or
   dx1=„49311‟ or dx1=„49312‟….) then
   asthma=0;
Else asthma=1;
Run;

                                       continued...
     SAS Code for 1 Column
Data nis.extract_asthma;
Set nis.extract_asthma_part1;
Where asthma=1;
Run;
         Alternate SAS Code
Data nis.extract_asthma;
Set nis.extract_asthma_part1;
Dx_3digits=Substr(dx1,1,3);
If (dx_3digits=„493‟) then asthma=1;
Else asthma=0;
Run;
Data nis.extract_asthma;
Set nis.extract_asthma_part1;
Where asthma=1;
Run;
     Generalize to 15 Columns
Data nis.extract_asthma;
Set nis.extract_asthma_part1;
Dx_3digits1=Substr(dx1,1,3);
Dx_3digits2=Substr(dx2,1,3);
…
Dx_3digits15=Substr(dx15,1,3);
Diagnoses_3=catx(„ „,
    dx_3digits1,dx_3digits2,…,dx_3digits15);
If (rxmatch(„493‟,diagnoses_3)>0) then asthma=1;
Else asthma=0;
Run;
       Multiple Rows of Data
 Ifthere are multiple rows of data
  containing patient condition codes (as in
  claims data or in the MEPS), it is first
  necessary to convert a one-to-many
  relationship to a one-to-one relationship.
      MEPS Datasets for 2005
Dataset Name   Description
HC-097         Full year consolidated file
HC-096         Medical conditions file
HC-095         Person round plan public use file
HC-094I        Appendix to MEPS 2005 event files
HC-094H        Home health file
HC-094G        Office-based medical provider visits file
HC-094F        Outpatient visits file
HC-094E        Emergency room visits file
HC-094D        Hospital inpatient stays file
HC-094C        Other medical expenses
HC-094B        Dental visits
                                                 continued...
      MEPS Datasets for 2005
HC-094A     Prescribed medicines file
HC-091      Jobs file
HC-090      Full year population characteristics
HC-084      Population characteristics
HC-036BRR   Replicates for calculating variances
HC-036      MEPS 1996-2005 pooled estimation linkage file
         Medicare, Part D
 The2006 MEPS datasets, available
 October, 2008, contain information on
 payments by Medicare, Part D that can be
 compared to earlier years to see if the
 benefit is increasing or decreasing the use
 of medications
Transposing Patient Conditions
libname meps 'c:\Meps';
proc Transpose data=meps.h78codes
           out=work.tran (drop=_name_
   _label_)
             prefix=med_;
    var icd9codx ;
    by dupersid;
run;

                                        continued...
Transposing Patient Conditions
data work.concat( keep= dupersid icd9codx ) ;
  length icd9codx $32767 ;
  set work.tran ;
  array chconcat {*} med_: ;
  icd9codx = left( trim( med_1 )) ;
  do i = 2 to dim( chconcat ) ;
      icd9codx = left(trim(icd9codx)) || ' ' || left(trim(
  chconcat[i] )) ;
  end ;
run ;


                                                   continued...
Transposing Patient Conditions
proc sql ;
  select max( length( icd9codx )) into
  :icd9codx_LEN from work.concat ;
quit ;
%put icd9codx_LEN=&icd9codx_LEN ;
data meps.icd9codes ;
  length icd9codx $ &icd9codx_LEN ;
  set work.concat ;
run ;
                 MEPS
 Once   the patient conditions have been
  concatenated into one text string, the
  RXMATCH statement can be used to
  extract patients with inclusion/exclusion
  criteria
 Because the MEPS only contains 3 digits
  of the 5-digit ICD9 code, only
  approximately conditions can be used
  Section 3
Control Sample
              Discussion
 Why  is a matched cohort important in
  observational studies?
 What is the purpose of defining a matched
  cohort; ie, how will it be used in the
  analysis?
      Creating a Matched Cohort
       we need to have some information
 First,
  concerning the asthma patients:
     Age, gender, and race distribution
     Sample size
 Then,   we get a corresponding sample
  from the non-asthma patients using
  stratification
 We verify by comparing asthma to non-
  asthma groups
 Problem with Matched Cohorts
 By  using variables to define a match, we
  may be using a confounding factor that will
  bias the outcomes
 There are many examples of such
  confounders in observational studies that
  were initially ignored but then found to be
  of importance
Asthma Versus Non-Asthma




                      continued...
Asthma Versus Non-Asthma




                      continued...
Asthma Versus Non-Asthma
   Creating a Matched Cohort
Data nis.extract_non_asthma;
Set nis.extract_asthma_part1;
Where asthma=0;
Run;




                                continued...
Code to Match on One Variable
Data nis.matchcontrol;
Set nis.extract_asthma_part1;
If asthma=1 then do;
Setid+1;
Caco=1;
Age2=age;
Output;
Caco=0;
Index=_n_;
Do while (abs(age2-age)<=1 and index>1);
Index=index-1;
Set studydatapoint=index;
If _error_=1 then stop;
              Complete Code
 Available at
 www2.sas.com/proceedings/sugi30/152-
 30.pdf
Written by Robert Matthews and Ilene Brill
University of Alabama at Birmingham
         How Many Variables?
 With  a large dataset, we can use many
  different variables to define a matched
  cohort.
 We can also examine some of the other
  co-morbidities to ensure a matched
  sample
  Purpose of Matched Cohorts
 To use in a linear regression model
 Assume that the variables are
  independent
     Need similar demographic ranges in a
      matched sample
     Do not need a one-to-one relationship
      because of the assumption of independence
           Regression

Factor 2    A    B      C    D
Factor 1
1           1A   1B     1C   1D
2           2A   2B     2C   2D
3           3A   3B     3C   3D
4           4A   4B     4C   4D
                Example 4
 Morten   Helms, Pernille Vastrup, Kare
  Molbak
 Short and long term mortality associated
  with foodborne bacterial gastrointestinal
  infections: registry based study
 BMJ, 326(7385);357




                                           continued...
               Example 4
 Used   age, sex, and county of residence to
  match cohorts
 Used 10:1 ratio of controls to group with
  infections
 3636 Controls
 Used Cox Regression to model
 Doesn‟t explain the reason for 10:1
              Example 5
 William D Leslie, Shelley Derkson, Colleen
  Metge, Lisa M Lix, Elizabeth A Salamon,
  Pauline Wood Steiman, Leslie L Roos
 Fracture risk among First Nations people:
  a retrospective matched cohort study
 CMAJ, 2004.171(8):869-873




                                       continued...
               Example 5
 Matched  cohorts on sex and year of birth
 Used 3:1 ratio of controls to non-controls
 98,000 Controls
 Descriptive statistics only
 Doesn‟t explain reason for 3:1 ratio
             Example 6
 Charles N Bernstein, Andre Wajda, James
  F Blanchard
 The clustering of other chronic
  inflammatory diseases in inflammatory
  bowel disease: a population-based study
 Gastroenterology, 2005. 129:827-836




                                    continued...
              Example 6
 Matched    cohort on age, sex, and
  geography
 10:1 ratio (38,000 controls)
 Logistic regression used
 No separate examination of sensitivity and
  specificity
 Doesn‟t explain reason for 10:1 ratio
         Alternate Process
Suppose    we don‟t have enough
information to know what to match?
While we can match on basic patient
demographics, that may be problematic in
and of itself.




                                     continued...
           Alternate Process
 Suppose, for example, we wanted to find the
  added cost of asthma in the inpatient
  population?
 First, we want to find out how the asthma and
  non-asthma patients differ
 Given the size of the sample, we can restrict the
  non-cases to specific parameters and then
  choose a random sub-sample of non-cases to
  compare to cases while taking group differences
  into consideration.
         First, Demographics
 Wewant to determine if the asthma and
 non-asthma inpatients differ in some
 demographic variable.
    If so, the differences are important and a
     matched cohort equalizing the demographic
     information could introduce a confounder
 We look at age, race, gender, income
 quartile, and primary payer between the
 two groups.
        Continuous Demographics
         asthma   N Obs Variable Label                        Mean     Std Dev
              0 7528387 AGE    Age in years at admission 47.1806652 28.5711819
                        LOS    Length of stay (cleaned)   4.6411475 6.9349812
                        TOTCHG Total charges (cleaned)     22408.78   39568.34
              1   466661 AGE    Age in years at admission 47.2470072 25.1847866
                         LOS    Length of stay (cleaned)   4.5247457 5.5646128
                         TOTCHG Total charges (cleaned)     21577.68   32398.50




asthma Variable Minimum Maximum Lower Quartile               Median Upper Quartile
    0 AGE             0 115.0000000           25.0000000 51.0000000         72.0000000
      LOS             0 365.0000000            2.0000000 3.0000000           5.0000000
      TOTCHG 25.0000000   999926.00              5650.00   11591.00           24318.00
    1 AGE             0 109.0000000           29.0000000 49.0000000         67.0000000
      LOS             0 283.0000000            2.0000000 3.0000000           5.0000000
      TOTCHG 26.0000000   996796.00              6774.00   12426.00           24252.00
     Discrete Demographics
Gender     Asthma                     No Asthma
Male       157431 (33.81%)            3118494 (41.45%)
Female     308138 (66.19%)            4384506 (58.44%)

                   RACE        No Asthma            Asthma
               Caucasian 3787586 (68.84%) 219075 (64.69%)
         African American 635671 (11.55%)     62219 (18.37%)

                 Hispanic   721879 (13.12%)   37077 (10.95%)

                   Other     357059 (6.49%)    20269 (5.99%)
Matched Samples-With Asthma
proc surveyselect data=nis.withasthma
out=work.sampleasthma
method=srs
n=50000
;
ID AGE AGEDAY …codeasthma;
where age>0 or ageday>30;
RUN;
         Without Asthma
PROC SURVEYSELECT
 DATA=WORK.nonasthma2005
 OUT=work.samplenonasthma
 METHOD=SRS
 N=50000
 ;
 ID AGE AGEDAY AMONTH …codeasthma;
where age>0 or ageday>30;
RUN;
      Append Subsamples
PROC SQL;
CREATE TABLE nis.asthma2005sample AS
SELECT * FROM work.sampleasthma
 OUTER UNION CORR
SELECT * FROM WORK.samplenonasthma
;
Quit;
 Demographics of Sample Versus
         Full Dataset
    asthma   N Obs Variable Label                        Mean      Std Dev
         0 7528387 AGE    Age in years at admission 47.1806652 28.5711819
                   LOS    Length of stay (cleaned)   4.6411475 6.9349812
                   TOTCHG Total charges (cleaned)     22408.78   39568.34
         1   466661 AGE    Age in years at admission 47.2470072 25.1847866
                    LOS    Length of stay (cleaned)   4.5247457 5.5646128
                    TOTCHG Total charges (cleaned)     21577.68   32398.50

codeasthma N Obs Variable Label                           Mean      Std Dev
         0   50000 AGE    Age in years at admission 53.8357600 23.9154022
                   LOS    Length of stay (cleaned)   4.8426937 6.7280422
                   TOTCHG Total charges (cleaned)     24520.32   40033.44
         1   50000 AGE    Age in years at admission 47.5816200 25.0360451
                   LOS    Length of stay (cleaned)   4.5605312 5.5873088
                   TOTCHG Total charges (cleaned)     21860.93   33212.23


                                                                 continued...
 Demographics of Sample Versus
         Full Dataset
asthma Variable Minimum Maximum Lower Quartile        Median Upper Quartile
    0 AGE             0 115.0000000    25.0000000 51.0000000      72.0000000
      LOS             0 365.0000000     2.0000000 3.0000000        5.0000000
      TOTCHG 25.0000000   999926.00       5650.00   11591.00        24318.00
    1 AGE             0 109.0000000    29.0000000 49.0000000      67.0000000
      LOS             0 283.0000000     2.0000000 3.0000000        5.0000000
      TOTCHG 26.0000000   996796.00       6774.00   12426.00        24252.00

                                            Lower                   Upper
codeasthma Variable Minimum     Maximum    Quartile      Median    Quartile
        0 AGE             0 105.0000000 34.0000000 56.0000000 75.0000000
          LOS             0 240.0000000 2.0000000 3.0000000 5.0000000
          TOTCHG 35.0000000   992244.00    7156.00   13254.00   26442.00
        1 AGE             0 107.0000000 30.0000000 50.0000000 67.0000000
          LOS             0 271.0000000 2.0000000 3.0000000 5.0000000
          TOTCHG 35.0000000   886032.00    6792.00   12553.00   24584.00



                                                                   continued...
 Demographics of Sample Versus
         Full Dataset
Gender   Asthma            No Asthma
Male     157431 (33.81%)   3118494 (41.45%)
Female   308138 (66.19%)   4384506 (58.44%)

Gender   Asthma            No Asthma
Male     16923 (33.90%)    20103 (40.28%)
Female   32990 (66.10%)    29804 (59.72%)


                                     continued...
Demographics of Sample Versus
        Full Dataset
             RACE        No Asthma            Asthma
          Caucasian 3787586 (68.84%) 219075 (64.69%)
   African American 635671 (11.55%)     62219 (18.37%)

           Hispanic 721879 (13.12%)     37077 (10.95%)

              Other    357059 (6.49%)    20269 (5.99%)


              RACE      No Asthma            Asthma
          Caucasian 25868 (70.99%) 23465 (64.71%)
    African American 4284 (11.76%) 6789 (18.72%)
            Hispanic   4155 (11.40%)    3855 (10.65%)

               Other    2130 (5.85%)     2154 (5.95%)
            Section 4
Patient Severity and Co-Morbidities
       Next, Patient Conditions
 We   want to find the patient conditions
  listed in the NIS, to see if there are
  differences between patients with and
  without asthma in terms of co-morbidities.
 Since there are 15 different columns, we
  must first combine the fifteen columns and
  then perform a frequency count.
             Append Data
 Append   the columns of data to create one
  dataset.
 Then compute a frequency count in
  descending order to find the most
  frequently occurring patient conditions for
  patients with and without asthma.
        Combining Columns
Data nis.columndx1 (keep key dx asthma);
Set nis.extract_asthma_part1;
Dx=dx1;
Run;
Data nis.columndx2 (keep key dx asthma);
Set nis.extract_asthma_part1;
Dx=dx2;
Run;
Repeat for all 15 columns
Repeat for procedure codes if desired

                                           continued...
       Combining Columns
PROC SQL;
CREATE TABLE SASUSER.APPEND_TABLE AS
SELECT * FROM WORK.COLUMN12
 OUTER UNION CORR
SELECT * FROM WORK.COLUMN122
 OUTER UNION CORR
SELECT * FROM WORK.COLUMN132
 OUTER UNION CORR
…
SELECT * FROM WORK.COLUMN102
 OUTER UNION CORR
SELECT * FROM WORK.COLUMN112
;
Quit;
    Descending Frequencies
Proc sort data=sasuser.append_table;
By asthmacode;
PROC FREQ DATA=
  SASUSER.APPEND_TABLE
  ORDER=FREQ
;
  TABLES DX / SCORES=TABLE;
By asthmacode;
Most Frequent Codes With Asthma
 Code     Translation
 486      Pneumonia, organism unspecified
 428.0    Congestive heart failure
 786.59   Chest pain
 414.01   Coronary atherosclerosis
 518.81   Acute respiratory failure
 715.36   Osteoarthritis
 404.93   Hypertensive heart & chronic kidney disease
 V57.89   Care involving rehab procedures
 278.01   Morbid obesity
 410.71   Subendocardial infarction
 427.31   Atrial fibrillation
 682.6    Other cellulitis & abscess of leg except foot
 276.4    Volume depletion
Most Frequent Codes Without
          Asthma
 Code     Translation
 486      Pneumonia, organism unspecified
 428.0    Congestive heart failure
 414.01   Coronary atherosclerosis
 786.59   Chest pain
 V57.89   Care involving rehab procedures
 491.21   Obstructive chronic bronchitis
 599.0    Urinary tract infection
 410.71   Subendocardial infarction
 427.31   Atrial fibrillation
 434.91   Cerebral artery occlusion
 03.89    Unspecified septicemia
Chi-Square Comparisons
  Table of bronchitis by codeasthma
  bronchitis   codeasthma
  Frequency
  Row Pct
  Col Pct           0       1     Total
           0   48059 49682       97741
               49.17 50.83
               96.12 99.36
           1     1941      318     2259
                85.92    14.08
                  3.88    0.64
  Total        50000 50000       100000

                                          continued...
Chi-Square Comparisons
   Table of obesity by codeasthma
    obesity       codeasthma
  Frequency
  Row Pct
  Col Pct             0       1    Total
              0 47610 44681        92291
                  51.59    48.41
                  95.22    89.36
              1    2390     5319    7709
                  31.00    69.00
                    4.78   10.64
  Total           50000 50000 100000
   Logistic Regression to Predict
         Asthma Condition
 We  use the most frequent diagnosis codes
  to see if regression can be predicted given
  the defined 0-1 indicator functions.
 We use stepwise logistic regression
 We include all possible 2-way interactions
    Define Indicator Functions
Data nis.extract_asthma_part1;
Set nis.indicatorsforcodes;
If (dx_3digits=„486‟) then pneumonia=1;
Else pneumonia=0;
If (dx_3digits=„428‟) then chf=1;
Else chf=0;
Run;
               Logistic Results
 Significant    variables:
     Age, race, mortality, income level
     Pneumonia, congestive heart failure, coronary
      blockage
     Acute respiratory failure, arthritis,
      hypertensive heart condition, rehabilitation
     Obestity, infarction, atrial fibrillation, cellulitis
     Bronchitis, urinary tract infection, cerebral
      artery occlusion, septicemia
     gender
   Section 5.
Auditing the Data
             Assumption
 How  do we know that the data concerning
  patient primary and secondary diagnoses
  have been entered correctly?
 How can we find out?
 Examination of Co-Morbidities
 We   just use the number of codes
  independent of the actual code value to
  define a patient severity index
 If there is a relationship between the
  number of codes and the definition of
  patient severity, it clearly demonstrates
  that it is possible to “game” the severity
  index by shifting patients into higher levels
  of severity
              Number of Codes
data nis.numberofcodes;
set nis.nis_core_2005;
if (dx10 eq ' ') then class=10;
if (dx9 eq ' ') then class=9;
if (dx8 eq ' ') then class=8;
if (dx7 eq ' ') then class=7;
if (dx6 eq ' ') then class=6;
if (dx5 eq ' ') then class=5;
if (dx4 eq ' ') then class=4;
if (dx3 eq ' ') then class=3;
if (dx2 eq ' ') then class=2;
if (dx1 eq ' ') then class=1;
run;
        Frequency of Index
class            Frequency   Percent
   1                  930       0.01
   2                649478      9.32
   3                950622     13.64
   4                867899     12.45
   5                801427     11.50
   6                744191     10.68
   7                651332      9.35
   8                578229      8.30
   9                539992      7.75
  10               1185273     17.01
Number of Codes Versus Mortality
                   6

                   5
  Mortality Rate




                   4

                   3

                   2

                   1

                   0
                       1   2   3     4   5    6   7      8   9   10
                                   Number of Diagnoses
Average Length of Stay by Number
           of Codes
                         45
Average Length of Stay




                         40
                         35
                         30
                         25
                         20
                         15
                         10
                          5
                          0
                              1   2   3   4   5    6   7    8   9   10
                                          Number of Codes
Average Total Charges by Number
            of Codes
                        40000
Average Total Charges



                        35000
                        30000

                        25000
                        20000

                        15000
                        10000

                        5000
                                1   2   3     4   5    6   7      8   9   10
                                            Number of Diagnoses
           Under-Reporting
 Itis clear by the high cost shown for just
  one diagnosis code that many of the
  providers are under-reporting patient co-
  morbidities
 This under-reporting indicates that there is
  a lack of uniformity in listing
  secondarydiagnosis codes.
               Discussion
 What  should be done with patients where
  the co-morbidities are obviously under-
  reported?
 The inclusion/exclusion criteria used for
  co-morbidities will exclude all of the under-
  reported patients.
        Section 6
Use in Regression Models
              Discussion
 Whatis an optimal split in the data
 between treatment and control groups?
     50/50
     2:1?
     3:1?
     10:1?
 Why?
       Consequences of Split
 We   will look at the results of a logistic
  regression given the different splits in the
  data.
 We want to predict mortality
 Note that the accuracy increases as the
  split becomes larger but the practical
  importance decreases.
             Code to Split
PROC SURVEYSELECT DATA=WORK.mortality
   OUT=work.samplenondisease3
METHOD=SRS N=10000 ;
RUN;
PROC SURVEYSELECT DATA=WORK.nonmortality
   OUT=work.sampledisease3
METHOD=SRS N=100000 ;
RUN;
PROCSQL;
CREATETABLE nis.randomdied3 AS SELECT
• FROM work.sampledisease3
• OUTER UNION CORR SELECT
* FROM WORK.samplenondisease3 ;
Quit;
             50/50 Split

  Actual Mortality       Predicted Mortality
Frequency
Row Pct
Col Pct                          0         1    Total
                     0      30410        5128   35538
                            85.57       14.43
                            89.32       13.30
                     1        3637     33439    37076
                               9.81    90.19
                             10.68     86.70
Total                       34047      38567    72614
              75/25 Split
  Actual Mortality       Predicted Mortality
Frequency
Row Pct
Col Pct                          0          1    Total
                     0      53214        2191    55405
                             96.05        3.95
                             92.90      13.77
                     1        4066     13726     17792
                             22.85     77.15
                               7.10    86.23
Total                       57280      15917     73197
              10/90 Split
  Actual Mortality       Predicted Mortality
Frequency
Row Pct
Col Pct                          0         1    Total
                     0      73055         825   73880
                             98.88       1.12
                             96.60      15.36
                     1        2570       4545    7115
                             36.12      63.88
                               3.40     84.64
Total                       75625       5370    80995
        Complete Sample
 Actual Mortality       Predicted Mortality
Frequency
Row Pct
Col Pct                        0          1     Total
                    0    5546977     11465    5558442
                            99.79      0.21
                            98.02    49.72
                    1     112067     11595    123662
                            90.62      9.38
                             1.98    50.28
Total                    5659044     23060    5682104
         Logistic Regression
 Once we decide on a split (here, 50/50),
 we can compute the logistic regression
 using the defined indicator functions.
 Large Samples in Regression
 Large samples will guarantee significant p-
  values just because the sample size is
  large
 However, the effect size of the model
  becomes so small that it is practically
  unimportant
 Should use measures of misclassification
  and gain to examine model adequacy
Examine Central Limit Theorem
PROC SURVEYSELECT DATA=nis.nis_205
  OUT=work.samples METHOD=SRS N=5
  rep=100 noprint;
RUN;
proc means data=work.samples noprint;
 by replicate;
 var los;
 output out=out mean=mean;
run;
Sample Size=1000
Confidence Limit With Larger n
          Healthcare Data
 Healthcare   data are generally not normally
  distributed
 Large samples can skew the mean so that
  the mean becomes non-representative of
  the sample
 Better to assume a gamma distribution
  and use a generalized linear model
Section 7
Summary
               Extraction
 Extraction  of patient information is
  complex and should not be taken for
  granted
 Care must be taken to examine the
  potential for confounding factors
 Lack of uniformity should be examined
  because of the potential to add bias into
  the model

								
To top