Docstoc

Tables

Document Sample
Tables Powered By Docstoc
					The Investigation of Data Using Tables

Primary objective: to show how data is investigated in terms of tables
using STATA

Secondary objective: to demonstrate some of the features of STATA for
general data management and analysis

The definition and layout of a table
The purpose of a table
STATA commands for the examination of tabulated data
Designing a table
Preparing data for tabulation: coding, labeling, and screening
Some statistical tests relating to tabulated data
Some demonstrations of the application of tables
    Simple tables relating to purely categorical variables
    More complex tables of categorical variables and predicting
     probabilities
    Tabulation as a tool to summarize continuous data
 The definition, and layout of a table
A table is a matrix-like structure in which the elements characterize aspects
of the categories of the table-defining variables

Tables can be one, two, or three, or more dimensional - though tables with
higher than three dimensions are hard to follow

A one-way, or one
dimensional table
. table age
                                 A two-way, or two dimensional table
----------+-----------      . table age smoke
   agecat |      Freq.
----------+-----------      ----------+-----------------------------------------
    25-29 |        292                |                 smokelev
    30-34 |        444         agecat |   0 per d.     1-24 per d.   >=25 per d.
    35-39 |        393      ----------+-----------------------------------------
                                25-29 |          132           105            55
    40-44 |        442
                                30-34 |          188           158            98
    45-49 |        405          35-39 |          164           142            87
----------+-----------          40-44 |          180           155           107
                                45-49 |          180           139            86
                            ----------+-----------------------------------------
A three-way table with marginal totals (row and column totals)
. table age smoke, by(oc) row col

------------+-------------------------------------------------------
ocuse and   |                        smokelev
agecat      |   0 per d.     1-24 per d.   >=25 per d.         Total
------------+-------------------------------------------------------
Non oc-user |
      25-29 |          107            79            40           226
      30-34 |          175           147            80           402
      35-39 |          156           130            77           363
      40-44 |          175           151           101           427
      45-49 |          175           138            81           394
            |
      Total |          788           645           379         1,812
------------+-------------------------------------------------------
  Oc-user   |
      25-29 |            25           26            15            66
      30-34 |            13           11            18            42
      35-39 |             8           12            10            30
      40-44 |             5            4             6            15
      45-49 |             5            1             5            11
            |
      Total |            56           54            54           164
------------+-------------------------------------------------------
  Attributes of a simple table

                          categorical variables

STATA table command

               . table age smoke        smoking level categories

               ----------+-----------------------------------------
                         |                 smokelev
                  agecat |   0 per d.     1-24 per d.   >=25 per d.
               ----------+-----------------------------------------
                   25-29 |          132           105            55
                   30-34 |          188           158            98
age   categories 35-39 |            164           142            87
                   40-44 |          180           155           107
                   45-49 |          180           139            86
               ----------+-----------------------------------------


                                    number of study subjects in the
                                    specific categories from the study
 The purpose of a table
In regard to categorical variables the purpose of a table is:

1. to highlight trends in the cell counts by categories of the categorical
   variables

2. to highlight disproportionate presence of cell counts when considered in
   regard to the total number of individuals who could occupy the cells

3. to summarize categorized information, perhaps confirming balance, in
   a study, in regard to participant classes, or categories

4. to summarize outcomes, especially when outcomes are dichotomous, and
   gain a ‘quick’ appreciation of potentially significant study factors when
   preparing data for analysis

and with respect to continuous data

5. to summarize continuous information possibly pointing to patterns of
   variability amongst classes of individuals
 Trends in cell counts ...

                    . table oc age, row col

                     ------------+-----------------------------------------
                                 |                  agecat
                           ocuse | 25-29 30-34 35-39 40-44 45-49 Total
Note that for        ------------+-----------------------------------------
perceived trends Non oc-user | 226          402    363    427    394 1,812
to be significant the Oc-user |       66     42     30     15     11    164
denominator must                 |           Trend direction
                           Total |   292    444    393    442    405 1,976
also be considered. ------------+-----------------------------------------

It is actually a trend . table mi age, row col
in proportions that
                       ----------+-----------------------------------------
we are looking for               |                  agecat
                            mi | 25-29 30-34 35-39 40-44 45-49 Total
There is actually a ----------+-----------------------------------------
Test for the             No MI |   286    423    356    371    306 1,742
                           MI |      6     21     37     71     99    234
Significance of                |            Trend direction
trends in a table, see Total | 292        444    393    442    405 1,976
  nptrend            ----------+-----------------------------------------
  Disproportionate presence of cell counts ...
                                . table oc mi, row col
                                ------------+--------------------
Again, it is critical to ask                |         mi
disproportionate in regard            ocuse | No MI    MI   Total
to what … we again need         ------------+--------------------
a reference, or denominator     Non oc-user | 1,607    205 1,812
                                  Oc-user   |   135      29   164
                                            |
Simply changing the command           Total | 1,742    234 1,976
a little help us here           ------------+--------------------

                 . tabulate oc mi,    col

Note the greater prevalence |             mi
of mi amongst the       ocuse |     No MI        MI |      Total
oc users than     ------------+----------------------+----------
                  Non oc-user |      1607        205 |      1812
the non-users                 |     92.25      87.61 |     91.70
                 ------------+----------------------+----------
                   Oc-user   |       135         29 |       164
                             |      7.75      12.39 |      8.30
                 ------------+----------------------+----------
                       Total |      1742        234 |      1976
                             |    100.00     100.00 |    100.00
Confirming balance in a study in regard to participant classes, or categories ...
                       . tabulate age smoke, row col

                                   |             smokelev
                            agecat |   0 per d 1-24 per    >=25 per |      Total
                        -----------+---------------------------------+----------
                             25-29 |       132        105         55 |       292
Here we see the relative           |     45.21      35.96      18.84 |    100.00
representation of every            |     15.64      15.02      12.70 |     14.78
                        -----------+---------------------------------+----------
category by row variable 30-34 |           188        158         98 |       444
and by column variable.            |     42.34      35.59      22.07 |    100.00
                                   |     22.27      22.60      22.63 |     22.47
                        -----------+---------------------------------+----------
It would not hurt to circle 35-39 |        164        142         87 |       393
large percentages, and             |     41.73      36.13      22.14 |    100.00
                                   |     19.43      20.31      20.09 |     19.89
small percentages and   -----------+---------------------------------+----------
explore their bases          40-44 |       180        155        107 |       442
                                   |     40.72      35.07      24.21 |    100.00
                                   |     21.33      22.17      24.71 |     22.37
                        -----------+---------------------------------+----------
                             45-49 |       180        139         86 |       405
                                   |     44.44      34.32      21.23 |    100.00
                                   |     21.33      19.89      19.86 |     20.50
                        -----------+---------------------------------+----------
                             Total |       844        699        433 |      1976
                                   |     42.71      35.37      21.91 |    100.00
                                   |    100.00     100.00     100.00 |    100.00
                                         . table oc, content(mean mi)
  To summarize outcomes ...
                                         ------------+-----------
                                               ocuse |   mean(mi)
Here we see tabulations of the           ------------+-----------
proportion of mi subjects in each of     Non oc-user |   .1131347
the age, smoking, and oc-use               Oc-user   |   .1768293
                                         ------------+-----------
classes

               . table age smoke, contents(mean mi) by(oc) format(%7.4f)

               ------------+-----------------------------------------
               ocuse and   |                 smokelev
               agecat      |   0 per d.     1-24 per d.   >=25 per d.
               ------------+-----------------------------------------
               Non oc-user |
                     25-29 |       0.0093        0.0000        0.0250
                     30-34 |       0.0000        0.0340        0.0875
                     35-39 |       0.0192        0.0846        0.2468
                     40-44 |       0.0571        0.1391        0.3366
                     45-49 |       0.1143        0.3043        0.3827
               ------------+-----------------------------------------
                 Oc-user   |
                     25-29 |       0.0000        0.0385        0.2000
                     30-34 |       0.0000        0.0909        0.4444
                     35-39 |       0.0000        0.0833        0.3000
                     40-44 |       0.2000        0.0000        0.8333
                     45-49 |       0.6000        0.0000        0.6000
               ------------+-----------------------------------------
  To summarize continuous information ...

In a study to explore the efficacy of a new drug, nifedipine, to alleviate chest
pain in heart patients, subjects were treated with either propanolol or nifedipine
and the pattern of their heart rate and blood pressure monitored at each
stage of adjustment of either drug dosage. There were three stages of the
study in addition to baseline (time zero).
                                . table grp time, c(mean hr) format(%7.2f)

Note the heart rate and        ----------+-------------------------------
systolic blood pressure patterns         |              time
                                     grp |      0       1       2       3
in time with the two drugs.    ----------+-------------------------------
Is this as useful as a graphical       N | 71.72    77.83 124.28 127.78
representation of the results?         P | 76.81    79.69   97.81 112.75
                            ----------+-------------------------------
  . table grp time, c(m sbp) format(%7.2f)

  ----------+-------------------------------
            |              time
        grp |      0       1       2       3
  ----------+-------------------------------
          N | 138.78 134.00 129.43 135.67
          P | 144.91 136.55 136.57 140.86
  ----------+-------------------------------
 In this study, because all subjects did not proceed to each stage,
 we would be advised to explore the level of participation of each stage...

                              . table grp time if sbp==.

                            ----------+-----------------------
Number of missing                     |          time
‘sbp’ observations by group       grp |    0     1     2     3
and study stage, amongst ----------+-----------------------
                                    N |    9     9    11    12
participants                        P |    5     5     9     9
                            ----------+-----------------------


                              . table grp time if sbp~=.

                              ----------+-----------------------
Number of non-missing                   |          time
‘sbp’ observations by group         grp |    0     1     2     3
                              ----------+-----------------------
and study stage, amongst
                                      N |    9     9     7     6
participants                          P |   11    11     7     7
                              ----------+-----------------------
An investigation into the missing values could also be advanced as follows:


                                 . generate miss=sbp==.

                                 . table time grp, by(miss)

                                 ----------+-----------
                                 miss and |     grp
                                 time      |    N     P
                                 ----------+-----------
                                 0         |
                                         0 |    9    11
       non-missing values                1 |    9    11
                                         2 |    7     7
                                         3 |    6     7
                                 ----------+-----------
                                 1         |
                                         0 |    9     5
        missing values                   1 |    9     5
                                         2 |   11     9
                                         3 |   12     9
                                 ----------+-----------
  STATA commands for the examination of tabulated data
Tables are to categorical variables as summarization is to continuous
variables … we use the STATA commands to investigate categorical
variables in every sense.

There are four STATA commands for the specific manipulation of tables
…. tabulate …. table …. tabdisp …. tabstat

 [by varlist:] tabulate varname1 [varname2] [weight] [if exp] [in range]
           , summarize(varname3) [ [no]means [no]standard [no]freq [no]obs
                                    wrap nolabel missing ]

table rowvar [colvar [supercolvar]] [weight] [if exp] [in range]
                     [, contents(clist) by(superrow_varlist) cw row col scol
                        format(%fmt) center left concise missing replace
                        name(string) cellwidth(#) csepwidth(#) scsepwidth(#)
                        stubwidth(#) ]

tabdisp rowvar [colvar [supercolvar]] [if exp] [in range], cellvar(varnames)
                [ by(superrowvar(s)) format(%fmt) center left concise missing
                  totals cellwidth(#) csepwidth(#) scsepwidth(#) stubwidth(#) ]

tabstat varlist [if exp] [in range] [weight] [, statistics(statname [...])
                        by(varname) nototal missing nosep
                        columns(variables|statistics) longstub labelwidth(#)
                        format[(%fmt)] casewise save ]
USE tabulate (tab) to display
                simple summaries,
                relative cell counts,
                associative measures between row and column variables

          . tabulate case age [fweight=count], chi row col

                     |          age
                case |     <= 29      >= 30 |     Total
          -----------+----------------------+----------
             Control |      8747       1498 |     10245
                     |     85.38      14.62 |    100.00
                     |     77.52      68.68 |     76.09
          -----------+----------------------+----------
                Case |      2537        683 |      3220
                     |     78.79      21.21 |    100.00
                     |     22.48      31.32 |     23.91
          -----------+----------------------+----------
               Total |     11284       2181 |     13465
                     |     83.80      16.20 |    100.00
                     |    100.00     100.00 |    100.00

                     Pearson chi2(1) =   78.3698   Pr = 0.000
USE table to display statistics concerning variables in a table and to gain
control of the table layout, for publication purposes


       . table time, c(mean hr mean sbp freq) format(%8.3f) by(grp)

       ----------+-----------------------------------
       grp and   |
       time      |   mean(hr)   mean(sbp)       Freq.
       ----------+-----------------------------------
       N         |
               0 |     71.722     138.778          18
               1 |     77.833     134.000          18
               2 |    124.278     129.429          18
               3 |    127.778     135.667          18
       ----------+-----------------------------------
       P         |
               0 |     76.812     144.909          16
               1 |     79.688     136.545          16
               2 |     97.812     136.571          16
               3 |    112.750     140.857          16
       ----------+-----------------------------------
 USE tabdisp to display your own calculations, and observations, in a table
 in which you do not want any special treatment of the numbers




                  . tabdisp case age, cellvar(Chi) format(%7.2f)

categorical         ----------+-------------
variables ‘structuring’       |     age
                         case | <= 29 >= 30
the table           ----------+-------------          values of the variable
                    Control | 3.04 15.71              ‘chi’ displayed using
                       Case | 9.66 49.97              the indicated format
                  ----------+-------------
USE tabstat to display the (same comprehensive array of) statistics of
(several) variables in a table and to gain control of the table layout, for
publication purposes, when only one classifying (categorical) variable is
considered

  . tabstat hr* sbp*, by(grp) stat(mean med max sd) format(%6.2f) long

  grp    stats |       hr1       hr2       hr3      sbp1      sbp2      sbp3
  -------------+------------------------------------------------------------
  N       mean |     77.83    124.28    127.78    134.00    129.43    135.67
           p50 |     71.00    115.00    120.00    134.00    128.00    129.00
           max |    125.00    230.00    230.00    160.00    150.00    180.00
            sd |     21.00     49.55     44.41     15.84     17.08     28.38
  -------------+------------------------------------------------------------
  P       mean |     79.69     97.81    112.75    136.55    136.57    140.86
           p50 |     76.50     82.00    110.00    134.00    130.00    150.00
           max |    120.00    190.00    180.00    180.00    206.00    188.00
            sd |     19.95     45.45     47.34     22.40     36.29     27.85
  -------------+------------------------------------------------------------
  Tot     mean |     78.71    111.82    120.71    135.40    133.00    138.46
           p50 |     74.50     97.00    120.00    134.00    130.00    140.00
           max |    125.00    230.00    230.00    180.00    206.00    188.00
            sd |     20.22     48.82     45.75     19.27     27.50     27.03
  -------------+------------------------------------------------------------

Note the use of the ‘*’ wildcard to ensure that all variables whose names
started with ‘hr’ and ‘sbp’ are included in the table
    Designing a table

Use the following principles in designing a table

• Tables should be wider than they are long
• The row and column headings should clearly describe the specific classes
• Arrange entries in the table so that trends and differences are clear
• Only use super-rows and super-columns sparingly, and check with journals
• Ensure that the summary attribute of a cell is presented in proportion to its
  importance (e.g. means should be more evident than errors)
• Don’t over-crowd tables, numbers should be specified to no greater
  precision than can be justified



When creating a table in an exploratory fashion carefully articulate the
categories to be ‘cross tabulated’ and the nature of the summarization you
require

                                             (See also Woodward, pp 39 – 44)
Preparing data for tabulation: coding, labeling, and screening

A study (Rosner chap. 10) elucidated the following relationship between
breast cancer (cases) in women and age at birth of first child
                         |            age
                    case |     <= 29      >= 30 |     Total
              -----------+----------------------+----------
                 Control |      8747       1498 |     10245
                    Case |      2537        683 |      3220
              -----------+----------------------+----------
                   Total |     11284       2181 |     13465


A number of questions arise if we want to investigate this data:

how do we select a set of variables to represent this information?
how do we decide on a coding scheme for the variables?
how do we ensure that we can recall the coding pattern (labeling)?
how do we reproduce the above table?
how can we establish whether age at first birth and breast cancer are
 related?
how do we select a set of variables to represent this information

There are two categorical variables, case/control, and age (<=29/>=30)
that can be used to specify each cell in the table.
Additionally, we’ll need a variable to quantify the number of entries in
each cell.
The variables are then, for example
        case
        age
        count

how do we decide on a coding scheme for the variables
The general approach to coding dichotomous categorical variables is
to set the state of interest to the value ‘1’, and the other state to the
value ‘0’.
Here since we are concerned with breast cancer as the ‘case’ state, we
will code the variable case as ‘1’ for case, ‘0’ for control.
Note that pneumonic matching is also very important.
Similarly, our exposure is delaying first birth, so our state of interest is
age>=30, so we will code age as ‘1’ for age >= 30, and ‘0’ otherwise.
 We will code the variable count as given by cell occupancy in the table

 Summarizing:                    1: breast cancer case
                   case
                                 0: non-breast cancer (i.e. control)

                                 1: age>=30 at first birth
                   age
                                 0: age<=29 at first birth


                   count         cell frequency or occupancy




Is ‘case’ actually a good name for the variable here?
Now we have the following table:

                . list

                             age        case       count
                  1.           0           0        8747
                  2.           0           1        2537
                  3.           1           0        1498
                  4.           1           1         683



 Unfortunately, the values here are not very symbolic of their significance
how do we reproduce the above table

We can label the table values as follows:
             label   define   alabel 1 ">= 30" 0 "<= 29"
             label   values   age alabel
             label   define   clabel 1 Case 0 Control
             label   values   case clabel

Now the table values appear as follows:
                                            the values prior to
            . list case age count           tabulation
                     case       age         count
              1. Control      <= 29          8747
              2.     Case     <= 29          2537
              3. Control      >= 30          1498
              4.     Case     >= 30           683
                                                    Note the use of frequency
                                                    weighting in the tabulation
            . tab case age [fwe=count]              command
                       |          age
                  case |     <= 29      >= 30 |     Total
            -----------+----------------------+----------
               Control |      8747       1498 |     10245
                  Case |      2537        683 |      3220
            -----------+----------------------+----------
                 Total |     11284       2181 |     13465
how do we ensure that we can recall the coding pattern (labeling)

The STATA command, codebook helps us with the management of
categorical variables, especially screening categorical variables
. codebook age case


age --------------------------------------------------------------- (unlabeled)
                  type: numeric (byte)
                 label: alabel

                 range:   [0,1]                        units:   1
         unique values:   2                    coded missing:   0 / 4

            tabulation:   Freq.   Numeric   Label
                              2         0   <= 29
                              2         1   >= 30


case -------------------------------------------------------------- (unlabeled)
                  type: numeric (byte)
                 label: clabel

                 range:   [0,1]                        units:   1
         unique values:   2                    coded missing:   0 / 4

            tabulation:   Freq.   Numeric   Label
                              2         0   Control
                              2         1   Case
These numbers were entered into STATA using the STATA Data Editor viz:




                          age          case          count
              1             0            0           8747
              2             0            1           2537
              3             1            0           1498
              4             1            1            683
how can we establish whether age at first birth and breast cancer are
 related
                                      request a c21 test of an association
                                      between breast cancer and age

  . tabulate case age [fwe=count], chi

             |          age
        case |     <= 29      >= 30 |     Total
  -----------+----------------------+----------
     Control |      8747       1498 |     10245
        Case |      2537        683 |      3220
  -----------+----------------------+----------
       Total |     11284       2181 |     13465

               Pearson chi2(1) =         78.3698       Pr = 0.000

                 H0: No association between age and breast cancer
   Some statistical tests relating to tabulated data
  Statistical tests relating to tables address the questions

  1: is there an association between the row and column category variables
  2: if the row or column variable is ‘ordered’ is there a trend relationship
     between the outcome category, and the ordered category
  3: if the row and column variables are both ordered is there a trend
     relationship between them evident in the cell counts


  1. The association test:

                             . tabulate oc sm, chi ex
Here we are exploring
whether there is an association          |            smokelev
                                   ocuse |   0 per d 1-24 per  >=25 per |   Total
between smoking and use ------------+---------------------------------+----------
                             Non oc-user |       788       645       379 |   1812
of oral contraceptive.         Oc-user   |        56        54        54 |    164
                               ------------+---------------------------------+----------
                                     Total |       844        699        433 |      1976
The null hypothesis of no such
                                        Pearson chi2(2) =   13.2758   Pr = 0.001
association fails                        Fisher's exact =                  0.002
 2. The trend test

 We can explore the existence of a linear trend relationship between
 oral contraceptive use and smoking
           . nptrend oc, by(smoke)

             smokelev       score         obs       sum of ranks
                    0           0         844         820414
                    1           1         699         686995
                    2           2         433         445866

                z =     3.37
             P>|z| =    0.00


We see that the null hypothesis of no such trend is rejected, at the 5%
level.

Further, and using the association test we can explore the existence of a
higher than linear relationship viz chi_square nonlinear trend (1df) =
chi_square associatn. (2 df) - chi_square linear trend (1 df) = 13.26 - 3.372
… 1.92 (1 df). No such residual trend
  3. A two-way test of trends can performed with the use of Kendall’s TAU

For the OCMI data we             . tabulate age smoke, tau row col
could explore the potential                 |             smokelev
relationship between                 agecat |   0 per d 1-24 per    >=25 per |      Total
                                 -----------+---------------------------------+----------
age and smoking level viz:            25-29 |     11862       6867       1675 |     20404
                                            |     58.14      33.66       8.21 |    100.00
                                            |     10.06       9.82       8.31 |      9.81
                                 -----------+---------------------------------+----------
                                      30-34 |     30794      20290       5542 |     56626
                                            |     54.38      35.83       9.79 |    100.00
                                            |     26.11      29.03      27.51 |     27.23
                                 -----------+---------------------------------+----------
                                      35-39 |     23482      14404       3783 |     41669
                                            |     56.35      34.57       9.08 |    100.00
                                            |     19.91      20.61      18.78 |     20.04
                                 -----------+---------------------------------+----------
                                      40-44 |     27342      17357       5671 |     50370
                                            |     54.28      34.46      11.26 |    100.00
                                            |     23.19      24.83      28.15 |     24.22
                                 -----------+---------------------------------+----------
                                      45-49 |     24438      10981       3474 |     38893
                                            |     62.83      28.23       8.93 |    100.00
                                            |     20.72      15.71      17.24 |     18.70
It seems as though there isn’t   -----------+---------------------------------+----------
                                      Total |    117918      69899      20145 |    207962
an association between                      |     56.70      33.61       9.69 |    100.00
                                            |    100.00     100.00     100.00 |    100.00
age and smoking level since
|tau/ASE| ~= 0.53 << 2.0                   Kendall's tau-b =   0.0103   ASE = 0.019
Exercise
The dataset ‘fev.dta’ is from Rosner (Fundamentals of Biostatistics, 2000, p. 40) and presents
aspects of a longitudinal study of respiratory function of children in the East Boston area of
Massachusetts. The meaning of the data is as shown below
                          FEV.DOC

                          Variable #      Variable
                          ----------|---------------------------
                           1         |  ID number
                           2         |  Age (yrs)
                           3         |  FEV (liters)
                           4         |  Height (inches)
                           5         |  Sex
                                     |      0=female, 1=male
                           6         |  Smoking Status
                                     |      0=non-current smoker,
                                     |      1=current smoker
                          ----------|---------------------------


Perform the following steps to facilitate an exploration of this data:
Label the variables and variable values of the dataset appropriately
Create a table to expose the level of smoking by age and sex … does this make sense
How does the distribution of age vary with sex, as evident by appropriate tabulation
How does the distribution of height vary with sex, and smoking status, again as evidenced via
 an appropriate table
Create a demographic table as would be appropriate in a journal article introducing this data and
 the investigation
Create a table of the measurement variable, fev, making clear important statistical properties
(e.g. mean, standard deviation, median, and range) and specifically revealing how these traits
vary with sex, age, and smoking status.
Using appropriate classification create categoric representations of height and age, and, using
tabular methods see if you can obtain an understanding as to how fev may vary with age and
height.
Incorporate sex and smoking status into your investigation of important determinants if fev as
evident in the fev dataset.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:23
posted:8/18/2012
language:English
pages:32