# Tables by cuiliqing

VIEWS: 28 PAGES: 32

• pg 1
```									The Investigation of Data Using Tables

Primary objective: to show how data is investigated in terms of tables
using STATA

Secondary objective: to demonstrate some of the features of STATA for
general data management and analysis

The definition and layout of a table
The purpose of a table
STATA commands for the examination of tabulated data
Designing a table
Preparing data for tabulation: coding, labeling, and screening
Some statistical tests relating to tabulated data
Some demonstrations of the application of tables
Simple tables relating to purely categorical variables
More complex tables of categorical variables and predicting
probabilities
Tabulation as a tool to summarize continuous data
The definition, and layout of a table
A table is a matrix-like structure in which the elements characterize aspects
of the categories of the table-defining variables

Tables can be one, two, or three, or more dimensional - though tables with
higher than three dimensions are hard to follow

A one-way, or one
dimensional table
. table age
A two-way, or two dimensional table
----------+-----------      . table age smoke
agecat |      Freq.
----------+-----------      ----------+-----------------------------------------
25-29 |        292                |                 smokelev
30-34 |        444         agecat |   0 per d.     1-24 per d.   >=25 per d.
35-39 |        393      ----------+-----------------------------------------
25-29 |          132           105            55
40-44 |        442
30-34 |          188           158            98
45-49 |        405          35-39 |          164           142            87
----------+-----------          40-44 |          180           155           107
45-49 |          180           139            86
----------+-----------------------------------------
A three-way table with marginal totals (row and column totals)
. table age smoke, by(oc) row col

------------+-------------------------------------------------------
ocuse and   |                        smokelev
agecat      |   0 per d.     1-24 per d.   >=25 per d.         Total
------------+-------------------------------------------------------
Non oc-user |
25-29 |          107            79            40           226
30-34 |          175           147            80           402
35-39 |          156           130            77           363
40-44 |          175           151           101           427
45-49 |          175           138            81           394
|
Total |          788           645           379         1,812
------------+-------------------------------------------------------
Oc-user   |
25-29 |            25           26            15            66
30-34 |            13           11            18            42
35-39 |             8           12            10            30
40-44 |             5            4             6            15
45-49 |             5            1             5            11
|
Total |            56           54            54           164
------------+-------------------------------------------------------
Attributes of a simple table

categorical variables

STATA table command

. table age smoke        smoking level categories

----------+-----------------------------------------
|                 smokelev
agecat |   0 per d.     1-24 per d.   >=25 per d.
----------+-----------------------------------------
25-29 |          132           105            55
30-34 |          188           158            98
age   categories 35-39 |            164           142            87
40-44 |          180           155           107
45-49 |          180           139            86
----------+-----------------------------------------

number of study subjects in the
specific categories from the study
The purpose of a table
In regard to categorical variables the purpose of a table is:

1. to highlight trends in the cell counts by categories of the categorical
variables

2. to highlight disproportionate presence of cell counts when considered in
regard to the total number of individuals who could occupy the cells

3. to summarize categorized information, perhaps confirming balance, in
a study, in regard to participant classes, or categories

4. to summarize outcomes, especially when outcomes are dichotomous, and
gain a ‘quick’ appreciation of potentially significant study factors when
preparing data for analysis

and with respect to continuous data

5. to summarize continuous information possibly pointing to patterns of
variability amongst classes of individuals
Trends in cell counts ...

. table oc age, row col

------------+-----------------------------------------
|                  agecat
ocuse | 25-29 30-34 35-39 40-44 45-49 Total
Note that for        ------------+-----------------------------------------
perceived trends Non oc-user | 226          402    363    427    394 1,812
to be significant the Oc-user |       66     42     30     15     11    164
denominator must                 |           Trend direction
Total |   292    444    393    442    405 1,976
also be considered. ------------+-----------------------------------------

It is actually a trend . table mi age, row col
in proportions that
----------+-----------------------------------------
we are looking for               |                  agecat
mi | 25-29 30-34 35-39 40-44 45-49 Total
There is actually a ----------+-----------------------------------------
Test for the             No MI |   286    423    356    371    306 1,742
MI |      6     21     37     71     99    234
Significance of                |            Trend direction
trends in a table, see Total | 292        444    393    442    405 1,976
nptrend            ----------+-----------------------------------------
Disproportionate presence of cell counts ...
. table oc mi, row col
------------+--------------------
Again, it is critical to ask                |         mi
disproportionate in regard            ocuse | No MI    MI   Total
to what … we again need         ------------+--------------------
a reference, or denominator     Non oc-user | 1,607    205 1,812
Oc-user   |   135      29   164
|
Simply changing the command           Total | 1,742    234 1,976
a little help us here           ------------+--------------------

. tabulate oc mi,    col

Note the greater prevalence |             mi
of mi amongst the       ocuse |     No MI        MI |      Total
oc users than     ------------+----------------------+----------
Non oc-user |      1607        205 |      1812
the non-users                 |     92.25      87.61 |     91.70
------------+----------------------+----------
Oc-user   |       135         29 |       164
|      7.75      12.39 |      8.30
------------+----------------------+----------
Total |      1742        234 |      1976
|    100.00     100.00 |    100.00
Confirming balance in a study in regard to participant classes, or categories ...
. tabulate age smoke, row col

|             smokelev
agecat |   0 per d 1-24 per    >=25 per |      Total
-----------+---------------------------------+----------
25-29 |       132        105         55 |       292
Here we see the relative           |     45.21      35.96      18.84 |    100.00
representation of every            |     15.64      15.02      12.70 |     14.78
-----------+---------------------------------+----------
category by row variable 30-34 |           188        158         98 |       444
and by column variable.            |     42.34      35.59      22.07 |    100.00
|     22.27      22.60      22.63 |     22.47
-----------+---------------------------------+----------
It would not hurt to circle 35-39 |        164        142         87 |       393
large percentages, and             |     41.73      36.13      22.14 |    100.00
|     19.43      20.31      20.09 |     19.89
small percentages and   -----------+---------------------------------+----------
explore their bases          40-44 |       180        155        107 |       442
|     40.72      35.07      24.21 |    100.00
|     21.33      22.17      24.71 |     22.37
-----------+---------------------------------+----------
45-49 |       180        139         86 |       405
|     44.44      34.32      21.23 |    100.00
|     21.33      19.89      19.86 |     20.50
-----------+---------------------------------+----------
Total |       844        699        433 |      1976
|     42.71      35.37      21.91 |    100.00
|    100.00     100.00     100.00 |    100.00
. table oc, content(mean mi)
To summarize outcomes ...
------------+-----------
ocuse |   mean(mi)
Here we see tabulations of the           ------------+-----------
proportion of mi subjects in each of     Non oc-user |   .1131347
the age, smoking, and oc-use               Oc-user   |   .1768293
------------+-----------
classes

. table age smoke, contents(mean mi) by(oc) format(%7.4f)

------------+-----------------------------------------
ocuse and   |                 smokelev
agecat      |   0 per d.     1-24 per d.   >=25 per d.
------------+-----------------------------------------
Non oc-user |
25-29 |       0.0093        0.0000        0.0250
30-34 |       0.0000        0.0340        0.0875
35-39 |       0.0192        0.0846        0.2468
40-44 |       0.0571        0.1391        0.3366
45-49 |       0.1143        0.3043        0.3827
------------+-----------------------------------------
Oc-user   |
25-29 |       0.0000        0.0385        0.2000
30-34 |       0.0000        0.0909        0.4444
35-39 |       0.0000        0.0833        0.3000
40-44 |       0.2000        0.0000        0.8333
45-49 |       0.6000        0.0000        0.6000
------------+-----------------------------------------
To summarize continuous information ...

In a study to explore the efficacy of a new drug, nifedipine, to alleviate chest
pain in heart patients, subjects were treated with either propanolol or nifedipine
and the pattern of their heart rate and blood pressure monitored at each
stage of adjustment of either drug dosage. There were three stages of the
study in addition to baseline (time zero).
. table grp time, c(mean hr) format(%7.2f)

Note the heart rate and        ----------+-------------------------------
systolic blood pressure patterns         |              time
grp |      0       1       2       3
in time with the two drugs.    ----------+-------------------------------
Is this as useful as a graphical       N | 71.72    77.83 124.28 127.78
representation of the results?         P | 76.81    79.69   97.81 112.75
----------+-------------------------------
. table grp time, c(m sbp) format(%7.2f)

----------+-------------------------------
|              time
grp |      0       1       2       3
----------+-------------------------------
N | 138.78 134.00 129.43 135.67
P | 144.91 136.55 136.57 140.86
----------+-------------------------------
In this study, because all subjects did not proceed to each stage,
we would be advised to explore the level of participation of each stage...

. table grp time if sbp==.

----------+-----------------------
Number of missing                     |          time
‘sbp’ observations by group       grp |    0     1     2     3
and study stage, amongst ----------+-----------------------
N |    9     9    11    12
participants                        P |    5     5     9     9
----------+-----------------------

. table grp time if sbp~=.

----------+-----------------------
Number of non-missing                   |          time
‘sbp’ observations by group         grp |    0     1     2     3
----------+-----------------------
and study stage, amongst
N |    9     9     7     6
participants                          P |   11    11     7     7
----------+-----------------------
An investigation into the missing values could also be advanced as follows:

. generate miss=sbp==.

. table time grp, by(miss)

----------+-----------
miss and |     grp
time      |    N     P
----------+-----------
0         |
0 |    9    11
non-missing values                1 |    9    11
2 |    7     7
3 |    6     7
----------+-----------
1         |
0 |    9     5
missing values                   1 |    9     5
2 |   11     9
3 |   12     9
----------+-----------
STATA commands for the examination of tabulated data
Tables are to categorical variables as summarization is to continuous
variables … we use the STATA commands to investigate categorical
variables in every sense.

There are four STATA commands for the specific manipulation of tables
…. tabulate …. table …. tabdisp …. tabstat

[by varlist:] tabulate varname1 [varname2] [weight] [if exp] [in range]
, summarize(varname3) [ [no]means [no]standard [no]freq [no]obs
wrap nolabel missing ]

table rowvar [colvar [supercolvar]] [weight] [if exp] [in range]
[, contents(clist) by(superrow_varlist) cw row col scol
format(%fmt) center left concise missing replace
name(string) cellwidth(#) csepwidth(#) scsepwidth(#)
stubwidth(#) ]

tabdisp rowvar [colvar [supercolvar]] [if exp] [in range], cellvar(varnames)
[ by(superrowvar(s)) format(%fmt) center left concise missing
totals cellwidth(#) csepwidth(#) scsepwidth(#) stubwidth(#) ]

tabstat varlist [if exp] [in range] [weight] [, statistics(statname [...])
by(varname) nototal missing nosep
columns(variables|statistics) longstub labelwidth(#)
format[(%fmt)] casewise save ]
USE tabulate (tab) to display
simple summaries,
relative cell counts,
associative measures between row and column variables

. tabulate case age [fweight=count], chi row col

|          age
case |     <= 29      >= 30 |     Total
-----------+----------------------+----------
Control |      8747       1498 |     10245
|     85.38      14.62 |    100.00
|     77.52      68.68 |     76.09
-----------+----------------------+----------
Case |      2537        683 |      3220
|     78.79      21.21 |    100.00
|     22.48      31.32 |     23.91
-----------+----------------------+----------
Total |     11284       2181 |     13465
|     83.80      16.20 |    100.00
|    100.00     100.00 |    100.00

Pearson chi2(1) =   78.3698   Pr = 0.000
USE table to display statistics concerning variables in a table and to gain
control of the table layout, for publication purposes

. table time, c(mean hr mean sbp freq) format(%8.3f) by(grp)

----------+-----------------------------------
grp and   |
time      |   mean(hr)   mean(sbp)       Freq.
----------+-----------------------------------
N         |
0 |     71.722     138.778          18
1 |     77.833     134.000          18
2 |    124.278     129.429          18
3 |    127.778     135.667          18
----------+-----------------------------------
P         |
0 |     76.812     144.909          16
1 |     79.688     136.545          16
2 |     97.812     136.571          16
3 |    112.750     140.857          16
----------+-----------------------------------
USE tabdisp to display your own calculations, and observations, in a table
in which you do not want any special treatment of the numbers

. tabdisp case age, cellvar(Chi) format(%7.2f)

categorical         ----------+-------------
variables ‘structuring’       |     age
case | <= 29 >= 30
the table           ----------+-------------          values of the variable
Control | 3.04 15.71              ‘chi’ displayed using
Case | 9.66 49.97              the indicated format
----------+-------------
USE tabstat to display the (same comprehensive array of) statistics of
(several) variables in a table and to gain control of the table layout, for
publication purposes, when only one classifying (categorical) variable is
considered

. tabstat hr* sbp*, by(grp) stat(mean med max sd) format(%6.2f) long

grp    stats |       hr1       hr2       hr3      sbp1      sbp2      sbp3
-------------+------------------------------------------------------------
N       mean |     77.83    124.28    127.78    134.00    129.43    135.67
p50 |     71.00    115.00    120.00    134.00    128.00    129.00
max |    125.00    230.00    230.00    160.00    150.00    180.00
sd |     21.00     49.55     44.41     15.84     17.08     28.38
-------------+------------------------------------------------------------
P       mean |     79.69     97.81    112.75    136.55    136.57    140.86
p50 |     76.50     82.00    110.00    134.00    130.00    150.00
max |    120.00    190.00    180.00    180.00    206.00    188.00
sd |     19.95     45.45     47.34     22.40     36.29     27.85
-------------+------------------------------------------------------------
Tot     mean |     78.71    111.82    120.71    135.40    133.00    138.46
p50 |     74.50     97.00    120.00    134.00    130.00    140.00
max |    125.00    230.00    230.00    180.00    206.00    188.00
sd |     20.22     48.82     45.75     19.27     27.50     27.03
-------------+------------------------------------------------------------

Note the use of the ‘*’ wildcard to ensure that all variables whose names
started with ‘hr’ and ‘sbp’ are included in the table
Designing a table

Use the following principles in designing a table

• Tables should be wider than they are long
• The row and column headings should clearly describe the specific classes
• Arrange entries in the table so that trends and differences are clear
• Only use super-rows and super-columns sparingly, and check with journals
• Ensure that the summary attribute of a cell is presented in proportion to its
importance (e.g. means should be more evident than errors)
• Don’t over-crowd tables, numbers should be specified to no greater
precision than can be justified

When creating a table in an exploratory fashion carefully articulate the
categories to be ‘cross tabulated’ and the nature of the summarization you
require

Preparing data for tabulation: coding, labeling, and screening

A study (Rosner chap. 10) elucidated the following relationship between
breast cancer (cases) in women and age at birth of first child
|            age
case |     <= 29      >= 30 |     Total
-----------+----------------------+----------
Control |      8747       1498 |     10245
Case |      2537        683 |      3220
-----------+----------------------+----------
Total |     11284       2181 |     13465

A number of questions arise if we want to investigate this data:

how do we select a set of variables to represent this information?
how do we decide on a coding scheme for the variables?
how do we ensure that we can recall the coding pattern (labeling)?
how do we reproduce the above table?
how can we establish whether age at first birth and breast cancer are
related?
how do we select a set of variables to represent this information

There are two categorical variables, case/control, and age (<=29/>=30)
that can be used to specify each cell in the table.
Additionally, we’ll need a variable to quantify the number of entries in
each cell.
The variables are then, for example
case
age
count

how do we decide on a coding scheme for the variables
The general approach to coding dichotomous categorical variables is
to set the state of interest to the value ‘1’, and the other state to the
value ‘0’.
Here since we are concerned with breast cancer as the ‘case’ state, we
will code the variable case as ‘1’ for case, ‘0’ for control.
Note that pneumonic matching is also very important.
Similarly, our exposure is delaying first birth, so our state of interest is
age>=30, so we will code age as ‘1’ for age >= 30, and ‘0’ otherwise.
We will code the variable count as given by cell occupancy in the table

Summarizing:                    1: breast cancer case
case
0: non-breast cancer (i.e. control)

1: age>=30 at first birth
age
0: age<=29 at first birth

count         cell frequency or occupancy

Is ‘case’ actually a good name for the variable here?
Now we have the following table:

. list

age        case       count
1.           0           0        8747
2.           0           1        2537
3.           1           0        1498
4.           1           1         683

Unfortunately, the values here are not very symbolic of their significance
how do we reproduce the above table

We can label the table values as follows:
label   define   alabel 1 ">= 30" 0 "<= 29"
label   values   age alabel
label   define   clabel 1 Case 0 Control
label   values   case clabel

Now the table values appear as follows:
the values prior to
. list case age count           tabulation
case       age         count
1. Control      <= 29          8747
2.     Case     <= 29          2537
3. Control      >= 30          1498
4.     Case     >= 30           683
Note the use of frequency
weighting in the tabulation
. tab case age [fwe=count]              command
|          age
case |     <= 29      >= 30 |     Total
-----------+----------------------+----------
Control |      8747       1498 |     10245
Case |      2537        683 |      3220
-----------+----------------------+----------
Total |     11284       2181 |     13465
how do we ensure that we can recall the coding pattern (labeling)

The STATA command, codebook helps us with the management of
categorical variables, especially screening categorical variables
. codebook age case

age --------------------------------------------------------------- (unlabeled)
type: numeric (byte)
label: alabel

range:   [0,1]                        units:   1
unique values:   2                    coded missing:   0 / 4

tabulation:   Freq.   Numeric   Label
2         0   <= 29
2         1   >= 30

case -------------------------------------------------------------- (unlabeled)
type: numeric (byte)
label: clabel

range:   [0,1]                        units:   1
unique values:   2                    coded missing:   0 / 4

tabulation:   Freq.   Numeric   Label
2         0   Control
2         1   Case
These numbers were entered into STATA using the STATA Data Editor viz:

age          case          count
1             0            0           8747
2             0            1           2537
3             1            0           1498
4             1            1            683
how can we establish whether age at first birth and breast cancer are
related
request a c21 test of an association
between breast cancer and age

. tabulate case age [fwe=count], chi

|          age
case |     <= 29      >= 30 |     Total
-----------+----------------------+----------
Control |      8747       1498 |     10245
Case |      2537        683 |      3220
-----------+----------------------+----------
Total |     11284       2181 |     13465

Pearson chi2(1) =         78.3698       Pr = 0.000

H0: No association between age and breast cancer
Some statistical tests relating to tabulated data
Statistical tests relating to tables address the questions

1: is there an association between the row and column category variables
2: if the row or column variable is ‘ordered’ is there a trend relationship
between the outcome category, and the ordered category
3: if the row and column variables are both ordered is there a trend
relationship between them evident in the cell counts

1. The association test:

. tabulate oc sm, chi ex
Here we are exploring
whether there is an association          |            smokelev
ocuse |   0 per d 1-24 per  >=25 per |   Total
between smoking and use ------------+---------------------------------+----------
Non oc-user |       788       645       379 |   1812
of oral contraceptive.         Oc-user   |        56        54        54 |    164
------------+---------------------------------+----------
Total |       844        699        433 |      1976
The null hypothesis of no such
Pearson chi2(2) =   13.2758   Pr = 0.001
association fails                        Fisher's exact =                  0.002
2. The trend test

We can explore the existence of a linear trend relationship between
oral contraceptive use and smoking
. nptrend oc, by(smoke)

smokelev       score         obs       sum of ranks
0           0         844         820414
1           1         699         686995
2           2         433         445866

z =     3.37
P>|z| =    0.00

We see that the null hypothesis of no such trend is rejected, at the 5%
level.

Further, and using the association test we can explore the existence of a
higher than linear relationship viz chi_square nonlinear trend (1df) =
chi_square associatn. (2 df) - chi_square linear trend (1 df) = 13.26 - 3.372
… 1.92 (1 df). No such residual trend
3. A two-way test of trends can performed with the use of Kendall’s TAU

For the OCMI data we             . tabulate age smoke, tau row col
could explore the potential                 |             smokelev
relationship between                 agecat |   0 per d 1-24 per    >=25 per |      Total
-----------+---------------------------------+----------
age and smoking level viz:            25-29 |     11862       6867       1675 |     20404
|     58.14      33.66       8.21 |    100.00
|     10.06       9.82       8.31 |      9.81
-----------+---------------------------------+----------
30-34 |     30794      20290       5542 |     56626
|     54.38      35.83       9.79 |    100.00
|     26.11      29.03      27.51 |     27.23
-----------+---------------------------------+----------
35-39 |     23482      14404       3783 |     41669
|     56.35      34.57       9.08 |    100.00
|     19.91      20.61      18.78 |     20.04
-----------+---------------------------------+----------
40-44 |     27342      17357       5671 |     50370
|     54.28      34.46      11.26 |    100.00
|     23.19      24.83      28.15 |     24.22
-----------+---------------------------------+----------
45-49 |     24438      10981       3474 |     38893
|     62.83      28.23       8.93 |    100.00
|     20.72      15.71      17.24 |     18.70
It seems as though there isn’t   -----------+---------------------------------+----------
Total |    117918      69899      20145 |    207962
an association between                      |     56.70      33.61       9.69 |    100.00
|    100.00     100.00     100.00 |    100.00
age and smoking level since
|tau/ASE| ~= 0.53 << 2.0                   Kendall's tau-b =   0.0103   ASE = 0.019
Exercise
The dataset ‘fev.dta’ is from Rosner (Fundamentals of Biostatistics, 2000, p. 40) and presents
aspects of a longitudinal study of respiratory function of children in the East Boston area of
Massachusetts. The meaning of the data is as shown below
FEV.DOC

Variable #      Variable
----------|---------------------------
1         |  ID number
2         |  Age (yrs)
3         |  FEV (liters)
4         |  Height (inches)
5         |  Sex
|      0=female, 1=male
6         |  Smoking Status
|      0=non-current smoker,
|      1=current smoker
----------|---------------------------

Perform the following steps to facilitate an exploration of this data:
Label the variables and variable values of the dataset appropriately
Create a table to expose the level of smoking by age and sex … does this make sense
How does the distribution of age vary with sex, as evident by appropriate tabulation
How does the distribution of height vary with sex, and smoking status, again as evidenced via
an appropriate table
Create a demographic table as would be appropriate in a journal article introducing this data and
the investigation
Create a table of the measurement variable, fev, making clear important statistical properties
(e.g. mean, standard deviation, median, and range) and specifically revealing how these traits
vary with sex, age, and smoking status.
Using appropriate classification create categoric representations of height and age, and, using
tabular methods see if you can obtain an understanding as to how fev may vary with age and
height.
Incorporate sex and smoking status into your investigation of important determinants if fev as
evident in the fev dataset.

```
To top