organization 1863 1278958177 by oRabHQor

VIEWS: 0 PAGES: 17

									Chapter 1-11. Compilation of Frequently Used Variable Generation and
Modifying Commands
This chapter contains a short synopsis of the most frequently used variable creation and
modification commands.

Deleting variables

         drop variable(s)

Examples:

         drop v1
         drop v2 v3
         drop v1-v5 v8

       clear
       input v1 v2 v3 v4 v5 v6 v7
       1 2 3 4 5 6 7
       8 9 10 11 12 13 14
       end
       list
       drop v1
       drop v2 v3
       drop v4-v6
       list

         . list

                +----------------------------------+
                | v1   v2   v3   v4   v5   v6   v7 |
                |----------------------------------|
             1. | 1     2    3    4    5    6    7 |
             2. | 8     9   10   11   12   13   14 |
                +----------------------------------+

         .   drop v1
         .   drop v2 v3
         .   drop v4-v6
         .   list

                +----+
                | v7 |
                |----|
             1. | 7 |
             2. | 14 |
                +----+




_________________

Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.


Chapter 1-11 (revision 16 May 2010)                                                                              p. 1
Deleting observations

To delete a subject (row in the data editor), use

         drop in range

Examples:

         drop in 3                 , which deletes 3rd row (3rd observation) in data editor
         drop in 1/5               , which deletes rows 1 to 5
         drop in -5/l              , which deletes 5th from last to last rows

       clear
       input id v1 v2
       1 1 2
       2 3 4
       3 5 6
       4 7 8
       5 9 10
       end
       list
       drop in -2/l
       list
       drop in 2
       list

. list

       +--------------+
       | id   v1   v2 |
       |--------------|
  1.   | 1     1    2 |
  2.   | 2     3    4 |
  3.   | 3     5    6 |
  4.   | 4     7    8 |
  5.   | 5     9   10 |
       +--------------+

. drop in -2/l
. list

     +--------------+
     | id   v1   v2 |
     |--------------|
  1. | 1     1    2 |
  2. | 2     3    4 |
  3. | 3     5    6 |
     +--------------+

. drop in 2
. list

     +--------------+
     | id   v1   v2 |
     |--------------|
  1. | 1     1    2 |
  2. | 3     5    6 |
     +--------------+
Chapter 1-11 (revision 16 May 2010)                                                      p. 2
Generating variables

BMI example

The equation for BMI is:      body mass index (BMI) = weight/height2 (units: kg/m2)

If height is in inches, then first convert to meters using
     1 inch = 0.0254 meter

      generate heightm = heightin*0.0254 // convert from inches to meters
         <or, using the command abbrevation>
      gen heightm = heightin*0.0254

If weight is in pounds, then first convert to kilograms using
    1 pound = 0.45359237 kilogram

      gen weightkg = weightlbs*0.4536            // convert from pounds to kilograms

Computing BMI from weight (kg) and height (meters),

      gen bmi = weightkg/height^2

Example

      clear
      input heightin weightlbs
      60 180
      63 .
      65 215
      end
      gen heightm = heightin*0.0254 // convert from inches to meters
      gen weightkg = weightlbs*0.4536 // convert from pounds to kilograms
      *gen bmi = weightkg/heightm^2 // use if do not want rounding
      gen bmi = round(weightkg/heightm^2,.1) // round to 1 decimal
      list in 1/5, noobs abbrev(20) // list first five to check

          +--------------------------------------------------+
          | heightin   weightlbs   heightm   weightkg    bmi |
          |--------------------------------------------------|
          |       60         180     1.524     81.648   35.2 |
          |       63           .    1.6002          .      . |
          |       65         215     1.651     97.524   35.8 |
          +--------------------------------------------------+




Chapter 1-11 (revision 16 May 2010)                                                   p. 3
Note: A good practice is to always use the capture drop command in a do-file anytime you
      generate a new variable. Frequently you will want to re-run that block of commands
      again, because you did not get it right the first time. Then, you have to drop the new
      variable before you can execute the gen command again. But, if you put drop in the do-
      file, when you go to run the entire block of commands in the “set up variables” section of
      your do-file, it will crash with an error because you cannot drop a variable that does not
      yet exist. Preceding any command with capture informs Stata to “capture the error
      message”. The effect is that the command executes if it can, but the command is
      ignored if it would otherwise crash with an error message.

Using the “capture drop” idea, it would look like,

     capture drop heightm
     gen heightm = heightin*0.0254

This is just an option. You do not have to do this.




Chapter 1-11 (revision 16 May 2010)                                                  p. 4
Recoding (continuous variable example)

The equation for BMI is:     body mass index (BMI) = weight/height2 in kg/m2

We will recode the BMI variable into the four BMI categories recommended by the National
Heart, Lung, and Blood Institute (1998)(Onyike et al., 2003):

       underweight (BMI <18.5)
       normal weight (BMI 18.5–24.9)
       overweight    (BMI 25.0–29.9)
       obese         (BMI 30)

This can be tricky if you compute BMI from weight and height variables, because of the large
number of decimal places in the data. The best way is to use the recode command and list the
ranges in reverse order. The recode command follows the rule that the value in the range first
encountered does not get replaced when later encountered. For example, we can use the value 25
in two ranges. Using the following command, if 25 occurs in the data, it gets set to 3, but does
not get replaced to 2 in the later listed range. Therefore, the 25 in the range “18.5/25=2” is
actually “18.5/24.9999999=2”.

     recode bmi 30/max=4         25/30=3   18.5/25=2    min/18.5=1 ,gen(bmicat)

Example:

     clear
     input bmi
     18.0
     18.49999
     18.5
     20.0
     24.99999
     25.0
     27.0
     .
     30.0
     32.0
     end
     recode bmi 30/max=4 25/30=3 18.5/25=2 min/18.5=1 ,gen(bmicat)
     tabstat bmi ,stats(min max) nototal missing by(bmicat)

       Summary for variables: bmi
            by categories of: bmicat (RECODE of bmi)

         bmicat |       min       max
       ---------+--------------------
              1 |        18 18.49999               correctly (BMI <18.5)
              2 |      18.5 24.99999               correctly (BMI 18.5–24.9)
              3 |        25        27              correctly (BMI 25.0–29.9)
              4 |        30        32              correctly (BMI 30)
              . |         .         .              correctly missing
       ------------------------------

Chapter 1-11 (revision 16 May 2010)                                                  p. 5
Replacing values of existing variables (numeric variables)

We will use the recoding of BMI to illustrate, although this is much more work than the recode
approach used above.

To create a categorical variable from the continuous BMI variable, using the following
categories,

       underweight (BMI <18.5)
       normal weight (BMI 18.5–24.9)
       overweight    (BMI 25.0–29.9)
       obese         (BMI 30)

we could use

       gen bmicat =1 if bmi < 18.5
       replace bmicat = 2 if 18.5 <= bmi & bmi< 25
       replace bmicat = 3 if 25.0 <= bmi & bmi < 30
       replace bmicat = 4 if 30 <= bmi & bmi ~= .

Note: In Stata, one equal sign “=” denotes assignment, two equal signs, “ ==” denote logical
equals.

We used several “operators” in this example (see box)


Relational Operators
       >    (greater than)
       <    (less than)
       >=   (greater than or equal)
       <=   (less than or equal)
       ==   (equal)
       !=   or ~= (not equal)

Logical Operators
       &     (and)
       |       (or)
       !    or ~    (not)

Arithmetic Operators
       +    (addition)
       -     (subtraction)
       *    (multiplication)
       /     (division)
       ^     (raised to a power, or exponentiation)
       -     (negation)




Chapter 1-11 (revision 16 May 2010)                                                  p. 6
Example:

     clear
     input bmi
     18.0
     18.49999
     18.5
     20.0
     24.99999
     25.0
     27.0
     .
     30.0
     32.0
     end
     gen bmicat =1 if bmi < 18.5
     replace bmicat = 2 if 18.5 <= bmi & bmi< 25
     replace bmicat = 3 if 25.0 <= bmi & bmi < 30
     replace bmicat = 4 if 30 <= bmi & bmi ~= .
     list , noobs sep(0)

           +-------------------+
           |      bmi   bmicat |
           |-------------------|
           |       18        1 |
           | 18.49999        1 |
           |     18.5        2 |
           |       20        2 |
           | 24.99999        2 |
           |       25        3 |
           |       27        3 |
           |        .        . |
           |       30        4 |
           |       32        4 |
           +-------------------+


       underweight (BMI <18.5)
       normal weight (BMI 18.5–24.9)
       overweight    (BMI 25.0–29.9)
       obese         (BMI 30)
We see that this approached created the categories correctly and that the missing value remained
missing.




Chapter 1-11 (revision 16 May 2010)                                                  p. 7
Caution: In Stata, a numeric missing value is stored as “a very large number”, or something
which can be thought of as +∞, for all practical purposes.

That is why we had to use the following in the above example:

       replace bmicat = 4 if 30 <= bmi & bmi ~= .

rather than using:

       replace bmicat = 4 if 30 <= bmi

since a “very large number”, which is what “missing” is stored as, satisfies this condition and
erroneous gets set to 4.

Illustrating this WRONG WAY to do it,

     clear
     input bmi
     18.0
     18.49999
     18.5
     20.0
     24.99999
     25.0
     27.0
     .
     30.0
     32.0
     end
     gen bmicat =1 if bmi < 18.5
     replace bmicat = 2 if 18.5 <= bmi & bmi< 25
     replace bmicat = 3 if 25.0 <= bmi & bmi < 30
     replace bmicat = 4 if 30 <= bmi    // WRONG WAY TO DO IT
     list , noobs sep(0)


          +-------------------+
          |      bmi   bmicat |
          |-------------------|
          |       18        1 |
          | 18.49999        1 |
          |     18.5        2 |
          |       20        2 |
          | 24.99999        2 |
          |       25        3 |
          |       27        3 |
          |        .        4 |       <- not what we want to happen
          |       30        4 |
          |       32        4 |
          +-------------------+




Chapter 1-11 (revision 16 May 2010)                                                     p. 8
Replacing values of existing variables (string variables)

When working with string variables, a string constant is something inside double quotes.

Example,

      clear
      input str3 hf
      Y
      n
      N
      y
      na
      end
      tab hf , missing


                hf |      Freq.     Percent        Cum.
       ------------+-----------------------------------
                 N |          1       20.00       20.00
                 Y |          1       20.00       40.00
                 n |          1       20.00       60.00
                na |          1       20.00       80.00
                 y |          1       20.00      100.00
       ------------+-----------------------------------
             Total |          5      100.00

For this example, we want to create a heart failure indictor variable from the string variable hf.
We could use,

      gen hfi = 1 if hf=="Y" | hf=="y"
      replace hfi = 0 if hf=="N" | hf=="n"
      replace hfi = . if hf=="na"
      list , noobs sep(0)
      tab hfi

           +----------+
           | hf   hfi |
           |----------|
           | Y      1 |
           | n      0 |
           | N      0 |
           | y      1 |
           | na     . |
           +----------+

               hfi |      Freq.     Percent        Cum.
       ------------+-----------------------------------
                 0 |          2       50.00       50.00
                 1 |          2       50.00      100.00
       ------------+-----------------------------------
             Total |          4      100.00




Chapter 1-11 (revision 16 May 2010)                                                      p. 9
Using this same example, but simply creating a new cleaned up string variable,

     clear
     input str3 hf
     Y
     n
     N
     y
     na
     end
     tab hf , missing


                hf |      Freq.     Percent        Cum.
       ------------+-----------------------------------
                 N |          1       20.00       20.00
                 Y |          1       20.00       40.00
                 n |          1       20.00       60.00
                na |          1       20.00       80.00
                 y |          1       20.00      100.00
       ------------+-----------------------------------
             Total |          5      100.00


We could use,

     gen hf2 = "Y" if hf=="Y" | hf=="y"
     replace hf2 = "N" if hf=="N" | hf=="n"
     replace hf2 = "" if hf=="na"
     list , noobs sep(0)
     tab hf2


          +----------------+
          | hf   hfi   hf2 |
          |----------------|
          | Y      1     Y |
          | n      0     N |
          | N      0     N |
          | y      1     Y |
          | na     .       |
          +----------------+

               hf2 |      Freq.     Percent        Cum.
       ------------+-----------------------------------
                 N |          2       50.00       50.00
                 Y |          2       50.00      100.00
       ------------+-----------------------------------
             Total |          4      100.00


In this example, we used "" (two quotes with no embedded space, called the “null string”) to
denote a missing value for a string variable.




Chapter 1-11 (revision 16 May 2010)                                                 p. 10
Converting variables that are letters or words from string to numeric (encode)

Starting with a string variable contains letters,

      clear
      input str3 current_smoker
      Y
      ""
      N
      end
      list , abbrev(15)

             +----------------+
             | current_smoker |
             |----------------|
          1. |              Y |
          2. |                |
          3. |              N |
             +----------------+


The encode command will convert this to numeric, assigning the values 1, 2, … according to the
alphebetical order of the string variable.

      encode current_smoker , gen(smoker)
      list , abbrev(15)

             +-------------------------+
             | current_smoker   smoker |
             |-------------------------|
          1. |              Y        Y |
          2. |                       . |
          3. |              N        N |
             +-------------------------+


It looks like nothing happened, but the data are really 1 and 2, with value labels automatically
assigned. To see this, we can list with the “nolabel” option.

      list , abbrev(15) nolabel

             +-------------------------+
             | current_smoker   smoker |
             |-------------------------|
          1. |              Y        2 |
          2. |                       . |
          3. |              N        1 |
             +-------------------------+



… continued …




Chapter 1-11 (revision 16 May 2010)                                                    p. 11
To convert this to a 0-1 variable, use

      recode smoker 2=1 1=0
      list , abbrev(15) nolabel

             +-------------------------+
             | current_smoker   smoker |
             |-------------------------|
          1. |              Y        1 |
          2. |                       . |
          3. |              N        0 |
             +-------------------------+

The automatically assigned value labels are still present, so we might as well drop them.

      list , abbrev(15)
      label drop smoker
      list , abbrev(15)

             +-------------------------+
             | current_smoker   smoker |
             |-------------------------|
          1. |              Y        N |
          2. |                       . |
          3. |              N        0 |
             +-------------------------+

             +-------------------------+
             | current_smoker   smoker |
             |-------------------------|
          1. |              Y        1 |
          2. |                       . |
          3. |              N        0 |
             +-------------------------+

The required sequence of steps using this approach is

      capture drop smoker                // useful if smoker exists already
      encode current_smoker , gen(smoker) // convert from string to numeric
      recode smoker 2=1 1=0
      label drop smoker                   // remove value labels
      tab current_smoker smoker , nolabel // check our work

       current_sm |        smoker
             oker |         0          1 |     Total
       -----------+----------------------+----------
                N |         1          0 |         1
                Y |         0          1 |         1
       -----------+----------------------+----------
            Total |         1          1 |         2




Chapter 1-11 (revision 16 May 2010)                                                   p. 12
Converting variables that look like numbers from string to numeric (destring)

If you read in a variable that contains any non-numeric character, Stata will convert it to string in
order to keep that non-numeric character intack.

For example,

      clear
      input x
      1
      na
      2
      end
      list
      describe

             +----+
             | x |
             |----|
          1. | 1 |
          2. | na |
          3. | 2 |
             +----+

       . describe

       Contains data
          obs:            3
         vars:            1
         size:           30 (99.9% of memory free)
       -------------------------------------------------------------------------------
                     storage display      value
       variable name   type   format      label      variable label
       -------------------------------------------------------------------------------
       x               str2   %9s
       -------------------------------------------------------------------------------


To clean this up, we first must change the “na” to “”,

      gen x2 = x
      replace x2="" if x2=="na"
      list
      describe




Chapter 1-11 (revision 16 May 2010)                                                     p. 13
             +---------+
             | x    x2 |
             |---------|
          1. | 1     1 |
          2. | na      |
          3. | 2     2 |
             +---------+

       -------------------------------------------------------------------------------
                     storage display      value
       variable name   type   format      label      variable label
       -------------------------------------------------------------------------------
       x               str2   %9s
       x2              str2   %9s
       -------------------------------------------------------------------------------

Athough the variable contains only numbers, it is still a string variable at this point.

We can now convert it to numeric using,

      destring , replace                      // destrings all variables if it can
        <or>
      destring x2 , replace
      describe

       x2 has all characters numeric; replaced as byte

       Contains data
          obs:            3
         vars:            2
         size:           33 (99.9% of memory free)
       -------------------------------------------------------------------------------
                     storage display      value
       variable name   type   format      label      variable label
       -------------------------------------------------------------------------------
       x               str2   %9s
       x2              byte   %10.0g
       -------------------------------------------------------------------------------


Notice it replaced the string variable with storage type “byte”. It used that type because it
chooses the storage type that uses up the least amount of disk space (see box). All you have to
remember is that all storage types are numeric except “string”.

Storage types, or datatypes

       Stata has six datatypes for variables:
       float = real numbers with 8.5 digits of precision
       double = real numbers with 16.5 digits of precision
       byte = integers between -127 and 100
       int = integers between -32,767 and 32,740
       long = integers between -2,147,483,467 and 2,147,483,620
       string = strings from 1 to 80 characters for Intercooled Stata,
                and from 1 to 244 for Stata/SE



Chapter 1-11 (revision 16 May 2010)                                                        p. 14
Creating dummy variables

A fast way to create dummy variables is to use the tabulate command.

Example,

     clear
     input group
     1
     1
     2
     2
     3
     3
     end
     tabulate group , gen(group_)
     list , noobs sepby(group)

           +-------------------------------------+
           | group   group_1   group_2   group_3 |
           |-------------------------------------|
           |     1         1         0         0 |
           |     1         1         0         0 |
           |-------------------------------------|
           |     2         0         1         0 |
           |     2         0         1         0 |
           |-------------------------------------|
           |     3         0         0         1 |
           |     3         0         0         1 |
           +-------------------------------------+

We see the “,gen( )” option instructs the tabulate command to create new variables that are
named the prefix we specify, with a suffix set to whatever the value of the original variable is.
The variable then contains a 1 to indicate the original variable had that value and zero otherwise.

It is usually helpful to rename these variables with meaningful names.

     rename    group_1 NoDose
     rename    group_2 LowDose
     rename    group_3 HiDose
     list ,    noobs sepby(group)

           +-----------------------------------+
           | group   NoDose   LowDose   HiDose |
           |-----------------------------------|
           |     1        1         0        0 |
           |     1        1         0        0 |
           |-----------------------------------|
           |     2        0         1        0 |
           |     2        0         1        0 |
           |-----------------------------------|
           |     3        0         0        1 |
           |     3        0         0        1 |
           +-----------------------------------+


Chapter 1-11 (revision 16 May 2010)                                                    p. 15
Creating quantiles

A fast way to create quantiles of a variable is to use the xtile command. Use “nq(3)” for tertiles,
“nq(4)” for quartiles, “nq(5)” for quintiles, etc.

Example, to create tertiles,

      clear
      input x
      15
      16
      5
      30
      40
      10
      18
      25
      35
      end
      tab x
      xtile x3 = x ,nq(3)
      tab x3
      tabstat x , by(x3) stat(min max)

                 x |      Freq.     Percent        Cum.
       ------------+-----------------------------------
                 5 |          1       11.11       11.11
                10 |          1       11.11       22.22
                15 |          1       11.11       33.33
                16 |          1       11.11       44.44
                18 |          1       11.11       55.56
                25 |          1       11.11       66.67
                30 |          1       11.11       77.78
                35 |          1       11.11       88.89
                40 |          1       11.11      100.00
       ------------+-----------------------------------
             Total |          9      100.00

       3 quantiles |
             of x |       Freq.     Percent        Cum.
       ------------+-----------------------------------
                 1 |          3       33.33       33.33
                 2 |          3       33.33       66.67
                 3 |          3       33.33      100.00
       ------------+-----------------------------------
             Total |          9      100.00

       Summary for variables: x
            by categories of: x3 (3 quantiles of x )

             x3 |       min       max
       ---------+--------------------
              1 |         5        15
              2 |        16        25
              3 |        30        40
       ---------+--------------------
          Total |         5        40
       ------------------------------


Chapter 1-11 (revision 16 May 2010)                                                    p. 16
References

Onyike CU, Crum RM, Lee HB, Lyketsos CG, Eaton WW. (2003). Is obesity associated with
      major depression? Results from the third national health and nutrition examination
      survey. Am J Epidemiol 158(12):1139-1153.




Chapter 1-11 (revision 16 May 2010)                                             p. 17

								
To top