The National Center for Health Statistics by eld18221


									 For demonstration only

            Module 5

Basic Measures and
Statistics Used in the
  Data Warehouse

  The Teaching Modules on Aging
             Developed by

    at The University of Michigan,
       William H. Frey, Director


The National Center for Health Statistics,

           with support from

    The National Institute on Aging
                       Basic Measures and Statistics

     Before beginning this module, please open the Data Warehouse
     on Trends in Health and Aging CD-ROM or go to

Introduction. In examining the tables and data available through the NCHS Data
Warehouse on Trends in Health and Aging (, it is
important to have a fundamental understanding of basic measures and statistics. This
module is designed to familiarize users of the Data Warehouse on Trends in Health and
Aging with key measures and statistical terms. It is intended both as a guide to the data
and terms in the modules and as a starting point for more in-depth analysis of the data.
The background on the statistical methods provided here allow users to understand the
data on the level that is necessary to follow the modules, and goes a little beyond to form
a fundamental knowledge base for statistics. Please, refer to the textbooks on statistics,
such as “Fundamentals of Biostatistics” by B.Rosner for more systematic and
comprehensive learning.

        Before working on this Module it is desirable to review the “Teaching module on
the data sources” and “The Teaching Module on the Access to the Data” which contains
the tutorial on using Beyond 20/20. Please refer to these sources if you are uncertain how
to perform a given task.

     When working with the Beyond 20/20 tables, always review the
                explanatory messages (Summaries).

        To find out more information about the data source, or the numbers and
calculation methods for a particular table, select Summary from the File menu. A window
like the one below for the table on the death rates will pop up with valuable information
on the table itself and the relevant survey.
I. Aggregated Measures
       The data presented in the Data Warehouse on Trends in Health and Aging are
aggregated by sex, age, and race and sometimes by other dimensions such as health status
or income level. In the table below regarding total tooth loss, for example, it is not
possible to obtain the prevalence of the total tooth loss for the age group 65-69 since the
data has been pre-tabulated into the age group 65-74.

It is important to note that the data in the
tables cannot be viewed on the lower
aggregation level. For instance, if a user
wanted to examine the data in the table
below for the health status of people in
“Excellent” condition, these data would be
unavailable because the Data Warehouse
only makes available pre-calculated data for those in “Excellent” or “Very Good”
condition as a single “Excellent/Very Good” variable. This is an aggregated measure.
If the level of aggregation of the data presented in the Data Warehouse is not sufficient
for a user, it may be possible to find some data from the NCHS publications, or to
download data sets from the NCHS web-site and to perform additional research and
calculations to obtain the needed level of the aggregation.

               Question. Examine a few tables under the category “Risk Factors”. In
               your opinion, what were the reasons for presenting the aggregated data?
               Discuss the strengths and limitations of this method of the data

        One of the purposes in the survey data aggregation is to obtain valid statistical
estimates based on a sufficient number of observations (respondents, events) in the
survey. This is especially true when the data are aggregated for few years. For example,
in the table on the health status the average estimates were obtained for 3 years: 1993-
1995, 1994-1996, 1995-1997, etc. Note that in this table the moving average is used –
the estimates were obtained for each available consecutive combination of the 3 years.
The moving average is used to reduce the fluctuation in the estimates.

II. Count, Rates and Percents
         Counts, rates, and percents are the basic measures used in the Data Warehouse on
the Trends in Health and Aging, and a clear definition of each is critical to understanding
the data.
         A count is the number representing the population or events of interest. For
example, in the table on mortality, the count is the number of deaths occurred in the
population. In the table on smoking, the count is the weighted number of persons who
responded positively to a survey question about smoking. Examples of when a count is
made up of “events” include the tables on hospital discharges – the count in this case is
the weighted number of sampled hospital discharges. There are a few issues involved in
the presentation of the count in the Data Warehouse on Trends on Health and Aging
tables that might be helpful to point out:
• When the data source for the table is a population survey, such as National Health
    Interview Survey, Behavior Risk Factor Surveillance System, or Medicare Current
    Beneficiary Survey, only a subset of a given population is sampled. Each person in
    the survey represents a portion of the population of interest, and survey numbers must
    be adjusted (weighted) to reflect the national or, in case of the Behavior Risk Factor
    Surveillance System, State figures more accurately. The population sampled is
    defined by the survey scope and purpose. For example, the population presented by
    the Medicare Beneficiary Survey are Medicare beneficiaries in the United States and
    its territories, and the population presented by the Behavior Risk Surveillance System
    is all persons living in the particular State in households with a telephone.
• The same is true for the surveys of events, such National Hospital Discharge Survey
    (NHDS). In this case, every discharge in the survey represents a portion of all hospital
   discharges, and the survey numbers have to be weighted to represent national
   estimates. Please, note that NHDS samples discharges, not patients, and a given
   individual could be hospitalized multiple times during the year. Therefore, the count
   of the discharges represents utilization of the hospitals by the population, not the
   number of people being hospitalized.

        A rate is a measure of some event, disease, or condition in relation to a unit of
population, along with some specification of time. For example, an annual death rate is
calculated by dividing the number of deaths in a given year by the midyear resident
population, as of July 1, and expressed as the number of deaths per 100,000 population,.
In the case of mortality rates, the numerator for the rate calculation is the number of
deaths (the count) occurred in a year, and the denominator is the midyear population. To
obtain the rate per 100,000 population we have to multiply this ratio by 100,000:

       Death RATE per 100,000=
              (Numerator=COUNT=Number of Deaths in a year) /
              (Denominator=POPULATION=Midyear U.S. resident population)

In the Data Warehouse on Trends in Health and Aging, each table showing rates also
shows the numerator (the count) and the denominator (the population) used for rate
• Open the table “Visits to Office-Based Physicians” under the category “Health Care
    Utilization”. The rates of visits per 100 represent the average number of visits to the
    doctor’s office made in a year by 100 persons. For example, in 1999 and 2000, each
    100 white women 65-74 years old in average made 569.1 visits to the doctor’s office,
    or each white woman between 65 and 74 visited the doctor’s office in average 5.7
        How to “combine” rates? One might be interested in obtaining the rates for
more broad category than the category presented in the table. For example, if the age
groups 65-74 years olds and 75-84 years olds are shown in the table could we obtain the
rates for the 65-84 years olds? Let’s examine national mortality rates from diabetes
mellitus for black males by age group (open mortality table by race):

The rate for black males of 65-74 years old was calculated using the number of deaths
among black males of 65-74 as the numerator, and the midyear U.S. resident black male
population of 65-75 years old as the denominator.
RATE black,male,65-74 = COUNTblack,male,65-74 / POPULATION black,male,65-74 * 100,000
                      = 1,266 / 699,329 * 100,000 = 181.0
How the rate for black males of 75-84 years old was calculated?
RATE black,male,75-84 = COUNTblack,male,75-84 / POPULATION black,male,75-84 * 100,000
                      = 945 / 328,656 * 100,000 = 287.5

Notice, that
COUNTblack,male,65-84= COUNTblack,male,65-74 + COUNTblack,male,75-84

POPULATION black,male,65-84 = POPULATIONblack,male,65-74 + POPULATION black,male,75-84

The formula for the calculations of the death rates from the diabetes among black males
of 65-84 years old is:
RATE black,male,65-84  = COUNTblack,male,65-84 / POPULATION black,male,65-84 * 100,000
                       = (1,266+945) / (699,329+328,656) * 100,000
                       = 2,211 / 1,027,985 * 100,000= 215.0
Hence, the mortality rate from the diabetes mellitus for the black males of 65-84 years
old is 215.0 per 100,000 population.

               Question. In some tables the count or/and the population are rounded to
               the thousands. Could we combine rates using the data from these tables?
               What would be the result of the calculations above if the population in the
               mortality table were given in thousands?

       A percent is a similar to the rate measure with both the numerator and the
denominator drawn from the same group.
• Open, for example, the table “Current Cigarette Smoking by Age, Sex, and Race:
   United States, Selected Years 1965-1998” (under the category “Risk Factors”) to
   show the percent of current smokers among persons of all races 65-74 years old. The
   percent of males 65-74 years old who smoked dropped from 31.8 in 1965 to 14.7 in
   1998. In other words, in 1965 in average out of each 100 men there were about 32
   smokers, and in 1998 the number of male smokers of the same age was about 15.

For population based surveys, the percent is usually calculated as the weighted number of
persons in the survey responding to the question in a certain way (e.g. “Yes” to the
question about the current smoking) by the weighted number of respondents who
answered the question. For event based surveys, the percent is calculated as the weighted
number of the selected events (e.g., visits to the dermatologist) divided by the total
weighted number of events (e.g., visits to all doctors in the survey).

•   Open the table “Nursing Home Residents Receiving Assistance in Activities of Daily
    Living” under the topic “Heath Care Utilization”, “Nursing Home”.

Using this table, create the view with the count and percent of residents by age and type
of ADL.

In 1999, 93.9% of 75-84 year old nursing home residents needed the help with bathing
and/or showering. The numerator for the calculation of this value is the count of the
nursing home residents needing help with Bathing/Showering (485,900), and the
denominator is the total number of all residents age of 75-84 (517,600).
        How to “combine” percents? Usually, the tables presenting the percents show
the numerator (the count) and the denominator used for the percent calculation. Some
tables show the category corresponding to the 100% (total) count used as a denominator,
and a count used as a numerator for the percent calculation. If both of these values are
presented in the table, the “combined” percent could be calculated using the method
described above for the rates. If the numerator or denominator is not presented in the
table, one if possible, should perform his or her own calculations using the downloaded
source data system files. For the nursing home table, these files could be found at the
National Nursing Home Survey web-site

•   Use the table “Visits to Hospital Emergency Departments by the Type of Visit” under
    the topic “Health Care Utilization”, “Emergency Room” and create the view shown

               Question. This table presents the annual data on emergency room visits by
               persons 65-74 years old for the years 1998 and 1999. What is the count in
               this table? Percent of what is shown? How the rates were calculated?
               Please, describe each number shown in this view of the table. Use the
               explanatory messages (Summaries) for the entire table (from the File drop-
               down menu), and for items Rates, Percent, # of Visits, and Population.

III. Age Adjustment

       Some of the tables in the Data Warehouse on Trends in Health and Aging cover a
period over the last 50 years of the 20th century. During this time the population structure
changed dramatically. From the chart below you can see that in 1950, among persons 65
years old and over about 40% were 65-69 years old and about 5% were 85 years old and
over. In 1999, the percent of 65-69 years old decreased to 27.4%. At the same time the
percent of 85 years old and over increased more than in two fold to about 12% in 1999 .
Note that the chart below was obtained using the data from the resident population tables
by race for 1950-1980 (for 18 age groups) and 1981-1999 (for 20 age groups). The
calculations of the total number of persons of 65 years old and over, and of the percent
distribution were performed in Excel.

                          Population distribution for persons of 65 years old and

                     30                                                     80-84
                                                                            85 and over






                              1950        1980         1990         1999

                     Question. Using the population tables mentioned above, try to obtain
                     similar distribution for persons of 50 years old and over, one for 1950 and
                     one for the latest year available. Using the data from these tables calculate
                     the number of people in age groups 50 and over, 50-64, 65-74, 75 and
                     over, and calculate the percent distribution for these groups among persons
                     of 50 years old and over. How did it change in 1999 compared to 1950?

Because the population structure changed so dramatically in the last few decades, the
crude estimates of percents and rates for earlier years represent the experience of a
younger population, while the estimates for the latest years would reflect the experience
of an older one. Therefore, the rates and percents for the “65 years old and over” and “50
years old and over”- usually have to be age-adjusted to be compared across the years.

        Age adjustment is the application of age-specific rates in a population of interest
to a standardized age distribution in order to eliminate differences in rates that result from
age differences in population composition. This adjustment is usually done when
comparing two or more population groups at one point in time or one population groups
at two or more points in time. The standardized age distribution used by the National
Center for Health Statistics is 2000 standard population.

Age-adjusted rates are calculated by the direct method as follows:

In other words, the age-adjusted rate is the weighted average of age-specific rates, where
the values of pi/P are used as adjustment weights. Age-adjusted percents are
calculated using a similar formula.

              Question. Using the formula above, prove that values of the crude and
              age-adjusted rates are almost equal when: a). the rates for the different age
              groups are not much different from each other; b). the population used as
              the denominator for the calculation of the age-specific rates is close to the
              standard population used to obtain the adjustment weights.

        Table 1 shows the relevant numbers for age-adjustment of the estimates for age-
group 65 years old and over. The population subgroups and the corresponding adjustment
weight are shown based on the 2000 Standard Population. Age adjustment requires use of
a standard age distribution. The year 2000 population replaced the 1940 U.S. population
for age adjusting mortality statistics.
                            Age      Standard           Adjustment
                                     2000               Weight
                                     Population in

                             65+          34,710           1.0000

                            65-74         18,136           0.5225

                            75-84         12,315           0.3548

                             85+          4,259            0.1227
Table 1 Age distributions and age-adjustment weights for the population age 65 and over based on
the 2000 standard population
        The adjustment weights are used in conjunction with the estimates of percents or
rates for similar age groups to create the age-adjusted percent or rate for the age group 65
years old and over.

•   Open the table on nursing home residents by age, sex and race under the topic
    “Health Care Utilization”. Arrange the view of the table by age groups and Units:

       The crude rate 47.18 per 1,000 population for the age group “65+ (crude)” was
obtained by simple division of the number of nursing home residents of age 65 and over
(1,126,008) by the corresponding population (23,864,420). To age adjust the rates for the
residents of age 65 and over requires use of the adjustment weights from the Table 1. As
shown in Table 2, the adjustment weight is multiplied by the rates for age-specific groups
65-74, 75-84, and 85 and over (in decimal format) to give the result for each population
subgroup. Those results are then added to get the number for the entire 65 years old and
over population. It shows that the age-adjusted resident rate for persons of 65 years old
and over was 58.3 per 1,000.

           Age     Adjustment        Resident            Calculation            Result
                    Weight            Rates

          65-74       0.5225           14.41            14.41* 0.5225           7.529

          75-84       0.3548           64.32            64.32* 0.3548           22.82

           85+        0.1227           227.77          227.77* 0.1227           27.948

           65+                                                                  58.298
Table 2 Calculation of age-adjusted numbers
        The similar procedure is applied to calculate age-adjusted rates and percents for
the age group 50 years old and over. The corresponding adjustment weights are shown in
the Table 3.

                            Age      Standard           Adjustment
                                     2000               Weight
                                     Population in

                             50+         75, 895           1.0000

                            50-64        41, 185          0.542657

                            65-74         18,136          0.238962

                            75-84         12,315          0.162264

                             85+          4,259           0.056117
Table 3 Age distributions and age-adjustment weights for the population age 50 and over based on
the 2000 standard population

            Question. Examine the rates of nursing home residents by age for available
            years. Population of what age group is more likely to reside in the nursing
       home? In 1977 the difference between crude and age-adjusted to 2000 standard
       population rates was 11.12, while in 1999 it was only 0.41. Why has it decreased?

Let’s look closer at the difference between crude and age-adjusted death rates.

•   Using the latest mortality table by race, create graphs to compare over the years the
    data for deaths rates due to malignant neoplasm for the age groups 65 years old and
    over crude and age adjusted to 2000 standard population.

           Question. Why are 1990 crude death rates significantly higher than the
           adjusted data? In which years should the crude and age-adjusted data be the
           most similar?

IV. Race and Hispanic Origin Measures
        Changes in the racial and ethnic composition of the population have important
consequences for the nation’s health since many measures of disease and disability differ
significantly by race and ethnicity. Diversity has long been a characteristic of the U.S.
population, but the racial and ethnic composition of the nation has changed drastically
over time. In 1977 the Office of Management and Budget (OMB) issued Race and Ethnic
Standards for Federal Statistics and Administrative Reporting in order to promote
comparability of data among Federal data systems. The 1977 Standards called for the
Federal Government’s data systems to classify individuals in the following four racial
groups: American Indian or Alaska Native, Asian or Pacific Islander, black, and white.
Depending on the data source, the classification by race was based on self-classification
or on observation by an interviewer or other person filling out the questionnaire, death
certificate, or hospital discharge records.

         The changes in the U.S. population over time, in addition to shifting legal or
political considerations, are what lead to the changes in reporting regulations. The
Hispanic population and the Asian and Pacific Islander population have grown more
rapidly than other racial and ethnic groups in recent decades. In 2000, more than 12
percent of the U.S. population identified themselves as Hispanic and almost 4 percent as
Asian Pacific Islander. Also in 2000, over a quarter of adults and more than a third of
children identified themselves as Hispanic, as black, as Asian or Pacific Islander, or as
American Indian or Alaska Native.

        In 1997, new standards were announced for classification of individuals by race
within the Federal Governments data systems (Federal Register, 62FR58781–58790). The 1997
Standards have five racial groups: American Indian or Alaska Native, Asian, Black or
African American, Native Hawaiian or Other Pacific Islander, and White. These five
categories are the minimum set for data on race in Federal statistics. The 1997 Standards
also offer an opportunity for respondents to select more than one of the five groups,
leading to many possible multiple race categories. The 1997 Standards allow for observer
or proxy identification of race but clearly state a preference for self-classification. The
Federal government considers race and Hispanic origin to be two separate and distinct
concepts. Thus Hispanics may be of any race. For instance, people can classify
themselves as white Hispanic, black Hispanic and so on. It is important to note that
Hispanic mortality data started in 1984 and only a limited number of states reported
Hispanic mortality at the beginning. Federal data systems are required to comply with the
1997 Standards by 2003.

        In the 1980 and 1990 decennial censuses, Americans could choose only one racial
category to describe their race. In 2000, the question on race was modified to allow the
choice of more than one racial category. Although overall a small percent of persons of
non-Hispanic origin selected two or more races in 2000, a higher percent of children than
adults were described as being of more than one race. The number of American adults
identifying themselves or their children as multiracial is expected to increase in the
future. In 2000 the percent of persons reporting two or more races also varied
considerably among racial groups. For example, the percent of all persons reporting a
specified race who mentioned that race in combination with one or more other racial
groups was 3 percent for white persons and 40 percent for American Indians and Alaska
Natives. For a more detailed discussion of race measures please consult the OMB website
and technical notes for NCHS publications at and, respectively.

        For some data systems, such as mortality statistics, the numerator and the
denominator in the rate calculation may be based on different race classifications. The
race in the denominator is based on the census forms. When an individual fills out the
census forms, he must make a personal choice as to which race(s) he identifies himself
with. When a person dies, however, it is the person who fills out the death certificate who
determines the race of the deceased. This can create a discrepancy in mortality rates
between the numerator (race at death as determined by another) and the denominator
(self-determined race reported while alive).
       For more information about how race was determined in a particular Beyond
20/20 table, please see the explanatory information (Summary) for the dimensions Race,
or Race/Ethnicity.

               Question. From the Data Warehouse review the race definition in the Life
               Expectancy table, and in any table from the Behavior Risk Factor
               Surveillance System, from the National Nursing Home Survey, and from
               the National Health Interview Survey. For each of these data systems
               answer the following questions: Was the race self-reported? If not, how it
               was recorded? Had the definition of race changed over the years? How
               might these changes in the definition affect the estimates by race?

V. Errors, Bias, and Quality Assurance
        Each estimate in the Data Warehouse on Trends in Health and Aging may be a
subject to a variety of errors: errors due to survey design, to random variation, to non-
response, and to misclassification. Below you will find a brief description of the major
types of errors and quality assurance standards related to the data from the Data
Warehouse on Trends in Health and Aging.

        Most of the data sources used by the Data Warehouse on Trends in Health and
Aging are surveys that are based on multistage stratified sample designs. For example,
the National Nursing Home Survey (NNHS) is a two-stage stratified probability survey.
The first stage of NNHS is the sampling of the nursing home facilities, and the second
stage is the selection of the residents and discharges in these facilities. The statistics
derived from the survey are subject to sampling variability. The standard error and
confidence intervals are common measures of sampling variability and are used to assess
the precision of an estimate derived from sample data. For most of the tables that based
on the surveys data, the standard error due to the survey design, or sampling error, was
estimated using special SUDAAN software and presented in the table along with the
corresponding 95% confidence interval and relative standard.
       Although the mortality data are not derived from samples (except for 1972, when
50% of death certificates were recorded by The National Vital Statistics System), they
may be affected by error due to the random variation in the number of deaths. The
standard error due to the random variation may be estimated based on the assumption of a
Poisson distribution of deaths using the following formula for the number of deaths:

                                      SE ( D) = D ,

and the formula below for the standard error of the death rates

                                      SE ( R) =       ,
where D is the number of deaths, R is the death rates, and SE stands for the standard
error. In the future, the mortality tables may include the error measure due to random

               Question. Open a mortality table by race for the latest available year, and
               using the formula above calculate the standard error and 95% confidence
               interval for the number of death and death rates from septicemia for the
               age group 65-74 in your State and neighboring States. For the purpose of
               this exercise, calculate the rates even if the number of deaths is less than
               20. Compare the rates and their 95% confidence for the two States. Do
               you think they are significantly different?

         Bias. Each survey presented in the Data Warehouse on Trends in health and
Aging employs multiple procedures and policies to minimize bias in the sample so the
statistics will give trustworthy results. Unfortunately, in most cases it is nearly impossible
to completely eliminate bias in the sample. It is therefore important for users of the
collected data to recognize that many different types of bias can occur on every step of
the survey. Below we discuss a few types of bias.

   1. Selection bias occurs if the method for selecting the participants produces a
      sample that does not represent the population of interest. For instance, the
      Behavior Risk Factor Surveillance System surveyed the households with phone
      service. Although it represents the majority of the population, the households that

       don’t have a phone are outside of the scope of the survey and differ from
       households with phones on factors related to health like income and education.

    2. Response bias occurs when participants respond differently from how they truly
       feel. They way questions are constructed, the way the interviewer behaves, as
       well as many other factors might lead an individual to provide inaccurate
       information. For instance, surveys about socially unacceptable behavior such as
       heavy smoking or drinking must be worded and conducted carefully to minimize
       the possibility of response bias. For example, when reporting body weight persons
       tend to underestimate it, while the self-reported height is likely to be
       overestimated. This leads to the underestimation of obesity and overweight, the
       determination of which is based on weight and height.

•   Open two tables on Obesity from the “Risk Factors and Disease Prevention” topic:
    from the National Health and Nutrition Survey (NHANES), which is based on actual
    measurements, and one from the National Health Interview Survey (NHIS) which
    uses self-reported weight and height. Arrange them to view the prevalence of obesity
    for the age group 65-74 for the years 1988-1994 by sex.

              Question. You can see that the prevalence of obesity for this age group
              estimated by NHANES for the years 1988-1994 was 24.1% for males and
              36.9% for females, though the annual prevalence obtained by NHIS for
              1988-1994 never exceeded 15.7% for males and 19.2% for males. Why are
              the estimates differ?

              Examine the prevalence of obesity and overweight for your State from the
              Behavior Risk Factor Surveillance System. How are the estimates
              different from the national estimates?

   3. Non-response bias. Responding to a survey is voluntary. Those who respond are
      likely to have stronger opinions than those who do not respond. In statistical
      language this is referred to as non-sampling bias that can lead to systematically
      over- or underestimating the truth about a population. Suppose a survey is sent out
      to 100 persons regarding insurance coverage. Assume that 70 of those people
      respond and 14 of them say they have no insurance coverage. If the percent of the
      non-covered persons is calculated as the number of those not insured divided by
      the total number of participants, the result would be 14/100=14%. By saying that
      15% are uninsured we are most likely overestimating insurance coverage, because
      by our calculations we assumed that all 100-14=86 participants who did not say
      “NO” are covered. If the percent was calculated by dividing the number of those
      who said “NO” by the number who responded to the question (said “YES” or
      “NO”) the result would be 14/70= 20%. In this case, we assumed that uninsured
      persons are distributed equally among responders and non-responders that may or
      may not be true.

              For each table the way the percent was calculated is described in the
explanatory notes

           Question. How the non-response bias is different from the selection bias?

   4. Misclassification bias. One of the examples of the misclassification may be the
      determination of race, Hispanic Origin, or age of the deceased by the person who
      filled out the death certificate in the absence. This type of bias can occur when the
      questions was answered by a proxy. See the table explanatory messages and the
      data systems description for the information how the data were obtained.

              Question. In the National Nursing Home Survey, the questions about
              needing help with activities of daily living (ADLs), such as bathing,
              eating, going to the toilet, and walking, were answered by the staff
              member most familiar with the nursing home resident. What kind of bias
              do you think it could introduce? How it may change the results?

Quality Assessment. There are a number of widely accepted methods and procedures
used by the NCHS and other government agencies to ensure the quality of the survey data
at each stage of data management, from the planning of the survey design and
questionnaire to data dissemination. NCHS conducts independent research and consults
with experts in areas such as data collection, data analysis, and a variety of substantive
topics and issues. NCHS reviews the quality (including the objectivity, utility, and
integrity) of information before it is disseminated and treats information quality as
integral to every step in the development of information, including its creation,
collection, maintenance and dissemination.

         In order to assure accurate estimations in the Data Warehouse, the data are
obtained through standardized statistical procedures based on the accepted theory and
practice. The Data Warehouse also follows generally recognized guidelines in terms of
defining acceptable standards for the data presentation, such as maximum standard errors,
cell size suppression, adherence to confidentiality, and other processing operations. All
statistical and analytic information in the Data Warehouse products undergo a formal
clearance process before dissemination. The methodology of data calculation, and
warning notes and source references about the data are an integral part of the Beyond
20/20 tables.

VI. Error Measures

        All error measures presented in the Data Warehouse are related to the errors due
to the survey design.

        The standard errors due to the survey design presented in the Data Warehouse
on Trends in Health and Aging were calculated using SUDAAN software, which takes
into account the complex survey design.

        A 95% percent confidence interval means that if all possible samples were
surveyed under the same conditions, approximately 95 percent of the intervals would
include the “true” estimation. A particular confidence interval may or may not contain
the “true” estimation, however. The lower bound of a 95 percent confidence interval is
calculated by using the following formula:

       LOWER BOUND 95% Confidence Interval =
                ESTIMATE - 1.96* STANDARD ERROR

        If the lower bound of the confidence interval was determined to be less than zero,
the value of zero was used. The upper bound of a 95 percent confidence interval is
calculated by using the formula:

       UPPER BOUND 95% Confidence Interval =
                 ESTIMATE + 1.96* STANDARD ERROR

       If the upper bound of the confidence interval was calculated for the percent and
was determined to be more than 100%, the value of 100% was used. Relative standard
error (RSE) is defined as the standard error divided by the estimate and is expressed as a
percent of the estimated value:

       RSE     = STANDARD ERROR / ESTIMATE * 100%

       In most of surveys, figures for the estimates for which the relative standard error
(RSE) is greater than 30% are considered unreliable.

       In the Data Warehouse on Trends in Health and Aging, in some tables the error
measures are presented as items of the dimension UNITS, in others they are shown as
items of the dimension MEASURE.

•   Open the table on fruits and vegetable consumption from the “Risk Factors and
    Disease Prevention” topic and arrange it by UNITS dimension and race and sex for
    the State of Virginia, age group 50-64, the years 1998-2000, 5 or more servings. You
    see that based on the Behavior Risk Factor Surveillance System, in Virginia 23.3% of
    white males, 17.5% of black males, 35.5% of white females, and 26.4% of black
    females eat recommended 5 servings of fruit and vegetables a day.

A 95% confidence interval is presented for each of these values. For example, for white
males the lower bound is 18.3%, and the upper bound is 28.3%, for black males this
interval has a lower bound of 7.6% and an upper one of 27.3%.

               Question. The confidence intervals for white and black males are
               overlapping. Could we assume in this case that percents of white and
               black males consuming 5 or more servings of fruits and vegetables a day
               are significantly different?

•   Open the table on visits to office-based physicians which has the MEASURE
    dimension available. Arrange the table by the year and measure for both sexes, all
    races, age groups 65 and over (age-adjusted), and all specialties. You can see that for
    1999-2000 the estimated rate of office visits to all specialists is 604.2 per 100
    persons, with a standard error of 7.4. The lower and upper bounds of the confidence
    interval are also given, as well as the relative standard error (RSE).

               Question. The rate for 1995 and 1996 is 604.9 (590.8, 619.0), and for
               1999 and 2000 is 604.2 (589.7, 618.8). Could we say that the rates of visits
               to the doctor office decreased in 1999-2000 compared to 1997-1998?
               See section VIII for a brief overview of the basic statistical tests that
               answered this question.

VII. Missing Values
         When browsing through the tables in the Data Warehouse on Trends in Health
and Aging, one occasionally finds cells where a symbol is shown in place of data. There
are many different possible reasons for missing values in the tables, and the following is a
list of the common types of missing values. Each missing value is represented in the
table by some kind of symbol (i.e. hyphen, asterisk, tilde) and placing the mouse over a
cell with a missing value will show a pop-up text that explains why the value is not there.

    1. Unreliable Estimates:
       In some cases the data are judged to be unreliable estimates, and these estimates
       may not be shown in the table.

           a. In survey data, standards of reliability are usually the following:

                   i. The number of observations in the survey, or the sample size,
                      based on which the statistics was calculated, has to be more or
                      equal than the or equal to a pre-determined value.
                  ii. The value of the relative standard error (RSE) is higher than a pre-
                      determined value.

               For example, the National Hospital Discharge Survey estimate is not
              considered reliable if it is based on fewer than 30 discharges in the sample.
              In the Behavior Risk Surveillance System, the number of the observations
              has to be 50 or more. In addition, for both of these surveys the estimate
              also is not considered reliable if the RSE more than 30%.

           b. The death rates are not considered reliable when the number of deaths in a
              cell is less than 20. Death rates based on a small number of deaths are not
              shown, though the number of deaths is presented in the tables. It is
              unadvisable to calculate rates in this case.

•   Open the Injury Death Rate table and look at the Fall subsection of Homicide as
    shown below. There are not many instances of deaths from falls which were a result
    of a homicide. Moving the mouse pointer over the cells in this row will show text
    stating that the data for this cell is an unreliable estimate.

        To find out which type of standard is used to determine the reliability of the data,
see the explanatory messages in the table.

   2. Not Available:
      There are a number of reasons why the data may not be available.

           a. A survey may not ask a specific question in a particular year. In the table
              on the participation in the physical activities you won’t find percent of
              persons of 75 years old and over who were gardening or participating in
              the aerobic for the years 1985 and 1990 – they were considered too old for
              these activities and the questions were not asked.

           b. In addition, data may be unavailable for a particular surgical procedure
              that had not yet been developed, such as coronary artery bypass, or a
              particular service that had not been provided, such as hospice care in the

           1970s. In the table below, removal of coronary artery obstruction was a
           procedure that did not exist (indicated here with “/”) before 1979. Then
           from 1979 to 1982, the data are unreliable for this procedure, likely
           because of its relative novelty. Not until 1983 are data available for this

         Question. What other reasons do you think might explain the unavailability
         of data?

3. Not applicable:
   If a particular estimate is not relevant, it is categorized as not applicable. Here are
   some examples of this condition:

       a. Some procedures, diagnoses or causes of death are relevant only to
          females (such as hysterectomy), and some are relevant only for males
          (such as malignant neoplasm of the prostate).

     Question. In 2000, 62,000 hysterectomies were performed on persons 65
     years old and over. The corresponding midyear population in the year 2000
     was: 20,340,000 females and 14,477,000 males. Using this information how
     would you calculate the crude rate of hysterectomies for persons of 65 years
     old and over in the nation?
   • Open the table on hospital discharges by all-listed procedures and arrange it
     by sex and UNITS to verify your answer.

       b. In the table on injury mortality, some combinations of the intent/manner of
          the injury (homicide) and case/mechanism (motor-vehicle traffic) are
          meaningless and considered as not applicable.

    4. Confidential:
        To protect confidentiality of the persons whose characteristics are presented for
        the public domain, in some tables the estimates are suppressed due to the
        confidentiality regulations
•   Open the Medicare Expenditure by Type of Service, Age, Sex, and Race table under
    “Health Care Expenditures” topic. The number of people enrolled in Medicare that
    did not satisfy confidentiality criteria of CMS (formerly HCFA) were suppressed to
    maintain confidentiality. Restructuring the table so that only blacks are present
    reveals that Medicare expenditure data in Alaska 1974-1977 are not shown for this

    5. Complementary to Confidential cell:
       Some cells had to be suppressed because they are adjacent to the confidential
       cells. For example, if the estimate for females is suppressed, and the estimates for
       male and both sexes are given, then the estimate for females could be calculated
       by subtracting the number for male from the number for both sexes.

•   Open the Health Care Expenditures folder and select the Medicare Expenditure by
    Type of Service, Age, Sex, and Race table (ME10S98A). Rearrange the data so that
    race is set to black and age to 85 and over. Go to the state of Connecticut in the year
    1998 and nest the Sex dimension inside state. The cell for female should have a “~”
    mark which means that this cell is complementary to a confidential cell (the male
    cell). It is important to note that this is an extremely rare case.

    6. Values based on the missing estimates:

       If the estimate in the table is shown as missing, all values based on this estimate
       usually will be shown as missing also. For example, for the missing rates all the
       error measures, such as upper and lower bounds of 95% confidence interval, will
       be shown as missing. In the Medicare Expenditure table, if the number of persons
       enrolled in Medicare for the particular demographic group is categorized as
       missing, all types of statistical data for this demographic group are not shown

VIII. Statistical Testing

         One reason for calculating sampling error is that certain statistical tests of
hypotheses require them. Tests of hypotheses consist of decision rules, which define how
the statistics obtained from a sample of the population are to be inspected so that one may
increase the odds for arriving at correct answers to questions about the underlying
population. For example, the statement “The value A is not equal to the value B” related
to the survey estimates is a statistical hypothesis. A typical approach is as follows: The
hypothesis H0 is formulated; then the sample data are examined. If the sample outcome
differs “significantly” from what would be expected if H0 were true, then H0 is rejected.

        The statistical testing Beyond 20/20 tools presented below are based on the t-tests
and z-tests described in the publication of Sirken M, Shimizu I, French D, and Brock D
(1983) “Manual on Standards and Procedures for Reviewing Statistical Reports. Revised”

National Center for Health Statistics, Washington, D.C. To use these tools you have to
download the special utility that will automatically modify the Beyond 20/20 Browser so
the Tools drop-down menu will be added. To perform the specific test one just has to
click on its name in the drop-down menu, which also contains the instructions (“help”).
As of September 2003, this utility is in the final stage of development. However, if you
are interested, this utility could be sent to you for testing. One could also perform similar
calculations using the formulas for the z-test and t-test described in any college-level
textbook on statistics.

1. Single comparison (test of the difference between two values).

•   Open the table on the visits to physician offices by physician specialty in the “Health
    Care Utilization” topic and arrange it to view rates of visits to Internal Medicine
    specialists for the age group 65-74 by the dimensions year and MEASURE.

        The values of rates in 1997-1998 and 1999-2000 were 128.3 and 139.8 per 100
persons, respectively. Are these rates significantly different in a statistical sense? The
figures are different, and if we would see only the value of the “Estimate” we would
probably say, “Yes, sure, they are different”. But look at the standard error and the 95%
confidence interval and you will see that the 95% confidence intervals are overlapping
for these values, and our answer should be “Well, we are not so sure if they are different
– maybe it is a result of the variations due to the survey design”.

        To answer this question we could use the statistical testing procedure for the
comparison of two values (single comparison). It uses z-statistics with a 5% level of
significance. Both tested values must be accompanied by non-missing standard errors
with corresponding relative standard errors less than 25%. This test is performed under
the assumption that both values are normally distributed with the variance equal to SE2,
where SE is the calculated standard error (10.1 and 12.0, respectively). Following the
instructions of the “Test of the difference between two values” in the drop-down menu

highlight the values and their standard errors, and perform the test. You will receive the
following message:

This confirms that we can NOT say that the number of visits per 100 persons increased
in 1999-2000 compared to 1997-1998.

        The single comparison test can be used for the comparison of two numbers only.
For example, we cannot to use this test to compare 65-74 years olds with 75-84 and 85
years old and over. For comparison of multiple values, multiple comparison (Bonferroni)
test should be used.

2. Test of trends.

        Another type of test that the users of the Data Warehouse on Trends in Health and
Aging might be interested in is the test of trends which helps to answer the question “Is
this sequence of the estimates generally decreasing (increasing)?”.

       The “Test for Trend” employed by the Beyond 20/20 tool is based on the
hypotheses in the form: “The value of X increases (decreases) as the value of Y
increases.” The hypothesis actually tested is just an opposite: “the variable X is
independent of the variable Y” in the sense that there is no linear relationship between the
two. For the test, a linear regression model represented by the equation below is fit to the
                                X = A + BY

The test is using the weighted squared technique to determine the values of A and B. For
acceptance or rejection of the hypothesis about the linear relationship the two-tailed t-
distribution is used. The number of degrees of freedom is determined as n-2, where n is
the number of the values (years, age groups) being analyzed for the trend.

•   Open the table on the health status (National) and make a chart for white persons 65
    years old and over (age-adjusted) who assessed their health as “Fair” or “Poor”. We
    can see that the percent seems to be decreasing. Will the statistical test confirm it?

Arrange the view of the table by Units and Year and perform the test of trends. The
message you receive is:

Not only the test confirmed that the trend is decreasing, it also supplies us with the slope -
average “pace” of the decrease per unit of time: 0.36% per year.

Question. Apply statistical tests to other tables in the Data Warehouse
that contain the value of the standard error and interpret the results.


To top