Introduction to Statistics - DOC by hcj


									                                         Introduction to Statistics
What is Statistics?
           Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. We
are bombarded by data in our everyday lives. Most of us associate "statistics" with the bits of data that appear in
news reports: baseball batting averages, imported car sales, the latest poll of the president's popularity, and the
average high temperature for today's date. Advertisements often claim that data show the superiority of the
advertiser's product. All sides in public debates about economics, education, and social policy argue from data. Yet
the usefulness of statistics goes far beyond these everyday examples.
           The study and collection of data are important in the work of many professions, so that training in the
science of statistics is valuable preparation for a variety of careers. Each month, for example, government statistical
offices release the latest numerical information on unemployment and inflation. Economists and financial advisors,
as well as policy makers in government and business study these data in order to make informed decisions. Doctors
must understand the origin and trustworthiness of the data that appear in medical journals if they are to offer their
patients the most effective treatments. Politicians rely on data from polls of public opinion. Business decisions are
based on market research data that reveal consumer tastes. Farmers study data from field trials of new crop varieties.
Engineers gather data on the quality and reliability of manufactured products. Most areas of academic study make
use of numbers, and therefore also make use of the methods of statistics.
           We can no more escape data than we can avoid the use of words. Just as words on a page are meaningless to
the illiterate or confusing to the partially educated, so data do not interpret themselves but must be read with under-
standing. Just as a writer can arrange words into convincing arguments or incoherent nonsense, so data can be
compelling, misleading, or simply irrelevant. Numerical literacy, the ability to follow and understand numerical argu-
ments, is important for everyone. The ability to express yourself numerically, to be an author rather than just a
reader, is a vital skill in many professions and areas of study. The study of statistics is therefore essential to a sound
education. We must learn how to read data, critically and with comprehension. We must learn how to produce data
that provide clear answers to important questions. And we must learn sound methods for drawing trustworthy con-
clusions based on data.

The Rise of Statistics
           Historically, the ideas and methods of statistics developed gradually as society grew interested in collecting
and using data for a variety of applications. The earliest origins of statistics lie in the desire of rulers to count the
number of inhabitants or measure the value of taxable land in their domains. As the physical sciences developed in
the seventeenth and eighteenth centuries, the importance of careful measurements of weights, distances, and other
physical quantities grew. Astronomers and surveyors striving for exactness had to deal with variation in their
measurements. Many measurements should be better than a single measurement, even though they vary among
themselves. How can we best combine many varying observations? Statistical methods that are still important were
invented in order to analyze scientific measurements.
           By the nineteenth century, the agricultural, life, and behavioral sciences also began to rely on data to answer
fundamental questions. How are the heights of parents and children related? Does a new variety of wheat produce
higher yields than the old, and under what conditions of rainfall and fertilizer? Can a person's mental ability and
behavior be measured just as we measure height and reaction time? Effective methods for dealing with such
questions developed slowly with much debate.
           As methods for producing and understanding data grew in number and sophistication, the new discipline of
statistics took shape in the twentieth century. Ideas and techniques that originated in the collection of government
data, in the study of astronomical or biological measurements, and in the attempt to understand heredity or
intelligence came together to form a unified le science of data." That science of data -- statistics-- is the topic of this
Understanding from Data
          The practice of statistics involves the use of many recipes for numerical calculation, some quite simple and
some very complex. As you are learning how to use these recipes, remember that the goal of statistics is not
calculation for its own sake, but gaining understanding from numbers. A calculator or computer can automate many
of the calculations, but you must supply the understanding. Computers using specialized software always carry out
the more complex procedures. A thorough grasp of the principles of statistics will enable you to quickly learn more
advanced methods as needed. On the other hand, a fancy computer analysis carried out without attention to basic
principles will often produce elaborate nonsense. As you study, seek to understand the principles as well as the
necessary details of methods and recipes.

Statistics is an area of mathematics that is concerned with the extraction of information from numerical data obtained
during an experiment. It involves the design of the experiment, the collection and analysis of the data, and making
inferences (statements) about the population based upon information in a sample.

The word statistics has two meanings.
1) Statistics refers to numerical facts such as the income of a family, the age of a student, the percentage of passes
completed by the quarterback of a football team, and the starting salary of a typical college graduate.
2) Statistics refers to the field of study which uses methods to collect, analyze, present, and interpret data.

Statistics can further be divided into two areas.
1) Descriptive statistics consists of methods for organizing, displaying, and describing data by using tables, graphs,
and summary measures. The first part of the course will use descriptive statistics.
2) Inferential statistics consists of methods that use sample results to help make decisions or prediction about a

Individuals are the objects described by a set of data. Individuals may be people, but they may also be animals or

A population consists of the entire group of individuals about which the researcher wants information.
       Ex. 1) All U.S. citizens
            2) All male high school students

The parameter is some characteristic of the population that the researcher wants to measure.
        Ex. 1) Proportion of U.S. citizens who voted in the last presidential election
            2) Average height of all males high school students.

A sample is the portion of the population that is selected to study.
       Ex. 1) A sample of the U.S. citizens would be the citizens of Massachusetts
             2) A sample of the male high school students would be male students at Hudson High School

A statistics is some characteristic of the sample.
         Ex. 1) The proportion of people in Massachusetts that voted in the last presidential election.
               2) The average height of the male students at Hudson High School.

Inference is a statement about a population based on the data collected in a sample. One type of inference is using a
sample statistic to estimate a population parameter.
         Ex. The average height of the male students in this class can be used to estimate the average
              height of all male VCU students

A distribution is a listing of all the possible values that a characteristic can take and the number of times
         that each value occurs
         Ex. Gender of student:        male female
Example 1
Television station QUE wants to know the proportion of TV owners in Virginia who watch the station's new program
at least once a week. The station asked a random group of 1000 TV owners in Virginia if they watch the program at
least once a week. Identify the 1) population, 2) parameter, 3) sample, 4) statistic, and 5) inference.

         1) The population consists of TV owners in Virginia.
         2) The parameter is the proportion of TV owners in Virginia who watch the station's new program at least
         once a week
         3) The sample is the 1000 TV owners that were asked if the watch the program at least once a week.
         4) The statistic is the proportion of the 1000 randomly selected TV owners that watch the program at least
         once a week.
         5) The inference is that the proportion of the 1000 randomly selected TV owners that watch the program at
         least once a week will be equal to the proportion of TV owners in Virginia that watch the program at least
         once a week.

A goal of statistics is to measure some characteristic about a subject or set of subjects. To assure more accurate
results, we usually either measure the characteristic on several subjects (the sample), or if only one subject is
available, we repeat the measurement several times. This is called repetition and it is in repeated experiments that
statistics become important.

When the measurements of some characteristic of the individuals do not change in repeated trials, the characteristic
is called a constant.
          Ex. 1) Weight of a certain rock.
              2) Number of minutes in an hour.

When the measurements of the characteristic vary from trial to trial, then the characteristic is called a variable.
Variables are classified into two categories:
   1. A qualitative (or categorical) variable is a variable whose measurements varies in kind or name but not in
        degree, meaning that they cannot be arranged in order of magnitude. Hence one level of
        a qualitative variable cannot be considered greater or better than another level.
        Ex. 1) Gender - male or female
             2) Occupation - teacher, lawyer, doctor, janitor, etc.
   2. A quantitative variable is a variable whose measurements vary in magnitude from trial to trial, meaning
        some order or ranking can be applied. Quantitative variables are further classified as being discrete or
        1. A discrete quantitative variable is a variable whose measurements can assume only a countable
            number of possible values.
            Ex. Number of cars in a parking deck
        2. A continuous quantitative variable is a variable whose measurements can assume any one of
            a countless number of values in a line interval.
            Ex. Weight of a typical student

In statistics, we are primarily concerned only with the observation of variables - if we know beforehand what the
measurement is going to be, as is the case with constants, then there is no reason to make the measurement.

Example 2
Identify each of the following characteristics as being a constant, a qualitative variable, a discrete quantitative
variable, or a continuous quantitative variable.
          a) College major – qualitative or categorical variable
          b) Number of dependents – discrete quantitative variable
          c) Number of people serving as President of the United States at any one time - constant
          d) Age – continuous quantitative variable (not because it’s always changing but because there is
             no limit to how precise you give a person’s age. ie. minutes, seconds, tenths of a second, etc.)
          e) Eye color – qualitative or categorical variable
          f) Suicide rate – continuous quantitative variable (rates are always continuous quantitative)
Example 3
The following is a small part of spreadsheet containing data from a 1997 Road Atlas.

Each row records data on one individual. Each row of data is called a case. Each column contains the values of one
variable for the individuals. In addition to the state, there are 5 variables. Nickname and Capital are qualitative or
categorical variables. Population is a discrete quantitative variable. Land Area and Highest Point are continuous
quantitative variables.

Most statistical software uses this format to enter data – each row is an individual, and each column is a variable.
This data set appears in a spreadsheet program that has rows and columns ready for your use. Spreadsheets are
commonly used to enter and transmit data. Most statistical software can read data from the major spreadsheet

Statistical tools and ideas can help you examine data in order to describe their main features. This examination is
called exploratory data analysis. Two basic strategies that help us organize our exploration of a set of data are:
          1) Begin by examining each variable by itself. Then move on to study relationships among the variables.
          2) Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data.

Additional Reading and Examples
1.   A fundamental component of statistics is the ability to understand and recognize the difference and relationship
     between a population and a sample. In most problems the data that we have is a sample of the population, and
     our goal is to use this data to make statements about the population.

2. Radon is a radioactive gas that is generally present in harmless amounts in nature. However, in certain dwellings
   radon gas is known to be present in quantities that may be harmful to humans, particularly in basements of
   buildings where air is stagnant. The Environmental Protection Agency (EPA) sets standards for environmental
   emissions from hazardous substances. According to the July 1995 issue of Consumer Reports, the EPA has
   suggested that radon levels exceeding 4.0 pc/l (picocurie per liter) are associated with an increased risk of lung
   cancer. Of interest is to estimate the percentage of all dwellings in which the radon concentration poses an
   increased risk of lung cancer. Data is available for a sample of 51 residential buildings owned by a local real
   estate developer.
         The population consists of all dwellings in which humans may enter and which could therefore pose a
   health risk. The specific parameter of interest is the percentage of all these dwellings that have a radon
   concentration above 4.0 pc/l. To estimate this percentage a sample of 51 residential building is selected, and the
   percentage of these 51 dwellings that have radon concentrations above 4.0 pc/l can be computed. This sample
   percentage is a statistic and can be used to estimate the percentage of all dwellings with radon concentrations
   above 4.0 pc/l (the inference).
         The owner of the dwelling is a qualitative variable, because we can only name the owner. The actual radon
   concentration is measured and hence is a continuous quantitative variable. The number of dwellings with radon
   concentrations exceeding 4.0 pc/l is countable and hence is a discrete quantitative variable. However, the
   percentage of the dwellings with radon concentrations exceeding 4.0 pc/l can take an uncountable number of
   possible values and hence is a continuous quantitative variable. The number of dwellings is countable and
   hence is a discrete quantitative variable.
Practice Problems
How much have you learned so far? The following is a set of problems you should complete on paper and bring to
class the first day of school.

1) The branch of statistics concerned with making statements about a population based on information obtained
   from a sample is called _________.

2) The __________ is the entire group of individuals (subjects) about which the researcher wants information.

3) __________ is the name given to a characteristic when the measurements of the characteristic do not change in
   repeated trials over time.

4) A local city council is interested in determining the percentage of people who live in the city that would be in
   favor of spending the money necessary to finance the removal of tolls from the downtown expressway. They
   randomly sampled 250 city residents and asked each of them whether they would favor investing their tax
   dollars for such a purpose. Identify the population, parameter, sample, and statistic in this experiment, and
   briefly explain the inference that is taking place.

5) A telemarketing firm in Los Angeles uses a device that dials residential telephone numbers in that city at
   random. Of the first 100 numbers dialed, 48% are unlisted. This is not surprising because 52% of all Los
   Angeles residential phones are unlisted. Identify each of 100, 48%, and 52% as being parameters or statistics.

6) Identify each of the following characteristics as being a constant, a qualitative variable, a discrete quantitative
   variable, or a continuous quantitative variable. Support your choice.
   a) Type of illness                                               e) Marital status
   b) Birth rate                                                    f) Temperature of classroom (F)
   c) Number of pets owned
   d) Daily rainfall

7) On Saturday, September 4, 1999 a tornado ripped through Hampton, Virginia, damaging over 800 cars and
   forcing 1600 people from their homes. Most of the 1600 people forced from their homes lived in one of five
   apartment complexes, a retirement community, or a nursing home. The average age of these 1600 people was
   62.3 years, adding to the trauma of the incident. A sample of 45 of these individuals was interviewed. The
   average age of these 45 individuals was 57.9 years, and the average amount of property damage done to their
   possessions was $4,389. 9 of the 45 individuals, or 20%, lived in the nursing home and were temporarily
   housed in other local nursing homes and hospitals until cleanup was complete and full services restored.
   Identify each of the following as being a parameter or a statistic.
   a) 20%                                     c) 1600                                   e) 62.3
   b) 45                                      d) 57.9                                   f) $4,389

8) Identify each of the following characteristics as being a constant, a qualitative variable, a discrete quantitative
   variable, or a continuous quantitative variable. Support your choice.
   a) Housing community in which a person lived
   b) Number of cars damaged at each affected community
   c) Percentage of displaced residents living in the nursing home
Reading Assignment:
                                                Space Shuttle Challenger
           On January 20, 1986 the National Aeronautics and Space Administration (NASA) experienced its greatest
tragedy, as the space shuttle Challenger exploded less than two minutes from take off, killing all on board. Could this
tragedy have been avoided? To answer this question, the Rogers Commission, headed by then Secretary of State
William Rogers, studied the accident and the events that led to the fatal launching. Their investigation determined the
cause of the accident, and their findings were published in the two volume Report of the Presidential Commission on the
Space Shuttle Challenger Accident (1986).
           Through the use of statistics, the report indicates that the flight should never have taken place and hence the
explosion could have been avoided. To illustrate this, we must first understand some information on how the space
shuttle operates. A space shuttle uses two booster rockets consisting of several pieces whose connections are sealed with
rubber O-rings, with the booster rockets lifting the shuttle into orbit. Each booster has three primary O-rings, for a total
of six on the entire shuttle.
           Using data collected from previous flights, NASA had determined that the performance of the O-rings was
quite sensitive to the temperature. Due to their rubber makeup, the O-rings will change shape when compression is
placed on them. The previous data has revealed that when this compression is removed, a warm O-ring will recover its
shape, while a cold O-ring will not. When an O-ring does not recover its shape the joints will not be sealed, and hence a
gas leak is quite possible.
           Prior to the Challenger launch, the coldest launch temperature had been 53 degrees Fahrenheit. The forecasted
temperature for January 20, 1986 was only 31 degrees. Prior to the flight, the NASA engineers discussed the conditions
for the flight and decided to proceed with the launch. Unfortunately a statistician was not involved in the discussion,
because only the data available at the time of the launch and some very simple statistical analyst, the failure of the flight
could very well have been predicted. The statistical analysis follows.
           Of the previous 23 flights, 16 of them were completed with no O-rings being damaged. The minimum
temperature of these 16 flights was 66 degrees, with an average temperature of 72.5 degrees. On five of the flights, one
O-ring was damaged, and only the other two flights two O-rings were damaged. The temperatures for these seven flights
ranged from 53 to 75 degrees, with an average temperature of 63.7 degrees. The data clearly indicate that there is a
strong relationship between launch temperature and O-ring damage, with colder temperatures associated with a higher
chance of O-ring damage. Using a more advanced statistical technique referred to as logistic regression, a function
could be estimated that would predict the probability of O-ring damage given the temperature at the time of the launch.
Using the data available from the previous 23 launches, the predicted probability of O-ring damage for a launch
temperature of 31 degrees is .96. Hence given the data available at the time of the launch, the engineers could have
predicted the near-certain O-ring damage that allowed the gas leak whose combustion resulted in the explosion of the
           A more exhaustive discussion of this material can be found in the 1989 paper "Risk Analysis of the Space
Shuttle: Pre-Challenger Prediction of Failure," by S. Dalal, E. Fowlkes, and B. Hoadley, which appeared in Journal of
the American Statistical Association.

Writing Assignment:
Write a brief (1/2 page) summary of the above article. Include the type of data and statistics included in the article.

Article Assignment:
Find a statistics article from either a newspaper or magazine (not online). The article may be about anything that is
appropriate to discuss in school and must include some form of inferential or descriptive statistics.

To top