VIEWS: 70 PAGES: 6 POSTED ON: 9/23/2010 Public Domain
Introduction to Statistics What is Statistics? Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. We are bombarded by data in our everyday lives. Most of us associate "statistics" with the bits of data that appear in news reports: baseball batting averages, imported car sales, the latest poll of the president's popularity, and the average high temperature for today's date. Advertisements often claim that data show the superiority of the advertiser's product. All sides in public debates about economics, education, and social policy argue from data. Yet the usefulness of statistics goes far beyond these everyday examples. The study and collection of data are important in the work of many professions, so that training in the science of statistics is valuable preparation for a variety of careers. Each month, for example, government statistical offices release the latest numerical information on unemployment and inflation. Economists and financial advisors, as well as policy makers in government and business study these data in order to make informed decisions. Doctors must understand the origin and trustworthiness of the data that appear in medical journals if they are to offer their patients the most effective treatments. Politicians rely on data from polls of public opinion. Business decisions are based on market research data that reveal consumer tastes. Farmers study data from field trials of new crop varieties. Engineers gather data on the quality and reliability of manufactured products. Most areas of academic study make use of numbers, and therefore also make use of the methods of statistics. We can no more escape data than we can avoid the use of words. Just as words on a page are meaningless to the illiterate or confusing to the partially educated, so data do not interpret themselves but must be read with under- standing. Just as a writer can arrange words into convincing arguments or incoherent nonsense, so data can be compelling, misleading, or simply irrelevant. Numerical literacy, the ability to follow and understand numerical argu- ments, is important for everyone. The ability to express yourself numerically, to be an author rather than just a reader, is a vital skill in many professions and areas of study. The study of statistics is therefore essential to a sound education. We must learn how to read data, critically and with comprehension. We must learn how to produce data that provide clear answers to important questions. And we must learn sound methods for drawing trustworthy con- clusions based on data. The Rise of Statistics Historically, the ideas and methods of statistics developed gradually as society grew interested in collecting and using data for a variety of applications. The earliest origins of statistics lie in the desire of rulers to count the number of inhabitants or measure the value of taxable land in their domains. As the physical sciences developed in the seventeenth and eighteenth centuries, the importance of careful measurements of weights, distances, and other physical quantities grew. Astronomers and surveyors striving for exactness had to deal with variation in their measurements. Many measurements should be better than a single measurement, even though they vary among themselves. How can we best combine many varying observations? Statistical methods that are still important were invented in order to analyze scientific measurements. By the nineteenth century, the agricultural, life, and behavioral sciences also began to rely on data to answer fundamental questions. How are the heights of parents and children related? Does a new variety of wheat produce higher yields than the old, and under what conditions of rainfall and fertilizer? Can a person's mental ability and behavior be measured just as we measure height and reaction time? Effective methods for dealing with such questions developed slowly with much debate. As methods for producing and understanding data grew in number and sophistication, the new discipline of statistics took shape in the twentieth century. Ideas and techniques that originated in the collection of government data, in the study of astronomical or biological measurements, and in the attempt to understand heredity or intelligence came together to form a unified le science of data." That science of data -- statistics-- is the topic of this text. Understanding from Data The practice of statistics involves the use of many recipes for numerical calculation, some quite simple and some very complex. As you are learning how to use these recipes, remember that the goal of statistics is not calculation for its own sake, but gaining understanding from numbers. A calculator or computer can automate many of the calculations, but you must supply the understanding. Computers using specialized software always carry out the more complex procedures. A thorough grasp of the principles of statistics will enable you to quickly learn more advanced methods as needed. On the other hand, a fancy computer analysis carried out without attention to basic principles will often produce elaborate nonsense. As you study, seek to understand the principles as well as the necessary details of methods and recipes. Definitions Statistics is an area of mathematics that is concerned with the extraction of information from numerical data obtained during an experiment. It involves the design of the experiment, the collection and analysis of the data, and making inferences (statements) about the population based upon information in a sample. The word statistics has two meanings. 1) Statistics refers to numerical facts such as the income of a family, the age of a student, the percentage of passes completed by the quarterback of a football team, and the starting salary of a typical college graduate. 2) Statistics refers to the field of study which uses methods to collect, analyze, present, and interpret data. Statistics can further be divided into two areas. 1) Descriptive statistics consists of methods for organizing, displaying, and describing data by using tables, graphs, and summary measures. The first part of the course will use descriptive statistics. 2) Inferential statistics consists of methods that use sample results to help make decisions or prediction about a population. Individuals are the objects described by a set of data. Individuals may be people, but they may also be animals or things. A population consists of the entire group of individuals about which the researcher wants information. Ex. 1) All U.S. citizens 2) All male high school students The parameter is some characteristic of the population that the researcher wants to measure. Ex. 1) Proportion of U.S. citizens who voted in the last presidential election 2) Average height of all males high school students. A sample is the portion of the population that is selected to study. Ex. 1) A sample of the U.S. citizens would be the citizens of Massachusetts 2) A sample of the male high school students would be male students at Hudson High School A statistics is some characteristic of the sample. Ex. 1) The proportion of people in Massachusetts that voted in the last presidential election. 2) The average height of the male students at Hudson High School. Inference is a statement about a population based on the data collected in a sample. One type of inference is using a sample statistic to estimate a population parameter. Ex. The average height of the male students in this class can be used to estimate the average height of all male VCU students A distribution is a listing of all the possible values that a characteristic can take and the number of times that each value occurs Ex. Gender of student: male female Example 1 Television station QUE wants to know the proportion of TV owners in Virginia who watch the station's new program at least once a week. The station asked a random group of 1000 TV owners in Virginia if they watch the program at least once a week. Identify the 1) population, 2) parameter, 3) sample, 4) statistic, and 5) inference. 1) The population consists of TV owners in Virginia. 2) The parameter is the proportion of TV owners in Virginia who watch the station's new program at least once a week 3) The sample is the 1000 TV owners that were asked if the watch the program at least once a week. 4) The statistic is the proportion of the 1000 randomly selected TV owners that watch the program at least once a week. 5) The inference is that the proportion of the 1000 randomly selected TV owners that watch the program at least once a week will be equal to the proportion of TV owners in Virginia that watch the program at least once a week. A goal of statistics is to measure some characteristic about a subject or set of subjects. To assure more accurate results, we usually either measure the characteristic on several subjects (the sample), or if only one subject is available, we repeat the measurement several times. This is called repetition and it is in repeated experiments that statistics become important. When the measurements of some characteristic of the individuals do not change in repeated trials, the characteristic is called a constant. Ex. 1) Weight of a certain rock. 2) Number of minutes in an hour. When the measurements of the characteristic vary from trial to trial, then the characteristic is called a variable. Variables are classified into two categories: 1. A qualitative (or categorical) variable is a variable whose measurements varies in kind or name but not in degree, meaning that they cannot be arranged in order of magnitude. Hence one level of a qualitative variable cannot be considered greater or better than another level. Ex. 1) Gender - male or female 2) Occupation - teacher, lawyer, doctor, janitor, etc. 2. A quantitative variable is a variable whose measurements vary in magnitude from trial to trial, meaning some order or ranking can be applied. Quantitative variables are further classified as being discrete or continuous. 1. A discrete quantitative variable is a variable whose measurements can assume only a countable number of possible values. Ex. Number of cars in a parking deck 2. A continuous quantitative variable is a variable whose measurements can assume any one of a countless number of values in a line interval. Ex. Weight of a typical student In statistics, we are primarily concerned only with the observation of variables - if we know beforehand what the measurement is going to be, as is the case with constants, then there is no reason to make the measurement. Example 2 Identify each of the following characteristics as being a constant, a qualitative variable, a discrete quantitative variable, or a continuous quantitative variable. a) College major – qualitative or categorical variable b) Number of dependents – discrete quantitative variable c) Number of people serving as President of the United States at any one time - constant d) Age – continuous quantitative variable (not because it’s always changing but because there is no limit to how precise you give a person’s age. ie. minutes, seconds, tenths of a second, etc.) e) Eye color – qualitative or categorical variable f) Suicide rate – continuous quantitative variable (rates are always continuous quantitative) Example 3 The following is a small part of spreadsheet containing data from a 1997 Road Atlas. Each row records data on one individual. Each row of data is called a case. Each column contains the values of one variable for the individuals. In addition to the state, there are 5 variables. Nickname and Capital are qualitative or categorical variables. Population is a discrete quantitative variable. Land Area and Highest Point are continuous quantitative variables. Most statistical software uses this format to enter data – each row is an individual, and each column is a variable. This data set appears in a spreadsheet program that has rows and columns ready for your use. Spreadsheets are commonly used to enter and transmit data. Most statistical software can read data from the major spreadsheet programs. Statistical tools and ideas can help you examine data in order to describe their main features. This examination is called exploratory data analysis. Two basic strategies that help us organize our exploration of a set of data are: 1) Begin by examining each variable by itself. Then move on to study relationships among the variables. 2) Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data. Additional Reading and Examples 1. A fundamental component of statistics is the ability to understand and recognize the difference and relationship between a population and a sample. In most problems the data that we have is a sample of the population, and our goal is to use this data to make statements about the population. 2. Radon is a radioactive gas that is generally present in harmless amounts in nature. However, in certain dwellings radon gas is known to be present in quantities that may be harmful to humans, particularly in basements of buildings where air is stagnant. The Environmental Protection Agency (EPA) sets standards for environmental emissions from hazardous substances. According to the July 1995 issue of Consumer Reports, the EPA has suggested that radon levels exceeding 4.0 pc/l (picocurie per liter) are associated with an increased risk of lung cancer. Of interest is to estimate the percentage of all dwellings in which the radon concentration poses an increased risk of lung cancer. Data is available for a sample of 51 residential buildings owned by a local real estate developer. The population consists of all dwellings in which humans may enter and which could therefore pose a health risk. The specific parameter of interest is the percentage of all these dwellings that have a radon concentration above 4.0 pc/l. To estimate this percentage a sample of 51 residential building is selected, and the percentage of these 51 dwellings that have radon concentrations above 4.0 pc/l can be computed. This sample percentage is a statistic and can be used to estimate the percentage of all dwellings with radon concentrations above 4.0 pc/l (the inference). The owner of the dwelling is a qualitative variable, because we can only name the owner. The actual radon concentration is measured and hence is a continuous quantitative variable. The number of dwellings with radon concentrations exceeding 4.0 pc/l is countable and hence is a discrete quantitative variable. However, the percentage of the dwellings with radon concentrations exceeding 4.0 pc/l can take an uncountable number of possible values and hence is a continuous quantitative variable. The number of dwellings is countable and hence is a discrete quantitative variable. Practice Problems How much have you learned so far? The following is a set of problems you should complete on paper and bring to class the first day of school. 1) The branch of statistics concerned with making statements about a population based on information obtained from a sample is called _________. 2) The __________ is the entire group of individuals (subjects) about which the researcher wants information. 3) __________ is the name given to a characteristic when the measurements of the characteristic do not change in repeated trials over time. 4) A local city council is interested in determining the percentage of people who live in the city that would be in favor of spending the money necessary to finance the removal of tolls from the downtown expressway. They randomly sampled 250 city residents and asked each of them whether they would favor investing their tax dollars for such a purpose. Identify the population, parameter, sample, and statistic in this experiment, and briefly explain the inference that is taking place. 5) A telemarketing firm in Los Angeles uses a device that dials residential telephone numbers in that city at random. Of the first 100 numbers dialed, 48% are unlisted. This is not surprising because 52% of all Los Angeles residential phones are unlisted. Identify each of 100, 48%, and 52% as being parameters or statistics. 6) Identify each of the following characteristics as being a constant, a qualitative variable, a discrete quantitative variable, or a continuous quantitative variable. Support your choice. a) Type of illness e) Marital status b) Birth rate f) Temperature of classroom (F) c) Number of pets owned d) Daily rainfall 7) On Saturday, September 4, 1999 a tornado ripped through Hampton, Virginia, damaging over 800 cars and forcing 1600 people from their homes. Most of the 1600 people forced from their homes lived in one of five apartment complexes, a retirement community, or a nursing home. The average age of these 1600 people was 62.3 years, adding to the trauma of the incident. A sample of 45 of these individuals was interviewed. The average age of these 45 individuals was 57.9 years, and the average amount of property damage done to their possessions was $4,389. 9 of the 45 individuals, or 20%, lived in the nursing home and were temporarily housed in other local nursing homes and hospitals until cleanup was complete and full services restored. Identify each of the following as being a parameter or a statistic. a) 20% c) 1600 e) 62.3 b) 45 d) 57.9 f) $4,389 8) Identify each of the following characteristics as being a constant, a qualitative variable, a discrete quantitative variable, or a continuous quantitative variable. Support your choice. a) Housing community in which a person lived b) Number of cars damaged at each affected community c) Percentage of displaced residents living in the nursing home Reading Assignment: Space Shuttle Challenger On January 20, 1986 the National Aeronautics and Space Administration (NASA) experienced its greatest tragedy, as the space shuttle Challenger exploded less than two minutes from take off, killing all on board. Could this tragedy have been avoided? To answer this question, the Rogers Commission, headed by then Secretary of State William Rogers, studied the accident and the events that led to the fatal launching. Their investigation determined the cause of the accident, and their findings were published in the two volume Report of the Presidential Commission on the Space Shuttle Challenger Accident (1986). Through the use of statistics, the report indicates that the flight should never have taken place and hence the explosion could have been avoided. To illustrate this, we must first understand some information on how the space shuttle operates. A space shuttle uses two booster rockets consisting of several pieces whose connections are sealed with rubber O-rings, with the booster rockets lifting the shuttle into orbit. Each booster has three primary O-rings, for a total of six on the entire shuttle. Using data collected from previous flights, NASA had determined that the performance of the O-rings was quite sensitive to the temperature. Due to their rubber makeup, the O-rings will change shape when compression is placed on them. The previous data has revealed that when this compression is removed, a warm O-ring will recover its shape, while a cold O-ring will not. When an O-ring does not recover its shape the joints will not be sealed, and hence a gas leak is quite possible. Prior to the Challenger launch, the coldest launch temperature had been 53 degrees Fahrenheit. The forecasted temperature for January 20, 1986 was only 31 degrees. Prior to the flight, the NASA engineers discussed the conditions for the flight and decided to proceed with the launch. Unfortunately a statistician was not involved in the discussion, because only the data available at the time of the launch and some very simple statistical analyst, the failure of the flight could very well have been predicted. The statistical analysis follows. Of the previous 23 flights, 16 of them were completed with no O-rings being damaged. The minimum temperature of these 16 flights was 66 degrees, with an average temperature of 72.5 degrees. On five of the flights, one O-ring was damaged, and only the other two flights two O-rings were damaged. The temperatures for these seven flights ranged from 53 to 75 degrees, with an average temperature of 63.7 degrees. The data clearly indicate that there is a strong relationship between launch temperature and O-ring damage, with colder temperatures associated with a higher chance of O-ring damage. Using a more advanced statistical technique referred to as logistic regression, a function could be estimated that would predict the probability of O-ring damage given the temperature at the time of the launch. Using the data available from the previous 23 launches, the predicted probability of O-ring damage for a launch temperature of 31 degrees is .96. Hence given the data available at the time of the launch, the engineers could have predicted the near-certain O-ring damage that allowed the gas leak whose combustion resulted in the explosion of the Challenger. A more exhaustive discussion of this material can be found in the 1989 paper "Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure," by S. Dalal, E. Fowlkes, and B. Hoadley, which appeared in Journal of the American Statistical Association. Writing Assignment: Write a brief (1/2 page) summary of the above article. Include the type of data and statistics included in the article. Article Assignment: Find a statistics article from either a newspaper or magazine (not online). The article may be about anything that is appropriate to discuss in school and must include some form of inferential or descriptive statistics.