Thanks to Joe Gheopreal (class of 2008) for outlining Chapter 5 Chapter 5: Producing Data Introduction: In statistics, we often need to get answers from immense groups of individuals. To get proper answers, we must find a way to produce data in a way that will answer our questions. Since most of the time it is near impossible to ask all subjects in the population, we must generate a sample that accurately represents the entire population. Two ways we gather data and create a sample that creates a least disturbed image of the population is an Observational Study and an Experiment. Observational Study- the observation of individuals and the measurement of variables of interest. NOTE: An observational study does NOT attempt to influence the responses. Experiment- the deliberate imposing of some treatment on individuals in order to observe their responses. Observational Studies are great when a statistician explores data about topics, which could include opinions and behaviors. But to gauge the effect of an intervention, a statistician needs to impose a change since the goal is to understand the cause ad effect. Observational studies tend to fail when they are about the effects of one variable on another because the explanatory variable is confounded with lurking variables. Only well-designed experiments take steps to defeat confounding. Sometimes we may be unable to observe individuals directly or performing an experiment, so simulations are used to provide alternative method for producing data in such circumstances. Statistical techniques for producing data open the door to answering specific questions with known degrees of confidence or statistical inference. In the end, the most important prerequisite for a trustworthy inference is careful design of data production. Part1: Designing Samples In most cases, we are gathering information about a large group of individuals. In the real world, we do not have the time and money for contacting every individual in the entire population. Since this is true, we only gather information of a part of the group so we can draw conclusions about the entire group. Population- The ENTIRE group of individuals that we want information about. Sample- A part of the population that we actually examine in order to gather information. Our population is defined in terms of our desire for knowledge. For example, if we wish to make conclusions about all the snowmen made in the U.S. during a winter storm, that group is our population even if the only the snowmen in your neighborhood are the only snowmen you see. The sample is the part from which we draw conclusions about the whole. To collect data we can either use sampling or conducting a census. Sampling- studying a part in order to gain information about the whole. Census- to contact every individual in the entire population. A carefully conducted sample can often be more accurate than a census. For example, a farmer can sample their corn inventory to verify the accuracy for the amount of corn the farmer has grown. Attempting to count every last piece of corn would not just make him sick of the crop, but is also expensive and inaccurate, since bored people do not count carefully. For conclusions, based for a sample to be valid for the entire population, a proper design for selecting the sample is required. Design- the method used to choose the sample of the population. Poor sample designs can produce MISLEADING conclusions… Example 1: American Idol Example It is the final two singers on American Idol, Sally Singsgreat and Bobby Badvoice. As usual to the show, the host asks who should be American Idol. Lets say 290,000 American callers responded and 86% said they want Bobby Badvoice to win. What is wrong with this sampling? Solution: People who actually spend time and money to respond to call-in polls are not representative of the entire population. In fact, they tend to be the same people who call radio shows. People who feel strongly, especially those with strong opinions, and more likely to call. It would not be surprising that a properly designed sample would show that 79% would want Sally Singsgreat to win. Call-in opinion polls are an example of voluntary response sampling. Voluntary response sample- a sample that consist of people who choose themselves by responding to a general appeal. Voluntary response samples are biased because people with strong opinions, especially negative opinions, are more likely to respond. Voluntary response is one common type of bad sample design. Another is shown in the following example… Example 2: Should the Mall be remodeled Example The King of Prussia Mall has decided that to determine whether they should do remodeling of their malls, they will ask the mall shoppers whether they should remodel their mall. What is the problem with this sampling? Solution: This will cause a form of bad sample design called convenience sampling. Convenience sampling- a sample that chooses the individuals easiest to reach. This sample does not represent the entire population. For example, people who tend to go to malls more are richer, teenagers, or retired. Plus, mall officials might tend to select neat, safe looking individuals from the stream of customers. Both forms of sampling almost guarantee not to represent the entire population. These sampling methods display systematic error, or bias. Bias- favoring some parts of the population over others. To eliminate bias, the statistician is to allow impersonal chance to choose the sample. A sample chosen by chance allows neither favoritism by the sampler nor self-selection by respondents. Choosing a sample by chance eliminates bias by giving individuals an equal chance to be chosen. The simplest way to use chance to select a sample is to place names in a hat (the population) and draw out a handful (the sample). This is called simple random sampling. Simple Random Sample- consist of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected. An SRS does not only give each individual an equal chance of being chosen, but also gives every possible sample an equal chance of being chosen. The idea of SRS is to choose our sample by drawing names from a group. In practice, computer software can choose an SRS from a list of individuals in the population. If software is unavailable, one can randomize by using a table of random digits. Table of Random Digits- a long string of digits 0 to 9 with the following properties: Each entry in the table is equally likely to by any of the 10 digits 0 through 9 and the entries must be independent of each other (this means knowledge of one part of the table gives no information about any other part). For random digits, refer to table B at the back of the book. These digits are random and only put in groups of five to make them easier to read. These numbers have no real meaning. Use table B for the following example… Example 3: Pizza Delivery Example: You are hosting a party and everyone wants pizza. You decide to choose the type of pizza places you will order from randomly. You plan to select from two of the pizza places (SRS of 2). Solution: Begin by labeling a bunch of places. For this example we will use 15. We will use two digit labels and label the pizza places from 00 to 14. Here are the places with the labels attached. 00 – Pizza Hut 01 – Dominos 02 – Papa Johns 03 – Franzone’s 04 – Angelo’s 05 – Creaser’s 06 – Pizza Planet 07 – Peace of Pizza 08 – Famous Georges 09 – Costco 10 – Uno’s 11 – Burchuchi’s 12 – Leaning Tower of Pizza 13 – Pizza Castle 14 – Sabarro’s Second, enter any line from table B and read the two digit groups, for this example here is line 115. 61041 77684 94322 24709 73698 14526 31893 31592 The two digit numbers made in this line are… 61 04 17 76 84 94 32 24 70 97 36 98 14 52 63 18 93 31 59 … As you can see, some of the labels do not apply, so we simply ignore them. The labels we do not ignore (00 to 14) we choose as our sample (in this case we will use 04(Angelo’s) and 14(Sabarro’s)). For an SRS there are two steps. The first step is to assign numerical labels to every individual in the population. The second step is to use table B or any random number generator to select labels at random. Be sure that all labels have the same number of digits so they all have the same chance of being chosen. Use the shortest possible labels: one digit for a population up to 10 members, two digits for 11 to 100 members, three digits for 101 to 1000 members, etc. The general framework for designs that use chance to choose a sample is a probability sample. Probability Sample – A sample chosen by chance. We must know what samples are possible and what chance, or probability, each possible sample has. Some probability sampling designs like SRS give each member of the population an equal chance to be selected. This may not be true in more elaborate sampling designs. In every case however, the use of chance to select the sample is the essential principle of statistical sampling. Yet designs for sampling from large populations spread out over a wide area are usually more complex than an SRS. It is important to sample important groups within the population separately then combine these samples. This is called a stratified sample. Stratified random sample – first divide the population into groups of similar individuals, then choose separate SRS in each stratum and combine these SRS’s to form a full sample. Strata – the divisions of the population into groups of individuals. One chooses the strata based on the facts known before the sample is taken. For example, a population of bees can be divided into workers, drones, larvae, and queen strata. A stratified design can produce more exact information than an SRS of the same size by using the idea that individuals in the same stratum are similar to one another. If all individuals in each stratum are identical, just one individual from each stratum is enough to completely describe the population. Another common way to restrict random selection is to choose the sample in stages. This is done by multistage sampling design. Multistage samples – selecting successively smaller groups within the population in stages, resulting in a sample consisting of clusters of individuals. Analysis of data from sampling designs more complex than an SRS goes beyond basic statistics. The SRS is the building block of more elaborate designs, and analysis of other designs differs more in complexity of detail rather than in fundamental concepts. Random sampling eliminates bias in the choice of the sample from a population, yet accurate information from a sample requires more than good sampling design. To have such a design we need an accurate and complete list of the population, or that sample suffers from undercoverage. Undercoverage – when some groups in the population are left out of the process of choosing the sample. For example, a sample survey of households will miss the homeless, prisoners, and students in dormitories. While undercoverage is nearly unavoidable and somewhat within most surveys to a degree, a more serious source of bias is nonresponse. Nonresponse – when an individual chosen for the sample cannot be contacted or does not cooperate. Example 4: Lances Date Example: Lance is looking for a date for the next prom, so Lance sends e-mails to every girl in the senior class, asking if they want to go to prom with him. Of all the girls that Lance sent e-mails to, only 14% responded, all said no. There is a way for Lance to get a date, how can he adjust his survey, so he is able to get some at the prom. Solution: There are several problems with Lances survey. First, from earlier, it suffers from being voluntary response sample, hence why all 14% of the girls said no. Second, this sample suffers from undercoverage, as only the senior girls are the ones he asked, he did not ask any juniors, sophomores, or freshmen. Last, since only 14% responded, Lances survey suffers from nonresponse as some of these girls might want to say yes to him in person, or more likely no in person. Yet, some girls might see Lance does not have a date, and decide to say yes to him just to make him happy. This is an example of a cause of response bias. Response bias – the behavior of the respondent or of the interviewer that can cause bias in sample results. Respondents could lie if asked about behavior that is unpopular or illegal. The sample then underestimates the presence of such behavior in the population. Things like the interviewers attitude, race, sex, and recall of memory can easily influence responses. In conclusion, good interviewing technique is another aspect of a well-done sample survey. The most important influence on the answers given to a sample survey is the wording of the question. Wording of question – Confusing or leading questions can lead to strong bias. But how accurate are the results of a survey? This is because if we take another survey, we can get different results. But since we purposely use chance, the results obey the laws of probability. In short, larger random samples give more accurate results than smaller samples. Another important part of designing a survey is the sample frame. Sample frame – List of individuals from which a sample is actually selected. Ideally, the frame should list every individual in the population, but in practice this is often difficult. Another type of sample used is a systematic random sample. Systematic random sample – Similar to an SRS, but the parts selected are chosen systematically (i.e.: 10, 34, 53, 68, 89). Part 2: Designing Experiments A study is an experiment when we actually do something to people, animals, or objects just to observe the response. Experimental Units – The individuals on which the experiment is done. Subject – When the units are human beings. Treatment – A specific experimental condition applied to the units. The purpose of an experiment is to reveal the response of one variable to changes in other variables, the distinction between explanatory and response variables is important. Factors – The explanatory variables in an experiment. Yet many experiments study the joint effects of several factors. Combining a specific value of each of the factors forms each treatment. Level – the specific value of each of the factors. Example 5: The drug dealer experiment: A drug dealer is wondering if taking two drugs at the same time makes you more addicted to each. The drug dealer decides to ask the friends of his customers whether they are more addicted to drugs. How should this experiment be conducted? Solution: For this experiment the dealer should divide the subjects into four groups. Group1: Drug 1 and Drug 2 Group 2: Drug 1 and placebo Group 3: placebo and Drug 2 Group 4: placebo and placebo Placebo – a dummy pill (or whatever is being experimented) that looks and taste like whatever is being experimented but has none of the active ingredients. A study must be aware of the several response variables, for example if the guys are getting drugs from somewhere else, or what they eat, etc. What makes an experiment more advantageous than an observational study is the fact that experiments giver good evidence of causation. Also, experiments also let us only study the factors that we are actually interested in as well as the combined effects of several factors. The design for an experiment is the following: Units -> Treatment -> Observe Response Yet in experiments there are chances that a response is due to a lurking variable rather than a treatment… Example 6: Luca Running Experiment: Luca is determined to run a faster time than Marco, so when he sees new track shoes, the speedy Gonzales, that say they will effectively make you faster, he quickly buys them. Luca then decides to alternate shoes every other day, and he found on the days that he used the Speedy Gonzales, he would beat Marco. What is wrong with Lucas experiment? Solution: The kid may be a great runner, but he is not as good of a statistician. His experiment is poorly designed as it suffers from placebo effect. Placebo Effect – When a subject responds favorably to any treatment, even a placebo. Since Luca believed the Speedy Gonzales would make him run faster than Marco, he would probably run a little harder on those days than on the other days. The results were confounded by the placebo effect. Confounded – Mixed up with. The days Luca used his regular shoes, are the days that are in the control group. Control Group – group receiving the placebo. The control group enables us to control the effects of outside variables on the outcome. Control is the first basic principle of statistical design of experiments. The simplest form of control is comparison. Many experimenters would try to match groups by elaborate balancing acts. Matching is helpful, yet not adequate due to too many lurking variables that might affect the outcome. A statistician remedy to this problem is to rely on chance to make an assignment that does not depend on any characteristic of the experimental units and that does not rely of the judgment of the experimenter in any way. The use of chance can now be combined with matching, as the following example will show. Example 7: Dog crap problem: Veterinarians want to help dogs with constipation so they are healthy again, so they decide to test a new constipation drug for dogs. The response variable is a dog’s crapage over a 30-day period. The control group eats a placebo. There are 40 dogs, how will we conduct this experiment? Solution: For this experiment lets use 40 dogs. So lets divide the dogs into two groups of 20. This will be done without bias, so number the dogs 00 to 39 and select them randomly to the two groups. Here is a diagram of the experiment… Random assignment / / Group1 Group2 (20 dogs) (20 dogs) / / Treatment Treatment Crap pill placebo Compare crap loss. Randomization, the use of chance to divide experimental unites into groups, is essential ingredient for good experimental design. The logic behind the randomized comparative design as shown above is as follows: Randomization produces groups of dogs that should be similar in all respects before the treatment is applied. Comparative design ensures that influences other than the severity of their constipation operates equally on both groups. Therefore, differences in average crapage must be due to either the pill or the play of chance in the random assignment of dogs to the pills. The reason we assign many dogs to the crap pill is the idea that the effects of chance will average out and there will be little difference in the average crapage of the two groups unless the pills themselves cause a difference. The use of enough experimental units to reduce the chance of variation is the third big idea of statistical design of experiments. The basic principles of statistical design of experiments are 1. Control the effects of lurking variables on the response, most simply by comparing two or more treatments. 2. Randomize – Use impersonal chance to assign experimental units to treatments. 3. Replicate each treatment on many units to reduce chance variation in the results. We hope to see a difference in the responses so large that it is unlikely to happen just because of chance variation. We try to learn if the treatment effects are larger than we would expect to see if only chance were operating. If they are, then they are statistically significant. Statistically significant – An observed effect so large that it would rarely occur by chance. To compare an array of treatments, a completely randomized design would be best used. Completely randomized – When all experimental units are allocated at random among the treatments. The logic of a randomized comparative experiment depends on our ability to treat all the experimental units identically in every way except for the actual treatments being compared. Therefore, careful attention to detail is a must for good experiments. Some experiments can be plagued by unconscious bias, so for an experiment to be most effective, it must be a double blind experiment. Double blind – neither the subjects nor the people who have contact with them know which treatment a subject received. An experiment can also be plagued if there is a strong lack of realism. Lack of realism – The subjects or treatments or setting of an experiment may not be realistically duplicated in the conditions we really want to study. Example 8: Pulse Introduction Example: Pulse wants to restart the introductions in its daily broadcast. So they generated two introductions and are deciding which one to use for the initial broadcast. So they brought in groups of students and told them that they are viewing these for an experiment. What is wrong with this experiment? Solution: First, this experiment is not blind. Second, we cannot make sure this applies to everyday students since the students know this is an experiment, hence non-realistic setting. Lack of realism can limit our ability to apply the conclusions of an experiment to the settings of greatest interest. Example 9: Rabbit trap Example: Rabbits can be problems to farmers, and a new trap has come out to stop them. This trap uses the scent of a specific vegetable to trap the rabbits. It is believed that the carrot trap is better than the lettuce trap. So a farmer sets up an equal amount of traps of each and whatever rabbits get trapped in them, he counts for the respectable trap. How is this experiment organized? Solution: This experiment is organized in a match pair’s design. Match pairs – Compares two treatments. For match pairs we choose blocks of two units that are as closely matched as possible. Block – a group of experimental units or subjects that are known before the experiment to be similar in some way that is expected to affect the response to the treatments. Block design – the random assignment of units to treatments is carried out separately within each block. Block designs can have blocks of any size. A block design combines the idea of creating equivalent treatment groups by matching with the principle of forming treatment groups at random. Blocks are another form of control. For example… Example 10: STD’s and Gender example: The progress of S.T.D. (Sexually Terrifying Disease), a type of STD differs from women and men. How can this experiment be properly done? Solution: Two separate randomizations would be done, assigning the subjects by their gender. Note that there is no randomization in making these blocks. Then conduct the experiment as normal. Men---Group(x3) – Therapy(x3) – Compare results / / Subjects ----- Women --- Group(x3) – Therapy(x3) – Compare results Part 3: Simulating Experiments: There are three methods we use to answer questions involving chance… 1. Try to estimate the likelihood of a result of interest by actually carrying out the experiment many times and calculating the result’s relative frequency. That is slow, sometimes costly, and often impractical or logistically difficult. 2. Develop a probability model and use it to calculate a theoretical answer. This requires that we know something about the rules of probability and therefore may not be feasible… yet. 3. Start with a model that, in some way, reflects the truth about the experiment, and then develop a procedure for imitating or simulating a number of repetitions of the experiment. This is quicker than repeating the real experiment, especially if we use a calculator, and it allow us to do problems that are hard when done formally. Example 11: Gambling Example: Suppose that for some gambling game, you win if the majority of 10 drawn cards is either a spade or clover. But five of the cards are damaged and usable, and two are black suits while three are red suits. How do we conduct this simulation? Simulation – The imitation of chance behavior, based on a model that accurately reflects the experiment under consideration. Solution: Their a few steps to follow when doing a simulation… 1. State the problem or describe the experiment – Draw 10 cards. What is the likelihood of a run where more then 5 of the cards are either a spade or a clover (black suit)? 2. State the assumptions – The picks a not independent of each other. 3. Assign digits to represent outcomes – assign the digits from 00 to 46 (since five of the cards are destroyed) (00 – 23 black suit, 24 – 46 red suit). Use the number table in the back (table B) to select values until 10 are selected. 4. Simulate many repetitions – Looking at 10 consecutive digits in table B simulates one repetition. Read many groups of 10 digits from the table to simulate many repetitions. Make sure to keep track of whether or not the event that we want to occur has occurred. 5. State your conclusion – Lets say that after 1000 simulations, 634 of those simulations had a majority of clovers and spades selected. We estimate the probability of a run by the proportion-estimated probability… 634/1000 = 63.4% In the real world, if we did 1000 simulations, we would go crazy. But a small amount of simulations like 20 would not be enough. That’s why with enough understanding of simulations; we can use a computer to do a large amount of simulations. A long simulation (or mathematical analysis) finds that the true probability could be 61.9%. The hardest part of this process is establishing a correspondence between random numbers and outcomes in the experiment, so it must be done carefully. While not true with the above problem, some problems might consist of independent trails. Independent – The result of one part of the experiment does not affect the result of the next part. The reason it is not true with the above problem is because when one card is selected from the deck, it would increase the chance that the next card in the deck is of the opposite colors suit. Test Questions: 1. Youtube wants to know if it should add advertisements before every video. So Youtube decides to ask the first 1000 viewers of each day to state their opinion on the question for one month. This type of sample is most likely A. Voluntary Response Sample. B. Convenience Sample. C. Simple Random Sample. D. Probability Sample. E. Stratified Random Sample. 2. The students of A.V. Agadro School have been asked whether they drink alcohol in front of the science department, which is conducting this survey. Since some of the students did drink, they decided to ditch school so they did not get caught. This survey is suffering from… A. Undercoverage. B. Nonresponse. C. Response bias. D. None of the above. E. Two of the above. 3. A monkey, Chimpy, is one of 3 trillion chimps used as a part of an experiment on a pill that is planed to be sold as an intelligence raising pill. Chimpy took the one pill a day that he was given for a month and was properly able to write Shakespeare, defuse a nuclear bomb, and solve the number pi to the 1,000,000,000,000,000,000,000,000th place. There is evidence that the monkeys know what the pill is capable of doing. During that one month Chimpy could have taken… A. A placebo. B. The intelligence raising pill. C. Both pills. D. Either pill. E. A totally different pill from the above. 4. In an experiment investigating if playing another video game before the tested game, helps the player score better on the tested game. The experiment divides the subjects into subjects that have proven video game experience with those who have never played a game in their lives, before they treated the subjects. This is an example of … A. Lack of realism. B. Match-pairs design. C. Double blind. D. Block design. E. None of the above. 5. You are an intern for the FDA. You are told by your boss to conduct a simulation to check how many restaurants have disobeyed FDA regulation. Based on last years data, of 100 restaurants, 34 of those restaurants have disobeyed FDA regulation. You need to conduct a simulation to determine the real probability that a restaurant would disobey FDA regulation. A. State the assumptions that are found within this experiment. B. How are you going to assign digits to accurately represent the outcomes of this experiment? Remember to make it clear as this system is needed for the next two questions. C. Using line120 on table B, simulate this experiment with the system you made in B. D. Now do the same thing as C, with Line 122 of table B. E. Do the same thing as C, with any other line on table B. Remember to state the line you selected or you get no credit. F. Calculate the relative frequency of restaurants that fail FDA regulation and state your conclusions. G. Now repeat this simulation 20 times on the calculator. Calculate the relative frequency and state your conclusions. H. Now which relative frequency would you tell to your boss and why?
Pages to are hidden for
"Chapter5--stat"Please download to view full document