VIEWS: 30 PAGES: 11 POSTED ON: 4/24/2010
Stat 305 - Lab 1 Goal: Learn how to compute basic summary statistics using STATA and to interpret the results. Learn how to plot basic graphs and interpret the results. Qm* This is the way the problems you must answer are designated: with 'Q*' and bolded, m is a number. Information about the dataset used for this lab: 158 fishes of 7 species are caught and measured. There are 3 variables: species, weight and length. All the fish are caught from the same lake (Laengelmavesi) near Tampere in Finland. Species Code: 1 Bream 2 Whitefish 3 Roach 4 Parkki 5 Smelt 6 Pike 7 Perch Weight is measured in grams (g). Length is measured in centimeters (cm). *You may want to open a new Word document to record your results and graphs 0. The STATA environment: Open STATA. Familiarize yourself with the STATA workspace. Different windows that open: Stata Results: displays results Commands: to type in commands; we won't use this. If you want to use the command window, 'PageUp' and 'PageDown' scroll through the previous commands. Review: gives a history of your commands, actions Variables: displays the variables of your data set For this class, the only windows that you will need to use are the 'Stata Results' and the 'Variables' windows because we will focus on using the pull-down menus. When you perform a task using these menus, the command that you could also use will appear in the 'Stata Results' window. You may use these commands in the 'Commands' window if you like. 1. Entering Data: Download the data from http://www.owlnet.rice.edu/~stat305/projects.htm. Open Excel spreadsheet: fishcatch1.xls Save spreadsheet as 'text (Tab delimited)', e.g. as fishcatch1.txt Import data into STATA: In the pull down menus, click on 'File / Import / ASCII data created by spreadsheet' Enter filename, fishcatch1.txt Choose the tab delimited option, then click 'OK' The variables in your spreadsheet will appear in the "Variables" window. The left column gives the variable names in STATA and the right gives the original variable names. 2. Finding Summary Statistics: (mean, std.dev., min, max, percentiles) From the pull down menus, go to 'Statistics/ Summaries, tables, & tests / Summary statistics / Summary statistics' You will only need to use the 'Main' tab for now: enter the variables of interest, choose the options in which you are interested, and click OK' The results from the analysis will appear in the 'Stata Results' window. Q1* Find the mean, standard deviation, and the Five Number Summary (minimum, maximum, and quartiles) for the variables weight and length. Q2* Which is a better of measure of center for weight? for length? Explain. Does it make sense to find these same statistics from Q1 for the Species variable? Explain. What other statistic could you use as a measure of center? 3. Graphing: Go to 'Graphics' a) Histogram: Select 'Histogram' You will mostly need to use the main tab: In the options: Type in the variable of interest (e.g. weight) Under 'Y - axis', select 'Frequency' or 'Density' You can designate number of bins or width of bins, but not both (because one is determined by the other). If you do not designate either, STATA chooses a default method (not Sturges' rule) for deciding the number of bins. The title tab allows you to give the graph a title; there are also tabs to label the x and y axes. The histogram is a way of summarizing quantitative data (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data. A histogram is constructed by dividing up the range of possible values in a data set into nonoverlapping intervals or classes called bins and then counting the number of observations that fall into each bin. The length of the interval is called the bin width and the number of observations that fall into a bin is called the bin count. A good rule of thumb for choosing the number of bins (and thereby determining the bin width) is Sturges' rule. This rule gives K number of bins, where K=1+3.322*log10(n). Making a histogram is a bit more art than science, so while there is a theoretical foundation for histograms, rules like Sturges rule are meant as guides. There are many other rules for bin width. In general, it is best to round up the number of bins (and/or the bin width) to be sure to include all the data points. Once the number of bins is chosen, we can determine the bin width, h, where x (n) x (1) h . K The maximum data point is denoted by x(n), and the minimum data point is denoted by x(1). For each bin, a rectangle is constructed at each interval, with a base length equal to the bin width; there is no space between the rectangles as in a bar chart. The height of the rectangle is proportional to the number of observations falling into that group. You can either use the bin count (also called the frequency) or the percentage of observations that fall in the bin (the relative frequency) as the height of the bin. In either case (using frequency or relative frequency), the graph of the histogram looks the same; the difference is in the scale of and the information contained in the Y-axis. For data from a continuous random variable, a histogram using the relative frequency gives a graphical estimate of the probability distribution. A histogram can also help detect any unusual observations (outliers), or any gaps in the data set; these outliers may result in some empty bins. An outlier is an observation in a data set which is far removed in value from the others in the data set. It is an unusually large or an unusually small value compared to the others. An outlier might be the result of an error in measurement, in which case it will distort the interpretation of the data, having undue influence on many summary statistics, for example, the mean. If an outlier is a genuine result, it is important because it might indicate an extreme of behavior of the process under study. For this reason, all outliers must be examined carefully before beginning any formal analysis. Outliers should not routinely be removed without further justification. Q3* Using a calculator, calculate the number of bins for a histogram of the 'length' and 'weight' variables using Sturges' rule. Calculate the bin width. Symmetry and skewness of data sets: Symmetry is implied when data values are distributed in the same way above and below the middle of the sample. You can identify a symmetric probability distribution when the part of the distribution on one side of the midpoint looks like the mirror image of the other. Example of a symmetric probability distribution 0.4 0.3 0.2 f(x) 0.1 0.0 -2 0 2 4 6 x Skewness is defined as asymmetry in the distribution of the sample data values. Values on one side of the distribution tend to be further from the 'middle' than values on the other side. For skewed data, the usual measures of location will give different values, for example, median<mean would indicate positive (or right) skewness. In other words, a distribution is skewed to the right if the right tail (larger values) is much longer than the left one (smaller values). Similarly, a distribution is skewed to the left if the left tail is much longer than the right one. Positive (or right) skewness is more common than negative (or left) skewness. In the case of skewed data, one would most likely report the mean in addition to the median as the measures of center. In the case of symmetric data, the mean and median are nearly or exactly equal. Example of a (right) skewed probability distribution 0.20 0.15 f(x) 0.10 right tail of the distribution 0.05 0.0 0 5 10 15 20 25 x Example of (left) skewed probability distribution 50 40 30 f(x) left tail of distribution 20 10 0 0.88 0.90 0.92 0.94 0.96 0.98 1.00 x Q4* Plot a histogram for weight and for length using the relative frequency (choose 'Density' in the options); use Sturges' rule to enter the number of bins in the options. (Copy and paste each graph before going on to the next because Stata copies over them.) Give titles to your histograms, e.g. "Histogram of Weight, using Sturges' rule". Q5* Plot a histogram for weight and for length using the relative frequency (choose 'Density' in the options); use the Stata default (don't check or enter the number of bins in the options). (Copy and paste each graph before going on to the next because Stata copies over them.) How many bins does the default give? In the future, when plotting histograms, you may use the STATA default or Sturges' rule, whichever you prefer. Q6* Answer the following about each histogram: Describe the distributions of the data. How are they different/similar? What is the overall shape of each one? Where are the mean and the median with respect to the bulk of the data? Are there any outliers in the data? What do the outliers signify (in terms of the variable; remember there are seven different species of fish)? Q7* In light of the discussion of symmetry and skewness, which is a better of measure of center for weight? for length? Explain. b) Box plot Select 'Box Plot' under the Graphics menu. You will mostly need to use the main tab: In the options: Type in the variable of interest (e.g. weight) You don't need to specify any other options (use the default of 'Median Type' choice is 'line'.) The title tab allows you to give the graph a title. A box plot is a graphical representation of the five-number summary, the set of statistics that consists of the maximum, the minimum, and the quartiles of a data set. A central box spans the 1st and 3rd quartiles (denote by q1 and q2), a line in the box marks the median, and lines extend from the box out to the smallest and largest observations. Q8* Plot box plots for the variables weight and length. Do not include outliers: on the 'Outsides' tab, select 'do not plot outside values'. Judging the box plots, which variable's distribution is skewed and why? Sometimes a box plot is modified to plot suspected outliers individually. In a modified box plot, the lines extend out from the central box only to the smallest and largest observations that are not suspected outliers. Observations that are more than 1.5*(q3-q1) are plotted as individual points. Q9* Plot modified box plots for the variables weight and length that include outliers: on the 'Outsides' tab, unselect 'do not plot outside values'; leave all the 'markers' at default. According to Stata, how many outliers are detected for each variable? c) Scatter plot Under the Graphics menu, select 'Easy Graphs' then 'Scatter Plot'. You will mostly need to use the main tab: In the options: Type in the variable for each of the x and y axes. You don't need to specify any other options. The title tab allows you to give the graph a title. A scatter plot shows the relationship between 2 quantitative variables measured on the same subject; one appears on each axis. The subject is represented by the paired values. The x-axis is used to denote the independent variable and the y-axis, the dependent variable. In this plot, points are plotted but not joined. The resulting pattern indicates the type and strength of the relationship between the two variables. Patterns that may arise: The more the points tend to cluster around a straight line, the stronger the linear relationship between the two variables. A scatterplot will also show up a non-linear relationship between the two variables and whether or not there exist any outliers in the data. Nonlinear association 2.5 2.0 1.5 y 1.0 0.5 0.0 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 x If the line around which the points tends to cluster runs from lower left to upper right, the relationship between the two variables is a positive association (direct). Positive association 4 3 y 2 1 0 1 2 3 4 x If the line around which the points tends to cluster runs from upper left to lower right, the relationship between the two variables is a negative association (inverse). Negative association -1.5 -2.0 y -2.5 -3.0 -3.5 1.0 1.5 2.0 2.5 3.0 3.5 x If there exists a random scatter of points, there is no relationship between the two variables. No association 6 4 y 2 0 0 1 2 3 4 x Q10* Plot a scatter plot of the length and weight variables; use length as the independent variable (on the x-axis). Describe the plot. What is the overall pattern? Are there any deviations from the pattern? If so, how might the deviations be explained? What kind of relationship is there between the two variables (i.e. a positive or a negative association)? Does this relationship make sense? Are there any outliers? * some of the text of this lab is from http://www.stats.gla.ac.uk/steps/glossary/index.html