VIEWS: 2 PAGES: 54 POSTED ON: 9/29/2011
Workshop 2: Basic statistics Instructor: Dr Patrick Johanns Office: 239 Phone: 9349-8162 Fax: 9349-8133 E-mail: p.johanns@mbs.edu Today Types of data Displaying data Summarising data Intent of workshop Give an overview of basic measures of – Central tendency – Variability or dispersion Explore and review different ways to display data via graphs and tables Dangers with Quantifying False sense of “accuracy”. May not be able to be compared to other factors. Estimates/data used to quantify a factor may be wrong or insufficient. Given these dangers, why quantify anything? Types of Data Discrete vs. Continuous Categorical vs. Quantitative Cross-Sectional vs. Time-Ordered Univariate vs. Bivariate and Multivariate Discrete and Continuous Variables Discrete Variable: Must take one of a set of fixed variables. Variables are counted. (Usually whole numbers). – Number of children in family: can’t have 3.4 children Continuous Variable: Can take any fractional value at all. Variables are measured. – Height of a tree. Almost never find a tree that is exactly 12 feet high. Categorical(Qualitative) and Quantitative Variables Element CATEGORICAL QUANTITATIVE Person Religion Height State Flower Area House Style Size Firm Industry Sales Worker Gender Age Computer Brand Size of Memory Cross-Sectional and Time-Ordered Data Cross-Sectional Data: DATA FOR ONE TIME PERIOD TAKEN ON DIFFERENT PERSONS OR PLACES – Absenteeism Data for Four Departments in January – Yearly Sales for Three Divisions Time-Ordered Data: DATA FROM MULTIPLE TIME PERIODS – Monthly Absenteeism Data over a One-Year Period – Daily IBM Stock Prices over a One-Month Period Univariate and Bivariate Data ELEMENT Univariate Bivariate Gender Worker Gender Promotion Status (yes, no) Market Share Product Market Share Price Productivity Group Productivity Job Switching (yes ,no) Observation: single object or time-period. Can have more than one piece of information. Spot Check on types of data Average weights and lengths of newborn babies on March 3 at Royal Women’s Hospital Mean marks by departments for academic year 1999 at University of Melbourne The Age Poll on gambling views Hourly defect numbers for a product on a production line Critical Questions that Business Professionals Ask On average, how are we doing? Is there much variation? Are there outliers? Typical Values What do we mean when we use the word “average”? If a country had 3 people who earned a million dollars each and 100 people who earned $20,000, what would the average person in that country earn? “Average” or “typical” depends on context: there’s no single definition Suburban Melbourne In a suburb of Melbourne, five houses were sold last month for $1,208,000, $759,000, $795,000, $990,000 and $579,000. What kind of community do you think it is? What would the average house in this community be like? Is this community average? What kind of statistics would you use to describe the data? The Sample Draw 5 observations from a population. They are: 8, 14, 3, 7, 23 Want to describe this sample: its typical values and its spread. Sample Mean x i where xi are the data values x i n is the sample size n 8 14 3 7 23 x 11 5 Sample Median Take all observations Order them Count observations until you are exactly half way: this is the median value (If there are an even number of observations then the median is half-way between the two middle values.) Median 8 Sample Mode What is the mode? What are its advantages and disadvantages? Trimmed Mean Rank order data. Eliminate the lowest x% and the highest x% of data. Compute sample mean for remaining data. Reduces effect of outliers. Measures of Spread Measures of spread help answer the questions: How closely grouped are the data? Is there a lot of variation? How far apart are the lowest and highest observations? The Range Rank order the data. Subtract smallest from the largest data value. Range 23 3 20 Use only two values to calculate range. Doesn’t measure spread around mean. Quartiles and Inter-Quartile Range Quartiles (Upper and Lower) are like Median except that they are the quarter and three-quarter points in the data Inter-quartile range is the distance between the quartiles Sample Standard Deviation s (x x)i 2 where x is the sample mean n 1 n is the sample size (8 11) (14 11) (3 11) (7 11) (23 11) 2 2 2 2 2 s 5 1 s 7.78 Variance and Coefficient of Variation Variance is just the square of the standard deviation: V sx 2 Coefficient of Variation is standard deviation divided by the mean: s CV x 100% X CV is useful because it is unit-less (doesn’t depend on the magnitude of the data). Quick Problems What is the mean and standard deviation of the following set of numbers: {1, 1, 1}? Mean = 1 Standard deviation = 0 Quick Problems The hourly wages of three employees are $4.00, $4.50, and $5.00. What would happen to the mean and standard deviation of these wages if the following occurred: each got a raise of $.50 per hour? Mean goes up $0.50 Standard deviation stays the same Quick Problems The hourly wages of three employees are $4.00, $4.50, and $5.00. What would happen to the mean and standard deviation of these wages if the following occurred: the hourly wage of each doubled? Mean doubles Standard deviation doubles Quick Problems School A School B 20 20 20 20 10 10 10 10 50 100 150 200 50 100 150 200 Pocket Money ($) Pocket Money ($) Which school has the greater mean? the greater standard deviation? Excel - Descriptive Statistics Go to “Tools” menu Go to “Data Analysis…” Choose “Descriptive Statistics” Enter the cell range of the data in the “Input Range” box Tick “Summary statistics” box Describing Data CROSS-SECTIONAL DATA TIME-ORDERED DATA STEM and LEAF LINE GRAPH Visualise FREQUENCY HIST. MOVING AVERAGE Data OGIVE RESIDUAL GRAPH BOX PLOT MEAN and STD. DEV. Summarise MEDIAN and IQR MOVING AVERAGE Data TRIMMED MEAN VALUES RANGE EMPIRICAL RULE Detect RESIDUALS AND THE TUKEY’S RULE Outliers CHEBYSHEV’S RULE EMPIRICAL RULE SCATTER PLOT Compare CROSS-TABS TABLE Data PERCENTAGE TABLE Example: Golf Scores 82 77 88 87 78 84 85 82 72 91 82 85 82 79 83 75 81 87 86 82 82 80 84 89 83 78 84 87 Stem-and-Leaf Display 7 2 Stem displays data values between 7 57889 75 and 79. 8 0122222233444 Stem displays data 8 55677789 values between 85 and 89. 9 1 Stems Leaves (tens digit) Frequency Table Golf Scores Frequency 72 to 75 2 76 to 79 4 80 to 83 10 84 to 87 9 88 to 91 3 28 Histogram for Golf Data MEAN IS GOLF SCORE 10 BETWEEN 80 AND 83 9 4 3 2 72-75 76-79 80-83 84-87 88-91 Mean of Frequency Histogram is its Balance Point. Relative Frequency Table Relative Golf Scores Frequency Frequency 72 to 75 2 2/28 = 7.1% 76 to 79 4 14.2% 80 to 83 10 35.7% 84 to 87 9 32.1% 88 to 91 3 10.7% 28 100.0% Cumulative Frequency Table for Golf Data Cumulative Golf Scores Frequency Golf Scores Frequency 72 to 75 2 < 76 2 76 to 79 4 < 80 6 80 to 83 10 < 84 16 84 to 87 9 < 88 25 88 to 91 3 < 92 28 Cumulative Relative Frequency Table for Golf Data Cumulative Cumul. Relative Golf Scores Frequency Frequency < 76 2 7.1% < 80 6 21.4% < 84 16 57.1% < 88 25 89.3% < 92 28 100% Ogive A cumulative frequency histogram is called an ogive. Procedure for constructing one is exactly the same as the frequency histogram seen previously. Box and Whisker Plots Median 75th percentile 25 percentile Lower Fence Upper Fence 10 20 30 40 50 60 70 Scale In this QA class the lower and upper fences will correspond to the lowest and highest values. Line Graph For time-ordered data, construct a line graph by using time as x-axis, plotting values on y-axis and joining them up with lines. Use to detect trends, cycles, seasonality and unusual observations. Scatter Plots 95 Golf 90 Score 85 80 75 70 0 20 40 60 80 Age Used with bivariate (or multivariate) data Each point is an observation. What about categorical variables? Are there any ways to represent these? One-Way Table Displays the impact of one categorical explanatory variable on a quantitative dependent variable. An explanatory variable affects the variation in another variable. It is also called an independent variable. A dependent variable is a variable whose variation depends upon another variable. Bivariate Data for Productivity and Job Switching Group Productivity (%) Switch 1 106 Yes 2 95 No 3 103 Yes … … ... 34 98 Yes 35 97 Yes 36 94 No One Way Table of Impact of Job Switching on Productivity No Job Switch Job Switch Number of groups 14 22 Mean 95.5% 103.5% Std. Dev. 2.8% 5.8% Median 95.0% 102.5% IQR 2.0% 6.0% Job switch groups have higher mean& median productivity. Job switch groups are more variable producers. Try Job Switching Throughout Firm to Increase Productivity. Multiple Box and Whisker Plots For multivariate or bivariate data where one of the variables is categorical. Get a visual comparison between different categories by drawing a box plot of the values of the non-categorical variable. Use one box plot for each category and use the same scale. Constructing a Cross-Tabs Table for Productivity and Job Switching Data 1. Select Sample. Categorical Variable into 2. Cross-Classify Each Mutually Exclusive and Element by Two Exhaustive Categories. Categorical Variables. 4. Arrange Sample Data in 3. Subdivide Each Cross-Tabs Table. NO JOB SWITCH JOB SWITCH LOW PROD 13 4 17 Row Totals HIGH PROD 1 18 19 14 22 36 Column Totals Entire Population Percentage Tables Joint percentage table show the percent of the entire population found in each box. NO JOB SWITCH JOB SWITCH LOW PROD 36.1% 11.1% 47.2% Row Totals HIGH PROD 2.8% 50.0% 52.8% 38.9% 61.1% 100% Column Totals Entire Population The Shape of Histograms Means (Approximate) SYMMETRIC SKEWED TO RIGHT Symmetric Histogram: Classes to the Left and Right of the Mean are Mirror Images. Skewed Histogram: the Distribution Falls Off More Slowly on One Side of the Mean. Question If our data set was highly skewed, would we prefer to use the mean and standard deviation, or the median and the quartiles? Median and the quartiles Summarising Skewed Cross- Sectional Data Median Trimmed Mean Upper Quartile and Lower Quartile and the Interquartile Range Outliers If the histogram for a data set is: Approximately bell-shaped – a data value more than three standard deviations away from the mean should be considered an outlier. (Empirical Rule) Skewed – use Chebyshev’s rule. (Not discussed in this workshop or required for MBA degree) The Empirical Rule Applied to Golf Data NUMBER OF STD. EMPIRICAL RULE DEV. FROM MEAN +1 ABOUT 68% OF DATA +2 ABOUT 95% OF DATA +3 ALMOST ALL DATA x 82.7 and s 4.29 The Mean +1 Standard Deviation Score 78.41 to 86.99 The Mean +2 Standard Deviations Score 74.12 to 91.28 The Mean +3 Standard Deviations Score 69.83 to 95.57 Problem Diagnosis Would like to determine the root causes of an outlier. What is unique about the outlier group or time period? What has changed that might account for the outliers? Is it a mistake? What did we do? Explored the use of statistics, graphs and tables to summarise and describe data. Learned how to identify outliers. Used tables to explore relationships between variables. Talked about how graphs and charts can lie.