VIEWS: 4 PAGES: 59 POSTED ON: 9/10/2012
Data Analysis: Review and Practical Application using SPSS Data of Interest National Insurance Company – 1000 questionnaires sent – 285 respondents Questionnaire Presentation – Copy given in class Coding Coding broadly refers to the set of all tasks associated with transforming edited responses into a form that is ready for analysis Steps – Transforming responses to each question into a set of meaningful categories – Assigning numerical codes to the categories – Creating a data set suitable for computer analysis Transforming Responses into Meaningful Categories A structured question is pre-categorized Responses to a nonstructured or open-ended question to be grouped into a meaningful and manageable set of categories Q 1: In this questionnaire, how many non- categorized questions? Missing-Value Category A missing value can stem from – A respondent's refusal to answer a question – An interviewer's failure to ask a question or record an answer or a "don't know" that does not seem legitimate Best way to treat missing value responses – Sound questionnaire design – Tight control over fieldwork Assigning Numerical Codes Assign appropriate numerical codes to responses that are not already in quantified form To assign numerical codes, the researcher should facilitate computer manipulation and analysis of the responses Multiple Response Question – Rank Order Question Please rank the following Insurance companies by placing a 1 beside the company you think is best overall, a 2 beside the company you think is second best, and so on. __________Progressive __________All State __________National Q2 How would you code the previous question to be added to the questionnaire ? This question requires as many variables (and columns) as there are objects to be ranked: 3 separate variables are needed Creating a Data Set Organized collection of data records Each sample unit within the data set is called a Case or Observation Structure of a Data Set – The number of observations = n – The total number of variables embedded in the questionnaire is m, then Data set = n x m matrix of numbers Importance of Coding Sheet: Anybody can enter /check data set. (Copy of coding sheet) SPSS Data Set 2 Views : Variable and Data. Raw Variable (labels and values) Transformed Variable (compute and recode) Preliminary Data Analysis: Basic Descriptive Statistics Preliminary data analysis examines the central tendency and the dispersion of the data on each variable in the data set Measurement level dictates what to do Feeling for the data What can we do: limitations on next slide? Run descriptives. (outputs 1) Measures of Central Tendency and Dispersion for Different Types of Variables Why Averages May be Misleading Researchers tested a new sauce product and found – Mean rating of the taste test was close to the middle of the scale, which had "very mild" and "very hot" as its bipolar adjectives Researcher’s conclusion – Consumers need really neither really hot nor really mild sauce Why Averages May be Misleading (Cont’d) Deeper examination revealed – The existence of a large proportion of consumers who wanted the sauce to be mild and an equally large proportion who wanted it to be hot nor really mild sauce Moral of the story: – A clear understanding of the distribution of responses can help a researcher avoid erroneous inferences. Talk about Skewness and Kurtosis. Crosstabs: Occurencies in specific condition. Most of the time with categorical variables Examples to run Cross-Tabulations- Comparing frequencies: Chi-square Contingency Test Technique used for determining whether there is a statistically significant relationship between two categorical (nominal or ordinal) variables Cross-Tabulation Using SPSS for National Insurance Company One crucial issue in the customer survey of National Insurance Company was how a customer's education was associated with whether or not she or he would recommend National to a friend. Need to Conduct Chi-square Test to Reach a Conclusion The hypotheses are: – H0:There is no association between educational level and willingness to recommend National to a friend (the two variables are independent of each other). – Ha:There is some association between educational level and willingness to recommend National to a friend (the two variables are not independent of each other). – Let’s do it…. Conducting the Test Test involves comparing the actual, or observed, cell frequencies in the cross-tabulation with a corresponding set of expected cell frequencies(Eij) Expected Values ninj Eij = ----- n where ni and nj are the marginal frequencies, that is, the total number of sample units in category i of the row variable and category j of the column variable, respectively Chi-square Test Statistic r c (Oij - Eij)2 2 = ----------------- i=1 j=1 Eij where r and c are the number of rows and columns, respectively, in the contingency table. The number of degrees of freedom associated with this chi-square statistic are given by the product (r - 1)(c - 1). National Insurance Company Study Computed Chi- square value P-value National Insurance Company Study --P-Value Significance The actual significance level (p-value) = 0.019 the chances of getting a chi-square value as high as 10.007 when there is no relationship between education and recommendation are less than 19 in 1000. The apparent relationship between education and recommendation revealed by the sample data is unlikely to have occurred because of chance. We can safely reject null hypothesis. Precautions in Interpreting Cross Tabulation Results Two-way tables cannot show conclusive evidence of a causal relationship Watch out for small cell sizes Increases the risk of drawing erroneous inferences when more than two variables are involved Overview of Techniques for Examining Associations Spearman Correlation Coefficient Technique The technique is appropriate when – The degree of association between two sets of ranks (pertaining to two variables) is to be examined Illustrative Research Question(s) This Technique Can Answer: – Is there a significant relationship between motivation levels of salespeople and the quality of their performance? Assume that the data on motivation and quality of performance are in the form of ranks, say, 1through 20, for 20 salespeople who were evaluated subjectively by their supervisor on each variable Overview of Techniques for Examining Associations (Cont’d) Pearson Correlation Coefficient Technique This technique is appropriate when – The degree of association between two metric-scaled (interval or ratio) variables is to be examined Illustrative Research Question(s) This Technique Can Answer: – Is there a significant relationship between customers' age (measured in actual years) and their perceptions of our company's image (measured on a scale of 1to 7)? Spearman Correlation Coefficient A Spearman correlation coefficient is a measure of association between two sets of ranks n 6 d2 i i =1 rs = 1 - ---------------------------- n(n2 - 1) di = the difference between the ith sample unit's ranks on the two variables n = the total sample size Pearson Correlation Coefficientdegree of association The Pearson correlation coefficient is the between variables that are interval-or ratio-scaled. Pearson correlation coefficient (rxy) between them is given by n (Xi – X)(Yi – Y) i=1 rxy = ----------------------------- (n-1) sx sy n = sample size (total number of data points) X and Y = means Xi and Yi = values for any sample unit i sx and sy = standard deviations National Insurance Company– Computing Pearson Correlation Among Service Quality Constructs National Insurance Company was interested in the correlations between respondents’ overall service- quality perceptions (on the 10-point scale) and their average ratings along each of the five dimensions of Service Quality National Insurance Company– Computing Pearson Correlation Among Service Quality Constructs Using SPSS Interpreting Pearson Correlation Coefficients Each of the five service-quality measures (reliability, empathy, tangibles, responsiveness, and assurance) is significantly related to the overall quality (OQ) at the .001 level of significance Responsiveness has the strongest correlation (.8625) Tangibles have the weakest correlation (.5038) All the correlations are strong enough to be meaningful Comparing Means Mainly T-tests and ANOVAs T-test on OQ and gender. Independent T-tests Independent Variable with 2 categories max. Equality of variance (cf output) 88% of chance that the difference of .04 is due to chance (random effect). Cannot reject the null hypothesis. Analysis of Variance ANOVA is appropriate in situations where the independent variable is set at certain specific levels (called treatments in an ANOVA context) and metric measurements of the dependent variable are obtained at each of those levels Example 24 Stores Chosen randomly for the study 8 Stores randomly chosen for each treatment Treatment 1 Treatment 2 Treatment 3 Store brand sold at Store brand sold at Store brand sold at 50¢ off the regular the regular price 75¢ off the regular price price monitor sales of the store brand for a week in each store Table 15.2 Unit Sales Data Under Three Pricing Treatments Treatment Regular Price 50 ¢ off 75 ¢ off Unit Sale in each store 37 46 46 38 43 49 40 43 48 40 45 48 38 45 47 38 43 48 40 44 49 39 44 49 Number of 8 8 8 stores Mean sales 38.75 44.13 48.00 ANOVA –Grocery Store Hypothesis Grocery Store Example – Ho 1 = 2 = 3 – Ha At least one is different from one or more of the others Hypotheses for K Treatment groups or samples – Ho 1 = 2 = ………..k – Ha At least one is different from one or more of the others Exhibit 15.1 SPSS Computer Output for ANOVA Analysis Between-Subjects Factors Value Label N Treatment 1 Regular 8 group price 2 50 cents off 8 3 75 cents off 8 Exhibit 15.1 SPSS Computer Output for ANOVA Analysis (Cont’d) Tests o f Betw een-Subjects Effects Dependent Variable: SALES Type III Sum Source of Squares df Mean Square F Sig. Corrected Model 345.250a 2 172.625 137.445 .000 Intercept 45675.375 1 45675.375 36367.123 .000 TREAT 345.250 2 172.625 137.445 .000 Error 26.375 21 1.256 Total 46047.000 24 Corrected Total 371.625 23 a. R Squared = .929 (A djusted R Squared = .922) There is less than a .001 probability of obtaining an F- value as high as 137.447 ANOVA OQ recommendation and OQ, individual variable OQ and EDUC (Graph)..and post hoc Overview of Techniques for Examining Associations (Cont’d) Simple Regression Analysis Technique This technique is appropriate when – A mathematical function or equation linking two metric-scaled (interval or ratio) variables is to be constructed, under the assumption that values of one of the two variables is dependent on the values of the other Overview of Techniques for Examining Associations–Simple Regression Analysis (Cont’d) Illustrative Research Question(s) this Technique Can Answer: – Are sales (measured in dollars) significantly affected by advertising expenditures (measured in dollars)? – What proportion of the variation in sales is accounted for by variation in advertising expenditures? How sensitive are sales to changes in advertising expenditures? Overview of Techniques for Examining Associations (Cont’d) Multiple Regression Analysis Technique This technique is appropriate when – Under the same conditions as simple regression analysis except that more than two variables are involved wherein one variable is assumed to be dependent on the others Overview of Techniques for Examining Associations (Cont’d) Illustrative Research Question(s) this Technique Can Answer: – Are sales significantly affected by advertising expenditures and price (where all three variables are measured in dollars)? – What proportion of the variation in sales is accounted for by advertising and price? How sensitive are sales to changes in advertising and price? Simple Regression Analysis Generates a mathematical relationship (called the regression equation) between one variable designated as the dependent variable (Y) and another designated as the independent variable (X) Independent Variable Vs. Dependent Variable Independent variable – Explanatory or predictor variable – Often presumed to be a cause of the other Dependent variable – Criterion Variable – Influenced by the independent variable Practical Applications of Regression Equations The regression coefficient, or slope, can indicate how sensitive the dependent variable is to changes in the independent variable The regression equation is a forecasting tool for predicting the value of the dependent variable for a given value of the independent variable Precautions In Using Regression Analysis Only capable of capturing linear associations between dependent and independent variables A significant R2-value does not necessarily imply a cause-and-effect association between the independent and dependent variables A regression equation may not yield a trustworthy prediction of the dependent variable when the value of the independent variable at which the prediction is desired is outside the range of values used in constructing the equation Precautions In Using Regression Analysis (Cont’d) A regression equation based on relatively few data points cannot be trusted The ranges of data on the dependent and independent variables can affect the meaningfulness of a regression equation Multiple Regression Analysis Yi = a + b1X1i + b2X2i + … + bkXki Yi is the predicted value of the dependent variable for some unit i; X1i, X2i, …, Xki are values on the independent variables for unit i; bl, b2, . . . , bk are the regression coefficients; a is the Y-intercept representing the prediction for Y when all independent variables are set to zero National Insurance Company– Multiple Regression Using SPSS Jill and Tom were interested in conducting a multiple regression analysis wherein overall service quality perceptions is the dependent variable and the average ratings along the five dimensions are the indpendent variable Factor Analysis A data and variable reduction technique that attempts to partition a given set of variables into groups of maximally correlated variables Factor Analysis Output and Its Interpretation Primary output of factor analysis is a factor- loading matrix Table 15.4 Factor-Loading Matrix Based on Data from Study of Star Customers Factor Loadings Factors Achieved F1 F2 Communalities X4: My friends are very 0.96 0.06 .926 impressed with the Star VCR 3 Variables load X6: No other brand of VCR 0.92 0.17 .875 high on factor 1 even comes close to matching the Star X1: I did not mind paying the 0.89 0.15 .815 high Price for my Star VCR X3: I hardly ever worry about 0.18 0.94 .916 anything going wrong with my Star VCR 3 Variables load 0.09 0.88 .782 X5: The Star VCR has the high on factor 2 latest technology built into it X2: I am pleased with the 0.16 0.86 .766 variety of things that a Star VCR can do VCR Eigenvalues: Standardized 2.626 2.454 variance explained by each factor Proportion of the total variance 0.438 0.409 explained by each factor Reducing Star Data X1, X4, and X6 can be combined into one factor X2, X3, and X5 can be into a second factor 6 variables can be reduced to two factors Potential Applications of Factor Analysis Used to – Develop concise but comprehensive, multiple- item scales for measuring various marketing constructs – Illuminate the nature of distinct dimensions underlying an existing data set – Convert a large volume of data into a set of factor scores on a limited number of uncorrelated factors Cluster Analysis Segment objects into groups so that members within each group are similar to one another in a variety of ways Useful for segmenting customers, market areas, and products Use of Cluster Analysis Firm offering recreational services wanted to enter a new region of the country They gathered data on more than 100 characteristics including – Demographics – Expenditures on recreation – Leisure time activities – Interests of household members The firm identified one or several household segments that are likely to be most responsive to its advertising and to its services How Does Cluster Analysis Work? Cluster analysis measures the similarity between objects on the basis of their values on the various characteristics Exhibit 15.8 Clusters Formed by Using Data on Two Characteristics High Low Low High Extent of participation in outdoor sporting events