# four types of statistical analysis: ANOVA testing, Linear Regression, Correlation Analysis, and Pricing Indexes

Document Sample

```					                                       Final Assignment   1

By

Benjamin W. Kratz

Professor Bruce Busbee

BUSN 5760 Applied Statistics

Webster University at Fort Jackson

October 13, 2008
Final Assignment           2

Abstract

When running a business it is imperative to perform four types of statistical analysis: ANOVA testing, Linear

Regression, Correlation Analysis, and Pricing Indexes. By looking at each of the statistical analysis, the business

owners and employees can determine how prices and wages will change along with what variables are the main

influences for the change. By determining this, they are able to create models that can predict the future outcome

with a statistical accuracy of 90-99%.
Final Assignment             3

Introduction:

Businesses use statistical data to answer the “so what?” Their goal is be able to predict how the economy will

change and what variables will cause the greatest influence and to what extent they influence the end cost. ANOVA

testing, Linear Regression, Correlation Analysis, and Pricing Indexes are four key statistical analysis areas they can

use to determine how the economy will change.

ANOVA Testing:

Businesses use ANOVA testing to see if the means of a population are the same (your null hypothesis) or if they

differ between populations (your research hypothesis) by looking at the variances. An ANOVA will tell you if there

is a statistically significant difference between group means (averages) based on group variances and sample sizes.

When conducting the ANOVA, they look for the total variation by obtaining the sum of the squared differences

between each observation and the overall mean. When calculating the total variation they break the computation

down into two separate components. The first component is the treatment variation (TV) and is computed by taking

the sum of the squared differences between each treatment mean and the total mean. The second component is the

random variation (RV) and is computed by taking the sum of the squared differences between each observation and

its treatment mean. The RV information also indicates the error component. The ANOVA test procedure produces

an F-statistic, which is used to calculate the p-value. To determine F distribution they use the following equation:

(��������)/(������������������������ ��������)
���� =
(��������)/(�������������������� ��������)

If the null hypothesis is correct, we expect F to be about one, whereas "large" F indicates a location effect.

How big should F be before we reject the null hypothesis? In statistical hypothesis testing, we use a p-value

(probability value) to decide whether we have enough evidence to reject the null hypothesis and say our research

hypothesis is supported by the data. To find the 1 percent level of significance they can use the chart found in

Appendix B.4 in the textbook titled “Statistical Techniques in Business and Economics” (Lind, Marchal, and

Wathen, 2008).

Chapter 12 of the same textbook provides a great example of the ANOVA by surveying passengers from four

different airlines. The intent is to find if there is a difference in the mean satisfaction level among the four airlines.
Final Assignment          4

The survey included questions on ticketing, boarding, in-flight service, baggage handling, pilot

communication, and so forth. Twenty-five questions offered a range of possible answers: excellent,

good, fair, or poor: A response of excellent was given a score of 4, good a 3, fair a 2, and poor a 1.

These responses were then totaled, so the total score was an indication of the satisfaction with the

flight. The greater the score, the higher the level of satisfaction with the service. The highest possible

score was 100. (Lind, Marchal, and Wathen, 2008).

Table 1: Results from Surveys: (Lind, Marchal, and Wathen, 2008).

Eastern    TWA       Allegheny     Ozark
94        75         70           68
90        68         73           70
85        77         76           72
80        83         78           65
88         80           74
68           65
65

The null hypothesis and the alternate hypothesis are as follows:

H0: µ1 = µ2 = µ3 = µ4                     H1: µ1 > µ2 > µ3 > µ4

Acceptance of the null hypothesis means that there is no difference in the mean scores for all four airlines.

Rejection of null hypothesis means that there is no a difference in at least one pair of mean scores. However, the

initial computation will not denote which data group differs or how many data groupings differ. The F distribution

is used to test the statistical data using a significance level of .01. To formulate the decision rule we look for the

critical value (cv) by using Appendix B.4 (Lind, Marchal, and Wathen, 2008). In order to find the cv the degrees of

freedom (df) need to be identified for the numerator (k) by taking the total number of treatments and subtracting 1

and the denominator by taking the total number of observations (n) and subtracting the number of treatments.

Therefore, df for the numerator = k – 1 = 4 – 1 = 3 and df for the denominator = n - k = 22 – 4 = 18. By using the 3

and 18 they compute the cv to be 5.09 which if the computed value of F exceeds 5.09 then they will reject H0. The

final step is to select the sample, perform the calculations, and make a decision as shown in Table 2.

Table 2: ANOVA Computation Layout: (Lind, Marchal, and Wathen, 2008).

Source of Variation       Sum of Squares      df          Mean Square          F
Treatments                         SST           k-1       SST/(k-1) = MST    MST/MSE
Error                              SSE           n-k       SSE/(n-k) = MSE
Total                 SS total        n-1
Final Assignment       5

Excel commands for the one-way ANOVA were used to run the data and the results are shown in Table 3.
Table 3: Excel One-way ANOVA Computation Results:

SUMMARY
Groups            Count       Sum       Average   Variance
Eastern                             4   349         87.25      36.92
TWA                                 5   391         78.20      58.70
Allegheny                           7   510         72.86      30.14
Ozark                               6   414         69.00      13.60
Total                             22    1664        75.64

ANOVA
Source of Variation       SS          df         MS         F          P-value       F crit
Between Groups               890.68           3    296.89          8.99     0.0007          3.16
Within Groups                594.41      18         33.02

Total                      1,485.09      21

The results tell us that the total error is 1,485.10 and the error within the groups is 890.68. The information

needed to accept or reject is the error between groups, which is 890.68. By taking the error between groups and

dividing it by the df (3) the mean square (MST) of 296.89 is obtained. Doing the same with the error within groups

the mean square (MSE) of 33.02 is obtained. There is a large differential between each mean square providing an

early indication that the hypothesis may be rejected since the between group error is larger than the within group

error. To confirm this notion the formula MST/MSE is used to find the F value. The calculated F Value is then

compared the the critical value obtained from Appendix B.4, pg. 789 (Lind, Marchal, and Wathen, 2008). By taking

the df for the denominator (18) and the df for the numerator the critical value of 5.09 is derived. The final step is to

compare the F value and the F critical value. Since the F value is greater than the F critical value the H o is rejected

and it indicates that there is a significant difference between each sample group. So what? By looking at the P-

value of .0007 it is determined that the probability of finding a f value larger when the null hypothesis is true is very

small. Since this is the case, the likelihood of obtaining a Type I error is very small.

With this data, a customer wanting to travel would be able to know that not all airlines provide the same level of

service with the same satisfaction. By knowing this, the customer would then begin to look more closely at the

survey and try to see what services were mentioned. Were the services that they prefer to use included in the

survey? If so what were the results for those services? By answering these questions, the customer is adding

additional weighted value to the variables, which would result in a new analysis of the data.
Final Assignment          6

To narrow down the influencing variable, more data would be needed to perform a correlation analysis to see

what service or combination of services have the greatest influence on flight satisfaction. The bad part about

performing statistical analysis on such a streamlined data set is that one person’s personal satisfaction is not the

same as another person. This creates a degree of bias in the data set that can potentially mislead the true nature of

the statistical findings. To eliminate some of the bias in the survey, there should be a control factor provided as a

means to reflect the bias data within the statistical analysis.

Linear Regression and Correlation Analysis:

While the ANOVA allow businesses to find similarities between several populations, it still leaves questions to

be answered as noted in the problem with the airlines. So how do businesses answer the question of how do the

variables relate to each other. In order to answer this type of questions a correlation analysis needs to be conducted

to create a model of the data that if we have a known value we will know the resulting value. To know the range of

certainty the business would also perform a linear regression and confidence interval.

Douglas Lind and associates provide a great example of this in Chapter 13 as they talk about the copier sales of

America (2008). The example looks the number of sales calls and copiers sold for 10 salespeople, as seen in Table

4) to see if there is a direct or indirect relationship between the number of calls made and the number of copiers sold.

They look to create a model using correlation analysis to measure the association between two variables. To do the

Table 4: Number of Sales Calls and Copiers Sold for 10 Salespeople:
Sales Represenative        Number of Sales Calls   Number of Copiers Sold
Tom Keller                          20                       30
Jeff Hall                           40                       60
Brian Virost                        20                       40
Greg Fish                           30                       60
Susan Walch                         10                       30
Carlos Ramirez                      10                       40
Rich Niles                          20                       40
Mike Kiel                           20                       50
Mark Reynolds                       20                       30
Soni Jones                          30                       70
Total                              220                      450
(Table 13-1: Lind, Marchal, and Wathen, 2008)
Correlation analysis it is important to identify the dependent variable and independent variable. The dependent

variable is the variable that is being predicted and the independent variable is the variable that provides the basis for

estimation. If they were to conduct a scatter diagram as seen in Table 5, the dependent variable would be on the y-
Final Assignment         7

axis and the independent variable on the x-axis. Once plotted the graph clearly shows that there is some type of

correlation between the number of calls made by a sales person and the number of sales.

Table 4: Number of Sales Calls and Copiers Sold for 10 Salespeople:

Sales Calls and Copiers Sold
80

Copiers Sold   60

40

20

0
0     10       20        30       40       50
Sales Calls

(Chart 13-1: Lind, Marchal, and Wathen, 2008)

So how does this help the business manager? Well, in the graphic form they can quickly show it to the sales

reps as a means to motivate them to increase their calls in an effort to increase sales. For most managers this is not

enough to go on when they want to know what the amount of sales will be if calls are increased. To properly answer

this they need to gain a better understanding of how the two values relate to each other by computing the coefficient

of correlation (r) along with determining how far they deviate from the mean and their products.

Table 5: Deviations from the mean and Their Products:

(Table 13-3: Lind, Marchal, and Wathen, 2008)

The following equation is used to compute the coefficient of correlation (r) using the standard deviations of the

samples of the sales calls and 10 copiers sold using the following formula:
Final Assignment          8

The resulting value can range from -1.00 to 1.00. The closer the value is to -1.00 or 1.00 the stronger the correlation

and the closer to 0.00 the weaker the correlation. A negative value indicates an inverse relationship and a positive

value indicates a direct relationship. Using excel„s descriptive statistics function we can obtain the standard

deviation (s) as seen in Table 6.

Table 6: Descriptive statistics of Sales Calls and Copiers Sold for 10 Salespeople:

Number of Sales Calls                               Number of Copiers Sold

Mean                                    22.000     Mean                                  45.000
Standard Error                           2.906     Standard Error                         4.534
Median                                  20.000     Median                                40.000
Mode                                    20.000     Mode                                  30.000
Standard Deviation (sx)                  9.189     Standard Deviation (sy)               14.337
Sample Variance                         84.444     Sample Variance                      205.556
Kurtosis                                 0.396     Kurtosis                              -1.001
Skewness                                 0.601     Skewness                               0.566
Range                                   30.000     Range                                 40.000
Minimum                                 10.000     Minimum                               30.000
Maximum                                 40.000     Maximum                               70.000
Sum                                    220.000     Sum                                  450.000
Count                                   10.000     Count                                 10.000
Confidence Level(95.0%)                  6.574     Confidence Level(95.0%)               10.256

900
The computation of ���� =                          = 0.759 indicates a strong positive correlation. This data does not tell
10−1 9.189 (14.337)

the manager that as the number of calls increase the number of sales will also increase, only that the two variables

have some type of relationship.

The data does not yet tell the manager to what amount of sales will one additional call create, to determine this;

a correlation needs to be established using linear regression analysis equation: Ŷ = a + bX

    “Ŷ” is the estimated value of the Y variable for a selected X value

    “a” is the Y-intercept (value of Y when X = 0)

    “b” is the slope if the line (mean change in Ŷ for each change of one unit in the X variable)

    “X” is the selected independent variable. To obtain the slope of the regression line by taking the

correlation coefficient
14.337
The first step is to find the slope: b = r (sy / sx) = 0.759(            ) = 1.1842. The second step is to determine the Y-
9.189

intercept: a = Ȳ - bX = 45 – 1.1842 (22) = 18.9476. With these two values the manager can now calculate how

many sales will result from an increase in 20 calls (X) by calculating Ŷ = 18.9476 + 1.1843 (20) = 42.6316 copiers.

Simply put for every additional call the sale representative can expect an increase of 1.2 copiers sold. However, the

equation is not truly reliable since the sales calls ranged from 10 to 40 which then limit the use of the equation to
Final Assignment        9

this range. If you use 0-10 or < 40 the accuracy lessens. The equation is only a prediction statement and is not

perfect.

To provide some validity to the accuracy of the prediction equation they calculate the standard error of estimate

to determine the measure of dispersion of the observed values around the line or regression. This is done using the

∑( ����−Ŷ Y− Ŷ )
follow equation: s y * X =                    . To ease the process the date has been computed using excel‟s Data
����−2

Analysis program to calculate regression. The results are shown in Table 7.

Table 7: Regression calculation of Sales Calls and Copiers Sold for 10 Salespeople:

Regression Statistics
Multiple R                              0.759
R Square                                0.576
Standard Error                          9.901
Observations                           10.000

Coefficients
Intercept                               18.95
Calls                                    1.18

The standard error computes to 9.901 depicting how far from the regression line the data point deviate. With

this knowledge, the manager can be 90% certain of their calculations. Therefore, there is a 10% chance their data is

off.

Index Numbers:

Drawing correlations between variables is not the only thing that is important to businesses and managers.

Profits are important and to be able to see the true profits then use Consumer Price Index (CPI) numbers. CPI

expresses the relative change in the sample value compared to the base period established. Two basic types of data

are needed to construct the CPI: price data and weighting data. The percent change in the CPI is a measure of

inflation. The CPI can be used to adjust for the effects of inflation in wages, salaries, pensions, or regulated or

contracted prices.

On weighted index is the Laspeyres Price Index (LPI) developed to determine a weighted price index using

base-period quantities as weights using the following:
Final Assignment          10

P=(          ptqo/    poqo) x 100

Douglas Lind and associates provide a great example of this in Chapter 15, as they talk about the prices for the six

food items shown in Table 8 (2008).

Table 8: Price and Quantity of Food Items in 1995 and 2005:

Item                  Price-95    Qty-95        Price-95*Qty-95            Price-05               Price-05*Qty-95
Bread                    \$0.77            50               \$38.50              \$0.89                            \$44.50
Eggs                     \$1.85            26               \$48.10              \$1.84                            \$47.84
Milk                     \$0.88        102                  \$89.76              \$1.01                           \$103.02
apples                   \$1.46            30               \$43.80              \$1.56                            \$46.80
Orange Juice             \$1.58            40               \$63.20              \$1.70                            \$68.00
Coffee                   \$4.40            12               \$52.80              \$4.62                            \$55.44
\$336.16                                              \$365.60

(Data from Table 15-3: Lind, Marchal, and Wathen, 2008)
To calculate the LPI they determine the total amount spent for the six items in the base-period equaling \$336.16.

Then we take the 2005 price and multiply the 1995 quantities to establish a weighted value of \$365.60. Now that

the two values are calculated, the weighted price index can be computed. The final computed value is 108.8

indicating that there is an 8.8 percent increase in the cost over the ten-year period.

\$365 .60
P=(      ptqo/        poqo) x 100 =              100 = 108.8
336 .16

The data from LPI does not reflect changes in any buying patterns that may have occurred over time. To

compensate for this they can use the Paasche Price index using current year quantities to reflect current buying

habits. The problem with using this price index is that it can provide greater weight to the prices whose quantities

have decreased. Therefore, the use of Fisher‟s Ideal Index (FII) tries to balance the effects of the two price indexes

by taking the geometric mean of the two indexes. However, the FII has similar issues as the Paashe Price Index in

that it requires current quantity data for each period being used.

Another use of CPI is when employees determine what the true amount of their current income is based on
�������������������� ������������������������
inflation? They can calculate for real income (RI) by using the equation: �������� =                                    100. To see the work
������������

they take the annual income \$20,000 from 1982-84 and set it as the base period (equal 100 CPI). Then they take the

present year income or \$40,000 and divide it by current CPI for that year which is 200. When they place it into the
40,000
RI = (MI/CPI) 100 =            100 = 20,000 the employee will realize that their income has the same purchasing power
200

as it did in 1982-84 and that the employers have properly adjusted their income to reflect the current CPI. Now this
Final Assignment            11

is not always the case, for example if the CPI were 250 then their RI would be \$16,000 indicating that the inflation

of the market has weakened their income/purchasing power by \$4,000. This concept is also called deflated income

and is brought to light when labor unions negotiate new contracts for employees.

Table 9: OCOLA for Army Major Living in Kaiserslautern, Germany:

(Retained from: http://perdiem.hqda.pentagon.mil/cgi-bin/cola-oha/o_cola.pl)

Businesses also use the CPI to determine cost-of-living allowance (COLA) increases within management-union

contracts. For instance, the military pays Soldiers over seas a supplement (O-COLA) to offset the cost difference of

the local economy with the US economy. Looking at Table 9, you will see that a Major in the Army living off post

in Kaiserslautern Germany with three dependants will receive an additional \$39.378 daily to offset the local

economy‟s 0.32 inflationary index reflecting the difference between the US CPI and the EURO CPI. The OCOLA

ensures that service members are not penalized with their income‟s purchasing power because they are serving in

another country.

The Producer Price Index (PPI), another version of the CPI, is a vital tool for business owners when they need

to provide daily budget analysis. The PPI reflects the prices charged the manufacturer for the materials purchased to

produce the end product and is used to calculate if and where they need to adjust their budget in the future. If the

company were a bakery, they would want to know what the PPI is for crude goods to determine if the current

allotted budget will be enough for the next month‟s production requirement. This is also a good indicator as to if the
Final Assignment          12

cost of their product needs to increase or not. The business owners will also be able to determine the ratio of growth

between raw material cost and sales as a means to determine at what point they will need to increase or decrease the

product price and by how much.

Conclusion:

Studying Statistics is important for any business owner to establish a baseline for becoming successful.

Statistical analysis of their company‟s cost for goods and services and how they relate to periodic influxes in the

economy help to establish new goal and benchmarks for the company. Without constant statistical review, an owner

may never realize that they need to change the price of their goods or services in order to stay in business or that

they are losing employees because their wages do not support their current cost of living.
Final Assignment           13

References:

Lind, D., Marchal, W., & Wathen, S. (2008). Statistical Techniques in Business & Economics. (3rd ed.). New

Delhi: Tata McGraw-Hill Publishing Company Limited. Pgs. 409-593.

“Overseas Cost of Living”. (2008). Department of Defense Per Deim, Travel and Transportation Allowance

Committee. Retrieved October 6, 2008, from: http://perdiem.hqda.pentagon.mil/cgi-bin/cola-oha/o_cola.pl

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 1262 posted: 2/14/2010 language: English pages: 14
Description: When running a business it is imperative to perform four types of statistical analysis: ANOVA testing, Linear Regression, Correlation Analysis, and Pricing Indexes. By looking at each of the statistical analysis, the business owners and employees can determine how prices and wages will change along with what variables are the main influences for the change. By determining this, they are able to create models that can predict the future outcome with a statistical accuracy of 90-99%.