# Clinical data management

Document Sample

```					Clinical data management

-------------------

WHAT IS STATISTICS
Statistics in plural form means






Statistic in singular form means


Science of Collection, Organization, Analysis and Interpretation of Numerical Facts

Characteristics of Statistics
Characteristics of statistics • Aggregate of facts – Collection of facts. Facts can be analyzed statistically only when they are more than one. • Affected to a marked extent by multiplicity of causes. • Numerically expressed –only numerical facts can be statistically analyzed. • Enumerated / Estimated according to reasonable standard of accuracy. • Collected in systematic manner. • Collected for pre-determined purpose. • Statistics are placed in relation to each other.

Branches and Scope
Branches of Statistics 1. Statistical Methods 2. Applied Statistics –

Biometry Demography Econometrics Statistical Quality Control Psychometry Scope and Application of Statistics • Biology Agriculture • Medicine Business • Economics Commerce

Limitations of Statistics
• Does not deal with qualitative data • Does not deal with individual fact • Statistical inferences are not exact. These are probabilistic statements. • Statistics can be misused • Common people can not handle statistics properly.

Statistics In Clinical Research









To find the action of a drug To compare action of two different drugs To find relative potency of new drug with reference to standard drug To compare efficacy of a particular drug or treatment To find association between two attributes such as cancer and smoking

Utility of Computers
• For collection, compilation and tabulation of data • Finding out various statistical measures • Hypothesis Testing • Packages like SPSS are available

Some Basic Definitions
• Units / Individuals / Elements – These are Objects whose characteristics we study. • Population / Universe – Collection of all Units. • Finite Population – contains finite number of Units. • Infinite Population – contains infinite number of Units. • Quantitative Characteristic – Numerically measurable • Qualitative Characteristic – Numerically not measurable • Variable - Quantitative Characteristics which varies from unit to unit. • Attribute – Qualitative Characteristics which varies from unit to unit. • Discrete Variable – Assumes some specified values in range. • Continuous Variable – assumes all the values in the given range.

Classification and Tabulation
Units having common characteristics are grouped together. Functions of Classifications • Reduces the bulk of data • Simplifies the data • Facilitates comparison of characteristics • Renders data ready for statistical analysis Types of classification – • Quantitative (with regard to variable) • Qualitative (with regard to attribute) • Spatial (Geographical) • Temporal (Chronological) • Classification of units on the basis of a characteristic into two classes is called Dichotomy (Men / Women)

Summarization of Data
Frequency Distribution

• Frequency is the number of units associated with each value of variable • Frequency Distribution is systematic presentation of values taken by variable and the corresponding frequencies • Values may be discrete or continuous • If the number of values is more, range of variable is divided into mutually exclusive sub-ranges called class intervals. • Lower Class Limit Upper Class Limit • Width of class – Difference between the class limits.

Frequency Distribution
• Class mark or Class Mid-value – Central value of class interval. • Continuous Frequency Distribution • Discrete Frequency Distribution • Inclusive Class Interval – Lower & Upper limits of class interval are included in the same class interval. • Exclusive Class Interval – Lower class limit is included in the same class interval & upper class limit is included in succeeding class interval. • While analyzing Frequency Distribution Inclusive class interval should be converted into Exclusive class interval. (0—9, 10—19, 20—29 will become -0.5—9.5, 9.5—19.5, 19.5—29.5 ) • Values 0.5, 9.5, etc. are called Class Boundaries

Graphic Representation of Frequency Distribution
Histogram • On X-axis class limits / class marks are marked. • On Y-axis class frequencies are marked. • Rectangular bars are drawn for each class interval and its frequency. • For unequal class interval Y-axis measures Frequency Density and not Class Frequency. • So if one class interval is three times the others, then its height is reduced to 1/3.

Frequency Distribution
• Open End Class – when class intervals at extremities do not have one limit. • Univariate Frequency Distribution – Single variable • Bivariate Frequency Distribution – Two variables • Multivariate Frequency Distribution – More than one variables • Frequency density of the class = Frequency of the Class / Width of the Class

Graphic Representation of Frequency Distribution
Frequency Polygon • Mark dots on the mid-point of top of each rectangle of histogram • Join these points by straight lines. • Polygon thus formed, is closed by joining to the mid-point falling on the X-axis of the next outlying interval with zero frequency. • It can be drawn without drawing Histogram by only marking the points.
100 90 80 70 60 50 40 30 20 10 0

No. of patients

1s tQ 2n tr d Q 3r tr d Q 4t tr h Q tr

Graphic Representation of Frequency Distribution
Cumulative Frequency Curve or Ogive • Less than Type or More than Type • Y-axis represents total frequency • X-axis is labeled with upper class limit in case of Less than Ogive and with lower class limit in case of More than Ogive • Cumulative curve has quick adaptability for interpretation. • Point of intersection of two curve is Median. • Two sets of Ogives can be compared on percentage basis.

Generally in a Frequency Distribution values cluster around a central value.  This is called as Central Tendency.  The central value around which there is a concentration is called Measure of Central Tendency or average  Averaging is done to arrive at a single value representing entire data. Objectives of Averaging –  To find out one value that represents the whole data.  To enable comparison  To establish relationship  To derive inferences about a universe from a sample


MEASURE OF CENTRAL TENDENCY

MEASURE OF CENTRAL TENDENCY
These Measures of central tendency are – • Mathematical Averages
– – – Arithmetic Mean Geometric Mean Harmonic Mean Median Mode

•

Positional Averages
– –

•

Arithmetic Mean, Median & Mode are most widely used.

Arithmetic Mean
When Mean is calculated for entire population, it is population Arithmetic Mean (  ) and ‘N’ is number of observations in population.

x 
N

Calculation of Mean from Grouped data (Frequency Distribution)  Required when number of observations is large  This is estimate of value of Mean  Not as accurate as obtained from all observations

Arithmetic Mean
Steps in calculation of Mean from Grouped data Mid-point (Class–Mark) = x = (Lower Limit + Upper Limit) / 2
X   ( f * x) / n

Where

f = number of observations in each class

Example


 f  X  15310 X 240 f
=63.79

Calculate the Mean weight of the population

Wt in Frequ Class- f*X kg ency mark (f) (X)

60-61 10
61-62 20 62-63 45 63-64 50 64-65 60

60.5
61.5 62.5 63.5 64.5

605
1230 2812. 5 3175 3870

65-66 40
66-67 15 Total 240

65.5
66.5

2620
997.5 15310

Weighted Arithmetic Mean
Considers relative importance of each value Ex. Labour rate for a product using three classes of labour Weighted Arithmetic Mean X w   w * X  / S w w = weight allocated Sw = Sum of all weights

Example
• Calculate the fatality rate in smallpox from the age-wise fatality rate given below
Age group in years
0–1 2–4 5–9 Above 9

No. of smallpox cases
150 304 421 170

Fatality rate per cent
35.33 21.38 16.86 14.17

This is given by Weighted Arithmetic mean. WAM = 150  35.33  304  21.38  421  16.86  170  14.17  20.39 150  304  421  170

GEOMETRIC MEAN
Geometric Mean =
n

product  of  all  values

More applicable in calculating Growth rate over years Growth Factor = 1 + Growth Rate/100 And Geometric Mean = Average Growth Factor

MEDIAN
This is the middle value of series when arranged in the order of magnitude. Median establishes a dividing line between 50% of higher values and 50% of lower values. In case of even number of terms Median is average of two middle terms. If number of terms, ‘ n ‘, is odd, then Median is the value of

n  1th term.
2

If number of terms is even i.e. ‘ 2n ‘, then Median is average of nth and (n+1)th term. This is applicable also for Simple Frequency distribution of Discrete random variable ‘ x ’

Median for Grouped Data
Locate the class in which Median lies.
Median = Lm + [(N +1)/2 – (F + 1)] * w/ F m Where, Lm = Lower limit of Median class W = Width of class interval F = Cumulative frequency upto lower limit of Median class

Fm = Frequency of the Median class N = total Frequency

MODE
Mode is the value of variable which occurs most frequently. For ungrouped data, check value that occurs most frequently. For Grouped data Mode is located in the class with maximum frequency
1 Mode = Mo = LMo +   d  d *w  2  1  d 

Where,

LMo = Lower limit of the Modal class d1 = Frequency of Modal class – frequency of the class preceding modal class d2 = Frequency of Modal class – frequency of the class succeeding modal class w = Width of Modal class

Percentile
• They are values of the variables which divide the total observations by an imaginary line into two parts, expressed in percentage as 10 % and 90 %, etc. • It can be used for comparing one percentile value of two samples/ populations

130 120 110 100 90 5 4 5 6

MEASURE OF DISPERSION
One more Characteristic of Dataset is How it is distributed? How far each element is from Measure of Central tendency The Measures for this Dispersion are  RANGE  INTER-QUARTILE RANGE  QUARTILE DEVIATIONS  MEAN DEVIATION  VARIANCE  STANDARD DEVIATION


RANGE
Range is the difference between the value of the Smallest observation & Largest observation present in the distribution. RANGE = L – S L – Largest Value For Grouped Data RANGE = Upper Limit of Highest Class – Lower Limit of Lowest Class Co-efficient of Range – Range of weight in Kgs & Height in cms are not comparable. To have comparison a relative measure of Range called Coefficient of Range is defined as Co-efficient of Range =
LS LS

S – Smallest Value

INTER-QUARTILE RANGE
• Inter-quartile Range is the Range calculated based on middle 50% of the observations. • INTER-QUARTILE RANGE = Q 3 – Q 1 • Q 1, Q 2, Q 3 are highest value in each of the first three quartile. • QUARTILE DEVIATION • QUARTILE DEVIATION = (Q 3 – Q 1)/2 •

QUARTILE DEVIATION
Co-efficient of Quartile Deviation = Lower Quartile Q Upper Quartile Q For grouped Data Q
1 1

Q3  Q1 Q3  Q1

=

N 1 4

th

observation.
th

3

=3

N 1 4

observation.

1 N C 4 = L 1+ h F 3 N C 4 = L 3+ h F

Q

3

QUARTILE DEVIATION
L 1 = Lower boundary of first quartile class L 3 = Lower boundary of third quartile class N = Total cumulative frequency f = Frequency of quartile class h = Class interval (width) c = cumulative frequency of the class just above the quartile class

MEAN DEVIATION
This is Absolute Mean Deviation of each observation from Mean.
x Absolute Mean Deviation =  N

for population, and for sample

Absolute Mean Deviation = Where,

 X x
n

x = value of observation
 = The Mean of population

N = number of observations in population
x = sample mean

N = number of observations in sample

Mean Deviation
M. D. (about the mean x ) =
1 N



f xx

=

1 N



f d

X = mid-value of the class interval f = corresponding frequency d = deviation Merits and De-merits of absolute Mean Deviation –  Simple and Easy  More comprehensive as it depends on all observations  True measure as it averages all deviations But    Less reliable as it ignores sign Not conducive to algebraic operation Not useful for open end class

VARIANCE
Here deviations are squared to make them positive. Variance =
2 

 x  x 
N

2

= For grouped data,
2 

x2 2 x N


=

fi X i  x N



2
2



fx 2

N

x

fi = frequency of class

and

Xi = value of class mark This is about population. For sample, variance = s =
2

 X

X



2

n 1

=

X

nX  n 1 n 1

2

2

STANDARD DEVIATION
S.D.   Variance = Properties of standard deviation

–

 S.D. is independent of change of origin i.e. if all the observation values are increased / decreased by a constant quan tity, S.D. does not change.  S.D. is dependent on change of scale i. e. if each observation value is multiplied / divided by a constant quantity, S.D. will also be similarly affected.

STANDARD DEVIATION
 Combined S.D. of two or more groups ( 12 )

 12 

n1 1  n2 2  n1d1  n2 d 2
2 2 2

2

n1  n2

d1 = X 1  x ;

x  n1 x1  n2 x2 / n1  n2 
Co-efficient of variation = S.D. / Mean This is generally expressed in percentage.





d2 = X 2  x

and

Example
• Calculate IQ of 50 boys from the data given f  X X f = 91. 2 IQ Fre. Class- f*X mark(X) f*X2

0-20
20-40 40-60 60-80

3
4 3 4

10
30 50 70

30
120 150 280

300
3600 7500 19600

σ

f  X 2  nX 2  n

484200  50  (91.2)  50 68328   36.97 50

2

80-100
100-120 120-140 140-160 Total

13
12 8 3 50

90
110 130 150

1170
1320 1040 450 4560

105300
145200 135200 67500 484200

SKEWNESS











The Measure of Central tendency and Measure of Dispersion are characteristics of Frequency Distribution Third important characteristic of Frequency Distribution is its Shape A Frequency Distribution is said to be Symmetrical when the values of the variable equidistant from mean have equal frequencies When F.D. is not Symmetrical, it is said to be Asymmetrical or Skewed Amy deviation from symmetry is called Skewness Skewness may be Positive or Negative

SKEWNESS
• Positively skewed - If the frequency curve has a longer tail towards the higher values of X. • In positively Skewed distribution Mode is minimum and Mean is maximum out of Mean, Median and Mode.

• Negatively skewed - If the frequency curve has a longer tail towards the lower values of X. • In Negatively skewed distribution, mean is minimum and Mode is maximum.

Bienayme Chebyshev’s Rule
It states that whatever may be the shape of distribution, at least 75 % of the values in the population will fall within + 2 standard deviation from the mean and at least 89 per cent will fall within + 3 standard deviation from the mean.  The rule states that the percentage of the data observation lying within +/- ‘k’ standard deviation of the mean is at least (1 – 1 / k2)*100 In case of symmetrical bell-shaped distribution, we can say that  Approximately 68 % of the observations in the population fall within +/- 1 s.d. from the mean  Approximately 95 % of the observations in the population fall within +/- 2 s.d. from the mean  Approximately 99 % of the observations in the population fall within +/- 3 s.d. from the mean


NORMAL DISTRIBUTION






This is a probability distribution of continuous random variable. Reflects values taken by many real life variables like height, wt. Large number of observations are clustered around the mean value and their frequency drops as we move away from the mean.

NORMAL DISTRIBUTION


If samples of size ‘n’ (n > 30) are drawn from any population, then sample means will be normally distributed with a mean equal to (population mean).

Characteristics of Normal Distribution Curve
  



Curve has single peak (Uni-modal) Mean lies at the centre Because of symmetry, Median & Mode are also at centre. The two tails extend indefinitely & never touch X-axis.

NORMAL DISTRIBUTION




To define a particular normal distribution we need only two parameters, the Mean and the Standard Deviation.  If is standard deviation, then   1  68 % observations lie in the range  95.5 % observations lie in the range   2  99.7 % observations lie in the range   3

The Standard Normal Distribution




This is a normal distribution with mean   0 and Standard Deviation   1 The observation value in Standard Normal Distribution are denoted by Z.

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857

Standardizing Normal Variable
Suppose we have Normal population with variable X. We can convert ‘X’ to Standard Normal Variable ‘Z’ by Using Formula

Z



X 



Q. - Average weight of baby at birth is 3.05kg. With SD of 0.39 kg. If the birth weights are normally distributed, would you regard weight of 4 kg. as abnormal and weight of 2.5 kg. as normal? A. – Normal limits of weight with 95 % probability would be 3.05 – 1.96 x 0.39 and 3.05 + 1.96 x 0.39 So 4 kg is abnormal and 2.5 kg is normal

Example

SAMPLING AND SAMPLE SIZE CALCULATION
In Statistical Analysis We infer about the Universe or population Either from data about entire universe Or from statistic of samples

Census Enumeration and Sample survey
Census Enumeration – collecting data from each and every unit from the population.  Costly, Time-consuming, Labour-intensive.  But accurate results, free of sampling errors.  Does not require deep statistical analysis  Gives direction for further studies  Cannot be used for destructive testing Sample survey – Cheaper, consumes less time and labor.  The selected part of the population which is used to ascertain the characteristics of the population is called Sample.  Sampling is done to obtain information about whole.

Types of Sampling –  Random or Probability Sampling  Non-Random or Judgment Sampling Random or Probability Sampling  The procedure is rigorous and time consuming.  But possible to quantify the magnitude of likely error in inferences made.

SAMPLING TECHNIQUES

SAMPLING TECHNIQUES
Simple Random Sampling –  Assign equal probability for each unit to get included in sample.  It can be with or without replacement.  Requires list of all members of population  Works well for relatively small population.  Selection by chits / Random numbers / Computer simulation

SAMPLING TECHNIQUES Stratified Random Sampling –






  



When considerable heterogeneity is present in the population, it is divided in segments or Strata and sample is selected from strata. Strata are so defined that they are mutually exclusive & collectively exhaustive. Stratification should be such that Strata are homogeneous within itself but heterogeneous between them. Sample size of Strata allocated on three points – Stratum size- Total no. of units in stratum Variability within stratum Cost of observation per unit in stratum Proportionate Allocation ni=nNi/N

Cluster Sampling – Any method of sampling where a group is taken as a sampling unit  Population is divided into well-defined groups or clusters and few of them are selected.  All the units of elected clusters are studied. e.g. Household in rural areas.  Advantages – Cost saving Better supervision List of units is not necessary Better co-operation of units But statistically less efficient.

SAMPLING TECHNIQUES

Systematic Sampling –  First unit is selected randomly from first ‘k’ units. Then every kth unit is included in sample.  Advantages – Simplicity in selection Operational convenience Even spread List of units not necessary But hidden periodicity will lead to inefficient sampling.  Most commonly used method.

SAMPLING TECHNIQUES

SAMPLING TECHNIQUES
Multi-stage Sampling –  Used for large scale enquiry covering large geographical area.

Non-Random or Non-probabilistic Sampling – In this case sampling error is not measurable. Three types –  Convenience Sampling – may result in biased opinion. Useful to generate hypothesis  Judgment Sampling – Sample is drawn from a population which he thinks to be representative of population. Should be carried out by an expert in the field.  Quota Sampling – Each field worker is assigned a quota. Advantages – lower cost, field worker has a free hand But accuracy is doubtful.  Snowball (Sequential) Sampling – Size of sample is decided as sampling takes place depending on the result of the earlier samples.

SAMPLING TECHNIQUES

Sample size depends upon  Type of problem investigated  Extent of variability in population  Larger the sample size, higher the confidence level.  Precision required  Budget and cost  Resources available

SAMPLE SIZE AND POWER

Determining Sample size for problems involving Mean
Influenced by consideration of - Standard Deviation - Acceptable level of sampling error - Expected confidence level In sampling Precision = n/s

Formula for Sample size
It is given as :

n

ZS E



2

, where , and

E  ZS

X

Standard error of the Mean

S

X



S n

Cont.
Z = Standardized value corresponding to a confidence level S = Sample standard deviation or an estimate of the population standard deviation E = Acceptable level of Error, plus or minus the error factor (the range is one half of the total confidence level).

Example
• Mea pulse rate of population is believed to be 70 per minute with a standard deviation of 8 beats. Calculate the minimum size of the sample to verify this with allowable error of +/- 1 beat at 5 % risk.

n

ZS E



2

• Z = 1.96 E=1 =8 • N = (1.96 x 8)2 / 1x1 = 245.86  246



Sample size
In case where the sample size is more than 5 per cent of a finite population (small population), sample size may be overestimated. Therefore a finite point correction factor represented by
         N n    N  1  

Determining Sample size for problems involving Proportions
Influenced by consideration of - Estimate of population proportion

- Acceptable level of sampling error
- Expected confidence level

The formula for determining sample size
• It is given by n=
Z 2
c .i .

pq

E

2

,

where n = sample size Z2c.i. = square of confidence level in standard error units P = estimated proportion of success q = estimated proportion of failures E2 = square of maximum allowance for error between the true proportion and the sample proportion

Example
Q. - Incidence rate in the last influenza epidemic was found to be 50 per thousand (5%) of the population exposed. What should be the size of sample to find incidence rate in current epidemic, if allowable error is 10%? A. – p = 5%; q = (100-5=) 95% E = 10 % of p = 0.1 x 5% = 0.5 %

1.96  n
E

2

pq

2

1.96  

 5  95  7299 2 0.5
2

Randomization
• This is a method to Control Extraneous Variables • It means assigning units to experimental treatment and experimental treatments to units randomly • This minimizes the effect of extraneous variable on observations

Blinding
• To rule out bias in subjects under study, BLINDING trials are used • In a Single Blinding trial, patients in control group as well as experimental group are given drugs which look similar but contents are different So no patient knows what is given to him • In a Double Blinding trial, not only the patients but also the nurses or medical observers do not know which group of patients is given drug • Blinding trials are very useful where subjective information is required

Coding
• Coding – Assigning number or symbol to results in order to group them into limited categories • Coding sacrifices some details but necessary for efficient data analysis Categorization rules – • a) Appropriate – Categorization should help validate the hypothesis. If hypothesis aims to establish relationship between key variables, then appropriate categories should facilitate comparison between them

Categorization rules
• b) Exhaustive – When multiple choice questions are used, alternatives covering full range of information should be provided • c) Mutually Exclusive • d) Single dimension – Means every class is defined in terms of one concept If more than one dimension is used it may not be mutually exclusive

CORRELATION ANALYSIS
• Correlation denotes inter dependence among the variables. There should be cause and effect relationship between 2 phenomenon. If no cause and effect relationship is there then there is no correlation. • The degree of relationship is expressed by a coefficient which ranges from +1 to -1 . • If both variables change in same direction then correlation is said to be +ve. • If they change in opposite direction then it is –ve.

METHODS OF STUDYING CORRELATION
• • • • • scatter diagram KARL PEARSON’S coefficient of correlation SPEARMAN’S RANK correlation coefficient Method of least squares scatter diagram – applicable for bi-variate distribution – data is plotted on graph paper in the form of dots – generally independent variable on X axis and dependent variable on Y axis – If points are more scattered then degree of relationship is less. – Nearer the points to the line, higher the degree of relationship.

Scatter Diagram

Perfectly + ve

Perfectly -ve

If the points lie on a straight line parallel to X axis or in a haphazard manner, it shows absence of relation between given 2 variables.

Scatter Diagram
Advantages of scatter diagram • Simple and non mathematical • Easy to understand • Shows approximate picture of correlation quickly. Hence used as first step in finding correlation • Not influenced by extreme values like other mathematical models. Disadvantages • Exact degree can not be established

KARL PEARSON COEFFICIENT OF CORRELATION
Also called as product moment formula. Most widely used in practice. r = Pearsonian correlation coefficient r = rxy =

COV ( X , Y )

 XY
= =
xy / N

 XY
xy N X  Y

=

 xy x y
2

2

=

N  X 2   X 

N  XY   X  Y 
2

N  Y 2   Y 

2

Where x = X- X

and

y=Y-

Y

xy = x * y

KARL PEARSON COEFFICIENT OF CORRELATION
‘r’ is always between -1 and +1. When r = +1 , correlation is perfect and positive r = -1 , correlation is perfect and negative r = 0 , correlation does not exist Other formula for ‘ r ‘ is,
r2  a  Y  b XY  nY
2

 Y 2  nY

2

SPEARMAN’S RANK CORRELATION COEFFICIENT
   rank is assigned to each value of variable useful when quantitative measures can not be fixed. The individual belonging to group can be arranged in order

by assigning a number indicating a rank Eg. Leadership skills Beauty  Given by :
6D 2 R=1N ( N 2  1)

Where D = difference of ranks between pained item in 2 series N = total number of paired observations

SPEARMAN’S RANK CORRELATION COEFFICIENT
  When ranks are not given assign ranks on basis of value. Equal ranks to 2 or more entries  Average rank is to be assigned  If 3 entries at rank 5 then average rank [ ( 5 + 6 + 7 ) / 3 ] = 6 So adjustment is required in formula
6[D 2  (1 / 12)(m 3  m)] R = 1 – N ( N 2  1)

m = number of entries having common rank  If there are more such groups then for each group adjustment is required . then R = 1 3 3 6[D 2  (1 / 12)(m1  m1 )  (1 / 12)(m2  m2 ) N ( N 2  1)

.

METHOD OF LEAST SEQUARES

Correlation coefficient r =

bxy * byx

Where bxy and byx are regression coefficient

REGRESSION ANALYSIS
REGRESSION LINE A line through the points drawn in such a manner as to represent the average relationship between the two variables Y = a + bx X = c + dy regression line Y on X regression line X on Y

Regression equation Y on X is obtained by minimizing the sum of squares of errors parallel to Y axis. Two regression equations are obtained from 2 sources so they are not reversible and interchangeable.

REGRESSION ANALYSIS
Normal equation for regression equation Y on X

Y  na  bX
XY  aX  bX 2
n= total number of observed pair values

XY  n X Y b= 2 X 2  n X

a=

Y  bX

Regression co-efficient b is called slope coefficient or regression coefficient .it represents incremental value of dependant variable for a unit change in value of independent variable. Regression coefficient of Y on X is also given by

Regression Coefficient

y byx  r x

Similarly regression coefficient of X on Y is

x bxy  r y
r = coefficient of correlation between X and Y

 x = population S.D. of X
r  byx  bxy

 y = population S.D. of y

Example
• On entry to a school, a new intelligence test was given to a small group of children. The results obtained in that test are and in a subsequent examination are tabulated below. Calculate Coefficient of correlation and Regression Equation
Child Number IQ Score (X)

1 6

2 4

3 6

4 8

5 8

6

7

8 6

10 8

Exam. Score (Y)

4

4

7

10 4

7

7

1

Child N0. 1 2 3 4 5 6 7 8 Total

Intelligence Test Score X 6 4 6 8 8 10 8 6 56 X2 36 16 36 64 64 100 64 36 416 x -1 -3 -1 -1 1 3 1 -1 0 x2 1 9 1 1 1 9 1 1 24 Y 4 4 7

Examination Score Y2 16 16 49 100 16 49 49 1 296 y -1.5 -1.5 1.5 4.5 -1.5 1.5 1.5 -4.5 0 y2 2.25 2.25 2.25 20.25 2.25 2.25 2.25 20.25 54

XY XY 24 16 42 80 32 70 56 6 326

xy xy 1.5 4.5 -1.5 4.5 -1.5 4.5 1.5 4.5 18

10 4 7 7 1 44

Calculations
r

 xy (  x )( y
2

2

)

18  24  54 18  36  0.5

Estimation
• Population parameters are unknown and they are to be estimated. Types of Estimates – • Point Estimate – It is a single number. Point Estimate is useful, if it is accompanied by an estimate of error • Interval Estimate – It is a range of values used to estimate a population parameter. It gives Range for parameter and the probability of the population parameter lying within that range

Standard Error
Standard Error – It is also known as Standard Error of the Mean. This is an estimate of the Standard Deviation of the Sampling Distribution of Means based on data from one or more random samples. It is designated as

M
M  
n

Standard Error of the Mean = Where

 is Standard Deviation of the original distribution and

‘ n ’ is the sample size This is used in the computation of Confidence Interval and Significance Test for the Mean.

Interval Estimate
• Systolic Blood Pressure of 566 males was taken. Mean BP was found to be128 mm and XD13.05 mm. Find 95 % confidence limits of BP within which population mean would lie.

X

s 13 .05    0.55 n 566

• Confidence limits for population mean are (Mean + 1.96 SE) and (Mean -1.96 SE) • 128 + 1.96 x 0.55 = 129.078 & 128 – 1.96 x 0.55 = 126.922

• Hypothesis Testing enables to determine the validity of hypothesis • Hypothesis Testing is used to analyze the difference between the sample statistic and hypothesized population parameter • Steps in Hypothesis Testing – a) Formulation of Hypothesis b) Selection of the Statistical Test to be used c) Selection of the Significance Level d) Calculation of the Standard Error of the Sample Statistic and Standardize the Sample Statistic e) Determination of the Critical Value f) Comparing the value of the Sample Statistic with the critical value g) Deducing the Business Research solution

HYPOTHESIS TESTING – BASIC CONCEPTS

Formulation of Hypothesis
Null Hypothesis assumes that the difference, if any, in the observed data is attributed to random error Null hypothesis is denoted as Ho

Null hypothesis consists of values relating to population parameter like ,  , p Alternate Hypothesis is an statement opposite to that made in Null hypothesis and it requires evidence to accept it Alternate hypothesis is denoted as Ha or H1 Hypothesis should be developed before sample is drawn Hypothesis should be specific Hypothesis should be fit for testing

Selection of Statistical Test to be used
Three factors influence the selection • Types of Research Question Formulated Research questions based on mean and proportions use z-test or t-test and if it is based on Frequency distribution it will use Chi-square test • Number of samples – If there are more than two samples then test like ANOVA and Chi-square are used • Measurement scale used – Research problem containing Interval Scale use ‘ z ‘ and ‘ t ‘ tests. Problems containing Ordinal and Nominal scale use Chi-square test • Other factors are – sample size, ‘  ‘ value being known or unknown

Level of Significance
• This is a measure of degree of risk that a researcher might reject the null hypothesis when it is true • 5% is commonly used level • Level of significance is the percentage of sample means that are outside specific cut-off points. • Level of significance is set by researcher based on factors like cost involved for each type of error

Type I and type II error
Scenario Decision Accept Ho Ho True No error

Reject Ho  Type I error



Ho False Type II Error



No error

Selection of Statistical Test to be used
Three factors influence the selection • Types of Research Question Formulated Research questions based on mean and proportions use z-test or t-test and if it is based on Frequency distribution it will use Chi-square test • Number of samples – If there are more than two samples then test like ANOVA and Chi-square are used • Measurement scale used – Research problem containing Interval Scale use ‘ z ‘ and ‘ t ‘ tests. Problems containing Ordinal and Nominal scale use Chi-square test • Other factors are – sample size, ‘  ‘ value being known or unknown

Calculating Test statistic
Sample Statistic  Hypothesized Parameter Test statistics = Standard Error of Statistic
= Where

x

x

n

x 

Use z-distribution for sample size > 30 and t-distribution for sample size < 30 1) Locate critical region that is region of standard normal curve corresponding to a level of significance . Describe the result and statistical conclusion

Determining the Critical value
Determining the Critical value– Two -tailed Test – Used when Ho :

  o

&

H1 :

  o

It means Null Hypothesis is rejected if value of sample statistic is above or below Hypothesized population

parameter Acceptance region falls between two rejection regions

Determining the Critical value
One Tailed Test – Used when Ho : Ho :

  o   o

& &

H1 : H1 :

  o   o

or

It means Null Hypothesis is rejected when value of sample statistic is higher than or lower than hypothesized population parameter Test can be Left-tailed or Right –tailed Left –tailed test will reject null Hypothesis if sample mean is lesser than hypothesized population parameter Right-tailed test will reject Null Hypothesis if sample mean is higher than hypothesized population parameter The critical value depends on the significance level, the type of hypothesis test and statistical test used

Conclusion
• Comparing the value of sample statistic with the Critical value • Deducing the Business Research Conclusion – If the value of standardized sample statistic falls in the acceptance region, Null Hypothesis is accepted and if it falls in rejection region null Hypothesis is rejected

Critical Values
Critical Value(Z) 1% Two-tailed test Right-tailed test Left-tailed test + 2.56 +2.326 -2.326 Level of Significance 2% 4% 5% 10 %

+ 2.326 + 2.054 + 1.960 + 1.645 +2.054 -2.054 +1.751 -1.751 +1.645 -1.645 +1.282 -1.282

Test of significance of mean in Large Sample Sample size > 30

x z / n
If



is not known, then use 

2

s

2

Tests of significance of mean for small sample, sample size < 30
• t-distribution is used for sample size of less than or equal to 30

x t s/ n

Tests of significance of difference between 2 means
• For large sample , sample size > 30 • Z test to detect whether or not the means of two samples drawn from two different sources differ significantly . • It is also done to check whether difference is due to chance and whether the sample belong to the same population .

Steps :
1)State null hypothesis . may be H0 = 1 =  2 2)State alternative hypothesis H0 = 1 3)Compute estimated standard error



2

 x1  x2   2 1 / n1   2 2 / n2
4) compute the test statistics

( X  X )  (   )
1 2 1 2

H0

Z=

_______________________



X 1X

2

Note:
• It is generally assumed that , the S.D. of population are known. • If they are unknown then the values of their corresponding sample standard deviation S1 & S2 are used.

For small size sample n < 30
• In this case ‘ t – distribution ‘ is used instead of normal distribution • Steps : -state null hypothesis -state alternative hypothesis 2 -compute pooled estimate of  SP2 = ( n1 - 1 ) S12 – ( n2 – 1 )S22 n1 + n2 -2 (S1 and S2 are sample standard deviations )

Cont. from last slide
-compute the standard error 1 1 =SP ( n  n ) 
X 1X 2
1 2

-Compute test statistics

( X  X )  (   ) t=
2 1

1

2

H0



X 1 X 2

-compare computed value of t with critical value of t

Test of significance for proportions
• The difference between the sample proportions and hypothesized proportion is standardized and the normal distribution is used for the test. Standard error of proportion



P

=

PH 0 * qH 0 n

Z=

p  p HO



p

Comparison of two sample Proportions
Test of Significance for comparison of two sample proportions – The proportion of the population from which two samples are drawn is given by

 n1 p1  n2 p2  p  n1  n2  
by

and

Standard Error of difference between the proportions is given

Sp p 
1 2

pq pq  n1 n2

Z = Difference in proportion / Standard Error (S)

Paired Samples (Dependent Samples)
For example, Sugar level of same sample of patients before and after the drug is administered Paired sample t-test is widely used t-test in such situations.
t D d SD n

Where D is Mean difference of samples d is hypothesized value of the difference SD is standard deviation of the difference n is sample size
S
2 D n 1  ( Di2  nD 2 ) n  1 i 1

ANALYSIS OF VARIANCE (ANOVA)
• If the problem requires comparison of means of more than two populations, we can use ANOVA • It tests whether there is any significant difference between the means of various samples • It measures variability in data points within the samples and also measures variance between the sample means. These two variations are compared using F-test • If value of F-testis large then we can deduce the conclusion that there is a significant difference between the means of samples • Two types of ANOVA – a) One factor b) Two factor • One factor ANOVA is used for the problems that involve evaluating the differences of mean of the dependent variable for various categories of single independent variable

Steps in ANOVA 1. Formulate the hypothesis 2. Obtain the mean of each sample 3. Find the mean of all samples i.e. grand mean 4. Calculate the variation between samples denoted as SSbetween 5. Obtain the mean square of the variation between the samples denoted by MSbetween using SSbetween

Steps in ANOVA
6. Calculate the variation within the samples denoted by SSwithin 7. Obtain the mean square of the variation within the samples denoted by MSwithin using SSwithin 8. Calculate the total variance SSy SSy = SSbetween + SSwithin 9. Calculate F ratio = MSbetween divided by MSwithin 10 Compare F ratio with critical value

Chi-square test
• To evaluate the statistical significance of association among variables involved in crosstabulation, Chi-square test is used • Chi-square test is used in two ways – • Test of Independence - used to evaluate association between two variables & • Test of goodness of fit – used to identify whether there is any significance difference between the observed frequencies and the expected frequencies

Chi-Square Test – General aspects
• Chi-square test can be performed on actual numbers but not on percentage or proportions. Percentage or proportions should be converted to actual numbers • Expected frequencies of all cells should be more than five. If it is less than five some of the rows or columns should be combined to make new frequencies greater than five • Chi-square test works only when sample size is large enough; usually more than 50. • Observations drawn should be random and independent

Chi-Square test – Goodness of Fit
If the value of Chi-square is less than the tabular value corresponding to level of significance, then we deduce that observed data corresponds to the expected data
(Oi  Ei ) 2 2   Ei i 1
n

Chi-Square Test – Test of Independence
Goodness of fit involved only one variable. To evaluate relationship between two or more variables, hisquare test of independence is used. This is useful while analyzing cross-tabulation. In this case  2  
i 1 j 1 n k

(Oij  Eij ) 2 Eij

Example: Channel Viewership Distribution according to Age Group Age Group/channel 15 – 25 age group 25 – 45 age group 45 years and above Total Channel A 20 80 60 160 Channel B Channel C Total

(32) 30 (80) 70 (48) 40 140

(28) 30 (70) 50 (42) 20 100

(20) 80 (50) 200 (30) 120 400

Figures in bracket are expected frequencies

• Formulation of Null Hypothesis • Ho : There is no association between age group and channel viewership • Ho : There is significant association between age group and channel viewership • Calculate the expected value – • where, • ni = row total ; nj = column total; n = total sample size

Calculate the Chi-square value –
  
2 i 1 j 1 n k

(Oij  Eij ) 2 Eij

= 16.08

Decide the level of significance and degrees of freedom – Degrees of freedom ‘ v ‘ = (n -1) (k – 1) (3 - 1) x (3 - 1) = 2 x 2 = 4 Let us set the level of significance as 1 % Determining the critical value From Chi-square table critical value for 1 % significance level and 4 degrees of freedom is 13.28 Deduce the business Research conclusion As the calculated value is greater than critical value, the Null hypothesis is rejected

Strength of Association
Strength of Association – test of independence does not describe the strength or magnitude of association. It is evaluated by using PHI- Coefficient and Coefficient of Contingency

1.2.3.1 PHI- Coefficient  - Suitable only for 2 x 2 table
 2
n

1.2.3.2 Coefficient of Contingency (C) – Can be used for table of any size
C 

2 2 n

The coefficient varies from 0 to 1. 0 indicates no association and 1 indicates maximum strength

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 646 posted: 10/31/2009 language: English pages: 122