VIEWS: 646 PAGES: 122 CATEGORY: Research POSTED ON: 10/31/2009
Clinical data management ------------------- WHAT IS STATISTICS Statistics in plural form means Numerical Facts about Objects. Statistic in singular form means Science of Collection, Organization, Analysis and Interpretation of Numerical Facts Characteristics of Statistics Characteristics of statistics • Aggregate of facts – Collection of facts. Facts can be analyzed statistically only when they are more than one. • Affected to a marked extent by multiplicity of causes. • Numerically expressed –only numerical facts can be statistically analyzed. • Enumerated / Estimated according to reasonable standard of accuracy. • Collected in systematic manner. • Collected for pre-determined purpose. • Statistics are placed in relation to each other. Branches and Scope Branches of Statistics 1. Statistical Methods 2. Applied Statistics – Biometry Demography Econometrics Statistical Quality Control Psychometry Scope and Application of Statistics • Biology Agriculture • Medicine Business • Economics Commerce Limitations of Statistics • Does not deal with qualitative data • Does not deal with individual fact • Statistical inferences are not exact. These are probabilistic statements. • Statistics can be misused • Common people can not handle statistics properly. Statistics In Clinical Research To find the action of a drug To compare action of two different drugs To find relative potency of new drug with reference to standard drug To compare efficacy of a particular drug or treatment To find association between two attributes such as cancer and smoking Utility of Computers • For collection, compilation and tabulation of data • Finding out various statistical measures • Hypothesis Testing • Packages like SPSS are available Some Basic Definitions • Units / Individuals / Elements – These are Objects whose characteristics we study. • Population / Universe – Collection of all Units. • Finite Population – contains finite number of Units. • Infinite Population – contains infinite number of Units. • Quantitative Characteristic – Numerically measurable • Qualitative Characteristic – Numerically not measurable • Variable - Quantitative Characteristics which varies from unit to unit. • Attribute – Qualitative Characteristics which varies from unit to unit. • Discrete Variable – Assumes some specified values in range. • Continuous Variable – assumes all the values in the given range. Classification and Tabulation Units having common characteristics are grouped together. Functions of Classifications • Reduces the bulk of data • Simplifies the data • Facilitates comparison of characteristics • Renders data ready for statistical analysis Types of classification – • Quantitative (with regard to variable) • Qualitative (with regard to attribute) • Spatial (Geographical) • Temporal (Chronological) • Classification of units on the basis of a characteristic into two classes is called Dichotomy (Men / Women) Summarization of Data Frequency Distribution • Frequency is the number of units associated with each value of variable • Frequency Distribution is systematic presentation of values taken by variable and the corresponding frequencies • Values may be discrete or continuous • If the number of values is more, range of variable is divided into mutually exclusive sub-ranges called class intervals. • Lower Class Limit Upper Class Limit • Width of class – Difference between the class limits. Frequency Distribution • Class mark or Class Mid-value – Central value of class interval. • Continuous Frequency Distribution • Discrete Frequency Distribution • Inclusive Class Interval – Lower & Upper limits of class interval are included in the same class interval. • Exclusive Class Interval – Lower class limit is included in the same class interval & upper class limit is included in succeeding class interval. • While analyzing Frequency Distribution Inclusive class interval should be converted into Exclusive class interval. (0—9, 10—19, 20—29 will become -0.5—9.5, 9.5—19.5, 19.5—29.5 ) • Values 0.5, 9.5, etc. are called Class Boundaries Graphic Representation of Frequency Distribution Histogram • On X-axis class limits / class marks are marked. • On Y-axis class frequencies are marked. • Rectangular bars are drawn for each class interval and its frequency. • For unequal class interval Y-axis measures Frequency Density and not Class Frequency. • So if one class interval is three times the others, then its height is reduced to 1/3. Frequency Distribution • Open End Class – when class intervals at extremities do not have one limit. • Univariate Frequency Distribution – Single variable • Bivariate Frequency Distribution – Two variables • Multivariate Frequency Distribution – More than one variables • Frequency density of the class = Frequency of the Class / Width of the Class Graphic Representation of Frequency Distribution Frequency Polygon • Mark dots on the mid-point of top of each rectangle of histogram • Join these points by straight lines. • Polygon thus formed, is closed by joining to the mid-point falling on the X-axis of the next outlying interval with zero frequency. • It can be drawn without drawing Histogram by only marking the points. 100 90 80 70 60 50 40 30 20 10 0 No. of patients 1s tQ 2n tr d Q 3r tr d Q 4t tr h Q tr Graphic Representation of Frequency Distribution Cumulative Frequency Curve or Ogive • Less than Type or More than Type • Y-axis represents total frequency • X-axis is labeled with upper class limit in case of Less than Ogive and with lower class limit in case of More than Ogive • Cumulative curve has quick adaptability for interpretation. • Point of intersection of two curve is Median. • Two sets of Ogives can be compared on percentage basis. Generally in a Frequency Distribution values cluster around a central value. This is called as Central Tendency. The central value around which there is a concentration is called Measure of Central Tendency or average Averaging is done to arrive at a single value representing entire data. Objectives of Averaging – To find out one value that represents the whole data. To enable comparison To establish relationship To derive inferences about a universe from a sample MEASURE OF CENTRAL TENDENCY MEASURE OF CENTRAL TENDENCY These Measures of central tendency are – • Mathematical Averages – – – Arithmetic Mean Geometric Mean Harmonic Mean Median Mode • Positional Averages – – • Arithmetic Mean, Median & Mode are most widely used. Arithmetic Mean When Mean is calculated for entire population, it is population Arithmetic Mean ( ) and ‘N’ is number of observations in population. x N Calculation of Mean from Grouped data (Frequency Distribution) Required when number of observations is large This is estimate of value of Mean Not as accurate as obtained from all observations Arithmetic Mean Steps in calculation of Mean from Grouped data Mid-point (Class–Mark) = x = (Lower Limit + Upper Limit) / 2 X ( f * x) / n Where f = number of observations in each class Example f X 15310 X 240 f =63.79 Calculate the Mean weight of the population Wt in Frequ Class- f*X kg ency mark (f) (X) 60-61 10 61-62 20 62-63 45 63-64 50 64-65 60 60.5 61.5 62.5 63.5 64.5 605 1230 2812. 5 3175 3870 65-66 40 66-67 15 Total 240 65.5 66.5 2620 997.5 15310 Weighted Arithmetic Mean Considers relative importance of each value Ex. Labour rate for a product using three classes of labour Weighted Arithmetic Mean X w w * X / S w w = weight allocated Sw = Sum of all weights Example • Calculate the fatality rate in smallpox from the age-wise fatality rate given below Age group in years 0–1 2–4 5–9 Above 9 No. of smallpox cases 150 304 421 170 Fatality rate per cent 35.33 21.38 16.86 14.17 This is given by Weighted Arithmetic mean. WAM = 150 35.33 304 21.38 421 16.86 170 14.17 20.39 150 304 421 170 GEOMETRIC MEAN Geometric Mean = n product of all values More applicable in calculating Growth rate over years Growth Factor = 1 + Growth Rate/100 And Geometric Mean = Average Growth Factor MEDIAN This is the middle value of series when arranged in the order of magnitude. Median establishes a dividing line between 50% of higher values and 50% of lower values. In case of even number of terms Median is average of two middle terms. If number of terms, ‘ n ‘, is odd, then Median is the value of n 1th term. 2 If number of terms is even i.e. ‘ 2n ‘, then Median is average of nth and (n+1)th term. This is applicable also for Simple Frequency distribution of Discrete random variable ‘ x ’ Median for Grouped Data Locate the class in which Median lies. Median = Lm + [(N +1)/2 – (F + 1)] * w/ F m Where, Lm = Lower limit of Median class W = Width of class interval F = Cumulative frequency upto lower limit of Median class Fm = Frequency of the Median class N = total Frequency MODE Mode is the value of variable which occurs most frequently. For ungrouped data, check value that occurs most frequently. For Grouped data Mode is located in the class with maximum frequency 1 Mode = Mo = LMo + d d *w 2 1 d Where, LMo = Lower limit of the Modal class d1 = Frequency of Modal class – frequency of the class preceding modal class d2 = Frequency of Modal class – frequency of the class succeeding modal class w = Width of Modal class Percentile • They are values of the variables which divide the total observations by an imaginary line into two parts, expressed in percentage as 10 % and 90 %, etc. • It can be used for comparing one percentile value of two samples/ populations 130 120 110 100 90 5 4 5 6 MEASURE OF DISPERSION One more Characteristic of Dataset is How it is distributed? How far each element is from Measure of Central tendency The Measures for this Dispersion are RANGE INTER-QUARTILE RANGE QUARTILE DEVIATIONS MEAN DEVIATION VARIANCE STANDARD DEVIATION RANGE Range is the difference between the value of the Smallest observation & Largest observation present in the distribution. RANGE = L – S L – Largest Value For Grouped Data RANGE = Upper Limit of Highest Class – Lower Limit of Lowest Class Co-efficient of Range – Range of weight in Kgs & Height in cms are not comparable. To have comparison a relative measure of Range called Coefficient of Range is defined as Co-efficient of Range = LS LS S – Smallest Value INTER-QUARTILE RANGE • Inter-quartile Range is the Range calculated based on middle 50% of the observations. • INTER-QUARTILE RANGE = Q 3 – Q 1 • Q 1, Q 2, Q 3 are highest value in each of the first three quartile. • QUARTILE DEVIATION • QUARTILE DEVIATION = (Q 3 – Q 1)/2 • QUARTILE DEVIATION Co-efficient of Quartile Deviation = Lower Quartile Q Upper Quartile Q For grouped Data Q 1 1 Q3 Q1 Q3 Q1 = N 1 4 th observation. th 3 =3 N 1 4 observation. 1 N C 4 = L 1+ h F 3 N C 4 = L 3+ h F Q 3 QUARTILE DEVIATION L 1 = Lower boundary of first quartile class L 3 = Lower boundary of third quartile class N = Total cumulative frequency f = Frequency of quartile class h = Class interval (width) c = cumulative frequency of the class just above the quartile class MEAN DEVIATION This is Absolute Mean Deviation of each observation from Mean. x Absolute Mean Deviation = N for population, and for sample Absolute Mean Deviation = Where, X x n x = value of observation = The Mean of population N = number of observations in population x = sample mean N = number of observations in sample Mean Deviation M. D. (about the mean x ) = 1 N f xx = 1 N f d X = mid-value of the class interval f = corresponding frequency d = deviation Merits and De-merits of absolute Mean Deviation – Simple and Easy More comprehensive as it depends on all observations True measure as it averages all deviations But Less reliable as it ignores sign Not conducive to algebraic operation Not useful for open end class VARIANCE Here deviations are squared to make them positive. Variance = 2 x x N 2 = For grouped data, 2 x2 2 x N = fi X i x N 2 2 fx 2 N x fi = frequency of class and Xi = value of class mark This is about population. For sample, variance = s = 2 X X 2 n 1 = X nX n 1 n 1 2 2 STANDARD DEVIATION S.D. Variance = Properties of standard deviation – S.D. is independent of change of origin i.e. if all the observation values are increased / decreased by a constant quan tity, S.D. does not change. S.D. is dependent on change of scale i. e. if each observation value is multiplied / divided by a constant quantity, S.D. will also be similarly affected. STANDARD DEVIATION Combined S.D. of two or more groups ( 12 ) 12 n1 1 n2 2 n1d1 n2 d 2 2 2 2 2 n1 n2 d1 = X 1 x ; x n1 x1 n2 x2 / n1 n2 Co-efficient of variation = S.D. / Mean This is generally expressed in percentage. d2 = X 2 x and Example • Calculate IQ of 50 boys from the data given f X X f = 91. 2 IQ Fre. Class- f*X mark(X) f*X2 0-20 20-40 40-60 60-80 3 4 3 4 10 30 50 70 30 120 150 280 300 3600 7500 19600 σ f X 2 nX 2 n 484200 50 (91.2) 50 68328 36.97 50 2 80-100 100-120 120-140 140-160 Total 13 12 8 3 50 90 110 130 150 1170 1320 1040 450 4560 105300 145200 135200 67500 484200 SKEWNESS The Measure of Central tendency and Measure of Dispersion are characteristics of Frequency Distribution Third important characteristic of Frequency Distribution is its Shape A Frequency Distribution is said to be Symmetrical when the values of the variable equidistant from mean have equal frequencies When F.D. is not Symmetrical, it is said to be Asymmetrical or Skewed Amy deviation from symmetry is called Skewness Skewness may be Positive or Negative SKEWNESS • Positively skewed - If the frequency curve has a longer tail towards the higher values of X. • In positively Skewed distribution Mode is minimum and Mean is maximum out of Mean, Median and Mode. • Negatively skewed - If the frequency curve has a longer tail towards the lower values of X. • In Negatively skewed distribution, mean is minimum and Mode is maximum. Bienayme Chebyshev’s Rule It states that whatever may be the shape of distribution, at least 75 % of the values in the population will fall within + 2 standard deviation from the mean and at least 89 per cent will fall within + 3 standard deviation from the mean. The rule states that the percentage of the data observation lying within +/- ‘k’ standard deviation of the mean is at least (1 – 1 / k2)*100 In case of symmetrical bell-shaped distribution, we can say that Approximately 68 % of the observations in the population fall within +/- 1 s.d. from the mean Approximately 95 % of the observations in the population fall within +/- 2 s.d. from the mean Approximately 99 % of the observations in the population fall within +/- 3 s.d. from the mean NORMAL DISTRIBUTION This is a probability distribution of continuous random variable. Reflects values taken by many real life variables like height, wt. Large number of observations are clustered around the mean value and their frequency drops as we move away from the mean. NORMAL DISTRIBUTION If samples of size ‘n’ (n > 30) are drawn from any population, then sample means will be normally distributed with a mean equal to (population mean). Characteristics of Normal Distribution Curve Curve has single peak (Uni-modal) Mean lies at the centre Because of symmetry, Median & Mode are also at centre. The two tails extend indefinitely & never touch X-axis. NORMAL DISTRIBUTION To define a particular normal distribution we need only two parameters, the Mean and the Standard Deviation. If is standard deviation, then 1 68 % observations lie in the range 95.5 % observations lie in the range 2 99.7 % observations lie in the range 3 The Standard Normal Distribution This is a normal distribution with mean 0 and Standard Deviation 1 The observation value in Standard Normal Distribution are denoted by Z. z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857 Standardizing Normal Variable Suppose we have Normal population with variable X. We can convert ‘X’ to Standard Normal Variable ‘Z’ by Using Formula Z X Q. - Average weight of baby at birth is 3.05kg. With SD of 0.39 kg. If the birth weights are normally distributed, would you regard weight of 4 kg. as abnormal and weight of 2.5 kg. as normal? A. – Normal limits of weight with 95 % probability would be 3.05 – 1.96 x 0.39 and 3.05 + 1.96 x 0.39 So 4 kg is abnormal and 2.5 kg is normal Example SAMPLING AND SAMPLE SIZE CALCULATION In Statistical Analysis We infer about the Universe or population Either from data about entire universe Or from statistic of samples Census Enumeration and Sample survey Census Enumeration – collecting data from each and every unit from the population. Costly, Time-consuming, Labour-intensive. But accurate results, free of sampling errors. Does not require deep statistical analysis Gives direction for further studies Cannot be used for destructive testing Sample survey – Cheaper, consumes less time and labor. The selected part of the population which is used to ascertain the characteristics of the population is called Sample. Sampling is done to obtain information about whole. Types of Sampling – Random or Probability Sampling Non-Random or Judgment Sampling Random or Probability Sampling The procedure is rigorous and time consuming. But possible to quantify the magnitude of likely error in inferences made. SAMPLING TECHNIQUES SAMPLING TECHNIQUES Simple Random Sampling – Assign equal probability for each unit to get included in sample. It can be with or without replacement. Requires list of all members of population Works well for relatively small population. Selection by chits / Random numbers / Computer simulation SAMPLING TECHNIQUES Stratified Random Sampling – When considerable heterogeneity is present in the population, it is divided in segments or Strata and sample is selected from strata. Strata are so defined that they are mutually exclusive & collectively exhaustive. Stratification should be such that Strata are homogeneous within itself but heterogeneous between them. Sample size of Strata allocated on three points – Stratum size- Total no. of units in stratum Variability within stratum Cost of observation per unit in stratum Proportionate Allocation ni=nNi/N Cluster Sampling – Any method of sampling where a group is taken as a sampling unit Population is divided into well-defined groups or clusters and few of them are selected. All the units of elected clusters are studied. e.g. Household in rural areas. Advantages – Cost saving Better supervision List of units is not necessary Better co-operation of units But statistically less efficient. SAMPLING TECHNIQUES Systematic Sampling – First unit is selected randomly from first ‘k’ units. Then every kth unit is included in sample. Advantages – Simplicity in selection Operational convenience Even spread List of units not necessary But hidden periodicity will lead to inefficient sampling. Most commonly used method. SAMPLING TECHNIQUES SAMPLING TECHNIQUES Multi-stage Sampling – Used for large scale enquiry covering large geographical area. Non-Random or Non-probabilistic Sampling – In this case sampling error is not measurable. Three types – Convenience Sampling – may result in biased opinion. Useful to generate hypothesis Judgment Sampling – Sample is drawn from a population which he thinks to be representative of population. Should be carried out by an expert in the field. Quota Sampling – Each field worker is assigned a quota. Advantages – lower cost, field worker has a free hand But accuracy is doubtful. Snowball (Sequential) Sampling – Size of sample is decided as sampling takes place depending on the result of the earlier samples. SAMPLING TECHNIQUES Sample size depends upon Type of problem investigated Extent of variability in population Larger the sample size, higher the confidence level. Precision required Budget and cost Resources available SAMPLE SIZE AND POWER Determining Sample size for problems involving Mean Influenced by consideration of - Standard Deviation - Acceptable level of sampling error - Expected confidence level In sampling Precision = n/s Formula for Sample size It is given as : n ZS E 2 , where , and E ZS X Standard error of the Mean S X S n Cont. Z = Standardized value corresponding to a confidence level S = Sample standard deviation or an estimate of the population standard deviation E = Acceptable level of Error, plus or minus the error factor (the range is one half of the total confidence level). Example • Mea pulse rate of population is believed to be 70 per minute with a standard deviation of 8 beats. Calculate the minimum size of the sample to verify this with allowable error of +/- 1 beat at 5 % risk. n ZS E 2 • Z = 1.96 E=1 =8 • N = (1.96 x 8)2 / 1x1 = 245.86 246 Sample size In case where the sample size is more than 5 per cent of a finite population (small population), sample size may be overestimated. Therefore a finite point correction factor represented by N n N 1 Determining Sample size for problems involving Proportions Influenced by consideration of - Estimate of population proportion - Acceptable level of sampling error - Expected confidence level The formula for determining sample size • It is given by n= Z 2 c .i . pq E 2 , where n = sample size Z2c.i. = square of confidence level in standard error units P = estimated proportion of success q = estimated proportion of failures E2 = square of maximum allowance for error between the true proportion and the sample proportion Example Q. - Incidence rate in the last influenza epidemic was found to be 50 per thousand (5%) of the population exposed. What should be the size of sample to find incidence rate in current epidemic, if allowable error is 10%? A. – p = 5%; q = (100-5=) 95% E = 10 % of p = 0.1 x 5% = 0.5 % 1.96 n E 2 pq 2 1.96 5 95 7299 2 0.5 2 Randomization • This is a method to Control Extraneous Variables • It means assigning units to experimental treatment and experimental treatments to units randomly • This minimizes the effect of extraneous variable on observations Blinding • To rule out bias in subjects under study, BLINDING trials are used • In a Single Blinding trial, patients in control group as well as experimental group are given drugs which look similar but contents are different So no patient knows what is given to him • In a Double Blinding trial, not only the patients but also the nurses or medical observers do not know which group of patients is given drug • Blinding trials are very useful where subjective information is required Coding • Coding – Assigning number or symbol to results in order to group them into limited categories • Coding sacrifices some details but necessary for efficient data analysis Categorization rules – • a) Appropriate – Categorization should help validate the hypothesis. If hypothesis aims to establish relationship between key variables, then appropriate categories should facilitate comparison between them Categorization rules • b) Exhaustive – When multiple choice questions are used, alternatives covering full range of information should be provided • c) Mutually Exclusive • d) Single dimension – Means every class is defined in terms of one concept If more than one dimension is used it may not be mutually exclusive CORRELATION ANALYSIS • Correlation denotes inter dependence among the variables. There should be cause and effect relationship between 2 phenomenon. If no cause and effect relationship is there then there is no correlation. • The degree of relationship is expressed by a coefficient which ranges from +1 to -1 . • If both variables change in same direction then correlation is said to be +ve. • If they change in opposite direction then it is –ve. METHODS OF STUDYING CORRELATION • • • • • scatter diagram KARL PEARSON’S coefficient of correlation SPEARMAN’S RANK correlation coefficient Method of least squares scatter diagram – applicable for bi-variate distribution – data is plotted on graph paper in the form of dots – generally independent variable on X axis and dependent variable on Y axis – If points are more scattered then degree of relationship is less. – Nearer the points to the line, higher the degree of relationship. Scatter Diagram Perfectly + ve Perfectly -ve If the points lie on a straight line parallel to X axis or in a haphazard manner, it shows absence of relation between given 2 variables. Scatter Diagram Advantages of scatter diagram • Simple and non mathematical • Easy to understand • Shows approximate picture of correlation quickly. Hence used as first step in finding correlation • Not influenced by extreme values like other mathematical models. Disadvantages • Exact degree can not be established KARL PEARSON COEFFICIENT OF CORRELATION Also called as product moment formula. Most widely used in practice. r = Pearsonian correlation coefficient r = rxy = COV ( X , Y ) XY = = xy / N XY xy N X Y = xy x y 2 2 = N X 2 X N XY X Y 2 N Y 2 Y 2 Where x = X- X and y=Y- Y xy = x * y KARL PEARSON COEFFICIENT OF CORRELATION ‘r’ is always between -1 and +1. When r = +1 , correlation is perfect and positive r = -1 , correlation is perfect and negative r = 0 , correlation does not exist Other formula for ‘ r ‘ is, r2 a Y b XY nY 2 Y 2 nY 2 SPEARMAN’S RANK CORRELATION COEFFICIENT rank is assigned to each value of variable useful when quantitative measures can not be fixed. The individual belonging to group can be arranged in order by assigning a number indicating a rank Eg. Leadership skills Beauty Given by : 6D 2 R=1N ( N 2 1) Where D = difference of ranks between pained item in 2 series N = total number of paired observations SPEARMAN’S RANK CORRELATION COEFFICIENT When ranks are not given assign ranks on basis of value. Equal ranks to 2 or more entries Average rank is to be assigned If 3 entries at rank 5 then average rank [ ( 5 + 6 + 7 ) / 3 ] = 6 So adjustment is required in formula 6[D 2 (1 / 12)(m 3 m)] R = 1 – N ( N 2 1) m = number of entries having common rank If there are more such groups then for each group adjustment is required . then R = 1 3 3 6[D 2 (1 / 12)(m1 m1 ) (1 / 12)(m2 m2 ) N ( N 2 1) . METHOD OF LEAST SEQUARES Correlation coefficient r = bxy * byx Where bxy and byx are regression coefficient REGRESSION ANALYSIS REGRESSION LINE A line through the points drawn in such a manner as to represent the average relationship between the two variables Y = a + bx X = c + dy regression line Y on X regression line X on Y Regression equation Y on X is obtained by minimizing the sum of squares of errors parallel to Y axis. Two regression equations are obtained from 2 sources so they are not reversible and interchangeable. REGRESSION ANALYSIS Normal equation for regression equation Y on X Y na bX XY aX bX 2 n= total number of observed pair values XY n X Y b= 2 X 2 n X a= Y bX Regression co-efficient b is called slope coefficient or regression coefficient .it represents incremental value of dependant variable for a unit change in value of independent variable. Regression coefficient of Y on X is also given by Regression Coefficient y byx r x Similarly regression coefficient of X on Y is x bxy r y r = coefficient of correlation between X and Y x = population S.D. of X r byx bxy y = population S.D. of y Example • On entry to a school, a new intelligence test was given to a small group of children. The results obtained in that test are and in a subsequent examination are tabulated below. Calculate Coefficient of correlation and Regression Equation Child Number IQ Score (X) 1 6 2 4 3 6 4 8 5 8 6 7 8 6 10 8 Exam. Score (Y) 4 4 7 10 4 7 7 1 Child N0. 1 2 3 4 5 6 7 8 Total Intelligence Test Score X 6 4 6 8 8 10 8 6 56 X2 36 16 36 64 64 100 64 36 416 x -1 -3 -1 -1 1 3 1 -1 0 x2 1 9 1 1 1 9 1 1 24 Y 4 4 7 Examination Score Y2 16 16 49 100 16 49 49 1 296 y -1.5 -1.5 1.5 4.5 -1.5 1.5 1.5 -4.5 0 y2 2.25 2.25 2.25 20.25 2.25 2.25 2.25 20.25 54 XY XY 24 16 42 80 32 70 56 6 326 xy xy 1.5 4.5 -1.5 4.5 -1.5 4.5 1.5 4.5 18 10 4 7 7 1 44 Calculations r xy ( x )( y 2 2 ) 18 24 54 18 36 0.5 Estimation • Population parameters are unknown and they are to be estimated. Types of Estimates – • Point Estimate – It is a single number. Point Estimate is useful, if it is accompanied by an estimate of error • Interval Estimate – It is a range of values used to estimate a population parameter. It gives Range for parameter and the probability of the population parameter lying within that range Standard Error Standard Error – It is also known as Standard Error of the Mean. This is an estimate of the Standard Deviation of the Sampling Distribution of Means based on data from one or more random samples. It is designated as M M n Standard Error of the Mean = Where is Standard Deviation of the original distribution and ‘ n ’ is the sample size This is used in the computation of Confidence Interval and Significance Test for the Mean. Interval Estimate • Systolic Blood Pressure of 566 males was taken. Mean BP was found to be128 mm and XD13.05 mm. Find 95 % confidence limits of BP within which population mean would lie. X s 13 .05 0.55 n 566 • Confidence limits for population mean are (Mean + 1.96 SE) and (Mean -1.96 SE) • 128 + 1.96 x 0.55 = 129.078 & 128 – 1.96 x 0.55 = 126.922 • Hypothesis Testing enables to determine the validity of hypothesis • Hypothesis Testing is used to analyze the difference between the sample statistic and hypothesized population parameter • Steps in Hypothesis Testing – a) Formulation of Hypothesis b) Selection of the Statistical Test to be used c) Selection of the Significance Level d) Calculation of the Standard Error of the Sample Statistic and Standardize the Sample Statistic e) Determination of the Critical Value f) Comparing the value of the Sample Statistic with the critical value g) Deducing the Business Research solution HYPOTHESIS TESTING – BASIC CONCEPTS Formulation of Hypothesis Null Hypothesis assumes that the difference, if any, in the observed data is attributed to random error Null hypothesis is denoted as Ho Null hypothesis consists of values relating to population parameter like , , p Alternate Hypothesis is an statement opposite to that made in Null hypothesis and it requires evidence to accept it Alternate hypothesis is denoted as Ha or H1 Hypothesis should be developed before sample is drawn Hypothesis should be specific Hypothesis should be fit for testing Selection of Statistical Test to be used Three factors influence the selection • Types of Research Question Formulated Research questions based on mean and proportions use z-test or t-test and if it is based on Frequency distribution it will use Chi-square test • Number of samples – If there are more than two samples then test like ANOVA and Chi-square are used • Measurement scale used – Research problem containing Interval Scale use ‘ z ‘ and ‘ t ‘ tests. Problems containing Ordinal and Nominal scale use Chi-square test • Other factors are – sample size, ‘ ‘ value being known or unknown Level of Significance • This is a measure of degree of risk that a researcher might reject the null hypothesis when it is true • 5% is commonly used level • Level of significance is the percentage of sample means that are outside specific cut-off points. • Level of significance is set by researcher based on factors like cost involved for each type of error Type I and type II error Scenario Decision Accept Ho Ho True No error Reject Ho Type I error Ho False Type II Error No error Selection of Statistical Test to be used Three factors influence the selection • Types of Research Question Formulated Research questions based on mean and proportions use z-test or t-test and if it is based on Frequency distribution it will use Chi-square test • Number of samples – If there are more than two samples then test like ANOVA and Chi-square are used • Measurement scale used – Research problem containing Interval Scale use ‘ z ‘ and ‘ t ‘ tests. Problems containing Ordinal and Nominal scale use Chi-square test • Other factors are – sample size, ‘ ‘ value being known or unknown Calculating Test statistic Sample Statistic Hypothesized Parameter Test statistics = Standard Error of Statistic = Where x x n x Use z-distribution for sample size > 30 and t-distribution for sample size < 30 1) Locate critical region that is region of standard normal curve corresponding to a level of significance . Describe the result and statistical conclusion Determining the Critical value Determining the Critical value– Two -tailed Test – Used when Ho : o & H1 : o It means Null Hypothesis is rejected if value of sample statistic is above or below Hypothesized population parameter Acceptance region falls between two rejection regions Determining the Critical value One Tailed Test – Used when Ho : Ho : o o & & H1 : H1 : o o or It means Null Hypothesis is rejected when value of sample statistic is higher than or lower than hypothesized population parameter Test can be Left-tailed or Right –tailed Left –tailed test will reject null Hypothesis if sample mean is lesser than hypothesized population parameter Right-tailed test will reject Null Hypothesis if sample mean is higher than hypothesized population parameter The critical value depends on the significance level, the type of hypothesis test and statistical test used Conclusion • Comparing the value of sample statistic with the Critical value • Deducing the Business Research Conclusion – If the value of standardized sample statistic falls in the acceptance region, Null Hypothesis is accepted and if it falls in rejection region null Hypothesis is rejected Critical Values Critical Value(Z) 1% Two-tailed test Right-tailed test Left-tailed test + 2.56 +2.326 -2.326 Level of Significance 2% 4% 5% 10 % + 2.326 + 2.054 + 1.960 + 1.645 +2.054 -2.054 +1.751 -1.751 +1.645 -1.645 +1.282 -1.282 Test of significance of mean in Large Sample Sample size > 30 x z / n If is not known, then use 2 s 2 Tests of significance of mean for small sample, sample size < 30 • t-distribution is used for sample size of less than or equal to 30 x t s/ n Tests of significance of difference between 2 means • For large sample , sample size > 30 • Z test to detect whether or not the means of two samples drawn from two different sources differ significantly . • It is also done to check whether difference is due to chance and whether the sample belong to the same population . Steps : 1)State null hypothesis . may be H0 = 1 = 2 2)State alternative hypothesis H0 = 1 3)Compute estimated standard error 2 x1 x2 2 1 / n1 2 2 / n2 4) compute the test statistics ( X X ) ( ) 1 2 1 2 H0 Z= _______________________ X 1X 2 Note: • It is generally assumed that , the S.D. of population are known. • If they are unknown then the values of their corresponding sample standard deviation S1 & S2 are used. For small size sample n < 30 • In this case ‘ t – distribution ‘ is used instead of normal distribution • Steps : -state null hypothesis -state alternative hypothesis 2 -compute pooled estimate of SP2 = ( n1 - 1 ) S12 – ( n2 – 1 )S22 n1 + n2 -2 (S1 and S2 are sample standard deviations ) Cont. from last slide -compute the standard error 1 1 =SP ( n n ) X 1X 2 1 2 -Compute test statistics ( X X ) ( ) t= 2 1 1 2 H0 X 1 X 2 -compare computed value of t with critical value of t Test of significance for proportions • The difference between the sample proportions and hypothesized proportion is standardized and the normal distribution is used for the test. Standard error of proportion P = PH 0 * qH 0 n Z= p p HO p Comparison of two sample Proportions Test of Significance for comparison of two sample proportions – The proportion of the population from which two samples are drawn is given by n1 p1 n2 p2 p n1 n2 by and Standard Error of difference between the proportions is given Sp p 1 2 pq pq n1 n2 Z = Difference in proportion / Standard Error (S) Paired Samples (Dependent Samples) For example, Sugar level of same sample of patients before and after the drug is administered Paired sample t-test is widely used t-test in such situations. t D d SD n Where D is Mean difference of samples d is hypothesized value of the difference SD is standard deviation of the difference n is sample size S 2 D n 1 ( Di2 nD 2 ) n 1 i 1 ANALYSIS OF VARIANCE (ANOVA) • If the problem requires comparison of means of more than two populations, we can use ANOVA • It tests whether there is any significant difference between the means of various samples • It measures variability in data points within the samples and also measures variance between the sample means. These two variations are compared using F-test • If value of F-testis large then we can deduce the conclusion that there is a significant difference between the means of samples • Two types of ANOVA – a) One factor b) Two factor • One factor ANOVA is used for the problems that involve evaluating the differences of mean of the dependent variable for various categories of single independent variable Steps in ANOVA 1. Formulate the hypothesis 2. Obtain the mean of each sample 3. Find the mean of all samples i.e. grand mean 4. Calculate the variation between samples denoted as SSbetween 5. Obtain the mean square of the variation between the samples denoted by MSbetween using SSbetween Steps in ANOVA 6. Calculate the variation within the samples denoted by SSwithin 7. Obtain the mean square of the variation within the samples denoted by MSwithin using SSwithin 8. Calculate the total variance SSy SSy = SSbetween + SSwithin 9. Calculate F ratio = MSbetween divided by MSwithin 10 Compare F ratio with critical value Chi-square test • To evaluate the statistical significance of association among variables involved in crosstabulation, Chi-square test is used • Chi-square test is used in two ways – • Test of Independence - used to evaluate association between two variables & • Test of goodness of fit – used to identify whether there is any significance difference between the observed frequencies and the expected frequencies Chi-Square Test – General aspects • Chi-square test can be performed on actual numbers but not on percentage or proportions. Percentage or proportions should be converted to actual numbers • Expected frequencies of all cells should be more than five. If it is less than five some of the rows or columns should be combined to make new frequencies greater than five • Chi-square test works only when sample size is large enough; usually more than 50. • Observations drawn should be random and independent Chi-Square test – Goodness of Fit If the value of Chi-square is less than the tabular value corresponding to level of significance, then we deduce that observed data corresponds to the expected data (Oi Ei ) 2 2 Ei i 1 n Chi-Square Test – Test of Independence Goodness of fit involved only one variable. To evaluate relationship between two or more variables, hisquare test of independence is used. This is useful while analyzing cross-tabulation. In this case 2 i 1 j 1 n k (Oij Eij ) 2 Eij Example: Channel Viewership Distribution according to Age Group Age Group/channel 15 – 25 age group 25 – 45 age group 45 years and above Total Channel A 20 80 60 160 Channel B Channel C Total (32) 30 (80) 70 (48) 40 140 (28) 30 (70) 50 (42) 20 100 (20) 80 (50) 200 (30) 120 400 Figures in bracket are expected frequencies • Formulation of Null Hypothesis • Ho : There is no association between age group and channel viewership • Ho : There is significant association between age group and channel viewership • Calculate the expected value – • where, • ni = row total ; nj = column total; n = total sample size Calculate the Chi-square value – 2 i 1 j 1 n k (Oij Eij ) 2 Eij = 16.08 Decide the level of significance and degrees of freedom – Degrees of freedom ‘ v ‘ = (n -1) (k – 1) (3 - 1) x (3 - 1) = 2 x 2 = 4 Let us set the level of significance as 1 % Determining the critical value From Chi-square table critical value for 1 % significance level and 4 degrees of freedom is 13.28 Deduce the business Research conclusion As the calculated value is greater than critical value, the Null hypothesis is rejected Strength of Association Strength of Association – test of independence does not describe the strength or magnitude of association. It is evaluated by using PHI- Coefficient and Coefficient of Contingency 1.2.3.1 PHI- Coefficient - Suitable only for 2 x 2 table 2 n 1.2.3.2 Coefficient of Contingency (C) – Can be used for table of any size C 2 2 n The coefficient varies from 0 to 1. 0 indicates no association and 1 indicates maximum strength