Docstoc

Statistics & Research Methodology

Document Sample
Statistics & Research Methodology Powered By Docstoc
					  DR SAIFUL’S
   NOTES ON
  MEDICAL &
ALLIED HEALTH
  PROFESSION
 EDUCATION:
 STATISTICS &
   RESEARCH
METHODOLOGY

Dr. Muhamad Saiful Bahri Yusoff
            MD, MScMEd
Content


RESEARCH IN MEDICAL EDUCATION ......................................................2
STEPS IN RESEARCH ................................................................................8
SAMPLING METHOD................................................................................ 11
SAMPLE SIZE........................................................................................... 16
OVERVIEW ON MEDICAL STATISTICS ................................................... 20
STUDY DESIGN........................................................................................ 23
DESCRIPTIVE STATITISTICS.................................................................. 30
HYPOTHESIS FORMULATION AND TESTING ......................................... 38
CONFIDENCE INTERVAL......................................................................... 43
EXPLORATORY DATA ANALYSIS (NUMERICAL DATA)........................... 46
UNIVARIATE ANALYSIS OF NUMERICAL DATA...................................... 49
UNIVARIATE ANALYSIS OF CATEGORICAL DATA ................................. 57
CORRELATION & REGRESSION.............................................................. 61
   CORRELATION ..................................................................................... 61
   SIMPLE LINEAR REGRESSION (SLR) ................................................... 64
CORRELATION ........................................................................................ 72
NONPARAMETRIC STATISTICS .............................................................. 76
NON-PARAMETRIC TESTS ...................................................................... 80
STATISTICAL ANALYSIS: WHICH TO CHOOSE?..................................... 84
WRITING A RESEARCH PROPOSAL ........................................................ 99
VARIABLES............................................................................................ 102
DATA PRESENTATION........................................................................... 106
Z-Score & IT’S USES.............................................................................. 110
t-test ...................................................................................................... 113
SENSITIVITY & SPECIFICITY................................................................ 114




                                                                                                               1
                     RESEARCH IN MEDICAL EDUCATION


1. Research and types of research:
   •   How do we develop knowledge?
       o Intuitive knowledge (based on “I feel or I think”)
       o Authoritative knowledge (based on authorized person view)
       o Logical knowledge (based on experience explanation which is
           reasonable and logical.)
       o Empirical knowledge (based on judgement back up by facts and
           usually 90% correct)
   •   What is research?
       o Research is a systematized effort to gain new knowledge – (Redman
           & Mory)
       o Literally research means search again and again repeatedly.
       o Research is an organized and systematic way of finding answers to
           questions.
   •   Research comprises:
       o Defining and redefining problems
       o Formulating hypothesis or suggested solutions
       o Collecting, organizing and evaluating data
       o Making deduction and reaching conclusions
       o   Testing the conclusions to determine whether they fit the
           formulating hypothesis                         (Clifford Woody)

   •   Types of research
       o Basic research and applied research
       o Quantitative research and qualitative research
   •   Qualitative research:
       o Ethnography, cognitive anthropology, etc
       o Synthetic rather than analytic
       o Generally hypothesis generating
       o Investigative methods are non-intrusive
       o Data are more impressionistic.




                                                                             2
    o Research in such a situation is a function of researcher’s insights
          and impressions.


                               Educational Research




History          Descriptive               Correlational          Group comparison



     Ethnographic                 Survey




                                    Experimental           Quasi-experimental

                                                                             Ex Post Facto or
                                                                            Causal-comparative

     Thomas K.Crowl


•   Descriptive research
    o Include quantitative and qualitative researches.
    o Methodologies include observations, surveys, self-report and tests.
    o May operates on the basis of hypotheses.
    o Deals with naturally occurring phenomena.
•   Ethnographic research
    o Descriptive and qualitative research.
    o Report is detailed verbal description.
    o Carried out in natural setting.
    o Researcher as participant and observer.
•   Survey
    o Descriptive
    o Quantitative study
•   Correlational research
    o Investigate the relationship between two or more variables.
    o Searching the relationship of variables in natural setting.


                                                                                            3
   •   Group comparison research
       o Comparing the values of two or more groups of population.
   •   Experimental research
       o Random selection of the individuals forming the groups
             Experimental group
             Control group
   •   Quasi-experimental research
       o A type of group comparison research.
       o Groups are randomly selected.
   •   Ex Post Facto or Causal-comparative study
       o Ex Post Facto in latin is “after the fact”.
       o Values of independent variable of two groups are preset (al ready
          present).


2. Quantitative/Qualitative Research:
   •   Deductive
       o Begin with a theory and collect data to test.
   •   Inductive:
       o Begin with observations and attempt to explain by generalizing.
   •   Deductive reasoning
       o A type of logic in which one goes from a general statement to a
          specific instance.
   •   Inductive reasoning:
       o Involves going from a series of specific cases to a general
          statement.
       o The conclusion in an inductive argument is never guaranteed.
   •   Confirmatory
       o Experimental
       o Quasi-experimental
       o Correlational (non-experimental)
   •   Exploratory
       o Qualitative



                                                                             4
3. Qualitative research methods for data collection
   •   Interviews
   •   Focus groups
   •   Survey: open ended questions
   •   Observations: recorded in field notes
   •   Document analysis
   •   What is qualitative data?
       o Data in the form of words, rather than numbers, based on:
             Asking open ended questions in:
              •     Interviews
              •     Group
              •     Surveys
             Examination of documents
              •     Observation of situations and actions, recorded in fields
                    notes
   •   Uses of qualitative data
       o Some social sciences e.g
             Anthropology
             History
             Psychology
             Sociology
             Public health
             Policy analysis
             Health care evaluation


4. Types of quantitative research design:
   •   The research design which are commonly used can be divided into
       following groups:
       o Non experimental design
             Post-test design                X O1     no control
             Pretest-post test design       O1 X O2   no control



                                                                                5
            Static group comparison          X O1 O2 no control
      o True experimental design
            Pretest-post test control group design
                                 Exp. Group            O1 X O2


            R.A
  (Random Allocation)
                                 Control              O3 O4
            Post test control group design
                                 Exp. Group            X O1


            R.A


                                 Control              O2


      o Quasi-experimental design
            Time series O1 X O2 X O3 X O4
            No equivalent control group
            •   Exp group O1 X O2
            •   No equivalent control group O3 O4
            Separate sample pretest post test design
            •   R.A – Pretest group O1 X
            •   R.A – Post test group X O2
5. Purpose of Medical Education Research:
  •   To improve the functioning of educational programmes by providing
      information for:
      o Decision making
      o Evaluating outcomes
      o Supporting advocacy for change
      o Contributing to the body of knowledge related to concepts and
         methods.


   Research is like a plant that grows and grows and grows and grows…


                                                                          6
 When it is grown, it throws off seeds of all types (basic, applied and
  practical) which in turn sprout and create more research projects…


The process continues with all of the new research ‘plants’ throwing off
seeds, creating additional, related research projects of various types…


 Soon there is a body of basic, applied and practical research projects
                      related to similar topics…
                   And the process goes on and on…




                                                                           7
                             STEPS IN RESEARCH




1. Preliminary steps:
   •   Clarifying the purpose
   •   Formulating the topic
       o State your topic idea as a question
       o Identify the main concepts or keywords


2. Finding background information
   •   Critically analyzing information sources
       o Initial appraisal
             Author
             Date of publication
             Edition or revision
             Publisher
             Title of the journal
       o Content of analysis
             Intended audience
             Objective reasoning
             Coverage
             Writing style
             Evaluative review


3. Five steps to write topic for better research
   •   Think about your topic
   •   Define your main concepts
   •   Think synonyms
   •   Think of broader terms
   •   Think of narrower terms
4. Steps in research
   •   Planning
       o Formulation of the study objectives


                                                   8
             General objective – what are the purpose of the study
             Specific objective – what are the things you want o find in the
             study
       o Planning of methods
             Study population
             •     Selection and definition
             •     Sampling
             •     Sample size
             Variables
             •     Selection
             •     Definition
             •     Scales of measurement
             Method of data collection
             Method of recording and processing
   •   Preparing for data collection
       o Construction of research instrument
       o Pretesting the instrument
   •   Collection the data
   •   Processing the data
   •   Interpreting the data
   •   Writing a report


5. To prioritize a problem and selection of a topic for research, it is helpful to
ask yourself a series of questions and then try to answer each of them
   •   Is the problem a current one? Does the problem exist now?
   •   How widespread is the problem? Are many areas and many people
       affected by the problem?
   •   Does the problem effect social groups, such as students, teachers and
       patients?
   •   Does the problem relate to broad social, economic, and health issues,
       such as unemployment income maldistribution, the status of women,
       education and maternal and child health?



                                                                                 9
   •    Who else is concerned about the problem? Are top government
        officials concerned? Are medical doctors or other professionals
        concerned?
   •    Are the resources available?
   •    Are measures available to solve the problem?
Review your answers to these questions, and ranked the problem and
arrange them according to the ranking.


 Problem identification
                                                                     Dissemination of findings
                                                                          Report writing


       Information gathering &
         knowledge building

                                                                      Drawing inference

         Research question/
       hypothesis formulation
                                                             Confirmation or
                                                         rejection of hypothesis


          Planning research
                                                Data analysis




                Data collection        Data processing




                                                                                                 10
                                           SAMPLING METHOD




                                               Sampling Method



                   Non-Probability                                           Probability
                     sampling                                                 sampling



           Purposive              Convenient                Unrestricted
           sampling                sampling                  sampling
                                                              (Simple
                                                              random
                                                             sampling)                     Restricted
                                                                                           sampling
Judgment                 Quota
sampling               sampling




                              Systematic       Stratified          Cluster        Area            Double
                               sampling         random            sampling      sampling         sampling
                                               sampling




    1. Sample and Subject:
           •   A sample is a subset of the population; it comprises some members
               selected from it.
           •   A subject is a single member of the sample (just like an element is a
               single member of the population).


    2. Population, Element and Population Frame:
           •   Population refers to the entire group of people, events, or things of
               interest that the researcher wishes to investigate.
           •   An element is a single member of the population.
           •   Population frame is a listing of all the elements in the population from
               which the sample is drawn.




                                                                                                            11
3. Sampling:
   •   Sampling is the process of selecting a sufficient number of elements
       from the population, so that a study of the properties or
       characteristics of the sample make it possible to generalize such
       properties or characteristics to the population.


4. Two major types of sampling design:
   •   Probability sampling (sample picked at random).
       o The elements in the population have equal chance or probability of
          being selected as sample subjects.
       o Probability sampling designs are used when the representativeness
          of the samples is of importance in the interests of wider
          generalisability.
   •   Non-probability sampling (sample not randomly picked).
       o The elements do not have a predetermined chance of being selected
          as subjects.
       o When time or other factors, rather than generalisability, become
          critical, non probability sampling design are chosen.


5. Probability Sampling:
   •   Unrestricted sampling:
       o More commonly known as simple random sampling.
       o Every element in the population has a known and equal chance of
          being selected as a subject.
       o Advantage:
             This kind of sampling method has the least bias.
       o Disadvantages:
             Cumbersome (difficult) and expensive.
             An entirely updated listing (population frame) of the population
             may not always available.
   •   Restricted (complex) random sampling:
       o Offer a viable, and sometimes more efficient alternative to the
          unrestricted design.


                                                                              12
o Five most common complex probability sampling methods
     Systematic sampling
     •   Drawing every nth element in the population starting with a
         randomly chosen between 1 and n.
     •   For example, if we want a sample of 60 students from total
         population of 300 students, we could sample every 9th
         student (9, 18, 27, …) until 60 students are selected.
     •   The number must be selected randomly for example we san
         take out one dollar ringgit and choose the last digit of money
         number.
     Stratified random sampling
     •   When sub-population vary considerably, it is advantageous
         to sample each sub-population (stratum) independently.
     •   Stratification is the process of grouping members of the
         population into relatively homogenous subgroups before
         sampling.
     •   The strata should be mutually exclusive: every element in the
         population must be assigned to only one stratum.
     •   The strata should also collectively exhaustive: no population
         element can be excluded.
     •   The random sampling is applied within each stratum.
     Cluster sampling
     •   Cluster sampling is used when natural grouping are evident
         in the population.
     •   The total population is divided into groups or clusters.
     •   Elements within a cluster should be heterogenous as
         possible.
     •   But there should be homogeneity between clusters.
     •   Each cluster must be mutually exclusive and collectively
         exhaustive.
     •   A random sampling technique is then used on any relevant
         clusters to choose which clusters to include in the study.



                                                                      13
             Area sampling
             •   One version of cluster sampling is area sampling or
                 geographically clusters sampling.
             •   Clusters consist of geographical areas.
             •   A geographically dispersed population can be expensive to
                 survey.
             •   Greater economy than simple random sampling can be
                 achieved by treating several respondents within a local area
                 as a cluster.
             Double sampling
             •   A sampling design where initially a sample is used in a study
                 to collect some preliminary information of interest, and later
                 a sub-sample of this primary sample is used to examine the
                 matter in more detail
             •   It is like reverse pilot study because in double sampling take
                 all population then proceeds with sampling the interest sub-
                 sample.




6. Non-probability Sampling:
  •   The elements in the population do not have any probabilities attached
      to their being chosen as sample subjects.
  •   The findings from the study of the sample cannot be confidently
      generalized to the population.
  •   This method is chosen when generalisability is not critical; focus may
      be on obtaining preliminary information in a quick and inexpensive
      way.
  •   2 broad categories:
      o Convenience sampling
             Collection of information from members of the population who
             are conveniently available to provide it.
      o Purposive sampling



                                                                              14
The sampling is confined to specific types of people who can
provide the desired information, either because they are the only
ones who have it, or conform to some criteria set by the
researcher.
2 type of purposive sampling:
•   Judgment sampling
    o Involves choice of subject who are most advantageously
       placed or in the best position to provide the information
       required.
    o Judgment sampling may curtail the generalisability of the
       findings because we are using a sample of experts who
       are conveniently available to us.
    o Judgment sampling calls for special efforts to locate and
       gain access to the individually who do not have the
       requisite information.
•   Quota sampling
    o This method ensures that certain groups are adequately
       represented in the study through the assignment of a
       quota.
    o The quota fixed for each subgroup is based on the total
       numbers of each group in the population.
    o Considered as a form of appropriateness stratified
       sampling, in which a predetermined proportion of people
       are sampled from different groups, but on a convenience
       basis.




                                                                   15
                                SAMPLE SIZE


1. Introduction:
   •   Questions:
       o How large should my sample be?
   •   Answer:
       o It depends…
             …large enough to be an accurate representation of the
             population.
             …large enough to achieve statistically significant results.
2. Determining sample size:
   •   What is sample size that would be required to make reasonably
       precise generalizations with confidence?
   •   A reliable and valid sample should enable us to generalize the findings
       from the sample to the population under investigation.
   •   The sample statistic (statistic finding) should be reliable estimates and
       reflect the population parameter (actual finding) as closely as possible
       within a narrow margin of error.
   •   Precision:
       o Precision refers to how close our estimate is to the true population
          characteristic.
       o Normally, the greater the precision required, the larger is the
          sample size needed.
   •   Confidence:
       o Confidence denotes how certain we are that our estimate will really
          hold true for the population.
       o Confidence reflect the level of certainty with which we can state
          that our estimates of the population parameters, based on our
          sample statistics, hold true.
       o Level of confidence can range from 0 to 100%.
       o A level of confidence of 95% is conventionally acceptable.
   •   Sample size is function of…
       o Variability (heterogeneity) in the population


                                                                              16
                The more variance we find, the bigger the sample should be
       o Precision or accuracy needed
                The more precise or accurate we want, the bigger the sample
                size should be
       o Confidence level desired
                The higher the confidence level we want, the bigger the sample
                size should be
       o Type of sampling plan used
                Different sampling approaches will require different sample size
   •   Trade-off between confidence and precision
       o If there is little variability in the population, a small sample size
          will be sufficient to obtain a high confidence and precision level.
       o The higher the precision, the lower will our confidence level be.
       o The higher the confidence level, the lower will our precision level
          be.        That is why, in both cases, we need bigger sample size to
          increase the precision and confidence.
   •   Roscoe proposes the following rules of thumb for determining sample
       size
       o Sample size larger than 30 and less than 500 are appropriate for
          most research
       o Where samples are to be broken into sub samples, a minimum
          sample size of 30 for each category is necessary.
       o In multivariate research (including regression analyses) the sample
          size should be several times (preferably 10 times or more) as large
          as the number of variables in the study.
       o For simple experimental research with tight experimental controls,
          successful research is possible with samples as small as 10 to 20
          in size.
3. The term statistically significant (p<.05) is used merely as a way
indicating the chances are at least 95 out of 100 that the findings obtained
from the sample of people who participate in the study are similar to what
the findings would be if one were actually able to carry out the study with
the entire population.


                                                                                 17
4. Sample size for single mean


n = (Z /∆) 2             n = sample size
                           = population standard
                          deviation
                          ∆ = precision
                          Z = Z-score at significance
                          level

If there is a possibility of response from 80% of sample population the
sample size = n/0.8
                                                               Table of values for A and B:

5. Sample size for two means                                   Significant level A

                                                               5%              1.98
                                  n = sample size
                                                               1%              2.58
n = (A + B) 2 * 22 /∆2            = population standard
                                                               Power            B
                                  deviation
                                  ∆ = expected difference of   80%              0.84
                                                               90%              1.28
                                  mean                         95%              1.54
                                  A = significance level
                                  (usually 95%, equals 1.96)
6. Sample size for single proportion:
                                  B = power (usually 80%,
                               equals 0.84)
                            n = sample size
n = (Z/∆) 2 p (1-p)
                            ∆ = precision
                            Z = Z-score at significance
                            level
                            p = population proportion


7. Sample size for two proportions:
                                                                 Power is the probability
                                                                 that the null hypothesis
n = (A + B) 2 * [(p1 (1-p1)) + (p2 (1-p2))] / (p1 – p2)2         will be correctly rejected
                                                                 i.e. rejected when there is
                                                                 indeed a real difference or
 n = sample size                                                 association. It can also be
 A = significance level                                          thought of as “100 minus
 (usually 95%, equals 1.96)                                      the percentage chance of
                                                                 missing real effect” –
 B = power (usually 80%,                                         therefore the higher the
 equals 0.84)                                                    power, the lower the
 p1 = first proportion                                           chance of missing a real
                                                                 effect.
 p2 = second proportion                                                                    18
Some definition:


Sampling error is the difference of statistically finding between actual
parameter of population


Standard error is means of deviation values between two or more groups of
sample or population.


Standard deviation is means of deviation values between two or more units
of samples or population.




                                                                           19
                   OVERVIEW ON MEDICAL STATISTICS


1. Some introduction:
   •   I’m interested in research…
   •   I’m forced to do research…
   •   Whatever the reason may be…


2. What should I do?
   •   How should I start???


3. Let’s make it easily understandable
   •   Research methods/ approaches – leading the way/ direction
   •   Statistical applications – tools/ vehicles


4. What do I know? Be honest!!
   •   Do I know about research methods?
       o If know back to basic… go back and read research methods/
          approaches
   •   Do I know about statistical and software application?
   •   Do I know how to interpret?
       o OK… I understand methods and approaches
   •   So… how to proceed?
   •   Please try to learn medical statistics
   •   OK… I agree to learn medical statistics
   •   Tell me how should I go for it (the easiest way)
   •   Don’t make it complicated (statistician make statistics more difficult)
   •   Tell me only statistics for non-statisticians


5. Application of statistics in medical research
   •   Why use statistics?
       o Art statistics differences in medical context due to real effects or
          random variation or both



                                                                                20
   •   Modern viewpoint of statistics
       o Aid for making scientific decision in the face of uncertainty
       o A valuable tool in decision making whenever one is uncertain about
          the state of nature
   •   Statistics in medicine
       o Increasingly prevalent in medical practice
             Hospital utility statistics, auditing, vaccination uptake,
             incidence/prevalence of AIDS and so on…
   •   Statistics is about common sense and good design
   •   Statistics is only the guide to make decisions
   •   Judgment should be made based on both biological and statistical
       plausibility
   •   Concept and applications of statistics in medical sciences
       o Let us discuss briefly
       o People say “stat is boring”
       o Let us make it interesting


6. Classification of statistics
   •   It consist of two parts
       o Descriptive statistics
             Concerned with collection, organization, enumeration of the
             frequency of characteristics, summarization and presentation of
             data.
       o Inferential statistics
             Statistical inference
             Analytical in nature
             Consists of a collection of principles or theorems
             Allows researcher to generalize characteristics of a “population”
             from the observed characteristics of a “sample”
   •   Statistical jargons
       o Population parameter
             A fixed numerical value which describes a particular
             characteristic of a population


                                                                             21
             E.g. 1 – the mean value in the population of a particular
             characteristic of interest (mean systolic blood pressure of
             Australia adults)
             E.g. 2 – the proportion of individuals in the population with a
             particular characteristics of interest (the proportion of low birth
             weight babies born in Indonesia)
       o Sample statistics
             Varies in value from sample to sample
             Other terms – statistics, summary statistics, point estimate,
             effect size, point estimate of the effect size
       o The relationship between sample statistics and population
          parameters is the basis of statistical inference.


7. Statistical inference
   •   2 broad categories
       o Hypothesis formulation and testing
       o Estimation
             Point estimation
             Interval estimation
            (Confidence interval)


8. Concepts of populations, samples and statistical inference
   •   Statistical analysis of medical studies is based on the key idea that we
       make observations on sample of subjects and then draw inferences
       about the populations of all such subjects from which the sample is
       drawn.
   •   If the study sample is not representative of the population we may well
       be misled and statistical procedures cannot help
   •   But even a well designed study can give only an idea of the answer
       sought because of random variation in the sample
   •   Thus result from a single sample is subject to statistical uncertainty,
       which is strongly related to the size of the sample. (Gardner MJ and
       Altman DG, 1988)


                                                                               22
                                     STUDY DESIGN




1. How do we begin to answer the question?
   •   Start with the building blocks of any design
       o Participants of investigation
       o Outcomes of investigation
       o Direction of inquiry (prospective or retrospective)
       o Other considerations (e.g. possibility and resources)


2. Think about our research question?
   •   Identify the participants of interest
   •   What are the outcomes of interest


3. About the investigation
   •   Presence of a comparison group
       o Dependent on the objective of the study
       o Generally increases the validity of an observed association
   •   Exposure (risk) (or intervention) and outcome
       o Must be measured with as little error as possible


4. Overview of epidemiologic studies:
                                    Design strategies


                  Descriptive                                      Analytical


                                                   Observational             Intervention studies
   Population                Individual              studies                (experimental; RCT)
  (prevalence,              (case report,
  correlational              case series,
    studies)                comparative Cross-sectional     Cohort     Case-control
                            studies with    studies         studies      studies
                              historical
                              controls)




                                                                                         23
5. Case report
   •   Strength
       o Hypothesis (question) generation
       o Clinical observation
   •   Weaknesses
       o May be one off
       o Nothing to compare


6. Case series
   •   Strengths
       o Strengthens the hypothesis
       o Able to establish temporal relationship
   •   Weakness
       o Nothing to compare


7. Comparative studies with historical controls
   •   Strength
       o Like two case series
       o Have something to compare to
   •   Weaknesses
       o May be other differences between groups
       o Relies on recoding information being accurate


8. Randomized Controlled Trial
   •   ‘Goal standard’ test of treatment
   •   Selection of groups entirely random
   •   Control group identical to treatment group at start except for
       intervention
   •   Participants/investigators commonly ‘blind’ to group allocation to
       reduce bias
   •   May evaluate good and bad outcomes




                                                                            24
                 •    End point blinding           e.g. the pathologist are not given any
                      information about the study sample slide so the pathologist didn’t
                      know whose slide it is and he/she will decide based on his/her
                      independent interpretation about the slide.
                 •    There are few of RCT
                      o Single blind
                      o Double blind
                      o Triple blind
                      o Multiple blind
                      o End point blinding
                 •    There are 2 design of RCT
                      o Parallel RCT
                      o Cross-over RCT


                                                                           Population
                     Population



                                                                        Eligible subjects
                 Eligible subjects


                 Randomization                                           Randomization



 Pre-treatment                    Pre-treatment               Test                          Control
  assessment                       assessment


                                                                       Outcome assessment
      Test                           Control

                                                             Control                         Test


Post-treatment                    Post-treatment
 assessment                        assessment
                                                                       Outcome assessment



             Parallel RCT                                               Cross-over RCT
         9. Prospective cohort:
                 •    A group of people (cohort) is assembled, none of whom has
                      experienced the outcome interest




                                                                                                      25
   •       On entry, people are classified according to characteristics that might
           be related to outcome
   •       Other names: longitudinal, prospective, incidence studies.


                                                            Disease


  Exposed
                                                           No Disease
                    Direction of inquiry


                                                            Disease


 Unexposed
                                                             No Disease



   •       Advantages of cohort studies
           o The only way of establishing incidence directly
           o Follow the same logic as clinical question: if person exposed, do
              they get the disease?
           o Exposure can be elicited without the bias
           o Can assess if the relationship between exposure and many
              diseases
           o Calculate risk directly: relative risk (RR)
   •       Strengths
           o Powerful design for defining incidence and investigating potential
              causes (aetiology questions)
           o Establishes temporal sequences
           o Appropriate for interventions where can’t randomize
           o Investigator has opportunity to measure important variables
              completely (not relying on record information)
   •       Weaknesses
           o Expensive and inefficient for rare outcomes – needs more patients
           o May be other differences between group


10. Case control studies
       •    Analytic study design


                                                                                  26
       o Looking back in nature
       o We were not there to measure risk directly
       o Associate outcome (disease) with prior exposure
   •   Calculate indirect estimate of risk: odds ratio (OR)
   •   Compare the frequency of a risk factor in a group of cases and a
       group of controls
   •   There must be a comparison group that does not have the disease
   •   There must be enough people in the study so that chance does not
       play a large part in the observed results
   •   Groups must be comparable except for the factor of interest


 Exposed


                                                         Cases
Unexposed
                                  Direction of inquiry

 Exposed


                                                         Control
Unexposed


   •   Advantages of case-control studies
       o No need to wait for a long time for disease to occur (causal or
            prognostic factors)
       o Most important methods used to study rare disease
       o Best design for disease with long latent period
       o Can evaluate multiple possible potential exposure
   •   Strengths
       o Very efficient design for rare outcomes
   •   Weaknesses
       o Does not allow for the examination of incidence or risk
               Cannot directly calculate incidence: OR; and indirect estimate
               of risk.
       o Increased susceptible to bias in measurement of exposure
               Exposure & disease occurred “prior to” the study


                                                                            27
              •      More potential for biases


11. Cross-sectional study
    •   Distinguish features
        o Observe at on particular time or over a period
        o Exposure and outcomes measured at the same time
        o Information obtained from the subjects only once

                                 Observation


                                  Population


                                   Samples




          Exposed                                       Disease




         Unexposed                                      No Disease



    •   Categories of cross-sectional study
        o Descriptive
              Prevalence studies
              •      A point prevalence
                     o Disease occurrence at the particular time
                     o E.g. the point prevalence of upper respiratory tract
                        infection on 1st of July 2005
              •      A period prevalence
                     o Disease occurrence at the particular period of time
                     o E.g. the ten-year year period prevalence (1996-2005) of
                        the cancer of breast in Malaysia.
        o Analytical
              Analytical studies is valid only when the current values of the
              exposure are extremely stable over time
              Two types


                                                                              28
      •   Classical cross-sectional
      •   Comparative cross-sectional study
          o A comparative way of conducting a cross-sectional study
          o Samples are drawn from two or more defined different
             populations
          o Measure exposure and outcome factors
          o Investigate the association between exposure and
             outcome
o Strengths of cross sectional studies
      Very quick and inexpensive to implement
      Useful for determining prevalence
      Appropriate for diagnostic test validity
o Weaknesses
      Difficulty in establishing links of causal effect (temporal
      relationship)
      Impractical for rare outcomes




                                                                    29
                         DESCRIPTIVE STATITISTICS


1. Definition:
   •   Statistics
       o A field of study concerned with the collection, organization and
          summarization of data.
   •   Statistical Methods
       o A scientific technique employed for collection, presentation,
          analysis and interpretation of data.
   •   Biostatistics
       o Biological field and medicine
2. Uses of statistical methods
   •   To collect data in the best possible way
       o Designing form
       o Organizing
       o Conducting survey
   •   To describe the characteristics of a group or a situation
       o Data summary
       o Data presentation
   •   To analyses data and to withdraw conclusion
       o Scientific, logic
       o Decision making
3. Classification of statistics
   •   Descriptive statistics
       o Concerned with collection, organization, enumeration of frequency
          of characteristics, summarization and presentation of data.
       o Describe characteristics of the observed data
             Type of variable
             Summary statistics
             Distribution
             Graphical presentation
   •   Inferential statistics
       o Analytical in nature


                                                                            30
       o Involve hypothesis testing and confidence interval
       o Allows researcher to infer/ generalize the characteristics of the
          sample (statistic) to the population (parameter)
4. Terms:
   •   Population:
       o Full sets of individuals
       o Collection of items objects, things, people
       o Parameter – descriptive measure from population data
   •   Sample
       o Subset of population
       o Selected to represent the population by sampling technique
       o Statistics – descriptive measure from sample data
   •   Variable
       o Any characteristics of even/object/person
       o The characteristics being observed/measured
       o E.g. age sex, race, height, weight, etc
   •   Data
       o The raw or original measurement of statistics
       o Values taken by the characteristics
       o E.g. Malay, female, 155cm
5. Classification of variables




                                                                             31
•   Discrete
    o Characteristics by gaps or interruptions in the values
    o Values that can be assume only whole numbers
    o Mainly count
    o E.g. no of students, no of teeth extracted
•   Continuous
    o No gap or interruption
    o Any value within specified interval
    o Mainly measurement
    o E.g. height, weight, BP, age, etc
•   Nominal
    o Unordered categories
    o No implied order among the categories
    o E.g. race, sex, medical diagnosis, etc.
•   Ordinal
    o Ordered categories
    o Ranked according to some criteria
    o E.g. BP – high, normal, low.




                                                               32
6. Categorical Variables:
   •   Data presentation
       o Statistics
            Frequency
            Percentage (%)
       o Graphical
            Pie chart
            Bar chart




7. Numerical variables
   •   Measures of central tendency
       o A measure of centrality




            Mean
            •   Arithmetic average
            •   Adding all the values in a population/sample and divided by
                the number of values that are added
            •   Affected by the extreme value




                                                                          33
          Median
          •   The middle value of data ordered from the lowest to the
              highest    arrange all value in order
          •   If n = odd number     medical is the middle
          •   If n = even number     median is the mean of 2 middle
              observation
          •   50th percentile of a set of observation
          •   The middle value of data ordered from lowest to the highest
              value
          •   Useful for data with non-normal distribution or skewed data
          •   Less sensitive to extreme values than the mean
          •   Median (IQR)
          Mode
          •   The most frequent observation
          •   Point of maximum concentration
•   Measures of dispersion/variability
    o Range = largest value – smallest value
          Different between the largest and smallest value in a set of
          observations
          Give idea about the variability of data
          Simplest to compute
          Sensitive to outliers
          Least useful
          R = Xmax - Xmin
    o Variance = s2
          Total squares of deviation of observations from the
          mean/number of degree of freedom
          Average measure of standard deviation of observation from
          mean sample
          Measures the amount of variability or spread about/from the
          mean of a sample
          S2 = Σ(xi – xmean)2/n-1



                                                                            34
       o Standard deviation (SD)
             A square root of variance
             The root mean square of the distances (or differences) from
             mean of sample
             A better measure of variability of a set of data
             Smaller SD indicates closer to the mean
             Mean (SD)
             S = √[Σ(xi – xmean)2/n-1]




       o Interquartile range (IQR)
             Q3 – Q1
             Range between 25th and 75th percentile
             Used along with the median
             It not affected by outlier
       o Percentile = 25th, 50th, 75th, 90th, 95th
8. Normal distribution
   •   Characteristic
       o Bell shaped appearance
       o Symmetrical about the mean
       o Mean = median = mode
       o Total area the curve = 1
       o The curve never touch the x line
       o SD usually less than 30% of mean value
   •   Approximately
       o 68%      1 SD


                                                                           35
       o 95%        2 SD
       o 99.7%       3SD
   •   Mean (SD)




9. Data presentation (numerical data)
   •   Statistics
       o Mean (SD)
       o Medical (IQR)
   •   Graphical
       o Histogram
             Frequency distribution of quantitative date/continuous data
             Bars represent frequency distribution for each class of interval
             No spaces between bars
             May have equal/unequal class interval
       o Box plot


                                                                            36
          Histogram


10. Summary
  •   Categorical data
      o Statistics
            Frequency (%)
      o Graphs
            Bar chart
            Pie chart
  •   Numerical data
      o Statistics
            Mean (SD)
            Median (IQR)
      o Graphs
            Histogram
            Box Plot




                            37
                 HYPOTHESIS FORMULATION AND TESTING


1. Hypothesis:
   •   A statement about one or more population
   •   Research question
       o Statement
       o Research hypothesis
   •   Postulating the existence of:
       o A difference between groups
       o An association among factors
   •   Usually derived from a hunch, an educated guess based on published
       results or preliminary observations.
   •   There are 2 type:
       o Null hypothesis (HO)
             Hypothesis of no difference
             Hypothesis to tested
       o Alternative hypothesis (HA)
             The hypothesis that postulates that there is a treatment effect,
             an association factors or a difference between groups.
   •   Inferential statistics – estimating the probability that a given outcome
       is due to chance
   •   If the sample data provide sufficient evidence to discredit HO   reject
       HO in favor of HA.


2. Hypothesis Testing:
   •   To aid the researcher in reading a decision concerning a population y
       examining the sample.
   •   Observed differences or associations may have occurred by chance.
   •   HO : the proportion of patients with disease who die after treatment
       with the new drug is not different from the proportion of similar
       patients who die after treatment with placebo.




                                                                              38
   •   HA : the proportion of patients with disease who die after treatment
       with the new drug is lower than the proportion of similar patients who
       die after treatment with placebo.


Statistical Decision                              In the population
based on a sample                       HO true                     HO false
                                        Correct                  Type II error
                                   (Confidence Limit)                 Beta (β)
Do not reject HO                          1–               (probability of wrongly not
                              (level of certainty of our    rejecting HO when the HO
                             statistical data hold true)              is false)
                                     Type I error
                                                                    Correct
                                  (level of significance)
                                                                 (Power of study)
                                        Alpha ()
Reject HO                                                              1–β
                              (probability of wrongly
                                                             (probability that the HO
                             rejecting the HO when the
                                                                correctly rejected)
                                       HO is true)



   •   Type I error ()
       o The probability of wrongly rejecting the null hypothesis when the
          null hypothesis is true.
   •   Type II error (β)
       o The probability wrongly not rejecting the null hypothesis when
          the null hypothesis is false.
   •   The test statistics
       o A value with a known distribution when the null hypothesis is true.
   •   Normal distribution (refer to descriptive statistics note)
   •   Level of significant ()
       o The null hypothesis is rejected if the probability of obtaining a
          value as extreme or more extreme than that observed in the sample
          is small when the null hypothesis is true.
       o “Small” is usually taken to less than or equal to 5%
       o If the 2 tail test is taken then the  must be divided by 2



                                                                                          39
             E.g.  = 0.05, p value = 0.05, (2 tail test taken). Significant level
              = 0.05/2 = 0.25. Conclusion does not reject the HO.
   •   The p value
       o The probability of obtaining a value as extreme or more extreme
          than that observed in the sample given that the null hypothesis is
          true is called p value
       o The smallest value of  for which the null hypothesis can be
          rejected
       o The p value is compared to the predetermined significance level 
          (usually 0.05) to decide whether the null hypothesis should be
          rejected
       o If p value less than , reject the HO.
       o If p value greater than , do not reject the HO.


3. Steps in hypothesis testing
   •   Step 1
       o Generate the null hypothesis and alternative hypothesis
             HO : ??
             HA : ??
             What are the characteristics of interest?
             •   E.g. mean, proportion
             •   1-tail (one sided) or 2-tail (both sided)
                 o E.g. 1-tail research hypothesis
                       The proportion of patients with disease after treatment
                       with new drug is lower than the proportion of similar
                       patients who die after treatment with placebo
                 o E.g. 2-tail research hypothesis
                       The mean blood pressure of patients in the new
                       treatment group is not different from the mean blood
                       pressure of patients in the old treatment
             E.g. research questions:
             •   Effectiveness of a new antihypertensive drug




                                                                                 40
              •   HO: the mean blood pressure of patients in the new
                  treatment group is not different from the mean blood
                  pressure of patients in the old treatment (µ1 = µ2)
              •   HA: the mean blood pressure of patients in the new treatment
                  group is different from the mean blood pressure of patients
                  in the old treatment (µ1 ≠ µ2)

                   Notes:                      Notes:

                   1. in population            2. in sample

                   - µ = mean                  - x = mean
                   -  = standard deviation    - SD = standard deviation

•   Step 2:
    o Set the significance level
          Usually set at 0.05, 0.01, 0.1
•   Step 3:
    o Decide which statistical test to use and check the assumption of the
       test
          Population is approximately normally distributed
          Data values are obtained by independent random sampling
          Adequate sample size
    o To decide which statistical test should be used
          E.g. mean, proportion
    o Assumption must be adequately met
    o If not met alternative procedures can be used
          E.g. non parametric test would be used when the data is seriously
          non-normal)
•   Step 4:
    o Compute the test statistic and associated p value
          Calculate appropriate test statistics
•   Step 5:
    o Interpretation
          Compare p value with the level of significance
          Decide whether or not to reject the null hypothesis
          p value <  – reject the null hypothesis


                                                                             41
          p value >  – do not reject the null hypothesis
•   Step 6:
    o Draw conclusions
          Conclude accordingly based on rejecting/not rejecting null
          hypothesis
    o Decision rule
          Rejection region
          •   To reject the null hypothesis if the value of the test statistic that
              computed from the sample is one of the values in the rejection
              region
          Acceptance region
          •   To accept the null hypothesis of the computed values in the
              acceptance region
          E.g. conclusion:
          •   The mean blood pressure of patients in new treatment group is
              different from the mean blood pressure of patient in old
              treatment




                                                                                 42
                              CONFIDENCE INTERVAL


1. Relationship between confidence interval and hypothesis test



         0.95




                                                           Confidence
                                                            interval




2. Confidence Interval
   •   Standard Deviation (SD) vs. Standard Error (SE)



        Sample                       Population
                                     Mean (µ)
                                      SD ()




          Mean (x)
           SD (s)                                 Sample
                                Sample



                                     Standard
                                       Error


       o SD – a measure of the variability of individual observation
       o SE – a measure of variability of summary statistics
                E.g. variability of the sample mean or sample proportion
       o SE (SEM) – a special type of standard deviation (the standard
          deviation of a sample statistics), depend on
                Standard deviation
                Sample size


                                                                           43
o Sample mean varies from sample to sample (as measured by SE)


                      How Close?

       Sample                            Population
                                          Mean (µ)
                                           SD ()




o Sample to sample variation of the statistic (sample statistic)



       Lower limit        Confidence               Upper limit
                           Interval


                                       Likely to
                                          fall


                          Population
                          parameter




                                                                   44
3. General Comments on Confidence Interval
  •   As a measure of an estimate of a population parameter (a measure of
      the precision of a sample statistic)
  •   Confidence interval = estimate ± k x (standard error)
  •   90% CI, 95% CI, 99% CI
      o 95% CI interpretation – 95% certain that the population parameter
         lies within its limits.

                          Width of the CI depend on




                                     SE




        Sample size                           More variation
        (larger CI narrower                   CI wider
        more precise estimate)

                                              Less precise estimate




  •   Confidence Interval can be calculated:
      o Mean
      o Relative risk
      o Odds ratio
      o Hazards ratio
      o Correlation coefficient
      o Regression coefficient
      o Etc…




                                                                        45
           EXPLORATORY DATA ANALYSIS (NUMERICAL DATA)




1. Hypothesis test for:
   •   Single mean
   •   Difference between two means for independent samples
   •   Difference between two means (or paired) samples.


2. Single Mean:
   •   Step 1: Null and Alternative Hypothesis
       o H0: The mean serum amylase level in the population from which
          the sample was drawn is 120 units/100ml.
       o HA: The mean serum amylase level in the population from which
          the sample was drawn is different from 120 units/ 100ml. (two-
          sided test)
   •   Step 2: Level of significance
       o Alpha = 0.05 (alpha/2 = 0.025)
   •   Step 3: Check the assumption
       o Population is approximately normally distributed
       o Random sampling
       o Independent variable/sample
   •   Step 4: Statistical test (one sample t test)
       o t = (x - µ0)/ (s/√n)
       o where
             x = sample mean
             s = sample standard deviation
             n = sample size
             µ0 = the hypothesis mean
             t stat has n-1 degrees of freedom
   •   step 5: interpretation
       o p-value < 0.025
       o reject H0
   •   step 6: conclusion


                                                                           46
      o The population mean serum amylase is statistically significantly
         different from 120 units/ 100ml.


3. Difference between two means for independent samples
  •   Step 1: Null and Alternative Hypothesis
      o H0: The mean serum amylase level in hospitalized and healthy
         subjects are the same (µ1 = µ2)
      o HA: The mean serum amylase levels in hospitalized and healthy
         subjects are different (µ1 ≠ µ2)   (two-tailed)
  •   Step 2: Level of significance
      o Alpha = 0.05 (alpha/2 = 0.025)
  •   Step 3: Check the assumption
      o Two population are normally distributed
      o Two population have equal variance (Levene’s test)
      o Both are independent samples/variables
      o Random samples
  •   Step 4: Statistical test
      o Name of the test = independent t-test
      o t statistic = (x1 – x2)/√[s2p (1/n1 + 1/n2)]
            where: s2p = [ (n1-1)s21 + (n2-1)s22 ]/[n1 + n2 – 2]
            •   x1, x2 = sample means
            •   s1, s2 = sample standard deviations
            •   n1, n2 = sample size
            •   t stat has n1 + n2 – 2 degree of freedom (df)
      o Degree of freedom
      o p-value
      o 95% confident interval (lower border & upper border)
  •   Step 5: Interpretation
      o P-value < 0.025
      o Reject null hypothesis
  •   Step 6: Conclusion




                                                                           47
     o At the 5% level of significance that the mean serum amylase levels
        are different in healthy and hospitalized subjects.


4. Difference between two means for dependent (or paired) samples
  o Research question: the investigators wanted to determine if treatment
     with Amynophylline altered the average number of apneic episodes per
     hour
  o Step 1: Null and Alternative Hypothesis
     o H0: there is no difference in average number of apneic episodes
        before and after Amynophylline (no difference = zero)
     o HA: the average number of apneic episodes before and after was
        difference (difference not equal to zero) (two-tailed)
  o Step 2: Level of significance
     o Alpha = 0.05 (alpha/2 = 0.025)
  o Step 3: Check the assumption
     o The population are normally distributes
     o The two samples are dependent variables/samples
     o Random sampling
  o Step 4: Statistical test (paired t test)
     o t = d/sd √(n)
     o where
            d = means of differences
            sd = standard deviation of the differences
            n = number of pairs
            t stat has n-1 degree of freedom
  o Step 5: Interpretation
     o p-value < 0.025
     o reject null hypothesis
  o Step 6: Conclusion
     o At the 5% level of significance that the average number of apneic
        episodes before and after Amyn0phylline were difference.




                                                                           48
                 UNIVARIATE ANALYSIS OF NUMERICAL DATA


1. Univariate analysis explores each variable in a data set separately:
   • It looks at the range of values
   • The central tendency of the values
   • It describes the pattern of responsible to the variable
   • It describes each variable on its own
2. Univariate analysis
   • Categorical variable (e.g. housing)
   • Numerical variable (e.g. age)
3. Univariate analysis - Numeric
   • Statistical analysis
      o Point estimation
           Count, min, max, average, median, mode.
      o Dispersion
           Range, standard deviation, variance, co-variance
           Skewness, kurtosis
      o Missing value
      o Outliers
      o Binning
   • Visualization
      o Histogram, box plot and etc…


                         Univariate Analysis – Numeric


                                       Age
Count        900            Average     35.25       St Dev       11.20
Min          19             Median      33          Variance     125.37
Max          75             Mode        27          Covariance 32%
Range        55             Skewness    1.09
Missing      0              Kurtosis    0.88




                                                                          49
                          Univariate Analysis – Challenges


                                     Variable
               Categorical                               Numeric
Missing values                               Missing values
Invalid values                               Outliers
Numerization                                 Binning


4. Missing data
   • Data entry error
   • Data processing error
   • Certain data may net be available at the time of entry
   • How to handle missing data
    o Fill in the missing values manually
    o Ignore the records with missing data
    o Fill in it automatically
              A global constant (e.g. “?”)
              The variable mean
5. Outliers
   • Data points inconsistent with the majority of data
   • Different outliers
    o Valid: CEO’s salary
    o Noisy: one’s age = 200, widely deviated points
   • Removal methods
    o Box plot
    o Clustering
    o Curve-fitting
6. Binning
   • Binning is a process of transferring continuous variables into
     categorical counterparts
   • Binning methods
    o Equal-width



                                                                      50
    o Equal-frequency
    o Entropy-based methods
   • Variable values (e.g. age)
    o 0, 4, 12, 16, 16, 18, 24, 26, 28
   • Equal-width binning
    o Bin 1: 0, 4                 [-, 10] bin
    o Bin 2: 12, 16, 16, 18       [10, 20] bin
    o Bin 3: 24, 26, 28           [20, +] bin
   • Equal-frequency
    o Bin 1: 0, 4, 12             [-, 14] bin
    o Bin 2: 16, 16, 18           [14, 21] bin
    o Bin 3: 24, 26, 28           [21, +] bin
7. Numerization
   • Numerization is the process of transferring categorical variable into
    numerical counterparts.
   • Numerization methods
    o Binary method
    o Ordinal method
   • Variable values (e.g. housing
    o For free, own, rent
   • Binary method
    o For free: 1, 0, 0
    o Own: 0, 1, 0
    o Rent: 0, 0, 1
   • Ordinal method
    o Own: 5
    o For free: 3
    o Rent: 1
8. Quantification
   • Introduction
    o To conduct quantitative analysis, responses to open-ended questions
       in survey research and the raw data collected using qualitative
       methods must be coded numerically.


                                                                             51
    o Most responses to survey research questions already are recorded in
       numerical format
          In mailed and face-to-face surveys, responses are keypunched
          into a data file.
          In telephone and internet surveys, responses are automatically
          recorded in numerically format.
   • Developing code categories
    o Coding qualitative data can use an existing scheme or one developed
       by examining the data.
    o Coding qualitative data into numerical categories sometimes can be
       a straightforward process
          Coding occupation, for example, can rely upon numerical
          categories defined by the Bureau of the census.
    o Coding most forms of qualitative data, however, requires much effort
    o This coding typically requires using an iterative procedure of trial
       and error
    o Consider, for example, coding responses to the question, “What is
       the biggest problem in attending college today?”
    o The researcher must develop a set of codes that are;
          Exhaustive of the full range of responses
          Mutually exclusive (mostly) of one another.
    o In coding responses to the question, “What is the biggest problem in
       attending college today?” the researcher might begin, for example,
       with a list of 5 categories, then realize that 8 would be better, then
       realize that it would be better to combine and use a total of 7
       categories
    o Each time the researcher makes a change in the coding scheme, it is
       necessary to restart the coding process to code all responses using
       the same scheme
9. Distribution
   • Data analysis begins by examining distributions




                                                                                52
  • One might begin, for example, by examining the distribution of
    responses to a question about formal education, where responses are
    recorded within six categories
  • A frequency distribution will show the number and percent of
    responses in each category of a variable
10. Central tendency
  • A common measure of central tendency is the average or mean of the
    responses
  • The median is the values in the middle case when all responses are
    rank-ordered
  • The mode is the most common responses
  • When data are highly skewed, meaning heavily balanced toward one
    end of the distribution, the median or mode might be better represent
    the most common or centered response.
  • Consider this distribution of respondent ages:
    o 18, 19, 19, 19, 20, 20, 21, 22, 85
  • The mean equals 27. But this number does not adequately represent
    the common respondent because the one person who is 85 skews the
    distribution toward the high end.
  • The median equals 20
  • This measure of central tendency gives a more accurate portrayal of the
    middle of the distribution
11. Dispersion
  • Dispersion refers to the way the values are distributed around some
    central value, typically the mean.
  • The range is the distance separating the lowest and highest values (e.g.
    the range of the ages listed previously equals 18-85)
  • The standard deviation is an index of the amount of variability in a set
    of data
  • The standard deviation represent dispersion with respect to the normal
    (bell shape) curve




                                                                            53
  • Assuming a set of numbers is normally distributed, then each standard
    deviation equals a certain distance from the mean.
  • Each standard deviation (+1, +2, etc) is the same distance from each
    other on the bell-shaped curve, but represents a declining percentage of
    responses because of the shape of the curve.
  • For example, the first standard deviation account 34.1% of the values
    below and above the mean
    o The figure 34.1% is derived from probability theory and the shape of
       the curve.
  • Thus approximately 68% of all responses fall within one standard
    deviation of the mean
  • The second standard deviation accounts for the next 13.6% of the
    responses from the mean (27.2% of all responses) and so on.
  • Dispersion measures
    o Spread around the mean
          Variance – too abstract, a step towards standard deviation
          Standard deviation (from mean) – more intuitive
    o Standard deviation
          Average distance between mean and each value in data set
          Translates variance into same scale as mean and all the values
          High values are generally bad
  • If the responses are distributed approximately normal and the range of
    responses is low – meaning that most responses fall closely to the mean
    – then the standard deviation will be small
    o The standard deviation of professional golfer’s score on a gold course
       will be low
    o The standard deviation of amateur golfer’s scores on a golf course
       will be high
13. Continuous and Discrete Variables
  • Continuous variables have responses that form a steady progression
    (e.g. age, income)
  • Discrete (i.e. categorical) variables have responses that are considered
    to be separate from one another (i.e. sex, religious)


                                                                            54
   • Sometimes, it is matter of debate within the community of scholars
    about whether a measured variable is continuous or discrete
   • This issue is important because the statistical procedures appropriate
    for continuous-level data, especially as related to the measurement of
    the dependent variable.
   • Example: suppose one measures amount of formal education within
    five categories (less than hs, hs, 2 years vocational/college, college, post
    college)
   • Is this measure continuous or discrete?
   • In practice, five categories seem to be cut off point for considering a
    variable as continuous
   • Using a seven-point response scale will give the researcher greater
    chance of deeming a variable to be continuous.
14. Subgroup comparison
   • Collapsing response categories
    o Sometimes the researcher might want to analyse a variable by using
        fewer response categories than were used to measure it
    o In these instances, the researcher might want to collapse one or
        more categories into a single category
    o The researcher might want to collapse categories to simplify the
        presentation of the results or because few observations exist within
        some categories
   • Collapsing response example


Response                                Frequency
Strongly disagree                            2
Disagree                                     22
Neither agree nor disagree                   45
Agree                                        31
Strongly agree                               1




                                                                               55
One might want to collapse the extreme responses and work with just three
categories


Response                                Frequency
Disagree                                     24
Neither agree nor disagree                   45
Agree                                        32


   • Handling “Don’t Know”
    o When asking about knowledge of factual information (“Does you
        teenager drink alcohol?”) or opinion on a topic the subject might not
        know much about (“Do school officials do enough to discourage
        teenagers from drinking alcohol?”), it is wise to include a “don’t
        know” categories as a possible responses.
    o Analyzing “don’t know” responses, however can be a difficult task
    o The research-on-research literature regarding this issues is complex
        and without clear-cut guidelines for decision making
    o The decisions about whether to use “don’t know” response categories
        and how to code and analyse them tends to be idiosyncratic to the
        research and the researcher.




                                                                             56
              UNIVARIATE ANALYSIS OF CATEGORICAL DATA


1. Categorical data analysis
   •   One proportion
       o Chi-square goodness of fit
   •   Two proportion (independent sample)
       o Pearson chi-square/ fisher Exact test
   •   Dependent sample (matched or paired)
       o Mc Nemar’s test
   •   Stratified sampling to control cofounder effect
       o Mantel-Haenszel test
2. Two proportion (independent sample) – Pearson Chi-square & Fisher
Exact test.
   •   To test the association between two categorical variable
   •   IHD vs. Gender
       o Does gender associated with IHD status?
   •   Result of test
       o Not significant     no association
       o Significant     an association
   •   Step 1: State the hypothesis
       o H0: There is no association between gender and IHD
       o HA: There is an association between gender and IHD
   •   Step 2: set the significance level
       o How much? – accept the error in estimating the proportion in the
          population
       o Usually:  = 0.05
   •   Step 3: check the assumption
       o Two variables are independent
       o Two variables are categorical
       o Expected count of less than 5 is > 20% (take fisher exact test) and
          if < 20% (take pearson chi-square test).
              Expected count = [row total x column total]/grand total



                                                                            57
  •   Step 4: statistical test
      o Chi-square test or
      o Fisher exact test
      o X2 = Σ (O – E)2 / E
      o Chi-square value:
             When the difference between observed and expected increase
             Value of chi-square increase    p-value decrease      significant
             increase
  •   Step 5: Interpretation
      o p value = 0.123
             do not reject H0
      o There is no significant association between gender and IHD status.
  •   Step 6: conclusion
      o There is no significance association between gender and IHD status
         using Pearson Chi-square tests (p-value = 0.123)
  •   Data presentation


Table 1: Association between IHD and gender
                             IHD
  Variable           Yes             No          z stat            p-value
                    n (%)          n (%)
Gender
  Male             15 (60)         10 (40)       2.381             *0.123
  Female           20 (80)         5 (20)
* Pearson Chi-square test
3. Two proportion (dependent sample) – Mc Nemar’s test
  •   Dependent sample (matched or pair sample)
  •   X2 = (|b+c|) / (b+c)                                                  SM
  •   Discordant pair                                                Live        Die
      o Is pair of different outcome
                                                            Live     a**         *b
      o Use to test the difference in the outcome
  •   Sample of 25 pair patient with breast cancer    RM
                                                            Die      *c          d**

                                                     * Discordant
                                                     ** Concordant               58
    o Matched for age
    o Undergone
          Simple Mastectomy (SM)
          Radical Mastectomy (RM)
    o Difference of 5-year survival proportion between two group
•   Step 1: state the null and alternative hypothesis
    o H0: there is no association between type between type of
       mastectomy and 5-year survival proportion in patients with breast
       cancer
    o HA: there is an association between type of mastectomy and 5-year
       survival proportion in patients with breast cancer.
•   Step 2: set the significance level
    o  = 0.05
•   Step 3: check the assumption
    o Categorical data
    o Dependent or matched sample
•   Step 4: statistical test
    o Mc Nemar’s test
•   Step 5: interpretation
    o p-value = 0.021
          reject H0
    o there is significant association between type of mastectomy and 5-
       year survival proportion in patients with breast cancer.
•   Step 6: conclusion
    o There is significant association between type of mastectomy and 5-
       year survival proportion in patients with breast cancer using Mc
       Nemar’s test (p-value = 0.021)
•   Data presentation




                                                                          59
Table 2: Association between type of mastectomy and 5-year survival
proportion in patients with breast cancer
                    Simple mastectomy
  Variable           Live          Die        p-value
                    n (%)         n (%)
Radical
  Live              13 (%)        1 (%)        *0.021
  Die               9 (%)         2 (%)
* Mc Nemar’s test




                                                                      60
                       CORRELATION & REGRESSION


1. Relationship between two variables
   •   Are two variables associated each other?
   •   To what degree (strength) are they associated?
   •   In which directions is the relationship?
       o Positive or negative
   •   Change in dependent variable that corresponds to change in
       independent variable.
       o Prediction



              Correlation                            Regression




   -   presence of association             -   prediction
   -   strength (degree) of
       association
   -   direction of association




CORRELATION


1. Is a measure of relationship between two numerical variables
   -   E.g. the relationship between height and weight, the relationship
       between cholesterol and blood pressure.
2. Pattern:
   -   Elliptical pattern – degree of elongation of the ellipses – proportional to
       the correlation coefficient.
   -   Elliptical pattern – indicative of normally distributed variables




                                                                                61
3. Correlation coefficient (r)
   -   X increase            Y increase        r = 1 (perfect positive)
   -   X increase            Y decrease        r = -1 (perfect negative)
   -   No linear relationship                  r=0
   -   r
       o < 0.25            poor
       o 0.26 – 0.50      fair
       o 0.51 – 0.75      good
       o 0.76 – 1.00      excellent
   -   r does not imply a cause and effect relationship
   -   Correlation should be assessed mathematically, not visually.
   -   r for statistical sample, ρ (rho) for parameter of population.
   -   Correlation coefficient:


Pearson’s Correlation coefficient         Spearman’e Ranked Correlation
                                                        coefficient


   -   A measure of degree of              -     Correlation coefficient
       straight line relationship                calculated on the ranks of
       between two numerical                     the observation of two
       variable                                  variables
   -   At least one variable have a        -     Rank correlation and
       normal distribution                       Spearman’s correlation –
                                                 similar
                                           -     Different when the scatter
                                                 plot deviates from an
                                                 elliptical shape


4. Example: Relationship between height and weight
   -   Step 1: state the null and alternative hypothesis
       • H0: There is no correlation between height and weight
       • HA: There is correlation between height and weight           (2-tailed)
   -   Step 2: set significance level


                                                                                   62
    •  = 0.05
-   Step 3: check the assumption
    • Both numerical variable
    • One of the variables has normal distribution
          Histogram
          Box and Whisker plot
-   Step 4: statistical test
    • Pearson correlation (if assumption is met)
    • Spearman’s correlation (if assumption is not met)
-   Step 5: Interpretation
                                        Correlations

                                                                          weight
                                                      height Height       Weight
      height Height           Pearson
                                                                     1     .878(**)
                              Correlation
                              Sig. (2-tailed)                         .       .000
                              N                                    100         100
      weight Weight           Pearson
                                                              .878(**)             1
                              Correlation
                              Sig. (2-tailed)                      .000            .
                              N                                    100         100
     ** Correlation is significant at the 0.01 level (2-tailed).


    • p-value = <0.001
          reject H0
-   step 6: conclusion
    • There is a significant, positive and excellence correlation between
      height and weight (r = 0.88, p < 0.001)
-   Checklist for reporting correlation (Figure 1)
    • Correlation coefficient – Pearson’s correlation coefficient/
      Spearman’s Ranked correlation coefficient
    • Actual p-value of correlation coefficient
    • Sample size
    • Scatter plot




                                                                                       63
         Figure1: A scatter plot showing high positive correlation between
         height and weight



             80         n = 100, Pearson’s r = 0.88, p < 0.001




             75




             70
    Weight




             65




             60




             55


                  150       155        160         165           170   175   180
                                                Height




SIMPLE LINEAR REGRESSION (SLR)


1. Regression Analysis
  • Regression analysis is a statistical tool that utilizes the relation
     between variables so that one variable van be predicted from the other
     or others
  • Linear regression
    o Simple (one independent variable (factor) and one outcome)
    o Multiple ( more than one factor and one outcome)
  • Logistic Regression (dichotomous dependent variables)


                                                                                   64
2. Simple Linear Regression
   • Example of research questions
    o Does a relationship exist between oral contraceptive and the
        incidence of thromboembolism?
    o What is the relationship of a mother’s weight to her baby’s birth
        weight?
    o Relationship between an animal’s pulse rate and the amount
        particular drug administered?
   • Simple because only one independent variable
   • Linear means the relationship between y (dependent/outcome) and x
     (independent/factor) variables can be represented by a straight line
   • Analysed linear relationship between two quantitative (numerical)
     variables
   • Involves estimating the equation of a straight line that defines the
     relationship between a dependent variable using a given data set
   • The method involved is called method of least squares
   • We choose a line such that the sum of squares of vertical distances of
     all points from the line is minimized (Q = Σ е2i )
   • These vertical distances between y values and their corresponding
     estimated values on the line are called residuals (ei = yi – ŷi)
   • The line thus obtained is called the regression line or the least-squares
     line of best fit
3. Regression line (least squares line of best fit)
   • Yi = β0 + β1Xi + єi
    o Yi is the value of dependent variable when the value of the
        independent variable is Xi
    o β0 is Y-interception and is constant
    o β1 is the slope of the regression line. It is the change in Yi when Xi is
        increased by one unit
    o β0 and β1 are called regression coefficients
    o єi is random error terms, normally distributed, independent, with
        zero mean, and constant variance a2



                                                                              65
4. Linear Regression Model
   • Relationship                                        Independent
                                                           (factor/
                 Population                Population    explanatory)
                Y-interception               slope         variable


     Dependent
     (outcome/                   Yi = β0 + β1Xi + єi        Random error
 response) variable


5. True Regression line
   • The random error term єi the regression equation accounts for the
      scattering of the data points about the regression line
   • As the mean of the єis is zero, the mean of Yi (at Xi) is:
     o E (Yi) = β0 + β1Xi
     o The notation E (Yi) means ‘expected value’ of Yi and represents the
          mean of Yi
   • Not that the mean of y and on x and the relationship is represented by
      a straight line
   • This equation represents the true regression line
6. Least square estimate
   • Time regression line is unknown
   • Estimated regression line:
     o Ŷ = β^0 + β^1X                     least square estimate
              Ŷ = is estimated mean
              β^0 is y-intercept and is constant
              •    if x = 0, β^0 is the estimated mean value of Y
              β^1 is the slope of the regression line. It is the change in Y when X
              is increased by one unit.
7. Least Squares (LS)
   • “Best Fit” means difference between actual Y values and predicted Y
      values are minimum.


          n                           n
          Σ (Yi – Ŷi) = Σ έ2i
          i=1                        i=1


                                                                                 66
   • LS minimizes the Sum of the Squared Differences (SSE)
8. Interpretation of Coefficient
   • Slope (β^1)
    o The change in the estimated mean value of Y when X is increased by
        1 unit
           If β^1 = 0.05, then the estimated mean cholesterol level (Y)
           changes by 0.05 mmol/dl when the age is (X) increased by 1 year.
   • Y-intercept (β^0)
    o Average value of Y when X = 0
           If β^0 = 3.3, then the mean cholesterol level (Y) is expected to 3.3,
           when the age (X) is 0 (???)
8. Measures of variation in Regression
   • Total variation (Total Sum of Squares (SSTOT))
    o Measures variation of observed Yi around the mean Ymean
   • Explained variation (Squared Sum of Regression (SSR))
    o Variation due to relationship between X & Y
   • Unexplained variation (Square Sum of Error (SSE))
    o Variation due to other factor
9. Sum of squares
   • Total sum of square (SSTOT)
    o Measure of total variation in dependent variable Y
    o SSTOT = Σ (Yi – Ymean)2 = SSR + SSE
   • Regression Sum of square (SSR)
    o Measure the variation ‘explained’ by the regression line
    o SSR = Σ (Y^i – Ymean)2
   • Error Sum of squares (SSE)
    o Measures of the ‘unexplained’ variation in Y or the scatter around
        the regression line
    o SSE = Σ (Yi – Y^i)2




                                                                              67
                            Measure of variation

   Y

                                  Yi
                                                       Ŷ = β^0 + β^1Xi
                                         (unexplained sum of squares)
                                         SSE = (Yi – Y^i)2

       (total sum of squares)
       SSTOT = (Yi – Ymean)2          Y^i

                                         (explained sum of squares)
                                         SSR = (Y^i - Ymean)2

                                                         Ymean


                                                                      X
                                 Xi

                                      Notes:
                                X, Y and slope:
                • Positive slope, Y increases with increase in X
               • Negative slope, Y decreases with increase in X


10. Hypothesis Testing:
   • For Simple Linear Regression
    o H0: β1 = 0 (no linear relationship)
    o HA: β1 ≠ 0 (there is linear relationship)
    o Test statistics: t-distribution
    o Rejection rule:
            Reject H0 if p-value less than 0.05 (assumed )
   • For Multi Linear Regression
    o H0: β1 = 0 (no linear relationship)
    o HA: β1 ≠ 0 (there is linear relationship)
    o Test statistics: F-test for ANOVA table:
            F = MSR/MSE
            MSR = SSR/dfReg
            MSE = SSE/dfError


                                                                          68
    o Rejection rule:
            Reject H0 if p-value for the F-test less than 0.05 (assumed )
   • Assumption
    o The errors are normally distributes
    o They are independent
    o The mean of random error term is equal to zero
    o The variance of random error, 2 (sigma square), is constant.
11. How to analyse
   • Exploration of the data
    o Descriptive
    o Scatter plot between two variables
            Check for distribution, relationship and outliers
   • Fit the square least line (regression line)
    o Using least square method
    o It is the best fitting straight line trough the data points in a scatter
         plot
    o It represents the least square equation and estimates the constant
         (a) and slope (b) for  and β     Y^ = a + bx
    o It is constructed by using the method of least square – minimizes the
         sum of squared deviations of each point from the mean (regression
         line)
   • Evaluation of model by R2 (R square)


                                  Model Summary
 model           R     R2      Adjusted R2         Std Error of the estimate
    1      .592a      .350          .338                    .9043
a. Predictors: (constant), Time
    o R2 = 0.35, meaning that 35% of the total variation in GPA is
         explained by the study time
    o R2 measures the closeness of fit of the sample regression equation to
         the observed values of Y
    o It ranges fro 0 to 1
    o Is called coefficient of determination


                                                                                 69
   • Evaluation b
     o Evaluation of β using t-statistics
                                              Coefficients
   Model         Unstandardized Coefficient                            95% CI for
                       B          Std error         t        Sig.   Lower      Upper
1 (constant)           1.461            .315      4.639      .000      .829         2.093
  Time                     .389         .073      5.342      .000      .243          .534

a- dependent variable for GPA


     o H0: β1 = 0 (no linear relationship)
     o HA: β1 ≠ 0 (there is linear relationship)
     o As p value < 0.001, we reject Ho at 5% significance level and have
         sufficient evidence to conclude that there is linear relationship
         between study time and GPA.
     o Positive β means direct relationship
     o Estimated Least Square (LS) equation
               GPA = constant + b (study time)
               GPA = 1.461 + 0.389 (study time)
   • Diagnostic checking for assumption
     o The assumptions:
               The errors are normally distributes
               They are independent
               The mean of random error term is equal to zero (linearity)
               The variance of random error, 2 (sigma square), is constant or
               equal
     o Model adequacy checks
               After obtaining the least square line or fit
               Linear model appropriate? ...R2
               Investigate model assumption
               Diagnostic procedures carried out through examination of
               Residuals (difference between the observed value Y and the fitted
               ot the predicted value Y at a given value X
               Normality



                                                                                            70
       •   Histogram of unstandardized residuals
       Linearity
       •   Plot of unstandardised residuals against unstandardised
           predicted values
       •   Creating residual: go to analyse   regression bivariate   save
             unstandardised residual and predicted values
       Let say the assumption is met.
• Interpretation and conclusion
 o 35% of the variation in GPA is explained by study time
 o There is significant linear association between GPA and study time
 o For each 1 hour increase in study, the GPA of a student increase by
    0.39
 o We are 95% confident that for each 1 hour change in the study tie,
    the GPA increase will lie between 0.24 to 0.53




                                                                        71
                                CORRELATION


1. Introduction
   • Correlation is used to measure and describe a relationship between two
     variables
   • Correlation measure three characteristics of relationship
    o The direction of the relationship
           Positive
           •    It means that when value of one variable increase, the
                corresponding value of related variable also increases.
           Negative
           •    It means that when value of one variable increase, the
                corresponding value of related variable decreases.
           Zero correlation (no correlation)
           •    It means that when values of one variable increases or
                decreases independent to the value of other variable
    o The form of the relationship
           When value of one variable increases, the corresponding value of
           related variable increases or decreases until certain value, but
           beyond that value there may have change not in the same trend
           or may not have any change at all.
    o The degree of the relationship
           It measure how strong the relationship between the values of two
           variables
2. Application of correlation
   • Prediction
    o If two variables are positively or negatively related to each other,
        then by knowing the value of one of these variables it is possible to
        predict the corresponding unknown value of the other variable
   • Validity
    o Validity is them measure that a test truly is measuring what it
        claims to measure.
   • Reliability


                                                                                72
      o Reliability is the measure whether the test instrument produces the
         stable, consistent measurements it is used again and again in the
         same group of students or people.
     • Theory verification
      o Theory is a statement that makes a specific prediction about the
         relationship between two variables
      o This predicted relation can be verified by correlation test
3. Measures of correlation
     • Pearson correlation (Pearson product-moment correlation)
      o The Pearson correlation measures the degree and the direction of the
         linear relationship and is denoted by the letter r (correlation
         coefficient)
      o r = (degree to which x and y vary together)/(degree to which x and y
         vary separately)
      o r = (covariability of x and y)/(variabilitiy of x and y separately)
      o r = (SP)/√(SSx SSy)
      o SP = sum of product of deviation = Σxy – [ (Σx Σy)/n]
      o SSx = sum of squared deviation of x = Σxx – [ (Σx Σx)/n]
         • Or SSx = Σx2 – (Σx)2/n
      o SSy = sum of squared deviation of y = Σyy – [ (Σy Σy)/n]
         • Or SSy = Σy2 – (Σy)2/n
     • Spearman correlation (Spearman rank-order correlation)
      o It is used when the data are of ordinal variable.
      o If it is not then data must be ranked
      o Rank order the score separately for each variables with 1 for the
         smallest score


Case        Score for         Score for        Ranked for        Ranked for
no.        variable 1         variable 2       variable 1         variable 2
 1              3                13                 1                 2
 2              5                14                 3                 3
 3              4                12                 2                 1



                                                                               73
    4              6                 15                 4                 4
    5              7                 16                 5                 5


         o If there are same score for more than one respondents the final rank
            for the respondents will be the average of the ranks


Respondent no             Score              Rank                Final rank
1                         3                  2                   2.5
2                         5                  5                   5
3                         2                  1                   1
4                         3                  3                   2.5
5                         4                  4                   4


         o The equitation for the spearman calculation
         o rs = 1 – (6ΣD2)/[N(N2-1)]
            • N is the number of pair (xy)
            • D is the difference between each pair (x – y)
        • After calculating the value of r or rs, this is to be compared with the
         critical value in the correlation table to decide whether there is
         significant correlation between the variables.
         o Calculated value        tabulated value (significant correlation)
         o For one-tailed test df = n-1 and for two-tailed test df = n-2
            • df = degree of freedom
         o coefficient of determination
            • this is squared correlation coefficient
            • it measures the percentage of variation shared between the two
              variables
            • r = 0.40
            • r2 = 0.16 i.e. 16%
        • Point to be remembered
         o Correlation is not causation
         o Correlation is affected by the range of data



                                                                                    74
    o Correlation is affected by the outliers
4. Hypothesis tests with the Pearson correlation
   • Two-tailed
    o Ho = ρ = 0 (no correlation)
    o HA = ρ ≠ 0 (there is correlation)
   • One-tailed
    o Ho = ρ ≤ 0 (there is no positive correlation)
    o HA = ρ > 0 (there is positive correlation)
   • Reporting correlation
    o r = 0.65, n = 30, p-value < 0.01, one tail or two tail,
    o r2 = coefficient of determination
5. Summary
   • Correlation is a statistical test to assess the relation between two
    variables
   • Relation can be positive or negative
   • Two method of test are Pearson and Spearman methods
   • Test is used in prediction of relationship testing validity and reliability
    and verifying theories
   • Can be calculated manually using different formulas or using computer
    statistical package like SPSS
   • Correlation does not say about cause and effect relationship
   • The correlation coefficient is influenced by the outliers and or range of
    data under analysis




                                                                               75
                      NONPARAMETRIC STATISTICS


1. Nonparametric Statistics (NPS)
   • Name nonparametric indicates – no assumption about parameters
     (means, variances)
   • Require very few assumptions; it is distribution free
   • Use median as a measure of central tendency
    o Applied when
           The data being analysed is ordinal or nominal
           In case of interval or ratio scale data when no assumption can be
           made about the population probability distribution
    o Appropriate foe small samples that are not normally distributed
    o Computationally easier
    o Less efficient than parametric counter parts
    o Loose information by substituting ranks in place of scales
          Parametric test                     Nonparametric test
One-sample t test                     Sign test for one sample
                                      1. Sign test for paired samples
Paired t-test                         2. Wilcoxon Signed-Ranked test for
                                      pair samples
Two independent sample t-test         Man-Whitney test (Wilcoxon Rank
                                      Sum Test)
One-way ANOVA                         Kruskal-Wallis Test
ANOVA (randomized block design)       Friedman’s Test
Pearson’s correlation coefficient     Spearman’s Rank correlation
                                      coefficient




2. Sign Test for Matched
   • Observation are matched pairs but assumption underlying the paired t-
     test are not met, or the measurement scale is weak then Sign Test can
     be applied


                                                                           76
  • Hypothesis
    o H0: ∆d = 0 (the median of differences is zero)
    o HA: ∆d ≠ 0
  • T.S.: smallest of n+ and n-
  • RR: Reject H0 if p-value is less than  (assumed alpha)
  • Procedure
    o Exclude the observations for which the difference (di) is zero
    o For di > 0 assign (+sign) and for di < assign (-sign)


3. Wilcoxon Signed-Rank Test for paired samples
  • It is sophisticated than Sign test
  • Sign test only tell whether the sign of a difference is positive or negative
  • This test makes use of both the signs and magnitudes of the differences
  • Thus for a strong measurement scale the sign test may be undesirable
    since it would not make full use of the information contained in the
    data.
  • Assumption
    o The distribution of difference is continuous
    o The distribution of differences is symmetric
  • Hypothesis
    o H0: ∆d = 0 (the median of differences is zero)
    o HA: ∆d ≠ 0
  • T.S: T = min (T+, |T-|)
  • Rejection region
    o Reject H0 if T ≤ critical value or
    o Reject H0 if p-value is less than  (assumed alpha)
  • Procedure
    o Calculate the differences of each pair of observations (di)
    o Ignore the signs of these differences
    o Rank the absolute values from smallest to largest
    o Assign the signs of the corresponding differences to these ranks




                                                                              77
    o A difference of zero is not ranked, it is eliminated from the analysis
        and the sample size is reduced by one
    o Tied observation are assigned an average rank (suppose two smallest
        differences; 4,4; each one will get average rank (1+2)/2 = 1.5)
    o Assign each rank either a (+) or (-) sign corresponding to the sign of
        the difference
    o Compute sum of +ve ranks (T)+ and sum of –ve ranks (T-)
    o Choose the test statistics (smallest of T+, |T-|)


4. Wilcoxon Rank Sum Test (Mann-Whitney-U test)
  • Counter part of t-test for two independent samples
  • Assumptions
    o The two samples have been drawn independently and randomly from
        their respective populations.
    o The measurement scale is at least ordinal.
    o   The distributions of the two populations have the same general
        shape. They differ only with respect to their medians.
  • Hypothesis
    o H0: the two populations are identical (∆1 = ∆2)
    o HA: population 1 and 2 have different medians (∆1 ≠ ∆2)
  • Rejection rule:
    o Reject H0 if p-value less than 0.05 (assumed )
  • Procedure
    o Select independent random samples from each population
    o Combine the two samples
    o Jointly rank the combined samples. If tied observation, assign an
        average to all with the same value
           For example: if two observations are tied for the rank 3 and 4
           each is given 3.5.
           Next higher value receives a rank of 5 and so on
    o Label sample smaller sample size as sample 1. test statistic is the
        SUM of RANKS for sample 1, denote region from the table
    o Determine rejection region from the table


                                                                               78
5. Kruskal-Wallis Test
   • Counter part of One Way Analysis of Variance (ANOVA: comparing
    means of more than two groups) if:
    o Normality assumption of ANOVA not justified
    o Or the data available is ordinal (consist ranks)
   • Assumption:
    o The samples are independent and random
    o The measurement scale is at least ordinal
    o The distribution of the values is sampled populations are identical
       except for the possibility that one or more of the population are
       composed of values that tend to be larger than those of other
       populations.
   • Hypothesis
    o H0: the two populations are all identical
    o HA: At least one of the population tend to exhibit larger values than
       others
   • Procedure
    o If no ties or moderate number of ties the formula simplifies to:
   • Rejection region
    o When the samples sizes are large (ni ≥ 5) the test statistic T is
       distributed approximately as x2 (t – 1)
    o Reject H0 if T > x2 (t – 1)


6. Spearman’s Rank correlation coefficient
   • Nonparametric alternative of Pearson’s coefficient of correlation
   • Relevant when the measurement scale is at least ordinal or the
    relationship between two variables is not linear
   • It is denoted by rs
   • rs = 1 implies strictly increasing monotonicity
   • rs = -1 implies strictly decreasing monotonicity




                                                                            79
                            NON-PARAMETRIC TESTS


1. Introduction
 • Inferential statistics where population parameters are not a requirement
   to calculate its value
 • A process which is carried out in order to find out whether or not a
   particular statistical hypothesis is likely to be true
 • A statistical test in which no assumption are made about any statistical
   parameter. This is similar to a test in which we do not assume that the
   data have any particular distribution
2. The X2 test for goodness of fit
 • Test hypothesis about the proportions of a population distribution
 • Test how well the sample proportions fit the population proportion
   specified by the null hypothesis
 • Example:
   o Ho : there is no difference in proportion of people in different
      categories
   o Observed data (fo)


          Category 1          Category 2           Category 3
               7                     26                27
                                          n = 60


   o Expected population based on hypothesis (fe)


          Category 1          Category 2           Category 3
              20                     20                20
                   1/3 of 60 in each category = 1/3 * 60 = 20
   o Difference between observed data and expected data (fo – fe)


          Category 1          Category 2           Category 3
         7 – 20 = -13         26 – 20 = 6          27 – 20 = 7



                                                                             80
  o Square the differences (fo – fe)2


          Category 1             Category 2       Category 3
         (-13)2 = 169             62 = 36              72 = 49


  o X2 = Σ [ (square the difference)/(expected data) ]
  o X2 = Σ [ (fo - fe)2/(fe) ]    chi square formula
  o X2 = 169/20 + 36/20 + 49/20
  o X2 = 8.45 + 1.8 + 2.45
  o X2 = 12.7
  o Degree of freedom = number of column – 1
  o Degree of freedom = 3 – 1 = 2
3. Chi-square test for independence
 • Test a relationship between two variables
 • Each individual in the sample is measured or classified on two separate
  variables
 • Example:
  o Ho : there is no relationship between preference of teaching method
      and gender of the students
      • Variable 1: teaching method
      • Variable 2: gender of students
  o Observed data (fo)


      Gender                      Teaching Methods               Total
                            Lecture            Tutorial
       Male                      25               25              50
      Female                     35               65             100
       Total                     60               90             150


      • 40% students refer lecture
      • 60% students prefer tutorial




                                                                             81
o Expected data (fe)


   Gender                     Teaching Methods                     Total
                        Lecture              Tutorial
     Male                  *20                  30                  **50
   Female                  40                   60                  100
     Total                **60                  90                 ***150


   • fe = (**Row total x **Column total)/(***whole total)
     •   *Example of calculation
         o **50 x **60/***150 = *20
o Difference between expected and observed value (fo – fe)


   Gender                     Teaching Methods                     Total
                        Lecture              Tutorial
     Male              25 -20 = 5          25 – 30 = -5              50
   Female             35 – 40 = -5         65 – 60 = 5              100
     Total                 60                   90                  150


o Square of difference (fo – fe)2
   • 52 = 25, (-5)2 = 25, (-5)2 = 25, 52 = 25
o Formula for X2 = Σ [ (fo - fe)2/(fe) ]
   • So X2 = 25/20 + 25/30 + 25/40 + 25/60
   • X2 = 1.25 + 0.83 + 0.63 + 0.42 = 3.13
   • Degree of freedom = (no. Of column - 1)x(no. of row – 1)
   • Degree of freedom = (2-1)x(2-1) = 1
   • Critical value = 3.84 when  = 0.05        refer to X2 statistic table
o Interpretation
   • Calculated values is less than critical value
   • Therefore hypothesis is accepted (do not reject the null hypothesis),
     there is no difference
o Conclusion



                                                                              82
      • There is no relationship between preference of teaching method and
        gender of the students
4. Chi-squared test for variance
 • This is a test of the null hypothesis that the population variance is 2.
 • We have a sample of size n and we compute an unbiased estimate of the
   population variance s2 using divisor n-1.
 • The distribution used is X2 statistics is (n-1)2 s2/2.
 • We assume that the population is normally distributed
 • For 95% level of confidence probability level are within 97.5% and 2.5%
 • Find out the critical region in MS Excel CHINV (p, df)




                                                                               83
              STATISTICAL ANALYSIS: WHICH TO CHOOSE?


1. Process of data management (follow the steps below)
   • Research question(s)
   • Research design
   • Data collection
   • Data entry
   • Data exploration & cleaning
   • Data analysis
   • Interpretation
   • Writing up
2. Role of statistics in a study
   • Statistical knowledge and judgment is required at every step of a study
   • What statistical analysis is appropriate to answer the research
     question? Points to consider to select the right statistical test:
    o Research question/ hypothesis
           Are you clear what you want to find out and what design you
           have used in your study?
    o Number of variables
    o Type of data
    o Number of groups
    o Sample distribution
    o Sample type
3. Research question
   • The essential question, the study is designed to answer the question
   • Most studies are concerned with answering one of four types of
     following questions
    o What is the magnitude of a health problem or health factor?
    o What is the efficacy of an intervention?
    o What is the casual relation between one factor (or factors) and the
        disease or outcome of interest?
    o What is the natural history of a disease?



                                                                            84
   • What is/are the research question (s)?
    o Common in medical research:
              Difference between/ among means
              Difference between/ among proportions
              Associations between/ among factors
              Difference between/ among treatment effects
   • Hypothesis
    o This is a testable statement that describes the nature of the
        proposed relationship between two/ more variables interest
    o E.g. there is an association between smoking and coronary heart
        disease
4. What is the research design applied and expected result?
   • Randomized control trial (RCT)
   • Observational studies
    o Cross-sectional
    o Case-control
    o Prospective cohort
    o Retrospective cohort
   • Case report/ series
   • Diagnostic test
   • E.g. 1
    o   Research question: effectiveness of new anti-hypertensive drug
    o Research design: randomized controlled trial




                                                                         85
• E.g. 2
 o Research question: Risk factor for enteric fever
 o Research design: Case control

                               Time direction




                                                   (People with            Population
                                                     disease)




                                                (People without disease)
• E.g. 3
 o Research question: maternal & fetal outcome in mother with PIH
 o Research design: prospective cohort




                                                                                   86
                                        Time direction




 Population
 (no disease)




5. Study factor (s)
   • Variable (s) of interest that is hypothesized to be related to health
     problem, disease or outcome of interest.
   • Also known as the independent variables/ exposure variables/
     determinants
6. Outcome factor (s)
   • The event or occurrence that is supposed to have as a result of the
     study factor
   • E.g. the outcome factor is blood pressure, as it influenced by study
     factors, salt intake.
   • Also known as dependent variable.
7. Number of variables
   • One independent variable only – univariate analysis
   • More than one independent factor variables – multivariate analysis
   • Less likely to conduct and conclude a study with only focusing on
     univariate analysis in health sciences
   • If there is a multi-factorial effect on the outcome, univariate gives
     misleading results
   • Example risk factors fro coronary heart disease
   • Multivariate analysis can eliminate confounding effect
8. Type of data
   • Numerical
    o Continuous (e.g. weight)



                                                                             87
    o Discrete (e.g. number of patients admitted)
  • Categorical
    o Nominal (e.g. occupation, gender)
    o Ordinal (e.g. disease severity, socioeconomic status)
  • Statistical tests applied are different based on the type of variables
    (must consider both independent and dependent variables)
9. Number of group
  • Two group (two levels) (e.g. diabetic and non-diabetic group)
  • More than two group (more than two levels) (e.g. race – Malay, Chinese,
    Indian, Others)
10. Sample distribution
  • Normal distribution       parametric test
  • Non-normal distribution       non-parametric test
  • Suggested procedure for assessing normality
    o Compare the mean & median (for normal distribution mean =
       median)
    o Construct a histogram overlaid with normal curve
    o Construct a box and whisker plot
    o Statistical test
          Kolmogorov-Sminov test
          Shapiro-wilk test
  • Non-parametric test are appropriate when:
    o Data is ordinal
    o Data is non-normal distribution and cannot be easily transformed
    o Data may contain outlier
  • Non-parametric methods have two general limitations
    o Not as powerful as parametric counterparts
    o Test for complex design are not readily available in standard
       computer packages
11. Sample type
  • Independent sample (e.g. disease and non-disease groups, male and
    female)



                                                                             88
   • Dependent/ paired/ matched sample (e.g. difference of blood pressure
    measurements before and after treatment, age and sex matched
    samples)
12. What to be asked before choosing a statistical test?
   • What is the research question/ hypothesis?
   • What is the outcome factor and what are the study factors?
   • How many variables?
   • How many groups?
   • What is the distribution like?
   • Are the samples independent?
   • Is the data numerical/ categorical?
13. Data exploration and cleaning
   • Compulsory to do
   • Do not rush to analyze data
   • Clean and explore first
   • Get acquaintance with the data
   • Check duplications
   • Out-of-range values and location of error
   • Distribution of variables
   • Missing data checking consistency errors
   • Exploring the relationship between variables
   • Transformations
   • To get acquaintance with data set before the major analysis is carried
    out
    o Read the protocol again
    o Recall the objectives
    o Identify major outcome, exposure and potential confounders/ effect
       modifier
    o To check records with duplicating ID number (to prevent repeated
       data entry
   • Error checking
    o Respondent’s mis-marking answers



                                                                              89
  o Coder’s miscoding response
  o Marking errors by data personnel
 • Out-of-range values and location errors
  o Measurement error
  o Recording error
  o Genuine observation
 • What to do?
  o Check again original measurements where possible
  o If original measurements suspicious       repeat the measurement
  o If not possible to check        common sense
  o If the value is impossible/implausible     justifiable to set as
      “missing”
14. Distribution of the variables
  o Examine each variable
         Continuous
         •   Normal distribution
         •   If not
             o ? transformation
             o ? categorization
         Categorical
         •   Frequency distribution
15. Missing data
  o Occur when respondent would/could not answer
  o Too much missing data
         Threat the study
         Indicate a problem with a question
  o Should not be entered as a blank as some statistical packages
      interpret blanks as zeroes
  o Common practice – coded as 9, 99 or 999
16. Consistency errors
  o Situations where respondents answered a question for which they
      were ineligible or when codes were entered incorrectly
  o Countercheck with questionnaire/data collection form


                                                                       90
       o Can be prevented by proper programming in some statistical
          software
  17. Exploring the relationship between variables
       • Cross tabulation useful for categorical variables (sometimes better to
        categorize)
       • Should consider confounding & interaction
       • Graphs – mostly for continuous variables
       • Relationship between the outcome variable and other variables
        o E.g. scatter plot
18. Transformation
       • Severely skewed data – two approaches
        o Use nonparametric methods
        o Apply transformation
       • Many distributions in medicine – skewed to the right
       • Involve performing a mathematical operation on every value of the
        variable
       • Improves the symmetry of the distribution


 Transformation               Name                         Effect
X3                            Cube         Reduce extreme skewness to left
X2                          Square         Reduce skewness to left
X1/2                     Square root       Reduce mild skewness to right
log10 (X)                     Log          Reduce skewness to right
-1/√X                 -ve reciprocal root Reduce extreme skewness to right
                                           Reduce events more extreme
-1/X                    -ve reciprocal
                                           skewness to right.


       • Check the symmetry of the distribution after transformation
       • If sufficiently improved    use the transformed data
       • If resistant to transformation   use nonparametric methods
19. Interpretation
       • Most confusing part of researchers



                                                                              91
    • May be the most difficult part for those who are not familiar with
      statistical applications
    • Should interpret only when considered to be results of final analysis
      stage
      o E.g. in multivariate analysis, final model should be interpreted for
         writing regardless of the prior more-favorable results towards the
         hypothesis
    • Recall statistical theory and concepts whenever applicable
    • May need help from a medical statistician
20. Univariate analysis
    • Test hypothesis between one independent and one dependent variable
21. Multivariate analysis
    • Why we need multivariate analysis?
    • Purpose of using multivariate analysis
    • Common multivariate analysis methods in health sciences research.


               Variables                           Variables


              Independent                         Dependent
               Predictor                           Outcome
              Explanatory                          Response




              Covariates
           Confounders                Not the primary interest
               Controls               Must be recognized
          Effect modifiers




                                                                              92
•   Confounding


                              ?
     Risk factor                                  Disease




                          Confounder




o Distortion of a risk factor-disease relationship brought about by
    the association of other factors with both risk factor and disease
o Example of confounding:



      Physical                ?
    activity level                              Systolic BP




                             Age




                                                                         93
          • Interaction


                                               ?
                 Risk factor                                    Disease




                                          Interaction
                                             factor
                                            (effect
                                          modifiers)



           o Exist when the primary relationship of interest between a risk
              factor and a disease is different at different levels of the interaction
              factor
           o Example of interaction




                Employment                    ?
               in an industry                                 Lung cancer




                                          Cigarette
                                          smoking
             Smokers



Risk of
 lung
cancer
                Non-Smokers




    e.g. multivariate analysis industry
          Years of employment in the




                                                                                    94
             Surgery
                                                    Compare
                                                    outcome


             Radiation



    • Are these two groups comparable?
    • What are the role covariates?
22. Purpose Multivariate Analysis
    • To statistically adjust the effect on variable Y by change in a
      particular variable x when others are controlled
      o X1      Y (X2, X3, X4… statistically adjusted for e.g. diet   CHD
         {smoking and age adjusted for})
    • To discover the variable X which has the most influence on outcome
      variable Y

         Diet
       Smoking
                                 CHD
         Age

    • To predict the outcome Y

             Clinical
          Pathology                     Cancer
         Demographic                    prognosis

        Socio-economic



    • Whole ideas of multivariate analysis are “How to separate
      independent effect of each X and Y”
    • Common multivariate analysis methods in health related sciences
      research
    • Multivariate models


                                                                            95
      • Modeling strategies


MTV
                                   Independent Variable      Dependent variable
Multiple linear regression                  >1                         1
Multiple logistic regression                >1                         1
Log-linear regression                       >1                         1
Survival analysis                           >1                         1


Independent variables              Dependent variables      methods
Continuous                         Continuous               Multiple linear reg.
Categorical                        Categorical              Multiple logistic reg.
Continuous                         Categorical              Multiple logistic reg.
Continuous/ categorical            Continuous (survival     Survival analysis
                                   time)
Continuous/ categorical            Continuous               Log-linear analysis


23. Multivariate analysis      General Linear Model (GLM)
      • The GLM is a flexible statistical model incorporating analysis involving
       normally distributed dependent variables and combinations of
       categorical and continuous predictor variables.
      • The GLM Univariate model procedure provides regression analysis
       and analysis of variance one dependent variable by one or more
       factors or covariates
      • The GLM Multivariate model procedure provides regression analysis
       and analysis of variance for multiple dependent variable by one or
       more factor or covariates
      • The GLM Repeated Measures procedure provides analysis of variance
       when the same measurement is made several times on each subject
       or case.




                                                                                96
GLM                            Independent Variable      Dependent variable


Univariate GLM                           ≥1                        1
Multivariate GLM                         ≥1                       >1


24. Repeated measures in categorical outcome
    • When the dependent variable is a numerical variable


Independent Variable    Dependent variable             Statistical test
Categorical                 Numerical           Repeated measures ANOVA
                                                        (parametric)
Categorical                 Numerical          Friedman test (non-parametric)




    • When the dependent variable is a categorical variable


Independent Variable    Dependent variable             Statistical test
Repeated measure            2 outcomes                Mc Nemar’s test
2 measures                  categories
                          3++ outcomes         Test of marginal Homogeneity
2 measures                  categories
3++ measures                2 outcomes                Cochran’s Q test
                            categories
Repeated measure                               Cross-sectional time series (xt)
with independent
variables


                              Binary
                              Ordinal                Logistic regression
                             Multiple                     (xt logic)




                                                                           97
Count      Loglinear regression
                (xt poisson)


        General estimating equation
            (GEE) model (xtgee)




                                  98
                     WRITING A RESEARCH PROPOSAL


1. Introduction:
   • Clear statement of the problems or issue to be analysed and the overall
     objective of the proposed research.
   • Brief summary of relevant studies and literature describing what has
     previously been done and what is currently known about the pattern.
   • Concise statement of the rationale behind the proposed approach to the
     problem


2. Statement of specific research goals:
   • List specific objectives
   • List specific hypotheses (if any) to be tested
   • List the key variables and how they will be operationally defined


3. Study methodology
   • Selection of study population
    o Size of study population or sample
    o Sampling procedure, if any
    o Specification of control population, if any
   • Description of the experiment or data collection procedure
    o Description of research design
    o Description of method and intended research tools
    o Description of “interfering” (confounding) variables and how they will
       be controlled, or how their effects will be evaluated
    o If appropriate, a discussion of pitfalls that might be encountered and
       of limitations of the procedure proposed
   • Diagram of research design (optional): a diagram is useful foe clarifying
     points of research strategy
   • Analysis plan
    o Specify the kinds of data expected to be obtained




                                                                            99
     o Specify the means by which the data will be analysed and
        interpreted
   • Data processing plan
     o Hand tabulation or computer
     o Analysis technique: statistical measures
     o Use of dummy tables
     o Test hypothesis or drive hypothesis to meet the objectives of the
        study


4. Significance of the research for both practice and theory


5. Time table (Gantt chart)
   • Planning phase
   • Construction and development of research instruments
   • Pre-testing of research tools and techniques
   • Selection of population
   • Data collection
   • Data preparation (coding, editing, cleaning, etc)
   • Data analysis
   • Report writing


6. Personnel
   • Principal investigator
   • Assistants
   • Supporting persons


7. Facilities available
   • Office space
   • Resources in field area
   • Data analysis equipment
   • Other assistance




                                                                           100
8. Collaboration arrangement
   • Describe the collaboration


9. Detailed budget
   • Personnel
   • Consultant fees
   • Supplies
   • Travel expenses
   • Data processing
   • Other expenses




                                  101
                                 VARIABLES


1. Types of variables
   • Continuous or quantitative variables
   • Discrete or qualitative variables


2. Continuous or quantitative variables
   • Interval-scale variables
    o Interval scale data has order and equal intervals.
    o Interval scale variables are measured on a linear scale, and can take
       on positive or negative values.
    o It is assumed that the intervals keep the same importance
       throughout the scale.
    o They allow us not only to rank order the items that are measured
       but also to quantify and compare magnitudes of differences between
       them.
    o With interval data, one can perform logical operations, add, and
       subtract, but one cannot multiply or divide.
    o For instance, if a liquid is at 40 degrees and we add 10 degrees, it
       will be 50 degrees. However, a liquid at 40 degrees does not have
       twice the temperature of a liquid at 20 degrees because 0 degrees
       does not represent ‘no temperature’
   • Ratio-scale interval
    o Finally, in ratio measurement there is always an absolute zero that
       is meaningful.
    o This means that you can construct a meaningful fraction (or ratio)
       with a ratio variable.
    o Weight is a ratio variable.
    o In applied social research most ‘count’ variables are ratio, for
       example, the number of clients in past six months.
    o Why? Because you can have zero clients and because it is
       meaningful to say that “…we had twice as many clients in the past
       six months as we did in the previous six months.”


                                                                             102
3. Qualitative or Discrete Variables
   • Discrete variables is also called categorical variables
    o Nominal variables
    o Ordinal variables
   • Nominal variables
    o Nominal variables allow for only qualitative classification.
    o That is, they can be measured only in terms of whether the
       individual items belong to certain distinct categories, but we cannot
       quantify or even rank order the categories
    o Nominal data has no order, and the assignment of numbers to
       categories is purely arbitrary.
    o Because of lack of order or equal intervals, one cannot perform
       arithmetic (+, -, / or *) or logical operation (<, >, =) on the nominal
       data.
    o E.g. male and female, unmarried, married, divorce or widower.
   • Ordinal variables
    o A discrete ordinal variable is a nominal variable, but its different
       states are ordered in a meaningful sequence
    o Ordinal data has order, but the intervals between scale points may
       be uneven.
    o Because of lack of equal distances, arithmetic operations are
       impossible, but logical operations can be performed on the ordinal
       data.
    o A typical example of an ordinal variable is socio-economic status of
       families.
    o We know upper middle is higher than middle but we cannot say how
       much higher.
    o Ordinal variables are quite useful for subjective assessment of
       quality; importance or relevance.
    o Ordinal scale data are very frequently used in social and behavioral
       research.




                                                                                 103
    o Almost al opinion surveys today request answers on three-, five- or
       seven-point scale.
    o Such data are not appropriate for analysis by classical techniques,
       because the numbers are comparable only in terms of relative
       magnitude, not actual magnitude.
    o Consider for example a questionnaire item on the time involvement
       by selecting one of the following codes:
           1 = very low or nil
           2 = low
           3 = medium
           4 = great
           5 = very great


4. Response variables/target variables
   • Often called a dependent variable or predicted variable.
   • This is the variable that is being watched and/or measured


5. Explanatory variables/predictor variables
   • Any variable that explains the response variable or predictor variable.
   • Its values will be used to predict the value of the target variable.
   • This is the variable manipulated by the experimenter.


                                      Ratio                absolute zero

                           Interval               distance is meaningful


                 Ordinal                        attributes can be ordered

       Nominal                         attributes are only name; weakest




 6. Confounding variable
    • A confounding variable (also confounding factor, lurking variable, a
      confound, or confounder) is an extraneous variable in a statistical


                                                                             104
 model that correlates (positively or negatively) with both the
 dependent variable and the independent variable.
• Extraneous variables are undesirable variables that influence the
 relationship between the variables that an experimenter is examining.
• In other words, confounding is a variable that is associated with the
 predictor variable and is a cause of the outcome variable.




                                                                      105
                            DATA PRESENTATION




1. Two ways of presenting data
   • Tables
   • Charts


2. Tables
   • One-way table (Univariate)
    o Table 1: Number of respondents by gender


                 Gender       No. of respondents
              Male                         51
              Female                       49
              Total                        100


   • Two-way table (Bivariate)
    o Table 2: Number of respondents by gender and their educational
         qualification


 Gender         Primary       Secondary           Higher       Total
Male                  15           20               16           51
Female                14           20               12           49
Total                 29           40               38          100




 Gender       Primary (%)   Secondary (%)        Higher (%)   Total (%)
Male              15 ( )          20 ( )           16 ( )       51 ( )
Female            14 ( )          20 ( )           12 ( )       49 ( )
Total             29 ( )          40 ( )           38 ( )      100 ( )




                                                                          106
 Gender         Primary          Secondary          Higher          Total
                    (%)                (%)             (%)              (%)
Male                15                 20              16               51
                    ()                 ()              ()               ()
Female              14                 20              12               49
                    ()                 ()              ()               ()
Total               29                 40              38               100
                    ()                 ()              ()               ()




3. Charts
   • Charts is a graphically way to organize data
   • Types
    o Pie chart
             A pie chart is a graphical way to organize data
             All pie charts compare parts of a whole
             A lie chart uses percentages of fraction to compare data
             A type of graph in which percentages values are represented as
             proportionally-sized slices of a pie
             Pie charts are especially useful in representing proportions,
             percents and fractions.
    o Bar chart and Histogram
             A histogram is a bar graph that shows that frequency data
             The first step… collect data and sort it into categories
             Label the data as the independent set or the dependent set
             Data group would be the independent variable and the frequency
             of that set would be the dependent variable
             The horizontal axis should be label with independent variable
             The vertical axis should be labeled with the dependent variable
             Each mark on either axis should be equal increments, such as 2,
             4, 6, 8, etc
             I think histogram as “sorting bin”




                                                                               107
            You have one variable, and you sort data by this variable by
            placing them into “bins”
            Then you count how many pieces of data are in each bin
            The height of the rectangle you draw on top each bin is
            proportional to the number of pieces in that bin
            On the other hand, in bar graph you have several measurement
            of different items, and compare them
            The main question a histogram is “how many measurements are
            there in each of the classes of measurement?”
            The main question a bar graph answer “what is the measurement
            for each item?”

               Situation                      Bar graph or Histogram?
                                        Bar graph.
We want to compare total revenues of
                                        Key question: what is the revenue for
five different companies
                                        each company?
We have measured revenues of several
companies. We want to compare           Histogram.
numbers of companies that make from     Key question: how many companies
0 to 10,000; from 10,000 to 20,000;     are there in each class of revenue?
from 20,000 to 30,000 and so on
                                        Bar graph
We want to compare height of ten oak
                                        Key question: what is the height of
tree in a city park
                                        each tree?
We have measured several trees in a
city park. We want to compare           Histogram
numbers of trees that are from 0 to 5   Key question: how many trees are
meters high; from 5 to 10; from 10 to   there in each class of height?
15 and so on


     o Line graph
            Are more popular than all other graphs combined because their
            visual characteristics reveal data trends clearly and these graphs
            are easy to create



                                                                                108
     A line graph is a visual comparison of how two variables – shown
     on the x- and y-axis – are related or vary with each other.
     It shows related information by drawing a continuous line
     between all the points on a grid.
     Line graphs compare two variables: one is plotted along the x-axis
     (horizontal) and the other along the y-axis (vertical)
     The y-axis is a line graph usually indicates quantity (e.g. dollars,
     liters) or percentage, while the horizontal x-axis often measures
     units of time.
o Scattered plot
     The pattern of the data points on the scatter plot reveals the
     relationship between the variables.
     Scatter plots can illustrate various patterns and relationship,
     such as:
     •   Data correlation
     •   Positive or direct relationships between variables
     •   Negative or inverse relationship between variables
     •   Scattered data points
     •   Non-linear patterns
     •   Spread of data
     •   outliers
o Pictograph




                                                                       109
                                        Z-Score & IT’S USES



This is the formula for converting a
given value of x into its corresponding
z score for raw data:

                                                                    In every normal
                                                                    distribution 0.3413 of
                                                                    its total area lies
                                                                    between the mean and
x = the value that is being standardized                            z = 1.2
μ = the mean of the distribution
σ = standard deviation of the distribution                 0.3413




Z-score for Means



                                        Standard Error formula:


    = sample mean
    = standard error                    σ = standard deviation
   = population mean                     n = sample size




    1. Z-score serves 2 purposes
        • Each z-score will tell the exact location of the original x value within the
          distribution
        • The z-score will form a standardized distribution that can be directly
          compared to other distributions that also have been transformed into z-
          scores.


    2. Value of z-score
        • The sign tells whether the score is located above (+) or below (-) the
          mean
        • The number tells the distance between the score and the mean in terms
          of the number of standard deviation.




                                                                                             110
  • The z-score for an item, indicated how far and in what direction, that
    item deviates from its distribution’s mean, expressed in units of its
    distribution’s standard deviation.
  • The mathematics of the z score transformation are such that if every
    item in a distribution is converted to its z score, the transformed scores
    will necessarily have a mean of zero and a standard deviation and a
    standard deviation of one.
  • Z scores are sometimes called “standard scores”.
  • The z score transformation is especially useful when seeking to
    compare the relative standings of items from distributions with
    different standard deviations.
  • Z scores are especially informative when the distribution to which they
    refer, is normal.
  • In every normal distribution, the distance between the mean and a
    given z score cuts off a fixed proportion of the total area under the
    curve.


3. Z-score for making comparison
  • For example: bob receive a score of x = 60 on math exam and a score x
    = 56 on a biology test. For which course he did well?
    o Suppose the biology score had µ = 48 and  = 4 and the math score
       had µ = 50 and  = 10.
    o Suppose you use a test for your students and the µ = 65 and  = 10
       and your friend use a test for your students which have µ = 100 and
        = 15
    o Three of your students got 75, 45 and 67 respectively in your test
       what should be the score of your students in your friend test if you
       want to say the students’ performance in both the tests are same
    o Formula for standardized score is x = µ + z
  • Second example: Ho = there is no effect of PBL on average score
    obtained by the students
    o Average score (µ) of USM 3rd year students is 60 with standard
       deviation () is 5


                                                                            111
o A sample of 20 students attended PBL and average score of this
   group of students is 65
o Is this increase of 5 marks in average due to chance or the effect of
   PBL?
o Answer can be obtained by z test



                                                      Standard Error formula:


                   = sample mean
                   = standard error                   σ = standard deviation
                  = population mean                    n = sample size


      Ho     of the students attended PBL = 60

      H1     of students attended PBL > 60 or ≠ 60
      Level of significance or  level = 0.05 (usually used)
      Z = 1.96

      Z = [sample mean – hypothetical mean]/[standard error between            and µ]
      Z = [obtained difference]/[difference due to chance]
      Consult normal distribution table to see if calculated value is in the critical
      region or not to reject or accept null hypothesis




                                                                                        112
                                       t-test


1. In calculating z-score we need
   • µ = population mean
   •  = population standard deviation
   • When the standard deviation () is not known, t-test is the alternative.
2. In simple t-test instead of , sample variance is used.
   • Sample variance (S2) = [SS/n-1] = [SS/df]
    o SS = Σ x2 – ( [Σx]2/n )
           SS = Sum of squared deviation
3. Instead of standard error x, estimated standard error Sx is used.
   • Estimated standard error Sx = S/(√n) = √(S2/n)
   • t = [X - µ]/Sx
    o X = sample mean
    o µ = population mean (hypothesis mean)
    o Sx = estimated standard error from sample
   • The higher the the degree of freedom (df) (sample size) the closure the
     S2 (sample variance) to the 2 (population variance)
   • Example:
    o I prefer PBL than Lecture
           Response        1 = SA, 2 = A, 3 = UD, 4 = DA, 5 = SDA
           From this example hypothesis mean (µ) = 3
           But µ can be getting from study that has been done by someone
           previously.
4. Independent measures t-test
   • t = [ (X1 – X2) – (µ1 - µ2) ]/ (standard error)
   • Pooled variance, S2p = [ SS1 + SS2 ]/[ df1 + df2 ]
   • Two samples standard error, Sx1 – Sx2 = √[ (S2p/n1) + (S2p/n1) ]
   • Ho = there is no difference in the m=clinical performance of students
     attended traditional curriculum and PBL curriculum.




                                                                             113
                         SENSITIVITY & SPECIFICITY


1. Definition
   • Sensitivity
    o Proportion of subject with a target condition who are identified by a
         positive test finding.
    o Test’s ability to correctly identify individuals with the condition
    o Test’s capacity to detect the condition when it is truly present
    o Probability of a test being positive given that the condition is present
    o Also called true positive rate or hit rate
    o The test will actually classify a person (with the condition) as likely
         to have the condition
   • Specificity
    o Proportion of subjects free of the condition who are correctly
         identified by a negative test result
    o Test’s ability to correctly identify individuals without the condition
    o Test’s capacity to exclude condition when it is truly absent
    o Also called true negative rate or correct rejection rate
    o The test will actually classify a person (without the condition) as
         unlikely to have the condition


                                   With the        Without the
        Respondents                                                       Total
                                  condition         condition
With the condition            a 36                b 96               132
Without the condition         c 4                 d 864              868
Total                             40                960              1000


Sensitivity = a/(a+c) = .90       true positive
Specificity = d/(b+d) = .90       true negative
Positive Predictive Power (PPP) = a/(a+b) = .27       false positive
Negative Predictive Power (NPP) = d/(c+d) = .99          false negative




                                                                                  114
2. Validity of the Test


                               True status (population)
                                positive       negative
        Result    positive          a              b
        of test   negative          c              d


Sensitivity: the probability of testing positive if the condition is truly
present = a/(a+c)


Specificity: the probability of screening negative if the condition is truly
absent = d/(b+d)


Example: Screening breast cancer by Physical Exam & Mammography
                                 With the          Without the
        Respondents                                                   Total
                                condition              condition
With the condition           a 36               b 96                132
Without the condition        c 4                d 864               868
Total                          40                   960             1000


Sensitivity: a/(a+c)
= 36/(36+4)
= 0.90 = 90%
Interpretation    screening by physical exam and mammography will
identify 90% of all true breast cancer cases
Specificity: d/(b+d)
= 864/(96+864)
= 0.90 = 90%
Interpretation    screening by physical exam and mammography will
correctly classify 90% of all non-breast cancer patient as being free disease.


PPP = a/(a+b)
= 36/(36/96)


                                                                               115
= 0.27 = 27%


NPP = d/(c+d)
= 864/(864 + 4)
= 0.99 = 99%


Validity – the extend to which the test distinguishes between persons with
and without the condition


High validity require
-   High sensitivity
-   High specificity




                                                                         116