Learning Center
Plans & pricing Sign in
Sign Out

Stats EBM

VIEWS: 12 PAGES: 196

  • pg 1
									Stats &
     NH&MRC levels of evidence „99
EBM: Conscious, explicit and judicious use of current best evidence       in making
patient care decisions. Clinical experience is integrated with external evidence from
systematic research.
I        systematic review/meta-analysis of all relevant RCT

II       ≥ 1 good RCT

III-1    ≥ 1 pseudorandomised CT (e.g. alternate allocation)
III-2    Comparative studies (incl. systematic reviews of such studies) with
         concurrent controls and allocation not randomised, cohort studies,
         case-control studies, or interrupted time series with a control group.
III-3    Comparative studies with historical control, ≥ 2 single arm studies,
         or interrupted time series without a parallel control group.

IV       Case series*

V        Expert consensus†
                    Statistics: study design*
                 Case series

                 Prevalence     = cross-sectional = survey
                                Compares a group of individuals with the disease
                 Case-control   (cases) with a group of individuals without the
                                disease (control) in terms of their status on a
                                variable or a number of variables which are
                                thought to be risk factors for the disease in
                                Prospectively follows two cohorts one with the risk
                 Cohort         factor one without

Experimental                    Clinical trials
        Systematic              Uses rigorous method and evaluates quality

Reviews Meta-analysis           Evaluates quality of reviewed articles and using
                                statistical methods integrates data into a quantitative
                                summarised form
                 Study types
            Cost-benefit    Benefits are converted into the
Economic                    same numerical units as the costs
            Cost-           Costs of a certain health care
                            “effect” is calculated e.g. $1000
            Effectiveness   per additional MI prevented
                            through education
                            Costs of a certain personal quality
            Cost-Utility    (utility) gain e.g. cost per additional
                            Quantitative methods are used to
Decision analysis           analyse primary studies and develop
                            probability trees based on which
                            decisions can be made
Quick/Slow? Cheap/Expensive? Prospective/Restrospective? Randomisation?

      Design Pros and Cons: Big 4 & more*
                                             Funding bias
              Allows meta-analysis
                                             Maybe unethical/impractical*
RCT           Blinding
                                             Volunteer bias
                                             Surrogate endpoints/mis-rand/mis-blind common
                                                  Hard to get matched controls
                                                  Volunteer bias
              Cheaper than RCT
Cohort        Subjects/controls can be matched
                                                  Hidden confounders : no randomization
                                                  Blinding is difficult
                                                  Supports causality but does not prove it
              Good for rare conditions            Recall reliance
Case-         Fewer subjects than cross           Confounders aplenty
control       sectional studies needed            Deciding who is /when they are a “case”
                                                  Give OR but not AR
                                                  Finds association, not causality
X sectional   Ethically safe
                                                  Neyman bias†
              Subjects act as own controls
                                                  Bad for tx with permanent effects
Cross-over    All receive some putative benefit
                                                  Washout period needed & may be unknown
 Approach to literature searching

Clarify your
Search &
select articles
Valid results?

What results?     Is there any reason why results may
                  not be applicable to my patients
So what?          do the results help in caring for
      Assessing research articles
Sample*                    Inclusion/exclusion: Real life patients?
                           Real life context?
                           Sample big enough?
           BIAS avoided?

                           Clear statement of exposure and comparison
                           Treated the same apart from exposure in question
Exposure                   Blind subjects
                           Ethical design/committee approval
                           Clearly defined outcome
                           Measured in the same way in control & exposed groups
                           Clinically relevant vs surrogate end points
                           Blind assessors
Outcome                    Follow up complete/ intention to treat analysis
                           Follow up long enough?
                           Clinically significant benefit? Outweighing harm?
                           Mean/SD: how large an effect/how precise an effect
       Notes on research assessment
•    Clinically significant benefit/harm: use RRR and NNT as well as CI

•    Randomization ensures that both known and unknown sample variables are evenly distributed. If the
     study is non-random look at both groups. If they‟re too different to begin with the study is invalid

Attrition: If there are patients lost to follow up recalculate the results assuming the ones lost to the
    control group did well and those lost to the treatment group did badly. If it‟s still a decent result then
    it‟s worthwhile.Effects on result interpretation: A. If subjects with more side effects drop out, the
     medication may appear to be better tolerated than it is. B. If subjects who are not responding drop out,
     efficacy will be inflated. C. If there is a high overall drop out rate, the validity and generalizability of the
     results might be reduced. <=

      –    Motivation
      –    Death
      –    Incorrect entry in the first place
      –    Side effects
      –    Clinical reason eg. Pregnency
      –    Can‟t find‟m!

Can‟t use “intent to treat” when doing an “efficacy analysis” as you need to measure effect in those who
    actually took the intervention
        Assessing review articles*
Focused question?

Valid and explicit article identification, selection, evaluation and
   combination ?

Article handling was reproduced independently by different authors?

Were the results similar from study to study?
Homogeneity The differences between
            different studies‟ means are due
            to chance alone
            Subgroup analysis
                  are only any good if

• Only a few subgroups were analysed
• The effect difference bwn subgroups is large
• Effect difference is very unlikely to occur by
• Was part of the hypothesis prior to the study being
• Is replicated in other studies
Hill 1965‟s 5 tests of causation:
Qualities of the association that
make causation more likely are
1 Consistency   Replicated in diff. Studies

2 Strength      Size of RR is large
                Good measurability of the degree to which one
3 Specificity   particular exposure confers a particular risk

4 Temporality Exposure precedes outcome

5 Coherence     Biologically plausible
       Presentation of Results
• We don‟t just give a list of numbers
• Results need to be presented in an easily-
  understandable format
• Which can be easily analysed
• Two main ways:
  – Continuous
  – Discrete
• Requires
  – central tendency
     • Eg mean
  – scatter
     • Eg standard deviation
  – shape
     • Eg skew
              Central Tendency

• Mean („average‟ to the layperson)
   – Total of all values divided by the number of values
   – Most useful in a normally-distributed sample
• Median
   – The middle value
   – Eg. if 99 values, the 50th largest
   – Very useful for skewed distributions, as lessens effects
     of outliers
• Mode
   – The value with the most results
• How different are the results to each other?
• Standard deviation
   –   Uses differences from mean of each value
   –   Most useful for normally-distributed sample
   –   √(value-mean)2/n-1
   –   67%±1SD, 95% ±2SD, 99.7% ±3SD
• Interquartile range
   – Difference between 25th and 75th centile
   – Very useful for skewed distributions, as lessens effects
     of outliers
Standard Error (of the mean)
• Population standard deviation /  n
• Smaller than SD!
• Indicates how well the sample mean
  approximates the population mean
• So if n very large, more precise and
  confident estimate
• Used for t-test, confidence intervals
SD vs SE

• Beware: some studies present SE, some sd,
  especially important in graphs
• So make sure of what you‟re reading
• SD is visual
  – tells you what the results look like on a graph
• SE is about precision
  – tells you what the true value is likely to be
  – (Hint: think of confidence interval)
• If results plotted on a graph, what does it look
• Eg.

Bell-shaped                Skewed
(Normal, Gaussian)
• For bell-shaped:
  – mean=median=mode
  – parametric analytic statistics are possible
     • use mean, s.d., number
     • which are more powerful
• For bell-shaped:
   – mean=median=mode
   – parametric analytic statistics are possible
      • use mean, s.d., number
      • which are more powerful
• For skewed
   – can‟t use parametric statistics
      (Unless data transformed to make it bell shaped)

• Bonferroni correction:
  – multiplies P result by total number of tests to
    correct for this
  – So if 10 tests, need actual result of P<0.005 to
    give P<0.05
• Probably too conservative
• Assumes different results are independent
  – which they won‟t be
• But do comment on multiple comparisons if
  few, mildly-significant results and a lot of
          Hypothesis Testing

• We start by assuming null hypothesis
  – There is no difference between populations
  – Both samples from same population
• (Opposite: populations really are different)
• 2 populations
  – MDD + photo for 10 weeks
  – MDD no photo for 10 weeks
• Any difference may be due to pure chance
• ie null hypothesis is true
• ie no real difference
• We reject the null hypothesis if it is very
  unlikely a result this big is due to chance
• ie p less than a pre-set value (eg 0.05)
• Note: refers to populations
• We are trying to work out what is happening in the
  population from our samples
• If we have bigger samples
• is more likely that they correspond to the true
• …so p value will be smaller as n increases
• So you need number of subjects
• As well as
   – difference between means
   – degree of scatter (sd)
                     Example 2

• Null hypothesis: there is no difference in outcome between
  populations of depressed patients given fluoxetine or
• Study with good methodology gives:
   – Difference in mean Hamilton scores of 15 after 10
   – Population standard deviation is 20
   – Effect size is 0.75
   – t-test statistic gives P of 0.005
• Therefore, chance that there is no true difference is 0.005
• We can assume that fluoxetine better than placebo
          Confidence Intervals
• Let‟s say the difference between the mean BDI
  scores of the two groups is 5
• This is the real difference between our samples
• We are interested in what happens to the whole
  population of depressed teenagers
• And whether there‟s a difference between two
   – Those given or not given the signed photo
• We can estimate what the true value is in the
  whole population from our sample
• We may calculate that there is:
• 95% chance that true population difference is
  between +17 and -7
• So a reasonable chance that getting the Becks
  photo actually makes depression worse
• Can‟t conclude it‟s an effective treatment
• If 95% CI +6 to +4
• More than 95% chance that the Becks photo does
  improve depression
Confidence Intervals

• Give a range, eg. 95% confidence interval for the
  difference in BDI between photo and no photo
  groups is 4 to 6
• Mathematical definition:
   – 95% of confidence intervals of samples randomly
     sampled from the population will contain the true
     population value
• Practical definition
   – There is a 95% chance that the C.I. contains the true
• Quote both definitions in exam!
   Uses of Confidence Intervals
• In describing most differences:
  – Difference between means of groups
  – NNT
     • If upper limit very large, suggests treatment may not
       be that effective
  – Relative Risk/Odds Ratio
     • If CI spans 1, result not significant
Advantages of Confidence Intervals

• Mathematicians love them
• Most journals demand them
• Give a range of likely values
  – Much easier to interpret at a glance than p value
  – So show how big difference really is
• Clearly show an underpowered study
  – Very large CI
Mathematical Assumptions of CI

• Random sampling from population
• Independent observations
• If simple CI of difference between means
   – needs normal distributions
      • as a parametric test
   – Equal population standard deviation between groups
   – (Can do fancy stuff with log for skewed data)
• In exam, end definition by saying:
• „..providing all assumptions for the underlying
  statistical tests have been met‟.

• If standard deviations are larger, effect size will be
• F (if wider spread of results, less real difference between
  central tendencies)
• For two studies with identical effect sizes, the p
  value will be smaller for the study with much
  larger numbers
• T (if larger samples, more confidence that difference is a
  real difference)
• If p < 0.05, there is a real difference between
  populations under study
• F (just means that there is a high probability that they are
• If the confidence interval for the difference in HRSD
  scores between treatment groups includes 0, we can
  conclude that it is likely that there is a real difference
  between treatments that is not due to chance
• F (if CI includes 0, shows that confidence interval of likely difference
  includes 0, ie no difference)
• The standard error of the mean is used in calculating
  confidence intervals for differences in means of continuous
• T (uses multiples of: difference in mean/SEM)
Problems with Statistical Tests…
                      • Type I errors
•   A result is found to be statistically significant but
    is not really statistically significant
•   Statistically significant result suggests result likely
    to be significant
•   P = 0.05 means chance of 5% there is no real
    difference between groups
•   Difference can still be due to chance
• Don‟t give too much attention to the actual
• Is P of 0.051 that different to P of 0.049?
               Multiple Testing

• If a lot of tests done, more likely to get
  „significant‟ result
• 20 statistical tests done on samples which
  are not truly different will give one result
  where P<0.05!
• So watch for multiple comparisons!
• Bonferroni correction:
  – multiplies P result by total number of tests to
    correct for this
  – So if 10 tests, need actual result of P<0.005 to
    give P<0.05
• Probably too conservative
• Assumes different results are independent
  – which they won‟t be
• But do comment on multiple comparisons if
  few, mildly-significant results and a lot of
           Beware of „Data Dredging‟

• It is possible to do lots and lots of analyses with a
  data set
   – Eg Sub-groups
      • Eg Divorced women aged 30-40 with a co-morbid anxiety
   – Eg Different measurement instruments
   – Eg Cross comparisons, with various covariates
• If you do enough analyses, you will find one with
• Certainly will not make it statistically-significant,
  unless p very small

• Make a small number of a priori hypotheses
• Adjust for multiple comparisons within
• Anything significant post hoc could be
• ….but only for hypothesis generation
• ….not as an explanatory finding unless p
  very small
                  Clinical Significance

• Statistical significance doesn‟t equal clinical significance
• Need to use clinical judgement
• If numbers very large, very small difference will give
  „statistically significant‟ result
• Effect size is a very useful measure
   – 0.8 seen as threshold of „very effective‟
• Look at actual numbers
   – Difference in Hamilton scale of 1 is not very interesting
• Again, confidence intervals give a better picture of clinical
              One-Tailed Cheating
• In most studies, we must assume that a difference
  can be in either direction
   – eg new treatment can be better or worse than placebo
• So for P<5%, thresholds are top and bottom 2.5%
  of a statistic
   – eg significant if t < -1.98 or > +1.98
       • 5% chance t < -1.98 or > +1.98 if no real difference
   – t = difference in means / SEM
• If result outside this range, difference is significant
• This is called „two-tailed‟
• If „one-tailed‟ test used, a result in the top
  5% (a larger range) would be significant

               rather than
• So difference in means wouldn‟t have to be so
  large for result to be significant
• And is sometimes used to make a „non-significant‟
  result „significant‟
• So criticise it!

• Sometimes reasonable: i.e. if certain that
  difference will lie in one direction you can then
  use a one-tailed test which allows more power
• Type II error
• A result is found to be not statistically significant
  when there is a real difference between
• Not as serious as a Type I error
• It is important to make sure a study has enough
  power to find a hypothesised difference
• Work out required sample size beforehand
               Variables Needed

• Hypothesised difference ( n)… or effect size
  being looked for
• Hypothesised variation ( n)
   – (During a large study, may have to review sample size
     as data available)
• Significance level ( n if lower p value)
   – Remember to adjust for multiple comparisons
• Proposed statistical test ( n if parametric)
• Power level ( n)
   – Often 80 %
• Can use a nomogram (eg Altman) to work
  out required sample size
• You can work backwards once all data in, to
  find out the power the study had to find a
  significant difference
• This can be very useful in meta-analysis
• Paper will say eg
   – To detect an improvement from 40% in the control
     group to 60% in the intervention group, 250 patients
     would be needed to achieve 80% power at 5%
• If this is not stated, assume it‟s not been calculated
• Study may be underpowered, making it hard to
  assume anything
• An underpowered study is a waste of time
• and it is unethical to put patients through the trial

• A type I error means that a result is falsely found to be
  statistically significant
• T
• If multiple tests are performed without adjusting
  significance thresholds, then a type II error is more likely
• F (no effect on type II, but increases risk of type I)
   (Adjusting significance thresholds may  type II)
• A very low p value indicates that a result is clinically
• F (statistical significance  clinical significance)
• If a hypothesised difference between populations
  is small, less subjects are needed to make a study
  adequately powered
• F (if small difference, a lot of subjects needed to give
  statistical significance, as standard deviations likely to
  overlap more)
              5. Analyse data
• You have results, now what do you do?
• If methodology assessed as adequate, with
  little chance of bias, look to see if different
  groups have different outcomes
• And if so, can this be explained by chance?
   – statistical significance
• If statistically significant, is it clinically
• Statistical tests depend on distribution of
• Continuous vs non-continuous
  – Results on scale vs yes/no
• Continuous
• Parametric vs non-parametric
   – Normally-distributed or not
   – Must be a high enough number for parametric
• If results skewed, can be transformed to make
  them parametric
   – eg. log (eg pH), square root
• If parametric:
   – More power
   – More statistical tests possible
          Parametric Tests

• (a) The t-test
• Compares two normally-distributed samples
          Parametric Tests

• (a) The t-test
• Compares two normally-distributed samples
          Parametric Tests
• (a) The t-test
• Compares two normally-distributed samples
• Tests null hypothesis: are they from the
  same population?
• (Opposite: are they really different?)
• Need:
  – Difference in means
  – Estimate of population standard deviation
  – Number of subjects
• Gives eg t(108 )=2.02, P<0.05
• The 108 is degrees of freedom (total n -2)
• Degrees of freedom: sample size minus
  number of estimated parameters
• If t above a certain threshold (from a table)
  for that level of degrees of freedom, then
  statistically significant
• Special assumptions
• Samples mustn‟t be markedly skewed
  – There is a formula
• Variances must be similar (use F-ratio)
  – An alternative version can be used if not
• Papers rarely say if these assumptions met
• However, when giving a definition, give
  these assumptions
• Two types of t-test
• For independent samples
  – Eg different treatment groups
• For repeated samples
  – Eg for same subjects, before and after an
  – More power
• (b) Analysis of Variance (ANOVA)
• A bit like t-test, but compares more than
  two samples
• Gives result: F(3,96)=4.1, P<0.05
• Similar threshold principle for t-test
• But two sets of degrees of freedom
  – Between groups
     • 3, so four groups     F: is value of F or
  – Within groups            “variance ratio” test
     • 96, so 100 subjects
• Lets you know only whether all samples likely to
  be from same population
• If overall P < threshold
• Must then compare groups with each other, and
  combinations, using t-tests (post hoc analysis
  using Schiffe, Fisher‟s or Tuckey‟s tests), to find
  out where significant differences lie
• Eg may find out only one experimental group
  different from control
• Again, can use ANOVA on unrelated or related
• Power increased if related - reduces error effects
  of individual differences
• (c) Multivariate Analysis of Variance
• Sometimes called multivariate modelling
• t-test gives you differences between groups at end
  of study
• Generally valid, but….
• Just compares two groups at the end
• Doesn‟t take into account baseline differences
  between groups
• Less power than actually looking at change within
• Groups may differ at baseline



   Baseline            16 weeks
• MANOVA gives you:
• Time effect
   – Comparison of all subjects at baseline against end-point
• Group effect
   – Comparison of all observations (baseline and end-
     point) of one group against all observations from
     another group
• Group * time interaction
   – Looks to see if there is a different time effect in the
     different groups
      • ie which treatment is more effective
   – Looks at change within each subject
      • increased power
• MANOVA gives you an F value, with significance
  (p value)
• MANOVA can give you effect size
• MANOVA can also take account of confounders
  which may not be balanced across treatment
• MANOVA can be used for more than 2 treatment
• MANOVA can look at treatment effects over
  several time points



       0   4       8          12   16
                  Time (weeks)
• F value will give you the overall difference
  in effect between the two curves
• t-tests sometimes used post hoc to look to
  see at which time points there was a
  significant difference
• MANOVA can demonstrate differences in
  treatment effects for different subgroups
• eg group * time * gender interaction
• Group * time: shows if there is a different time
  effect for the different groups
• Group * time * gender: shows if this group effect
  on outcome is different for each gender
• Needs large numbers
   – eg 100 per group gives about 70% power of detecting
     effect size of 0.5
• (d) Analysis of Covariance (ANCOVA)
• “an extension of ANOVA which adjusts means for the
  influence of a correlated variable or a covariate. Used
  when research groups are known to differ on a
  background-correlated variable, in addition to differences
  attributed to the experimental treatment.”*
•   A regression method, not an ANOVA!
•   Used to allow for baseline differences
•   Used rather like MANOVA
•   Special assumptions
•   1. Needs pre- and post-treatment scores to
• 2. When post-treatment scores plotted
  against post-treatment scores, the lines must
  be parallel
Score              Placebo

                    d        Treatment

                  Pre-treatment score
• For an equivalent pre-treatment score, the
  active treatment gives a lower post-
  treatment score
• The difference is d
• A confidence interval for d can be given
• Or a hypothesis test used
          Non-Parametric Tests
• Depend on ranking, as can‟t rely on distribution
• Eg for final HDRS scores,
• Placebo group may have
   – highest, 2nd, 4, 5, 6th highest scores
• Intervention group may have
   – 3, 7, 8, 9, 10th highest scores
• Could these samples be from the same population?
• (a) Mann-Whitney U-test
• For two independent groups of scores
• Looks at ranks in the two groups, correcting for
  difference in size of sample
• Using a table, see if a significant difference

• (b) Kruskal-Wallis Test
• Like Mann-Whitney, but for more than two groups
• (c) Sign Test
• For paired results
• Compares amount of subjects where scores
  go up with amount whose scores go down
• (d) Wilcoxon Matched Pairs Test
  (Wilcoxon Signed Rank Test)
• For paired results, like sign test
• Takes account of magnitude of change
• (e) Friedman Test
• Like Wilcoxon, but more than two groups
• Log transformation is sometimes essential before
  parametric statistical tests are used
• T (if results skewed)
• Parametric tests use the difference in means and
  the standard deviation
• T (also needs number of subjects)
• Non-parametric tests use the difference in medians
  and the inter-quartile range
• F (use ranks)
• The Kruskal-Wallis Test is a non-
  parametric test that can be used to compare
  more than 2 unrelated groups
• T
• Eg Recovered or not recovered
• Eg Recovered from depression 6 months
  after start of treatment
                       Still Recovered
   Placebo         60         30

   Treatment       25         58
• Uses counts
• Doesn‟t use percentages (although they may be
  used for convenience)
• Uses Chi-squared 2
• This is a non-parametric test
   – Uses counts, not a continuous distribution
• Looks at the difference between observed and
  expected values
• If numbers in boxes small, need to correct data
   – Yates‟ continuity correction
   – Fisher‟s exact test if very small
• The need for informed consent can limit the
  external validity of treatment studies
• T (can only recruit consenting subjects; if a lot of
  population under study would not be able to
  consent, sample not representative of population)
    Stratified/Block Randomisation

• You may be asked what this is
• It is important to make groups as identical as possible
• With very large numbers, randomisation should ensure this
• But there can be an imbalance, especially if numbers small
• And, if small sub-groups, there may be imbalanced
  numbers in a group
• Can use stratification/block randomisation/minimisation to
  ensure balance

• Does the scale measure what it is supposed to?

• Face validity
   – Scale appears to be correct
• Criterion validity
   – Can be compared with a known quantity, eg height
   – Relation between rating scale and diagnosis
• Content validity
   – Appropriate coverage of the subject matter
   – Eg scale for depression should measure emotions,
     cognitions, physical symptoms, behaviour
• Convergent validity
   – Whether measures which are expected to be correlated
     are indeed associated
   – eg correlation of emotional, cognitive, physical
     symptoms on depression scale
• Divergent validity
   – Whether a scale discriminates between groups which
     are expected to be different, eg depressed people and
     non-depressed people
• Predictive validity
  – The agreement between a present measurement
    and one in the future
  – Eg Suicidality scale and suicide attempt in
• Concurrent validity
  – Agreement with another (valid) scale
  – eg new depression scale correlates with HRSD
• Construct validity
  – A composite of other types of validity
Classification systems validity
Descriptive validity – how well the lassification system
describes clinical syndromes re tightness
of criteria, overlap of symptoms and comorbidity
(describing what we see without reference to
aetiology or the theoretical basis of the conditions).
• If a scale is not valid, it is of no use and you can‟t
  interpret meanings
• A scale must be used with proven validity
• Validity is proven in previous studies
• Validity must be proven for the population under
   – eg children
• So if a new scale is used for a trial, with no
  validity data, it‟s not very good, really!
• Does the test give the same scores when
• Test-retest reliability
   – Does the same test give the same score on the
     same subject when repeated?
• Inter-rater reliability
   – Would different raters give the same score with
     the test on the same patient?
• As with validity, a test should have proven
• In addition, if more than one person rates
  patients in a study,
• Inter-rater reliability must be assessed
  between raters and stated
• The kappa () statistic usually used
• Should be above 0.7
                 Some Scales

• The scale used can influence results!
• You may be asked what a scale measures
• May be:
  – structured/semi-structured/unstructured interview
  – self-rated questionnaire
• May measure level of symptoms or give diagnosis
• See pp 52-54 of Oxford textbook (3rd edition,
  green) for a good list
• Hamilton Rating Scale for Depression
  – Unstructured interview
  – Biological weighting
• Beck Depression Inventory
  – Self-rated questionnaire
  – Psychological weighting
• Therefore on medication vs CBT trial:
  – HRSD may favour antidepressant
  – BDI may favour CBT
• Therefore a good trial should use multiple
• An effective treatment should give
  significant results on all measures
                  Global Measures

• Clinical Global Impression Improvement (CGI)
   – Very much improved (1) 
   – No change (4) 
   – Very much worse (7)
   – Often used in trials
• Health of the Nation Outcome Scale (HoNOS)
   – Semi-structured interview
   – Measures multiple domains, eg. function, depressed
     mood, delusions
   – Versions for children and people with LD
   – Government like it
• Another warning
• What does a decrease in 5 on eg HRSD
• Different for different patients
• Different 40-35 to 6-1
• So diagnosis may be better

• These scales just show level of symptoms,
  don‟t give a diagnosis
• Need a set, (semi-) structured interview to
  give actual diagnosis
• eg Schedule for Affective Disorders and
  Schizophrenia (SADS), SCID, PSE
• Pre-set diagnostic critieria (eg DSM-IV)
• „Clinical interview‟ not adequate!
• A good treatment will cause more patients
  to go into remission
• Define remission/recovery a priori
• Diagnosis still a vague term!!
  – Depends on criteria used
• Watch out: a lot of psychology literature
  uses people with BDI above a threshold and
  calls them „depressed‟

• A scale must have good reliability for it to
  have good validity
• T (if not reliable, measurements won‟t be valid)
• Validity not essential for reliability
• It is essential to assess the criterion validity
  of new scales that measure depressive
• F (Can be useful to compare against DSM
  diagnosis, but not essential. Scale more important
• Convergent validity is often measured by
  correlating overall scores on a new outcome
  scale with overall scores on an established
• F (That is concurrent validity)
• The HRSD is a good scale to use to
  diagnose depression
• F (Gives level of symptoms, not diagnosis)
            Blinding Participants
• Knowing they are having new treatment
  may cause favourable expectations or
• May affect compliance
• If they know they‟re having control
  treatment, may be more likely to seek
  additional treatment
• Less likely to leave trial early
Hawthorne effect: patients getting better because they know
they‟re in the treatment group
         Blinding Investigators
• Health-care providers, enrollers, trial designers,
• Their attitudes about intervention may be passed
  to patients
• May affect likelihood of giving adjunct
• May affect adjustment of dose
• May affect whether they withdraw patient from
• May affect encouraging/discouraging patients to
  withdraw from trial

• Knowing the outcome status in a cohort
  study may lead to information bias
• F (would do if a case-control study)
• Stratification can be used to control for
  confounding in a case-control study
• T (or a cohort study)

• In a case-control study of the association between
  diagnosis of alcohol dependence and diagnosis of
  depression, the relative risk is the statistic which would
  give the most useful information on strength of association
• F (cannot calculate RR from case-control; OR
  approximates RR if v rare outcome)
• The t-test is commonly used to look at strength of
  association in case-control studies
• F (use correlation or CI of OR)
Odd & even number used to allocate patients to tx/c arms:

• What is this type of allocation called? (2 marks)
• Quasi-randomisation
• What disadvantages could this type of allocation have in
  this case? (4 marks)
• It potentially introduces selection bias (1)
• The person discussing study entry with the subject may
  know what treatment they‟ll get (1)
• This may influence whether or how they discuss the study
• So subjects in each group may differ in ways other than
  treatment                             (1)
• It means there is not allocation concealment (1)
Parametric Data are normally distributed e.g. height
Non-para Data are NOT normal e.g. HAM-D
Binary*    E.g. gender i.e. not continuous like above

Tests           All non-parametric tests can be used on
                parametric data but not other way around
           One variable                   ≥2 variables
 P         T-test                         ANOVA
 NP        Mann-Whitney U test            Kristal-Wallis test
 B         X2                             x2
           Fisher‟s exact prob. Test

     These are all for independent variable/outcome data
        Stats: Definitions
Relative risk   =RR=risk ratio=rate ratio: in cohort
                study: risk of getting illness if you have risk
                factor vs. if you don‟t e.g. 1 in 3 (MI if
                smoke) divided by 1 in 10 (MI non-smoker)
                Variables associated both with outcome and the risk
Confounders     factor being studied

                Multivariate analysis which attempts to calculate
Logistic        influence of variables on (usu. Dichotomous)
regression      outcomes (taking into account confounders)
                Only one you can calculate in case-control
Odds ratio      studies. Should be > 3 to be worthwhile. Sl.
                Over-estimate of the RR & less useful than it
                The range of values within which there is a 95%
Confidence      probability the true population mean lies. If
interval        includes 1 then is insignificant.

Power           probability that a test will
                produce a significant difference
             Stats: Definitions
                      The probability that a diseased patient will have a
Likelihood            certain test result vs. a non-diseased control having that
ratio of a +/-        result. LR+ of 5 means the +ve result is 5 times more
                      likely to happen in a diseased person. LR- of 0.2 means
test [in a diseased   the -ve result is 0.2 x more likely (i.e. 5 times less
person]               likely) to happen in a person with the disease.

SpPin     When Specificity high Positive result rules in diagnosis
                      Results are derived by comparing groups. In the
Parallel vs           latter each person in treatment arm has a person
matched grp           matched for confounding variables in controls
                      Allows investigation of the effect of more than
Factorial             one independent variable on an outcome.
design                Variables may be combined or separate in cases.

Single v double blind          Single: patient‟s blind. Double:
                               investigator‟s blind too.
                 Reliability measure: Level of agreement between two
Kappa            raters beyond that which would have occurred by
score            chance alone. K=1 means perfect agreement. ≥0.7
                 usually taken to be good enough
          Stats: Definitions
              The amount of spread or variation in the data. Usually
Variance      recorded as a range or a standard deviation.
              = SD2
Z score       Deviation of a given value from the mean
              divided by the SD
           Multivariate analysis
Multi-     Multivariate analysis: Methods to deal with more
           than one related 'outcome/dependent variable' (like
variate    two outcome measures from the same individual)
analysis   simultaneously with adjustment for multiple
           confounding variables (covariates). When there is
           more than one dependent variable, it is inappropriate
           to do a series of univariate tests. Hotelling's T2 test is
           used when there are two groups (like cases and
           controls) with multiple dependent measures (may be
           more than two), and multivariate analysis of variance
           (MANOVA) is used for more than two groups. FA &
           PCA are types of multivariate analysis
                   Stats definitions
 Factor         Seeks to summarize multi-variate data into a few “factors”
                that bunch together dependent variables that correlate
 analysis       highly e.g. insomnia & appetite change usually go together
                and can then be called the “melancholic” factor.
                     PCA aims at reducing a large set of variables to a small
 Principal*          set that still contains most of the information in the
 component           large set. While PCA summarises or approximates the
                     data using fewer dimensions (to visualise it, for
 analysis            example), FA provides an explanatory model for the
                     correlations among the data.
Eigenvalue: measure the amount of the variation explained by each
principal component (PC) (or factor) and will be largest for the first PC
and smaller for the subsequent PCs. An eigenvalue greater than 1
indicates that PCs account for more variance than accounted by one
of the original variables in standardized data (i.e. one of the variables
that have been collapsed into the PCs or factors). This is commonly
used as a cut-off point for which PCs are retained.
         Multivariate Analysis
• Powerful tool for investigating the effects of
  variables of interest
• Adjusts for the effects of confounding variables
• Very useful when you can‟t randomise samples
• So very important in observational studies
• Parametric
• So needs normal distributions
                   Many Types
• At a very simple level, t-test
• Multiple linear regression
   – Continuous outcomes
   – eg HRSD scores
• Multiple logistic regression
   – Binary outcomes
   – eg diagnosis of schizophrenia
• Cox regression
   – Time-to-event outcomes
       Multiple Linear Regression
• Shows relative effects of different covariates
  (exposures) on continuous outcome
• You get a model
• eg HRSD = 5.54 + 0.345*age + 4.23*female +
• This can be useful for prediction…
• What we‟re interested in is
       Are the covariates significant predictors?
  Are Covariates Really Significant

• Some covariates correlate strongly (eg alcohol use
  positively related to maleness)
• We want to know if a covariate predicts outcome
• If all other covariates held constant
• Some covariates may be redundant as they are
  strongly correlated with each other
• So a model can drop them
               What This Means

• Can conclude that alcohol not a predictor per se, it
  is the maleness that is the predictor, and it also
  predicts alcohol use
   – so alcohol should be dropped from model
• However, may conclude that both alcohol and sex
  are significant predictors independently
   – so put them both in model
• We may find that alcohol has different predictive
  effects for each sex
   – alcohol * sex interaction
          Presentation of Results

• As a table Coefficient Standar    t      P
                 (b)     d error
      Sex       4.23       0.9      4.7 < 0.001
   Alcohol       9.6       4.2     2.29 < 0.05
      Age       0.345      0.9     0.38    ns
     Sex *       3.6       1.1     3.27 < 0.01
        Interpretation of Results

• If P < threshold, that covariate is a
  significant independent predictor for the
• Remember rules on interpreting P values
   – P < 0.05 means less than 5% chance that this
     result purely due to chance
• So for our table...
                        Coefficient   Standard    t       P
                           (b)         error b

             Sex           4.23         0.9      4.7    < 0.001

           Alcohol         9.6          4.2      2.29   < 0.05

             Age          0.345         0.9      0.38     ns

             Sex *         3.6          1.1      3.27   < 0.01

• Sex and alcohol use are significantly associated with
  HDRS scores
• ..taking into account other variables in the model
• Age ns, should be dropped from model
• The effects of alcohol are different for males and females
   – Or different effects of sex for different alcohol intake!
            Deciding on a Model
• Many models can come from one data set!
• Have to decide on which model to use
• Hierarchical selection
   – uses theoretical rationale of what is expected
• Forward stepwise selection
   – starts with no covariates
   – start with most significant predictor
   – keep on adding less significant predictors until they‟re
     no longer significant
• Backward stepwise selection
   – Start with all possible covariates
   – Keep removing least significant variable
• All subsets selection
   – Computer looks at all possible models
   – Find model which explains the highest proportion of
   – Highest R² value
      •   variation of response explained by model /
      •   variation of the response variable
      •   In fact, this can falsely increase if have lots of variables
      •   Use adjusted R² to take account of this
    Multiple Logistic Regression

• If binary outcome
• eg diagnosis of schizophrenia, death
• Works out how a variable affects the chance
  of the outcome occurring
• Gives adjusted odds ratios for covariates
• OR can be hard to interpret
• Approximates relative risk for rare events
                     The Model
•   Like for linear regression
•   But uses log-odds
•   Means scale can be negative as well as positive
•   If not log-odds would be 0 to infinity
    – 1 corresponding to no effect
• eg, if p is the predicted probability of MDD
• loge (p/1-p) = 5.54 + 0.345*age + 4.23*female +

• To get corrected odds ratio for a variable
• ORcorrected = ecoefficient
• eg if coefficient in model is 0.7, odds ratio approximately 2
• Otherwise, interpretation of significance of predictors as
  with linear regression
• You‟ll get a table saying if predictors are significant
• Deciding on which model - same principles
      Some Potential Problems
• 1. Only gives results of significance of predictors
  controlling for other covariates measured and
  included in the model
• Does not control for covariates not thought of
• So not as good as randomisation
• 2. Must be normal distribution
• 3. Must be linear, not curved, relationship
  between continuous variables
• 4. Multicollinearity
• A problem if two covariates correlate very highly
  (eg 0.9)
• The one with minutely better correlation with the
  outcome measure is selected first
• The other variable will supposedly be a much less
  good predictor
   – or even non-significant
   – as it correlates so much better with the first variable
     than the outcome measure
• So study design important: don‟t include variables
  with very high correlation
• Multiple logistic regression is used to investigate
  the effects of variables on a continuous outcome
• F (binary outcome)
• Multiple linear regression can be used to prove
  that an exposure of interest causes an outcome of
• F (only shows association)
• In multiple linear regression, if two covariates are
  highly correlated, there may be a type II error
• T (multicollinearity)
• Multiple linear regression is an appropriate statistical
  technique to use if there is a curvilinear relationship
  between variables
• F (must be linear relationship)
• Multiple linear regression cannot be used if there is a
  heavily skewed distribution of a proposed variable
• F (can transform the distribution to make it normal)
         Stats: Definitions
                An RRR of 25% means the risk of the event in
Relative Risk   the exposed/exposed group is reduced by 25%!
                There is less than 0.5% chance that the difference
P<0.5%          was found by chance (I.e. 95% CI excludes 0)

Number          Number of people who would have
Needed to Treat to be exposed/treated to prevent one
Risk v          Risk factor predisposes to disease
                development but may or may not act as a
prognostic      prognostic factor once disease occurs
                Shows time to event and can be used to comment
Survival*       on whether the chance of the even occurring goes
curve           up or down with time
         Stats: Definitions
               Difference in outcome magnitude between
Effect size*   control and treatment group DIVIDED by SD
               ES of 1 => treated patient is better than 86% of
               untreated patients
Odds           Usually is proportion of people with
               event/outcome to those without .e.g. the
               odds of dying in a year are 1/99 (1 person
               will die and 99 will remain alive)
               The proportion of cases in a population
Population     attributable to a given risk factor (e.g. PAR of
attributable   genetic causation is low in scz i.e. many sporadic
               cases occur with no family history)
Fisher‟s       A non-parametric test of statistical
               significance of the difference between two
Exact          small independent samples when the scores
Probability    belong to one or the other of two mutually
               exclusive classes
                   Using the 95% CI
If it includes 1(rates) or 0 (reductions) then it is not significant.

The bigger the sample the narrower it is.

        95% CI of RRR = RRR1.96xSE*
        95% CI of NNT = 1/CI of ARR!
     Positive studies:    If the minimum putative effect in the CI is
                          not clinically significant then the numbers
                          are too small.
     Negative studies:    If the maximum putative effect still
                          clinically significant then it doesn‟t prove
                          lack of benefit.
Using Number Needed to Treat
E.g. although the relative risk might sound impressive if the absolute risk is
quite low then treatment may still not be worthwhile as NNT’ll be very high.

•NNT will be much higher in a low risk group meaning that although
there might be a similar relative lowering of adverse outcome it may not
be reasonable to treat
      – that‟s why it might not be smart to give statins to hypercholesterolemic,
        but otherwise low risk people: even though it might reduce their risk of
        MI by the same 25% as giving it to a high risk group, you might only
        prevent a few MI although you‟re putting huge numbers on statins

•NNT can also be used to take side effects into account:
       – If 30% of patients get fat on olanzapine and NNT to prevent 1-year
          relapse following a drug-induced psychosis is say 30 you‟ll make 10
          people fat to prevent one relapse but in a schizophreniform disorder
          group where NNT is 3 you‟ll only be getting 1 people fat to prevent an
You can come up with a hypothetical factor F that says how close your patient are
to the study population e.g. 0.5 (i.e. half as likely to respond to the particular
treatment because they‟re hard cases!). The adjusted NNT will then be NNT/F
             Sensitivity, Specificity &
             Predictive Value of Tests
    PVs BUT NOT Se/Sp are prevalence
    dependant                          Sensitivity/Specificity
                                       Disease      No Disease
    Predictive         + Test          a            b
    Value              - Test          c            d

     Sensitivity a/(a+c)                     a/(a+b) PPV
     Specificity d/(b+d)                     d/(c+d) NPV

Likelihood ratio +ve= (a/(a+c))/(b/(b+d)) = sensitivity/(1-specificity)
Likelihood ratio -ve = c/(a+c)/(d/(b+d))= (1-sensitivity)/specificity
       Using likelihood ratios
Probability = Odds/(1+odds)
Odds = P/(1-P)
Pre-test probability = prevalence = (a+c)/total
Pre-test odds
Post-test odds = LR+ x Pre-test odds
Post test probability = 1(1+post test odds) = a/(a+b)
= PPV for the population*
Which is the percentage of people scoring positive on
the test who will actually have the result (just replace
with LR- in above to obtain Post-test prob for -ve
result: percentage of people who score negative on the
test who will have the disease)
   Risk & exposure data*
           Exposed    Control
Event      EE         CE
No event   EN         CN
 EER          EE/E
 CER          CE/C

 ARR        CER-EER             ABI=EER-CER

  RR        EER/CER
 RRR         1-RR%              RBI= ABI/CER
  %        ARR/CER%
 NNT          1/ARR
   Risk & exposure data

            Exposed       Control
Event       EE            CE
No event    EN            CN

    of EE
    of CE

    OR        OEE/OEN
         Stats: errors & power
Type I       Making up the difference
            False positive
             Null hypothesis in fact correct

Type II Missing the difference
       False negative
        Null hypothesis is wrong

Power = 1-                  usually ≥0.8 is good enough
         All negative studies must include a power analysis
                      Data types

           Positions in a race      Ranked data A, B, C, D where difference
Ordinal                             doesn‟t matter. If 2 people come equal second
                                    they‟ll be counted as 2.5 in rank
           Temperature in Celsius   Quantitative with arbitrary zero and (equal)
Interval                            steps. 100 ≠ 2x50 in amount.

           Temperature in Kelvin    Starts as zero with equal steps
Ratio                               100=2x50 in amount

           Blood type               Mutually exclusive, unordered categories,
Nominal                             must be exhaustive
                Stats: Bias*
Any systematic error that results in an incorrect
estimate of the association b/n exposure&
outcome: introduced by the researcher and is
usu. a product of study design
S            Selection
P            Performance
A            Attrition: drop-outs/Migration*
R            Recall
D            Detection
I            Insensitive measure
C            Compliance
                Worked Example
                       Gold        Standard

                         +              -
   Screening +           90             5
                          a             b
       test       -      30            60
                          c             d
What are sensitivity and specificity of the test?
• Sensitivity: 90/(90+30) = 0.75
• Specificity: 60/(60+5) = 0.92
• What is the PPV of the test for this sample?
• PPV = 90/(90+5) = 0.95
                      Example 2
                       Gold      Standard

                        +             -
    Screening +         90            5
                         a            b
        test      -     30           60
                         c            d
What is likelihood ratio of a positive test?
• 0.75/(1-0.92) = 0.75/0.08 = 9.375
• What is likelihood ratio of negative test?
• (l-sensitivity)/specificity
• 1-0.75/0.92 = 0.25/0.92 = 0.27
                           Example 3
What are the pre-test odds of the condition?
• Pre-test odds = (90+30)/(5+60) = 1.85
• Prevalence = (90+30) / (90+30+5+60) = 0.65
    – So pre-test odds = 0.65 / (1-0.65) = 1.85
• This data is from an in-patient sample

                             Gold          Standard
                              +                -
    Screening +               90               5
                               a               b
         test          -      30              60
                               c               d
               Example 4

• In a community sample, prevalence is 10%
• What is the PPV in the community of the
• Was 0.95 in IP sample, where prevalence
                   Example 5

• In a community sample, prevalence is 10%
• What is the PPV in the community of the
•   Post-test odds = pre-test odds * LR
•   Pre-test odds = ratio with disease to without
•                = 0.1 / 0.9 = 0.111
•   Post-test odds = 0.111 * 9.375 = 1.042
•   Post-test probability = post-TO / (postTO+1)
•   = 1.042 / 2.042 = 0.51
•   PPV = 0.51
•   If low prevalence, PPV of sensitive test is low
                         Question 6
                            Gold        Standard
                             +              -
      Screening +            90            10
                              a             b
          test       -      110           190
                              c             d
•   What are sensitivity and specificity of the test?
•   Sensitivity: 90/(90+110) = 0.45      (2 marks)
•   Specificity: 190/(190+10) = 0.95 (2 marks)
•   What is the PPV of the test for this sample?
•   PPV = 90/(90+10) = 0.90               (2 marks)
                            Question 6b
                              Gold       Standard

                                +            -
      Screening +              90           10
                                a            b
           test         -      110          190
                                c            d
What is likelihood ratio of a positive test?
• 0.45/(1-0.95) = 0.45/0.05 = 9         (3 marks)
• What is likelihood ratio of a negative test?
• (l-sensitivity)/specificity
• 1-0.45/0.95 = 0.55/0.95 = 0.58       (3 marks)
                        Question 6c
                          Gold              Standard

                            +                   -
   Screening +             90                  10
                            a                  b
       test         -      110                190
                            c                  d
• What are the pre-test odds of the condition?
• Pre-test odds = (90+110)/(10+190) = 1 (2 marks)
• Prevalence = (90+110) / (90+110+10+190) = 0.5
   – So pre-test odds = 0.5 / (1-0.5) = 1
• This data is from an in-patient sample
              Question 6d

• In a MFE OP sample, prevalence is 25%
• What is the PPV in MFE OPs of the test?
                     Question 6e

• In a OP sample, prevalence is 25%
• What is the PPV in MFE OPs of the test?
•   Post-test odds = pre-test odds * LR        (2)
•   Pre-test odds = ratio with disease to without
•                 = 0.25 / 0.75 = 1/3          (2)
•   Post-test odds = 1/3 * 9 = 3                (1)
•   Post-test probability = post-TO / (postTO+1) (2)
•   = 3 / 4 = 0.75                             (1)
•   PPV = 0.75                                  (2)
         Selection Bias: types*
• Prevalence: cancer as outcome kills some of the
  cases (i.e. ≈ Neyman bias)
• Admission rate: hospitalised patients may over-
  represent risk and outcome
• Volunteer/non-responder
• Membership: characteristics that lead to group
  membership influence outcome
• Procedure selection: patient characteristics
  influence selection of treatment
         Procedure bias: types
• Performance: cases and controls treated
• Recall
• Insensitive-measure: instrument not
  accurate enough to measure change
• Compliance: one treatment may be more
  palatable than the other
• Detection
           Stats: odds and ends
Reliability measures:
    – Kappa statistic for NOMINAL
    – Correlation coefficient for NUMERICAL
Variability source:
    – Data e.g. BP taken different hours
    – Examiner e.g. different take on schizophrenia
    – Instrument e.g. patient rated pain scale
    – multiple significance tests
    – multiple analyses of same data
“Fishing expedition” is where data is produced first
then mined for significant bits which are then
          Stats: odds and ends
Improving credibility of the control group:
   – Make them as similar in selection & treatment as poss.
   – ASK them afterwards if they thought they were in the
     tx or control group
  Multiple comparisons between
  groups: multivariate methods*
Definition To compare ≥ 2 dependent outcomes or
           variables simultaneously with adjustment
           for multiple confounding variables
           (covariates). When there is more than
           one dependent variable, it is
           inappropriate to do a series of univariate
Types      MANOVA
           Hotelling‟s T2-test
           Factor / cluster analysis
           Priciple component analysis
           Canonical correlation
    Increasing statistical power
• Increase the sample size.
• Decrease the variance

  – scales with more accuracy or sensitivity.
  – increasing the homogeneity within groups.
  – stricter criteria for inclusion in each group (e.g.
    higher requirement for caseness).
  – controlling for confounding variables i.e. by using
    covariant analysis.
                                       Correlation and regression

• Bivariate (two variable) data should always be
   first examined by a scatterplot
• A scatterplot can give an idea of whether two
       variables are related
• Correlation is used to assess the strength of the
   relationship between two variables
• Regression is used to describe the relationship
   between two variables
    Correlation coefficient: the r
                     Pearsons r
Correlation                     For linear correlation
1= perfect +ve cor
-1=perfect -ve cor Spearemans
                                For any correlation
0= no cor
P value That r is truly different from 0 BUT
           doesn‟t indicate significant correlation
Spurious       Subgroups
correlations Outliers
Occur          Change of direction (u or n curve)
because of

• Simple Linear Regression describes the
   relationship between two variables in more detail
   than correlation

• It fits a straight line through a set of data:

                                              Line of best

                                           Model approach to regression

• The line of best fit is represented by the following

                  Y  a  bX

• The intercept a and the slope b for the population of
   interest are unknown, and therefore must be
   estimated from sample data

• The equation is the regression model, and a and b
   are known as model parameters
                                                       ANOVA approach

Analysis of Variance (ANOVA) approach:

• We partition the variability of a response variable into:

    (1) Variation explained by the linear model

    (2) Variation unexplained by the model (residual)

• If the proportion of variation explained by the model is
    larger than the proportion unexplained, then the linear
    model is appropriate

• F test: Variance Ratio Test: between/within group

If the scatterplot reveals that a straight line is not

• Non-linear regression
    - Transform the response (e.g. logarithm, square root)
• Polynomial regression
    - Fit a quadratic (or cubic etc.) line through the data
• Multiple regression
    - Adjust for effects > 1 independent variable

• Usually either correlation or regression is a
   suitable approach for assessing association in
   (numerical) bivariate data

• Correlation coefficients assess strength of
   (linear) association

• Regression allows description and prediction
   of a response variable

• Both methods have possible misuses
Reviews and Meta-analyses
         Qualitative research
• Examines subjective experience, behaviour,
  meaning, intersubjective interaction without
  using statistical methods or quantification
• Interpretive methods: observe & understand
• Critical methods: examine origins with a
  view to effecting practical transformation in
            The Funnel Plot
• A scatter graph
• Treatment effect on horizontal axis
• Measure of study size on vertical axis
• Precision of study increases with sample
• …so middle of plot likely to have larger
1/standard error

                                   xx x
                                  xxx xx xx
                                x xx xx x xxx
                              xx x xx x x x

                   -1              0        +1
                        Effect size
1/standard error

                                   xx x
                                  xxx xx xx
                               x xx xx x xxx
                              xx x xx x x x

                   -1              0       +1
                        Effect size

Should be seen if all studies used
Would give effect size about 0.4
1/standard error

                                    xx x
                                   xxx xx xx
                                   xx xx x xxx
                                    xx x x x

                   -1              0         +1
                        Effect size

Often seen
1/standard error

                                   xx x
                                  xxx xx xx
                                x xx xx x xxx
                              xx x xx x x x

                   -1              0        +1
                        Effect size

Often seen
Shows smaller studies with less effect not used
Would give effect size about 0.6
• This publication bias may be obvious
  on eyeballing the funnel plot
• Can be evaluated by statistical methods
  – Regression methods
  – Rank correlation approach
  – Sensitivity low if less than 20 trials
• If funnel plot indicates severe publication
  bias, maybe you need to ignore the review!
• Can use „trim and fill‟
  – add proposed studies to make funnel plot
Combining Data
• Need to compare comparative data
• Difficult, as different studies use different
• So a good large RCT is better!
• Need an outcome measurement
• Eg. Effect size for primary outcome
• Eg Odds ratio of getting outcome of interest
  with/without exposure of interest
• Need an error estimate
  – confidence interval
           Draw a Nice Picture

• A forest plot/blobbogram
• x-axis: outcome measurement, eg OR
   – May have vertical line at no effect mark
      • OR 1/effect size 0
• Each study one after the other, on top of each
• Central measurement: square or diamond
• Size of mark proportional to either:
   – number of subjects in study
   – precision of estimate (1/SE)

Study 2

Study 3


Study 5
Study 6
Pooled effects (fixed)

                     -1           0     1

                          Effect Size
• Horizontal line each side shows CI
• Can easily see how many studies have
  „significant‟ result, as CI won‟t cross the vertical
• There will be diamonds at bottom, showing pooled
   – fattest bit at the estimated value
   – horizontal limits are CI (will be narrow as many

• Studies will give different values
• Is this due to chance or different effects of
• Do hypothesis test
• Null hypothesis: the studies varied so much due to
• If P is low, suggests significant heterogeneity
• This is given as chi-squared, with P
      Fixed vs Random Effects
• These may both be given for pooled effects!
• Fixed effects assumes no heterogeneity
• Random effects takes into account heterogeneity
   – may give more weight to outlying studies
   – may give weight to small studies, which may be more
     open to systematic bias
• Fixed effects has narrower confidence intervals
• Will be no difference if no heterogeneity
• If different, both should be published
• If both significant, gives you more
  confidence in the result
• Look at actual pooled treatment effect
• Is it statistically significant?
• Is it clinically significant?
• Very large numbers could make a tiny
  difference statistically significant
• Look at effect size/pooled OR for clinical
 Quantitative vs qualitative research
             Empirical          Interpretive        Critical
Paradigms    Positivism         Individual/         Myths/hidden
             Natural sciences   collective          truths that limit
& origins                       meaning             people
                                Hermeneutics        Marxist/feminist
Types        Observational      Ethnography         Participatory
             Experimental       Phenomenology       action
Privileged   Researcher         Participant         Stakeholder

Ethics       External                          Intrinsic
Diagnosis Studies
• What is diagnostic significance of a particular
• eg. FRS in schizophrenia
• Do cross-sectional study
• Like screening study
• Diagnosis by „gold standard‟
• Symptom as „screening test‟
• Work out sensitivity, specificity, PPV, likelihood
  ratios like screening studies
• Of course, PPV depends on sample!
• So sample must be chosen appropriately
  – You may be asked to comment on it
• Eg a random sample of patients suspected
  of having schizophrenia
    Prognosis Studies

How to anticipate likely course of
        patient‟s illness
•   Inception cohort
•   Cohort of representative patients
•   Follow-up
•   See what long-term outcome is
•   eg what is chance of
    – death
    – independent living
    – future episode of depression…..
• Can be compared against control group
    – eg 1st episode MDD vs no psychiatric illness
• Sub-groups can be compared
    – eg melancholic vs non-melancholic MDD
        Is The Evidence Valid?

             1. Appropriate Sample?
• Representative of patients we‟re interested in
   – eg 1st diagnosis DSM-IV schizophrenia
• Clearly-defined
   – so we know who it applies to
• Early in course of disease
   – otherwise, some patients will have completed course of
     disease (die) before they‟re included in study
          2. Sufficient Follow-Up?

• Long enough
  – if too short, not enough time to develop the outcomes of
• Complete enough
  – We don‟t know what happens to the subjects lost to F/U
  – Could be bias as may be more likely to drop out if very
  – <5% OK, <20% reasonable
  – Use sensitivity analysis
     • what if none or all drop-outs had outcome?
3. Appropriate Outcome Measurement?

• Must be objective
• Otherwise there could be bias
• Especially important if groups compared
• Pre-defined objective criteria should be
• Blind assessors if possible to prevent……
• Information bias
       4. Post-Hoc Comparisons?

• Beware of data dredging
• Lots of sub-group comparisons possible
• Are sub-groups pre-defined?
    Are The Results Important?
             Data Presentation
• Can be as rates with outcome at set-time
  – eg 60% recurrence at 5 years
• Can be as median time to outcome
  – eg 50% recur by 4 years
• Can be presented as survival curve
• Can present all 3
           Precision of Estimates

• Not exact answer, just an estimation based
  on sample
• Use confidence intervals to tell us likely
  range of values in population
• Use hypothesis tests as described earlier to
  compare groups
• eg logrank test for survival curves
• eg chi-squared for recurrence rate
            Clinical Significance

• Is sample similar to our patients?
• What would numbers actually mean to a
• Easier than for abstract statistics like effect
  size, odds ratio!
• How common is an illness
• Sampling crucial
• Sometimes two stages:
  – screen
  – clear diagnosis
• You are looking for how common an illness
  is in a population
• So ask all population
• ….or sample a truly random sample
• ….who must be representative
• Sometimes a whole birth cohort followed-
           How Common?
• Incidence
• Number of new cases of illness per head of
  population in a set time period
• eg 100 cases per 100,000 population per
• Prevalence
• Number of people who have a condition at a
  certain time
• Can be point prevalence
  – how many have diagnosis at the sampling point
• Can be period prevalence
  – how many have the diagnosis at any time over a
    set time period
              Must check for….

•   Are all potential subjects questioned?
•   If not, how many?
•   What efforts to find them?
•   How would those questioned differ from those not
    – If cases more likely to reply, overestimate prev
    – If cases less likely to reply, underestimate prev
• Can see how respondents and non-respondents at
  stage 2 differ by looking at stage 1 scores
   Qualitative methodology Ax
Congruence     Did method “fit” question
               Was study done as said it‟ll be done
Responsiveness    Was design emergent: resp. to social context

Appropriateness      Sampling & data collection “fit” research Q

Adequacy       Details given re. method
               Sufficient sources in sample/weighted correctly
               Was it iterative?
Transparency   Provision of detail,
               Clear way of dealing with disagreements
               Privileging participants
Qualitative data interpretation Ax
Permeability Researcher‟s conceptual & personal journey:
               preconceptions to further understanding
Reciprocity    Participants involved in method & write up?
Authenticity   Verbatim accounts
               Range of voices incl. dissenters
               Recognizable to people having that experience
Coherence      Findings fit data?
               How much of data used?
               Did different researchers agree?
Typicality     What claims about generalizability are made
 Qualitative research: Delphi tech

What for?   Getting a consensus statement from a panel
            e.g. for clinical treatement guidelines
Process     Several rounds of a mailed survey
            encourage deeper analysis through
            examining disagreements and reaches
            closer to consensus with each round:
            •Ask a question (or several) of a panel
            •Collate and summarize the replies
            •Send summary out to start the next cycle.
            •Members may change original statement
            towards the emerging consensus or leave it
            the same & provide justifications.
Qualitative studies: grounded theory
Who dunnit?        Glaser and Strauss
How 1. read (and re-read) a textual database (such
to do    as a corpus of field notes)
it?   2. "discover" or label variables (called
            categories, concepts and properties) and
            their interrelationships.
What may            •one's reading of the literature
affect              •one's use of techniques designed to
"theoretical        enhance sensitivity
sensitivity*" ?

To top