# Introduction to resampling in MATLAB - PowerPoint

Document Sample

```					Introduction to resampling in
MATLAB

Feb 5 2009
Austin Hilliard
So you've done an experiment...
   Two independent datasets:
   control
   experimental
   Have n numbers in each dataset, representing
independent measurements

   Question:
Did the experimental manipulation have an effect,
i.e. did it make a difference?
The question restated
Is there a significant (i.e. “real”) numerical
difference between the control and experimental
datasets?

What is the probability that the two datasets are
subsets of two different/independent
distributions?

What is the probability that the two datasets are
NOT subsets of the same underlying
distribution?
   What do we need?
   A number(s) to represent/summarize each dataset
→ data descriptors
   A test for comparing these numbers
→ test statistic
   A way to assess the significance of the test statistic
→ distribution of test statistics generated from
datasets that ARE from the same underlying
distribution
Descriptors and tests
   Visualize it
   Does it have a central tendency? (i.e. is histogram
vaguely mound shaped?)
   - If yes, use the mean as descriptor
   - If no, median may be better

   Typical test: compute difference (subtraction)
between descriptors
We'll discuss how to assess the significance of
your test statistic soon, that's the resampling
part...

But first let's visualize some data
Simulate and inspect data
ctrl = 10.*(5.75+randn(25,1));
exp = 10.*(5+randn(25,1));
boxplot([ctrl exp])              groups = {'control' 'experimental'}
boxplot([ctrl exp],'labels',groups)
Central tendency?

hist(ctrl)               hist(exp)
title('Control group')   title('Experimental group')
Compute descriptors and test stat
   Our data looks pretty mound shaped, so let's
use the mean as data descriptor
ctrl_d = mean(ctrl);
exp_d = mean(exp);

   Take the absolute value of the difference as our
test stat
test_stat = abs(ctrl_d - exp_d);
Assessing the test statistic
   What we would really like to test:
   The probability that the two datasets are subsets of
two different/independent distributions

   Problem:
   We don't know what those hypothetical independent
distributions are, if we did we wouldn't have to go
through all of this!
Assessing the test statistic
   Solution:
   Compute the probability that the two datasets are
NOT subsets of the same underlying distribution
   How to do this?
   Start by assuming datasets are subsets of same
distribution → null hypothesis
   See what test statistic looks like when null
hypothesis is true
Assessing the test statistic
   Need to generate a distribution of test statistic
generated when datasets really ARE from the same
underlying distribution
→ distribution of test stat under null hypothesis
   Then we can quantify how (a)typical the value of our
test stat is under null hypothesis
   if typical, our datasets are likely from same
distribution → no effect of experiment
   if atypical, there is a good chance datasets are from
different distributions → experiment had an effect
Generate the null distribution using
bootstrap
   Basic procedure:
   Combine datasets
   Randomly sample (w/ replacement) from
combined dataset to create two pseudo
datasets w/ same n as real datasets
   Compute descriptors and test statistic for
pseudo datasets
   Repeat 10,000 times, keeping track of
pseudo test stat
Combine datasets and resample
box = [ctrl; exp];                     % combine datasets into 'box'
% could use concat()
pseudo_stat_dist = zeros(1,10000);     % create vector 'pseudo_stat_dist'
% to store results of resampling
% --- start resampling ---
for trials = 1:10000                           % resample 10000 times
pseudo_ctrl = ...                        % create pseudo groups to be
sample(length(ctrl), box);           % same size as actual groups
pseudo_exp = ...
sample(length(exp), box);
pseudo_ctrl_d = ...                      % compute pseudo group
mean(pseudo_ctrl);                   % descriptors
pseudo_exp_d = ...
mean(pseudo_exp);
pseudo_stat = ...                        % compute pseudo test stat
abs(pseudo_ctrl_d - pseudo_exp_d);
pseudo_stat_dist(trials) = ...           % store pseudo test stat to
pseudo_stat;                         % build null distribution
end
Now what?
   Compute how many values of pseudo test stat
are greater than our actual test stat, then divide
by the total number of data points in the null
distribution
   This approximates the likelihood of getting our
actual test stat (i.e. the likelihood of seeing a
difference as big as we see) if our two datasets
were truly from the same underlying distribution
→ p-value
Compute p-val and visualize null
distribution

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats
% bigger than actual stat
pval = bigger / length(pseudo_stat_dist)      %   divide by total number
%   of pseudo test stats to
%   compute p-value
%   could use proportion()
hist(pseudo_stat_dist)                        % show histogram of pseudo
title('Pseudo test stat distribution')    % test stats
xlabel('Pseudo test stat')
ylabel('Frequency')
How to interpret the p-value?
   Assuming that the null hypothesis is true, the p-
value is the likelihood that we would see a value
of the test statistic that is greater than the value
of our actual test statistic

Restated →
   Assuming both groups really are from the same
underlying distribution, the p-value is the
likelihood that we would see a difference
between them as big as we do
“Paired” data
•   What if your two datasets are not actually
independent?
•   Maybe you have a single group of
subjects/birds/cells/etc. with measurements
taken under different conditions
•   Important: the measurements within each
condition must still be independent
Bootstrap to test paired data
•   Basic strategy is the same, but...
•   Instead of one descriptor/group, have as many
descriptors as data pairs
–   compute difference between measurement
under condition1 and measurement under
condition2
•   Test statistic is mean/median/etc. of the
differences
•   Slightly different resampling procedure
Visualize data
•   Investigate distributions for raw data, also
should plot distribution of the differences
between paired data-points

•   Why?
→ Test statistic is computed from these differences,
not the raw datasets
Resampling procedure
•   Randomly sample 1 or -1 and store in vector
•   Multiply by vector of differences
–   randomizes direction of change
•   Compute test stat (mean, median, etc.) for
randomized differences
•   Repeat
•   Now we have distribution of test statistic if
direction of change is random, i.e. NOT
dependent on experimental conditions
Compute descriptors and test stat

diffs = ctrl - exp;                  % compute change from ctrl condition
% to exp condition

test_stat = abs(mean(diffs));        % compute test statistic

pseudo_stat_dist = zeros(1,10000);   % create vector to store pseudo test
% stats generated from resampling
Resampling

for trials = 1:10000                           % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's
% and -1's same size as diffs

pseudo_diffs = signs .* diffs;           % multiply by actual diffs to
% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs));   % compute pseudo test stat from
% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to
% grow null distribution

end
Compute p-val and visualize null
distribution
bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats
% bigger than actual stat

pval = bigger / length(pseudo_stat_dist)     %   divide by total number
%   of pseudo test stats to
%   compute p-value
%   could use proportion()

hist(pseudo_stat_dist)                       % show histogram of pseudo
title('Pseudo test stat distribution')   % test stats
xlabel('Pseudo test stat')
ylabel('Frequency')
function [] = bootstrap2groupsMEAN_PAIRED (ctrl, exp)

% --- prepare for resampling ---

diffs = ctrl - exp;                    % compute change from ctrl condition
% to exp condition

test_stat = abs(mean(diffs))           % compute test statistic

pseudo_stat_dist = zeros(1,10000);     % create vector to store pseudo test
% stats generated from resampling

% --- resampling ---

for trials = 1:10000                           % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's
% and -1's same size as diffs

pseudo_diffs = signs .* diffs;           % multiply by actual diffs to
% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs));   % compute pseudo test stat from
% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to
% grow null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats
% bigger than actual stat

pval = bigger / length(pseudo_stat_dist)        %   divide by total number
%   of pseudo test stats to
%   compute p-value
%   could use proportion()

hist(pseudo_stat_dist)                          % show histogram of pseudo
title('Pseudo test stat distribution')      % test stats
xlabel('Pseudo test stat')
ylabel('Frequency')
function [] = bootstrap2groupsMEANS (ctrl, exp)
%
% function [] = bootstrap2groupsMEANS (ctrl, exp)
%
% 'ctrl' and 'exp' are assumed to be column vectors, this is
% necessary for creation of 'box'
%

% --- prepare for resampling ---

ctrl_d = mean(ctrl);                   % compute descriptors
exp_d = mean(exp);                     % 'ctrl_d' and 'exp_d'

test_stat = abs(ctrl_d - exp_d)        % compute test statistic 'test_stat'

box = [ctrl; exp];                     % combine datasets into 'box'
% could use concat()

pseudo_stat_dist = zeros(1,10000);     % create vector 'pseudo_stat_dist'
% to store results of resampling

% --- start resampling ---

for trials = 1:10000                           % resample 10,000 times

pseudo_ctrl = ...                        % create pseudo groups to be
sample(length(ctrl), box);           % same size as actual groups
pseudo_exp = ...
sample(length(exp), box);

pseudo_ctrl_d = ...                      % compute pseudo group
mean(pseudo_ctrl);                   % descriptors
pseudo_exp_d = ...
mean(pseudo_exp);

pseudo_stat = ...                        % compute pseudo test stat
abs(pseudo_ctrl_d - pseudo_exp_d);
pseudo_stat_dist(trials) = ...           % store pseudo test stat to
pseudo_stat;                         % build null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats
% bigger than actual stat

pval = bigger / length(pseudo_stat_dist)         %   divide by total number
%   of pseudo test stats to
%   compute p-value
%   could use proportion()

hist(pseudo_stat_dist)                           % show histogram of pseudo
title('Pseudo test stat distribution')       % test stats
xlabel('Pseudo test stat')
ylabel('Frequency')
Don't forget to look online to get a more
generalized version of
'bootstrap2groupsMEANS.m' called
'bootstrap2groups.m'

It's more flexible (can define method for descriptor
and iterations as input), stores output in a struct,
plots raw data, has one-tailed tests if needed,
and can be used for demo (try running
'out = bootstrap2groups(@mean)'
Possible topics for next week
•   confidence intervals
•   normality testing
•   ANOVA (1 or 2-way)
•   power

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 19 posted: 2/11/2012 language: pages: 28
How are you planning on using Docstoc?