# Central Limit Theory Example Answers

Document Sample

FCS Example Problem Sheet with Answers: Central Limit Theory

This is an example problem sheet with model answers at the end. It has a working m-file (matlab program:
copied as an appendix at the end) associated with it that you can experiment with by observing the effects
of changing parameters. This is an example of the problem sheets and is presented as:

1. Theory
2. Steps to lead you through building the model and any background maths
3. Questions to investigate

Problem Sheet: Central Limit Theorem

Copy CentralLimitTheoryEg.m into your matlab directory and open it. This file demonstrates the central
limit    theorem.      Mathematical       details    of    the     theorem       can      be     found  at:
http://mathworld.wolfram.com/CentralLimitTheorem.html. Random variables, and probablility in general, will
be covered in lectures. You do not need to know all the mathematical details to run the example, however.
The idea is to experiment with the model to see whether it matches theoretical predictions and how its
parameters interact and affect the results. You therefore only need the theoretical ideas which follow.

If you roll a dice many times, you would assume that each number would come up roughly the same
number of times (with equal frequency). That is, if you plotted the frequency with which each number
occurred, you would expect the graph to be roughly flat. Moreover, the more times you rolled the dice the
flatter you would expect the graph to look.

However, if you had 2 dice and added their scores, you would expect to get many more middling numbers
(6, 7, 8) than extreme numbers like 2 or 12. This is simply because there are many more ways to get a
score of 7 (6 ways: 1&6, 2&5, 3&4, 4&3, 2&5 or 6&1) than there are to get 12 (1 way: 6&6). This means that
if you rolled both dice many times and plotted the frequencies of each score (ie how many times each score
occurred), you would get a graph with higher frequencies in the middle, and lower ones at the edges. If you
were to use 3 or more dice, you would get even less extreme values and more middling ones since it is
rarer to get all 1’s on 7 dice than it is to get all 1’s on 2. The question is, however, if one were to plot the
frequencies of the scores for, 2, 10 or 20 dice what would the graph look like? Is it more steeply humped in
the middle for 10 dice rather than 2 dice? What happens at the extreme values?

The Central Limit Theorem begins to answer these questions. It says that if you have enough random
variables (dice, in this case) their mean (average) score will approach being normally distributed (that is, the
graph will start to look bell-shaped) with a mean equal to the mean of the random variables (the graph will
be centred on 3.5 in the dice case) and a variance (how widely spread the bell is) which depends on the
number of random variables used. The question is, therefore, how many random variables is enough and
how closely does the distribution of mean scores resemble a normal distribution …

We will investigate this by examining the frequencies of the means of different numbers of random
variables. We shall therefore perform the following steps (referenced with the relevant line numbers from
CentralLimitTheoryEg.m):

1. Generate NumVars random variables, ‘throwing’ them each NumEgs times. This is performed in
line 13 using rand. Each row of the matrix RandomVariables holds NumEgs instances (ie the throws)
of a different random variable. However, note that unlike in the dice example, where the scores are an
integer between 1 and 6 (ie they are discrete), here each instance is a number between 0 and 1 (they
are continuous).

2. Work out the mean score of each throw. The i’th column of RandomVariables holds the i’th
instance of all the random variables, so the i’th mean score is the mean of the i’th column. We can
therefore get the means using the function mean as in line 34. mean(x) works in a similar way to many
matlab functions (eg sum, std, var, diff etc. See help for details): if x is a vector (row or
column), it will return a single number, whereas if x is a matrix it will return the mean of each column as
a vector. As RandomVariables may be a matrix or a vector (if NumVars=1), we override the default
behaviour via mean(x,1), which forces column means (whereas mean(x,2) forces row means).
3. Plot a bar chart of frequencies of the average scores. Now we have NumEgs average scores, we
need the frequency of each score (ie how many times it appears). However, as the random variables
are continuous (as discussed in 1), no two mean scores will have exactly the same value (so each has
frequency 1). We therefore have to divide the possible range of mean scores (from 0 to 1) into a
number (specified by NumBins) of equal sized ‘bins’ eg 4 bins means: [0 - 0.25], [0.25 - 0.5], [0.5 -
0.75] and [0.75 – 1]. Each mean score is then assigned to a bin, and the number in each (its frequency)
recorded. This process is performed by the function hist. It can be called in many ways, but the most
convenient here is to pass in the mean scores and a vector specifying the centre of each bin (bin
centres set in line 18). This returns the frequencies (line 37), which can be used together with bar (or
stairs or other functions) to plot a bar chart (line 44).

4. Compare a plot of the frequencies of the mean scores, with a plot of the theoretical normal
distribution they should resemble. As we want to compare the frequencies with a (normal) probability
distribution we need to turn the frequencies into a probability distribution. This is done by rescaling the
frequencies so that the sum of the area of the bars is 1 (line 42) and plotting them (line 44). The plot is
then ‘held’ (line 50) and the normal distribution (bell-shaped curve) they should look like, drawn using
the subfunction DrawNormal (lines 69-80). Note that as the variance (spread of the curve) depends on
the number of random variables (NumVars), it has to be passed to the subfunction. Also, note how
point-by-point operations are used to plot the normal curve.

Questions:
3 parameters affect the behaviour of the system: NumVars, the number of random variables; NumEGs, the
number of instances of each variable and NumBins, the number of bins the mean scores are assigned to.
The following questions investigate their effect.

1. Start by examining the effect of NumVars. The program has been written so that a vector NumVars can
be entered, with a graph produced for each member of NumVars (remember close all closes all

>> CentralLimitTheoryEg([1:5 10],100,10)

What happens as the number of variables increases? Try NumVars =20. What’s the lowest NumVars
that gives a good approximation to the normal distribution? [3 marks]

2. How is the behaviour affected by changing the number of instances, NumEGs? If you have 500
instances do you need a smaller NumVars to get a good approximation to a normal distribution? What
happens if you only have 10? Can you explain this behaviour? [3 marks]

3. Now examine the effect of NumBins. Are its effect linked to the value of NumEGs? What if we have
more bins than there are instances (try NumBins =200, NumEGs =50)? What happens as you
decrease the number of bins? Is there an ‘optimal’ number of bins for NumEGs=50? How does this
change as you change NumEGs? [4 marks]

Postscript: While the effect of increasing NumVars could be deduced from theory, the effect of the other
parameters is more easily seen by practical experimentation. Indeed, the interplay between these 2 factors
is an important issue in probability density estimation, where data distributions are inferred from a sample of
points (see Bishop, 95 pp. 49-59). The goal is to get as detailed a picture of the distribution as possible (ie
try to maximize NumBins) without introducing artifacts due to the scarcity of data. Thus NumBins can be
seen as a smoothing parameter (which you may come across in the Neural Networks course).

This is the style of practical experimentation I am aiming for on this course. Theoretical mathematical issues
will be introduced and then tested by building and experimenting with a model. In the course of this process,
design decisions will be made (normally parameters to be set) and experimented with, which can highlight
practical difficulties with certain techniques. It is important, however, to try to relate observations back to the
theory so one can explain why such-and-such ‘quirky’ behaviour occurred.
General Idea: The matlab problem sheets are meant to be investigative. I therefore expect you to answer
the questions and comment on any observations you make. In particular, I would like you to try to explain
any of the behaviours you observe, with reference to theory where possible. You can even back up your
hypotheses with further experimentation, with the questions hopefully guiding you towards this sort of
answer. As such, there are no ‘correct’ answers: some of you will observe certain phenomena and others
different ones, and you may have competing explanations for them.

Below is then a sample of the type of answer I would expect (though a bit long and over-complete), the
general theme being that I need enough evidence to show that you have both investigated the problem and
thought about the results produced. This evidence will be answers to the questions together with any
graphs/program output you deem necessary (pictures often make explanations simple and matlab figures
can be pasted into word docs easily, though they are not essential). It could also include a summary of any
discussion you had with peers and/or me/demonstrators when discussing/assessing the work. If you feel it
would be more productive to work in groups, please do so.

1. Start by examining the effect of NumVars. What happens as the number of variables increases?
Try NumVars =20. What’s the lowest NumVars that gives a good approximation to the normal
distribution?

The graph shows a typical run of the                                                                       Mean of 1 Variables                                              Mean of 2 Variables
1.5                                                                       3
Approx pdf of means

Approx pdf of means
program: As NumVars increases you get a
better approximation to the normal.                                                             1                                                                2

0.5                                                                       1

Using 5 variables seems to give a reasonable                                                    0                                                                 0
approximation and it doesn’t improve that                                                       -0.5   0          0.5
Mean scores
1   1.5                         -0.5   0          0.5
Mean scores
1   1.5

3                                                                4
Approx pdf of means

Approx pdf of means
much as you go higher. However, it is a more                                                           Mean of 3 Variables                                                  Mean of 4 Variables
3
noisy result and the graph for 5 variables                                                      2
2
changes more than that for 10 if the program                                                    1
1
is run a few times. One would expect this as                                                    0                                                                 0
-0.5   0          0.5            1   1.5                         -0.5   0          0.5            1   1.5
the smaller the number of variables, the more                                                                 Mean scores                                                      Mean scores
4                                                                6
Approx pdf of means

Approx pdf of means

randomness there is.                                                                            3
Mean of 5 Variables                                             Mean of 10 Variables
4
2
2
1

0                                                                 0
-0.5   0          0.5            1   1.5                         -0.5   0          0.5            1   1.5
Mean scores                                                      Mean scores

Another reason for this is seen by looking at the graph for 20 variables: it is quite a good approx to the
normal, but it is difficult to see it exactly, as all the vast majority of data is concentrated in the central 3 bars
– as you’d expect from theory – and so again reducing the visible randomness. I would expect this to
change as NumBins changes.

The combination of the effects can be seen most easily by comparing the graph for 20 variables: relatively
unchanged on different runs and looking like we/theory would expect ie normal; with the graph for 1 variable
– constantly changing and not looking particularly flat or uniform which it should be as it is the uniform
distribution.

2.       How is the behaviour affected by changing the number of instances, NumEGs? If you have 500
instances do you need a smaller NumVars to get a good approximation to a normal distribution? What
happens if you only have 10? Can you explain this behaviour?

As NumEgs increases, several things happen: the difference between runs of the program decreases (as
you’d expect from using more data), the noise in eg NumVars=5 which makes it look un-normal also
decreases and it looks as if a lower NumVars is needed for a good approx to a normal.
This is slightly misleading, though (and it was a                                                          Mean of 1 Variables                                                               Mean of 2 Variables
1.5                                                                                        2
bit of a trick question). While it is certainly true

Approx pdf of means

Approx pdf of means
1.5
that NumVars=5 looks more normal as it is less                                                  1

noisy, consider lower values: They still don’t look
1
0.5
0.5
normal, especially NumVars = 1 and 2. The
0                                                                                  0
graph here shows NumEgs=2000, so even going                                                     -0.5   0          0.5
Mean scores
1   1.5                                          -0.5   0          0.5
Mean scores
1   1.5

3                                                                                 3
to very high numbers will not make the bars

Approx pdf of means

Approx pdf of means
Mean of 4 Variables
Mean of 3 Variables
normal. What is actually happening is that the                                                  2                                                                                 2

graphs are becoming more like their true                                                        1                                                                                 1
underlying distribution. Thus NumVars=1 looks
0                                                                                  0
more uniform and NumVars=2 looks binomial                                                       -0.5   0          0.5
Mean scores
1   1.5                                          -0.5   0          0.5
Mean scores
1   1.5

[NB I would not expect you all to get this point, as                                            4                                                                                 6

Approx pdf of means

Approx pdf of means
Mean of 5 Variables                                                           Mean of 10 Variables
it’s very mathematical and a bit of a trick                                                     3
4

question. You’d get full marks for noting it looks                                              2
2
1
less noisy and more normal for higher NumEgs
0                                                                                  0
and saying something about what happens when                                                    -0.5   0          0.5
Mean scores
1   1.5                                          -0.5   0          0.5
Mean scores
1   1.5

NumEGs =10].

Another point of interest from NumVars=1 is that the bins at the edges seem to be about 50% of what they
should be. This is an artefact of the process we are using. The bins are specified by their centres and thus
the bins at the edges are centred at 0 and 1 respectively (see line 18 in the program) and thus cover the
ranges [-infinity, 0.05] and [0.95, infinity]. As the mean values are between 0 and 1 only, they each cover
50% of the range of the other bins (ie [0, 0.05] and [0.95, 1]) which explains why they are 50% of the size.

For NumEgs = 10, you get a very noisy behaviour. This is clearly as there is too little data. Some bins are
therefore missed out altogether and others have all the data in. The situation appears to be worse in the
sense of more variable for lower NumVars, but even higher NumVars look distinctly un-normal.

3.       Now examine the effect of NumBins. Are its effect linked to the value of NumEGs? What if we
have more bins than there are instances (try NumBins =200, NumEGs =50)? What happens as you
decrease the number of bins? Is there an ‘optimal’ number of bins for NumEGs=50? How does this
change as you change NumEGs?

The effect of NumBins is clearly linked to NumEgs: if you have too many bins for the number of egs you get
behaviour similar to what was seen when NumEgs =10: some bins have no data, others get too much and
the approximation to the normal is noisy and the graphs change a lot between runs of the program. This
situation becomes more extreme if the number of bins is less than NumEgs as you are guaranteed in this
case to have some bins with no data in them. Note that this behaviour doesn’t seem as bad for high
NumVars (numVars = 20, say), but this is simply because the data is concentrated in the centre and so we
are effectively using fewer bins. It does not address the underlying problem that we need much more data
than the number of bins covering the active range of the data (ie where the majority of data lies).

Note that this behaviour gets exponentially worse the higher the dimensionality of the input data. Here we
2
have 1D data and need, say, NumEgs = 10 NumBins. If the data was 2D, we would need NumEgs = 10
n
NumBins to get the same coverage. Similarly, for n-D data we would need NumEgs = 10 NumBins. This
exponential increase in data needed is a facet of the Curse of Dimensionality which affects many machine
learning methods such as neural network training etc. [Again this is an extra point not needed for full marks]
From the above it would seem therefore that the best thing to do is reduce NumBins to as low as possible.
Setting NumBins= 1 or 2 though means that we lose all sense of discrimination in the space as everything
is in 1or 2 bins. There is therefore a tension between NumBins and NumEgs. In practical applications, what
happens is that NumEgs is limited by circumstance and we would increase make NumBins as large as
possible given this number. For NumEgs=50, the best NumBins for NumVars = 5 seems to be around 8 (but
it’s very subjective since NumEgs=50 is too few). If NumEgs increases, so does the optimal value for
NumBins. Clearly, because of the way the bin centres are specified, the optimal value will change as
NumVars increases as fewer bins are in the active range of the data.
Appendix: CentralLimitTheoryEg.m
%   function CentralLimitTheoryEg(NumVars,NumEgs,NumBins)
%   NumVars holds the number of random variables that are to be averaged
%   It can be a vector so that we can see what happens for several values
%   NumEGs holds the number of instances (cf 'throws') of each random
%   variable. NumBins holds the number of bins in which we can place the
%   averaged random variables

function CentralLimitTheoryEg(NumVars,NumEgs,NumBins)

% Generate some random variables
% the dimensions of the data are determined by the maximum number of
% variables to be averaged and the number of instances of each variable
RandomVariables=rand(max(NumVars),NumEgs);

% Generate the centre of the bins that will hold the frequencies of the data
% As we know the data is between 0 and 1 we know the bins must be between
% 0 and 1. The increment is 1/NumBins which generates NumBins bins
BinCents=0:1/NumBins:1;

% Step through NumVars
for i=1:length(NumVars)
% As we may have several NumVars to examine, we can either have a
% separate figure for each one or we can use subplot. Initially try using
% figure then comment this out and use subplot as explained below
figure

%   subplot splits the axis into m by n mini-graphs and lets you plot in
%   each. Uncomment this line if you want to plot several graphs at once
%   Make sure that you have enough mini-axes to hold all the averages ie
%   m*n < length(NumVars)
%   subplot(3,2,i)

% get the average of the 1 to NumVars(i) variables
means=mean(RandomVariables(1:NumVars(i),:),1);

% Get the frequencies of the data in the bins specified by BinCents
Freqs=hist(means,BinCents);

% Transform the frequencies into probabilities by making sure the sum
% of the area of the bins is 1 by dividing each frequency by the number
% of examples and the bin width
Probs=Freqs/(NumEgs*1/NumBins);

% Plot the probabilities as a bar chart in the current figure window
bar(BinCents,Probs);
% Or try the stairs function
%       stairs(BinCents,Probs);

% 'hold' the graph so that the next curve can be seen
hold on;

% Plot the normal curve that theory predicts we should get
PlotNormal(NumVars(i))

% release the graph so that when you next call the function the graph
% clears
hold off

% Label the axes
xlabel('Mean scores')
%      ylabel('Probability distribution of the means')
ylabel('Approx pdf of means')

% Put a title on the graph
title(['Mean of ' int2str(NumVars(i)) ' Variables'])
end

% Subfunction which plots the probability distribution which theory
% predicts as the distribution of a mean of N random variables
function PlotNormal(N)

% Specify the mean of the distribution. As each of our random variables has a
% mean of 0.5, the average mean is also 0.5
mu=0.5;

% Calculate the variance of the ditribution. Each random variable has a
% variance of 1/12 and var(nX) = (1/n)var(X) (don't worry of you don't
understand this)
sigSquared=1/(12*N);

% generate a vector of points between 0 and 1
x=0:0.001:1;

% Calculate the corresponding points with the equation for a normal
NormVar=(1/sqrt(sigSquared*2*pi))*exp(-((x-mu).^2)/(2*sigSquared));

% plot the distribution in red
plot(x,NormVar,'r')

DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 57 posted: 3/7/2010 language: English pages: 6
Description: Central Limit Theory Example Answers
How are you planning on using Docstoc?