# Review of Probability and Statistics in Simulation

Document Sample

```					Review of Probability and Statistics
in Simulation (2)

1
In this review
• Use of Probability and Statistics in Simulation
• Random Variables and Probability Distributions
• Discrete, Continuous, and Discrete and Continuous Random Variables
- “Mixed” Distribution
• Expectation and Moments
• Covariance
• Sample Mean and Variance
• Data Collection and Analysis
• Parameter Estimation
• Properties of a “Good” Estimator
-------------------------------------------------------------------------------------
• Simulation data and output stochastic processes
• Two Types of Statistics in simulation output
• Distribution Estimation
• Confidence Intervals (CI)
• Run Length and Number of Replications
2
Four Properties of a “Good” Estimator (1)
• Unbiasedness
– An unbiased estimator has an expected value that
is equal to the true value of the parameter being
estimated, i.e.,
E[estimator] = population parameter
– for mean      E[XI] = 
E[Sx2] = 2
– but E[Sx]   - the square root of a sum of #’s is
not usually equal to the sum of the square roots of
those same #’s
3
Four Properties of a “Good” Estimator (2a)
• Efficiency
– The net efficient estimator among a group of unbiased
estimators is the one with the smallest variance
– Ex: Three different estimators’ distributions
1, 2, 3 based on samples
2                       3    of the same size

1

Value of Estimator
Population Parameter
–   1 and 2: expected value = population parameter (unbiased)
–   3: positive biased
–   Variance decreases from 1, to 2, to 3 (3 is the smallest)
–   Conclusion: 2 is the most efficient                       4
Four Properties of a “Good” Estimator (2b)
• Efficiency (-continued)
– Relative Efficiency: since it is difficult to prove that an
estimator is the best among all unbiased ones, use:

Variance of first estimator
Relative Efficiency 
Variance of secondestimator
– Ex: Sample mean vs. sample median
Variance of sample mean       = 2/n
Variance of sample median     = 2/2n
Var[median] / Var[mean]       = (2/2n) / (2/n) = /2 = 1.57
– Therefore, sample median is 1.57 times less efficient than the
sample mean
5
Four Properties of a “Good” Estimator (4)

• Sufficiency
– A necessary condition for efficiency
– Should use all the information about the
population parameter that the sample can
provide - take into account each of the sample
observations
– Ex: Sample median is not a sufficient estimator
because only ranking of the observations is
used and distances between adjacent values are
ignored
6
Four Properties of a “Good” Estimator (4)

• Consistency
– Should yield estimates that converge in probability to the
population parameter being estimated when n (sample size)
becomes larger
– That is, when n  , estimator becomes unbiased and the
variance of the estimator approaches 0
– Ex: X/n is an unbiased estimator of the population proportion
i.e., X/n is a consistent estimator of p
Variance:         Var[X/n] = 1/n2 Var[X] = 1/n2 (npq) = pq/n
(since X is binomially distributed)
When n  ,            pq/n  0
7
Two Types of Statistics
• Statistics based on observations (observational data)
– Concerned with the value of each observation but not the
time at which these observations are made
– Collected on a given number of observations
– Observation: Often an “entity” - any object of interest
– Value to be observed: Duration of certain activities
e.g., Customer (entity, one observation for each entity)
Waiting time (value observed)
• Statistics on time-persistent variables (time-dependent
statistics)
– Variables that have values defined over time (not any single
observation)
– Collected over a given period of time
e.g., Number of customers waiting in line                 8
Formulas for Sample Mean and Sample Variance
Statistics based                             Statistics for time
on observation                               persistent variables
I

x
T
Sample
   i 1
i

  0
x(t )dt
mean            X   I              I                         X       T             T
I

x
T
 x (t )dt 
2
I
2                                           2
Sample                                 i            X                                            2

I        2        0

2       i 1
variance        S   x
I 1
S   x
T         X   T

• Another useful statistics: coefficient of variation Sx/XI
• Formally, estimates that specify a single value (parameter)
of the population are called point estimates, while
estimates that specify a range of values are called interval
estimates
9
Distribution Estimation
• Use collected data to identify (“fit”) the
underlying distribution of the population
• Approach
– Assume the data follow a particular statistical
distribution - Hypothesis
– Apply one or more goodness-of-fit tests to the
sample data - Inference (see how parameters are
estimated)
• Commonly used tests: Chi-Square test and
Kolmogorov-Smirnov test
– Judging the outcome of the tests - If fit (under a
specified level of statistical significance)
10
Statistical Inference
• Variability of simulation outputs should be considered
• Confidence Interval (CI)
– Point estimates: Single parameters
– Interval estimates: A probability statement to
specify the likelihood that the parameter being
estimated falls within prescribed bounds
– Simulation (to estimate population mean ):
By Central Limit Theorem, the sample mean XI is
approximately normally distributed for sufficiently
large I (independence is not a necessary condition
for CLT)
11
Confidence Interval (CI)
• Assume XI is normally distributed, then the statistic:
Z = (XI - )/X
is a random variable that is normally distributed with a
mean of zero and standard deviation of one
– X(, 2)  Z(0, 1)         standard normal distribution
– P [-Z/2 < Z < Z/2] = 1 - 
where Z/2 is the value for Z such that the area to its right on
the standard normal curve equals /2

1-          -- “level of
significance”

 /2       0         /2
12
Confidence Interval (CI)
• So, we can assert that with probability 1 -  that:
XI - Z/2 X <  < XI + Z/2 X
that is a proportion 1 -  of confidence intervals based
on I samples of X should contain (cover) the mean 
C.I.
XI - Z/2 X           XI             XI + Z/2 X
• Note:
– I ,     1 -   ( )
bigger sample size, the more confident, but runs longer
–   (1 -  ), I 
Less confident, less the number of required simulation runs
13
Confidence Interval (CI)
• The above formula assumes knowledge of the standard
deviation of the mean X which is usually unknown
• If use the sample standard deviation of the mean SX to
estimate X , can develop a similar relationship using the
statistic:      t = (XI - )/SX     where t is a random variable
having a student t-distribution with I - 1 degrees of freedom
• Hence a 1 -  confidence interval for  is:
XI - t/2 SX <  < XI + t/2 SX
C.I.
XI - t/2 SX            XI            XI + t/2 SX
 ? - never known!
• If the sample Xi are IID -
 X
 2 2                         S        2 S2 

   X
I       X  IX 

and   S   X
       X

I

 S X  IX 

                                                   14
Hypothesis Testing
• Establish Null Hypothesis H0
– Based for comparison (statistical inference)
– No significant change is present
– Simulation: base model (baseline) - “as is”
• Alternate Hypothesis H1 (or Ha)
– Changes to the base model (deviation from the
base model - can be one-sided or two-sided)
• Experiment
– A systematic approach that uses test statistics to
signify statistical whether H1 should be accepted
or rejected
– H0 is the status quo, so burden of proof is on H1 -
“Innocent until proven guilty”                    15
Hypothesis Testing
• Ex:
H0: average waiting times of using rule A and
rule B are the same
H1: average waiting times of using rule A is less
than that of using rule B - one-sided test
(greater - one-sided; not the same - two-sided)
– A two-scenario case
• Two alternatives - Pairwise Comparison
– More than two alternatives
• A vs. B, B vs. C, C vs. A - Analysis of Variance
(ANOVA)
16
Two Types of Errors
The true situation maybe:
H0 is True          H0 is False
Accept H0        Correct Decision        Incorrect Decision
(Reject H1)                                (Type II Error)

Reject H0        Incorrect Decision      Correct Decision
(Accept H1)        (Type I Error)
• The probability  () of a Type I error (Type II error)
– level of significance of the test
• Ex: An 1 -  confidence interval for  is
XI - t/2 SX <  < XI + t/2 SX
17
Some Statistical Problems in Simulation

• Initial Conditions (IC) & Data Truncation
and idle”
– Need to “warm-up” the system - to reach a
– Statistics of system performance only collected
after warm-up period
– How to determine - mostly empirical or use a
“long” period before truncating the statistics
18
Run Length and Number of Replications

• A few long runs
– Better estimate of the steady state mean
because fewer initial bias
– But variance may increase due to a reduced
sample size
• Many short runs
– May have bias due to starting conditions
– But variance may decrease
19
Run Length and Number of Replications
• How long to run
– A given time period
• Convenient by sample sizes may vary
• Statistics on observations
– A given number of entities that enter the system
• System ends “empty and idle”
• Statistics on time-persistent variables
– A given number of entities that depart the system
• System not ending “empty and idle”
• Useful especially when routing is complex, e.g., rework
– Automatic stopping rules
• Simulation results (statistics collected) monitored closely
(periodically)
• Stop simulation once a prescribed criteria (often accuracy)
is satisfied
• An implementation - the batch mean method               20
Number of Replications
• When estimating the variance of an output variable X
by replication method
– X ~ N(, 2)
– The number of independent replications required to attain a
specified confidence interval for X is given by
2
 t / 2, I 1 S X 
I                   
         g        
Where g is the half-width of the desired CI
g - how accurate  - how confident I - variable
– Implementation of the formula is iterative - because I must
first be assumed (a few runs, say 5 or 8) to obtain initial
values of t & SX
– Then test the sufficiency of the initial assumption and
determine additional number of replications                 21
Number of Replications

• Practical use
1. Select (arbitrarily) a few runs - initial I
2. Compute SX
2
3. If       t / 2, I 1 S X 
I                     
         g        
Then make additional runs, go to step 2 with an
updated I, otherwise stop
• Two key concepts
– Confidence interval - the  range  g
–
22

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 4 posted: 8/31/2012 language: English pages: 22