Review of Probability and Statistics in Simulation

Document Sample
Review of Probability and Statistics in Simulation Powered By Docstoc
					Review of Probability and Statistics
        in Simulation (2)

                         In this review
• Use of Probability and Statistics in Simulation
• Random Variables and Probability Distributions
• Discrete, Continuous, and Discrete and Continuous Random Variables
    - “Mixed” Distribution
• Expectation and Moments
• Covariance
• Sample Mean and Variance
• Data Collection and Analysis
• Parameter Estimation
• Properties of a “Good” Estimator
• Simulation data and output stochastic processes
• Two Types of Statistics in simulation output
• Distribution Estimation
• Confidence Intervals (CI)
• Run Length and Number of Replications
 Four Properties of a “Good” Estimator (1)
• Unbiasedness
  – An unbiased estimator has an expected value that
    is equal to the true value of the parameter being
    estimated, i.e.,
    E[estimator] = population parameter
  – for mean      E[XI] = 
                  E[Sx2] = 2
  – but E[Sx]   - the square root of a sum of #’s is
    not usually equal to the sum of the square roots of
    those same #’s
  Four Properties of a “Good” Estimator (2a)
• Efficiency
  – The net efficient estimator among a group of unbiased
    estimators is the one with the smallest variance
  – Ex: Three different estimators’ distributions
                                              1, 2, 3 based on samples
                 2                       3    of the same size


                                               Value of Estimator
                  Population Parameter
  –   1 and 2: expected value = population parameter (unbiased)
  –   3: positive biased
  –   Variance decreases from 1, to 2, to 3 (3 is the smallest)
  –   Conclusion: 2 is the most efficient                       4
   Four Properties of a “Good” Estimator (2b)
• Efficiency (-continued)
  – Relative Efficiency: since it is difficult to prove that an
    estimator is the best among all unbiased ones, use:

                             Variance of first estimator
      Relative Efficiency 
                            Variance of secondestimator
  – Ex: Sample mean vs. sample median
    Variance of sample mean       = 2/n
    Variance of sample median     = 2/2n
    Var[median] / Var[mean]       = (2/2n) / (2/n) = /2 = 1.57
  – Therefore, sample median is 1.57 times less efficient than the
    sample mean
Four Properties of a “Good” Estimator (4)

• Sufficiency
  – A necessary condition for efficiency
  – Should use all the information about the
    population parameter that the sample can
    provide - take into account each of the sample
  – Ex: Sample median is not a sufficient estimator
    because only ranking of the observations is
    used and distances between adjacent values are
  Four Properties of a “Good” Estimator (4)

• Consistency
  – Should yield estimates that converge in probability to the
    population parameter being estimated when n (sample size)
    becomes larger
  – That is, when n  , estimator becomes unbiased and the
    variance of the estimator approaches 0
  – Ex: X/n is an unbiased estimator of the population proportion
    i.e., X/n is a consistent estimator of p
    Variance:         Var[X/n] = 1/n2 Var[X] = 1/n2 (npq) = pq/n
                      (since X is binomially distributed)
                      When n  ,            pq/n  0
              Two Types of Statistics
• Statistics based on observations (observational data)
   – Concerned with the value of each observation but not the
     time at which these observations are made
   – Collected on a given number of observations
   – Observation: Often an “entity” - any object of interest
   – Value to be observed: Duration of certain activities
     e.g., Customer (entity, one observation for each entity)
           Waiting time (value observed)
• Statistics on time-persistent variables (time-dependent
   – Variables that have values defined over time (not any single
   – Collected over a given period of time
     e.g., Number of customers waiting in line                 8
Formulas for Sample Mean and Sample Variance
                 Statistics based                             Statistics for time
                 on observation                               persistent variables

                             i 1
                                                                               0
                                                                                     x(t )dt
  mean            X   I              I                         X       T             T

                                                                          x (t )dt 
                                         2                                           2
  Sample                                 i            X                                            2
                                                          I        2        0
                      2       i 1
  variance        S   x
                                         I 1
                                                               S   x
                                                                                     T         X   T

  • Another useful statistics: coefficient of variation Sx/XI
  • Formally, estimates that specify a single value (parameter)
    of the population are called point estimates, while
    estimates that specify a range of values are called interval
           Distribution Estimation
• Use collected data to identify (“fit”) the
  underlying distribution of the population
• Approach
  – Assume the data follow a particular statistical
    distribution - Hypothesis
  – Apply one or more goodness-of-fit tests to the
    sample data - Inference (see how parameters are
     • Commonly used tests: Chi-Square test and
       Kolmogorov-Smirnov test
  – Judging the outcome of the tests - If fit (under a
    specified level of statistical significance)
              Statistical Inference
• Variability of simulation outputs should be considered
• Confidence Interval (CI)
   – Point estimates: Single parameters
   – Interval estimates: A probability statement to
     specify the likelihood that the parameter being
     estimated falls within prescribed bounds
   – Simulation (to estimate population mean ):
     By Central Limit Theorem, the sample mean XI is
     approximately normally distributed for sufficiently
     large I (independence is not a necessary condition
     for CLT)
              Confidence Interval (CI)
• Assume XI is normally distributed, then the statistic:
       Z = (XI - )/X
  is a random variable that is normally distributed with a
  mean of zero and standard deviation of one
   – X(, 2)  Z(0, 1)         standard normal distribution
   – P [-Z/2 < Z < Z/2] = 1 - 
     where Z/2 is the value for Z such that the area to its right on
              the standard normal curve equals /2

                                       1-          -- “level of

                    /2       0         /2
             Confidence Interval (CI)
• So, we can assert that with probability 1 -  that:
      XI - Z/2 X <  < XI + Z/2 X
  that is a proportion 1 -  of confidence intervals based
  on I samples of X should contain (cover) the mean 
     XI - Z/2 X           XI             XI + Z/2 X
• Note:
   – I ,     1 -   ( )
     bigger sample size, the more confident, but runs longer
   –   (1 -  ), I 
     Less confident, less the number of required simulation runs
                 Confidence Interval (CI)
• The above formula assumes knowledge of the standard
  deviation of the mean X which is usually unknown
• If use the sample standard deviation of the mean SX to
  estimate X , can develop a similar relationship using the
  statistic:      t = (XI - )/SX     where t is a random variable
  having a student t-distribution with I - 1 degrees of freedom
• Hence a 1 -  confidence interval for  is:
        XI - t/2 SX <  < XI + t/2 SX
          XI - t/2 SX            XI            XI + t/2 SX
                                   ? - never known!
• If the sample Xi are IID -
           X
                    2 2                         S        2 S2 
     X
            I       X  IX 
                                 and   S   X
                                                      X

                                                            S X  IX 
                                                                      14
             Hypothesis Testing
• Establish Null Hypothesis H0
  – Based for comparison (statistical inference)
  – No significant change is present
  – Simulation: base model (baseline) - “as is”
• Alternate Hypothesis H1 (or Ha)
  – Changes to the base model (deviation from the
    base model - can be one-sided or two-sided)
• Experiment
  – A systematic approach that uses test statistics to
    signify statistical whether H1 should be accepted
    or rejected
  – H0 is the status quo, so burden of proof is on H1 -
    “Innocent until proven guilty”                    15
                 Hypothesis Testing
• Ex:
  H0: average waiting times of using rule A and
       rule B are the same
  H1: average waiting times of using rule A is less
       than that of using rule B - one-sided test
      (greater - one-sided; not the same - two-sided)
  – A two-scenario case
        • Two alternatives - Pairwise Comparison
  – More than two alternatives
        • A vs. B, B vs. C, C vs. A - Analysis of Variance
                Two Types of Errors
                       The true situation maybe:
                      H0 is True          H0 is False
Accept H0        Correct Decision        Incorrect Decision
(Reject H1)                                (Type II Error)

Reject H0        Incorrect Decision      Correct Decision
(Accept H1)        (Type I Error)
• The probability  () of a Type I error (Type II error)
   – level of significance of the test
• Ex: An 1 -  confidence interval for  is
      XI - t/2 SX <  < XI + t/2 SX
 Some Statistical Problems in Simulation

• Initial Conditions (IC) & Data Truncation
  – Most simulation start with the system “empty
    and idle”
  – Need to “warm-up” the system - to reach a
    steady state
  – Statistics of system performance only collected
    after warm-up period
  – How to determine - mostly empirical or use a
    “long” period before truncating the statistics
 Run Length and Number of Replications

• Deciding on the trade-off
• A few long runs
  – Better estimate of the steady state mean
    because fewer initial bias
  – But variance may increase due to a reduced
    sample size
• Many short runs
  – May have bias due to starting conditions
  – But variance may decrease
   Run Length and Number of Replications
• How long to run
  – A given time period
     • Convenient by sample sizes may vary
     • Statistics on observations
  – A given number of entities that enter the system
     • System ends “empty and idle”
     • Statistics on time-persistent variables
  – A given number of entities that depart the system
     • System not ending “empty and idle”
     • Useful especially when routing is complex, e.g., rework
  – Automatic stopping rules
     • Simulation results (statistics collected) monitored closely
     • Stop simulation once a prescribed criteria (often accuracy)
       is satisfied
     • An implementation - the batch mean method               20
                Number of Replications
• When estimating the variance of an output variable X
  by replication method
   – X ~ N(, 2)
   – The number of independent replications required to attain a
     specified confidence interval for X is given by
                  t / 2, I 1 S X 
              I                   
                          g        
     Where g is the half-width of the desired CI
               g - how accurate  - how confident I - variable
   – Implementation of the formula is iterative - because I must
     first be assumed (a few runs, say 5 or 8) to obtain initial
     values of t & SX
   – Then test the sufficiency of the initial assumption and
     determine additional number of replications                 21
          Number of Replications

• Practical use
  1. Select (arbitrarily) a few runs - initial I
  2. Compute SX
  3. If       t / 2, I 1 S X 
        I                     
                      g        
    Then make additional runs, go to step 2 with an
    updated I, otherwise stop
• Two key concepts
  – Confidence interval - the  range  g

Shared By: