# Z - TU Delft CiTG

Document Sample

```					 Review of statistics and frequency analysis
Associate Professor: Dr. P.H.A.J.M. van Gelder
TU Delft, Faculty of Civil Engineering and Geosciences
UNESCO-IHE, Guest Lecturer

Year 2009 Lectures on:
October 27th 8.45 – 10.30h
October 28th 15.45 – 17.30h
November 5th 15.45 – 17.30h
November 6th 8.45 – 10.30h

A selection of these slides will be
presented during the course

contact for students
Dr.ir. P.H.A.J.M. van Gelder room: 3.87 ext: 86544
Overall outline of the course
Review of statistics and frequency
analysis

Data analysis, random variables,
classification, stat. moments, frequency
distributions; samples, populations and
probability models; parameter
estimation and confidence intervals.
Introduction to this Course
1 + 1 lecture periods (basic statistics), 1 + 1
lecture periods on frequency analysis and a
written exam
Complete power point presentations can be
found on Van Gelder’s website
Brief self-introduction:
– Background (name, country, university education etc.)
– Interests, experiences with statistics?
– What do you hope to get out of this part of the course?
– Who has experienced hydrological extremes?
Outline for today

• Introduction on natural hazards and
probabilistic design
• Data sets for river– and coastal engineers
• Theory on:
– Probability and events
– Random (or stochastic) variables
– Transformations
– Multivariate distributions
Number of Floods worldwide

Source: Dartmouth Flood Observatory, 2003
Regional Distribution of Large Floods

1999-2002

1985-1988

Source: Dartmouth Flood
Observatory, 2003
Reasons for Concern

Source: Smith et al 2001, TAR IPCC WG II
Does Global Warming Lead to an Intensification of the
Hydrological Cycle and More Extremes Events?
More flood-producing rain (but regional differences)
Longer rainless periods and higher evaporation demand
of the atmosphere
There are indications for:
 Changing regimes, flood timing
 More frequent floods, land slides etc.
 No reduction of droughts, but decreased summer low flows
e.g. in western Europe
Effects of changing Mean and Variance of
Precipitation on Stream Flow
(i.e. non-linear processes, thresholds, etc.)

(from Middelkoop 2005, after Arnell 1996)
Frequency analyses can be done for …
High flows
 Flood peak discharge
 Flood volume
 Occurrence of high flows in certain periods/months
Low flows
 Minimum low flow discharge
 Runoff deficits
Groundwater levels, groundwater fluctuations etc.
Estimated cost of damage caused by natural hazards
Simulated variables (model outputs)
Daily rainfall, rainfall intensities etc.
….

How do we measure floods?

(Aus: Hornberger et al., 1998)
Rating curve

(Aus: Hornberger et al., 1998)
Application of rating curve to measured
water levels at a gauge

(Aus: Hornberger et al., 1998)
…. that is not always that easy!

(The big flood in the HJ Andrews 1996)
Continuous measurement of discharges
(incl. floods and low flows)

(Aus: Hornberger et al., 1998)
Floods in 2002; river Elbe,
city of Dresden, Germany
Flood defence structures

Storm Surge Barrier Oosterschelde (NL)
Maeslantkering - storm surge barrier (NL)
Variability in annual precipitation (NL)
Precipitation in the Netherlands
1100

1000

900
Precipitation [mm]

800

700

600

500
1940   1950   1960         1970         1980     1990   2000
Year
Variability in daily river
discharges (Meuse river, NL)
Daily Discharges of the River Maas at Lith (+ Mean, Minimum and Maximum per year))
2500

2000
River Discharge [m3/s]

1500

1000

500

0
0         1000          2000            3000         4000          5000
Day Nr
Example: Evaluating of hydrological extremes
(flood runoff); Case in Austria 2005
600

Min (71-02)
Max(71-02)                                                2005: max. ca. 515 m3/s
Mittel (71-02)
500         2005

400
[m3/s]

300
1999: max. ca. 230 m3/s

200

100

0
1.Jän   1.Feb      1.Mär    1.Apr   1.Mai      1.Jun         1.Jul   1.Aug       1.Sep         1.Okt   1.Nov   1.Dez
Example: Evaluating of floods in a regional
context (Case study: Austria 2005)
Ongoing EU projects on Flood Risks
EFFS: A European Flood Forecasting System
– to develop a prototype of a European flood
forecasting system for 4-10 days in advance,
which could provide daily information on potential
floods for large rivers, and flash floods in small
basins.

SPHERE: Systematic, Palaeoflood and Historical data
for the improvement of flood Risk Estimation
– to develop a new approach which complements
hydrologic modelling and the application of
historical and paleoflood hydrology to increase the
temporal framework of the largest floods over time
spans from decades to millennia; in order to
improve extreme flood occurrences.
THARMIT: Torrent Hazard Control in the European
Alps
– to develop practical tools and methodologies for
hazard assessment, prevention and mitigation,
and to devise methods for saving and monitoring
potentially dangerous areas.

CARPE DIEM: Critical assessment of Available Radar
Precipitation Estimation techniques and Development
of Innovative approaches for Environmental
Management.
– to improve real-time estimation of radar rainfall
fields for flood forecasting, by coupling multi-
parameter polarisation radar data and NWP, and
exploiting NWP results in order to improve the
IMPACT: Investigation of Extreme Flood Processes and
Uncertainty
– to investigate extreme flood and defense failure
processes, their risk and uncertainty. Will consider
dam breach formation, sediment movement, flood
propagation and predictive models, within an
overall framework of flood risk management.

GLACIORISK: Survey and Prevention of Extreme
Glaciological Hazards in European Mountainous
Regions
– to develop scientific studies for detection, survey
and prevention of glacial disasters in order to
save lives and reduce damages.
SAFERELNET: Risk assessment of natural hazards in
Europe

MITCH: Mitigation of Climate Induced Hazards
– dealing with the mitigation of natural hazards with
a meteorological cause, in order to assist planning
and management. The main focus will be on flood
forecasting and warning, but it will also include
other flood related hazards, such as landslips and
debris flow, and longer term climate hazards, such
as drought, and the possible impact of climate
change on the frequency and magnitude hazards

Modelling for Flood Risk Mitigation - June 2002
FLOODMAN: Near real-time flood forecasting, warning
and management system based on satellite radar
images, hydrological and hydraulic models and in-situ
data
– near real-time monitoring of flood extent using
spaceborne SAR, optical data & in-situ
measurements, hydrological and hydraulic model
data. The result will be an expert decision system
for monitoring, management and forecast of floods
in selected areas in Europe. The monitoring will
also be used to update the hydrological/hydraulic
models and thereby improving the quality of flood
forecasts.
FLOODSITE: The FLOODsite project covers the
physical, environmental, ecological and socio-
economic aspects of floods from rivers, estuaries
and the sea. The project is arranged into seven
themes covering:

Risk analysis – hazard sources, pathways and
vulnerability of receptors.
Risk management – pre-flood measures and flood
emergency management.
Technological integration – decision support and
uncertainty.
Pilot applications – for river, estuary and coastal sites.
Training and knowledge uptake – guidance for
professionals, public information and educational
material.
Networking, review and assessment.
Co-ordination and management.
Extreme Events;
Two very realistic simulations
1. River dike failure in the Netherlands
2. Asteroid impact in the Atlantic Ocean
The role Statistics                            Properties of
in Water Engineering                           the hydrological
system

Data analysis and statistics
(later also modeling etc.)

Statistics of hydrological variables
for decision support
Water balance: P = R + ET + dS/dt
Variability and heterogeneity of hydrological
Statistical Analysis !!

variables (groundwater levels, precip. patterns etc.)
Hydrological extremes:

x-year
flood
droughts                   floods
Scenarios for:
 Land use change
 Climate chance
 Different water management strategies
ETC.
4 types of error occur when measuring a
hydrological variable
1. Operation and function errors: malfunction of measuring
instrument, personal (human) error.
2. Random error: caused by numerous minor impacts partly
independent from each other. If repeated frequently, the
values fluctuate around the true value.
3. Constant systematic error: inherent in any kind of
equipment (e.g., wrong installation of instruments, wrongly
indicated zero point, incorrect rating curve etc.); constant
in respect of time, but may vary according to the
measuring range.
4. Variable systematic error: usually caused by insufficient
control during the measuring period; mostly the origin in
the instrument (e.g., drift” of the device, growing of plants
at the location of measurement etc.). Can be avoided
through continuous comparison of the measurement and
repeated calibration of the instruments.
 Systematic errors can not be reduced by increasing the
number of measurements, if equipment and measuring
conditions remain the same!
Structural design principles
Old methods:
– determine a worst case load
– determine a worst case strength
– determine the geometry of the structure
Unknown how safe the structure is
No insight in contribution of different individual
failure mechanisms
No insight in importance of different input
parameters
Uncertainties in variables cannot be taken into
account
Uncertainties in the physical models cannot be
taken into account
Failure mechanisms of a dike
Design of a structure

Random boundary conditions
Fault tree with AND and OR
flood defence fails

OR

flood defence collapses         overtopping

AND

piping develops   inspection fails
Mathematics of AND and OR
In case of an AND-gate, you should
multiply the probabilities
In case of an OR-gate, you should add
the probabilities (and substract the
multiplication of the two probabilities)
Important condition:
This is only true when both mechanisms
are fully (statistically) independent
Example of dependence
Modern wave run-up formula is:

R  1.6 H
tan 

H /L
L  gT /(2 )
2

(for shallow water, last equation is somewhat different and implicit)
Example of dependence (2)
So the answer depends on H and T
But in a single wave field, T = f(H), for
example:T = 3.9 * H0.376
This can be modelled as:
T = A * HB, in which both A and B are
stochastic variables with a mean and
standard deviation
Example of dependence (3)
But this is only true in case of a single
wave field (wind waves OR swell
waves)
When there are more wave fields H and T
are NOT statistically independent
There is no good model for run-up due to
double peaked spectra, but there is an
approximation by Van der Meer
Example of dependence (4)

Run-up larger than allowed

OR

Run-up due to           Run-up due to              Run-up due to
wind waves              swell waves           double peaked spectrum

Computation with       Computation with          Computation with the
statistics of          statistics of            approxiamtion of
wind waves             swell waves                VanderMeer
Two approaches
First approach:
start at bottom and calculate the
probability of failure according to normal
design practice
second approach:
start at top and assign probability to
failure mechanisms
Two approaches (2)
Usually with a start at bottom, you do not
reach at the required overall failure
probability
Usually with a start at top, you cannot
construct some elements
So in practice, you have to make a
mixture
Sensitivity analysis
What is the effect of 10% change in input
on the output ?
This determines how important is an input
parameter
End of introduction
Data availability
Internet offers a huge source of past - and
real time data
The theory will be explained with
examples and data sets taken from river
engineering:
The Global Runoff Data Centre
http://grdc.bafg.de/servlet/is/910/
Mediterranean Hydrological Cycle Observing System
(Med-HYCOS project)
http://medhycos.mpl.ird.fr
UNESCO International Hydrological Programme
http://webworld.unesco.org/water/ihp/db/
The Global River Discharge Database” (RIVDIS)
http://www.rivdis.sr.unh.edu
Local Websites
Ministry of Water Resources of China
http://www.mwr.gov.ch/english/index.asp
Ministry of Water Resources of India
http://mowr.gov.in
Water Commission of India
http://cwc.nic.in
These sites were useful for obtaining basic
information about river basins, not so useful
Apart from websites, data is also published in
National Water Resources Books
Datasets for river engineers
The longest period of observation is recorded for the
river Nemunas at Smalinninkai: 1812-2003 (LT). The
majority of European rivers have observation records
dating from the period 1910-1920 and continuing to
1999-2004.
On most Asian rivers water discharges have been
observed since the period 1930-1940, although the
river Bia at Biisk (Russia) has a record 108 year
period of observation (1895 to 2003). The shortest
period of observation is found on the Indian rivers
(1939 -1979), the Chinese rivers (1930-1985) and the
Iranian rivers (1963-1985).
Your data for the exercise is available at:

http://www.citg.tudelft.nl/live/pagina.jsp?
id=418a276e-b63e-4cec-a6fe-
763feb04f984&lang=en
Some snap shots
Data for coastal engineers

www.oceanor.no/
www.knmi.nl/onderzk/oceano/waves/era40/lice
nse.cgi
www.globalwavestatisticsonline.com/
http://www.golfklimaat.nl
http://www.actuelewaterdata.nl
http://www.hydraulicengineering.tudelft.nl/public
/gelder/paper56-data3.zip
Hm0, H1/3,HTE3, Tm02, TH1/3, Th0,
wind direction, wind speed, water
level , surge
Important data source for Dutch data
Some snap shots
real time wave data; significant wave heights
wave periods
Water levels gauges in Mid West Netherlands
Water levels and astronomical tide
The wave data on the wave climate site is available in data files per
year (now : 1979 - 2002). The files contain wave data in the
following format :
19880221 0100 90 4 84 12 52 66 351 10 -6 3433402
19880221 0400 76 3 73 15 53 65 339 -95 -2 3433402
19880221 0700 67 3 66 10 42 53 346 -31 -1 3433402
19880221 1000 67 3 64 12 40 48 344 51 -8 3433402
19880221 1300 66 3 65 11 36 44 308 -3 0 3433402
19880221 1600 75 3 73 12 40 47 298 -93 6 3433402
19880221 1900 91 4 79 13 41 48 325                 0 3 3433402
19880221 2200 86 4 80 11 41 47 334 95 0 3433402
19880222 0100 84 5 81 13 42 50 323 38 4 0400402
…….. etc.
The files are arranged as follows :
Column nr.                Name                                                                         unit

1                         Date                                                                         [yyyymmdd]

2                         time                                                                         [hhmm] MET !

3                         wave height Hm0                                                              [cm]

4                         accuracy wave height Hm0 (standard deviation)                                [cm]

5                         wave height H1/3                                                             [cm]

6                         wave height HTE3                                                             [cm]

7                         wave period Tm02                                                             [0.1 s]

8                         wave period TH1/3                                                            [0.1 s]

9                         wave direction Th0                                                           [gr], nautical[1]

10                        water level                                                                  [cm] NAP/MSL

11                        surge                                                                        [cm]

12                        code number which indicates the origin of the given value                    [-]

[1]
Nautical degrees : (from) North = 0 degrees, (from) East = 90, South = 180, West = 270 and
North again = 360.
Probability

P(A) = probability of event A

Mathematical definition
Frequentistic definition
Mathematical definition

Axioms:
1. P(A)  0
2. P() = 1
3. P(A or B) = P(A) + P(B)
(if A and B are independent)
Frequentistic definition

P(A) = N(A) / N

in which:
N(A)        number of experiments leading to A
N           total number of experiments

example: probability that a consumer product fails
within 1 year after production
example interpretation
P(A) = n(A) / N
P      probability
n(A)   number of outcomes in experiment A
N      total number of outcomes

x      x      x   x   x       x
x      x      x   x A x       x
x      x      x   x   x       x
x      x      x   x   x       x

P(A) = 4 / 24 = 1 / 6
example dice

1    2      3   4   5   6
x    x      x   x   x   x

P(x=4)    = 1/6
P(x  5) = 2/6
P(x even) = 3/6
Some history

1650Pascal / Fermat
1750         Bernouilli / Bayes
1850         Venn / Boole

1920         Von Mises
1960         Savage / Lindley   decision making
1970         Benjamin / Cornell decision making
1960’s and onwards

‘Years ago a statistician might have
claimed that statistics deals with the
processing of data;
today statisticians will be more likely to
say that statistics is concerned with
decision making in the face of
uncertainty.’
probability calculation

calculation of a probability from other
probabilities
Joint events
B
Union
A               
A or B
B
Cross section
A and B         A               

B
Implication         A
A in B                          

Denial                  A
A not                           
Union
P(A or B)

x   x     x    x   x       x
x   x     x    x   x       x
x   x     x    x   x       x B
A
x   x     x    x   x       x

P(A or B) = P(A) + P(B) - P(A en B)
13/24      = 6/24 + 9/24 - 2/24
?
Cross section
P(A and B)

x   x       x     x     x    x
x   x       x     x     x    x
x   x       x     x     x    x B
A
x   x       x     x     x    x

P(A and B) = nAB / n
= (nA / n) * (nAB / nA)
= P(A) * P(B | A)
= 6/24 * 2/6 = 2/24
Conditional probability
P(A | B) =
probability of A given the fact that event
B has occured

P(A and B) = P(B) P(A | B)

P(A | B) = P(A and B) / P(B)
Conditional probability

P(rain in Delft on sept. 18, 2024)?

P(rain in Delft on 9/18-2024| rain in
Amsterdam on 9/18-2024)?

P(rain in Delft on 9/18-2024|rain in Cape
Town on 9/18-2024)?
example: dice
1            2        3        4         5         6
x            x        x        x         x         x

P( x  2 en x  even ) 1/ 6
P( x  2 | x  even )                                 1/ 3
P( x  even )      1/ 2

P( x  2 en x  oneven)
P( x  2 | x  oneven)                               0
P( x  oneven)

P( x  2 en x  even) 1/ 6
P( x  2 | x  even)                                1/ 3
P( x  even)      1/ 2
Independence
A and B are independent if
PA | B  PA 

In that case:
PA en B  PA  PB

PA of B  PA   PB  PA  PB
Important rules
Theorem of total probability
PA   PA | B PB  PA | Bniet  PBniet 

PA    PA | Bi  PBi  als Bi elkaar uitsluiten
i

Generalisation to continuous integral “in
which the uncertainty is integrated out”

PB | A  PA 
Theorem of Bayes               PA | B  
PB 
example: quiz dilemma
Car in A, B or C

A            B          C

U: chooses A
QM: Good that you didn’t choose B,
because it is empty. Would you still like
to switch to C?
Quiz-dilemma
Theorem of Bayes:

Pinf o | A  PA  1/ 2 * 1/ 3
PA | inf o                                   1/ 3
Pinf o         Pinf o 

Pinf o | C PC 1 * 1/ 3
PC | inf o                                  2/3
Pinf o       Pinf o 

Yes, switch to C!
Notes: P(info)=0.5 because there can be a car in B or not.
P(info|C)=1, because if we have information on C and B(info), we know that A
should contain the car with 100% certainty
A clever student is invited to write a simulation programme to find out if this is
indeed true
Solution of Jeroen van den Bos

Development of 2 Matlab scripts
Quiz_noinfo.m (choose a box and check if the car is
there with no information from the QM)
Quiz_info.m (choose a box and update your choice
when the QM gives his information on another box)
Quiz_noinfo.m
clear;
N = 2000; NoOfBoxes = 3;
NoSuccess=0;

for i = 1:N,
Box_Car = fix(1+rand(1)*(NoOfBoxes)); %car is put randomly in a box
Box_Guess = fix(1+rand(1)*(NoOfBoxes)); %random choice of a box

NoSuccess = NoSuccess + (Box_Car == Box_Guess); %if guess is right increase # of
succesful attempts

Fr_Success(i)=NoSuccess/i; %frequency of success after i attemps
end;

P_Success = Fr_Success(N) %final result

% output
% ------

plot(1:N,Fr_Success,'.',[0 N],[1 1]/NoOfBoxes)
axis([0 N 0 1])
legend('Simulation result',['P = 1/' num2str(NoOfBoxes)])
Quiz_info.m
clear;
N = 2000; NoOfBoxes = 3;
NoSuccess_Stay=0;
NoSuccess_Switch = 0;

for i = 1:N,
Box_Car = fix(1+rand(1)*(NoOfBoxes)); %car is put randomly in a box
Box_1stGuess = fix(1+rand(1)*(NoOfBoxes)); %random choice of a box

k = 0; %Select possible empty boxes
for j = 1:NoOfBoxes,
if (Box_Car ~= j) & (Box_1stGuess ~= j) %empty box cannot be 'car' or 'guess'
k = k + 1;
Empty_Boxes(k) = j; %vector of empty boxes
end;
end;
Box_Empty = Empty_Boxes(fix(1+rand(1)*k)); %Choose randomly from empty boxes

k = 0; %Select possible alternatives
for j = 1:NoOfBoxes,
if (Box_Empty ~= j) & (Box_1stGuess ~= j) %alt box cannot be 'empty' or 'guess'
k = k + 1;
Alternative_Boxes(k) = j; %vector of alternative boxes
end;
end;
Box_Alternative = Alternative_Boxes(fix(1+rand(1)*k)); %Choose randomly from alternative boxes

NoSuccess_Stay = NoSuccess_Stay + (Box_Car == Box_1stGuess); %if 1st guess is right increase # of succesful attempts 'stay' strategy
NoSuccess_Switch = NoSuccess_Switch + (Box_Car == Box_Alternative); %if alt. guess is right increase # of succesful attempts 'switch' strategy

Fr_Success_Stay(i)=NoSuccess_Stay/i; %frequency of success after i attemps
Fr_Success_Switch(i)=NoSuccess_Switch/i; %frequency of success after i attemps
end;

P_Success_Stay = Fr_Success_Stay(N) %final result
P_Success_Switch = Fr_Success_Switch(N) %final result

% output
% ------

plot(1:N,Fr_Success_Stay,'r.',1:N,Fr_Success_Switch,'b.',[0 N],[1 1]*P_Success_Stay,'r--',[0 N],[1 1]*P_Success_Switch,'b--')
axis([0 N 0 1])
%legend('Simulation result Stay','Simulation result Switch',['P_{Stay} = ' num2str(P_Success_Stay)],['P_{Switch} =',
num2str(P_Success_Switch)],'location','EastOutside')

legend(['"Stay" stragegy (P_{success} = ' num2str(P_Success_Stay) ')'],['"Switch" strategy (P_{success} = ' num2str(P_Success_Switch) ')']);
No info
1
Simulation result
0.9                                                       P = 1/3

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   200   400   600   800    1000     1200   1400     1600   1800       2000
With info from QM
1
"Stay" stragegy (P success = 0.3295)
0.9                               "Switch" strategy (P success = 0.6705)

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0   200   400   600   800   1000    1200   1400    1600   1800       2000
Updating process

If new information becomes available,
failure probability of a system
stochastic variables
stochastic variables
•   What is a stochastic variable?
•   probability distributions
•   Fast characteristics
•   Distribution types
•   Two stochastic variables
stochastic variable
Quantity with uncertainty:
– Natural variation
– Lack of statististical data
– Schematizations

example:
– strength of concrete
– outcome of a dice
– temperature in Delft on September 16th, 2014
Relation with events
uncertainty can be expressed with probabilities

• probability that stochastic variable X
–   is smaller than x
–   larger than x
–   equal to x
–   is in the interval [x, x+ x]
–   etc.
probability distribution
probability distribution function = probability
P(X):

FX() = P(X)
1

0.8
stochast               0.6
F ()
X

0.4
dummy
0.2

0

probability density

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
-3   -2   -1   0   1   2   3   4   5


This is a probability density function
probability density
Differentiation of F to :

fX() = dFX() / d

f = probability density function

fX() d = P( < X  +d
1

0.8
P(X 
0.6
F ()
X

0.4

0.2

0

0.5

0.4
P( < X  d 
0.3
fX()

0.2

0.1

0
 d 
1

0.8

0.6
F ()
X

0.4

0.2   P(X 
0

0.5

0.4

0.3
fX()

0.2

0.1

0

Discrete and continuous
discrete variabele:
0.4                                                              1

0.35                                                         0.9

0.3
0.8

0.25
p (x)

F (x)
0.7
X

0.2

X
0.6
0.15

0.5
0.1

0.05                                                         0.4

0                                                          0.3
0    1   2       3       4   5       6                         0   1    2       3       4       5   6
x                                                                  x

continue variabele:
0.5                                                          1

0.4                                                         0.8
F (x)
f (x)

0.3                                                         0.6
X
X

0.2                                                         0.4

0.1                                                         0.2

0                                                          0
-4       -2       0       2           4       6            -4           -2       0       2       4       6
x                                                              x

kansdichtheid                                              (cumulatieve)
kansverdeling
Complementary probability
distribution
Complementary (cumulative distribution function (ccdf) of variable S
P(S > x) = 1 - P(S  x) = 1-FS(x)

tail of the distribution

P{S > Sd} = Ptarget

Sd
Fast characteristics
0.5

f (x)   0.4

0.3
X

0.2

sX
0.1

0
-4   -2   0         2   4       6
mX            x

mX      mean
sX      standard deviation, indication for spreading
0.7

0.6

0.5

fX(x) 0.4
0.3

0.2

0.1             sX
0
0   1   2         3   4       5
mX                x

Mean  maximum (mode)
Median is the value m for which P(X<m)=50%


• Mean            m X   x fX x  dx



• Variance       s   x  m X  fX x  dx
2
X
2



• Standard deviation         sX

• Variation coefficient             sX
VX 
mX
distribution types

•   Uniform distribution
•   Normal distribution
•   Lognormal distribution
•   Gumbel distribution
•   Weibull distribution
•   Gamma distribution
•   ….
Uniform distribution

fX()                   area = total probability = 1

1/(b-a)

a                  b


mean        m = (a+b)/2

Standard deviation s = (b-a)/12
Matlab demonstration
Generate numbers from a Uniform distr.
Make a histogram (observe the variability
around its mean)
Calculate the mean value and standard
deviation
They should converge to 0.5 and 0.2887
(1/sqrt(12))
This is indeed confirmed by the simulation
Normal distribution
strength of timber
0.06
mX    probability density function
0.05                             normal distribution
mean = 37 N/mm2
standard deviation = 8.6 N/mm2
0.04
probability density

0.03
sX

0.02

0.01

0
-10   0   10   20   30    40     50     60      70      80    90
bending strength (N/mm2)
Normal distribution in CDF domain
sterkte houtsoort
1

0.9

0.8
kans buigsterkte <= x

0.7

0.6

0.5                                     kansverdelingsfunctie
normale verdeling
0.4                                     mu = 37 N/mm2
sigma = 8.6 N/mm2
0.3

0.2

0.1

0
0   10   20    30       40       50      60       70      80
2
x (N/mm )
Normal distribution in linearised CDF
domain
sterkte houtsoort
0.997
0.99         Normal Probability Plot
0.98
0.95
kans buigsterkte <= x

0.90

0.75

0.50

0.25

0.10
0.05
0.02
0.01
0.003

15    20    25    30   35      40    45   50   55   60
2
x (N/mm )
normal distribution
probability density: probability distribution:
1   m                    y
2
        
f X                                 FX y      f X  d
1       2 s 
e
s 2


in which:
m    mean
s    standard deviation (s > 0)
    dummy variable (- <  < )
standard normal distribution
normal distributed variable X:
X  mX
X  mX  s X u             or      u
sX
standard normal distributed variable u:
mu  0
su  1

probability density:                    probability distribution:
1                              y
 2
f u       
1
e 2           Fu y          d
2

table
This table can be used in both directions:
1. Given an x value, what is the
corresponding exceedence probability
2. Given a probability, what is the
corresponding x value

Note that the table only describes the right
hand tail of the standard normal
distribution. The left hand tail can be
obtained by symmetry around the point (0,
0.5).

For ordinary normal distributions, always
scale back to a standard normal distribution
(by subtracting the mean value, and
dividing by the standard deviation)
normal distribution
Why so popular?

Central limit theorem
Sum of many variables (i.i.d.) is (almost)
normally distributed.

Y = X1 + X2 + X3 + X4 + ….

i.i.d. = independent identically distributed
Normal Distributions

A continuous rv X is said to have a
normal distribution with parameters
m and s , where    m   and
0  s , if the pdf of X is

1     ( x  m )2 /(2s 2 )
f ( x)       e                           x  
s 2
Standard Normal Distributions
The normal distribution with parameter
values m  0 and s  1 is called a
standard normal distribution. The
random variable is denoted by Z. The
pdf is               1     z2 / 2
f ( z;0,1)              e          z  
s 2
The cdf is                  z
( z )  P( Z  z )        f ( y;0,1)dy

Standard Normal Cumulative Areas

normal
curve

0   z
Standard Normal Distribution
Let Z be the standard normal variable.
Find (from table)
a. P( Z  0.85)
Area to the left of 0.85 = 0.8023

b. P(Z > 1.32)
1  P(Z  1.32)  0.0934
c. P(2.1  Z  1.78)
Find the area to the left of 1.78 then
subtract the area to the left of –2.1.
= P( Z  1.78)  P( Z  2.1)
= 0.9625 – 0.0179
= 0.9446
z Notation

z will denote the value on the
measurement axis for which the area
under the z curve lies to the right of z .
 P ( Z  z )  

0      z
Ex. Let Z be the standard normal variable. Find z if
a. P(Z < z) = 0.9278.
Look at the table and find an entry
= 0.9278 then read back to find

z = 1.46.

b. P(–z < Z < z) = 0.8132
P(z < Z < –z ) = 2P(0 < Z < z)
= 2[P(z < Z ) – ½]
= 2P(z < Z ) – 1 = 0.8132
P(z < Z ) = 0.9066
z = 1.32
Nonstandard Normal Distributions

If X has a normal distribution with
mean m and standard deviation s , then
X m
Z
s
has a standard normal distribution.
Normal Curve
Approximate percentage of area within
given standard deviations (empirical
rule).
99.7%
95%
68%
Ex. Let X be a normal random variable
with m  80 and s  20.
Find P( X  65).

     65  80 
P  X  65   P  Z          
       20 
 P  Z  .75

= 0.2266
Ex. A particular rash shown up at an
elementary school. It has been
determined that the length of time that the
rash will last is normally distributed with
m  6 days and s  1.5 days.
Find the probability that for a student
selected at random, the rash will last for
between 3.75 and 9 days.
 3.75  6     96
P  3.75  X  9   P           Z     
 1.5          1.5 
 P  1.5  Z  2

= 0.9772 – 0.0668

= 0.9104
Percentiles of an Arbitrary Normal
Distribution

(100p)th percentile      (100 p)th for 
for normal  m,s   m  standard normal  s
               
Lognormal distribution

0.7

0.6

0.5

fX() 0.4
0.3

0.2

0.1             sX
0
0   1   2         3   4       5
mX                
Lognormal distribution
y
fX() : lognormal
fY(y) : normal

y = ln  or
 = exp(y)



If X is lognormal distributed, than
Y = ln(X) is normal distributed
Lognormal distribution
X lognormal distributed  Y = ln(X) normal distributed

probability density function for X:

 ln   m 2 
f X   
1
exp          Y
       0
 s Y 2    
      2s Y
2


in which mY and sY parameters of the lognormal distribution:
mY     mean value of Y (not of X !!)
sY     standard deviation of Y (not of X !!)
Lognormal distribution
X lognormal distributed  Y = ln(X) normal distributed

m X  exp(m Y  1 s Y )
2
2

s X  m X exp(s Y )  1
2

m Y  ln m X  1 s Y
2
2

 sX 2        
s Y  ln 1             ln(1  V 2 )
 m2                     X
    X        
Lognormal distribution
As a consequence of the central limit
theorem:
Product of many variables is (almost)
lognormal distributed

y  x1 x 2 ...x n
log y  log x1  log x 2  ...  log x n
so log y (almost) normal distributed.

Definition: log y normal  y lognormal
Example (salaries in a country are LN distr.
Asymptotic distributions

Normal                 X  Y1  Y2  Y3  ...

Lognormal              X  Y1 * Y2 * Y3 * ...

Weibull                X  minY1, Y2, Y3 ,...

Gumbel                 X  max Y1, Y2, Y3 ,...
Asymptotic distr. return in the 5th year course CT5310
by Vrijling and Van Gelder
Discrete distributions
Exponential Distribution

A continuous rv X has an exponential
distribution with parameter  if the pdf is

 e   x x  0

f ( x;  )  

          0       otherwise
Mean and Variance

The mean and variance of a random
variable X having the exponential
distribution

1         1
m       s 
2
         2
Applications of the Exponential
Distribution
Suppose that the number of events
occurring in any time interval of length t
has a Poisson distribution with parameter t
and that the numbers of occurrences in
nonoverlapping intervals are independent
of one another. Then the distribution of
elapsed time between the occurrences of
two successive events is exponential with
parameter   .
Two stochasts
Two stochasts
joint probability density

0.2
Kansdichtheid

0.15

0.1

0.05

0
2
1                                       2
0                               1
0
-1             -1
-2   -2
y                       x
Contour lines of the joint density
fY(h)

h
fXY(,h)

fX()


Two stochasts
Relation with events
fXY  ,h   P  X       andh    Y  h  h  /  h

FXY  ,h   PX   en Y  h 

h              h


Also here
f     density
F     (cumulative) distribution
Example

3

• Length           kansdichtheid (1/m)
2.5

2

1.5

1

0.5

0
1.2   1.4   1.6        1.8        2    2.2    2.4   2.6
lengte (m)

• Weight                          0.05
kansdichtheid (1/kg)

0.04

0.03

0.02

0.01

0
50     60         70         80       90     100    110
gewicht (kg)
Corresponding contour plot?
110

100

90
weight (kg)

80

70

60

50
1.4   1.6       1.8      2   2.2
length (m)
Scatter plot results of a large survey
health investigation
110
1000 observations

100

90
weight (kg)

80

70

60

50
1.4   1.5   1.6   1.7 1.8 1.9       2   2.1   2.2
length (m)
Dependence                                                            density (1/kg)

0.02
0.03

0.01
0.05

0.04

0
110
110

100
100

lengte (m)

90
weight (kg)

90

weight (kg)
80
80

70
70

60
60

50
50
1.4   1.6       1.8      2   2.2
3

2.5
densiity (1/m)

2

1.5

1

0.5

0
1.4   1.6       1.8      2   2.2
length (m)
Characteristics

mX, mY

sX, sY

Dependence
covXY                 covariance or
rXY = covXY / sX sY   correlation, between -1 and 1
Covariance
Cov(X,Y)=E((X-EX)(Y-EY))

calculation example on black board
Correlation
rho = 0    rho = 0.3   rho = 0.7

rho = 0.9   rho = 1     rho = -0.9
Correlations between wave height
and wave period (data from
golfklimaat.nl website)
Engineering: structural reliability
Reliability:
Probability that structure falls apart
The smaller the probability, the larger the
reliability
Risk = probability x consequences
Structure:
Strength
Falls apart if strength < load            Introduction
Design value - principle
Probability density function (pdf) of the load S: fS(x)

P{S > Sd} = Ptarget

s

m
Sd
Introduction
Cumulative distribution function
1

0.8

0.6
F ()
X

0.4                 cdf
0.2   P(X 
0

0.5

0.4

0.3
f X()

0.2
pdf
0.1

0

Introduction
Design value - principle
Cumulative probability distribution function (cdf) of the lo

P{S  Sd} = 1-Ptarget

Sd
Introduction
Design value - principle
Complementary cumulative distribution function (ccdf) of th
= 1-FS(x)

P{S > Sd} = Ptarget

P{S > Sd} = Ptarget                               Sd

Sd                         Introduction
Design value load Sd or quantile is defined as:
P{S > Sd} = Ptarget during reference period Tref

Target probability Ptarget :
Depends on consequences of structural failure
Is specified in building codes
Typical:
Ptarget = 10-4 - 10-1 (structural collapse)
Tref = 15 - 100 years
Introduction
Peaks over Treshold analysis
for quantile estimation
Let X1; X2; . . . ; Xn be a
series of independent
random observations
of a random variable X
with the distribution
function F(x). To
model the upper tail of
F(x), consider k
exceedances of X over
a threshold u and let
Y1; Y2; . . . ; Yk denote
the excesses (or
peaks), i.e. Yi=Xi-u.
Extreme value statistics

If we know the distribution of a random variable
(for instance, monthly water level, daily wave
height, etc), how does the distribution of the
maximum of n random variables behave?
Careful for inhomogenities
Stationary Time Series

Exhibits stationarity in that it fluctuates around a constant long
run mean

Has a finite variance that is time invariant

Has a theoretical covariance between values of yt that depends
only on the difference apart in time

E ( yt )  m
              
Var ( y t )  E ( y t  m)  (0)
2

Cov ( y t , y t  )  E( y t  m)( y t    m)  ()
Stationary time series

WHITE NOISE PROCESS

Xt = ut         ut ~ IID(0, σ2 )

White Noise
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
Examples of non-stationary series

2.5
Share Prices                                                        Exchange Rate
.7
2.25

.65
2

1.75                                                                           .6

1.5
.55

1.25
.5
1
.45
.75

.5                                                                           .4

.25                                                                          .35

0   50      100   150          200   250   300    350   400      450
0   50    100   150   200   250

9.3

Income
9.2

9.1

9

8.9

8.8

8.7

1960               1965               1970           1975
Unit Root Tests

How do you find out if a series is
stationary or not?
Order of Integration of a Series

A series which is stationary after being differenced once
is said to be integrated of order 1 and is denoted by I(1).
In general a series which is stationary after being
differenced d times is said to be integrated of order d,
denoted I(d). A series, which is stationary without
differencing, is said to be I(0)

Yt  b0 Yt  1  e t  I (1)
 Yt  Yt  Yt  1  b0 e t  I (0)
Informal Procedures to identify non-stationary processes
(1) Eye ball the data (a) Constant mean?
12
RW2

10

8

6

4

2

0
0    50   100   150    200    250    300     350     400     450         500

(b) Constant variance?
var
200

150

100

50

0

-50

-100

-150

-200

0    50   100    150    200    250     300     350     400         450         500
Statistical Tests for stationarity: Simple t-test

Set up AR(1) process with drift (b0)
Yt = b0 + b1Yt-1 + et        et ~ iid(0,σ2)            (1)

Simple approach is to estimate eqn (1) using OLS and examine
estimated b1

Use a t-test with null Ho: b1 = 1 (non-stationary)
against alternative Ha: b1 < 1 (stationary).

Test Statistic: TS = (b1 – 1) / (Std. Err.(b1))
reject null hypothesis when test statistic is large negative
- 5% critical value is -1.65
Distribution of a maximum of random
variables
Therefore...
Extreme value distribution of a uniform
distribution
From an ‘operational point of view’
rather than conceptual point of view Matlab
code ct53100.m
for j=1:100,
n=12;
for i=1:n,
x(i)=5*rand(1);
end
y(j)=max(x);
end
5                                                                                      5

Maximum water level [m]
Water level [m]

4
4
3
3
2

1                                                                                      2
0   2        4        6          8        10   12                                      0     20          40            60      80     100
Time period i                                                                                Year j
3                                                                             60

2                                                                             40
Frequency

Frequency
1                                                                             20

0                                                                                      0
1       2             3               4        5                                       2   2.5        3       3.5      4        4.5   5
Monthly water level [m]                                                              Annual maximum Water level [m]
5                                                                                      5
Cumulative frequency

4                                                            Cumulative frequency
4
3
3
2

1                                                                                      2
0   2        4        6         8         10   12                                      0     20          40         60          80    100
Sorted monthly water level [m]                                                            Sorted annual water level [m]
Back to the operational viewpoint
- Change the number of observations
- Change the distribution type
- From minimum to maximum
- etc.
From a visual point of view

Statistics wind:
30                      30
Weibull
Bron:
Source: KNMI (Schiphol, 1964)
V (m/s)

20                      20

10                      10

0                        0
0.2         0.1     0        0   0.2         0.4       0.6        0.8            1
kansdichtheid
probability density                 tijd
time (1/1/64 - 31/12/64)

Mean             : 5 m/s
Standard deviation     : 2.5 m/s
Introduction
visual point of view

instantaneous   maximum 1 year

maximum 50 years

5 m/s           20 m/s 28 m/s            wind speed

Introduction
visual point of view
Procedure for minima of r.v.’s
The extreme value distribution

Distribution type
Distribution parameters
Quantile values (design values)

Approach 1: extreme value distributions
The extreme value distribution
Limit theorem (Fisher-Tippet):

Maximum of many random variables has distribution:
Reverse Weibull (bounded maximum) or
Gumbel or
Frechet (bounded minimum)
regardless of parent distribution

Conditions:
Random variables are independent
Random variables have the same parent distribution
Approach 1: extreme value distributions
Extreme value distributions

Frechet
(concave)

Reverse Weibull    Gumbel
(convex)       (straight)

Approach 1: extreme value distributions
Generalized Extreme Value
distribution
All three extreme value distributions are
special cases of the Generalized Extreme
Value distribution (GEV):

 = 0 Gumbel            (EV type I for maxima)
 > 0 Frechet           (EV type II for maxima)
 < 0 Reverse Weibull   (EV type III for maxima)
Approach 1: extreme value distributions
Domain of attraction
Parent distribution         Asymptotic distribution
type of maximum
(domain of attraction)
uniform, beta               Reverse Weibull
(short tail)
Normal, exponential,      Gumbel
gamma, lognormal, Weibull
Pareto, Cauchy, Student-t   Frechet
(fat tail)

Approach 1: extreme value distributions
Example:
Maximum over 50 years
Quantile value at Ptarget = 0.15 (design
Steps:
– Determine extreme value distribution type
– Determine distribution parameters
– Calculate requested quantile (design load)
Approach 1: extreme value distributions
Distribution type
Statistics: parent distribution is approx. Weibull
EV-theory: domain of attraction is Gumbel
Plot: monthly maxima of hourly averaged wind speeds

Slightly convex?

Approach 1: extreme value distributions
Gumbel probability plot

CDF on Gumbel probability paper:

Reverse Weibull-like deviation
(poor convergence)

Approach 1: extreme value distributions
Wind speed annual maxima Schiphol

Approach 1: extreme value distributions
Wind pressure monthly maxima
Schiphol

pressure:

q = 0.5 r U2
with:
r air density
U wind speed

Approach 1: extreme value distributions
Transformations
Presenting large datasets
In a histogram
On probability paper

order the n observations
Number of classes: 1 + 1.33 ln(n)
All classes have preferably the same
width
Histogram
Wave height (cm’s):
25 45 35 25 30 70
20 45 65 30 40 40
35 45 55 35 32 37
28 45 49 39 40 60
29 34 47 35 45 49
35 45 34 28 34 54
48 38 32 39 45 58
Histogram
#classes: 1 + 1.33 ln(42) ≈ 6
highest - lowest = 70 - 20 = 50
class width about 50 / 6 = 8 (we take 5 to
choose a round number)
Histogram 4
class           frequency freq/width
(unit=5)
17,5 - 27,5       3        3/2
27,5 - 32,5       7        7/1
32,5 - 37,5       9        9/1
37,5 - 42,5       6        6/1
42,5 - 47,5       8        8/1
47,5 - 57,5       5        5/2
57,5 - 77,5       4        4/4
Table 2.1 Groundwater chemistry from Noord-Branband, The Netherlands
X        Y      EC       Ca       Cl      HCO3      K       Mg       NO3    Na       SO4

Case study:             (m)
133875
158975
(m)   (mS/m)
407525 11.60
413725 83.20
(mg/l)
9.63
159.00
(mg/l)
19.70
46.50
(mg/l)
3.00
290.00
(mg/l)
1.72
1.58
(mg/l)
0.88
14.23
(mg/l) (mg/l)
0.28   9.90
0.28 23.26
(mg/l)
21.20
189.00
156675   419725 72.10    135.00    37.40   260.00    4.00    13.29     1.81 18.02     149.00

Groundwater           145325
143200
148238
165225
401775 39.70
410625 33.10
411875 67.90
422650 40.10
51.20
59.30
74.20
67.70
19.50
14.60
85.00
22.10
11.00
207.00
252.00
125.00
8.28
1.16
3.73
1.13
7.50
4.66
3.87
3.45
13.20 13.97
0.28 10.88
0.28 78.83
0.28 10.41
112.00
2.10
44.50
83.10

chemistry data set    171500
163875
168463
177075
414450 24.00
407250 14.30
402625 49.80
405600 19.50
3.84
17.70
48.20
8.34
15.40
11.70
52.30
11.10
3.00
44.00
147.00
3.00
1.67
0.77
4.54
9.21
2.08
2.06
13.15
6.25
9.35
0.28
8.63
7.84
0.28 40.53
2.17 12.75
61.20
19.60
81.10
61.00
184550   415063 40.30     52.70    31.40     9.10    2.78     5.63     0.28 21.81     150.00
182500   403638 65.40     63.40    46.50     3.00   25.81    15.67    31.60 20.15     137.00
194688   404850 62.80     84.10    31.40     9.10   16.37     9.82    45.50 22.87      94.60
137412   391325 70.10     83.80    36.20    13.00   10.99    22.27    57.20 17.48      83.50
146000   394525 26.40     49.20     6.67   173.00    1.24     4.22     0.28   7.46      1.80
157725   399378 11.60     15.70    10.50    61.00    0.51     1.83     0.28   8.20      2.50
158600   393350 88.40    167.00    88.30   371.00    1.47     6.58     0.28 31.94      76.90
143290   376185 36.50     23.30    54.50     3.00   10.40     2.68     7.29 38.94      57.70
150705   382290 14.20      6.69    10.50     3.00    4.09     6.06     0.28   5.49     41.40
167325   392060 27.70     45.30    18.30   127.00    0.72     5.57     0.28   6.93     21.40
179563   395513 68.20     51.30    58.90     3.00   43.23    15.59     0.28 32.86     242.00
173640   388415 98.50    172.00   109.00   450.00    1.77    19.10     0.28 47.29      60.40
162600   378238 15.70     11.20     6.82     3.00    2.57     0.81     0.34   7.52     45.90
171335   380745 12.10      2.12     4.72     3.00    1.55     2.27     0.28   4.67     27.10
182675   393300 25.20     11.90    12.60     3.00    2.93     7.02     9.31   7.56     62.70
192750   397975 53.00     55.50    18.80     3.00   24.71    10.38    40.50 19.79      81.20
182065   382505 23.20     35.50     7.45    77.00    1.15     5.70     0.28   6.13     49.10
146050   367045 93.00     82.10    43.70     3.00   41.72    21.87    87.30 23.15      90.60
155915   372595 63.90     69.00    38.60     3.00   10.99    21.53    44.40 21.83     105.00
163750   368850   9.90     7.63     4.93     3.00    1.70     2.15     1.28   6.75     31.10
168200   365138 70.80     64.80   112.00    56.00    2.42    14.02     0.28 59.13     166.00
177325   372138 13.50     18.30    18.50    57.00    1.51     1.64     0.28 10.69       2.10
169813   362163 33.20     10.40    15.30     3.00    1.83     2.46     0.28 15.75      72.10
82245   404445 87.10     34.50   160.00   293.00    2.73     3.32     0.28 150.70      2.20
85975   401085 62.30    109.00    56.30   349.00    2.16    12.57     0.28 19.78       1.70

(from;                 93840
115480
117915
406235 45.90
420695 112.60
407045 76.00
41.50
147.00
73.30
90.70
181.00
60.20
130.00
433.00
232.00
4.21
2.71
42.34
9.11
13.30
19.69
0.28 18.86
0.28 49.11
0.28 49.06
2.10
1.00
144.00

Y. Zhou, 2006:        119620
127360
402050 26.50
419875 115.10
18.40
121.00
18.10
184.00
3.00
467.00
3.33
3.17
7.49
9.94
0.28 11.38
0.28 147.70
86.50
11.10
135725   416615 62.90    125.00    17.80   305.00    0.32    11.59     0.28 11.49     105.00

Hydrogeostatistics.   121645
125395
408175 30.50
401095 39.00
54.90
52.90
16.30
48.90
159.00
65.00
1.05
1.66
4.55
4.53
0.28   7.62
0.28 21.33
16.40
74.00
76360   392710 16.00      3.68    37.20    29.00    2.18     2.13     0.28 22.38       2.20

UNESCO-IHE             84060
94950
393685 35.30
391050 23.60
30.10
4.31
61.40
20.00
3.00
3.00
2.10
2.32
8.63
6.13
0.28 25.51
2.30 11.43
79.20
76.60
96000   383625 63.30     66.10    30.00     3.00   44.80    13.07    16.30 13.18     207.00

lecture note.)        104570
101830
391570 51.10
398440 51.60
15.00
33.30
60.90
20.10
3.00
3.00
4.18
13.39
6.67
15.59
0.28 26.06
0.28 12.54
80.90
173.00
113238   399300 61.20     60.00    73.30    74.00   19.38    11.81     0.28 45.19     151.00
Frequency tables

Step 1: Range of data
R = xmax - xmin
R(Cl) = 184 - 4.72 = 179.28 mg/l
Frequency tables
Step 2: Number of class intervals, m
Class number
6 < m < 25
m = 1 + 1.33 ln(N)
Example Cl: m should be around 6-7 class
Class width x:
x > R/m
Example Cl: 15 > 179.28/13=14
Class limits xj- (lower limit) and xj+ (upper limit):
xj- = x0 + (j-1)*Δx < values in class j < xj- + Δx =
xj+
Frequency tables

Step 3: Number of measurements per class

nj:      absolute frequency
fj=nj/n: relative frequency
Frequency tables
Step 4: Creating frequency table
Class interval   Absolute frequency   Relative frequency
0 < Cl < 15                    11              20.80%
15 < Cl < 30                    15              28.30%
30 < Cl < 45                     8              15.10%
45 < Cl < 60                     7              13.20%
60 < Cl < 75                     4               7.50%
75 < Cl < 90                     2               3.80%
90 < Cl < 105                     1               1.90%
105 < Cl < 120                     2               3.80%
120 < Cl < 135                     0                   0
135 < Cl < 150                     0                   0
150 < Cl < 165                     1               1.90%
165 < Cl < 180                     0                   0
180 < Cl < 195                     2               3.80%
Creating of a Histogram
Step 5: Absolute and relative frequencies
Abso lute fre quency histog ram
16

14

12
Absolute frequency

10

8

6

4

2

0
0   15   30   45   60   75    90   105   120   135   150   165   180   195
Cl (mg/l)

Re lative freque nc y histo gram
30

25
Relative frequency (%)

20

15

10

5

0
0   15   30   45   60   75    90   105   120   135   150   165   180   195
Cl (mg/l)
Frequency tables
Step 6: Cumulative frequency table
Upper class limit   Absolute frequency   Relative frequency
Cl < 15                     11             20.80%
Cl < 30                     26             49.10%
Cl < 45                     34             64.20%
Cl < 60                     41             77.40%
Cl < 75                     45             84.90%
Cl < 90                     47             88.70%
Cl < 105                    48             90.60%
Cl < 120                    50             94.40%
Cl < 135                    50             94.40%
Cl < 150                    50             94.40%
Cl < 165                    51             96.30%
Cl < 180                    51             96.30%
Cl < 195                    53            100.00%
Frequency distribution
Step 7: Cumulative frequency distribution curve
Cumulative fre que ncy dis tribution
100
90
Cumulative frequency (%)

80
70
60
50
40
30
20
10
0
0   20   40    60      80      100      120    140   160   180   200
Cl (mg/l)
Some typical frequency distributions
Sy m m e t ric a l o r be ll- s ha pe d d is t ribut io n                                                 P o s it iv e s ke we d dis t ribut io n

Relative f requency (% )
Relat ive f requency ( %)

X v a ri a b l e                                                                                    X v a r i a b le

B im o d a l dis t ribu t io n                                                               N e g a t iv e s ke we d dis t ribut io n
Relat ive f requency ( %)

Relative f requency (% )

X v a ri a b l e                                                                                  X v a r i a b le
Statistical descriptors
Descriptors of central tendency
Mode: the value with largest frequency; average
value of measurements of the class with the
largest frequency

For Cl: the mode is 19.4 mg/l
Re lative freque nc y histo gram

Not applicable for                            30

25

distributions with          Relative frequency (%)
20

several peaks                                        15

10

5

0
0   15   30   45   60   75    90   105   120   135   150   165   180   195
Cl (mg/l)
Statistical descriptors
Descriptors of central tendency
Median: the value corresponds to 50% of cumulative
frequency, the value of mid measurement for odd
number samples or the average of two mid
measurements for even number samples
For Cl:                                                                    Cumulative fre que ncy dis tribution
100
median = 31.4 mg/l          Cumulative frequency (%)
90
80
70
60
50

Insensitive to the tails or                                 40
30
outsiders of the distribution,                              20

preferable for data sets with                               10
0
exceptional values                                                0   20   40    60      80      100
Cl (mg/l)
120    140   160   180   20
Cl Rank    Percent
184     1   100.00%
181     2    98.00%

Statistical descriptors
160     3    96.10%
112     4    94.20%
109     5    92.30%
90.7    6    90.30%
88.3    7    88.40%
85    8     86.50%
73.3   9     84.60%

1. Descriptors of central tendency             61.4
60.9
60.2
10
11
12
82.60%
80.70%
78.80%
58.9   13    76.90%

Quartiles: split the data into quarters
56.3   14    75.00%

54.5   15    73.00%
52.3   16    71.10%
48.9   17    69.20%

– Lower quartile: 25% cumulative frequency
46.5   18    65.30%
46.5   18    65.30%
43.7   20    63.40%
39.2   21    61.50%

For Cl: lower quartile = 16.3 mg/l         38.6
37.4
37.2
22
23
24
59.60%
57.60%
55.70%
36.2   25    53.80%

– Upper quartile: 75% cumulative frequency
31.4   26    50.00%
31.4   26    50.00%
30    28    48.00%
22.1   29    46.10%

For Cl: upper quartile = 56.3 mg/l         20.9
20.1
30
31
44.20%
42.30%
20    32    40.30%
19.7   33    38.40%
19.5   34    36.50%
18.8   35    34.60%

In practice also the 1%, 5%,10%,           18.5
18.3
18.1
36
37
38
32.60%
30.70%
28.80%

90%, 95% and 99% values are used
17.8   39    26.90%
16.3   40    25.00%
15.4   41    23.00%
15.3   42    21.10%
14.6   43    19.20%

(e.g. discharge data).                     12.6
11.7
11.1
10.5
44
45
46
47
17.30%
15.30%
13.40%
9.60%
10.5   47     9.60%
7.45   49     7.60%
6.82   50     5.70%
6.67   51     3.80%
4.93   52     1.90%
4.72   53      .00%
Statistical descriptors
1. Descriptors of central tendency
Arithmetic mean: average value of measurements

1 N
x =  xi
n i=1
For Cl: arithmetic mean = 43.72 mg/l

• More representative of the sample, sensitive to outsiders in a
small sample.
• Most distributions are sufficiently characterised by the mean
and the variance.
Statistical descriptors
1. Descriptors of central tendency
Geometric mean

x g = n x1* x2 * ...* xn
1 n
log( x g ) =  log( xi )
n i=1
For Cl: geometric mean = 29.53 mg/l
Not applicable for negative values. Often hydrogeological
variables are not symmetrical, but the log transformations are
symmetrical. Then geometric mean is applicable.
The radius of a grain with main axes a, b, and c is characterised
best by the third root of the a*b*c.
Statistical descriptors
1. Descriptors of central tendency
Harmonic mean

1
xh =
1 n 1

n i=1 xi

For Cl: harmonic mean = 20.08 mg/l

Appropriate for phenomenon where small values are more
important (e.g. hydraulic conductance; see lecture notes form
Zhou, 2006)
Statistical descriptors
Descriptors                         Summary of properties
---------------------------------------------------------------------------------------------------------------------
Mode                    indication of abundant values, isolated property,
not applicable for distributions with several peaks.
---------------------------------------------------------------------------------------------------------------------
Median                  insensitive to the tails of the distribution, preferable
for data sets with exceptional values.
---------------------------------------------------------------------------------------------------------------------
Arithmetic Mean more representative of the sample, sensitive to exceptional
values in a small sample. Most distributions are sufficiently
characterised by the mean and the variance.
---------------------------------------------------------------------------------------------------------------------
Geometric mean not applicable for negative values. The radius of a grain with
main axes a, b, and c is characterized best by the third root
of the product a b c.
---------------------------------------------------------------------------------------------------------------------
Harmonic mean more appreciate for phenomenon where small values are
more important.
Statistical descriptors
Relations between central tendency descriptors

The harmonic mean is smaller than the
geometric mean, and the geometric mean is,
xh < x g < x
in turn, smaller than the arithmetic mean.
They are equal only if x1 = x2 = ... = xn.

If the frequency distribution is symmetrical.
x = median           They are not equal when the distribution is
not symmetrical.

log( x g ) = log(x)   The mean of the log x -distribution is equal to
the logarithm of the geometric mean of x.
Statistical descriptors
2. Descriptors of dispersion (variation)
Sample variance
1 n                          1 n 2 2
s   2=
  ( xi - x )2   s   2=
 xi - x
n i=1                        n i=1

For Cl: variance = 1768.72 [mg/l]2
Standard deviation: the square root of the
variance
For Cl: standard deviation = 42.06 mg/l
Statistical descriptors
2. Descriptors of dispersion (variation)
50

40
Frequency (100%)

30

20

10

0
-8   -7   -6   -5   -4   -3   -2     -1       0       1   2    3   4   5   6   7   8
Me an = 0

S tandard devi ati on = 1
S tandard devi ati on = 2
Statistical descriptors
2. Descriptors of dispersion (variation)
Coefficient of variation
s
Cv =
x
For Cl: coefficient of variation = 0.96

Useful to compare the variations of two or more
data sets.
Statistical descriptors
3. Descriptors of asymmetry
Coefficients of skewness

x - mode                      3( x - median)
P1 =             Moments or   P2 =
s                                s
product
moments!
2
m3                         n
a 3 = 3/2            3=                  a3
m2                    (n - 1)(n - 2)

For Cl: coefficient of skewness α3 =
1.97
Statistical descriptors
Moments or product moments
Central moments
1 n
mk =  ( xi - x )k
n i=1
Moments are statistical descriptors of a data set used for:
– 1st moment: mean or expected value, µx (“central tendency”)
– 2nd moment: variance, σ2x ; standard deviation σx is square root of
variance (“spread around the central value”)
– 3rd moment: skewness, γx
(“measure of symmetry”)
– 4th moment: kurtosis, κx
(“peakedness of central portion of distribution”)
Statistical descriptors
3. Descriptors of asymmetry
Symmetrical distribution         3
=0

X variable
Mode = Mean = Median

Skewned to the right                                             Skewned to the left
3
>0                                                         3   <0

X variale                                                                           X variale
Mode  Mean                                                                                                   Mean Mode
Median                                                                                                       Median
Statistical descriptors
4. Descriptor of flatness/’peakedness’
Coefficient of kurtosis
m4                   n3
a4 =           4=                       a4
m2
2
(n - 1)(n - 2)(n - 3)

For Cl: coefficient of Kurtosis = 7.04

α4=3: for Normal distribution
α4>3: steeper than Normal distribution
α4<3: flatter than Normal distribution
Normal probability paper
Sort the data from small to large
Assign each observation to i/N+1 in which
i the order number and N the total
number of data points
Daily Mean Run-Off Anomalies at Achleiten Danube
River  deseasonalised daily mean run-off

POT

year
Flood frequency analysis (peak flows)
Annual maximum series (more common)
 One can miss a large event if more than one per year; but continuous
and easy to process
 Often used for estimating extremes in long records (>10 years)
Partial duration series (“Peaks-Over-Threshold, POT”)
 Definition of the threshold is tricky and requires experience
 Often used for short records (<10 years)

150
at Stah
Daily discharge (m3/s)

125

100
threshold
75

50

25

0
25-Mar-60    2-Dec-73   11-Aug-87   19-Apr-01
Flood frequency
analysis
(peak flows):

Annual max.
series
vs.
partial duration
series

(Davie, 2002)
Annual max. series vs. partial
duration series (POT)
Langbein showed the following relationship
(Chow 1964):

1/T = 1- e-(1/Tp)
T : return period using annual max. series
Tp: return period using partial duration series

Differences get smaller for larger return periods
(less than 1% difference for a 10-year
recurrence interval)!
Assumptions of frequency analysis
All data points are correct and precisely measured
 Be aware of the uncertainty of peak flow data (uncertainty and errors come
later in this course!)
Independent events: peak flows are not part of the same event
 Carefully check the data set; plot the whole record, in particular all events of
the POT series
 Problems with events at the transition of the year (31 Dec – 1 Jan) in humid
temperate or some tropical climates
Random sample: Every value in the population has equal chance of being
included in the sample
The hydrological regime has remained static during the complete time period
of the record
 No land use change, no climate change, no changes in the river channels, no
change in the flood water management etc. in the catchment (often not the
case for long records!)
All floods originate from the same statistical population (homogeneity)
 Different flood generating mechanisms (e.g. rain storms, snow melt, snow-
on-ice etc.) might cause floods with different frequencies/recurrence intervals
Describing the frequency mathematically:
Probability Distribution Function
Typically defined in either of two forms:
• Probability density function (PDF)
• Cumulative distribution function (CDF)

PDF                    CDF

PX  x  p( x)  p x
x
F ( x)   f (i )
Discrete
i 0

b                      x

Continuous pa  x  b   f ( x)dx   F ( x)      f (m )dm

a
Basics (examples using measured flow, Q)
Probability of exceedence, P(X): probability that the flow Q is
greater or equal X; P(X)ε[0,1]
Relative frequency, F(X): probability of flow Q being less than
a value X; F(X)ε[0,1]. Can be read from a cumulative
probability curve, but be careful with the selected class
intervals.
Average recurrence interval or return period, T(X): statistical
term meaning the chance of exceedence once every T
years over a long record (time step is usually one year).
– Not exactly the number of years that are between certain size
events!
– More the average number of years, in which flow is greater than X!
– No regularity or periodicity in occurrences of exceedences
(assumption)
P(X) = 1–F(X)
T(X) = 1/P(X) = 1/(1-F(x))
Relative frequency F(X)

Probability of exceedance P(X)

PDF
CDF
Probability of exceedance P(X)

Relative frequency F(X)
Recurrence intervals for design
purposes (flood protection) in Germany
Class 1
 Settlements, urban areas, important infrastructure:
50-100 years
Class 2
 Single buildings, not always inhabit neighborhoods:
25-50 years
Class 3
 Farm land, intensively used: 10-25 years
Class 4
 Farm land; 5-10 years

What about large dams, nuclear power plants etc.?               PMF

(according to DIN 19700, part 99)
Exceedence probability for a specified number of
time intervals (see Box C-4, page 561 in Dingman, 2002)
Examples: Exceedence probability and return
period (based on Box C-4 in Dingman, 2002)
What is the probability that a flood greater or equal a 100-year flood will occur
next year?
      P(X) = 1/T(X) = 0.01
What is the probability that we will not have a flood that is greater or equal the
50-year flood next year?
      F(X) = 1-P(X) = 1-0.02 = 0.98
What is the probability that we will not have a flood that is greater or equal the
20-year flood in the next 5 years?
      F(X) = (1-P(X))n = 0.955 = 0.774
What is the probability that the next exceedence of the 100-year flood will occur
in the 10th year from now on?
      p = (1-0.01)9 x 0.01 = 0.00916
What is the probability that the 100-year flood will be exceeded at least once in
the next 40 years?
      p = 1-(1-0.01)40 = 0.331
What is the probability that the 50-year flood will be exceeded twice in a row
(two independent events in one year), and how many 50-year floods can
be expected on averages in 1000 years?
      p = 0.02 x 0.02 = 0.0004; and on average 20 floods in 1000 years.
=

(                          )
-   =

=

(Bedient and Huber, 2002)
But how do we estimate P(X) and F(X) from data?

Example: Flood frequency analysis

150
at Stah
Daily discharge (m3/s)

125

100

75

50

25

0
25-Mar-60    2-Dec-73   11-Aug-87   19-Apr-01
Example: Annual max. series for the river Wye (1971-97)
(not Normal-distributed!)

(Davie, 2002)
Plotting position – Weibull formula
Rank the annual maximum series data from low to high
(independent data, the year of occurrence is irrelevant)
Calculate F(X) with the rank r and N total data points (i.e.
length of record: N years)

F(X) = r/(N+1)

– For example: The largest value of a 25 year record would plot at
a recurrence interval of 26 years.
– F(X) can never reach 1
– If you rank from high to low P(X) = 1-F(X) is calculated
Example: Annual max. series for the river Wye (1971-97)
(not Normal-distributed!)

(Davie, 2002)
Gringorten formula
F(X)=p
in the Workshop
course note]
F(X) = (r-0.44) / (N+0.12)
Difference to Weibull formula is often not great
Use is often down to personal preferences
Empirical constants (0.44 and 0.12) are valid for
Gumbel distribution

Comparison of Weibull and Gringorten
formulae (Davie, 2002)
Example: Annual max. series for the river Wye (1971-97)
(not Normal-distributed!)

(Davie, 2002)
Example: Annual max. series for the river Wye (1971-97)
(not Normal-distributed!)

Reliability
is good!

(Davie, 2002)
Extrapolation beyond the data set
Weibull or Gringorten formulae only good for flood frequency
estimations for flows within the measured record, and
even unreliable near either limiting value
For extrapolation the fit of a probability distribution is needed.
Estimate the parameters through:
1. Method of moments (widely used)
2. Method of L-moments (less widely, used, quite complex)
3. Method of maximum likelihood (not widely used)
An alternative is a graphical approach to fit the distribution
(subjective approach)
Choice of the distribution function often based on personal
preferences (but always take the distribution that fits your
data best in a particular region), but there are sometimes
guidelines (depend on the region)
Extreme values are usually not normally distributed, however,
mean annual flows in humid areas are often normally
distributed
Method of moments – Example:
Gumbel distribution
Product moments are statistical descriptors of a data set (characterize
the probability distribution):
– 1st moment: mean or expected value, µx (“central tendency”)
– 2nd moment: variance, σ2x ; standard deviation σx is square root of
variance (“spread around the central value”)
– 3rd moment: skewness, γx
(“measure of symmetry”)
γx > 0
– 4th moment: kurtosis, κx
(“peakedness of central portion of distribution”)
– Coefficient of variation: measure of spread
CV = σx/µx

γx < 0

(L-moments are used for small sample sizes;
see Dingman (2002), Appendix C)
Method of moments – Gumbel distribution
F(X) = exp(-exp(-b(X-a)))
a = mean(Q) - 0.5772/b
b = π/(σQ 60.5)

F(X) leads to P(X) and T(X) for a certain size of flow X
Re-arranging the formulae leads to the size of flow for a
given recurrence interval:

X = a – 1/b ln ln(T(X)/(T(X)-1))

Example: for the 50-year flood, you need to compute the natural logarithm of
50/49 and then the natural logarithm of this result. The parameters a and b
are estimated from the sample.
Rule of thumb: Do not extrapolate recurrence intervals
beyond twice the length of your stream flow record
Example: Annual max. series for the river Wye
(1971-97); values required for the Gumbel formula

Mean (Q)      Standard deviation (σQ)         a value        b value
21.21            6.91                        18.11            0.19

Applying the method of moments and Gumbel formula to the
data gives some interesting results. The values used in the formula
are shown in the table above and can be easily computed. When
the formula is applied to find the flow values for an average
recurrence interval of 50 years it is calculated as 39.1 m3/s. This is
less than the largest flow during the record which under the Weibull
formula has an average recurrence interval of 29 years. This
discrepancy is due to the method of moments formula treating the
highest flow as an extreme outlier. If we invert the formula we can
calculate that a flood with a flow of 48.87 m3/s (the largest on
record) has an average recurrence interval of around three
hundred years.

(Davie, 2002)
Distributions often used in hydrology (1/2)

(Dingman, 2002)
Distributions often used in hydrology (2/2)

(Dingman, 2002)
Use of common distributions in hydrology
Flood frequency analysis
Most commonly applied are the Exponential (EXP), Log-Normal (LN), Log-
Pearson 3 (LP3) and Generalised Extreme Value (GEV) distributions.

In practice
The choice of probability distribution may be dictated by mathematical
convenience or by familiarity with a certain distribution (“personal bias”).
Sample estimators must be adopted in order to obtain estimates of the statistics
for determining the distribution parameters. In some cases, more than one
distribution may fit the available data equally well.

General three-step procedure
1) A suitable form of standard frequency distribution is chosen to represent the
observations;
2) the chosen distribution is fitted to the data by determining values for its
parameters; and
3) the required quantiles are computed from the fitted cumulative distribution
function (CDF).
Distributions often used in hydrology
1. Normal distribution (ND)
Probably the most important distribution, but often not
useful for hydrological extremes
1      1  x  c 2 
PDF:         f ( x)       exp            
a 2     2 a     

a: scale parameter = standard deviation, σ
c: location parameter = mean, μ
 y2 / 2
e
Standard Normal Distribution (S-ND)                 f ( y) 
2
y
1       u2
CDF:         G( y )        2
exp( )du
2              y
xm
                                   s
(compare lecture of Dr. Zhou)
Location parameters
A probability distribution is characterized by location and scale parameters.
Location parameter equal to zero and scale parameter equal to one (Standard-
ND) vs. ND with a location parameter of 10 and a scale parameter of 1.

Scale Parameter
The next plot has a scale parameter of 3 (location parameter is zero). The
effect is that the graph is ‘stretched out’ .
Probability Distribution Functions
2.       The Lognormal distribution
y = ln(x)
If y follows the Normal distribution, x follows the
Log-normal distribution.

Lognormal probability density function:

1         1  ln(x) - m y 2
f(x) =          exp [- (            ) ] when x > 0
x s y 2       2     sy

= 0 elsewhere
Lognormal distribution function
0.6

0.5

0.4
Frequency

0.3

0.2

0.1

0
0   1                2                   3                 4   5
X variale

Standard deviation = 1        Standard deviation = 2

Lognormal distribution is skewed to the right!
Effects of shape parameters for the
lognormal distribution

PDF

CDF
Processing Log-normal Distribution Function
Histogram indicates positive skewness
distribution
Plot of cumulative frequency on Lognormal
probability paper shows a straight line
Take logarithm transformation of data
y = ln(x)
Calculate sample mean and standard
deviation of logarithmic values
Carry out analysis on logarithmic values
3. Pearson type III distribution
(Often used for flood frequencies)
1                      xc
PDF:          f ( x)           ( x  c) b 1 exp     
a (b)
b
   a 

a - scale parameter              b - shape parameter
c - location parameter                   
Г( ) – Gamma function: (b)   t b 1e t dt
0

 commonly fitted to the logarithms of floods (so-
called log-Pearson type III distribution)

4. Gamma Distribution (when c=0)
Effects of shape parameter, gamma

PDF

CDF
5. Exponential distribution
Particularly useful when applying partial duration
series

PDF:        f ( x)   exp  x 
With a mean of 1/λ, a variance of 1/λ2, and a skewness of 2.

Standard Exponential Distribution

CDF:        G( y)  1  exp(y)
Plots of exponential distribution

CDF

PDF
Example: Exponential distribution applied to
storm interval times (from Bedient & Huber 2002)
6. General extreme value (GEV) distribution

  k ( x  c ) 1 / k 
                      
CDF:       F ( x1 )  exp  1                    for k  0
 
        a     

c - location parameter         a - scale parameter
k - shape parameter

k=0 Extreme value type I (EV1) (Gumbel)
k<0 Extreme value type II (EV2)
k>0 Extreme value type III (EV3)
closely related to the Weibull distribution
Comparison of the three types of
GEV distributions (PDF)
x                                    Type II, k<0

Type I, k=0

Type III, k>0
x=c

y1
0
If the sample for which frequency distribution is required exhibits
skewness, a three-parameter distribution is useful (e.g. GEV).
General Extreme Value distribution
Type I (= Gumbel distribution)
-> widely used for annual maximum series!
(Note: little different in description (use of parameters) than
in the example of the river Wye, from Davie (2002))
      x1  c 
CDF:        F ( x1 )  exp exp       
        a 

1    x1  c       x1  c 
PDF:        F ( x1 )  exp        exp       
a      a            a 

x1  c
CDF:         G( y1 )  exp  exp  y1         y1 
a
(standardized)
Plots of Gumbel distribution

CDF

PDF
How good is the fit of the distribution
function?
• Graphic check: visual check of the plotted graph
(How good are the observations reproduced by the fitted
PDF/CDF?)
• Mathematical check: statistical test to determine the
goodness fit
- chi-square (χ2 ):                               PDF
(needs much data, depends on classified intervals)
- Kolmogorov-Smirnov (K-S) tests:                CDF
 Not necessary to divide the data into intervals; thus error associated
with the number and size of intervals is avoided.
 Good if n>35 and even better if n>50.
 Quick and easy, but only one value is considered)
- Unfortunately, often several distributions provide acceptable fits to
the available data (no identification of the “true” or “best”
distribution); confidence limits are too large
Graphical method
(Example of the river Wye)

Is this fit suitable for the
whole data set?

Frequency of flows less than a value X. The F(X) values on the
x-axis have undergone a transformation to fit the Gumbel distribution;
called ‘reduced variate’ (cf. Workshop in Hydrology).
Comparison of different PDFs
(Significant differences in particular for the extremes!)
Example: Kolmogorov-Smirnov (K-S) tests
(according to Schoenwiese 2000)

Dn  1 max FX ( xi )  S n ( xi )                  (see sketch on
n
                                       black board)
P( Dn  Dn )  1  

FX(xi) - CDF of the assumed distribution
Sn(xi) - CDF of the observed ordered sample

If Dn ≤ the tabulated value (see below) Dnα, the assumed
distribution is acceptable at the significance level α
(n: sample size).

α           0.20       0.10         0.05          0.01        0.001
Dnα      1.073/n0.5   1.224/n0.5   1.358 /n0.5   1.628/n0.5   1.040/n0.5
Variability of quantile estimates –
confidence limits
Sources of errors
• assumption of a particular distribution (cannot be quantified)
• sampling errors in estimation of the parameters of the
distribution (quantifiable through standard errors)

Confidence limits (CL): 100(1-α)% confidence intervals
QT : quantile
C : constant (see Workshop page 13;
CL (QT )  QT  t1 / 2 SEq    course notes from Hall page 31f)
σ : standard deviation from a
sample of size n
Cs
Standard error: SEq              t1-α/2 : value of Student’s t-
n          distribution for a 97.5% level of
confidence (two-tailed test) and
(n-1) degrees of freedom; tabulated
Example: Lognormal with 90%-confidence limits

(Bedient & Huber 2002)
Example: Estimation of confidence limits
(according to Schoenwiese 2000)
The mean annual temperature at the Hohenpeissenberg, Germany, for
the period 1954-1970 (n=17; Normal-distributed C=1) is 6.24 0C with a
standard deviation of 0.73 0C.
CL(QT )  QT  t1 / 2 SEq
The confidence limits (α = 5%) can be calculated as:

CL (mean annual temperature in 0C) = 6.24 ± t97.5% (0.73/170.5) 0C
= 6.24 ± 0.38 0C

The mean of the annual temperature is at significance level of 95%
in the interval of 6.24 ± 0.38 0C, thus between 5.86 and 6.62 0C.

Please note, if the records lengths would have been 120 years
(= n) with the same standard deviation, the interval would be
6.24 ± 0.13 0C.
A few remarks on:
Low flow frequency analysis
Data required: annual minimum series
Problem of independent events; do not split the year in the
middle of low period (i.e. low flow periods can be long)
Often zero-values (e.g. in arid climates or cold climates)
There is finite limit on how low a low flow can be (no
negative flows!)
 Different statistical treatment of the data
 Fit an exponential distribution rather than, for instance, a log-
normal distribution
 Other often used distributions are the Weibull, Gumbel, Pearson
Type III, and log-normal distributions
A few remarks on:
Low flow frequency analysis

Figure 7.15 Two probability density
functions. The usual log-normal           Figure 7.16 Probability values
distribution (solid line) is contrasted   (calculated from the Weibull
with the truncated log-normal             sorting formula) plotted on a log
distribution (dashed line) that is        scale against values of annual
possible with low flows (where the        minimum flow (hypothetical
minimum flow can equal zero).             values).
Application in the Rur river (7-1)
Station Stah (Germany)
area              2245 km2
record 1953 to 2001

150
at Stah
Daily discharge (m3/s)

125

100

75

50

25

0
25-Mar-60    2-Dec-73   11-Aug-87   19-Apr-01

Prepared by Tu Min (2004)
Application in the Rur river (7-2)
Annual flood peaks (1954 - 2001)
200
the water years                                   Stah on the Roer

Discharge (m 3/s)
150

(Nov - Oct)                           100

50

0
1950     1960    1970     1980    1990       2000

Homogeneity
200
Discharge (m 3/s)             Stah on the Roer
150                                        88
Statistical tests
100

(change point)                           50
1980
66
0
1950     1960    1970     1980    1990       2000
Application in the Rur river (7-3)
Flood frequency analysis
Example – Normal distribution
3
water year peak Q (m /s) rank (m) Q in order p=(m-0.375)/(N+0.25)              t       Xcal    UCL     LCL
1954         65          1                      28        0.013           -2.228    12.70   28.24   -2.83
1955         76          2                      31        0.034           -1.829    24.05   37.65   10.44
1956         66          3                      34        0.054           -1.604    30.47   43.04   17.90
1957         59          4                      34        0.075           -1.439    35.17   47.03   23.32
1958         71          5                      36        0.096           -1.306    38.96   50.26   27.66
1959         65          6                      44        0.117           -1.192    42.19   53.04   31.33

150
Normal
N = 48
Peak (m3/s)

100

μ = 76.1
50                    Observed
σ = 28.5                                                                                    Fitted
95% CL
0                   Series4
t97.5% = 2.011                                -3        -2   -1         0          1           2           3
Standard variate
Application in the Rur river (7-4)
Flood frequency analysis
Example – LN distribution

Y  ln( X )
5.5

N = 48                         LN                5.0

μy = 4.3
Ln (Q)

4.5

4.0
σy= 0.4
Observed
3.5                Fitted
95% CL
Series4
t97.5% = 2.011            -3        -2   -1
3.0
0         1       2       3
Standard variate
Application in the Rur river (7-5)
Flood frequency analysis
Example – LN3 distribution

N = 48                                                    x(min)x(max)  x 2 med
Y  ln( X   )                  
x(min)  x(max)  2 xmed
Xmin = 27.6
Xmax = 139.3                                     5.5
LN3
Xmed = 73.1
Ln (Q+73)

5.0
μy = 5.0                                                            Observed
Fitted
σy= 0.2                                                             95% CL
Series4
4.5
t97.5% = 2.011               -3     -2   -1         0
Standard variate
1       2           3
Application in the Rur river (7-6)
Flood frequency analysis
N
μ
48
76.1
Example – Gumbel distribution
2
σ              810.9
α               22.2                                         Distribution fitting
ζ               63.3
water year                  peak Q (m3/s) Rank (m) Q in order         Fi        Yi                       Xest            baised var(Xest)        LCL     UCL
1954              65            1           28       0.012   -1.494                          30.2               34.2              18.4   41.9
1955              76            2           31       0.032   -1.232                          36.0               26.7              25.6   46.3
1956              66            3           34       0.053   -1.076                          39.4               22.9              29.8   49.1
1957              59            4           34       0.074   -0.957                          42.1               20.4              33.0   51.2
1958              71            5           36       0.095   -0.857                          44.3               18.6              35.6   53.0

200
200
1980-2001

Discharge (m 3/s)
150
Gumbel (1954-2001)
150
Peak (m3/s)

100

50

100                                                                     0
200
-2          -1      0          1     2       3   4      5
Observed                                     1974-1979
Reduced            Gumbel variate
Discharge (m 3/s)

150
50                                   Fitted
95% CL                          100

0                                  Series4                          50

-2          -1         0      1     2        3            4              5              0
-2          -1       0     1     2       3       4      5
Reduced Gumbel variate                                                                   Reduced Gumbel variate
Application in the Rur river (7-7)
Flood frequency analysis
Example – Gumbel distribution
Magnitude of T-year flood

  1 
QT  ay  c y   ln  ln 1  
58-79               80-01
Magnitude of T-year flood
54-01               LCL(54-01)

  T 
UCL (54-01)
200

α=

Discharge (m 3/s)
22.2      22.7     20.4      N-1 =      47                            160
ζ=        63.3      52.9     76.2     t97.5% =   2.01
Return      Reduced
1954-2001 1958-1979 1980-2001 Var (Est) LCL     UCL                    120
Period (yr) variate (yi)
5          1.50       96.6      86.9     106.9     40.4      83.8    109.4
10          2.25       113.3    104.0     122.2     73.7      96.0    130.5
25          3.20       134.3    125.5     141.6     133.9     111.1   157.6                       80
50          3.90       150.0    141.4     156.0     191.7     122.1   177.8                            1      2       3        4             5
Reduced Gumbel variate
75          4.31       159.0    150.7     164.4     230.4     128.5   189.5
100         4.60       165.5    157.3     170.3     260.1     133.0   197.9
Take home messages
Frequency analysis, in particular of hydrological extremes,
is prerequisite for sustainable water resources
management
Annual max. series or partial duration series depends on
the length of the record
Understanding of probability, recurrence interval and risk
Knowledge of often used statistical distributions
Calculation of confidence intervals
Test of the goodness of fit of a PDF or CDF
Specialty of low flow values
Closure
•   probability and event
•   stochastic variables, cont. and discrete
•   Transformations
•   Joint distributions
•   Linear regression
•   Parameter estimation (with Bestfit)
Closure
The role of statistics in hydrology and water
resources is what?
You have now knowledge of
 Frequency tables,
 Histograms,
 Distributions functions,
 statistical descriptors, and
 Standard Normal Distribution   and   Log-Normal
distribution.

Are you now well prepared for the two
assignments?
Assignment 1: Basic Statistics
Use the data set from Van Gelder’s website
1. Calculate the range and estimate a reasonable number of intervals as
well as class limits.
2. Calculate the relative and absolute frequencies.
3. Make a histogram and cumulative frequency distribution, hand drawn
on linear paper (figures). Interpret this briefly (one sentence!)
4. Calculate the median, mode, and arithmetic mean.
5. Calculate the variance, standard deviation and coefficient of variation.
6. Calculate the skewness and kurtosis. Interpret the results briefly (one
sentence!)
7. Plot the data as a graph on normal probability paper (figure). Is the
Normal-Distribution suitable for that data set? Compare the mean value
4+5.

Deliver your printed report to Pieter van Gelder on November 5th
at 15.45h at the start of the lectures
Assignment 2: Frequency Analysis
website
Determine the PDF with the lowest Chi-Square
value in Bestfit
Include in your report a plot of the observations
and optimal fit
Extrapolate the fitted Exponential distribution to
a 10^-3 /yr quantile
Calculate the 95% Confidence Bounds around
the 10^-3 /yr quantile for the Exponential fit
Deliver your printed report to Pieter van Gelder on
November 6th at 8.45h at the start of the lectures
Assignment 3: Transformations of
distributions
Generate 28 random numbers from the Uniform distribution with
lowerbound a and upperbound b
Plot your data in a histogram and draw the PDF of the uniform
distribution in the same plot
Generate 100 sets of 28 random numbers from the above Uniform
distribution and take from each set the maximum number
Plot these 100 maxima in a histogram, derive the theoretical
distribution function for these 100 maxima and draw the PDF in
the same plot
Assume that the above 100 numbers are monthly maximum wind
speeds. Transform your wind speeds to wind pressures and plot
your wind pressures in a cumulative distribution plot.

Deliver your printed report to Pieter van Gelder on
November 13th at 17.00h at the reception of UNESCO-IHE
Final mark for Review of statistics and
frequency analysis (module 1)

Weight factor of all computer exercises is