Verification Introduction
Holly C. Hartmann
Department of Hydrology and Water Resources
University of Arizona
hollyoregon@juno.com
RFC Verification Workshop, 08/14/2007
1
Goals
• General concepts of verification
• Think about how to apply to your operations
• Be able to respond to and influence NWS verification
program
• Be prepared as new tools become available
• Be able to do some of their own verification
• Be able to work with researchers on verification
projects
• Contribute to development of verification tools (e.g.,
look at various options)
• Avoid some typical mistakes
2
Agenda
1. Introduction to Verification
- Applications, Rationale, Basic Concepts
- Data Visualization and Exploration
- Deterministic Scalar measures
2. Categorical measures – KEVIN WERNER
- Deterministic Forecasts
- Ensemble Forecasts
3. Diagnostic Verification
- Reliability
- Discrimination
- Conditioning/Structuring Analyses
4. Lab Session/Group Exercise
- Developing Verification Strategies
- Connecting to Forecast Operations and Users
3
Why Do Verification? It depends…
Administrative: logistics, selected quantitative criteria
Operations: inputs, model states, outputs, quick!
Research: sources of error, targeting research
Users: making decisions, exploit skill, avoid mistakes
Concerns about verification?
4
Need for Verification Measures
Verification statistics identify
- accuracy of forecasts
- sources of skill in forecasts
- sources of uncertainty in forecasts
- conditions where and when forecasts are skillful
or not skillful, and why
Verification statistics then can inform
- improvements in terms of forecast skill and
decision making with alternate forecast sources
(e.g., climatology, persistence, new forecast
systems)
Adapted from: Regonda, Demargne, and Seo, 2006 5
Skill versus Value
Assess quality of forecast system
i.e. determine skill and value of forecast
A forecast has skill if it predicts A forecast has value if it
the observed conditions well helps the user to make better
according to some objective or decisions than without
subjective criteria. knowledge of the forecast.
• Forecasts with poor skill can be valuable (e.g. extreme event
forecasted in wrong place)
• Forecasts with high skill can be of little value (e.g. blue sky desert)
Credit: Hagedorn (2006) and Julie Demargne 6
Stakeholder Use of HydroClimate Info & Forecasts
Common across all groups
Uninformed, mistaken about forecast interpretation
Use of forecasts limited by lack of demonstrated forecast skill
Have difficulty specifying required accuracy
Common across many, but not all, stakeholders
Have difficulty distinguishing between “good” & “bad” products
Have difficulty placing forecasts in historical context
Unique among stakeholders
Relevant forecast variables, regions (location & scale), seasons, lead
times, performance characteristics
Technical sophistication: base probabilities, distributions, math
Role of of forecasts in decision making
7
What is a Perfect Forecast?
Forecast evaluation concepts
All happy families are alike;
each unhappy family
is unhappy in its own way.
-- Leo Tolstoy (1876)
All perfect forecasts are alike;
each imperfect forecast
is imperfect in its own way.
-- Holly Hartmann (2002)
8
Different Forecasts, Information, Evaluation
Deterministic
“Today’s high will be 76 degrees,
and it will be partly cloudy, Categorical
with a 30% chance of rain.”
Probabilistic
9
Different Forecasts, Information, Evaluation
Deterministic
“Today’s high will be 76 degrees,
and it will be partly cloudy, Categorical
with a 30% chance of rain.”
Probabilistic
Deterministic Categorical Probabilistic
76°
30%
No
rain Rain
How would you evaluate each of these?
10
Different Forecasts, Information, Evaluation
Deterministic
“Today’s high will be 76 degrees,
and it will be partly cloudy, Categorical
with a 30% chance of rain.”
Probabilistic
Standard hydrograph
Deterministic
11
ESP Forecasts: User preferences influence verification
From: California-Nevada River Forecast Center
12
ESP Forecasts: User preferences influence verification
From: California-Nevada River Forecast Center
13
ESP Forecasts: User preferences influence verification
From: California-Nevada River Forecast Center
14
ESP Forecasts: User preferences influence verification
From: California-Nevada River Forecast Center
15
So Many Evaluation Criteria!
Deterministic Categorical Probabilistic
Hit Rate
Bias Brier Score
Surprise rate
Correlation Threat Score Ranked
RMSE Gerrity Score Probability Score
Success Ratio Distributions-
• Standardized Post-agreement
RMSE oriented Measures
Percent Correct
• Nash-Sutcliffe Pierce Skill Score • Reliability
Gilbert Skill Score • Discrimination
Linear Error in
Heidke Skill Score
Probability Space • Sharpness
Critical Success index
Percent N-class errors
Modified Heidke Skill Score
Hannsen and Kuipers Score
Gandin and Murphy Skill Scores…
16
RFC Verification System: Metrics
CATEGORIES DETERMINISTIC FORECAST PROBABILISTIC FORECAST
VERIFICATION METRICS VERIFICATION METRICS
1. Categorical Probability Of Detection (POD), Brier Score (BS),
(predefined threshold, range False Alarm Ratio (FAR), Rank Probability Score (RPS)
of values) Probability of False Detection (POFD)
Lead Time of Detection (LTD),
Critical Success Index (CSI), Pierce Skill Score
(PSS), Gerrity Score (GS)
2. Error Root Mean Square Error (RMSE), Continuous RPS
(accuracy) Mean Absolute Error (MAE),
Mean Error (ME), Bias (%),
Linear Error in Probability Space (LEPS)
3. Correlation Pearson Correlation Coefficient, Ranked
correlation coefficient, scatter plots
4. Distribution Properties Mean, variance, higher moments for Wilcoxon rank sum test, variance of
observation and forecasts forecasts, variance of observations,
ensemble spread, Talagrand Diagram (or
Rank Histogram)
Source: Verification Group, courtesy J. Demargne 17
RFC Verification System: Metrics
CATEGORIES DETERMINISTIC FORECAST PROBABILISTIC FORECAST
VERIFICATION METRICS VERIFICATION METRICS
5. Skill Scores Root Mean Squared Error Skill Score (SS- Rank Probability Skill Score,
RMSE) (with reference to persistence, Brier Skill Score (with reference to
(relative accuracy over
climatology, lagged persistence), persistence, climatology, lagged
reference forecast)
Wilson Score (WS), persistence)
Linear Error in Probability Space Skill Score
(SS-LEPS)
6. Conditional Statistics Relative Operating Characteristic (ROC), ROC and ROC Area,
(based on occurrence of reliability measures, other resolution measures,
specific events) discrimination diagram, reliability diagram,
other discrimination measures discrimination diagram,
other discrimination measures
7. Confidence Sample size, Ensemble size, sample size,
(metric uncertainty) Confidence Interval (CI) Confidence Interval (CI)
Source: Verification Group, courtesy J. Demargne 18
Possible Performance Criteria
Accuracy - overall correspondence between forecasts and observations
Bias - difference between average forecast and average observation
Consistency - forecasts don’t waffle around
Good consistency
19
Possible Performance Criteria
Accuracy - overall correspondence between forecasts and observations
Bias - difference between average forecast and average observation
Consistency - forecasts don’t waffle around
Sharpness/Refinement – ability to make bullish forecast statements
Not Sharp
Sharp
20
What makes a forecast “good”?
Forecasts should agree with observations, with few Accuracy
large errors
Forecast mean should agree with observed mean Bias
Linear relationship between forecasts and Association
observations
Forecast should be more accurate than low-skilled Skill
reference forecasts (e.g., random chance, persistence, or
climatology)
Adapted from : Ebert (2003)
21
What makes a forecast “good”?
Binned forecast values should agree with binned Reliability
observations (agreement between categories)
Forecast can discriminate between events & non- Resolution
events
Forecast can predict with strong probabilities (i.e., Sharpness
100% for event, 0% for non-event)
Forecast represents the associated uncertainty Spread (Variability)
Adapted from : Ebert (2003)
22
Forecasting Tradeoffs
Forecast performance is multi-faceted
False Alarms Surprises
warning without event event without warning
No fire
“False Alarm Ratio” “Probability of Detection”
A forecaster’s fundamental challenge
is balancing these two.
Which is more important?
Depends on the specific decision context…
23
How Good? Compared to What?
SForecast – SBaseline SForecast
Skill Score = =1-
SPerfect – SBaseline SBaseline
Skill Score: (0.50 – 0.54)/(1.00-0.54) = -8.6%
~worse than guessing~
What is the appropriate Baseline?
24
Graphical
Forecast Evaluation
25
Basic Data
Display
Historical
seasonal water
supply outlooks
Colorado River
Basin
Morrill, Hartmann, and
Bales, 2007
26
Scatter plots
Historical
seasonal water
supply outlooks
Colorado River
Basin
Morrill, Hartmann, and
Bales, 2007
27
Histograms
Historical
seasonal water
supply outlooks
Colorado River
Basin
Morrill, Hartmann, and
Bales, 2007
28
IVP Scatterplot Example
Source: H. Herr 29
Cumulative Distribution Function (CDF): IVP
Cat 1 = No Observed
Precipitation
Cat 2 = Observed
Precipitation
(>0.001”)
Empirical distribution
of forecast
probabilities for
different
observations
categories
Goal: Widely
separated CDFs
Source: H. Herr, IVP Charting Examples, 2007 30
Probability Density Function (PDF): IVP
Cat 1 = No Observed
Precipitation
Cat 2 = Observed
Precipitation
(>0.001”)
Empirical distribution
for 10 bins for IVP
GUI
Goal: Widely
separated PDFs
Source: H. Herr, IVP Charting Examples, 2007 31
“Box-plots”: Quantiles and Extremes
Based on
summarizing
CDF
computation
and plot
Goal: Widely
separated box-plots
Cat 1 = No Observed Precipitation Source: H. Herr, IVP Charting Examples,
2007
Cat 2 = Observed Precipitation (>0.001”)
32
Scalar
Forecast Evaluation
33
Standard Scalar Measures
Bias
Mean forecast = Mean observed
Forecast
Correlation Coefficient
Variance shared between forecast and observed (r2)
Says nothing about bias or whether
forecast variance = observed variance
Pearson correlation coefficient: assumes normal
Observed
distribution, can be + or – (Rank r: only +, non-normal ok)
Root Mean Squared Error
Distance between forecast/observation values
Better than correlation, poor when error is
heteroscedastic fcst
Emphasizes performance for high flows obs
Alternative: Mean Absolute Error (MAE)
34
Standard Scalar Measures (with Scatterplot)
1943-99 April 1 Forecasts for 1954-97 January 1 Forecasts for
Apr-Sept Streamflow at Jan-May Streamflow at
Stehekin R at Stehekin, WA Verde R blw Tangle Crk, AZ
Bias = 22 Bias = -87.5
Forecast (1000’s ac-ft)
Corr = 0.92 Corr = 0.58
RMSE = 74.4 RMSE = 228.3
Observed (1000’s ac-ft) Observed (1000’s ac-ft)
35
IVP: Deterministic Scalar Measures
ME: smallest;
+ and – errors
cancel
MAE vs.
RMSE: RMSE
influenced by
large errors for
large events
MAXERR:
largest
Sample Size:
small samples
have large
uncertainty
Source: H. Herr, IVP Charting Examples, 2007 36
IVP: RMSE – Skill Scores
Skill compared to
Persistence Forecast
Source: H. Herr, IVP Charting Examples, 2007 37