Calculation of a verification score from a sample of forecasts and verification
data should usually be only a first step. It should ideally be followed
by some form of statistical inference. Even if the quality of forecasts
remains constant, sampling variability means that a later sample of data
will give a different value for the score, so the value of a score cannot
be viewed in isolation, without some idea of its sampling variation. Most
scores have an underlying "population" value and the calculated score can
be viewed as a *(point) estimate* of this population parameter. It
is good practice, where possible, to find a *confidence interval*
for this parameter - an interval that has a pre-specified high probability
of including the true value of the parameter. To do so we need to know
the *sampling distribution* of the sample score. Sometimes this can
be approximated by a tractable distribution, such as a Gaussian distribution.
On other occasions a *non-parametric*, or *resampling *approach,
such as the *bootstrap* is needed.

It is important not to confuse the idea of a confidence interval for
a population parameter with that of a *prediction interval*. The latter
makes statements about likely values of a sample quantity, given assumptions
about the underlying population; both can be useful in inference.

An alternative to *interval estimation* (constructing confidence
intervals) is to *test hypotheses*. The most usual *null hypotheses*
of interest are:

- The population value of a verification score for a forecasting system is that corresponding to some reference forecast and hence represents zero skill.
- The population values of a verification score are the same for two forecasting systems.

The idea of *power* is often forgotten in hypothesis testing. The
probability of Type I error (rejecting the null hypothesis when it is true)
is controlled to be a small number (for example 5%, 1%), but the power
(the probability of correctly rejecting the null hypothesis when it is
false) is frequently ignored. A test whose power is not much greater than
its probability of Type I error is of little use. Power can be used to
choose between competing tests of the same null hypothesis.

There is a close link between hypothesis testing, confidence intervals
and prediction intervals in many circumstances. A null hypothesis will
be rejected if and only if the null value of a population parameter lies
outside a corresponding confidence interval, if and only if a sample score
value lies outside a corresponding prediction interval.

Ian Jolliffe, February 2003

A few other links for **hypothesis testing**:

Probability and Statistics for Biological Sciences: Introduction to Hypothesis Testing (David W. Sabo, British Columbia Institute of Technology)

WMO Climate Information and Prediction Services (CLIPS) curriculum - Link to Statistical Inference (Ian Jolliffe, University of Aberdeen)