Confidence intervals for verification scores
Any verification score must be regarded as a sample estimate of the "true" value for an infinitely large verification dataset. There is therefore some uncertainty associated with the score's value, especially when the sample size is small or the data are not independent, or both. It is a good idea to estimate some confidence intervals (CIs) to set some bounds on the expected value of the verification score. This also helps to assess whether differences between competing forecast systems are real.
Jolliffe (2007) gives a nice discussion of several methods for deriving CIs for verification measures. Mathematical formulae are available for computing CIs for distributions which are binomial or normal, assumptions that are reasonable for scores that represent proportions (PC, POD, FAR, TS). In general, most verification scores cannot be expected to satisfy these assumptions. Moreover, the verification samples are often non-independent in space and/or time. A non-parametric method such as the /bootstrap method/ is ideally suited for handling these data because it does not require assumptions about distributions. The bootstrap is, however sensitive to dependence of the events in the verification sample. A strategy such as block bootstrapping, where the data is resampled in blocks which can be considered independent of each other, is recommended for datasets with high spatial or temporal correlation. This point is discussed further below.
The non-parametric bootstrap is quite simple to do:
1. Generate a bootstrap sample by randomly drawing N forecast/observation pairs from the full set of N samples, with replacement (i.e., pick a sample, put it back, N times).
2. Compute the verification statistic for that bootstrap sample.
3. Repeat steps 1 and 2 a large number of times, say 1000, to generate 1000 estimates for that verification statistic.
4. Order the estimates from smallest to largest. The (1-α) confidence interval is easily obtained by finding the values for which the fraction α/2 of estimates are lower and higher, respectively.
In step 1, it is sometimes appropriate to sample M < N replicates of the
data. For example, if the distribution for the data is heavy tailed, in
which case it is recommended to use M=sqrt(N) (see, e.g., Gilleland,
2010). When using the bootstrap approach for finding confidence
intervals, the main assumption is that the data sample represents the
population distribution. Therefore, N should be large enough that this
assumption is likely to be valid; especially if using M
Step 2 may entail computing several summary statistics all at once,
which can save considerable computing time if intervals for more than
one statistic are desired.
A general rule of thumb for selecting the number of replicate samples
(or resamples) in step 3 is to choose the number small enough as to
minimize the computational time, but large enough that a representative
sample of statistics is found. The usual technique is to start with a
low number, say 500, and run the procedure twice. If the resulting
intervals differ wildly, then increase the number of resamples, e.g. to
700. Repeat this procedure until the confidence intervals do not change
drastically. Ideally, the procedure should be tried many times for each
number of resamples, but it has been found in practice that twice is
sufficient.
The procedure for determining the bootstrap confidence intervals from
the sample of statistics in step 4 is known as the percentile method.
It is generally a good method, but there are more assumptions. If these
assumptions are not valid, then the intervals tend to be too narrow.
There are several alternative methods, and each has its own advantages
and disadvantages. A method known as the BCa method adjusts the
quantiles (i.e., α/2 and 1-α/2) for violations of these
assumptions. The result are highly accurate estimates for the
confidence limits, but the procedure involves an additional round of
sampling that can be highly inefficient for large data sets. An
asymptotic approximation to the BCa bounds that is quick is known as the
ABC method, but it can only be applied for smooth statistics. See
Gilleland (2010) for more about these approaches.
When comparing the scores for two or more forecasts one can calculate
confidence intervals for the mean difference between the scores. If
the (1-α) confidence interval does not include 0, then the performance
of the forecasts can be considered significantly different.
As mentioned above, bootstrap sampling also implies independent events
in the sample. Thus it is often necessary to sample in "blocks" to
obtain more reasonable estimates of the confidence limits. For example,
if there is high spatial correlation in the dataset, which is often the
case in gridded forecasts, then each full grid might be sampled as a
single block. Or, if there is also temporal correlation in the
forecasts, then it could be necessary to form blocks of two or three
successive forecasts.
A crucial assumption for using the block bootstrap approach is that the
length of correlation is much smaller than the sample size. This
assumption is usually valid for time series where the series is often
very long, and the temporal correlation is relatively short. However,
this assumption is typically violated for spatial correlation whereby
many variables tend to be highly correlated spatially over large areas
relative to the size of the domain studied. In such a case, a
parametric bootstrap procedure may be preferred. Such a procedure,
however, requires an assumed model that can vary greatly from one type
of variable (or region) to another.
References:
Gilleland, E., 2010: Confidence intervals for forecast verification.
NCAR Technical Note NCAR/TN-479+STR, 71pp.
Jolliffe, I.T., 2007: Uncertainty and inference for verification
measures. Wea. Forecasting, 22, 637-650.