Bootstrap confidence intervals

Confidence intervals for verification scores

Any verification score must be regarded as a sample estimate of the "true" value for an infinitely large verification dataset. There is therefore some uncertainty associated with the score's value, especially when the sample size is small or the data are not independent, or both. It is a good idea to estimate some confidence intervals (CIs) to set some bounds on the expected value of the verification score. This also helps to assess whether differences between competing forecast systems are real.

Jolliffe (2007) gives a nice discussion of several methods for deriving CIs for verification measures. Mathematical formulae are available for computing CIs for distributions which are binomial or normal, assumptions that are reasonable for scores that represent proportions (PC, POD, FAR, TS). In general, most verification scores cannot be expected to satisfy these assumptions. Moreover, the verification samples are often non-independent in space and/or time. A non-parametric method such as the /bootstrap method/ is ideally suited for handling these data because it does not require assumptions about distributions. The bootstrap is, however sensitive to dependence of the events in the verification sample. A strategy such as block bootstrapping, where the data is resampled in blocks which can be considered independent of each other, is recommended for datasets with high spatial or temporal correlation. This point is discussed further below.

The non-parametric bootstrap is quite simple to do:

1. Generate a bootstrap sample by randomly drawing N forecast/observation pairs from the full set of N samples, with replacement (i.e., pick a sample, put it back, N times).

2. Compute the verification statistic for that bootstrap sample.

3. Repeat steps 1 and 2 a large number of times, say 1000, to generate 1000 estimates for that verification statistic.

4. Order the estimates from smallest to largest. The (1-α) confidence interval is easily obtained by finding the values for which the fraction α/2 of estimates are lower and higher, respectively.

In step 1, it is sometimes appropriate to sample M < N replicates of the data. For example, if the distribution for the data is heavy tailed, in which case it is recommended to use M=sqrt(N) (see, e.g., Gilleland, 2010). When using the bootstrap approach for finding confidence intervals, the main assumption is that the data sample represents the population distribution. Therefore, N should be large enough that this assumption is likely to be valid; especially if using M

Step 2 may entail computing several summary statistics all at once, which can save considerable computing time if intervals for more than one statistic are desired.

A general rule of thumb for selecting the number of replicate samples (or resamples) in step 3 is to choose the number small enough as to minimize the computational time, but large enough that a representative sample of statistics is found. The usual technique is to start with a low number, say 500, and run the procedure twice. If the resulting intervals differ wildly, then increase the number of resamples, e.g. to 700. Repeat this procedure until the confidence intervals do not change drastically. Ideally, the procedure should be tried many times for each number of resamples, but it has been found in practice that twice is sufficient.

The procedure for determining the bootstrap confidence intervals from the sample of statistics in step 4 is known as the percentile method. It is generally a good method, but there are more assumptions. If these assumptions are not valid, then the intervals tend to be too narrow. There are several alternative methods, and each has its own advantages and disadvantages. A method known as the BCa method adjusts the quantiles (i.e., α/2 and 1-α/2) for violations of these assumptions. The result are highly accurate estimates for the confidence limits, but the procedure involves an additional round of sampling that can be highly inefficient for large data sets. An asymptotic approximation to the BCa bounds that is quick is known as the ABC method, but it can only be applied for smooth statistics. See Gilleland (2010) for more about these approaches.

When comparing the scores for two or more forecasts one can calculate confidence intervals for the mean difference between the scores. If the (1-α) confidence interval does not include 0, then the performance of the forecasts can be considered significantly different.

As mentioned above, bootstrap sampling also implies independent events in the sample. Thus it is often necessary to sample in "blocks" to obtain more reasonable estimates of the confidence limits. For example, if there is high spatial correlation in the dataset, which is often the case in gridded forecasts, then each full grid might be sampled as a single block. Or, if there is also temporal correlation in the forecasts, then it could be necessary to form blocks of two or three successive forecasts.

A crucial assumption for using the block bootstrap approach is that the length of correlation is much smaller than the sample size. This assumption is usually valid for time series where the series is often very long, and the temporal correlation is relatively short. However, this assumption is typically violated for spatial correlation whereby many variables tend to be highly correlated spatially over large areas relative to the size of the domain studied. In such a case, a parametric bootstrap procedure may be preferred. Such a procedure, however, requires an assumed model that can vary greatly from one type of variable (or region) to another.

References:

Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Technical Note NCAR/TN-479+STR, 71pp.

Jolliffe, I.T., 2007: Uncertainty and inference for verification measures. Wea. Forecasting, 22, 637-650.