The `scoringutils`

package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values. Predictions can either be automatically scored from a `data.frame`

using the function `eval_forecasts`

. Alternatively, evaluation metrics can be accessed directly using lower level functions within a vector/matrix framework.

Predictions can be handled in various formats: `scoringutils`

can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary.

In addition to automatic scoring, `scoringutils`

offers a variety of plots and visualisations.

Most of the time, the `eval_forecasts`

function will be able to do the entire evaluation for you. The idea is simple, yet flexible.

All you need to do is to pass in a `data.frame`

that has a column called `prediction`

and one called `true_value`

. Depending on the exact input format, additional columns like `sample`

, `quantile`

or `range`

and `boundary`

are needed. Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. In this case, we would have additional columns called for example `model`

, `date`

, `forecast_date`

, `forecast_horizon`

and `location`

.

Using the `by`

argument you need to specify the *unit of a single forecast*. In this example here we would set `by = c("model", "date", "forecast_date", "forecast_horizon", "location")`

(note: if we want to be pedantic, there is a small duplication as the information of “date” is already included in the combination of “forecast_date” and “forecast_horizon”. But as long as there isn’t some weird shift, this doesn’t matter for the purpose of grouping our observations). If you don’t specify `by`

(i.e. `by = NULL`

), `scoringutils`

will automatically use all appropriate present columns. Note that you don’t need to include columns such as `quantile`

or `sample`

in the `by`

argument, as several quantiles / samples make up one forecast.

Using the `summarise_by`

argument you can now choose categories to aggregate over. If you were only interested in scores for the different models, you would specify `summarise_by = c("model")`

. If you wanted to have scores for every model in every location, you would need to specify `summarise_by = c("model", "location")`

. If you wanted to have one score per quantile or one per prediction interval range, you could specify something like `summarise_by = c("model", "quantile")`

or `summarise_by = c("model", "quantile", "range")`

(note again that some information is duplicated in quantile and range, but this doesn’t really matter for grouping purposes). When aggregating, `eval_forecasts`

takes the mean according to the group defined in `summarise_by`

(i.e. in this example, if `summarise_by = c("model", "location")`

, scores will be averaged over all forecast dates, forecast horizons and quantiles to yield one score per model and location). In addition to the mean, you can also obtain the standard deviation of the scores over which you average or any desired quantile (e.g. the median in addition to the mean) by specifying `sd = TRUE`

and `quantiles = c(0.5)`

.

Here is an example of an evaluation using toy data:

```
library(scoringutils)
library(data.table)
```

```
<- scoringutils::quantile_example_data_plain
data print(data, 3, 3)
#> true_value id model prediction horizon quantile
#> 1: 2.659261 1 model1 -0.6448536 1 0.05
#> 2: 2.659261 1 model1 0.3255102 1 0.25
#> 3: 2.659261 1 model1 1.0000000 1 0.50
#> ---
#> 598: 30.189608 30 model2 31.2242353 2 0.50
#> 599: 30.189608 30 model2 31.3873685 2 0.95
#> 600: 30.189608 30 model2 30.6399809 2 0.75
::eval_forecasts(data,
scoringutilssummarise_by = c("model", "quantile", "range"))
#> model quantile range interval_score sharpness underprediction
#> 1: model1 0.50 0 0.8269027 0.0000000 0.304369095
#> 2: model1 0.25 50 0.7760589 0.3044214 0.179681560
#> 3: model1 0.75 50 0.7760589 0.3044214 0.179681560
#> 4: model1 0.05 90 0.2658170 0.1523911 0.024935181
#> 5: model1 0.95 90 0.2658170 0.1523911 0.024935181
#> 6: model2 0.50 0 0.9779030 0.0000000 0.350926228
#> 7: model2 0.25 50 0.6787509 0.3566315 0.072721303
#> 8: model2 0.75 50 0.6787509 0.3566315 0.072721303
#> 9: model2 0.05 90 0.2721723 0.1606143 0.008071852
#> 10: model2 0.95 90 0.2721723 0.1606143 0.008071852
#> overprediction coverage coverage_deviation bias aem
#> 1: 0.52253362 0.0000000 0.00000000 0.1566667 0.8269027
#> 2: 0.29195591 0.4000000 -0.10000000 0.1566667 0.8269027
#> 3: 0.29195591 0.4000000 -0.10000000 0.1566667 0.8269027
#> 4: 0.08849074 0.8166667 -0.08333333 0.1566667 0.8269027
#> 5: 0.08849074 0.8166667 -0.08333333 0.1566667 0.8269027
#> 6: 0.62697674 0.0000000 0.00000000 0.2233333 0.9779030
#> 7: 0.24939811 0.5333333 0.03333333 0.2233333 0.9779030
#> 8: 0.24939811 0.5333333 0.03333333 0.2233333 0.9779030
#> 9: 0.10348616 0.8500000 -0.05000000 0.2233333 0.9779030
#> 10: 0.10348616 0.8500000 -0.05000000 0.2233333 0.9779030
#> quantile_coverage
#> 1: 0.5500000
#> 2: 0.4000000
#> 3: 0.7833333
#> 4: 0.1333333
#> 5: 0.9500000
#> 6: 0.6166667
#> 7: 0.3333333
#> 8: 0.8666667
#> 9: 0.1500000
#> 10: 0.9833333
```

Using an appropriate level of summary, we can easily use the output for visualisation. The `scoringutils`

package offers some built-in functions to help get a sense of the data

```
::plot_predictions(data, x = "id", range = c(0, 90),
scoringutilsfacet_formula = ~ model)
```

(The data is just randomly generated values. We plan to add real example data to make these illustrations more useful in the future)

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model"))
::score_table(scores) scoringutils
```

Given this level of aggregation, not all metrics may make sense. In this case, for example, averaging over different quantiles to compute quantile coverage does not make much sense. If you like, you can select specific metrics for the visualisation.

Let us look at calibration:

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "range", "quantile"))
::interval_coverage(scores) +
scoringutils::ggtitle("Interval Coverage")
ggplot2
::quantile_coverage(scores) +
scoringutils::ggtitle("Quantile Coverage") ggplot2
```

Let us look at the individual components of the weighted interval score:

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model"))
::wis_components(scores) scoringutils
```

We can also look at contributions to different metrics by range:

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "range"))
::range_plot(scores, y = "interval_score") scoringutils
```

We can also visualise metrics using a heatmap:

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "horizon"))
:= as.factor(horizon)]
scores[, horizon ::score_heatmap(scores,
scoringutilsx = "horizon", metric = "bias")
```

The `eval_forecasts`

function is designed to work with various different input formats. The following formats are currently supported:

quantile forecasts in either a plain quantile format or in a format that specifies interval ranges and the boundary of a given interval range.

```
print(scoringutils::quantile_example_data_plain, 3, 3)
#> true_value id model prediction horizon quantile
#> 1: 2.659261 1 model1 -0.6448536 1 0.05
#> 2: 2.659261 1 model1 0.3255102 1 0.25
#> 3: 2.659261 1 model1 1.0000000 1 0.50
#> ---
#> 598: 30.189608 30 model2 31.2242353 2 0.50
#> 599: 30.189608 30 model2 31.3873685 2 0.95
#> 600: 30.189608 30 model2 30.6399809 2 0.75
print(scoringutils::quantile_example_data_long, 3, 3)
#> true_value id model prediction boundary range horizon
#> 1: 2.659261 1 model1 -0.6448536 lower 90 1
#> 2: 2.659261 1 model1 0.3255102 lower 50 1
#> 3: 2.659261 1 model1 1.0000000 lower 0 1
#> ---
#> 718: 30.189608 30 model2 31.3873685 upper 90 2
#> 719: 30.189608 30 model2 30.6399809 upper 50 2
#> 720: 30.189608 30 model2 31.2576984 upper 0 2
```

sample based format with either continuous or integer values

```
print(scoringutils::integer_example_data, 3, 3)
#> # A tibble: 6,000 x 6
#> # Groups: id [30]
#> id model true_value sample prediction horizon
#> <int> <chr> <dbl> <int> <dbl> <dbl>
#> 1 1 model1 6 1 5 1
#> 2 1 model1 6 2 4 1
#> 3 1 model1 6 3 3 1
#> 4 1 model1 6 4 3 1
#> 5 1 model1 6 5 4 1
#> 6 1 model1 6 6 4 1
#> 7 1 model1 6 7 5 1
#> 8 1 model1 6 8 4 1
#> 9 1 model1 6 9 4 1
#> 10 1 model1 6 10 6 1
#> # … with 5,990 more rows
print(scoringutils::continuous_example_data, 3, 3)
#> id model true_value sample prediction horizon
#> 1: 1 model1 0.03007379 1 -0.203426069 1
#> 2: 1 model1 0.03007379 2 0.007621269 1
#> 3: 1 model1 0.03007379 3 -2.086657003 1
#> ---
#> 5998: 30 model2 -2.93749990 48 -0.079900522 2
#> 5999: 30 model2 -2.93749990 49 -1.178524017 2
#> 6000: 30 model2 -2.93749990 50 0.638750918 2
```

forecasts in a binary format:

```
print(scoringutils::binary_example_data, 3, 3)
#> # A tibble: 120 x 5
#> # Groups: id, model [60]
#> id model horizon prediction true_value
#> <int> <fct> <dbl> <dbl> <dbl>
#> 1 1 model1 1 0.746 0
#> 2 1 model1 2 0.522 0
#> 3 1 model2 1 0.00958 0
#> 4 1 model2 2 0.00671 0
#> 5 2 model1 1 0.730 0
#> 6 2 model1 2 0.511 0
#> 7 2 model2 1 0.0274 0
#> 8 2 model2 2 0.0192 0
#> 9 3 model1 1 0.543 0
#> 10 3 model1 2 0.380 0
#> # … with 110 more rows
```

It also offers functionality to convert between these formats. For more information have a look at the documentation of the following functions:

```
::sample_to_quantile() # convert from sample based to quantile format
scoringutils::range_to_quantile() # convert from range format to plain quantile
scoringutils::quantile_to_range() # convert the other way round
scoringutils::quantile_to_long() # convert range based format from wide to long
scoringutils::quantile_to_wide() # convert the other way round scoringutils
```

A variety of metrics and scoring rules can also be accessed directly through the `scoringutils`

package.

The following gives an overview of (most of) the implemented metrics.

The function `bias`

determines bias from predictive Monte-Carlo samples, automatically recognising whether forecasts are continuous or integer valued.

For continuous forecasts, Bias is measured as \[B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))\]

where \(P_t\) is the empirical cumulative distribution function of the prediction for the true value \(x_t\). Computationally, \(P_t (x_t)\) is just calculated as the fraction of predictive samples for \(x_t\) that are smaller than \(x_t\).

For integer valued forecasts, Bias is measured as

\[B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))\]

to adjust for the integer nature of the forecasts. In both cases, Bias can assume values between -1 and 1 and is 0 ideally.

```
## integer valued forecasts
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions bias(true_values, predictions)
#> [1] -0.880 -0.575 0.455 -0.710 0.880 0.220 0.870 -0.495 -0.140 -0.345
#> [11] 0.185 0.875 -0.510 0.295 -0.880 0.530 0.925 -0.400 0.995 -0.465
#> [21] -0.165 0.000 -0.560 -0.340 0.610 0.330 0.030 0.285 -0.060 0.470
## continuous forecasts
<- rnorm(30, mean = 1:30)
true_values <- replicate(200, rnorm(30, mean = 1:30))
predictions bias(true_values, predictions)
#> [1] -0.70 -0.08 -0.58 -0.16 0.49 -0.02 0.52 0.41 -0.88 0.92 -0.96 -0.62
#> [13] -0.73 0.18 0.37 -0.30 -0.45 0.32 -0.13 0.17 0.44 -0.16 0.38 0.79
#> [25] 0.50 0.40 -0.02 0.74 -0.83 0.34
```

Calibration or reliability of forecasts is the ability of a model to correctly identify its own uncertainty in making predictions. In a model with perfect calibration, the observed data at each time point look as if they came from the predictive probability distribution at that time.

Equivalently, one can inspect the probability integral transform of the predictive distribution at time t,

\[u_t = F_t (x_t)\]

where \(x_t\) is the observed data point at time \(t \text{ in } t_1, …, t_n\), n being the number of forecasts, and \(F_t\) is the (continuous) predictive cumulative probability distribution at time t. If the true probability distribution of outcomes at time t is \(G_t\) then the forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times \(t\). In that case, the probabilities ut are distributed uniformly.

In the case of discrete outcomes such as incidence counts, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:

\[u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )\]

where \(k_t\) is the observed count, \(P_t(x)\) is the predictive cumulative probability of observing incidence \(k\) at time \(t\), \(P_t (-1) = 0\) by definition and \(v\) is standard uniform and independent of \(k\). If \(P_t\) is the true cumulative probability distribution, then \(u_t\) is standard uniform.

The function checks whether integer or continuous forecasts were provided. It then applies the (randomised) probability integral and tests the values \(u_t\) for uniformity using the Anderson-Darling test.

As a rule of thumb, there is no evidence to suggest a forecasting model is miscalibrated if the p-value found was greater than a threshold of \(p >= 0.1\), some evidence that it was miscalibrated if \(0.01 < p < 0.1\), and good evidence that it was miscalibrated if \(p <= 0.01\). In this context it should be noted, though, that uniformity of the PIT is a necessary but not sufficient condition of calibration. It should als be noted that the test only works given sufficient samples, otherwise the Null hypothesis will often be rejected outright.

Wrapper around the `crps_sample`

function from the `scoringRules`

package. For more information look at the manuals from the `scoringRules`

package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.

```
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions crps(true_values, predictions)
#> [1] 0.649850 1.318150 0.435500 1.149900 0.781650 0.704250 0.832950 0.592300
#> [9] 5.230925 2.450325 0.893775 1.151875 1.277800 2.244200 1.216625 2.363750
#> [17] 4.915525 7.883525 1.141050 8.796975 3.920725 2.506000 3.947375 1.558000
#> [25] 1.358750 1.802525 2.241825 1.675950 1.998750 4.889025
```

Wrapper around the `dss_sample`

function from the `scoringRules`

package. For more information look at the manuals from the `scoringRules`

package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.

```
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions dss(true_values, predictions)
#> [1] 0.980098 2.661951 1.309684 1.382606 1.566530 3.276875 3.366579 2.636354
#> [9] 2.484455 3.945296 2.716004 3.432542 3.715580 5.833559 3.539483 2.809888
#> [17] 6.787577 3.735003 3.770679 5.390386 4.519517 3.250709 3.307365 3.983590
#> [25] 3.363838 2.942595 4.960309 3.356285 4.096127 5.817716
```

Wrapper around the `log_sample`

function from the `scoringRules`

package. For more information look at the manuals from the `scoringRules`

package. The function should not be used for integer valued forecasts. While Log Scores are in principle possible for integer valued foreasts they require a kernel density estimate which is not well defined for discrete values. Smaller values are better.

```
<- rnorm(30, mean = 1:30)
true_values <- replicate(200, rnorm(n = 30, mean = 1:30))
predictions logs(true_values, predictions)
#> [1] 1.2866381 0.8593650 1.0029785 1.0579208 0.9926259 1.2484167 1.5229751
#> [8] 1.4169337 1.0986732 1.1312240 1.7191367 0.8686863 1.1764081 0.9017488
#> [15] 1.9838143 2.4636160 1.4733809 0.9651339 1.2275458 1.1712084 0.9606882
#> [22] 1.5391638 1.1088316 1.1369635 1.4620804 1.4473927 1.7501319 1.0833631
#> [29] 1.1494207 1.6111555
```

The Brier score is a proper score rule that assesses the accuracy of probabilistic binary predictions. The outcomes can be either 0 or 1, the predictions must be a probability that the true outcome will be 1.

The Brier Score is then computed as the mean squared error between the probabilistic prediction and the true outcome.

\[\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2\]

```
<- sample(c(0,1), size = 30, replace = TRUE)
true_values <- runif(n = 30, min = 0, max = 1)
predictions
brier_score(true_values, predictions)
#> [1] 0.3243399
```

The Interval Score is a Proper Scoring Rule to score quantile predictions, following Gneiting and Raftery (2007). Smaller values are better.

The score is computed as

\[ \text{score} = (\text{upper} - \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{lower} - \text{true_value}) \cdot 1(\text{true_values} < \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{true_value} - \text{upper}) \cdot 1(\text{true_value} > \text{upper})\]

where \(1()\) is the indicator function and \(\alpha\) is the decimal value that indicates how much is outside the prediction interval. To improve usability, the user is asked to provide an interval range in percentage terms, i.e. interval_range = 90 (percent) for a 90 percent prediction interval. Correspondingly, the user would have to provide the 5% and 95% quantiles (the corresponding alpha would then be 0.1). No specific distribution is assumed, but the range has to be symmetric (i.e you can’t use the 0.1 quantile as the lower bound and the 0.7 quantile as the upper). Setting `weigh = TRUE`

will weigh the score by \(\frac{\alpha}{2}\) such that the Interval Score converges to the CRPS for increasing number of quantiles.

```
<- rnorm(30, mean = 1:30)
true_values <- 90
interval_range <- (100 - interval_range) / 100
alpha <- qnorm(alpha/2, rnorm(30, mean = 1:30))
lower <- qnorm((1- alpha/2), rnorm(30, mean = 1:30))
upper
interval_score(true_values = true_values,
lower = lower,
upper = upper,
interval_range = interval_range)
#> [1] 0.22887917 0.12915489 0.14745489 0.20393535 0.08203821 0.08403069
#> [7] 0.16669027 0.19011763 0.07734240 0.22332403 1.52448382 0.28581583
#> [13] 0.38330615 0.23936227 0.14885444 0.22053655 0.16671844 0.24506179
#> [19] 0.20387278 1.48964203 0.17658916 0.22336535 0.06635200 0.11939883
#> [25] 0.19029015 1.91258428 0.15955126 0.07556659 0.37008564 0.22527360
```