The scoringutils
package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values. Predictions can either be automatically scored from a data.frame
using the function eval_forecasts
. Alternatively, evaluation metrics can be accessed directly using lower level functions within a vector/matrix framework.
Predictions can be handled in various formats: scoringutils
can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary.
In addition to automatic scoring, scoringutils
offers a variety of plots and visualisations.
Most of the time, the eval_forecasts
function will be able to do the entire evaluation for you. The idea is simple, yet flexible.
All you need to do is to pass in a data.frame
that has a column called prediction
and one called true_value
. Depending on the exact input format, additional columns like sample
, quantile
or range
and boundary
are needed. Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. In this case, we would have additional columns called for example model
, date
, forecast_date
, forecast_horizon
and location
.
Using the by
argument you need to specify the unit of a single forecast. In this example here we would set by = c("model", "date", "forecast_date", "forecast_horizon", "location")
(note: if we want to be pedantic, there is a small duplication as the information of “date” is already included in the combination of “forecast_date” and “forecast_horizon”. But as long as there isn’t some weird shift, this doesn’t matter for the purpose of grouping our observations). If you don’t specify by
(i.e. by = NULL
), scoringutils
will automatically use all appropriate present columns. Note that you don’t need to include columns such as quantile
or sample
in the by
argument, as several quantiles / samples make up one forecast.
Using the summarise_by
argument you can now choose categories to aggregate over. If you were only interested in scores for the different models, you would specify summarise_by = c("model")
. If you wanted to have scores for every model in every location, you would need to specify summarise_by = c("model", "location")
. If you wanted to have one score per quantile or one per prediction interval range, you could specify something like summarise_by = c("model", "quantile")
or summarise_by = c("model", "quantile", "range")
(note again that some information is duplicated in quantile and range, but this doesn’t really matter for grouping purposes). When aggregating, eval_forecasts
takes the mean according to the group defined in summarise_by
(i.e. in this example, if summarise_by = c("model", "location")
, scores will be averaged over all forecast dates, forecast horizons and quantiles to yield one score per model and location). In addition to the mean, you can also obtain the standard deviation of the scores over which you average or any desired quantile (e.g. the median in addition to the mean) by specifying sd = TRUE
and quantiles = c(0.5)
.
Here is an example of an evaluation using toy data:
library(scoringutils)
library(data.table)
<- scoringutils::quantile_example_data_plain
data print(data, 3, 3)
#> true_value id model prediction horizon quantile
#> 1: 2.659261 1 model1 -0.6448536 1 0.05
#> 2: 2.659261 1 model1 0.3255102 1 0.25
#> 3: 2.659261 1 model1 1.0000000 1 0.50
#> ---
#> 598: 30.189608 30 model2 31.2242353 2 0.50
#> 599: 30.189608 30 model2 31.3873685 2 0.95
#> 600: 30.189608 30 model2 30.6399809 2 0.75
::eval_forecasts(data,
scoringutilssummarise_by = c("model", "quantile", "range"))
#> model quantile range interval_score sharpness underprediction
#> 1: model1 0.50 0 0.8269027 0.0000000 0.304369095
#> 2: model1 0.25 50 0.7760589 0.3044214 0.179681560
#> 3: model1 0.75 50 0.7760589 0.3044214 0.179681560
#> 4: model1 0.05 90 0.2658170 0.1523911 0.024935181
#> 5: model1 0.95 90 0.2658170 0.1523911 0.024935181
#> 6: model2 0.50 0 0.9779030 0.0000000 0.350926228
#> 7: model2 0.25 50 0.6787509 0.3566315 0.072721303
#> 8: model2 0.75 50 0.6787509 0.3566315 0.072721303
#> 9: model2 0.05 90 0.2721723 0.1606143 0.008071852
#> 10: model2 0.95 90 0.2721723 0.1606143 0.008071852
#> overprediction coverage coverage_deviation bias aem
#> 1: 0.52253362 0.0000000 0.00000000 0.1566667 0.8269027
#> 2: 0.29195591 0.4000000 -0.10000000 0.1566667 0.8269027
#> 3: 0.29195591 0.4000000 -0.10000000 0.1566667 0.8269027
#> 4: 0.08849074 0.8166667 -0.08333333 0.1566667 0.8269027
#> 5: 0.08849074 0.8166667 -0.08333333 0.1566667 0.8269027
#> 6: 0.62697674 0.0000000 0.00000000 0.2233333 0.9779030
#> 7: 0.24939811 0.5333333 0.03333333 0.2233333 0.9779030
#> 8: 0.24939811 0.5333333 0.03333333 0.2233333 0.9779030
#> 9: 0.10348616 0.8500000 -0.05000000 0.2233333 0.9779030
#> 10: 0.10348616 0.8500000 -0.05000000 0.2233333 0.9779030
#> quantile_coverage
#> 1: 0.5500000
#> 2: 0.4000000
#> 3: 0.7833333
#> 4: 0.1333333
#> 5: 0.9500000
#> 6: 0.6166667
#> 7: 0.3333333
#> 8: 0.8666667
#> 9: 0.1500000
#> 10: 0.9833333
Using an appropriate level of summary, we can easily use the output for visualisation. The scoringutils
package offers some built-in functions to help get a sense of the data
::plot_predictions(data, x = "id", range = c(0, 90),
scoringutilsfacet_formula = ~ model)
(The data is just randomly generated values. We plan to add real example data to make these illustrations more useful in the future)
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model"))
::score_table(scores) scoringutils
Given this level of aggregation, not all metrics may make sense. In this case, for example, averaging over different quantiles to compute quantile coverage does not make much sense. If you like, you can select specific metrics for the visualisation.
Let us look at calibration:
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "range", "quantile"))
::interval_coverage(scores) +
scoringutils::ggtitle("Interval Coverage")
ggplot2
::quantile_coverage(scores) +
scoringutils::ggtitle("Quantile Coverage") ggplot2
Let us look at the individual components of the weighted interval score:
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model"))
::wis_components(scores) scoringutils
We can also look at contributions to different metrics by range:
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "range"))
::range_plot(scores, y = "interval_score") scoringutils
We can also visualise metrics using a heatmap:
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "horizon"))
:= as.factor(horizon)]
scores[, horizon ::score_heatmap(scores,
scoringutilsx = "horizon", metric = "bias")
The eval_forecasts
function is designed to work with various different input formats. The following formats are currently supported:
quantile forecasts in either a plain quantile format or in a format that specifies interval ranges and the boundary of a given interval range.
print(scoringutils::quantile_example_data_plain, 3, 3)
#> true_value id model prediction horizon quantile
#> 1: 2.659261 1 model1 -0.6448536 1 0.05
#> 2: 2.659261 1 model1 0.3255102 1 0.25
#> 3: 2.659261 1 model1 1.0000000 1 0.50
#> ---
#> 598: 30.189608 30 model2 31.2242353 2 0.50
#> 599: 30.189608 30 model2 31.3873685 2 0.95
#> 600: 30.189608 30 model2 30.6399809 2 0.75
print(scoringutils::quantile_example_data_long, 3, 3)
#> true_value id model prediction boundary range horizon
#> 1: 2.659261 1 model1 -0.6448536 lower 90 1
#> 2: 2.659261 1 model1 0.3255102 lower 50 1
#> 3: 2.659261 1 model1 1.0000000 lower 0 1
#> ---
#> 718: 30.189608 30 model2 31.3873685 upper 90 2
#> 719: 30.189608 30 model2 30.6399809 upper 50 2
#> 720: 30.189608 30 model2 31.2576984 upper 0 2
sample based format with either continuous or integer values
print(scoringutils::integer_example_data, 3, 3)
#> # A tibble: 6,000 x 6
#> # Groups: id [30]
#> id model true_value sample prediction horizon
#> <int> <chr> <dbl> <int> <dbl> <dbl>
#> 1 1 model1 6 1 5 1
#> 2 1 model1 6 2 4 1
#> 3 1 model1 6 3 3 1
#> 4 1 model1 6 4 3 1
#> 5 1 model1 6 5 4 1
#> 6 1 model1 6 6 4 1
#> 7 1 model1 6 7 5 1
#> 8 1 model1 6 8 4 1
#> 9 1 model1 6 9 4 1
#> 10 1 model1 6 10 6 1
#> # … with 5,990 more rows
print(scoringutils::continuous_example_data, 3, 3)
#> id model true_value sample prediction horizon
#> 1: 1 model1 0.03007379 1 -0.203426069 1
#> 2: 1 model1 0.03007379 2 0.007621269 1
#> 3: 1 model1 0.03007379 3 -2.086657003 1
#> ---
#> 5998: 30 model2 -2.93749990 48 -0.079900522 2
#> 5999: 30 model2 -2.93749990 49 -1.178524017 2
#> 6000: 30 model2 -2.93749990 50 0.638750918 2
forecasts in a binary format:
print(scoringutils::binary_example_data, 3, 3)
#> # A tibble: 120 x 5
#> # Groups: id, model [60]
#> id model horizon prediction true_value
#> <int> <fct> <dbl> <dbl> <dbl>
#> 1 1 model1 1 0.746 0
#> 2 1 model1 2 0.522 0
#> 3 1 model2 1 0.00958 0
#> 4 1 model2 2 0.00671 0
#> 5 2 model1 1 0.730 0
#> 6 2 model1 2 0.511 0
#> 7 2 model2 1 0.0274 0
#> 8 2 model2 2 0.0192 0
#> 9 3 model1 1 0.543 0
#> 10 3 model1 2 0.380 0
#> # … with 110 more rows
It also offers functionality to convert between these formats. For more information have a look at the documentation of the following functions:
::sample_to_quantile() # convert from sample based to quantile format
scoringutils::range_to_quantile() # convert from range format to plain quantile
scoringutils::quantile_to_range() # convert the other way round
scoringutils::quantile_to_long() # convert range based format from wide to long
scoringutils::quantile_to_wide() # convert the other way round scoringutils
A variety of metrics and scoring rules can also be accessed directly through the scoringutils
package.
The following gives an overview of (most of) the implemented metrics.
The function bias
determines bias from predictive Monte-Carlo samples, automatically recognising whether forecasts are continuous or integer valued.
For continuous forecasts, Bias is measured as \[B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))\]
where \(P_t\) is the empirical cumulative distribution function of the prediction for the true value \(x_t\). Computationally, \(P_t (x_t)\) is just calculated as the fraction of predictive samples for \(x_t\) that are smaller than \(x_t\).
For integer valued forecasts, Bias is measured as
\[B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))\]
to adjust for the integer nature of the forecasts. In both cases, Bias can assume values between -1 and 1 and is 0 ideally.
## integer valued forecasts
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions bias(true_values, predictions)
#> [1] -0.880 -0.575 0.455 -0.710 0.880 0.220 0.870 -0.495 -0.140 -0.345
#> [11] 0.185 0.875 -0.510 0.295 -0.880 0.530 0.925 -0.400 0.995 -0.465
#> [21] -0.165 0.000 -0.560 -0.340 0.610 0.330 0.030 0.285 -0.060 0.470
## continuous forecasts
<- rnorm(30, mean = 1:30)
true_values <- replicate(200, rnorm(30, mean = 1:30))
predictions bias(true_values, predictions)
#> [1] -0.70 -0.08 -0.58 -0.16 0.49 -0.02 0.52 0.41 -0.88 0.92 -0.96 -0.62
#> [13] -0.73 0.18 0.37 -0.30 -0.45 0.32 -0.13 0.17 0.44 -0.16 0.38 0.79
#> [25] 0.50 0.40 -0.02 0.74 -0.83 0.34
Calibration or reliability of forecasts is the ability of a model to correctly identify its own uncertainty in making predictions. In a model with perfect calibration, the observed data at each time point look as if they came from the predictive probability distribution at that time.
Equivalently, one can inspect the probability integral transform of the predictive distribution at time t,
\[u_t = F_t (x_t)\]
where \(x_t\) is the observed data point at time \(t \text{ in } t_1, …, t_n\), n being the number of forecasts, and \(F_t\) is the (continuous) predictive cumulative probability distribution at time t. If the true probability distribution of outcomes at time t is \(G_t\) then the forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times \(t\). In that case, the probabilities ut are distributed uniformly.
In the case of discrete outcomes such as incidence counts, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:
\[u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )\]
where \(k_t\) is the observed count, \(P_t(x)\) is the predictive cumulative probability of observing incidence \(k\) at time \(t\), \(P_t (-1) = 0\) by definition and \(v\) is standard uniform and independent of \(k\). If \(P_t\) is the true cumulative probability distribution, then \(u_t\) is standard uniform.
The function checks whether integer or continuous forecasts were provided. It then applies the (randomised) probability integral and tests the values \(u_t\) for uniformity using the Anderson-Darling test.
As a rule of thumb, there is no evidence to suggest a forecasting model is miscalibrated if the p-value found was greater than a threshold of \(p >= 0.1\), some evidence that it was miscalibrated if \(0.01 < p < 0.1\), and good evidence that it was miscalibrated if \(p <= 0.01\). In this context it should be noted, though, that uniformity of the PIT is a necessary but not sufficient condition of calibration. It should als be noted that the test only works given sufficient samples, otherwise the Null hypothesis will often be rejected outright.
Wrapper around the crps_sample
function from the scoringRules
package. For more information look at the manuals from the scoringRules
package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions crps(true_values, predictions)
#> [1] 0.649850 1.318150 0.435500 1.149900 0.781650 0.704250 0.832950 0.592300
#> [9] 5.230925 2.450325 0.893775 1.151875 1.277800 2.244200 1.216625 2.363750
#> [17] 4.915525 7.883525 1.141050 8.796975 3.920725 2.506000 3.947375 1.558000
#> [25] 1.358750 1.802525 2.241825 1.675950 1.998750 4.889025
Wrapper around the dss_sample
function from the scoringRules
package. For more information look at the manuals from the scoringRules
package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions dss(true_values, predictions)
#> [1] 0.980098 2.661951 1.309684 1.382606 1.566530 3.276875 3.366579 2.636354
#> [9] 2.484455 3.945296 2.716004 3.432542 3.715580 5.833559 3.539483 2.809888
#> [17] 6.787577 3.735003 3.770679 5.390386 4.519517 3.250709 3.307365 3.983590
#> [25] 3.363838 2.942595 4.960309 3.356285 4.096127 5.817716
Wrapper around the log_sample
function from the scoringRules
package. For more information look at the manuals from the scoringRules
package. The function should not be used for integer valued forecasts. While Log Scores are in principle possible for integer valued foreasts they require a kernel density estimate which is not well defined for discrete values. Smaller values are better.
<- rnorm(30, mean = 1:30)
true_values <- replicate(200, rnorm(n = 30, mean = 1:30))
predictions logs(true_values, predictions)
#> [1] 1.2866381 0.8593650 1.0029785 1.0579208 0.9926259 1.2484167 1.5229751
#> [8] 1.4169337 1.0986732 1.1312240 1.7191367 0.8686863 1.1764081 0.9017488
#> [15] 1.9838143 2.4636160 1.4733809 0.9651339 1.2275458 1.1712084 0.9606882
#> [22] 1.5391638 1.1088316 1.1369635 1.4620804 1.4473927 1.7501319 1.0833631
#> [29] 1.1494207 1.6111555
The Brier score is a proper score rule that assesses the accuracy of probabilistic binary predictions. The outcomes can be either 0 or 1, the predictions must be a probability that the true outcome will be 1.
The Brier Score is then computed as the mean squared error between the probabilistic prediction and the true outcome.
\[\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2\]
<- sample(c(0,1), size = 30, replace = TRUE)
true_values <- runif(n = 30, min = 0, max = 1)
predictions
brier_score(true_values, predictions)
#> [1] 0.3243399
The Interval Score is a Proper Scoring Rule to score quantile predictions, following Gneiting and Raftery (2007). Smaller values are better.
The score is computed as
\[ \text{score} = (\text{upper} - \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{lower} - \text{true_value}) \cdot 1(\text{true_values} < \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{true_value} - \text{upper}) \cdot 1(\text{true_value} > \text{upper})\]
where \(1()\) is the indicator function and \(\alpha\) is the decimal value that indicates how much is outside the prediction interval. To improve usability, the user is asked to provide an interval range in percentage terms, i.e. interval_range = 90 (percent) for a 90 percent prediction interval. Correspondingly, the user would have to provide the 5% and 95% quantiles (the corresponding alpha would then be 0.1). No specific distribution is assumed, but the range has to be symmetric (i.e you can’t use the 0.1 quantile as the lower bound and the 0.7 quantile as the upper). Setting weigh = TRUE
will weigh the score by \(\frac{\alpha}{2}\) such that the Interval Score converges to the CRPS for increasing number of quantiles.
<- rnorm(30, mean = 1:30)
true_values <- 90
interval_range <- (100 - interval_range) / 100
alpha <- qnorm(alpha/2, rnorm(30, mean = 1:30))
lower <- qnorm((1- alpha/2), rnorm(30, mean = 1:30))
upper
interval_score(true_values = true_values,
lower = lower,
upper = upper,
interval_range = interval_range)
#> [1] 0.22887917 0.12915489 0.14745489 0.20393535 0.08203821 0.08403069
#> [7] 0.16669027 0.19011763 0.07734240 0.22332403 1.52448382 0.28581583
#> [13] 0.38330615 0.23936227 0.14885444 0.22053655 0.16671844 0.24506179
#> [19] 0.20387278 1.48964203 0.17658916 0.22336535 0.06635200 0.11939883
#> [25] 0.19029015 1.91258428 0.15955126 0.07556659 0.37008564 0.22527360