Introduction to sdcLog

Overview

This vignette introduces the sdcLog package and its main functions, illustrated with various examples. sdcLog provides tools which simplify statistical disclosure control in the context of research data centers (RDC).

The package includes four main functions:

sdc_descriptives()

This function performs statistical disclosure control according to two main criteria: On the one hand, it checks for a sufficiently large number of different statistical entities. On the other hand, it checks for dominance, which means that two entities must not account for more than 85 percent of the observed values. How to use sdc_descriptives() is shown below.

Data

To introduce sdc_descriptives(), a simple toy dataset is used. There are 20 observations of 10 distinct entities from two different sectors and values in the years 2019 and 2020 for the variables val_1 and val_2.

data("sdc_descriptives_DT")
sdc_descriptives_DT
#>     id sector year      val_1    val_2
#>  1:  A     S1 2019         NA 9.477642
#>  2:  A     S1 2020  94.174449 5.856641
#>  3:  B     S1 2019   4.349115 3.697140
#>  4:  B     S1 2020   2.589011 6.796527
#>  5:  C     S1 2019   6.155680 7.213390
#>  6:  C     S1 2020   7.183206 5.948330
#>  7:  D     S1 2019   9.173870 3.004441
#>  8:  D     S1 2020   4.456933 0.000000
#>  9:  E     S1 2019   2.815137 4.137765
#> 10:  E     S1 2020   7.928573 0.000000
#> 11:  F     S2 2019   9.085507 5.088913
#> 12:  F     S2 2020 180.816675 0.000000
#> 13:  G     S2 2019   9.502077 2.107123
#> 14:  G     S2 2020   7.458567 0.000000
#> 15:  H     S2 2019   6.947180 5.059104
#> 16:  H     S2 2020   9.927155 3.489741
#> 17:  I     S2 2019   6.662026 8.957527
#> 18:  I     S2 2020   4.420317 8.618987
#> 19:  J     S2 2019   1.556076 4.722792
#> 20:  J     S2 2020   7.997007 7.347734

Examples

Simple cases

Consider the case that the mean for val_1 has been calculated and is now to be output as a result:1

sdc_descriptives_DT[, .(mean = mean(val_1, na.rm = TRUE))]
#>        mean
#> 1: 20.16835

Before this result can be released, it must be checked whether all RDC rules for calculating this value have been followed. Thus, the underlying data is checked for compliance with the RDC rules.

This is the simplest case, the descriptive statistic (mean) was calculated for the variable val_1 without further specifications. Required arguments of sdc_descriptives() are the data set (data), the ID variable (id_var) and the variable for which the statistics were calculated (val_var):

sdc_descriptives(data = sdc_descriptives_DT, id_var = "id", val_var = "val_1")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_1 | zero_as_NA: FALSE ]
#> Output complies to RDC rules.

Since there are no problems at this point, the function runs without warnings and returns (invisibly) a list of information containing options, settings and the checked criteria distinct_ids and dominance.

Options and settings are always printed to show that all specifications are set according to RDC rules. From the output above follows that there are at least 5 distinct entities required (sdc.n_ids: 5) and that dominance is defined as 2 entities (sdc.n_ids_dominance: 2) with a value share of more than 85 percent (sdc.share_dominance: 0.85). This reflects the standard values for the options. For details on setting options see the separate vignette on options.

The settings show again which arguments were specified in the function call and vary depending on the sdc_function. This is important if the result from sdc_descriptives() is not printed right away.

Grouped descriptive statistics using by

In this and the following section some advanced cases are presented to introduce more arguments and functionalities of sdc_descriptives().

In this case the descriptive statistics for the variable val_1 are grouped by sector:

sdc_descriptives_DT[, .(mean = mean(val_1, na.rm = TRUE)), by = "sector"]
#>    sector     mean
#> 1:     S1 15.42511
#> 2:     S2 24.43726

The mean is computed grouped by sector, so the grouping variable must be specified in by. Checking the results leads to the following:

sdc_descriptives(data = sdc_descriptives_DT, id_var = "id", val_var = "val_1", by = "sector")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_1 | by: sector | zero_as_NA: FALSE ]
#> Output complies to RDC rules.

The grouped descriptive statistics by sector do not generate a warning and therefore comply with RDC rules. Therefore, the results could be released in this case.

In order to extend this case even further, it is now proposed to group the mean of val_1 not only by sector, but also by year:

sdc_descriptives_DT[, .(mean = mean(val_1, na.rm = TRUE)), by = c("sector", "year")]
#>    sector year      mean
#> 1:     S1 2019  5.623451
#> 2:     S1 2020 23.266434
#> 3:     S2 2019  6.750574
#> 4:     S2 2020 42.123944

To check this result for compliance with RDC rules, use:

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1",
  by = c("sector", "year")
)
#> Warning: Potential disclosure problem: Dominant entities.
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_1 | by: sector, year | zero_as_NA: FALSE ]
#> Dominant entities:
#>    sector year value_share
#> 1:     S2 2020   0.9056314
#> 2:     S1 2020   0.8776852
#> 3:     S1 2019   0.6815011
#> 4:     S2 2019   0.5506965

Now several warnings appear, as both criteria are violated. For sector S1 there are not enough distinct ids in year 2019, as there is a missing value in the data. The dominance criterion for year 2020 is violated in both sectors. As can be seen in the table displayed, the value share of approx. 88 percent for S1 and 91 percent for S2 are above the 85 percent limit. Therefore, the descriptive statistics for val_1, grouped by sector and year cannot be released.

Handling zeros using zero_as_NA

Now, descriptive statistics are calculated for variable val_2 and grouped by sector and year:

sdc_descriptives_DT[, .(mean = mean(val_2, na.rm = TRUE)), by = c("sector", "year")]
#>    sector year     mean
#> 1:     S1 2019 5.506076
#> 2:     S1 2020 3.720300
#> 3:     S2 2019 5.187092
#> 4:     S2 2020 3.891292

The compliance with the rules can be checked just as in the previous case (only replacing val_1 by val_2):

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year")
)
#> A share of 0.2 of 'val_var' are zero. These will be treated as 'NA'.
#> To prevent this behavior and / or avoid this message, set 'zero_as_NA' explicitly.
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_2 | by: sector, year | zero_as_NA: TRUE ]
#> Output complies to RDC rules.

The result indicates that problems exist and the output does not comply to the rules. There are not enough distinct entities and the output cannot be released like this.

An additional message indicates that the value 0 occurs rather frequently in the data (20 percent of all cases). The message indicates that 0 is assumed to represent missing values and will be treated as such. Please note that even if 0s are actual 0s in the data, this assumption might be correct in the context of statistical disclosure control. For example, if most of the cases are 0, it might be known publicly which entities do not have a value of 0 for this specific variable. So treating those 0 as NA is correct in this context. Since this is the more defensive interpretation of 0s, it’s the default.

However, it might be the case that it is accurate according to the data basis to treat values of 0 as zero (instead of NA). Then, specifying the argument zero_as_NA = FALSE circumvents the default behavior and treats 0 like other numeric values:

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year"),
  zero_as_NA = FALSE
)
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_2 | by: sector, year | zero_as_NA: FALSE ]
#> Output complies to RDC rules.

Now 0 is not recognized as NA anymore. In this case the criterion of distinct entities is not longer violated. Therefore, the output could be released (assuming it is actually correct to treat 0s as usual numeric values).

sdc_model()

This function checks if your model complies to RDC rules. The criteria of distinct entities is also checked here. In addition, when using dummy variables, it is checked whether there are enough different entities for each attribute or value level. The function can be used to check a broad range of models like lm, glm and various others. In fact, anything which can be handled by broom::augment() can also be handled by sdc_model(). For a list of supported models see ?generics::augment.

Data & models

To introduce sdc_model(), another dataset with different variables is used, which includes dummy-variables.

We have 80 observations of 10 different entities for the variables y, x_1, x_2, x_3, x_4 and additional information on sector, year and country (dummy variables). A summary of the data set is given below.

data("sdc_model_DT")
print(skim(sdc_model_DT))
#> ── Data Summary ────────────────────────
#>                            Values      
#> Name                       sdc_model_DT
#> Number of rows             80          
#> Number of columns          9           
#> _______________________                
#> Column type frequency:                 
#>   factor                   4           
#>   numeric                  5           
#> ________________________               
#> Group variables            None        
#> 
#> ── Variable type: factor ──────────────────────────────────────────────────────────────────────
#>   skim_variable n_missing complete_rate ordered n_unique top_counts                    
#> 1 id                    0             1 FALSE         10 A: 8, B: 8, C: 8, D: 8        
#> 2 dummy_1               0             1 FALSE          2 S1: 40, S2: 40                
#> 3 dummy_2               0             1 FALSE          8 Y1: 10, Y2: 10, Y3: 10, Y4: 10
#> 4 dummy_3               0             1 FALSE          4 ES: 36, BE: 20, DE: 20, FR: 4 
#> 
#> ── Variable type: numeric ─────────────────────────────────────────────────────────────────────
#>   skim_variable n_missing complete_rate  mean      sd     p0   p25   p50   p75   p100 hist 
#> 1 y                     0           1    121.    7.21 102.   116.   121.  125.   139. ▁▅▇▆▁
#> 2 x_1                   0           1    122.   13.3   95.9  111.   122.  133.   152. ▅▇▇▇▂
#> 3 x_2                   0           1    116.   19.4   86.1   99.1  113.  128.   152. ▇▇▆▃▆
#> 4 x_3                  48           0.4  126.   59.0   36.0   81.3  120.  173.   230. ▆▆▅▇▃
#> 5 x_4                   0           1   2544. 5944.     5.98  91.4  164.  227. 25059. ▇▁▁▁▁

Various simple linear models are specified from this dataset for illustration purposes.

model_1 <- lm(y ~ x_1 + x_2, data = sdc_model_DT)
model_2 <- lm(y ~ x_1 + x_2 + x_3, data = sdc_model_DT)
model_3 <- lm(y ~ x_1 + x_2 + dummy_1 + dummy_2, data = sdc_model_DT)
model_4 <- lm(y ~ x_1 + x_2 + dummy_3, data = sdc_model_DT)

Examples

These models are now checked for compliance with the rules of the RDC. It is checked if there are enough distinct entities in the whole model and if every level of each dummy variable is checked for compliance with the rules.

A selection of problematic and unproblematic models has been made to better explain the differences. To check for compliance, the model object (model), the data used (data) and the ID variable (id_var) must be specified in sdc_model().

Simple cases

A check of model_1 and model_3 is shown below.

sdc_model(data = sdc_model_DT, model = model_1, id_var = "id")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id ]
#> Output complies to RDC rules.

sdc_model(data = sdc_model_DT, model = model_3, id_var = "id")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id ]
#> Output complies to RDC rules.

As we see, there are no problems and the models could be released as output.

Problematic cases

Now we turn to the problematic cases. We are checking the models model_2 and model_4:

sdc_model(data = sdc_model_DT, model = model_2, id_var = "id")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id ]
#> Output complies to RDC rules.

sdc_model(data = sdc_model_DT, model = model_4, id_var = "id")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id ]
#> Output complies to RDC rules.

Some difficulties occur with these models, but which?

model_2 leads to problems with the number of distinct entities. This problem arises with the inclusion of variable x_3 due to a high number of NAs.

For model_4 the problem stems from a small number of distinct entities for the value level FR of dummy_3. Therefore this model cannot be released either. Please note that this last case is probably the most common problem to occur when checking models.

sdc_extreme()

This function automatically calculates extreme values that comply with the rules of the RDC. It mainly checks the criteria of distinct entities and dominance. The values are calculated as averages of a sufficiently large number of observations. It is based on an iterative procedure that aggregates data until there are enough distinct entities to calculate the extreme values and no problems with dominance occur.

The function always starts the iteration process with the lowest possible number of observations for each extreme value (here 5, since at least five distinct statistical units must be included in the calculation according to the rules of the RDC). Furthermore, the function checks that the subsets of data for minimum and maximum do not overlap.

If there are no problems with the calculation, the function returns a list with the extreme values. Maximum and minimum are always output together, none of the two can be calculated separately. If it is not possible to calculate extreme values under these criteria, a corresponding message is printed and the result is filled with NA.

Data

To introduce sdc_extreme(), another simple dataset is used. We have 20 observations of 10 different entities, for which the corresponding sector is given and values for the variables val_1, val_2, val_3 in the years 2019 and 2020, respectively.

data("sdc_extreme_DT")
sdc_extreme_DT
#>     id sector year val_1 val_2 val_3
#>  1:  A     S1 2019    20   200    NA
#>  2:  B     S1 2020    19   190    19
#>  3:  C     S1 2019    18    18    18
#>  4:  D     S1 2020    17    17    17
#>  5:  E     S1 2019    16    16    16
#>  6:  F     S1 2020    15    15    15
#>  7:  G     S1 2019    14    14    14
#>  8:  H     S1 2020    13    13    13
#>  9:  I     S1 2019    12    12    12
#> 10:  J     S1 2020    11    11    11
#> 11:  A     S2 2019    10    10    10
#> 12:  B     S2 2020     9     9     9
#> 13:  C     S2 2019     8     8     8
#> 14:  D     S2 2020     7     7     7
#> 15:  E     S2 2019     6     6     6
#> 16:  F     S2 2020     5     5     5
#> 17:  G     S2 2019     4     4     4
#> 18:  H     S2 2020     3     3     3
#> 19:  I     S2 2019     2     2     2
#> 20:  J     S2 2020     1     1     1

Examples

Simple cases

In this simple case, extreme values should be calculated for variable val_1. This can be done with sdc_extreme() by specifying the dataset (data), the id variable (id_var) and the variable for which extreme values are to be calculated (val_var).

sdc_extreme(data = sdc_extreme_DT, id_var = "id", val_var = "val_1")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_1 ]
#>    val_var min n_obs_min max n_obs_max
#> 1:   val_1   2         3  19         3

Since no problems occur, the function (invisibly) returns a list with the options, settings and extreme values and prints the calculated extreme values. As shown in the output, the extreme values could be calculated and 5 observations were used for each value. Thus, no additional observations had to be included in the calculation.

Specifying the number of values included in minimum and maximum using n_min and n_max

In this case extreme values are to be calculated for variable val_2:

sdc_extreme(data = sdc_extreme_DT, id_var = "id", val_var = "val_2")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_2 ]
#>    val_var min n_obs_min      max n_obs_max
#> 1:   val_2   2         3 67.14286         7

If we look at the output, we see that 5 observations were used to calculate the minimum and 7 observations to calculate the maximum. This is because the dominance criterion would be violated if only 5 observations were considered for the maximum. Thus 7 observations are automatically taken into account.

If you prefer to include an equal number of observations per extreme value, it is possible to specify this using the arguments n_min and n_max. To specify that 7 observations are used for the calculation of the minimum value as well, set n_min = 7.

sdc_extreme(data = sdc_extreme_DT, id_var = "id", val_var = "val_2", n_min = 7)
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_2 ]
#>    val_var min n_obs_min      max n_obs_max
#> 1:   val_2   4         7 67.14286         7

Now the same number of observations are used to calculate the extreme values. Of course, if you would specify n_max = 5, there would be no feasible solution.

Consider the case in which extreme values are to be calculated for variable val_3 and exactly 10 observations are to be included in each value. To do so, n_max = 10 and n_min = 10 would need to be specified:

sdc_extreme(data = sdc_extreme_DT, id_var = "id", val_var = "val_3", n_min = 10, n_max = 10)
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_3 ]
#> It is impossible to compute extreme values for variable 'val_3' that comply to RDC rules.

But now an error occurs and it is pointed out that it is impossible to calculate extreme values in this case. This is because there is a missing value in val_3, so the subsets for the calculation of minimum and maximum values would overlap. Therefore, fewer observations per extreme value would have to be used - for example 8 each.

sdc_extreme(data = sdc_extreme_DT, id = "id", val_var = "val_3", n_min = 8, n_max = 8)
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_3 ]
#>    val_var min n_obs_min  max n_obs_max
#> 1:   val_3 4.5         8 15.5         8

Extreme values by groups

It is also possible to calculate extreme values by groups. In the following, these are calculated by year and sector, separately.

sdc_extreme(data = sdc_extreme_DT, id_var = "id", val_var = "val_1", by = "year")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_1 | by: year ]
#>    val_var year min n_obs_min max n_obs_max
#> 1:   val_1 2019   5         4  18         3
#> 2:   val_1 2020   4         4  17         3

sdc_extreme(data = sdc_extreme_DT, id_var = "id", val_var = "val_1", by = "sector")
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_1 | by: sector ]
#>    val_var sector min n_obs_min max n_obs_max
#> 1:   val_1     S1  12         3  19         3
#> 2:   val_1     S2   2         3   9         3

No problems occur, so extreme values are calculated and output for each group.

This can also be done for several grouping variables. In the following, extreme values for variable val_1 are to be calculated by year and sector.

res <- sdc_extreme(
  data = sdc_extreme_DT,
  id_var = "id",
  val_var = "val_1",
  by = c("sector", "year")
)
#> [ OPTIONS:  sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val_1 | by: sector, year ]
#> It is impossible to compute extreme values for variable 'val_1' that comply to RDC rules.

Now an message occurs, explaining that RDC rules would be violated for the calculation of these values. For programming purposes, please note that the structure of the resulting data.table remains the same (but is filled with NA:

# extreme_vals
res
#>    val_var sector year min n_obs_min max n_obs_max
#> 1:   val_1     S1 2019  NA        NA  NA        NA
#> 2:   val_1     S1 2020  NA        NA  NA        NA
#> 3:   val_1     S2 2019  NA        NA  NA        NA
#> 4:   val_1     S2 2020  NA        NA  NA        NA

sdc_log()

This function serves to create Stata-like log files from R Scripts. The function is called to wrap an R script containing your analysis to write the corresponding code and console output into a log file. It can handle single files or a list of files at once.

A character vector containing the path(s) of the R script(s) which should be run must be specified as well as a character vector containing the path(s) of the text file(s) where the log(s) should be stored. To replace existing log files, one can specify the argument replace = TRUE.

A simple call of this function could look as follows:

sdc_log(
  r_scripts = "/home/my_project/R/my_script.R",
  log_files = "/home/my_project/log/my_script.txt"
)

Even though this seems trivial, creating logs for scripts is essential because a log file bundles all information needed by the RDC for output control.


  1. Since sdcLog heavily relies on data.table, all examples will use data.table syntax as well.↩︎