Introduction

This package was developed to help prepare some samples to be send to a facility. It assumes that you have collected the samples and the information but you still need to do the experiment in several batches due to technical or practical limitations. The question that tries to answer is:

Which samples go with each batch?

Of all the possible combinations of samples, it looks for the combination which minimizes the differences between each subgroup according to the following rules:

  • If the variable is categorical it tires to randomize the variable across the subgrups.
  • If the variable is numeric it tries to distribute evenly following the original distribution of values.
  • If there are NA (not available values) it looks to distribute them randomly.

Even with this measures you might end up with some batch effect due to: - Confounding variables not provided for their randomization on the batches Sometimes due to being unknown, impossible to measure. - Lack of replicates(samples with the same conditions) If you can’t provide new replicates, aim to provide more technical replicates. Technical replicates mean reanalyzing the same sample twice or more, the more samples with technical replicates the more accurate your measures will be and easier to avoid or detect batch effects. - Processing If there is a change on the methodology, or you pause and resume later the sample collection there might be changes on the outcome due to external factors.

Previous work

Before building this package I would like to give credit to those that made also efforts in this direction:

The OSAT package handles categorical variables but not numeric data.

The minDiff package reported in Stats.SE, handles both numeric and categorical data, but it can only optimize for two nominal criteria.

This package handles both numerical and categorical variables as many as you wish. But the accuracy depends on the number of iterations done. But if you only have data falling on any of these condition the results will be more accurate.

Preparation

We can use the survey dataset for the examples:

data(survey, package = "MASS") 
head(survey)
##      Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse    Clap Exer Smoke Height      M.I
## 1 Female   18.5   18.0 Right  R on L    92    Left Some Never 173.00   Metric
## 2   Male   19.5   20.5  Left  R on L   104    Left None Regul 177.80 Imperial
## 3   Male   18.0   13.3 Right  L on R    87 Neither None Occas     NA     <NA>
## 4   Male   18.8   18.9 Right  R on L    NA Neither None Never 160.00   Metric
## 5   Male   20.0   20.0 Right Neither    35   Right Some Never 165.00   Metric
## 6 Female   18.0   17.7 Right  L on R    64   Right Some Never 172.72 Imperial
##      Age
## 1 18.250
## 2 17.583
## 3 16.917
## 4 20.333
## 5 23.667
## 6 21.000

The dataset has numeric, categorical values and some NA’s value.

Picking samples for each batch

Imagine that we can only work in groups of 70, and we want to randomize by Sex, Smoke, Age, and by writing hand.
There are 1.654399910^{61} combinations some of them would be have in a single experiment all the right handed students. We could measure all these combinations but we can try to find an optimum value.

# To reduce the variables used:
omit <- c("Wr.Hnd", "NW.Hnd", "Fold", "Pulse", "Clap", "Exer", "Height", "M.I")
(keep <- colnames(survey)[!colnames(survey) %in% omit])
## [1] "Sex"   "W.Hnd" "Smoke" "Age"
head(survey[, keep])
##      Sex W.Hnd Smoke    Age
## 1 Female Right Never 18.250
## 2   Male  Left Regul 17.583
## 3   Male Right Occas 16.917
## 4   Male Right Never 20.333
## 5   Male Right Never 23.667
## 6 Female Right Never 21.000

# Looking for groups at most of 70 samples.
index <- design(pheno = survey, size_subset = 70, omit = omit)
index
## $SubSet1
##  [1]   4   9  19  20  21  22  26  28  42  44  45  48  49  50  53  63  64  69  71
## [20]  72  75  76  83  86  88  90  91  96  97 103 105 109 119 121 125 134 135 138
## [39] 139 142 150 154 155 159 161 173 175 176 179 192 195 198 199 201 205 206 207
## [58] 208 212 237
## 
## $SubSet2
##  [1]   5   7  11  13  14  23  29  32  37  38  41  52  57  58  60  61  65  74  81
## [20]  84  85  89  92  93  94  99 100 114 116 120 123 126 128 132 140 149 152 153
## [39] 158 163 164 168 171 180 183 184 193 200 203 213 216 220 223 225 227 230 232
## [58] 233 234
## 
## $SubSet3
##  [1]   1   2   6  12  15  18  24  27  31  33  39  40  46  55  62  66  68  70  73
## [20]  77  80  82  87  95  98 106 107 108 110 118 122 124 130 133 137 141 147 148
## [39] 156 162 165 167 169 170 181 182 185 187 189 194 202 204 215 217 221 222 226
## [58] 231 235
## 
## $SubSet4
##  [1]   3   8  10  16  17  25  30  34  35  36  43  47  51  54  56  59  67  78  79
## [20] 101 102 104 111 112 113 115 117 127 129 131 136 143 144 145 146 151 157 160
## [39] 166 172 174 177 178 186 188 190 191 196 197 209 210 211 214 218 219 224 228
## [58] 229 236

We can transform then into a vector to append to the file or to pass to the lab mate with:

head(batch_names(index))
## [1] "SubSet3" "SubSet3" "SubSet4" "SubSet1" "SubSet2" "SubSet3"

Checking results

We can check the values of the sampling with:

out <- evaluate_index(index, survey[, keep])
out[, "Age", ]
##          subgroups
## stat        SubSet1   SubSet2   SubSet3   SubSet4
##   mean    20.208367 20.711898 20.426559 20.154051
##   sd       7.663874  8.223262  5.097300  4.124343
##   mad      1.359544  1.605656  1.730194  1.853250
##   na       0.000000  0.000000  0.000000  0.000000
##   entropy  0.000000  0.000000  0.000000  0.000000

And we can compare with the original distribution using evaluate_orig:

orig <- evaluate_orig(survey)
orig[, "Age"]
##      mean        sd       mad        na   entropy 
## 20.374515  6.474335  1.605656  0.000000  0.000000

Unbalanced setting

In the previous case the data was mostly balanced (check it out in the orig object) but let’s create an unbalanced dataset to check it.

n <- 99
samples <- 100
unbalanced <- data.frame(Classroom = rep(c("A", "B"), each = samples/2),
                         Sex = c(rep("M", n), rep("F", samples-n)),
                         Age = rnorm(samples, mean = 25, sd = 3))
table(unbalanced)[, , 1:5]
## , , Age = 16.7606877438421
## 
##          Sex
## Classroom F M
##         A 0 1
##         B 0 0
## 
## , , Age = 18.3564356796084
## 
##          Sex
## Classroom F M
##         A 0 0
##         B 1 0
## 
## , , Age = 18.8105321872606
## 
##          Sex
## Classroom F M
##         A 0 0
##         B 0 1
## 
## , , Age = 19.0829046190892
## 
##          Sex
## Classroom F M
##         A 0 1
##         B 0 0
## 
## , , Age = 19.4863600125267
## 
##          Sex
## Classroom F M
##         A 0 0
##         B 0 1

In this dataset the classroom a single classroom has all the females (-49).

i <- design(unbalanced, 15)

# Mean entropy en each subset
rowMeans(evaluate_index(i, unbalanced)["entropy", , ])
##  Classroom        Sex        Age 
## 0.98475090 0.05303319 0.00000000
# Original entropy on the dataset
evaluate_orig(unbalanced)["entropy", ]
##  Classroom        Sex        Age 
## 1.00000000 0.08079314 0.00000000
# Dispersion of the entropy
apply(evaluate_index(i, unbalanced)["entropy", , ], 1, sd)
##  Classroom        Sex        Age 
## 0.02239937 0.14031263 0.00000000

We can see that in this simple case where a single variable has all the other cases we approximately reached the same entropy levels.

Quality check

If you need a subset with the samples that are more diverse you can use the following function:

samples <- extreme_cases(survey, size = 10)
survey[samples, ]
##        Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse    Clap Exer Smoke Height      M.I
## 16  Female   17.5   17.0 Right  R on L    NA   Right Freq Never 156.00   Metric
## 45  Female   13.0   13.0  <NA>  L on R    70    Left Freq Never 180.34 Imperial
## 57  Female   15.5   15.4 Right  R on L    70 Neither None Never 157.48 Imperial
## 87  Female   18.2   18.0 Right  L on R    70   Right Some Never 162.56 Imperial
## 91    Male   20.5   20.0 Right  R on L    75    Left Some Never 183.00   Metric
## 112   Male   19.2   19.6 Right  L on R    80   Right None Never 190.50 Imperial
## 118   Male   23.0   22.0  Left  L on R    83    Left Some Heavy 193.04 Imperial
## 134 Female   15.4   16.4  Left  L on R    80    Left Freq Occas 160.02 Imperial
## 190   Male   18.5   18.0 Right Neither    63 Neither Freq Never 196.00   Metric
## 197 Female   15.0   13.0 Right  R on L    80 Neither Freq Never 170.18 Imperial
##        Age
## 16  17.167
## 45  17.417
## 57  17.167
## 87  18.000
## 91  19.667
## 112 18.167
## 118 18.917
## 134 18.500
## 190 20.083
## 197 17.000

It is recommended to add replicates on the several batches to estimate the differences between the batches. If you want to add replicates on the batches you can use the following:

# Looking for groups at most of 70 samples.
index_replicates <- replicates(pheno = survey, size_subset = 70, 
                               controls = 10, omit = omit)
index_replicates
## $SubSet1
##  [1]   3   5   9  11  14  17  26  27  33  43  44  45  46  56  61  72  73  74  76
## [20]  77  79  85  86  88  92  95  97  98 100 101 102 103 106 113 117 119 121 122
## [39] 125 128 131 134 135 137 139 143 144 147 150 151 152 159 162 163 166 170 171
## [58] 175 179 180 182 184 196 199 208 217 226 237
## 
## $SubSet2
##  [1]   1  14  15  16  18  29  30  31  33  34  38  39  47  48  52  54  58  60  65
## [20]  70  78  79  80  82  83  86  91 108 109 110 112 114 115 124 126 129 132 133
## [39] 149 151 153 158 160 162 164 169 171 173 174 186 188 191 196 198 199 200 202
## [58] 204 210 212 217 218 219 220 221
## 
## $SubSet3
##  [1]   4   6   8  13  14  21  22  23  24  28  32  33  36  37  42  50  53  55  57
## [20]  59  63  67  69  75  79  84  86  89  93 104 111 120 127 130 141 142 145 151
## [39] 155 162 167 171 177 178 181 185 187 192 194 195 196 197 199 203 205 206 207
## [58] 211 215 216 217 222 223 227 233 235
## 
## $SubSet4
##  [1]   2   7  10  12  14  19  20  25  33  35  40  41  49  51  62  64  66  68  71
## [20]  79  81  86  87  90  94  96  99 105 107 116 118 123 136 138 140 146 148 151
## [39] 154 156 157 161 162 165 168 171 172 176 183 189 190 193 196 199 201 209 213
## [58] 214 217 224 225 228 229 230 231 232 234 236

The controls will be those on all the batches:

survey[Reduce(intersect, index_replicates), ]
##        Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse    Clap Exer Smoke Height      M.I
## 14  Female   19.5   20.2 Right  L on R    66 Neither Some Never  155.0   Metric
## 33  Female   17.1   17.5 Right  R on L    72   Right Freq Heavy  166.4 Imperial
## 79  Female   18.3   18.5 Right  R on L    68 Neither Some Never  165.1 Imperial
## 86  Female   17.7   17.0 Right  R on L    76   Right Some Never  167.0   Metric
## 151   Male   21.8   22.3 Right  R on L    76    Left Freq Never  195.0   Metric
## 162   Male   18.1   18.2  Left Neither    NA   Right Some Never  168.0   Metric
## 171 Female   16.5   17.0 Right  L on R    NA   Right Some Never  168.0   Metric
## 196 Female   17.0   17.6 Right  L on R    76   Right Some Never  165.0   Metric
## 199 Female   19.1   19.0 Right  R on L    80   Right Some Occas  170.0   Metric
## 217 Female   16.3   16.2 Right  L on R    NA   Right None Never     NA     <NA>
##        Age
## 14  17.500
## 33  39.750
## 79  17.083
## 86  17.250
## 151 25.500
## 162 21.167
## 171 73.000
## 196 26.500
## 199 19.167
## 217 19.250

If you have biological replicates use the information you have about them, but if you have technical replicates do not add them to the table used by design or replicates.

Utilities

To convert to a value to append to the data.frame we can use:

batch <- batch_names(index)

Or if we want to do so automatically we can do:

df <- inspect(index, survey, omit = omit)
head(df)
##      Sex W.Hnd Smoke    Age   batch
## 1 Female Right Never 18.250 SubSet3
## 2   Male  Left Regul 17.583 SubSet3
## 3   Male Right Occas 16.917 SubSet4
## 4   Male Right Never 20.333 SubSet1
## 5   Male Right Never 23.667 SubSet2
## 6 Female Right Never 21.000 SubSet3

Comparing groups

If a data.frame has a column with name “batch” we can use the function distribution to see if it is randomly distributed. Or we can see the mean difference between each subgroup and the original value of the data.frame.

evaluate_entropy(index, survey)
##         Sex       W.Hnd        Fold        Clap        Exer       Smoke 
## 0.006505986 0.620020651 0.203322551 0.198039029 0.146556617 0.491375208 
##         M.I 
## 0.097973199
evaluate_independence(index, survey)
##         Sex       W.Hnd        Fold        Clap        Exer       Smoke 
## 0.064221034 0.009156125 0.762186083 0.609048036 0.840920110 0.158145859 
##         M.I 
## 0.314953299

evaluate_na(index, survey)
##    Sex Wr.Hnd NW.Hnd  W.Hnd   Fold  Pulse   Clap   Exer  Smoke Height    M.I 
##  0.375  0.375  0.375  0.375  0.000  1.250  0.375  0.000  0.375  1.500  1.500 
##    Age 
##  0.000

evaluate_mean(index, survey)
##    Wr.Hnd    NW.Hnd     Pulse    Height       Age 
## 0.2129286 0.1731027 0.7372826 1.3390543 0.1940100
evaluate_sd(index, survey)
##     Wr.Hnd     NW.Hnd      Pulse     Height        Age 
## 0.06746011 0.02890952 0.34931526 0.91503449 1.66637329
evaluate_mad(index, survey)
##    Wr.Hnd    NW.Hnd     Pulse    Height       Age 
##  25.76017  25.76017  96.36900 243.89511  25.91399

The numbers provided are just an indicator of how far are they from a perfect distance. The farthest and index is, the worst it is to account for batch effect.

If we want the original numbers we can use evaluate_index, and look for specific values:

ev <- evaluate_index(index, survey)
ev["entropy", "Sex",]
##   SubSet1   SubSet2   SubSet3   SubSet4 
## 0.9927745 0.9948132 0.9965664 0.9898221
ev[1:4, "Age",]
##       subgroups
## stat     SubSet1   SubSet2   SubSet3   SubSet4
##   mean 20.208367 20.711898 20.426559 20.154051
##   sd    7.663874  8.223262  5.097300  4.124343
##   mad   1.359544  1.605656  1.730194  1.853250
##   na    0.000000  0.000000  0.000000  0.000000
evaluate_independence(index, survey)
##         Sex       W.Hnd        Fold        Clap        Exer       Smoke 
## 0.064221034 0.009156125 0.762186083 0.609048036 0.840920110 0.158145859 
##         M.I 
## 0.314953299

SessionInfo

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] experDesign_0.0.4
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.3  magrittr_1.5    htmltools_0.5.0 tools_3.6.3    
##  [5] yaml_2.2.1      stringi_1.5.3   rmarkdown_2.3   knitr_1.30     
##  [9] stringr_1.4.0   digest_0.6.25   xfun_0.17       rlang_0.4.7    
## [13] evaluate_0.14