This package was developed to help prepare some samples to be send to a facility. It assumes that you have collected the samples and the information but you still need to do the experiment in several batches due to technical or practical limitations. The question that tries to answer is:
Which samples go with each batch?
Of all the possible combinations of samples, it looks for the combination which minimizes the differences between each subgroup according to the following rules:
NA
(not available values) it looks to distribute them randomly.Even with this measures you might end up with some batch effect due to: - Confounding variables not provided for their randomization on the batches Sometimes due to being unknown, impossible to measure. - Lack of replicates(samples with the same conditions) If you can’t provide new replicates, aim to provide more technical replicates. Technical replicates mean reanalyzing the same sample twice or more, the more samples with technical replicates the more accurate your measures will be and easier to avoid or detect batch effects. - Processing If there is a change on the methodology, or you pause and resume later the sample collection there might be changes on the outcome due to external factors.
Before building this package I would like to give credit to those that made also efforts in this direction:
The OSAT package handles categorical variables but not numeric data.
The minDiff package reported in Stats.SE, handles both numeric and categorical data, but it can only optimize for two nominal criteria.
This package handles both numerical and categorical variables as many as you wish. But the accuracy depends on the number of iterations
done. But if you only have data falling on any of these condition the results will be more accurate.
We can use the survey dataset for the examples:
data(survey, package = "MASS")
head(survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 1 Female 18.5 18.0 Right R on L 92 Left Some Never 173.00 Metric
## 2 Male 19.5 20.5 Left R on L 104 Left None Regul 177.80 Imperial
## 3 Male 18.0 13.3 Right L on R 87 Neither None Occas NA <NA>
## 4 Male 18.8 18.9 Right R on L NA Neither None Never 160.00 Metric
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00 Metric
## 6 Female 18.0 17.7 Right L on R 64 Right Some Never 172.72 Imperial
## Age
## 1 18.250
## 2 17.583
## 3 16.917
## 4 20.333
## 5 23.667
## 6 21.000
The dataset has numeric, categorical values and some NA
’s value.
Imagine that we can only work in groups of 70, and we want to randomize by Sex, Smoke, Age, and by writing hand.
There are 1.654399910^{61} combinations some of them would be have in a single experiment all the right handed students. We could measure all these combinations but we can try to find an optimum value.
# To reduce the variables used:
omit <- c("Wr.Hnd", "NW.Hnd", "Fold", "Pulse", "Clap", "Exer", "Height", "M.I")
(keep <- colnames(survey)[!colnames(survey) %in% omit])
## [1] "Sex" "W.Hnd" "Smoke" "Age"
head(survey[, keep])
## Sex W.Hnd Smoke Age
## 1 Female Right Never 18.250
## 2 Male Left Regul 17.583
## 3 Male Right Occas 16.917
## 4 Male Right Never 20.333
## 5 Male Right Never 23.667
## 6 Female Right Never 21.000
# Looking for groups at most of 70 samples.
index <- design(pheno = survey, size_subset = 70, omit = omit)
index
## $SubSet1
## [1] 4 9 19 20 21 22 26 28 42 44 45 48 49 50 53 63 64 69 71
## [20] 72 75 76 83 86 88 90 91 96 97 103 105 109 119 121 125 134 135 138
## [39] 139 142 150 154 155 159 161 173 175 176 179 192 195 198 199 201 205 206 207
## [58] 208 212 237
##
## $SubSet2
## [1] 5 7 11 13 14 23 29 32 37 38 41 52 57 58 60 61 65 74 81
## [20] 84 85 89 92 93 94 99 100 114 116 120 123 126 128 132 140 149 152 153
## [39] 158 163 164 168 171 180 183 184 193 200 203 213 216 220 223 225 227 230 232
## [58] 233 234
##
## $SubSet3
## [1] 1 2 6 12 15 18 24 27 31 33 39 40 46 55 62 66 68 70 73
## [20] 77 80 82 87 95 98 106 107 108 110 118 122 124 130 133 137 141 147 148
## [39] 156 162 165 167 169 170 181 182 185 187 189 194 202 204 215 217 221 222 226
## [58] 231 235
##
## $SubSet4
## [1] 3 8 10 16 17 25 30 34 35 36 43 47 51 54 56 59 67 78 79
## [20] 101 102 104 111 112 113 115 117 127 129 131 136 143 144 145 146 151 157 160
## [39] 166 172 174 177 178 186 188 190 191 196 197 209 210 211 214 218 219 224 228
## [58] 229 236
We can transform then into a vector to append to the file or to pass to the lab mate with:
head(batch_names(index))
## [1] "SubSet3" "SubSet3" "SubSet4" "SubSet1" "SubSet2" "SubSet3"
We can check the values of the sampling with:
out <- evaluate_index(index, survey[, keep])
out[, "Age", ]
## subgroups
## stat SubSet1 SubSet2 SubSet3 SubSet4
## mean 20.208367 20.711898 20.426559 20.154051
## sd 7.663874 8.223262 5.097300 4.124343
## mad 1.359544 1.605656 1.730194 1.853250
## na 0.000000 0.000000 0.000000 0.000000
## entropy 0.000000 0.000000 0.000000 0.000000
And we can compare with the original distribution using evaluate_orig
:
orig <- evaluate_orig(survey)
orig[, "Age"]
## mean sd mad na entropy
## 20.374515 6.474335 1.605656 0.000000 0.000000
In the previous case the data was mostly balanced (check it out in the orig
object) but let’s create an unbalanced dataset to check it.
n <- 99
samples <- 100
unbalanced <- data.frame(Classroom = rep(c("A", "B"), each = samples/2),
Sex = c(rep("M", n), rep("F", samples-n)),
Age = rnorm(samples, mean = 25, sd = 3))
table(unbalanced)[, , 1:5]
## , , Age = 16.7606877438421
##
## Sex
## Classroom F M
## A 0 1
## B 0 0
##
## , , Age = 18.3564356796084
##
## Sex
## Classroom F M
## A 0 0
## B 1 0
##
## , , Age = 18.8105321872606
##
## Sex
## Classroom F M
## A 0 0
## B 0 1
##
## , , Age = 19.0829046190892
##
## Sex
## Classroom F M
## A 0 1
## B 0 0
##
## , , Age = 19.4863600125267
##
## Sex
## Classroom F M
## A 0 0
## B 0 1
In this dataset the classroom a single classroom has all the females (-49).
i <- design(unbalanced, 15)
# Mean entropy en each subset
rowMeans(evaluate_index(i, unbalanced)["entropy", , ])
## Classroom Sex Age
## 0.98475090 0.05303319 0.00000000
# Original entropy on the dataset
evaluate_orig(unbalanced)["entropy", ]
## Classroom Sex Age
## 1.00000000 0.08079314 0.00000000
# Dispersion of the entropy
apply(evaluate_index(i, unbalanced)["entropy", , ], 1, sd)
## Classroom Sex Age
## 0.02239937 0.14031263 0.00000000
We can see that in this simple case where a single variable has all the other cases we approximately reached the same entropy levels.
If you need a subset with the samples that are more diverse you can use the following function:
samples <- extreme_cases(survey, size = 10)
survey[samples, ]
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 16 Female 17.5 17.0 Right R on L NA Right Freq Never 156.00 Metric
## 45 Female 13.0 13.0 <NA> L on R 70 Left Freq Never 180.34 Imperial
## 57 Female 15.5 15.4 Right R on L 70 Neither None Never 157.48 Imperial
## 87 Female 18.2 18.0 Right L on R 70 Right Some Never 162.56 Imperial
## 91 Male 20.5 20.0 Right R on L 75 Left Some Never 183.00 Metric
## 112 Male 19.2 19.6 Right L on R 80 Right None Never 190.50 Imperial
## 118 Male 23.0 22.0 Left L on R 83 Left Some Heavy 193.04 Imperial
## 134 Female 15.4 16.4 Left L on R 80 Left Freq Occas 160.02 Imperial
## 190 Male 18.5 18.0 Right Neither 63 Neither Freq Never 196.00 Metric
## 197 Female 15.0 13.0 Right R on L 80 Neither Freq Never 170.18 Imperial
## Age
## 16 17.167
## 45 17.417
## 57 17.167
## 87 18.000
## 91 19.667
## 112 18.167
## 118 18.917
## 134 18.500
## 190 20.083
## 197 17.000
It is recommended to add replicates on the several batches to estimate the differences between the batches. If you want to add replicates on the batches you can use the following:
# Looking for groups at most of 70 samples.
index_replicates <- replicates(pheno = survey, size_subset = 70,
controls = 10, omit = omit)
index_replicates
## $SubSet1
## [1] 3 5 9 11 14 17 26 27 33 43 44 45 46 56 61 72 73 74 76
## [20] 77 79 85 86 88 92 95 97 98 100 101 102 103 106 113 117 119 121 122
## [39] 125 128 131 134 135 137 139 143 144 147 150 151 152 159 162 163 166 170 171
## [58] 175 179 180 182 184 196 199 208 217 226 237
##
## $SubSet2
## [1] 1 14 15 16 18 29 30 31 33 34 38 39 47 48 52 54 58 60 65
## [20] 70 78 79 80 82 83 86 91 108 109 110 112 114 115 124 126 129 132 133
## [39] 149 151 153 158 160 162 164 169 171 173 174 186 188 191 196 198 199 200 202
## [58] 204 210 212 217 218 219 220 221
##
## $SubSet3
## [1] 4 6 8 13 14 21 22 23 24 28 32 33 36 37 42 50 53 55 57
## [20] 59 63 67 69 75 79 84 86 89 93 104 111 120 127 130 141 142 145 151
## [39] 155 162 167 171 177 178 181 185 187 192 194 195 196 197 199 203 205 206 207
## [58] 211 215 216 217 222 223 227 233 235
##
## $SubSet4
## [1] 2 7 10 12 14 19 20 25 33 35 40 41 49 51 62 64 66 68 71
## [20] 79 81 86 87 90 94 96 99 105 107 116 118 123 136 138 140 146 148 151
## [39] 154 156 157 161 162 165 168 171 172 176 183 189 190 193 196 199 201 209 213
## [58] 214 217 224 225 228 229 230 231 232 234 236
The controls will be those on all the batches:
survey[Reduce(intersect, index_replicates), ]
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 14 Female 19.5 20.2 Right L on R 66 Neither Some Never 155.0 Metric
## 33 Female 17.1 17.5 Right R on L 72 Right Freq Heavy 166.4 Imperial
## 79 Female 18.3 18.5 Right R on L 68 Neither Some Never 165.1 Imperial
## 86 Female 17.7 17.0 Right R on L 76 Right Some Never 167.0 Metric
## 151 Male 21.8 22.3 Right R on L 76 Left Freq Never 195.0 Metric
## 162 Male 18.1 18.2 Left Neither NA Right Some Never 168.0 Metric
## 171 Female 16.5 17.0 Right L on R NA Right Some Never 168.0 Metric
## 196 Female 17.0 17.6 Right L on R 76 Right Some Never 165.0 Metric
## 199 Female 19.1 19.0 Right R on L 80 Right Some Occas 170.0 Metric
## 217 Female 16.3 16.2 Right L on R NA Right None Never NA <NA>
## Age
## 14 17.500
## 33 39.750
## 79 17.083
## 86 17.250
## 151 25.500
## 162 21.167
## 171 73.000
## 196 26.500
## 199 19.167
## 217 19.250
If you have biological replicates use the information you have about them, but if you have technical replicates do not add them to the table used by design
or replicates
.
To convert to a value to append to the data.frame we can use:
batch <- batch_names(index)
Or if we want to do so automatically we can do:
df <- inspect(index, survey, omit = omit)
head(df)
## Sex W.Hnd Smoke Age batch
## 1 Female Right Never 18.250 SubSet3
## 2 Male Left Regul 17.583 SubSet3
## 3 Male Right Occas 16.917 SubSet4
## 4 Male Right Never 20.333 SubSet1
## 5 Male Right Never 23.667 SubSet2
## 6 Female Right Never 21.000 SubSet3
If a data.frame has a column with name “batch” we can use the function distribution to see if it is randomly distributed. Or we can see the mean difference between each subgroup and the original value of the data.frame.
evaluate_entropy(index, survey)
## Sex W.Hnd Fold Clap Exer Smoke
## 0.006505986 0.620020651 0.203322551 0.198039029 0.146556617 0.491375208
## M.I
## 0.097973199
evaluate_independence(index, survey)
## Sex W.Hnd Fold Clap Exer Smoke
## 0.064221034 0.009156125 0.762186083 0.609048036 0.840920110 0.158145859
## M.I
## 0.314953299
evaluate_na(index, survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 0.375 0.375 0.375 0.375 0.000 1.250 0.375 0.000 0.375 1.500 1.500
## Age
## 0.000
evaluate_mean(index, survey)
## Wr.Hnd NW.Hnd Pulse Height Age
## 0.2129286 0.1731027 0.7372826 1.3390543 0.1940100
evaluate_sd(index, survey)
## Wr.Hnd NW.Hnd Pulse Height Age
## 0.06746011 0.02890952 0.34931526 0.91503449 1.66637329
evaluate_mad(index, survey)
## Wr.Hnd NW.Hnd Pulse Height Age
## 25.76017 25.76017 96.36900 243.89511 25.91399
The numbers provided are just an indicator of how far are they from a perfect distance. The farthest and index is, the worst it is to account for batch effect.
If we want the original numbers we can use evaluate_index
, and look for specific values:
ev <- evaluate_index(index, survey)
ev["entropy", "Sex",]
## SubSet1 SubSet2 SubSet3 SubSet4
## 0.9927745 0.9948132 0.9965664 0.9898221
ev[1:4, "Age",]
## subgroups
## stat SubSet1 SubSet2 SubSet3 SubSet4
## mean 20.208367 20.711898 20.426559 20.154051
## sd 7.663874 8.223262 5.097300 4.124343
## mad 1.359544 1.605656 1.730194 1.853250
## na 0.000000 0.000000 0.000000 0.000000
evaluate_independence(index, survey)
## Sex W.Hnd Fold Clap Exer Smoke
## 0.064221034 0.009156125 0.762186083 0.609048036 0.840920110 0.158145859
## M.I
## 0.314953299
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] experDesign_0.0.4
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.3 magrittr_1.5 htmltools_0.5.0 tools_3.6.3
## [5] yaml_2.2.1 stringi_1.5.3 rmarkdown_2.3 knitr_1.30
## [9] stringr_1.4.0 digest_0.6.25 xfun_0.17 rlang_0.4.7
## [13] evaluate_0.14