experDesign

CRAN status R build status AppVeyor build status Travis build status Coverage status Lifecycle: stable Project Status: Active - The project has reached a stable, usable state and is being actively developed.

The goal of experDesign is to help you decide which samples go in which batch, reducing the potential batch bias when analyzing.

Installation

To install the latest version on CRAN use:

install.packages("experDesign")

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("llrs/experDesign")

Example

Imagine you have some samples already collected and you want to distributed them in batches:

library("experDesign")
metadata <- expand.grid(height = seq(60, 80, 5), 
                        weight = seq(100, 300, 50),
                        sex = c("Male","Female"))
head(metadata, 15)
#>    height weight  sex
#> 1      60    100 Male
#> 2      65    100 Male
#> 3      70    100 Male
#> 4      75    100 Male
#> 5      80    100 Male
#> 6      60    150 Male
#> 7      65    150 Male
#> 8      70    150 Male
#> 9      75    150 Male
#> 10     80    150 Male
#> 11     60    200 Male
#> 12     65    200 Male
#> 13     70    200 Male
#> 14     75    200 Male
#> 15     80    200 Male

If you block incorrectly and end up with a group in a single batch we will end up with batch effect. In order to avoid this design helps you assign each sample to a batch (in this case each batch has 24 samples at most). First we can explore the number of samples and the number of batches:

size_data <- nrow(metadata)
size_batch <- 24
(batches <- optimum_batches(size_data, size_batch))
#> [1] 3
# So now the best number of samples for each batch is less than the available
(size <- optimum_subset(size_data, batches))
#> [1] 17
# The distribution of samples per batch
sizes_batches(size_data, size, batches)
#> [1] 17 17 16

Note that instead of using a whole batch and then leave a single sample on the third distributes all the samples in the three batches that will be needed. We can directly look for the distribution of the samples given our max number of samples per batch:

d <- design(metadata, size_batch)
# It is a list but we can convert it to a vector with:
batch_names(d)
#>  [1] "SubSet3" "SubSet2" "SubSet2" "SubSet1" "SubSet3" "SubSet2" "SubSet1"
#>  [8] "SubSet1" "SubSet2" "SubSet2" "SubSet1" "SubSet2" "SubSet1" "SubSet3"
#> [15] "SubSet1" "SubSet3" "SubSet2" "SubSet1" "SubSet3" "SubSet1" "SubSet2"
#> [22] "SubSet1" "SubSet3" "SubSet2" "SubSet1" "SubSet1" "SubSet1" "SubSet1"
#> [29] "SubSet3" "SubSet2" "SubSet3" "SubSet2" "SubSet3" "SubSet3" "SubSet2"
#> [36] "SubSet1" "SubSet2" "SubSet1" "SubSet3" "SubSet3" "SubSet2" "SubSet3"
#> [43] "SubSet2" "SubSet3" "SubSet3" "SubSet1" "SubSet1" "SubSet2" "SubSet2"
#> [50] "SubSet3"

Naively one would either fill some batches fully or distribute them not evenly (the first 17 packages together, the next 17 and so on). This solution ensures that the data is randomized. For more random distribution you can increase the number of iterations performed to calculate this distribution.

If you need space for replicates to control for batch effect you can use:

r <- replicates(metadata, size_batch, 5)
lengths(r)
#> SubSet1 SubSet2 SubSet3 
#>      20      20      20
r
#> $SubSet1
#>  [1]  4  9 10 12 20 21 22 23 25 26 28 29 31 39 40 41 43 45 49 50
#> 
#> $SubSet2
#>  [1]  2  7 13 15 16 18 21 23 24 27 30 33 35 36 37 38 41 47 49 50
#> 
#> $SubSet3
#>  [1]  1  3  5  6  8 11 14 17 19 21 23 32 34 41 42 44 46 48 49 50

Which seeks as controls the most diverse values and adds them to the samples distribution. Note that if the sample is already present on that batch is not added again, that’s why the number of samples per batch is different from the design without replicates.

Previous work

The CRAN task View of Experimental Design includes many packages relevant for designing an experiment before collecting data, but none of them provides how to manage them once the samples are already collected.

Two packages allow to distribute the samples on batches:

If you are still designing the experiment and do not have collected any data DeclareDesign might be relevant for you.

Question in Bioinformatics.SE I made before developing the package.

Other

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.