The main function for running the CHiCANE method is `chicane()`

. This takes a bam file and bed files with all restriction fragments and targeted restriction fragments as input, and produces a table with fragment interactions and associated p-values.

The package comes with a subset of reads from a capture Hi-C experiment on the Bre80 normal epithelial breast tissue cell line (Baxter et al. 2018). The experiment targeted several breast cancer risk loci in non-coding regions of the genome, and aimed to associate these risk variants with genes. Two of the risk SNPs are rs13387042 and rs16857609, and reads that map to them have been included in the file Bre80_2q35.bam.

```
library(chicane);
#> Loading required package: gamlss.tr
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: gamlss
#> Loading required package: splines
#> Loading required package: gamlss.data
#>
#> Attaching package: 'gamlss.data'
#> The following object is masked from 'package:datasets':
#>
#> sleep
#> Loading required package: nlme
#> Loading required package: parallel
#> ********** GAMLSS Version 5.1-6 **********
#> For more on GAMLSS look at http://www.gamlss.org/
#> Type gamlssNews() to see new features/changes/bug fixes.
#> Loading required package: data.table
# example BAM file, baits, and restriction fragments
bam <- system.file('extdata', 'Bre80_2q35.bam', package = 'chicane');
baits <- system.file('extdata', '2q35.bed', package = 'chicane');
fragments <- system.file('extdata', 'GRCh38_HindIII_chr2.bed.gz', package = 'chicane'); # HindIII fragments on chromosome 2
if( bedtools.installed() ) {
chicane.results <- chicane(
bam = bam,
baits = baits,
fragments = fragments
);
print( chicane.results[ 1:10 ] );
}
```

The `chicane`

function returns a `data.table`

object sorted by q-value. Each row contains information about the target and bait fragments, and the observed and expected number of reads linking the two fragments.

The `chicane`

function can operate directly on BAM files. To convert the BAM file into a text file, R calls the shell script `prepare_bam.sh`

. This script relies on bedtools to select reads overlapping the captured fragments and identify the restriction fragments corresponding to each read. The resulting data is read into a `data.table`

object, and the number of reads linking any two fragments is calculated.

For users that have a digested genome output from HiCUP, this needs to be converted to BED format to be compatible with bedtools. CHiCANE includes a helper function, `convert.hicup.digest.bed()`

, which can be used to convert a digested genome in HiCUP format to a BED file of the restiction digest fragments.

This processing can take a while (~45 minutes for a 19GB BAM file on a commodity computer with 13GB RAM), and only needs to be done once for each BAM. To speed up model fitting, it is possible to pre-process the data with the `prepare.data()`

function and run `chicane`

directly on the resulting data table.

```
if( bedtools.installed() ) {
interaction.data <- prepare.data(
bam = bam,
baits = baits,
fragments = fragments
);
}
```

The interaction data object contains details of the fragments, and the number of reads linking the two fragments.

By default, only combinations of fragments that are detected at least once are included in the data. To also include zero counts, use the `include.zeros`

argument. Setting `include.zeros = 'cis'`

includes all zero counts for bait/ target combinations on the same chromosome, while `include.zeros = 'all'`

includes all possible combinations.

```
if( bedtools.installed() ) {
cis.zero.results <- chicane(
bam = bam,
baits = baits,
fragments = fragments,
include.zeros = 'cis'
);
print( cis.zero.results[ 1:10 ] );
}
```

For most experiments including zeros will result in impractically large files. Another option for accounting for zero counts is to specify a truncated distribution through the `distribution`

argument. However, this will slow down model fitting.

The `interactions`

argument of the `chicane()`

function can take either a `data.table`

object from `prepare.data`

or the path to such a table.

The CHiCANE method models the expected number of reads linking two restriction fragments as a function of the distance between the loci and the “interactibility” of the bait fragment, that is, its inherent propensity to interact with other fragments.

Let \(Y_{ij}\) denote the number of reads linking bait \(i\) and target fragment \(j\), \(d_{ij}\) be the distance between the two fragments, and \(t_i\) denote the number of reads linking the bait \(i\) with another fragment in trans. The CHiCANE model assumes that \(Y_{ij}\) follows a distribution with mean parameter \(\mu_{ij}\), where

\[\begin{equation} \mu_{ij} = \beta_0 + \beta_1\log(d_{ij}) + \beta_2\log(t_i + 1) \end{equation}\]

If the other end fragment \(j\) is also a bait, terms are included to adjust for the trans counts of the fragment \(j\), as well as the product of the trans counts of the two fragments as shown below

\[\begin{equation} \mu_{ij} = \beta_0 + \beta_1\log(d_{ij}) + \beta_2\log(t_i + 1) + \beta_3\log(t_j + 1) + \beta_4\log(t_i + 1)\log(t_j + 1) \end{equation}\]

Once an expected value \(\mu_{ij}\) is obtained for each combination of fragments, a \(p\)-value for the observed counts \(y_{ij}\) exceeding what’s expected by random chance is calculated as

\[\begin{equation} p = P(Y_{ij} \geq y_{ij}) \end{equation}\]

The probability \(P(Y_{ij} \geq y_{ij})\) depends on how the distribution \(Y_{ij} \mid \mu_{ij}\) of the counts conditional on the expected value is modelled.

The CHiCANE method supports several different distributions of the counts \(Y_{ij}\) conditional on the mean \(\mu_{ij}\). By default the counts are assumed to follow a negative binomial distribution with \(E(Y_{ij}) = exp (\mu_{ij})\). The negative binomial is similar to the Poisson model for counts, but includes an overdispersion term that makes it more appropriate for datasets with high variance. After fitting \(\beta_0, \beta_1, \beta_2\) and the overdispersion term by a maximum likelihood over all pairs of fragments, we use the fitted model to estimate a \(p\)-value for each pair. The Poisson, truncated Poisson, or truncated negative binomial distribution can be specified through the `distribution`

argument.

The negative binomial distribution models the counts as \(Y \sim NB(\mu, \theta)\), where \(\mu\) is the mean and \(\theta\) is a dispersion parameter. Under this model, the variance of the counts conditional on the mean \(\mu\) is given by

\[\begin{equation} Var(Y) = \mu + \frac{\mu^2}{\theta} \end{equation}\]

The model is fit using the `glm.nb()`

function in the MASS package. The maximum likelihood estimates (MLEs) of the coefficients \(\hat{\beta_i}\) are used to provide an estimate of the expected number of counts for each combination of restriction fragments. The MLE of the dispersion parameter \(\theta\) is used as the dispersion parameter (`size`

) in the negative binomial model.

Occasionally, the model fitting procedure will result in numerical errors as the result of lack of convergence. This is most often due to lack of overdispersion in the data leading to an unstable estimate of \(\theta\). If the variance is approximately equal to the mean, the \(\theta\) term goes to infinity and the model fails to converge.

When the model throws a numerical warning or fails to converge, the `chicane()`

function will attempt to diagnose if the problem was due to lack of overdispersion. This is done by fitting a negative binomial regression model with 25 iterations, and using the resulting fitted values and dispersion parameter to examine the variance of the negative binomial. If the estimated variance at the point with the largest fitted value is less than 0.1% greater than the corresponding estimate of the Poisson variance (i.e. the fitted value at that point), a Poisson distribution is fit instead.

Unlike the negative binomial, the Poisson distribution does not allow for the variance of the counts to exceed their mean. This model assumes that \(Y \sim \text{Poisson}(\mu)\), giving \(Var(Y) = \mu\). In practice, capture Hi-C counts often show more variability than can be explained by the Poisson model, leading to false positive interaction calls.

When fitting the model without including unobserved combinations of fragments (see Pre-processing Data), there are no counts of zero in the data. To accommodate this in the statistical model, a truncated Poisson distribution can be fit using the GAMLSS package (Stasinopoulos and Rigby 2016). This model assigns \(P(Y = 0) = 0\), and assumes the probabilities are proportional to the negative binomial probabilities for values \(y > 0\).

Similarly to the truncated negative binomial, the truncated Poisson distribution is supported by setting `distribution = 'truncated-poisson'`

. This model assumes the probabilities are proportional to the Poisson probabilities for values \(y>0\).

Adjusting for distance is done in a piecewise linear fashion. Cis-data is ordered by distance, and split into equally sized bins. The count model is then fit separately within each bin, and results are merged at the end by concatenating the results from the individual model fits.

By default, CHiCANE tries to split the data based on distance quantiles, i.e. into 100 equally sized groups. In some cases, particularly when dealing with smaller datasets, the resulting datasets are too small for model fitting. When this occurs, the number of distance bins are halved until all resulting datasets are large enough for the model to be fit. To override the default behaviour, pass the desired number of distance bins to `distance.bins`

. For example, setting `distance.bins = 4`

results in the count model being fit separately in each quartile. Trans interactions are fit separately from cis interactions.

By default, all baits and targets are included when fitting the CHiCANE model. An alternative approach is to filter out baits and fragments with low (Dryden et al. 2014) or high (Cairns et al. 2016) degree of “interactibility”, as inferred through the trans counts. CHiCANE also supports this model through the `bait.filters`

and `target.filters`

arguments.

Each of these take a vector of length two, where each element corresponds to the proportion of fragments that should be considered to fall below the “lower limit” and “upper limit.” For example, passing in `bait.filters = c(0.3, 0.9)`

means that the baits with the lowest 30% or highest 10% of trans counts will be removed before fitting the model.

Filtering fragments may affect multiple testing correction by changing the number of tests performed.

By default, CHiCANE performs multiple testing correction separately for each bait. To change this to a global multiple testing correction, set `multiple.testing.correction = 'global'`

in the main `chicane()`

function.

CHiCANE can merge replicates at the data processing step. If more than one BAM file is available, these can be passed as vector to the `bam`

argument of the `chicane()`

and `prepare.data()`

functions. The `replicate.merging.method`

determines how replicates are merged. Available options are ‘sum’ and ‘weighted-sum’.

CHiCANE allows users to specify additional covariates that should be added to the model with the `adjustment.terms`

argument. For example, we can add a term to adjust for the chromosome of the target fragment.

```
data(bre80);
adjusted.results <- chicane(
interactions = bre80,
adjustment.terms = 'target.chr'
);
print( adjusted.results[ 1:5 ] );
#> target.id bait.id target.chr target.start
#> 1: chr19:1155128-1228414 chr2:217035649-217042355 chr19 1155128
#> 2: chr1:117125276-117135137 chr2:217035649-217042355 chr1 117125276
#> 3: chr1:14950583-14951791 chr2:217035649-217042355 chr1 14950583
#> 4: chr1:49076960-49080684 chr2:217035649-217042355 chr1 49076960
#> 5: chr3:145680459-145682483 chr2:217035649-217042355 chr3 145680459
#> target.end bait.chr bait.start bait.end bait.to.bait bait.trans.count
#> 1: 1228414 chr2 217035649 217042355 FALSE 9494
#> 2: 117135137 chr2 217035649 217042355 FALSE 9494
#> 3: 14951791 chr2 217035649 217042355 FALSE 9494
#> 4: 49080684 chr2 217035649 217042355 FALSE 9494
#> 5: 145682483 chr2 217035649 217042355 FALSE 9494
#> target.trans.count distance count expected p.value q.value
#> 1: 2 NA 2 1.003963 0.2656989 0.6417435
#> 2: 2 NA 2 1.003992 0.2657096 0.6417435
#> 3: 2 NA 2 1.003992 0.2657096 0.6417435
#> 4: 2 NA 2 1.003992 0.2657096 0.6417435
#> 5: 2 NA 2 1.004014 0.2657179 0.6417435
```

The `adjustment.terms`

argument also supports more complex expressions such as log-transformations. Multiple adjustment terms can be added by passing a vector.

Any adjustment term must correspond to a column already present in the data. If you would like to adjust for anything that is not already present in the CHiCANE output (e.g. GC content), there’s a three step approach:

- Run
`prepare.data()`

on your BAM file - Add extra columns to the resulting data frame
- Run
`chicane()`

with the`interactions`

argument

```
sessionInfo();
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#>
#> attached base packages:
#> [1] parallel splines stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] chicane_0.1.2 data.table_1.12.8 gamlss.tr_5.1-0 gamlss_5.1-6
#> [5] nlme_3.1-148 gamlss.data_5.1-4 gamlss.dist_5.1-6 MASS_7.3-51.6
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 compiler_3.6.1 formatR_1.7
#> [4] bedr_1.0.7 futile.logger_1.4.3 iterators_1.0.12
#> [7] R.methodsS3_1.8.0 futile.options_1.0.1 R.utils_2.9.2
#> [10] tools_3.6.1 testthat_2.3.2 digest_0.6.25
#> [13] evaluate_0.14 lattice_0.20-41 rlang_0.4.6
#> [16] Matrix_1.2-18 foreach_1.5.0 yaml_2.2.1
#> [19] VennDiagram_1.6.20 xfun_0.14 stringr_1.4.0
#> [22] knitr_1.28 grid_3.6.1 R6_2.4.1
#> [25] survival_3.1-12 rmarkdown_2.1 lambda.r_1.2.4
#> [28] magrittr_1.5 codetools_0.2-16 htmltools_0.4.0
#> [31] stringi_1.4.6 R.oo_1.23.0
```

Development of CHiCANE was supported by Breast Cancer Now.

Baxter, Joseph S, Olivia C Leavy, Nicola H Dryden, Sarah Maguire, Nichola Johnson, Vita Fedele, Nikiana Simigdala, et al. 2018. “Capture Hi-c Identifies Putative Target Genes at 33 Breast Cancer Risk Loci.” *Nature Communications* 9 (1): 1028.

Cairns, Jonathan, Paula Freire-Pritchett, Steven W Wingett, Csilla Várnai, Andrew Dimond, Vincent Plagnol, Daniel Zerbino, et al. 2016. “CHiCAGO: Robust Detection of Dna Looping Interactions in Capture Hi-c Data.” *Genome Biology* 17 (1): 127.

Dryden, Nicola H, Laura R Broome, Frank Dudbridge, Nichola Johnson, Nick Orr, Stefan Schoenfelder, Takashi Nagano, et al. 2014. “Unbiased Analysis of Potential Targets of Breast Cancer Susceptibility Loci by Capture Hi-c.” *Genome Research* 24 (11): 1854–68.

Stasinopoulos, Mikis, and Bob Rigby. 2016. *Gamlss.tr: Generating and Fitting Truncated “Gamlss.family” Distributions*. https://CRAN.R-project.org/package=gamlss.tr.