- Overview
- System Requirements
- Installation Guide
- Demo
- URLs and References

MTAR package has functions to 1) compute the summary statistics with different types of data input; 2) to adjust summary statistics for sample overlap if necessary; and 3) to perform multi-trait rare-variant association tests using the summary statistics.

The package development version is tested on the following systems:

Mac OSX: Mojave version 10.14.6 (R version 3.6.0)

Windows 10 (R version 3.4.0)

The CRAN package should be compatible with Windows and Mac operating systems.

Before setting up the `MTAR`

package, users should have `R`

version 3.4.0 or higher.

`MTAR`

package requires R with version 3.4.0 or higher, which can be downloaded and installed from CRAN.

`install.packages("MTAR")`

`MTAR`

package depends on several R packages (`MASS`

, `CompQuadForm`

, `Matrix`

, `stats`

, `utils`

), which will be downloaded when installing `MTAR`

. `MTAR`

also uses non-exported R code from R packages `ACAT`

, `SKAT`

and `SeqMeta`

. The `MTAR`

package functions with all packages in their latest versions as they appear on `CRAN`

on September 9, 2019. The versions of software are, specifically:

```
CompQuadForm_1.4.3
MASS_7.3-51.4
Matrix_1.2-17
seqMeta_1.6.7
SKAT_1.3.2.1
```

The score statistics \(\mathbf{U}\) and their covariance matrix \(\mathbf{V}\) are the summary statistics required by MTAR tests. In this section, we will demonstrate how to use various utility functions to prepare for these summary statistics. The output object of these utility functions can be directly taken by the MTAR function as input.

where \(\widehat{\mathbf{\gamma}_k}\) and \(\widehat{\phi}_k\) are the restricted maximum likelihood estimators of \(\mathbf{\gamma}_k\) and \(\phi_k\) under \(H_0: \mathbf{\beta}_k = 0\). For the linear regression model, we have \(a(\widehat{\phi}_k) = n^{-1}\sum_{i = 1}^n(Y_{ik} - \widehat{\mathbf{\gamma}}_{k}^T\mathbf{X}_{ik})^2\), \(b'(z) = z\) and \(b''(z) = 1\). For the logistic regression model, we have \(a(\phi) = 1, b'(z) = e^z/(1 + e^z), b''(z) = e^z/(1 + e^z)^2\) (Tang and Lin (2015); Hu et al. (2013)).

To calculate the summary statistics from individual-level data, *Get_UV_from_data* can be used with required inputs of *traits*, *covariates* and *genotype*. Specifically, *traits*, *covariates* and *genotype* are 3 lists and each sublist contains the information from \(L\) study. For \(\ell\)th study, traits, covariates and genotype information should be a \(n\times K\), \(n\times D\) and \(n\times m\) matrix, respectively. The number of traits in each study can be different, but the name of traits must be provided. Same as the name of SNPs in *genotype*. The missing value in *trait* or in *genotype* should be labeled as NA. The order of subject IDs in *traits*, *covariates* and *genotype* should be the same.

Here, we use a simulated dataset *rawdata* to show how to apply *Get_UV_from_data* function. In *rawdata*, there are 3 studies, 3 continuous traits and 10 rare variants. Fig. 2 shows the diagram for sample overlap among traits in three studies. Specifically, there are 1500 subjects in study1, but each subject only has one trait measurement. In study2 and study3, the sample size is 500 and each subject has two (LDL and HDL) or three traits (LDL, HDL and TG) measurements.

```
data("rawdata")
attach(rawdata)
head(traits.dat$Study1)
```

```
## LDL HDL TG
## SUBJ1 -2.6166166 NA NA
## SUBJ2 0.8519045 NA NA
## SUBJ3 1.2438145 NA NA
## SUBJ4 -0.5327296 NA NA
## SUBJ5 0.8006589 NA NA
## SUBJ6 0.1560920 NA NA
```

`head(cov.dat$Study1)`

```
## cov1 cov2
## SUBJ1 1 -0.29713414
## SUBJ2 0 -0.08074092
## SUBJ3 0 0.55975760
## SUBJ4 0 0.49851956
## SUBJ5 1 -0.07097778
## SUBJ6 1 -0.28825662
```

`head(geno.dat$Study1)`

```
## SNP2148 SNP2149 SNP2150 SNP2151 SNP2152 SNP2153 SNP2154 SNP2155
## SUBJ1 0 0 0 0 0 0 0 0
## SUBJ2 0 0 0 0 0 0 0 0
## SUBJ3 0 0 0 0 0 0 0 0
## SUBJ4 0 0 0 0 0 0 0 0
## SUBJ5 0 0 0 0 0 0 0 0
## SUBJ6 0 0 0 0 0 0 0 0
## SNP2156 SNP2157
## SUBJ1 0 0
## SUBJ2 0 0
## SUBJ3 0 0
## SUBJ4 0 0
## SUBJ5 0 0
## SUBJ6 0 0
```

```
obs.stat <- Get_UV_from_data(traits = traits.dat,
covariates = cov.dat,
genotype = geno.dat,
covariance = TRUE)
obs.stat$U
```

```
## $LDL
## SNP2148 SNP2149 SNP2150 SNP2151 SNP2152 SNP2153
## 0.3438038 1.5488022 1.3252516 0.0000000 2.0216781 -1.8258464
## SNP2154 SNP2155 SNP2156 SNP2157
## 0.6500477 0.0000000 0.1757972 1.5709795
##
## $HDL
## SNP2148 SNP2149 SNP2150 SNP2151 SNP2152
## 0.00000000 0.70525844 1.25507643 0.00000000 -0.03924912
## SNP2153 SNP2154 SNP2155 SNP2156 SNP2157
## -0.28306623 2.06966868 0.00000000 -12.48457057 0.00000000
##
## $TG
## SNP2148 SNP2149 SNP2150 SNP2151 SNP2152 SNP2153
## 0.0000000 1.3061250 2.2535334 0.0000000 -2.6934948 -0.2769778
## SNP2154 SNP2155 SNP2156 SNP2157
## 1.5616663 0.0000000 -0.3394679 0.0000000
```

`detach(rawdata)`

For each trait, if the number of unique values is no more than 2, then this trait will be viewes as a binary trait and a logistic regression will be conducted to calculate the summary statistics. If *covariance = TRUE*, then a \(m\times m\) covariance matrix \(\mathbf{V}_k\) is returned, otherwise only \(m\) diagonal elements \({\rm diag}(\mathbf{V}_k)\) is returned.

When the score statistics \(\mathbf{U}_k\) and their variance (\({\rm diag}(\mathbf{V}_k)\)) are available, the covariance matrix \(\mathbf{V}_{k}\) can be approximated as \[
\mathbf{V}_k \approx {\rm diag}(\mathbf{V}_k)^{1/2} \mathbf{R} {\rm diag}(\mathbf{V}_k)^{1/2},
\] where \(\mathbf{R}=\{R_{j\ell}\}_{j,\ell=1}^m\) is the SNP correlation matrix estimated either from the working genotypes or the external reference (Hu et al. (2013)). This transformation has been implemented in function *Get_UV_from_varU*.

```
data("varU.example")
attach(varU.example)
obs.stat <- Get_UV_from_varU(U = U, varU = varU, R= R)
obs.stat$U
```

```
## $LDL
## 1:150551327 1:150551576 1:150551649 1:150550965 1:150551447
## -268.5912258 31.2170161 -12.7574302 4.2199502 0.8814878
## 1:150550761
## 3.2751963
##
## $HDL
## 1:150551327 1:150551576 1:150551649 1:150550965 1:150551447 1:150550761
## 229.0655499 21.4970739 -18.7336436 -0.1519353 -3.7007466 -12.8477062
##
## $TG
## 1:150551327 1:150551576 1:150551649 1:150550965 1:150551447 1:150550761
## -363.297279 -11.531591 -1.618765 -2.521649 1.127816 6.706268
```

`detach(varU.example)`

When the genetic effect estimates \(\widehat{\mathbf{\beta}}_{k} = \{\widehat{\beta}_{kj} \}_{j=1}^m\) and their standard error \(\mathbf{se}_{k}= \{ se_{kj}\}_{j=1}^m\) are available, we can approximate \(\mathbf{U}_k = \{ U_{kj}\}_{j=1}^m\) and \(\mathbf{V}_{k} = \{V_{kj\ell}\}_{j,\ell=1}^m\) as \({U}_{kj} \approx \widehat{\beta}_{kj}/se^2_{kj}\) and \(V_{kj\ell} \approx R_{j\ell} /(se_{kj}se_{k\ell})\) via function *Get_UV_from_beta*.

```
data("beta.example")
attach(beta.example)
obs.stat <- Get_UV_from_beta(Beta = Beta, Beta.se = Beta.se, R = R)
detach(beta.example)
```

It took less than 0.05 second to calculate the summary statistics for a gene with 10 rare variants.

If all the traits are from one study or from multiple studies with overlapping subjects, the genetic effect estimates among traits are correlated. To calculate the covariances of these genetic effects, we need to first compute *zeta* – the among-trait covariances of Z-scores for the SNPs that are not associated with the traits (details in Luo et al. (2019)). In function *Get_zeta*, we ask users to input a vector of Z-scores from genome-wide association tests between common variants and each trait. The function will conduct LD pruning (by using the reference independent SNP list in 1000 Genome), remove SNPs with p-value < 0.05 (default p-value cutoff threshold), and compute *zeta*.

Here we showed a simplified example of estimating matrix *zeta*, we extracted a subset of 737 null and common SNPs from chromosome 1 in the Global Lipids Genetics Consortium (GLGC, Liu et al. (2017)). After filtering out those dependent SNPs, there are only 137 SNPs remaining. The computation time of estimating *zeta* with 737 null and common SNPs is less than 4 seconds.

```
data("zeta.example")
attach(zeta.example)
# Downloading independent common SNPs from 1000Genome data set.
githubURL <- "https://github.com/lan/MTAR/blob/master/indp_snps.1KG.rda?raw=true"
utils::download.file(githubURL,"1kgfile")
load("1kgfile")
zeta1 <- Get_zeta(Zscore = Zscore, Indp_common_snp = indp_snps.1KG)
detach(zeta.example)
```

To run MTAR function, the score statistics *U*, their covariance matrix *V* and minor allele frequency *MAF* are required. Other inputs are optional. Here an example dataset (*MTAR.example*) offers all the information needed in *MTAR* function.

```
data("MTAR.example")
names(MTAR.example)
```

```
## [1] "annotation" "U" "V"
## [4] "MAF" "genetic_cor.trait" "zeta"
```

Specifically, *genetic_cor.trait* represents the genetic correlation information (B. Bulik-Sullivan et al. (2015)). The genetic correlation can be estimated using *ldsc* software (B. K. Bulik-Sullivan et al. (2015) and URLs). Some websites also archive the genetic correlation among many complex traits (URLs). If no genetic correlation is available, *MTAR* will use an exchangeable correlation structure among traits. The *zeta* can be estimated in Step 2 and via *Get_zeta*. If no *zeta* is provided, *MTAR* assumes there is no sample overlap among traits. The p-values of MTAR-O, cMTAR, iMTAR and cctP and ancillary information will be returned. In this example, it took less 5 seconds to calculate the MTAR test p-values for gene PNPLA2.

```
attach(MTAR.example)
pval <- MTAR(U = U, V = V, MAF = MAF, genetic_cor.trait = genetic_cor.trait, zeta = zeta)
pval
```

```
## $MTARO
## [1] 5.226324e-08
##
## $cMTAR
## $cMTAR$p
## [1] 5.368409e-08
##
## $cMTAR$rho1.min
## [1] 0
##
## $cMTAR$rho2.min
## [1] 1
##
##
## $iMTAR
## $iMTAR$p
## [1] 2.599168e-08
##
## $iMTAR$rho1.min
## [1] 0
##
## $iMTAR$rho2.min
## [1] 1
##
##
## $cctP
## [1] 3.329077e-06
```

`detach(MTAR.example)`

ldsc website: https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation Genetic correlation: https://media.nature.com/original/nature-assets/ng/journal/v47/n11/extref/ng.3406-S2.csv

Bulik-Sullivan, Brendan K, Po-Ru Loh, Hilary K Finucane, Stephan Ripke, Jian Yang, Nick Patterson, Mark J Daly, et al. 2015. “LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies.” *Nature Genetics* 47 (3). Nature Publishing Group: 291.

Bulik-Sullivan, Brendan, Hilary K Finucane, Verneri Anttila, Alexander Gusev, Felix R Day, John RB Perry, Nick Patterson, et al. 2015. “An Atlas of Genetic Correlations Across Human Diseases and Traits.” *Nat. Genet.* 47 (11). Nature Publishing Group: 1236–41.

Hu, Yi-Juan, Sonja I Berndt, Stefan Gustafsson, Andrea Ganna, Reedik Mägi, Eleanor Wheeler, Mary F Feitosa, et al. 2013. “Meta-Analysis of Gene-Level Associations for Rare Variants Based on Single-Variant Statistics.” *The American Journal of Human Genetics* 93 (2). Elsevier: 236–48.

Liu, Dajiang J, Gina M Peloso, Haojie Yu, Adam S Butterworth, Xiao Wang, Anubha Mahajan, Danish Saleheen, et al. 2017. “Exome-Wide Association Study of Plasma Lipids in \(>\) 300,000 Individuals.” *Nat. Genet.* 49 (12). Nature Publishing Group: 1758.

Luo, Lan, Judong Shen, Hong Zhang, Aparna Chhibber, Devan V Mehrotra, and Zheng-Zheng Tang. 2019. “Multi-Trait Analysis of Rare-Variant Association Summary Statistics Using MTAR.” *Submitted*.

Tang, Zheng-Zheng, and Dan-Yu Lin. 2015. “Meta-Analysis for Discovering Rare-Variant Associations: Statistical Methods and Software Programs.” *The American Journal of Human Genetics* 97 (1). Elsevier: 35–53.