This platform presents multiple analyses conducted using CARTaGENE data to the scientific community as well as the public at large. The CARTaGENE cohort is specific to Quebec and is a regional subset of the Canada-wide "Canadian Partnership for Tomorrow's Health"
This platform displays Phenome-wide association (PheWAS) results from the CARTaGENE cohort (i.e. statistical associations between individuals' trait information and genomic variations across the cohort).
We built this version of PheWeb using N=29,337 CARTaGENE participants genotyped on five different arrays.
Short name | Desription | N Individuals | N Variants |
---|---|---|---|
Archi | GSA v3 + Multi disease panel (MHI_GSAMD-24v3-0-EA_20034606_C1) | 1,909 | 688,795 |
17K | GSA v2 + Multi disease panel + addon (CaG_addon_v1_20037253_A2) | 17,286 | 645,075 |
4K | GSA v2 + Multi disease panel (GSAMD-24v2-0_20024620_A) | 4,179 | 728,919 |
5K | GSA v1 + Multi disease panel (GSAMD-24v1- 0_20011747_A1) | 5,237 | 658,296 |
760 | GSA v1 (GSA-24v1-0_A1) | 726 | 626,377 |
The genotyped CARTaGENE participants are from six Quebec regions: Gatineau, Montreal, Quebec, Saguenay, Sherbrooke, Trois-Rivieres. Only the 17K and 5K arrays included participants from all six regions.
Region | N Individuals in 17K (%) | N Individuals in 4K (%) | N Individuals in 5K (%) | N Individuals in 760 (%) | N Individuals in Archi (%) |
---|---|---|---|---|---|
Gatineau | 757 (4.4) | None | 326 (6.2) | 1 (0.1) | None |
Montreal | 12,354 (71.5) | 2,921 (69.9) | 3,423 (65.4) | 522 (71.9) | 1,302 (68.2) |
Quebec | 2,428 (14.0) | 877 (21.0) | 715 (13.7) | 61 (8.4) | 410 (21.5) |
Saguenay | 556 (3.2) | 151 (3.6) | 228 (4.4) | 94 (12.9) | 196 (10.3) |
Sherbrooke | 698 (4.0) | 229 (5.5) | 359 (6.9) | 48 (6.6) | None |
Trois-Rivieres | 493 (2.9) | None | 186 (3.6) | None | None |
Only the 17K and 5K arrays included many participants recruited in both phase 1 and phase 2 of the study.
Phase | N Individuals in 17K (%) | N Individuals in 4K (%) | N Individuals in 5K (%) | N Individuals in 760 (%) | N Individuals in Archi (%) |
---|---|---|---|---|---|
1 | 9,828 (56.9) | 4,177 (100.0) | 2,503 (47.8) | 715 (98.5) | 1,908 (100.0) |
2 | 7,456 (43.1) | 1 (0.0) | 2,734 (52.2) | 11 (1.5) | None |
We started with the genotype data on genome build GRCh37/hg19 saved in five different files (i.e. one per array in PLINK format). We confirmed that each array's genotype data had undergone the initial QC at CARTaGENE.
Our steps for CARTaGENE genotype data imputation to the TOPMed reference panel included:
The following sections describe each of these steps in detail.
We converted each array's genotype data to genome build GRCh38 and aligned it to the "+" strand. Our conversion included the following steps:
The final number of variants after these steps was:
Array | N Variants in GRCh37 | N Variants in GRCh38 |
---|---|---|
Archi | 688,795 | 620,623 |
17K | 645,075 | 612,224 |
4K | 728,919 | 648,511 |
5K | 658,296 | 638,099 |
760 | 626,377 | 581,440 |
We merged the genotypes from the five arrays in the following steps:
G ~ m + PC1 + ... + PC4 + A2 + ... + A5
and G ~ m + PC1 + ... + PC4
, where G - individual genotype (0, 1, or 2), m - intercept corresponding to the average alternate allele frequency, PCi - i-th principal component, Aj - indicator variable for array j. Assuming random sampling and no batch effects, the array variable should not have impact on predicting individual-specific allele frequency. The test is implemented in the compare_ancestry_adjusted_af.py script. For the non pseudo-autosomal region of the X chromosome, we performed this test for both sexes separately.The final merged dataset included 29,330 individuals and 479,908 variants.
We removed a further 159 variants which showed substantial differences in alternate allele frequencies when compared to the TOPMed reference panel.
We imputed CARTaGENE genotypes using the TOPMed reference panel through the TOPMed Imputation Server at the NHLBI BioData Catalyst platform. At the moment of our analyses, the TOPMed Imputation Server was limiting the imputation jobs to 25,000 individuals. To overcome this limitation, we split the CARTaGENE imputation into two batches. The following sections describe our approach.
For imputation purposes, we split CARTaGENE's merged array data into two overlapping batches (Figure 1):
By generating the overlap between two batches, we ended up with similar population allele frequencies between variants in two sets (Figure 2).
We imputed the 307,841,049 variants in each of the two batches. The alternate allele frequencies (AF) of imputed variants were similar in both batches (Figure 1).
The low-frequency variants with MAF<0.001 had the lowest imputation qualities (median Rsqbatch #1=0.02401 and median Rsqbatch #2=0.02409), while more frequent variants with MAF>0.01 had the highest imputation qualities (median Rsqbatch #1>0.93842 and median Rsqbatch #2>0.93825) (Figure 2).
For each imputed variant in two batches, we looked at the absolute difference between its imputation qualities Rsqbatch #1 and Rsqbatch #2 in these batches (Figure 4). We observed the largest Rsq difference (>0.5) only in variants with the lowest minor allele frequency (MAF<0.00034095, corresponding to 20 alternate alleles). For the vast majority (>99%) of variants with MAF>0.01, the Rsq difference didn't exceed 0.01.
We merged the imputed genotypes from two batches using the merge_minimac_dosages.py script. For the dosages imputed in both batches, we selected the dosage from the batch with the highest imputation quality (Minimac Rsq) for the corresponding variant. After, we filtered out variants with the combined imputation quality Rsqcombined≤0.3 (Figure 1). The underlying distributions of Rsqbatch #1 and Rsqbatch #2 were similar for each quantile of Rsqcombined (Figures 2).
The final dataset included 103,574,557 imputed variants (Table I).
MAFcombined | [0,0.00034] | (0.00034,0.01] | (0.01,0.1] | (0.1,0.5] |
---|---|---|---|---|
Number of variants | 74,532,694 | 19,663,301 | 4,244,017 | 5,134,545 |
In this version of the CARTaGENE PheWeb, we restricted our genome-wide association analyses (GWASs) to individuals of European genetic ancestry. We used Regenie to run GWASs. The following sections describe the selection of individuals, phenotypes, and the corresponding Regenie arguments.
We selected CARTaGENE participants of inferred European genetic ancestry using the following steps:
Genetic ancestry | Ad Mixed American | African | East Asian | South Asian | European | Other |
---|---|---|---|---|---|---|
Number of individuals | 479 | 460 | 225 | 124 | 25,896 | 2,146 |
We used binary and continuous phenotypes from the baseline assessment. To select the phenotypes of potential interest, we used the meta-information for CARTaGENE variables in the COMBINED_CATALOG_v2_9_7 - OCT2021 catalogue. We based our phenotype selection on the sample size estimated using a subset of CARTaGENE individuals of European genetic ancestry.
We selected 326 binary phenotypes to perform GWASs in the following way:
We selected 402 continuous phenotypes to perform GWASs in the following way:
We performed GWASs on >700 phenotypes using Regenie (version 3.0.1.) software, which handles unbalanced case-control ratios and accounts for population structure and relatedness. Our automated Nextflow pipeline for executing Regenie in a highly parallel way is available here.
In Regenie's Step 1 (whole genome model), we used the merged array genotype data with the following filters:
In Regenie's Step 2 (single-variant association testing), we used the imputed data and the following settings:
The CARTaGENE PheWeb uses the open-source PheWeb framework. The customized source code for this instance is available here.