Tutorial

This is a short tutorial on how to use the different modules of GWASpy.

1. Datasets

We will be using simulated test data (on GRCh37) from RICOPILI for most of the examples. Below is how it can be downloaded and copied to a Google bucket

wget https://personal.broadinstitute.org/sawasthi/share_links/UzoZK7Yfd7nTzIxHamCh1rSOiIOSdj_gwas-qcerrors.py/sim_sim1a_eur_sa_merge.miss.{bed,bim,fam} .
gsutil cp sim_sim1a_eur_sa_merge.miss.{bed,bim,fam} gs://my-gcs/bucket/test_data

For low-coverage genotype imputation using GLIMPSE, we will be using the 1X downsampled NA12878 file from the GLIMPSE tutorial. Below is how it can be downloaded and copied to a Google bucket

wget wget https://github.com/odelaneau/GLIMPSE/raw/refs/heads/master/tutorial/NA12878_1x_bam/NA12878.{bam,bam.bai} .
gsutil cp NA12878.{bam,bam.bai} gs://my-gcs/bucket/test_data

2. Start a dataproc cluster with GWASpy installed

The code below will start a cluster with GWASpy automatically installed. This fetches the GWASpy version on PyPI. You can also install the GitHub version by replacing gwaspy with git+https://github.com/atgu/GWASpy.git

hailctl dataproc start gwaspy-tut --region=us-central1 --packages gwaspy --max-age 4h

3. Pre-imputation QC

Next, we will QC the data using the default arguments in preimp_qc. Since the reference panel we will be using for phasing and imputation is on GRCh38, we also add a liftover argument to liftover our input data to GRCh38. We also set export_type to vcf because we will use the QC’ed file as input to phasing and imputation.

First create a python script on your local machine as below

import gwaspy.preimp_qc as qc
qc.preimp_qc.preimp_qc(dirname="gs://my-gcs/bucket/test_data/", basename="sim_sim1a_eur_sa_merge.miss",
                       input_type="plink", reference="GRCh37", liftover=True, export_type="vcf")

Then run the following command to submit the script to the Dataproc cluster named gwaspy-tut
hailctl dataproc submit gwaspy-tut qc_script.py

4. PCA

Using the QC’ed data from step 3 above, we will now run PCA.

First create a python script on your local machine as below

import gwaspy.pca as pca
pca.pca.pca(data_dirname="gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC",
            data_basename="sim_sim1a_eur_sa_merge.miss_qced", out_dir="gs://my-gcs/bucket/test_data/",
            input_type="vcf", reference="GRCh38", pca_type="normal")

Then run the following command to submit the script to the Dataproc cluster named gwaspy-tut
hailctl dataproc submit gwaspy-tut pca_script.py

If you have real data, you can use (1) pca_type="project" which will train a Random Forest model on the HGDP+1kGP reference and use the trained model to classify samples in your data; or (2) pca_type="joint" which will first find an intersection (variants) between the HGDP+1kGP reference and your input, use the intersected HGDP+1kGP to train a RF model, then classify your input data. If your data has a lot of variants (+million), pca_type="project" usually gives plausible results. Otherwise you can try pca_type="joint"

Note

If you are a Broad user with Hail Batch access, you have to have python and GWASpy installed locally to be able to run phasing and imputation. For non-Broad users, we provide a nextflow implementation and the only thing you are required to do is have nextflow locally (nextflow executable file) and necessary permissions as mentioned

5. Phasing and Imputation

5.1 Hail Batch

5.1.1. Phasing (should be ~$2 and takes ~40 minutes)

The example below is for running phasing, without a reference panel. If you want to use the HGDP+1kGP reference panel or your own, simply add the --vcf-ref argument as explained here

phasing --input-vcf gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC/sim_sim1a_eur_sa_merge.miss_qced.vcf.bgz \
--output-filename sim_sim1a_eur_sa_merge.miss_qced.phased --out-dir gs://my-gcs/bucket/test_data/GWASpy/phasing \
--fill-tags --genome-build GRCh38 --billing-project my-billing-project

5.1.2. Imputation using IMPUTE5 (should be ~$4 and takes <20 minutes)

The example below is for running phasing, without a reference panel. If you want to use the HGDP+1kGP reference panel or your own, simply add the --vcf-ref argument as explained here

imputation --input-file gs://my-gcs/bucket/test_data/GWASpy/phasing/shapeit5/phase_common/sim_sim1a_eur_sa_merge.miss_qced.phased_chrCNUMBER.array.shapeit5_common.bcf \
--vcf-ref hgdp1kgp --output-filename sim_sim1a_eur_sa_merge.miss_qced.phased.imputed --out-dir gs://my-gcs/bucket/test_data/GWASpy/imputation \
--n-samples 1989 --n-ref-samples 4091 --billing-project my-billing-project

Note

You may need to add HAIL_GENETICS_HAIL_IMAGE=hailgenetics/python-dill:3.9-slim in front of the phasing and imputation commands if you are using a Python version other than 3.9, 3.10, or 3.11

5.2. Nextflow

Before we run the nextflow pipeline, you have to first download the following files and copy them to your bucket: (1) common chunks and rare chunks files used to parallelize imputation across genomic regions; (2) genetic map files. SHAPEIT5 repo has chunks files and genetic map files.

Once you have the files on a Google bucket, you can update the params.json file. Specifically, the things you need to update are: input_vcf, output_filename, out_dir, data_type, common_chunks, rare_chunks, genetic_maps. If you have one input file per chromosome, set input_split_by_chrom to true

{
    "input_vcf": "gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC/sim_sim1a_eur_sa_merge.miss_qced.vcf",
    "output_filename": "sim_sim1a_eur_sa_merge.miss_qced",
    "out_dir": "gs://my-gcs/bucket/test_data/GWASpy/nf_phase_impute",
    "impute": true,
    "fill_tags": true,
    "input_split_by_chrom": false,
    "vcf_ref": "gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes_v2/hgdp1kgp_chrCNUMBER.filtered.SNV_INDEL.phased.shapeit5",
    "ref_format": "vcf",
    "data_type": "array", // or wgs
    "maf": 0.001,
    "common_chunks": "gs://my-gcs/bucket/chunks/b38/20cM/chunks_chrCNUMBER.txt",
    "rare_chunks": "gs://my-gcs/bucket/chunks/b38/4cM/chunks_chrCNUMBER.txt",
    "genetic_maps": "gs://my-gcs/bucket/maps/b38/chrCNUMBER.b38.gmap.gz"
}

Next thing to do is update the nextflow.config file. The only things you need to change are workDir and google.project, and sometimes google.location

workDir = 'gs://my-gcs/bucket/test_data/GWASpy/work'

process {
  executor = 'google-batch'
  errorStrategy = { task.exitStatus==null ? 'retry' : 'terminate' }
  maxRetries = 3
}

profiles {
    gbatch {
      google.project = 'my-batch-billing-project'
      google.location = 'us-central1'
      batch.spot = true
    }
}

Now you can easily run both phasing and imputation using the following command

./nextflow run main.nf -c nextflow.config -profile gbatch -params-file params.json

6. Low-coverage WGS imputation using GLIMPSE

6.1 Hail Batch (should be ~$0.5 and takes <10 minutes)

Unlike phasing using IMPUTE5, GLIMPSE takes BAM files as input, and since we usually have one BAM file per sample, the input to the imputation module when using GLIMPSE is a TSV file without a header and has two columns: first column with sample ID and second column with the actual path to the BAM file. Only one sample/BAM per row is allowed in the TSV. Below is an example of a file saved as gs://my-gcs/bucket/test_data/na12878_test.tsv

NA12878

gs://my-gcs/bucket/test_data/NA12878.bam

Once you have saved the TSV to a bucket, you can run GLIMPSE phasing and imputation using the following command

imputation --input-file gs://my-gcs/bucket/test_data/na12878_test.tsv --vcf-ref hgdp1kgp \
--output-filename sim_sim1a_eur_sa_merge.miss_qced.phased.imputed \
--out-dir gs://my-gcs/bucket/test_data/GWASpy/lowcov_imputation --n-samples 1 --n-ref-samples 4091 \
--billing-project my-billing-project --chromosomes 22 --software glimpse2

6.2. Nextflow COMING VERY SOON