Tutorial
This is a short tutorial on how to use the different modules of GWASpy.
1. Datasets
We will be using simulated test data (on GRCh37) from RICOPILI for most of the examples. Below is how it can be downloaded and copied to a Google bucket
wget https://personal.broadinstitute.org/sawasthi/share_links/UzoZK7Yfd7nTzIxHamCh1rSOiIOSdj_gwas-qcerrors.py/sim_sim1a_eur_sa_merge.miss.{bed,bim,fam} . gsutil cp sim_sim1a_eur_sa_merge.miss.{bed,bim,fam} gs://my-gcs/bucket/test_data
For low-coverage genotype imputation using GLIMPSE, we will be using the 1X downsampled NA12878 file from the GLIMPSE tutorial. Below is how it can be downloaded and copied to a Google bucket
wget wget https://github.com/odelaneau/GLIMPSE/raw/refs/heads/master/tutorial/NA12878_1x_bam/NA12878.{bam,bam.bai} . gsutil cp NA12878.{bam,bam.bai} gs://my-gcs/bucket/test_data
2. Start a dataproc cluster with GWASpy installed
The code below will start a cluster with GWASpy automatically installed. This fetches the GWASpy version on PyPI. You
can also install the GitHub version by replacing gwaspy with git+https://github.com/atgu/GWASpy.git
hailctl dataproc start gwaspy-tut --region=us-central1 --packages gwaspy --max-age 4h
3. Pre-imputation QC
Next, we will QC the data using the default arguments in preimp_qc. Since the reference panel we will be using
for phasing and imputation is on GRCh38, we also add a liftover argument to liftover our input data to GRCh38. We
also set export_type to vcf because we will use the QC’ed file as input to phasing and imputation.
First create a python script on your local machine as below
import gwaspy.preimp_qc as qc qc.preimp_qc.preimp_qc(dirname="gs://my-gcs/bucket/test_data/", basename="sim_sim1a_eur_sa_merge.miss", input_type="plink", reference="GRCh37", liftover=True, export_type="vcf")
Then run the following command to submit the script to the Dataproc cluster named gwaspy-tut
hailctl dataproc submit gwaspy-tut qc_script.py
4. PCA
Using the QC’ed data from step 3 above, we will now run PCA.
First create a python script on your local machine as below
import gwaspy.pca as pca pca.pca.pca(data_dirname="gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC", data_basename="sim_sim1a_eur_sa_merge.miss_qced", out_dir="gs://my-gcs/bucket/test_data/", input_type="vcf", reference="GRCh38", pca_type="normal")
Then run the following command to submit the script to the Dataproc cluster named gwaspy-tut
hailctl dataproc submit gwaspy-tut pca_script.py
If you have real data, you can use (1) pca_type="project" which will train a Random Forest model on the HGDP+1kGP reference and
use the trained model to classify samples in your data; or (2) pca_type="joint" which will first find an intersection (variants) between
the HGDP+1kGP reference and your input, use the intersected HGDP+1kGP to train a RF model, then classify your input data. If
your data has a lot of variants (+million), pca_type="project" usually gives plausible results. Otherwise you can try pca_type="joint"
Note
If you are a Broad user with Hail Batch access, you have to have python and GWASpy installed locally to be able to run phasing and imputation. For non-Broad users, we provide a nextflow implementation and the only thing you are required to do is have nextflow locally (nextflow executable file) and necessary permissions as mentioned
5. Phasing and Imputation
5.1 Hail Batch
5.1.1. Phasing (should be ~$2 and takes ~40 minutes)
The example below is for running phasing, without a reference panel. If you want to use the HGDP+1kGP reference panel or
your own, simply add the --vcf-ref argument as explained here
phasing --input-vcf gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC/sim_sim1a_eur_sa_merge.miss_qced.vcf.bgz \ --output-filename sim_sim1a_eur_sa_merge.miss_qced.phased --out-dir gs://my-gcs/bucket/test_data/GWASpy/phasing \ --fill-tags --genome-build GRCh38 --billing-project my-billing-project
5.1.2. Imputation using IMPUTE5 (should be ~$4 and takes <20 minutes)
The example below is for running phasing, without a reference panel. If you want to use the HGDP+1kGP reference panel or
your own, simply add the --vcf-ref argument as explained here
imputation --input-file gs://my-gcs/bucket/test_data/GWASpy/phasing/shapeit5/phase_common/sim_sim1a_eur_sa_merge.miss_qced.phased_chrCNUMBER.array.shapeit5_common.bcf \ --vcf-ref hgdp1kgp --output-filename sim_sim1a_eur_sa_merge.miss_qced.phased.imputed --out-dir gs://my-gcs/bucket/test_data/GWASpy/imputation \ --n-samples 1989 --n-ref-samples 4091 --billing-project my-billing-project
Note
You may need to add HAIL_GENETICS_HAIL_IMAGE=hailgenetics/python-dill:3.9-slim in front of the phasing
and imputation commands if you are using a Python version other than 3.9, 3.10, or 3.11
5.2. Nextflow
Before we run the nextflow pipeline, you have to first download the following files and copy them to your bucket: (1) common chunks and rare chunks files used to parallelize imputation across genomic regions; (2) genetic map files. SHAPEIT5 repo has chunks files and genetic map files.
Once you have the files on a Google bucket, you can update the params.json file. Specifically, the things you need
to update are: input_vcf, output_filename, out_dir, data_type, common_chunks,
rare_chunks, genetic_maps. If you have one input file per chromosome, set input_split_by_chrom to true
{ "input_vcf": "gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC/sim_sim1a_eur_sa_merge.miss_qced.vcf", "output_filename": "sim_sim1a_eur_sa_merge.miss_qced", "out_dir": "gs://my-gcs/bucket/test_data/GWASpy/nf_phase_impute", "impute": true, "fill_tags": true, "input_split_by_chrom": false, "vcf_ref": "gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes_v2/hgdp1kgp_chrCNUMBER.filtered.SNV_INDEL.phased.shapeit5", "ref_format": "vcf", "data_type": "array", // or wgs "maf": 0.001, "common_chunks": "gs://my-gcs/bucket/chunks/b38/20cM/chunks_chrCNUMBER.txt", "rare_chunks": "gs://my-gcs/bucket/chunks/b38/4cM/chunks_chrCNUMBER.txt", "genetic_maps": "gs://my-gcs/bucket/maps/b38/chrCNUMBER.b38.gmap.gz" }
Next thing to do is update the nextflow.config file. The only things you need to change are workDir and
google.project, and sometimes google.location
workDir = 'gs://my-gcs/bucket/test_data/GWASpy/work' process { executor = 'google-batch' errorStrategy = { task.exitStatus==null ? 'retry' : 'terminate' } maxRetries = 3 } profiles { gbatch { google.project = 'my-batch-billing-project' google.location = 'us-central1' batch.spot = true } }
Now you can easily run both phasing and imputation using the following command
./nextflow run main.nf -c nextflow.config -profile gbatch -params-file params.json
6. Low-coverage WGS imputation using GLIMPSE
6.1 Hail Batch (should be ~$0.5 and takes <10 minutes)
Unlike phasing using IMPUTE5, GLIMPSE takes BAM files as input, and since we usually have one BAM file per sample, the
input to the imputation module when using GLIMPSE is a TSV file without a header and has two columns: first column with
sample ID and second column with the actual path to the BAM file. Only one sample/BAM per row is allowed in the TSV.
Below is an example of a file saved as gs://my-gcs/bucket/test_data/na12878_test.tsv
NA12878 |
gs://my-gcs/bucket/test_data/NA12878.bam |
Once you have saved the TSV to a bucket, you can run GLIMPSE phasing and imputation using the following command
imputation --input-file gs://my-gcs/bucket/test_data/na12878_test.tsv --vcf-ref hgdp1kgp \ --output-filename sim_sim1a_eur_sa_merge.miss_qced.phased.imputed \ --out-dir gs://my-gcs/bucket/test_data/GWASpy/lowcov_imputation --n-samples 1 --n-ref-samples 4091 \ --billing-project my-billing-project --chromosomes 22 --software glimpse2
6.2. Nextflow COMING VERY SOON