Tutorial

This is a short tutorial on how to use the different modules of GWASpy.

1. Datasets

We will be using simulated test data (on GRCh37) from RICOPILI for most of the examples. Below is how it can be downloaded and copied to a Google bucket

wget https://personal.broadinstitute.org/sawasthi/share_links/UzoZK7Yfd7nTzIxHamCh1rSOiIOSdj_gwas-qcerrors.py/sim_sim1a_eur_sa_merge.miss.{bed,bim,fam} .
gsutil cp sim_sim1a_eur_sa_merge.miss.{bed,bim,fam} gs://my-gcs/bucket/test_data

For low-coverage genotype imputation using GLIMPSE, we will be using the 1X downsampled NA12878 file from the GLIMPSE tutorial. Below is how it can be downloaded and copied to a Google bucket

wget wget https://github.com/odelaneau/GLIMPSE/raw/refs/heads/master/tutorial/NA12878_1x_bam/NA12878.{bam,bam.bai} .
gsutil cp NA12878.{bam,bam.bai} gs://my-gcs/bucket/test_data

2. Start a dataproc cluster with GWASpy installed

The code below will start a cluster with GWASpy automatically installed. This fetches the GWASpy version on PyPI. You can also install the GitHub version by replacing gwaspy with git+https://github.com/atgu/GWASpy.git

hailctl dataproc start gwaspy-tut --region=us-central1 --packages gwaspy --max-age 4h

3. Pre-imputation QC

Next, we will QC the data using the default arguments in preimp_qc. Since the reference panel we will be using for phasing and imputation is on GRCh38, we also add a liftover argument to liftover our input data to GRCh38. We also set export_type to vcf because we will use the QC’ed file as input to phasing and imputation.

  • First create a python script on your local machine as below

    import gwaspy.preimp_qc as qc
    qc.preimp_qc.preimp_qc(dirname="gs://my-gcs/bucket/test_data/", basename="sim_sim1a_eur_sa_merge.miss",
                           input_type="plink", reference="GRCh37", liftover=True, export_type="vcf")
    
  • Then run the following command to submit the script to the Dataproc cluster named gwaspy-tut

    hailctl dataproc submit gwaspy-tut qc_script.py
    

4. PCA

Using the QC’ed data from step 3 above, we will now run PCA.

  • First create a python script on your local machine as below

    import gwaspy.pca as pca
    pca.pca.pca(data_dirname="gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC",
                data_basename="sim_sim1a_eur_sa_merge.miss_qced", out_dir="gs://my-gcs/bucket/test_data/",
                input_type="vcf", reference="GRCh38", pca_type="normal")
    
  • Then run the following command to submit the script to the Dataproc cluster named gwaspy-tut

    hailctl dataproc submit gwaspy-tut pca_script.py
    

If you have real data, you can use (1) pca_type="project" which will train a Random Forest model on the HGDP+1kGP reference and use the trained model to classify samples in your data; or (2) pca_type="joint" which will first find an intersection (variants) between the HGDP+1kGP reference and your input, use the intersected HGDP+1kGP to train a RF model, then classify your input data. If your data has a lot of variants (+million), pca_type="project" usually gives plausible results. Otherwise you can try pca_type="joint"

Note

If you are a Broad user with Hail Batch access, you have to have python and GWASpy installed locally to be able to run phasing and imputation. For non-Broad users, we provide a nextflow implementation and the only thing you are required to do is have nextflow locally (nextflow executable file) and necessary permissions as mentioned

5. Phasing and Imputation

5.1 Hail Batch

5.1.1. Phasing (should be ~$2 and takes ~40 minutes)

The example below is for running phasing, without a reference panel. If you want to use the HGDP+1kGP reference panel or your own, simply add the --vcf-ref argument as explained here

phasing --input-vcf gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC/sim_sim1a_eur_sa_merge.miss_qced.vcf.bgz \
--output-filename sim_sim1a_eur_sa_merge.miss_qced.phased --out-dir gs://my-gcs/bucket/test_data/GWASpy/phasing \
--fill-tags --genome-build GRCh38 --billing-project my-billing-project

5.1.2. Imputation using IMPUTE5 (should be ~$4 and takes <20 minutes)

The example below is for running phasing, without a reference panel. If you want to use the HGDP+1kGP reference panel or your own, simply add the --vcf-ref argument as explained here

imputation --input-file gs://my-gcs/bucket/test_data/GWASpy/phasing/shapeit5/phase_common/sim_sim1a_eur_sa_merge.miss_qced.phased_chrCNUMBER.array.shapeit5_common.bcf \
--vcf-ref hgdp1kgp --output-filename sim_sim1a_eur_sa_merge.miss_qced.phased.imputed --out-dir gs://my-gcs/bucket/test_data/GWASpy/imputation \
--n-samples 1989 --n-ref-samples 4091 --billing-project my-billing-project

Note

You may need to add HAIL_GENETICS_HAIL_IMAGE=hailgenetics/python-dill:3.9-slim in front of the phasing and imputation commands if you are using a Python version other than 3.9, 3.10, or 3.11

5.2. Nextflow

Before we run the nextflow pipeline, you have to first download the following files and copy them to your bucket: (1) common chunks and rare chunks files used to parallelize imputation across genomic regions; (2) genetic map files. SHAPEIT5 repo has chunks files and genetic map files.

Once you have the files on a Google bucket, you can update the params.json file. Specifically, the things you need to update are: input_vcf, output_filename, out_dir, data_type, common_chunks, rare_chunks, genetic_maps. If you have one input file per chromosome, set input_split_by_chrom to true

{
    "input_vcf": "gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC/sim_sim1a_eur_sa_merge.miss_qced.vcf",
    "output_filename": "sim_sim1a_eur_sa_merge.miss_qced",
    "out_dir": "gs://my-gcs/bucket/test_data/GWASpy/nf_phase_impute",
    "impute": true,
    "fill_tags": true,
    "input_split_by_chrom": false,
    "vcf_ref": "gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes_v2/hgdp1kgp_chrCNUMBER.filtered.SNV_INDEL.phased.shapeit5",
    "ref_format": "vcf",
    "data_type": "array", // or wgs
    "maf": 0.001,
    "common_chunks": "gs://my-gcs/bucket/chunks/b38/20cM/chunks_chrCNUMBER.txt",
    "rare_chunks": "gs://my-gcs/bucket/chunks/b38/4cM/chunks_chrCNUMBER.txt",
    "genetic_maps": "gs://my-gcs/bucket/maps/b38/chrCNUMBER.b38.gmap.gz"
}

Next thing to do is update the nextflow.config file. The only things you need to change are workDir and google.project, and sometimes google.location

workDir = 'gs://my-gcs/bucket/test_data/GWASpy/work'

process {
  executor = 'google-batch'
  errorStrategy = { task.exitStatus==null ? 'retry' : 'terminate' }
  maxRetries = 3
}

profiles {
    gbatch {
      google.project = 'my-batch-billing-project'
      google.location = 'us-central1'
      batch.spot = true
    }
}

Now you can easily run both phasing and imputation using the following command

./nextflow run main.nf -c nextflow.config -profile gbatch -params-file params.json

6. Low-coverage WGS imputation using GLIMPSE

6.1 Hail Batch (should be ~$0.5 and takes <10 minutes)

Unlike phasing using IMPUTE5, GLIMPSE takes BAM files as input, and since we usually have one BAM file per sample, the input to the imputation module when using GLIMPSE is a TSV file without a header and has two columns: first column with sample ID and second column with the actual path to the BAM file. Only one sample/BAM per row is allowed in the TSV. Below is an example of a file saved as gs://my-gcs/bucket/test_data/na12878_test.tsv

NA12878

gs://my-gcs/bucket/test_data/NA12878.bam

Once you have saved the TSV to a bucket, you can run GLIMPSE phasing and imputation using the following command

imputation --input-file gs://my-gcs/bucket/test_data/na12878_test.tsv --vcf-ref hgdp1kgp \
--output-filename sim_sim1a_eur_sa_merge.miss_qced.phased.imputed \
--out-dir gs://my-gcs/bucket/test_data/GWASpy/lowcov_imputation --n-samples 1 --n-ref-samples 4091 \
--billing-project my-billing-project --chromosomes 22 --software glimpse2

6.2. Nextflow COMING VERY SOON