.. _sec-tutorial: ======== Tutorial ======== This is a short tutorial on how to use the different modules of GWASpy. 1. Datasets ########### We will be using simulated test data (on GRCh37) from RICOPILI for most of the examples. Below is how it can be downloaded and copied to a Google bucket .. code-block:: sh wget https://personal.broadinstitute.org/sawasthi/share_links/UzoZK7Yfd7nTzIxHamCh1rSOiIOSdj_gwas-qcerrors.py/sim_sim1a_eur_sa_merge.miss.{bed,bim,fam} . gsutil cp sim_sim1a_eur_sa_merge.miss.{bed,bim,fam} gs://my-gcs/bucket/test_data For low-coverage genotype imputation using GLIMPSE, we will be using the 1X downsampled NA12878 file from the GLIMPSE tutorial. Below is how it can be downloaded and copied to a Google bucket .. code-block:: sh wget wget https://github.com/odelaneau/GLIMPSE/raw/refs/heads/master/tutorial/NA12878_1x_bam/NA12878.{bam,bam.bai} . gsutil cp NA12878.{bam,bam.bai} gs://my-gcs/bucket/test_data 2. Start a dataproc cluster with GWASpy installed ################################################# The code below will start a cluster with GWASpy automatically installed. This fetches the GWASpy version on PyPI. You can also install the GitHub version by replacing :code:`gwaspy` with :code:`git+https://github.com/atgu/GWASpy.git` .. code-block:: sh hailctl dataproc start gwaspy-tut --region=us-central1 --packages gwaspy --max-age 4h 3. Pre-imputation QC #################### Next, we will QC the data using the default arguments in :code:`preimp_qc`. Since the reference panel we will be using for phasing and imputation is on GRCh38, we also add a :code:`liftover` argument to liftover our input data to GRCh38. We also set :code:`export_type` to :code:`vcf` because we will use the QC'ed file as input to phasing and imputation. - First create a python script on your local machine as below .. code-block:: python import gwaspy.preimp_qc as qc qc.preimp_qc.preimp_qc(dirname="gs://my-gcs/bucket/test_data/", basename="sim_sim1a_eur_sa_merge.miss", input_type="plink", reference="GRCh37", liftover=True, export_type="vcf") - Then run the following command to submit the script to the Dataproc cluster named `gwaspy-tut` .. code-block:: sh hailctl dataproc submit gwaspy-tut qc_script.py 4. PCA ###### Using the QC'ed data from step 3 above, we will now run PCA. - First create a python script on your local machine as below .. code-block:: python import gwaspy.pca as pca pca.pca.pca(data_dirname="gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC", data_basename="sim_sim1a_eur_sa_merge.miss_qced", out_dir="gs://my-gcs/bucket/test_data/", input_type="vcf", reference="GRCh38", pca_type="normal") - Then run the following command to submit the script to the Dataproc cluster named `gwaspy-tut` .. code-block:: sh hailctl dataproc submit gwaspy-tut pca_script.py If you have real data, you can use (1) :code:`pca_type="project"` which will train a Random Forest model on the HGDP+1kGP reference and use the trained model to classify samples in your data; or (2) :code:`pca_type="joint"` which will first find an intersection (variants) between the HGDP+1kGP reference and your input, use the intersected HGDP+1kGP to train a RF model, then classify your input data. If your data has a lot of variants (+million), :code:`pca_type="project"` usually gives plausible results. Otherwise you can try :code:`pca_type="joint"` .. note:: If you are a Broad user with Hail Batch access, you have to have python and GWASpy installed locally to be able to run phasing and imputation. For non-Broad users, we provide a nextflow implementation and the only thing you are required to do is have nextflow locally (nextflow executable file) and necessary permissions `as mentioned `_ 5. Phasing and Imputation ######################### **5.1 Hail Batch** 5.1.1. Phasing (should be ~$2 and takes ~40 minutes) The example below is for running phasing, without a reference panel. If you want to use the HGDP+1kGP reference panel or your own, simply add the :code:`--vcf-ref` argument `as explained here `_ .. code-block:: sh phasing --input-vcf gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC/sim_sim1a_eur_sa_merge.miss_qced.vcf.bgz \ --output-filename sim_sim1a_eur_sa_merge.miss_qced.phased --out-dir gs://my-gcs/bucket/test_data/GWASpy/phasing \ --fill-tags --genome-build GRCh38 --billing-project my-billing-project 5.1.2. Imputation using IMPUTE5 (should be ~$4 and takes <20 minutes) The example below is for running phasing, without a reference panel. If you want to use the HGDP+1kGP reference panel or your own, simply add the :code:`--vcf-ref` argument `as explained here `_ .. code-block:: sh imputation --input-file gs://my-gcs/bucket/test_data/GWASpy/phasing/shapeit5/phase_common/sim_sim1a_eur_sa_merge.miss_qced.phased_chrCNUMBER.array.shapeit5_common.bcf \ --vcf-ref hgdp1kgp --output-filename sim_sim1a_eur_sa_merge.miss_qced.phased.imputed --out-dir gs://my-gcs/bucket/test_data/GWASpy/imputation \ --n-samples 1989 --n-ref-samples 4091 --billing-project my-billing-project .. note:: You may need to add :code:`HAIL_GENETICS_HAIL_IMAGE=hailgenetics/python-dill:3.9-slim` in front of the :code:`phasing` and :code:`imputation` commands if you are using a Python version other than 3.9, 3.10, or 3.11 **5.2. Nextflow** Before we run the nextflow pipeline, you have to first download the following files and copy them to your bucket: (1) common chunks and rare chunks files used to parallelize imputation across genomic regions; (2) genetic map files. SHAPEIT5 repo has `chunks files `_ and `genetic map files `_. Once you have the files on a Google bucket, you can update the :code:`params.json` file. Specifically, the things you need to update are: :code:`input_vcf`, :code:`output_filename`, :code:`out_dir`, :code:`data_type`, :code:`common_chunks`, :code:`rare_chunks`, :code:`genetic_maps`. If you have one input file per chromosome, set :code:`input_split_by_chrom` to :code:`true` .. code-block:: sh { "input_vcf": "gs://my-gcs/bucket/test_data/GWASpy/Preimp_QC/sim_sim1a_eur_sa_merge.miss_qced.vcf", "output_filename": "sim_sim1a_eur_sa_merge.miss_qced", "out_dir": "gs://my-gcs/bucket/test_data/GWASpy/nf_phase_impute", "impute": true, "fill_tags": true, "input_split_by_chrom": false, "vcf_ref": "gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes_v2/hgdp1kgp_chrCNUMBER.filtered.SNV_INDEL.phased.shapeit5", "ref_format": "vcf", "data_type": "array", // or wgs "maf": 0.001, "common_chunks": "gs://my-gcs/bucket/chunks/b38/20cM/chunks_chrCNUMBER.txt", "rare_chunks": "gs://my-gcs/bucket/chunks/b38/4cM/chunks_chrCNUMBER.txt", "genetic_maps": "gs://my-gcs/bucket/maps/b38/chrCNUMBER.b38.gmap.gz" } Next thing to do is update the :code:`nextflow.config` file. The only things you need to change are :code:`workDir` and :code:`google.project`, and sometimes :code:`google.location` .. code-block:: sh workDir = 'gs://my-gcs/bucket/test_data/GWASpy/work' process { executor = 'google-batch' errorStrategy = { task.exitStatus==null ? 'retry' : 'terminate' } maxRetries = 3 } profiles { gbatch { google.project = 'my-batch-billing-project' google.location = 'us-central1' batch.spot = true } } Now you can easily run both phasing and imputation using the following command .. code-block:: sh ./nextflow run main.nf -c nextflow.config -profile gbatch -params-file params.json 6. Low-coverage WGS imputation using GLIMPSE ############################################ **6.1 Hail Batch** (should be ~$0.5 and takes <10 minutes) Unlike phasing using IMPUTE5, GLIMPSE takes BAM files as input, and since we usually have one BAM file per sample, the input to the imputation module when using GLIMPSE is a TSV file without a header and has two columns: first column with sample ID and second column with the actual path to the BAM file. Only one sample/BAM per row is allowed in the TSV. Below is an example of a file saved as :code:`gs://my-gcs/bucket/test_data/na12878_test.tsv` .. list-table:: :widths: 15 50 :header-rows: 0 * - NA12878 - gs://my-gcs/bucket/test_data/NA12878.bam Once you have saved the TSV to a bucket, you can run GLIMPSE phasing and imputation using the following command .. code-block:: sh imputation --input-file gs://my-gcs/bucket/test_data/na12878_test.tsv --vcf-ref hgdp1kgp \ --output-filename sim_sim1a_eur_sa_merge.miss_qced.phased.imputed \ --out-dir gs://my-gcs/bucket/test_data/GWASpy/lowcov_imputation --n-samples 1 --n-ref-samples 4091 \ --billing-project my-billing-project --chromosomes 22 --software glimpse2 **6.2. Nextflow** **COMING VERY SOON**