Hail Query and Batch

The four GWASpy modules use two different backends: preimp_qc and pca use Hail Query, while phasing and imputation modules use Batch (Hail Batch for Broad users and nextflow for non-Broad users). Hail Query is well-suited for manipulating large genomics data in a highly parallelised environments such as Dataproc. Batch, on the other hand, is good for batch processing (scheduling, queueing, and executing) workloads on Google Cloud resources.

All the instructions below assume the user has a Google account and an active (Google) Cloud billing account

Query

For running the preimp_qc and pca modules, you need to start a Dataproc cluster. Hail has a command-line tool, hailctl, for doing this and it is installed automatically when you install Hail. We highly recommend setting a maximum age for the cluster (--max-age), this will ensure the cluster is automatically deleted after the specified time.

Below is how you can start a cluster with GWASpy pre-installed:

hailctl dataproc start my-cluster-name -region=us-central1 --packages gwaspy --max-age 4h

To shut down the cluster, you can run:

hailctl dataproc stop my-cluster-name --region=us-central1

Batch

The phasing and imputation modules use Batch as the backend. For Broad users with a Hail Batch account, there is no setup needed, you can proceed to running the modules. For non-Broad users, we have a nextflow implementation of the modules that requires nextflow setup first. Follow the steps here to: (1) install nextflow; and (2) setup Google Cloud Batch for nextflow