Hail Query and Batch
The four GWASpy modules use two different backends: preimp_qc and pca use Hail Query, while
phasing and imputation modules use Batch (Hail Batch for Broad users and nextflow for non-Broad users).
Hail Query is well-suited for manipulating large genomics data in a highly parallelised environments such as Dataproc.
Batch, on the other hand, is good for batch processing (scheduling,
queueing, and executing) workloads on Google Cloud resources.
All the instructions below assume the user has a Google account and an active (Google) Cloud billing account
Query
For running the preimp_qc and pca modules, you need to start a Dataproc cluster. Hail has a command-line
tool, hailctl, for doing this and it is installed automatically when
you install Hail. We highly recommend setting a maximum age for the cluster (--max-age), this will ensure the cluster is
automatically deleted after the specified time.
Below is how you can start a cluster with GWASpy pre-installed:
hailctl dataproc start my-cluster-name -region=us-central1 --packages gwaspy --max-age 4h
To shut down the cluster, you can run:
hailctl dataproc stop my-cluster-name --region=us-central1
Batch
The phasing and imputation modules use Batch as the backend. For Broad users with a Hail Batch account,
there is no setup needed, you can proceed to running the modules. For non-Broad users, we have a nextflow implementation
of the modules that requires nextflow setup first. Follow the steps here to: (1) install nextflow; and
(2) setup Google Cloud Batch for nextflow