.. _sec-qb: ==================== Hail Query and Batch ==================== The four GWASpy modules use two different backends: :code:`preimp_qc` and :code:`pca` use Hail Query, while :code:`phasing` and :code:`imputation` modules use Batch (Hail Batch for Broad users and nextflow for non-Broad users). Hail Query is well-suited for manipulating large genomics data in a highly parallelised environments such as Dataproc. `Batch `_, on the other hand, is good for batch processing (scheduling, queueing, and executing) workloads on Google Cloud resources. All the instructions below assume the user has a Google account and an active (Google) Cloud billing account Query ##### For running the :code:`preimp_qc` and :code:`pca` modules, you need to start a Dataproc cluster. Hail has a command-line tool, `hailctl `_, for doing this and it is installed automatically when you install Hail. We highly recommend setting a maximum age for the cluster (:code:`--max-age`), this will ensure the cluster is automatically deleted after the specified time. Below is how you can start a cluster with GWASpy pre-installed: .. code-block:: sh hailctl dataproc start my-cluster-name -region=us-central1 --packages gwaspy --max-age 4h To shut down the cluster, you can run: .. code-block:: sh hailctl dataproc stop my-cluster-name --region=us-central1 Batch ##### The :code:`phasing` and :code:`imputation` modules use Batch as the backend. For Broad users with a Hail Batch account, there is no setup needed, you can proceed to running the modules. For non-Broad users, we have a nextflow implementation of the modules that requires nextflow setup first. Follow the steps here to: `(1) install nextflow `_; and `(2) setup Google Cloud Batch for nextflow `_