Pre-Imputation Quality Control (QC)

Detecting and correcting issues such as genotyping errors, sample handling errors, population stratification etc is important in GWAS. The preimp_qc module addresses these issues and cleans (QC) your data. Below is a flow diagram of the filters applied when QC’ing input data:

Arguments and options

Argument	Description
`--dirname`	Path to where the data is
`--basename`	Data basename
`--input-type`	Input type. Options: [`hail`, `plink`, `vcf`]
`--export-type`	Export type. Options: [`hail`, `plink`, `vcf`]
`--out-dir`	Directory path to where output files are going to be saved
`--annotations`	Annotations file to be used for annotating sample with information such as Sex and Phenotype
`--reference`	Reference genome build. Default is GRCh38. Options: [`GRCh37`, `GRCh38`]
`--report`	Generate a QC PDF report or not. Default is True
`--liftover`	Liftover input data to GRCh38 or not, default is False. Running `preimp_qc` with `--liftover` will activate liftover
`--pre-geno`	include only SNPs with missing-rate < NUM (before ID filter), important for post merge of multiple platforms
`--mind`	include only IDs with missing-rate < NUM
`--fhet-aut`	include only IDs within NUM < FHET < NUM
`--fstat-y`	include only female IDs with fhet < NUM
`--fstat-x`	include only male IDs with fhet > NUM
`--geno`	include only SNPs with missing-rate < NUM
`--midi`	include only SNPs with missing-rate-difference (case/control) < NUM
`--withpna`	include monomorphic (invariant) SNPs
`--maf`	include only SNPs with MAF >= NUM
`--hwe-th-con`	HWE_controls < NUM
`--hwe-th-cas`	HWE_cases < NUM

Output(s)

QC’ed file(s) i.e. file with all the variants and/or samples that fail QC filters removed
A detailed PDF QC report including pre- and post-QC variant/sample counts, figures such as Manhattan and QQ plots etc.

Examples

All the code below assumes the user already has a Dataproc cluster running as described in the previous section

You can run pre-imputation qc using the preimp_qc module (1) inside a python script; or (2) via the command line

Python script - submitting a python script to a cluster from local machine (Highly recommended)

First create a python script on your local machine as below

import gwaspy.preimp_qc as qc
qc.preimp_qc.preimp_qc(dirname="gs://my-gcs/bucket/test_data/", basename="my_data_basename",
                       input_type="my_input_type")

Then run the following command to submit the script to the Dataproc cluster named my-cluster-name
hailctl dataproc submit my-cluster-name qc_script.py

Command line - requires user to SSH’ed to a cluster

Users may encounter this error when trying to run things from the command line

This requires the user to be inside (gcloud compute ssh) the Dataproc cluster with GWASpy already installed

gcloud compute ssh "my-cluster-name-m"
preimp_qc --dirname gs://my-gcs/bucket/test_data/ --basename my_data_basename --input-type my_input_type