Pre-Imputation Quality Control (QC)
Detecting and correcting issues such as genotyping errors, sample handling errors, population stratification etc
is important in GWAS. The preimp_qc module addresses these issues and cleans (QC) your data. Below is a flow diagram
of the filters applied when QC’ing input data:
Arguments and options
Argument |
Description |
|---|---|
|
Path to where the data is |
|
Data basename |
|
Input type. Options: [ |
|
Export type. Options: [ |
|
Directory path to where output files are going to be saved |
|
Annotations file to be used for annotating sample with information such as Sex and Phenotype |
|
Reference genome build. Default is GRCh38. Options: [ |
|
Generate a QC PDF report or not. Default is True |
|
Liftover input data to GRCh38 or not, default is False. Running |
|
include only SNPs with missing-rate < NUM (before ID filter), important for post merge of multiple platforms |
|
include only IDs with missing-rate < NUM |
|
include only IDs within NUM < FHET < NUM |
|
include only female IDs with fhet < NUM |
|
include only male IDs with fhet > NUM |
|
include only SNPs with missing-rate < NUM |
|
include only SNPs with missing-rate-difference (case/control) < NUM |
|
include monomorphic (invariant) SNPs |
|
include only SNPs with MAF >= NUM |
|
HWE_controls < NUM |
|
HWE_cases < NUM |
Output(s)
QC’ed file(s) i.e. file with all the variants and/or samples that fail QC filters removed
A detailed PDF QC report including pre- and post-QC variant/sample counts, figures such as Manhattan and QQ plots etc.
Examples
All the code below assumes the user already has a Dataproc cluster running as described in the previous section
You can run pre-imputation qc using the preimp_qc module (1) inside a python script; or (2) via the command line
Python script - submitting a python script to a cluster from local machine (Highly recommended)
First create a python script on your local machine as below
import gwaspy.preimp_qc as qc qc.preimp_qc.preimp_qc(dirname="gs://my-gcs/bucket/test_data/", basename="my_data_basename", input_type="my_input_type")
Then run the following command to submit the script to the Dataproc cluster named my-cluster-name
hailctl dataproc submit my-cluster-name qc_script.py
Command line - requires user to SSH’ed to a cluster
Users may encounter this error when trying to run things from the command line
This requires the user to be inside (gcloud compute ssh) the Dataproc cluster with GWASpy already installed
gcloud compute ssh "my-cluster-name-m" preimp_qc --dirname gs://my-gcs/bucket/test_data/ --basename my_data_basename --input-type my_input_type