Pre-Imputation Quality Control (QC)

Detecting and correcting issues such as genotyping errors, sample handling errors, population stratification etc is important in GWAS. The preimp_qc module addresses these issues and cleans (QC) your data. Below is a flow diagram of the filters applied when QC’ing input data:

_images/qc_workflow.png

Arguments and options

Argument

Description

--dirname

Path to where the data is

--basename

Data basename

--input-type

Input type. Options: [hail, plink, vcf]

--export-type

Export type. Options: [hail, plink, vcf]

--out-dir

Directory path to where output files are going to be saved

--annotations

Annotations file to be used for annotating sample with information such as Sex and Phenotype

--reference

Reference genome build. Default is GRCh38. Options: [GRCh37, GRCh38]

--report

Generate a QC PDF report or not. Default is True

--liftover

Liftover input data to GRCh38 or not, default is False. Running preimp_qc with --liftover will activate liftover

--pre-geno

include only SNPs with missing-rate < NUM (before ID filter), important for post merge of multiple platforms

--mind

include only IDs with missing-rate < NUM

--fhet-aut

include only IDs within NUM < FHET < NUM

--fstat-y

include only female IDs with fhet < NUM

--fstat-x

include only male IDs with fhet > NUM

--geno

include only SNPs with missing-rate < NUM

--midi

include only SNPs with missing-rate-difference (case/control) < NUM

--withpna

include monomorphic (invariant) SNPs

--maf

include only SNPs with MAF >= NUM

--hwe-th-con

HWE_controls < NUM

--hwe-th-cas

HWE_cases < NUM

Output(s)

  • QC’ed file(s) i.e. file with all the variants and/or samples that fail QC filters removed

  • A detailed PDF QC report including pre- and post-QC variant/sample counts, figures such as Manhattan and QQ plots etc.

Examples

All the code below assumes the user already has a Dataproc cluster running as described in the previous section

You can run pre-imputation qc using the preimp_qc module (1) inside a python script; or (2) via the command line

  1. Python script - submitting a python script to a cluster from local machine (Highly recommended)

  • First create a python script on your local machine as below

    import gwaspy.preimp_qc as qc
    qc.preimp_qc.preimp_qc(dirname="gs://my-gcs/bucket/test_data/", basename="my_data_basename",
                           input_type="my_input_type")
    
  • Then run the following command to submit the script to the Dataproc cluster named my-cluster-name

    hailctl dataproc submit my-cluster-name qc_script.py
    
  1. Command line - requires user to SSH’ed to a cluster

Users may encounter this error when trying to run things from the command line

  • This requires the user to be inside (gcloud compute ssh) the Dataproc cluster with GWASpy already installed

    gcloud compute ssh "my-cluster-name-m"
    preimp_qc --dirname gs://my-gcs/bucket/test_data/ --basename my_data_basename --input-type my_input_type