Pre-Imputation Quality Control (QC)

Detecting and correcting issues such as genotyping errors, sample handling errors, population stratification etc is important in GWAS. The preimp_qc module addresses these issues and cleans (QC) your data. Below is a flow diagram of the filters applied when QC’ing input data:

_images/qc_workflow.png

Examples

You can run pre-imputation qc using the preimp_qc module: (1) via the command line; or (2) inside a python script

  1. Command line

    preimp_qc --dirname data/ --basename sim_sim2a_eur_sa_merge.miss --input-type plink
    
  2. Inside a python script

    import gwaspy.preimp_qc as qc
    qc.preimp_qc.preimp_qc(input_type="plink", dirname="data/", basename="sim_sim2a_eur_sa_merge.miss")
    

Arguments and options

Argument

Description

--dirname

Path to where the data is

--basename

Data basename

--input-type

Input type. Options: [hail, plink, vcf]

--export-type

Export type. Options: [hail, plink, vcf]

--out-dir

Directory path to where output files are going to be saved

--annotations

Annotations file to be used for annotating sample with information such as Sex and Phenotype

--reference

Reference genome build. Default is GRCh38. Options: [GRCh37, GRCh38]

--report

Generate a QC PDF report or not. Default is True

--pre-geno

include only SNPs with missing-rate < NUM (before ID filter), important for post merge of multiple platforms

--mind

include only IDs with missing-rate < NUM

--fhet-aut

include only IDs within NUM < FHET < NUM

--fstat-y

include only female IDs with fhet < NUM

--fstat-x

include only male IDs with fhet > NUM

--geno

include only SNPs with missing-rate < NUM

--midi

include only SNPs with missing-rate-difference (case/control) < NUM

--withpna

include monomorphic (invariant) SNPs

--maf

include only SNPs with MAF >= NUM

--hwe-th-con

HWE_controls < NUM

--hwe-th-cas

HWE_cases < NUM

Output(s)

  • QC’ed file(s) i.e. file with all the variants and/or samples that fail QC filters removed

  • A detailed PDF QC report including pre- and post-QC variant/sample counts, figures such as Manhattan and QQ plots etc.