Genotype Imputation

Genotype imputation is a process of estimating missing genotypes from the haplotype or genotype reference panel. It allows you to accurately evaluate the evidence for association at genetic markers that are not directly genotyped. GWASpy has a module, imputation, for running imputation using IMPUTE5. Because imputation can be a computationally intensive task, we run it on multiple chunks in parallel, then merge the imputed chunks together at the end. Below are examples of how to run imputation using either the HGDP+1kGP or your own reference panel.

Examples

1. HGDP+1kGP reference panel

imputation --input-file gs://path/to/file.vcf.bgz --vcf-ref hgdp1kgp --output-filename my_outfilename --out-dir gs://path/to/output/dir --n-samples 1989 --n-ref-samples 4091 --billing-project my-billing-project

2. Own reference panel

imputation --input-file gs://path/to/file.vcf.bgz --vcf-ref gs://path/to/ref_panel/ALL.chrCNUMBER.vcf --output-filename my_outfilename --out-dir gs://path/to/output/dir --n-samples 1989 --n-ref-samples 4091 --billing-project my-billing-project

Warning

When using your own reference panel, make sure that you use the CNUMBER placeholder in the filename passed to –vcf-ref

Arguments and options

Argument	Description
`--input-file`	Path to where the VCF or TSV with target VCF/BAM files is
`--vcf-ref`	Reference panel file to use for imputation
`--chromosomes`	Chromosome(s) to run imputation for. Default is `all`
`--local`	Type of service. Default is Service backend where jobs are executed on a multi-tenant compute cluster in Google Cloud
`--billing-project`	Billing project to be used for the jobs
`--n-samples`	Number of target samples to be imputed. We use this to estimate resources for some of the jobs
`--n-ref-samples`	Number of reference samples. We use this to estimate resources for some of the jobs
`--software`	Software to use for phasing. Options: [`beagle5`, `impute5`]. Default is `impute5`
`--output-filename`	Output filename without file extension
`--out-dir`	Path to where output files will be saved

Output

The resulting output is a VCF file per chromosome with imputed genotypes.