Genotype Imputation
Genotype imputation is a process of estimating missing genotypes from the haplotype or genotype reference panel. It
allows you to accurately evaluate the evidence for association at genetic markers that are not directly genotyped.
GWASpy has a module, imputation, for running imputation using IMPUTE5. Because imputation can be a computationally
intensive task, we run it on multiple chunks in parallel, then merge the imputed chunks together at the end. Below are
examples of how to run imputation using either the HGDP+1kGP or your own reference panel.
Examples
1. HGDP+1kGP reference panel
imputation --input-file gs://path/to/file.vcf.bgz --vcf-ref hgdp1kgp --output-filename my_outfilename --out-dir gs://path/to/output/dir --n-samples 1989 --n-ref-samples 4091 --billing-project my-billing-project
2. Own reference panel
imputation --input-file gs://path/to/file.vcf.bgz --vcf-ref gs://path/to/ref_panel/ALL.chrCNUMBER.vcf --output-filename my_outfilename --out-dir gs://path/to/output/dir --n-samples 1989 --n-ref-samples 4091 --billing-project my-billing-project
Warning
When using your own reference panel, make sure that you use the CNUMBER placeholder in the filename passed to –vcf-ref
Arguments and options
Argument |
Description |
|---|---|
|
Path to where the VCF or TSV with target VCF/BAM files is |
|
Reference panel file to use for imputation |
|
Chromosome(s) to run imputation for. Default is |
|
Type of service. Default is Service backend where jobs are executed on a multi-tenant compute cluster in Google Cloud |
|
Billing project to be used for the jobs |
|
Number of target samples to be imputed. We use this to estimate resources for some of the jobs |
|
Number of reference samples. We use this to estimate resources for some of the jobs |
|
Software to use for phasing. Options: [ |
|
Output filename without file extension |
|
Path to where output files will be saved |
Output
The resulting output is a VCF file per chromosome with imputed genotypes.