Genotype Imputation

Genotype imputation is a process of estimating missing genotypes from the haplotype or genotype reference panel. It allows you to accurately evaluate the evidence for association at genetic markers that are not directly genotyped. GWASpy has a module, imputation, for running imputation using IMPUTE5. Because imputation is a computationally intensive task, we run it on multiple chunks in parallel, then merge the imputed chunks together at the end. This is why the module is divided into two parts: (1) impute; (2) concat. Below are examples of how to run imputation

A. Run imputation+concat in a single command

Command line

imputation --input-vcf gs://path/to/file.vcf.bgz --samples-file gs://path/to/female_samples.txt --out-dir gs://path/to/output/dir --billing-project project-name --run impute --n-samples integer_number_of_samples

Python (inside a Python script)

import gwaspy.imputation as impute
impute.imputation.genotype_imputation(input_vcfs = 'gs://path/to/file.vcf.bgz',
          females_file: str = gs://path/to/female_samples.txt,
          n_samples: int = integer_number_of_samples,
          n_panel_samples: int = 4099,
          buffer_region: int = 250,
          local: bool = False,
          billing_project = 'project-name',
          memory: str = 'highmem',
          cpu: int = 8,
          stages: str = 'impute,concat'
          output_type: str = 'bcf',
          out_dir = 'gs://path/to/output/dir')

B. Run imputation and concat in separate commands

If you want to run impute or concat as separate steps, you can set the --stages (command-line)/stages (Python script) argument as impute or concat. It’s important to note though that if you want to run things this way, the impute step should always be run before concat as GWASpy uses results from the impute stage for concat

Arguments and options

Argument	Description
`--input-vcf`	Path to where the VCF for target genotypes paths is
`--samples-file`	Text file with list of FEMALE samples, one sample ID each line, that are in the dataset. This is crucial for chromosome X imputation as the data is split by sex
`--local`	Type of service. Default is Service backend where jobs are executed on a multi-tenant compute cluster in Google Cloud
`--billing-project`	Billing project to be used for the job(s)
`--memory`	Memory to use for imputation. Options: [`lowmem`, `standard`, `highmem`]. Default is `highmem`
`--cpu-concat`	CPU to use for the concatenation step. Default is 8
`--n-samples`	Total number of samples in your dataset. We use this to estimate some of the job resources like storage.
`--buffer-region`	Buffer region to be used during imputation. This helps prevent imputation quality from deteriorating near the edges of the region. Default is 250 KB
`--stages`	Process to run. Default is `impute,concat`
`--out-type`	Output type. Options: [`bcf`, `vcf`]. Default is `bcf` [HIGHLY RECOMMENDED SINCE BCFs ARE GENERALLY MORE EFFICIENT TO WORK WITH AND TAKE UP LESS SPACE]
`--out-dir`	Path to where output files will be saved

Output

The resulting output is a VCF file per chromosome with imputed genotypes.

Note

Concatenating BCFs from imputation by chromosome is slower when the output is VCF compared to a BCF. The size may also differ significantly between BCF and VCF.