Haplotype Phasing
Knowing the phase of a haplotype can allow us to impute low frequency variants, this makes haplotype phasing an
important step before genotype imputation. GWASpy has a module, phasing, for performing phasing. Phasing can
be run with or without a reference panel using SHAPEIT5
GWASpy can handle both array and WGS data. For array data, the user can pass a VCF/BCF file with all the chromosomes, then GWASpy will use SHAPEIT5 to phase the chromosomes in parallel. Since WGS has more variants, phasing will be parallelized across multiple chunks in each chromosome. It’s also important to note that phasing of WGS data includes phasing common variants first, followed by phasing rare variants.
Another important aspect of phasing is the use of a reference panel. In many cases (small sample size), including a reference panel when phasing improves accuracy. By default, GWASpy runs phasing without a reference panel, but there is an option to use a reference panel as shown below.
Examples
1. Without a reference panel
phasing --input-vcf gs://path/to/file.vcf.bgz --output-filename outfilename.phased --out-dir gs://path/to/output/dir --genome-build GRCh38 --billing-project my-billing-project
2. HGDP+1KG reference panel
Set --vcf-ref to hgdp1kgp
phasing --input-vcf gs://path/to/file.vcf.bgz --output-filename my_outfilename --out-dir gs://path/to/output/dir --genome-build GRCh38 --billing-project my-billing-project --vcf-ref hgdp1kgp
3. Own reference panel
Note
If you’re using your own reference panel, make sure the files are bgzip compressed.
Chromosome X reference file must be named X and not 23
Say you have your reference panel files for each chromosomes stored in gs://ref_panel/ALL.chr{1..22,X}.vcf,
you would pass the path to --vcf-ref as gs://ref_panel/ALL.chrCNUMBER.vcf.
GWASpy uses CNUMBER as a placeholder for the chromosomes. Then you can run phasing as:
phasing --input-vcf gs://path/to/file.vcf.bgz --output-filename outfilename.phased --out-dir gs://path/to/output/dir --genome-build GRCh38 --billing-project my-billing-project --vcf-ref gs://ref_panel/ALL.chrCNUMBER.vcf
Note
For nextflow users, the idea is the same. The only difference is you have to update the params.json file. Examples are provided in the tutorial section of the documentation
Arguments and options
Argument |
Description |
|---|---|
|
Path to where VCF file to be phased is |
|
VCF file for reference haplotypes if phasing with a reference panel |
|
Pedigree (PLINK FAM) file |
|
Type of service. Default is Service backend where jobs are executed on a multi-tenant compute cluster in Google Cloud |
|
Billing project to be used for the job(s) |
|
Genome reference build. Default is GRCh38. Options: [ |
|
Array or WGS data. Default is array. Options: [ |
|
Whether or not to add AC tag required by SHAPEIT5. Including |
|
Software to use for phasing. Options: [ |
|
Output filename without file extension |
|
Path to where output files will be saved |
Output
The resulting output is a VCF file per chromosome with phased haplotypes.