Haplotype Phasing

Knowing the phase of a haplotype can allow us to impute low frequency variants, this makes haplotype phasing an important step before genotype imputation. GWASpy has a module, phasing, for performing phasing. Phasing can be run with or without a reference panel using SHAPEIT5

GWASpy can handle both array and WGS data. For array data, the user can pass a VCF/BCF file with all the chromosomes, then GWASpy will use SHAPEIT5 to phase the chromosomes in parallel. Since WGS has more variants, phasing will be parallelized across multiple chunks in each chromosome. It’s also important to note that phasing of WGS data includes phasing common variants first, followed by phasing rare variants.

Another important aspect of phasing is the use of a reference panel. In many cases (small sample size), including a reference panel when phasing improves accuracy. By default, GWASpy runs phasing without a reference panel, but there is an option to use a reference panel as shown below.

Examples

1. Without a reference panel

phasing --input-vcf gs://path/to/file.vcf.bgz --output-filename outfilename.phased --out-dir gs://path/to/output/dir --genome-build GRCh38 --billing-project my-billing-project

2. HGDP+1KG reference panel

Set --vcf-ref to hgdp1kgp

phasing --input-vcf gs://path/to/file.vcf.bgz --output-filename my_outfilename --out-dir gs://path/to/output/dir --genome-build GRCh38 --billing-project my-billing-project --vcf-ref hgdp1kgp

3. Own reference panel

Note

  1. If you’re using your own reference panel, make sure the files are bgzip compressed.

  2. Chromosome X reference file must be named X and not 23

Say you have your reference panel files for each chromosomes stored in gs://ref_panel/ALL.chr{1..22,X}.vcf, you would pass the path to --vcf-ref as gs://ref_panel/ALL.chrCNUMBER.vcf. GWASpy uses CNUMBER as a placeholder for the chromosomes. Then you can run phasing as:

phasing --input-vcf gs://path/to/file.vcf.bgz --output-filename outfilename.phased --out-dir gs://path/to/output/dir --genome-build GRCh38 --billing-project my-billing-project --vcf-ref gs://ref_panel/ALL.chrCNUMBER.vcf

Note

For nextflow users, the idea is the same. The only difference is you have to update the params.json file. Examples are provided in the tutorial section of the documentation

Arguments and options

Argument

Description

--input-vcf

Path to where VCF file to be phased is

--vcf-ref

VCF file for reference haplotypes if phasing with a reference panel

--pedigree

Pedigree (PLINK FAM) file

--local

Type of service. Default is Service backend where jobs are executed on a multi-tenant compute cluster in Google Cloud

--billing-project

Billing project to be used for the job(s)

--genome-build

Genome reference build. Default is GRCh38. Options: [GRCh37, GRCh38]

--data-type

Array or WGS data. Default is array. Options: [array, wgs].

--fill-tags

Whether or not to add AC tag required by SHAPEIT5. Including --fill-tags, in your command will enable this step

--software

Software to use for phasing. Options: [beagle, shapeit]. Default is shapeit

--output-filename

Output filename without file extension

--out-dir

Path to where output files will be saved

Output

The resulting output is a VCF file per chromosome with phased haplotypes.