Haplotype Phasing

Knowing the phase of a haplotype can allow us to impute low frequency variants, this makes haplotype phasing an important step before genotype imputation. GWASpy has a module, phasing, for performing phasing. Phasing can be run with or without a reference panel using either Eagle2 or SHAPEIT4

In GWASpy, the phasing is divided into 3 parts: (1) scatter; (2) phase; (3) concat. We first split the input file into multiple smaller chunks with overlapping windows between consecutive windows, run phasing on each chunk, then concatenate (join) the phased chunks. Running things this way rather than phasing entire chromosomes speeds up the time it takes to run phasing since we can parallelize the phasing of each chunk. All 3 steps can be run in a single command (i.e. as a pipeline, DEFAULT & RECOMMENDED) or in separate commands either directly from the command line or inside a Python script.

A. Run scatter, phase, and concat in a single command

You can run the scatter, phase, and concat steps sequentially in a single command. This is the default behaviour in GWASpy. Below are examples

Command line

phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project billing-project

Python (inside a Python script)

import gwaspy.phasing as phase
phase.phasing.haplotype_phasing(input_vcf = 'gs://path/to/file.vcf.bgz',
          vcf_ref = None,
          local: bool = False,
          billing_project = 'billing-project',
          software = 'shapeit',
          reference= 'GRCh38',
          max_win_size_cm: float = 10.0,
          overlap_size_cm: float = 2.0,
          scatter_memory: int = 26,
          cpu: int = 4,
          threads: int = 3,
          stages: str = 'scatter,phase,concat',
          output_type: str = 'bcf',
          out_dir = 'gs://path/to/output/dir')

B. Run scatter, phasing, and concat steps in separate commands.

If for whatever reasons you’d like to run the scatter, phase, and concat steps separately, you can make use of the --stages (command-line) and stages (Python script) arguments to specify which stage [scatter, phase, concat] you want to run. It’s important to note that even though you can run things this way, phase is dependent on results from scatter and concat on results from phase i.e. you cannot run phasing without having ran scatter prior.

C. Reference panels

In some cases, including a reference panel when phasing might improve accuracy. By default, GWASpy runs phasing without a reference panel. If the user wants to use a reference panel, there are two options

Note

If the data set you wish to phase contains more than twice as many samples as the largest reference panel available to you, then using a reference panel is unlikely to give much of a boost in phasing accuracy.

C1. HGDP+1KG dataset

phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project billing-project --vcf-ref hgdp_1kg

C2. Own reference panel

Say you have your reference panel files by chromosomes stored in gs://ref_panel/ALL.chr{1..22,X}.vcf, you would pass the path to --vcf-ref as gs://ref_panel/ALL.chrCNUMBER.vcf, GWASpy uses CNUMBER as a placeholder for the chromosomes. Then you can run phasing as:

phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project project-name --vcf-ref gs://ref_panel/ALL.chrCNUMBER.vcf

Note

If you’re using your own reference panel, make sure the files are bgzip compressed.
Chromosome X reference file must be name X and not 23

Arguments and options

Argument	Description
`--input-vcf`	Path to where VCF file to be phased is
`--vcf-ref`	VCF file for reference haplotypes if phasing with a reference panel
`--local`	Type of service. Default is Service backend where jobs are executed on a multi-tenant compute cluster in Google Cloud
`--billing-project`	Billing project to be used for the job(s)
`--software`	Software to use for phasing. Options: [`eagle`, `shapeit`]. Default is `eagle`
`--reference`	Genome reference build. Default is GRCh38. Options: [`GRCh37`, `GRCh38`]
`--max-win-size-cm`	Maximum window size to use when chunking the input file. Default is 10.0
`--overlap-size-cm`	Size of overlap between consecutive overlapping windows. Default is 2.0
`--cpu`	Number of CPUs to use in phasing. Default is 4. [TO BE CHANGED]
`--scatter-mem`	Memory to use for scattering input into chunks before phasing. [TO BE CHANGED]
`--threads`	Number of threads to use in phasing. Default is 3. [TO BE CHANGED]
`--stages`	Process(es) to run. Default is `scatter,phase,concat`
`--out-type`	Output type. Options: [`bcf`, `vcf`]. Default is `bcf` [HIGHLY RECOMMENDED SINCE BCFs ARE GENERALLY FASTER TO WORK WITH AND TAKE UP LESS SPACE]
`--out-dir`	Path to where output files will be saved

Output

For both Eagle and SHAPEIT, the resulting output is a VCF file per chromosome with phased haplotypes.

Note

By default, Eagle will output a VCF file with phased GT and other fields that were in the unphased VCF, whereas SHAPEIT will ONLY output the GT field. This will result in phased files generated using Eagle being bigger in size than those generate using SHAPEIT.