Haplotype Phasing
Knowing the phase of a haplotype can allow us to impute low frequency variants, this makes haplotype phasing an
important step before genotype imputation. GWASpy has a module, phasing, for performing phasing. Phasing can
be run with or without a reference panel using either Eagle2 or SHAPEIT4
In GWASpy, the phasing is divided into 3 parts: (1) scatter; (2) phase; (3) concat. We
first split the input file into multiple smaller chunks with overlapping windows between consecutive windows, run
phasing on each chunk, then concatenate (join) the phased chunks. Running things this way rather than phasing entire
chromosomes speeds up the time it takes to run phasing since we can parallelize the phasing of each chunk. All 3 steps
can be run in a single command (i.e. as a pipeline, DEFAULT & RECOMMENDED) or in separate commands either directly from
the command line or inside a Python script.
A. Run scatter, phase, and concat in a single command
You can run the scatter, phase, and concat steps sequentially in a single command. This is the default behaviour in GWASpy. Below are examples
Command line
phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project billing-project
Python (inside a Python script)
import gwaspy.phasing as phase phase.phasing.haplotype_phasing(input_vcf = 'gs://path/to/file.vcf.bgz', vcf_ref = None, local: bool = False, billing_project = 'billing-project', software = 'shapeit', reference= 'GRCh38', max_win_size_cm: float = 10.0, overlap_size_cm: float = 2.0, scatter_memory: int = 26, cpu: int = 4, threads: int = 3, stages: str = 'scatter,phase,concat', output_type: str = 'bcf', out_dir = 'gs://path/to/output/dir')
B. Run scatter, phasing, and concat steps in separate commands.
If for whatever reasons you’d like to run the scatter, phase, and concat steps separately, you can make use of the
--stages (command-line) and stages (Python script) arguments to specify which stage
[scatter, phase, concat] you want to run. It’s important to note that even though you can run things this way, phase is dependent on results from scatter and concat on results
from phase i.e. you cannot run phasing without having ran scatter prior.
C. Reference panels
In some cases, including a reference panel when phasing might improve accuracy. By default, GWASpy runs phasing without a reference panel. If the user wants to use a reference panel, there are two options
Note
If the data set you wish to phase contains more than twice as many samples as the largest reference panel available to you, then using a reference panel is unlikely to give much of a boost in phasing accuracy.
C1. HGDP+1KG dataset
phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project billing-project --vcf-ref hgdp_1kg
C2. Own reference panel
Say you have your reference panel files by chromosomes stored in gs://ref_panel/ALL.chr{1..22,X}.vcf,
you would pass the path to --vcf-ref as gs://ref_panel/ALL.chrCNUMBER.vcf,
GWASpy uses CNUMBER as a placeholder for the chromosomes. Then you can run phasing as:
phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project project-name --vcf-ref gs://ref_panel/ALL.chrCNUMBER.vcf
Note
If you’re using your own reference panel, make sure the files are bgzip compressed.
Chromosome X reference file must be name X and not 23
Arguments and options
Argument |
Description |
|---|---|
|
Path to where VCF file to be phased is |
|
VCF file for reference haplotypes if phasing with a reference panel |
|
Type of service. Default is Service backend where jobs are executed on a multi-tenant compute cluster in Google Cloud |
|
Billing project to be used for the job(s) |
|
Software to use for phasing. Options: [ |
|
Genome reference build. Default is GRCh38. Options: [ |
|
Maximum window size to use when chunking the input file. Default is 10.0 |
|
Size of overlap between consecutive overlapping windows. Default is 2.0 |
|
Number of CPUs to use in phasing. Default is 4. [TO BE CHANGED] |
|
Memory to use for scattering input into chunks before phasing. [TO BE CHANGED] |
|
Number of threads to use in phasing. Default is 3. [TO BE CHANGED] |
|
Process(es) to run. Default is |
|
Output type. Options: [ |
|
Path to where output files will be saved |
Output
For both Eagle and SHAPEIT, the resulting output is a VCF file per chromosome with phased haplotypes.
Note
By default, Eagle will output a VCF file with phased GT and other fields that were in the unphased VCF, whereas SHAPEIT will ONLY output the GT field. This will result in phased files generated using Eagle being bigger in size than those generate using SHAPEIT.