Haplotype Phasing

Knowing the phase of a haplotype can allow us to impute low frequency variants, this makes haplotype phasing an important step before genotype imputation. GWASpy has a module, phasing, for performing phasing. Phasing can be run with or without a reference panel using either Eagle2 or SHAPEIT4

In GWASpy, the phasing is divided into 3 parts: (1) scatter; (2) phase; (3) concat. We first split the input file into multiple smaller chunks with overlapping windows between consecutive windows, run phasing on each chunk, then concatenate (join) the phased chunks. Running things this way rather than phasing entire chromosomes speeds up the time it takes to run phasing since we can parallelize the phasing of each chunk. All 3 steps can be run in a single command (i.e. as a pipeline, DEFAULT & RECOMMENDED) or in separate commands either directly from the command line or inside a Python script.

A. Run scatter, phase, and concat in a single command

You can run the scatter, phase, and concat steps sequentially in a single command. This is the default behaviour in GWASpy. Below are examples

  1. Command line

    phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project billing-project
    
  2. Python (inside a Python script)

    import gwaspy.phasing as phase
    phase.phasing.haplotype_phasing(input_vcf = 'gs://path/to/file.vcf.bgz',
              vcf_ref = None,
              local: bool = False,
              billing_project = 'billing-project',
              software = 'shapeit',
              reference= 'GRCh38',
              max_win_size_cm: float = 10.0,
              overlap_size_cm: float = 2.0,
              scatter_memory: int = 26,
              cpu: int = 4,
              threads: int = 3,
              stages: str = 'scatter,phase,concat',
              output_type: str = 'bcf',
              out_dir = 'gs://path/to/output/dir')
    

B. Run scatter, phasing, and concat steps in separate commands.

If for whatever reasons you’d like to run the scatter, phase, and concat steps separately, you can make use of the --stages (command-line) and stages (Python script) arguments to specify which stage [scatter, phase, concat] you want to run. It’s important to note that even though you can run things this way, phase is dependent on results from scatter and concat on results from phase i.e. you cannot run phasing without having ran scatter prior.

C. Reference panels

In some cases, including a reference panel when phasing might improve accuracy. By default, GWASpy runs phasing without a reference panel. If the user wants to use a reference panel, there are two options

Note

If the data set you wish to phase contains more than twice as many samples as the largest reference panel available to you, then using a reference panel is unlikely to give much of a boost in phasing accuracy.

C1. HGDP+1KG dataset

phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project billing-project --vcf-ref hgdp_1kg

C2. Own reference panel

Say you have your reference panel files by chromosomes stored in gs://ref_panel/ALL.chr{1..22,X}.vcf, you would pass the path to --vcf-ref as gs://ref_panel/ALL.chrCNUMBER.vcf, GWASpy uses CNUMBER as a placeholder for the chromosomes. Then you can run phasing as:

phasing --input-vcf gs://path/to/file.vcf.bgz --out-dir gs://path/to/output/dir --reference GRCh38 --billing-project project-name --vcf-ref gs://ref_panel/ALL.chrCNUMBER.vcf

Note

  1. If you’re using your own reference panel, make sure the files are bgzip compressed.

  2. Chromosome X reference file must be name X and not 23

Arguments and options

Argument

Description

--input-vcf

Path to where VCF file to be phased is

--vcf-ref

VCF file for reference haplotypes if phasing with a reference panel

--local

Type of service. Default is Service backend where jobs are executed on a multi-tenant compute cluster in Google Cloud

--billing-project

Billing project to be used for the job(s)

--software

Software to use for phasing. Options: [eagle, shapeit]. Default is eagle

--reference

Genome reference build. Default is GRCh38. Options: [GRCh37, GRCh38]

--max-win-size-cm

Maximum window size to use when chunking the input file. Default is 10.0

--overlap-size-cm

Size of overlap between consecutive overlapping windows. Default is 2.0

--cpu

Number of CPUs to use in phasing. Default is 4. [TO BE CHANGED]

--scatter-mem

Memory to use for scattering input into chunks before phasing. [TO BE CHANGED]

--threads

Number of threads to use in phasing. Default is 3. [TO BE CHANGED]

--stages

Process(es) to run. Default is scatter,phase,concat

--out-type

Output type. Options: [bcf, vcf]. Default is bcf [HIGHLY RECOMMENDED SINCE BCFs ARE GENERALLY FASTER TO WORK WITH AND TAKE UP LESS SPACE]

--out-dir

Path to where output files will be saved

Output

For both Eagle and SHAPEIT, the resulting output is a VCF file per chromosome with phased haplotypes.

Note

By default, Eagle will output a VCF file with phased GT and other fields that were in the unphased VCF, whereas SHAPEIT will ONLY output the GT field. This will result in phased files generated using Eagle being bigger in size than those generate using SHAPEIT.