Principal Component Analysis

Principal components analysis (PCA) can be used to detect and quantify the genetic structure of populations. In GWASpy, the pca module can be run in 3 different ways: (1) normal PCA without a reference panel; (2) joint PCA; or (3) Projection PCA.

Arguments and options

Argument

Description

--ref-dirname

Path to where reference data is

--ref-basename

Reference basename

--ref-info

Path to reference information. Tab-delimited file with sample IDs and their SuperPop labels

--reference

Genome reference build. Default is GRCh38. Options: [GRCh37, GRCh38]

--pca-type

Type of PCA to run. Default is normal. Options: [normal, project, joint]

--data-dirname

Path to where the data is

--data-basename

Data basename

--input-type

Data input type. Options: [hail, plink, vcf]

--maf

include only SNPs with MAF >= NUM in PCA. Default is 0.05

--hwe

include only SNPs with HWE >= NUM in PCA. Default is 1e-03

--geno

include only SNPs with call-rate > NUM. Default is 0.98

--ld-cor

Squared correlation threshold (exclusive upper bound). Must be in the range [0.0, 1.0]. Default is 0.2

--ld-window

Window size in base pairs (inclusive upper bound). Default is 250000

--npcs

Number of PCs to use. Default is 20

--relatedness-method

Method to use for the inference of relatedness. Default is pc_relate. Options: [pc_relate, ibd, king]

--relatedness-thresh

Threshold value to use in relatedness checks. Default is 0.98

--prob

Minimum probability of belonging to a given population for the population to be set. Default is 0.8

--out-dir

Path to where output files will be saved

Output

A tab-delimited file with the first 20 principal components (PCs) computed and graphical visualizations of the PCs are generated.