Principal Component Analysis

Principal components analysis (PCA) can be used to detect and quantify the genetic structure of populations. In GWASpy, the pca module can be run in 3 different ways: (1) normal PCA without a reference panel; (2) joint PCA; or (3) Projection PCA.

Arguments and options

Argument	Description
`--ref-dirname`	Path to where reference data is
`--ref-basename`	Reference basename
`--ref-info`	Path to reference information. Tab-delimited file with sample IDs and their SuperPop labels
`--reference`	Genome reference build. Default is GRCh38. Options: [`GRCh37`, `GRCh38`]
`--pca-type`	Type of PCA to run. Default is normal. Options: [`normal`, `project`, `joint`]
`--data-dirname`	Path to where the data is
`--data-basename`	Data basename
`--input-type`	Data input type. Options: [`hail`, `plink`, `vcf`]
`--maf`	include only SNPs with MAF >= NUM in PCA. Default is 0.05
`--hwe`	include only SNPs with HWE >= NUM in PCA. Default is 1e-03
`--geno`	include only SNPs with call-rate > NUM. Default is 0.98
`--ld-cor`	Squared correlation threshold (exclusive upper bound). Must be in the range [0.0, 1.0]. Default is 0.2
`--ld-window`	Window size in base pairs (inclusive upper bound). Default is 250000
`--npcs`	Number of PCs to use. Default is 20
`--relatedness-method`	Method to use for the inference of relatedness. Default is pc_relate. Options: [`pc_relate`, `ibd`, `king`]
`--relatedness-thresh`	Threshold value to use in relatedness checks. Default is 0.98
`--prob`	Minimum probability of belonging to a given population for the population to be set. Default is 0.8
`--out-dir`	Path to where output files will be saved

Output

A tab-delimited file with the first 20 principal components (PCs) computed and graphical visualizations of the PCs are generated.