PLINK cheatsheet

The following flags and corresponding tidypopgen functions are based on plink version 1.9.

File management and reading data:

PLINK 1.9	`tidypopgen`
–make-bed –out	`gt_as_plink(data, file = my_file, type = "bed")`
–recode	`gt_as_plink(data, file = my_file, type = "ped")`
–recode vcf	`gt_as_vcf(data, file = my_file)`
–allele1234 and –alleleACGT	See `gen_tibble()` parameter ‘valid_alleles’

PLINK flags –update-alleles, –allele1234, and –alleleACGT, all alter the coding of alleles. In tidypopgen, valid alleles are supplied when reading in a gen_tibble.

Quality control:

PLINK	`tidypopgen`
–maf	`data %>% loci_maf()`
–geno	`data %>% loci_missingness()`
–hwe	`data %>% loci_hwe()`
–freq –a2-allele	`data %>% loci_alt_freq()`
–mind	`data %>% indiv_missingness()`
–het	`data %>% indiv_het_obs()`

To filter out variants in tidypopgen, in a similar way to PLINK flags such as –extract or –autosome, it is necessary to use the gen_tibble with select_loci_if(). For example:

data %>% select_loci_if(loci_chromosomes(genotypes) %in% c(1:22)) 
#> # A gen_tibble: 15 loci
#> # A tibble:     5 × 3
#>   id    population  genotypes
#>   <chr> <chr>      <vctr_SNP>
#> 1 GRC24 pop_a               1
#> 2 GRC25 pop_a               2
#> 3 GRC26 pop_a               3
#> 4 GRC27 pop_a               4
#> 5 GRC28 pop_a               5

will select autosomal loci in the same way as –autosome. Or alternatively:

my_snps <- c("rs4477212","rs3094315","rs3131972","rs12124819","rs11240777")

data %>% select_loci_if(loci_names(genotypes) %in% my_snps) %>% show_loci()
#> # A tibble: 5 × 8
#>   big_index name  chromosome position genetic_dist allele_ref allele_alt chr_int
#>       <int> <chr>      <int>    <int>        <int> <chr>      <chr>        <int>
#> 1         1 rs44…          1    82154            0 A          <NA>             1
#> 2         2 rs30…          1   752566            0 A          G                1
#> 3         3 rs31…          1   752721            0 G          A                1
#> 4         4 rs12…          1   776546            0 A          <NA>             1
#> 5         5 rs11…          1   798959            0 G          A                1

will select loci from a previously defined set in the same way as –extract.

Similarly, to filter out individuals, as might be performed with –keep in PLINK, requires using filter:

my_individuals <- c("GRC14300079", "GRC14300142", "GRC14300159")

data %>% filter(id %in% my_individuals)
#> # A gen_tibble: 16 loci
#> # A tibble:     0 × 3
#> # ℹ 3 variables: id <chr>, population <chr>, genotypes <vctr_SNP>

Quality control for linkage

Linkage is managed through clumping in tidypopgen with loci_ld_clump().

This option is similar to the –indep-pairwise flag in PLINK, but results in a more even distribution of loci when compared to LD pruning.

To explore why clumping is preferable to pruning, see https://privefl.github.io/bigsnpr/articles/pruning-vs-clumping.html

Quality control for relatedness (KING)

KING	`tidypopgen`
–kinship	`pairwise_king()`
–distance	`pairwise_ibs()`
–unrelated	`filter_high_relatedness()`

pairwise_king() implements the KING-robust estimator of kinship equivalent to –kinship in KING. To remove related individuals, instead of using –unrelated –degree, the user can pass the resulting kinship matrix and a relatedness threshold to filter_high_relatedness, which will return the largest possible set of unrelated individuals.

Merging datasets:

PLINK	`tidypopgen`
–bmerge	`rbind()`
–flip-scan	`rbind_dry_run()`
–flip	use ‘flip_strand = TRUE’ in `rbind()`

In PLINK, data merging can fail due to strand inconsistencies that are not addressed prior to merging. PLINK documentation suggests to users to try a ‘trial flip’ of data to address this, and then to ‘unflip’ any errors that remain. In tidypopgen, when data are merged with rbind, strand inconsistencies are identified and automatically flipped, avoiding multiple rounds of flipping before merging.

PLINK does allow users to identify inconsistencies prior to merging with –flip-scan, and this functionality is included in the tidypopgen rbind_dry_run(). rbind_dry_run() reports the numeric overlap of datasets, alongside the number of SNPs to ‘flip’ in the new target dataset, as well as the number of ambiguous SNPs.

Data are only merged one set at a time, there is no equivalent to –merge-list.

Analysis:

PLINK	`tidypopgen`
–pca	See `gt_pca()` for pca options
–fst	`pairwise_pop_fst()` with `group_by()`
–homozyg	`gt_roh_window()`