Title: | Tidy Population Genetics |
---|---|
Description: | We provide a tidy grammar of population genetics, facilitating the manipulation and analysis of data on biallelic single nucleotide polymorphisms (SNPs). |
Authors: | Evie Carter [aut], Andrea Manica [aut, cre] |
Maintainer: | Andrea Manica <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.0.0.9019 |
Built: | 2024-11-21 22:25:11 UTC |
Source: | https://github.com/EvolEcolGroup/tidypopgen |
Augment for gt_pca
accepts a model object and a dataset and adds
scores to each
observation in the dataset. Scores for each component are stored in a
separate column, which is given name with the pattern ".fittedPC1",
".fittedPC2", etc. For consistency with broom::augment.prcomp, a column
".rownames" is also returned; it is a copy of 'id', but it ensures that
any scripts written for data augmented with broom::augment.prcomp will
work out of the box (this is especially helpful when adapting plotting scripts).
## S3 method for class 'gt_pca' augment(x, data = NULL, k = NULL, ...)
## S3 method for class 'gt_pca' augment(x, data = NULL, k = NULL, ...)
x |
A |
data |
the |
k |
the number of components to add |
... |
Not used. Needed to match generic signature only. |
A gen_tibble containing the original data along with additional columns containing each observation's projection into PCA space.
gt_pca_autoSVD()
gt_pca_tidiers
augment_loci
add columns to the loci table of a gen_tibble
related
to information from a given analysis.
augment_loci(x, data, ...)
augment_loci(x, data, ...)
x |
An object returned by one of the |
data |
the |
... |
Additional parameters passed to the individual methods. |
A gen_tibble with additional columns added to the loci tibble (accessible
with show_loci()
. If data
is missing, a tibble of the information, with a
column .rownames
giving the loci names.
Augment for gt_pca
accepts a model object and a gen_tibble
and adds
loadings for each locus to the loci table. Loadings for each component are stored in a
separate column, which is given name with the pattern ".loadingPC1",
".loadingPC2", etc. If data
is missing, then a tibble with the loadings is returned.
## S3 method for class 'gt_pca' augment_loci(x, data = NULL, k = NULL, ...)
## S3 method for class 'gt_pca' augment_loci(x, data = NULL, k = NULL, ...)
x |
A |
data |
the |
k |
the number of components to add |
... |
Not used. Needed to match generic signature only. |
A gen_tibble with a loadings added to the loci tibble (accessible
with show_loci()
. If data
is missing, a tibble of loadings.
gt_pca_autoSVD()
gt_pca_tidiers
Augment for q_matrix
accepts a model object and a dataset and adds
Q values to each observation in the dataset.
Q values are stored in separate columns, which is given name with the
pattern ".Q1",".Q2", etc. For consistency with broom::augment.prcomp, a column
".rownames" is also returned; it is a copy of 'id', but it ensures that
any scripts written for data augmented with broom::augment.prcomp will
work out of the box (this is especially helpful when adapting plotting scripts).
## S3 method for class 'q_matrix' augment(x, data = NULL, ...)
## S3 method for class 'q_matrix' augment(x, data = NULL, ...)
x |
A |
data |
the |
... |
Not used. Needed to match generic signature only. |
A gen_tibble containing the original data along with additional columns containing each observation's Q values.
Augment for gt_dapc
accepts a model object and a dataset and adds
scores to each
observation in the dataset. Scores for each component are stored in a
separate column, which is given name with the pattern ".fittedLD1",
".fittedLD2", etc. For consistency with broom::augment.prcomp, a column
".rownames" is also returned; it is a copy of 'id', but it ensures that
any scripts written for data augmented with broom::augment.prcomp will
work out of the box (this is especially helpful when adapting plotting scripts).
## S3 method for class 'gt_dapc' augment(x, data = NULL, k = NULL, ...)
## S3 method for class 'gt_dapc' augment(x, data = NULL, k = NULL, ...)
x |
A |
data |
the |
k |
the number of components to add |
... |
Not used. Needed to match generic signature only. |
A gen_tibble containing the original data along with additional columns containing each observation's projection into PCA space.
gt_pca
objectsFor gt_pca
, the following types of plots are available:
screeplot
: a plot of the eigenvalues of the principal components (currently
it plots the singular value)
scores
a scatterplot of the scores of each individual on two principal
components (defined by pc
)
loadings
a plot of loadings of all loci for a given component (chosen with pc
)
## S3 method for class 'gt_pca' autoplot(object, type = c("screeplot", "scores", "loadings"), k = NULL, ...)
## S3 method for class 'gt_pca' autoplot(object, type = c("screeplot", "scores", "loadings"), k = NULL, ...)
object |
an object of class |
type |
the type of plot (one of "screeplot", "scores" and "loadings") |
k |
the principal components to be plotted: for scores, a pair of values
e.g. c(1,2); for |
... |
not currently used. |
autoplot
produces simple plots to quickly inspect an object. They are
not customisable; we recommend that you use ggplot2
to produce publication
ready plots.
a ggplot2
object
gt_pcadapt
objectsFor gt_pcadapt
, the following types of plots are available:
qq
: a qunatile-quantile plot of the p-values from pcadapt
(wrapping bigsnpr::snp_qq()
)
manhattan
a manhattan plot of the p-values from pcadapt
(wrapping bigsnpr::snp_manhattan()
)
## S3 method for class 'gt_pcadapt' autoplot(object, type = c("qq", "manhattan"), ...)
## S3 method for class 'gt_pcadapt' autoplot(object, type = c("qq", "manhattan"), ...)
object |
an object of class |
type |
the type of plot (one of "qq", and "mnahattan") |
... |
further arguments to be passed to |
autoplot
produces simple plots to quickly inspect an object. They are
not customisable; we recommend that you use ggplot2
to produce publication
ready plots.
a ggplot2
object
gt_cluster_pca
objectsFor gt_cluster_pca
, autoplot
produces a plot of a metric of choice ('BIC',
'AIC' or 'WSS') against the number of clusters (k). This plot is can be
used to infer the best value of k, which corresponds to the smallest
value of the metric (the minimum in an 'elbow' shaped curve). In some cases,
there is not 'elbow' and the metric keeps decreasing with increasing k;
in such cases, it is customary to choose the value of k at which the
decrease in the metric reaches as plateau. For a programmatic way of choosing
k, use gt_cluster_pca_best_k()
.
## S3 method for class 'gt_cluster_pca' autoplot(object, metric = c("BIC", "AIC", "WSS"), ...)
## S3 method for class 'gt_cluster_pca' autoplot(object, metric = c("BIC", "AIC", "WSS"), ...)
object |
an object of class |
metric |
the metric to plot on the y axies, one of 'BIC', 'AIC', or 'WSS' (with sum of squares) |
... |
not currently used. |
autoplot
produces simple plots to quickly inspect an object. They are
not customisable; we recommend that you use ggplot2
to produce publication
ready plots.
a ggplot2
object
gt_dapc
objectsFor gt_dapc
, the following types of plots are available:
screeplot
: a plot of the eigenvalues of the discriminant axes
scores
a scatterplot of the scores of each individual on two discriminant
axes (defined by ld
)
loadings
a plot of loadings of all loci for a discriminant axis (chosen with ld
)
components
a bar plot showing the probability of assignment to each cluster
## S3 method for class 'gt_dapc' autoplot( object, type = c("screeplot", "scores", "loadings", "components"), ld = NULL, group = NULL, n_col = 1, ... )
## S3 method for class 'gt_dapc' autoplot( object, type = c("screeplot", "scores", "loadings", "components"), ld = NULL, group = NULL, n_col = 1, ... )
object |
an object of class |
type |
the type of plot (one of "screeplot", "scores" and "loadings") |
ld |
the principal components to be plotted: for scores, a pair of values
e.g. c(1,2); for |
group |
a vector of group memberships to order the individuals in "components" plot. If NULL, the clusters used for the DAPC will be used. |
n_col |
for |
... |
not currently used. |
autoplot
produces simple plots to quickly inspect an object. They are
not customisable; we recommend that you use ggplot2
to produce publication
ready plots.
a ggplot2
object
q_matrix
objectsAutoplots for q_matrix
objects
## S3 method for class 'q_matrix' autoplot(object, data = NULL, annotate_group = TRUE, ...)
## S3 method for class 'q_matrix' autoplot(object, data = NULL, annotate_group = TRUE, ...)
object |
A Q matrix object (as returned by |
data |
An associated tibble (e.g. a |
annotate_group |
Boolean determining whether to annotate the plot with the group information |
... |
not currently used. |
a barplot of individuals, coloured by ancestry proportion
qc_report_indiv
objectsFor qc_report_indiv
, the following types of plots are available:
scatter
: a plot of missingness and observed heterozygosity within
individuals.
relatedness
: a histogram of paired kinship coefficients
## S3 method for class 'qc_report_indiv' autoplot( object, type = c("scatter", "relatedness"), miss_threshold = NULL, kings_threshold = kings_threshold, ... )
## S3 method for class 'qc_report_indiv' autoplot( object, type = c("scatter", "relatedness"), miss_threshold = NULL, kings_threshold = kings_threshold, ... )
object |
an object of class |
type |
the type of plot ( |
miss_threshold |
a threshold for the accepted rate of missingness within individuals |
kings_threshold |
an optional numeric, a threshold of relatedness for the sample |
... |
not currently used. |
autoplot
produces simple plots to quickly inspect an object. They are
not customisable; we recommend that you use ggplot2
to produce publication
ready plots.
a ggplot2
object
qc_report_loci
objectsFor qc_report_loci
, the following types of plots are available:
overview
: an UpSet plot, giving counts of snps over the threshold for
missingness, minor allele frequency, and Hardy-Weinberg equilibrium P-value,
and visualising the interaction between these
all
: a four panel plot, containing missing high maf
, missing low maf
,
hwe
, and significant hwe
plots
missing
: a histogram of proportion of missing data
missing low maf
: a histogram of the proportion of missing data for
snps with low minor allele freqency
missing high maf
:a histogram of the proportion of missing data for
snps with high minor allele freqency
maf
: a histogram of minor allele frequency
hwe
: a histogram of HWE exact test p-values
significant hwe
: a histogram of significant HWE exact test p-values
## S3 method for class 'qc_report_loci' autoplot( object, type = c("overview", "all", "missing", "missing low maf", "missing high maf", "maf", "hwe", "significant hwe"), maf_threshold = NULL, miss_threshold = NULL, hwe_p = NULL, ... )
## S3 method for class 'qc_report_loci' autoplot( object, type = c("overview", "all", "missing", "missing low maf", "missing high maf", "maf", "hwe", "significant hwe"), maf_threshold = NULL, miss_threshold = NULL, hwe_p = NULL, ... )
object |
an object of class |
type |
the type of plot (one of |
maf_threshold |
a threshold for the accepted rate of minor allele frequency of loci |
miss_threshold |
a threshold for the accepted rate of missingness per loci |
hwe_p |
a threshold of significance for Hardy-Weinberg exact p-values |
... |
not currently used. |
autoplot
produces simple plots to quickly inspect an object. They are
not customisable; we recommend that you use ggplot2
to produce publication
ready plots.
a ggplot2
object
gen_tibble
Count the number of loci in gen_tibble
(or directly from its genotype
column).
count_loci(.x, ...) ## S3 method for class 'tbl_df' count_loci(.x, ...) ## S3 method for class 'vctrs_bigSNP' count_loci(.x, ...)
count_loci(.x, ...) ## S3 method for class 'tbl_df' count_loci(.x, ...) ## S3 method for class 'vctrs_bigSNP' count_loci(.x, ...)
.x |
a |
... |
currently unused. |
the number of loci
Colours in the palette used by distruct
distruct_colours
distruct_colours
A vector of 60 hex colours
gen_tibble
A gen_tibble
stores genotypes for individuals in a tidy format. DESCRIBE
here the format
gen_tibble( x, ..., valid_alleles = c("A", "T", "C", "G"), missing_alleles = c("0", "."), backingfile = NULL, quiet = FALSE ) ## S3 method for class 'character' gen_tibble( x, ..., parser = c("vcfR", "cpp"), n_cores = 1, chunk_size = NULL, valid_alleles = c("A", "T", "C", "G"), missing_alleles = c("0", "."), backingfile = NULL, quiet = FALSE ) ## S3 method for class 'matrix' gen_tibble( x, indiv_meta, loci, ..., ploidy = 2, valid_alleles = c("A", "T", "C", "G"), missing_alleles = c("0", "."), backingfile = NULL, quiet = FALSE )
gen_tibble( x, ..., valid_alleles = c("A", "T", "C", "G"), missing_alleles = c("0", "."), backingfile = NULL, quiet = FALSE ) ## S3 method for class 'character' gen_tibble( x, ..., parser = c("vcfR", "cpp"), n_cores = 1, chunk_size = NULL, valid_alleles = c("A", "T", "C", "G"), missing_alleles = c("0", "."), backingfile = NULL, quiet = FALSE ) ## S3 method for class 'matrix' gen_tibble( x, indiv_meta, loci, ..., ploidy = 2, valid_alleles = c("A", "T", "C", "G"), missing_alleles = c("0", "."), backingfile = NULL, quiet = FALSE )
x |
can be:
|
... |
if |
valid_alleles |
a vector of valid allele values; it defaults to 'A','T', 'C' and 'G'. |
missing_alleles |
a vector of values in the BIM file/loci dataframe that indicate a missing value for the allele value (e.g. when we have a monomorphic locus with only one allele). It defaults to '0' and '.' (the same as PLINK 1.9). |
backingfile |
the path, including the file name without extension,
for backing files used to store the data (they will be given a .bk
and .RDS automatically). This is not needed if |
quiet |
provide information on the files used to store the data |
parser |
the name of the parser used for VCF, either "cpp" to use
a fast C++ parser, or "vcfR" to use the R package |
n_cores |
the number of cores to use for parallel processing |
chunk_size |
the number of loci or individuals (depending on the format)
processed at a time (currently used
if |
indiv_meta |
a list, data.frame or tibble with compulsory columns 'id'
and 'population', plus any additional metadata of interest. This is only used
if |
loci |
a data.frame or tibble, with compulsory columns 'name', 'chromosome',
and 'position','genetic_dist', 'allele_ref' and 'allele_alt'. This is only used
if |
ploidy |
the ploidy of the samples (either a single value, or a vector of values for mixed ploidy). Only used if creating a gen_tibble from a matrix of data; otherwise, ploidy is determined automatically from the data as they are read. |
When loading packedancestry files, missing alleles will be converted from 'X' to NA
an object of the class gen_tbl
.
q_matrix_list
objectThis function retrieves a single P matrix from a q_matrix_list
object
based on the specified k value and run number.
get_p_matrix(x, ..., k, run)
get_p_matrix(x, ..., k, run)
x |
A |
... |
Not used |
k |
The k value of the desired P matrix |
run |
The run number of the desired P matrix |
A single P matrix from the q_matrix_list
object
q_matrix_list
objectThis function retrieves a single Q matrix from a q_matrix_list
object
based on the specified k value and run number.
get_q_matrix(x, ..., k, run)
get_q_matrix(x, ..., k, run)
x |
A |
... |
Not used |
k |
The k value of the desired Q matrix |
run |
The run number of the desired Q matrix |
A single Q matrix from the q_matrix_list
object
gen_tibble
to a genind
object from adegenet
This function converts a gen_tibble
to a genind
object from adegenet
gt_as_genind(x)
gt_as_genind(x)
x |
a |
a genind
object
gen_tibble
to a genlight
object from adegenet
This function converts a gen_tibble
to a genlight
object from adegenet
gt_as_genlight(x)
gt_as_genlight(x)
x |
a |
a genlight
object
gentibble
to a .geno file for sNMF from the LEA packageThis function writes a .geno file fom a gen_tibble
. Unless a file path is given,
a file with suffix .geno is written in the same location as the .rds and .bk
files that underpin the gen_tibble
.
gt_as_geno_lea(x, file = NULL)
gt_as_geno_lea(x, file = NULL)
x |
|
file |
the .geno filename with a path, or NULL (the default) to use the location of the backing files. |
the path of the .geno file
gen_tibble
to a data.frame compatible with hierfstat
This function converts a gen_tibble
to a data.frame formatted
to be used by hierfstat
functions.
gt_as_hierfstat(x)
gt_as_hierfstat(x)
x |
a |
a data.frame with a column 'pop' and further column representing the genotypes (with alleles recoded as 1 and 2)
gen_tibble
object to PLINK bed formatThis function exports all the information of a gen_tibble
object into
a PLINK bed, ped or raw file (and associated files, i.e. .bim and .fam for .bed;
.fam for .ped).
gt_as_plink(x, file = NULL, type = c("bed", "ped", "raw"), overwrite = TRUE)
gt_as_plink(x, file = NULL, type = c("bed", "ped", "raw"), overwrite = TRUE)
x |
a |
file |
a character string giving the path to output file. If left to NULL, the output file will have the same path and prefix of the backingfile. |
type |
one of "bed", "ped" or "raw" |
overwrite |
boolean whether to overwrite the file. |
the path of the saved file
gen_tibble
to a VCFThis function write a VCF from a gen_tibble
.
gt_as_vcf(x, file = NULL, chunk_size = NULL, overwrite = FALSE)
gt_as_vcf(x, file = NULL, chunk_size = NULL, overwrite = FALSE)
x |
a |
file |
the .vcf file name with a path, or NULL (the default) to use the location of the backing files. |
chunk_size |
the number of loci processed at a time. Automatically set if left to NULL |
overwrite |
logical, should the file be overwritten if it already exists? |
the path of the .vcf file
This function implements the clustering procedure used in Discriminant
Analysis of Principal Components (DAPC, Jombart et al. 2010).
This procedure consists in running successive K-means with an
increasing number of clusters (k), after transforming data using
a principal component analysis (PCA). For each model,
several statistical measures of goodness of fit
are computed, which allows to choose the optimal k using the function
gt_cluster_pca_best_k()
.
See details for a description of how to select the optimal k
and vignette("adegenet-dapc") for a tutorial.
gt_cluster_pca( x = NULL, n_pca = NULL, k_clusters = c(1, round(nrow(x$u)/10)), method = c("kmeans", "ward"), n_iter = 1e+05, n_start = 10, quiet = FALSE )
gt_cluster_pca( x = NULL, n_pca = NULL, k_clusters = c(1, round(nrow(x$u)/10)), method = c("kmeans", "ward"), n_iter = 1e+05, n_start = 10, quiet = FALSE )
x |
a |
n_pca |
number of principal components to be fed to the LDA. |
k_clusters |
number of clusters to explore, either a single value, or a vector of length 2 giving the minimum and maximum (e.g. 1:5). If left NULL, it will use 1 to the number of pca components divided by 10 (a reasonable guess). |
method |
either 'kmeans' or 'ward' |
n_iter |
number of iterations for kmeans (only used if |
n_start |
number of starting points for kmeans (only used if |
quiet |
boolean on whether to silence outputting information to the screen (defaults to FALSE) |
a gt_cluster_pca
object, which is a subclass of gt_pca
with
an additional element 'cluster', a list with elements:
'method' the clustering method (either kmeans or ward)
'n_pca' number of principal components used for clustering
'k' the k values explored by the function
'WSS' within sum of squares for each k
'AIC' the AIC for each k
'BIC' the BIC for each k
'groups' a list, with each element giving the group assignments for a given k
This function selects the best k value based on a chosen metric and criterion.
It is equivalent to plotting the metric against the k values, and selecting
the k that fulfils a given criterion (see details for an explanation of
each criterion). This function simply adds an element 'best_k' to the
gt_cluster_pca
returned by gt_cluster_pca()
. The choice can be
over-ridden simply by assigning a different value to that element (e.g.
for an object x and a desired k of 8, simply use x$best_k <- 8
)
gt_cluster_pca_best_k( x, stat = c("BIC", "AIC", "WSS"), criterion = c("diffNgroup", "min", "goesup", "smoothNgoesup", "goodfit"), quiet = FALSE )
gt_cluster_pca_best_k( x, stat = c("BIC", "AIC", "WSS"), criterion = c("diffNgroup", "min", "goesup", "smoothNgoesup", "goodfit"), quiet = FALSE )
x |
a |
stat |
a statistics, one of "BIC", "AIC" or "WSS" |
criterion |
one of "diffNgroup", "min", "goesup", "smoothNgoesup", "goodfit", see details for a discussion of each approach. |
quiet |
boolean on whether to silence outputting information to the screen (defaults to FALSE) |
The analysis of data simulated under various population genetics models (see reference) suggested an ad-hoc rule for the selection of the optimal number of clusters. First important result is that BIC seems more efficient than AIC and WSS to select the appropriate number of clusters (see example). The rule of thumb consists in increasing K until it no longer leads to an appreciable improvement of fit (i.e., to a decrease of BIC). In the most simple models (island models), BIC decreases until it reaches the optimal K, and then increases. In these cases, the best rule amounts to choosing the lowest K. In other models such as stepping stones, the decrease of BIC often continues after the optimal K, but is much less steep, so a change in slope can be taken as an indication of where the best k lies.
This function provides a programmatic way to select k. Note that it is highly recommended to look at the graph of BIC versus the numbers of clusters, to understand and validate the programmatic selection. The criteria available in this function are:
"diffNgroup": differences between successive values of the summary statistics (by default, BIC) are split into two groups using a Ward's clustering method (see ?hclust), to differentiate sharp decrease from mild decreases or increases. The retained K is the one before the first group switch. This criterion appears to work well for island/hierarchical models, and decently for isolation by distance models, albeit with some unstability. It can be confounded by an initial, very sharp decrease of the test statistics. IF UNSURE ABOUT THE CRITERION TO USE, USE THIS ONE.
"min": the model with the minimum summary statistics (as specified by stat argument, BIC by default) is retained. Is likely to work for simple island model, using BIC. It is likely to fail in models relating to stepping stones, where the BIC always decreases (albeit by a small amount) as K increases. In general, this approach tends to over-estimate the number of clusters.
"goesup": the selected model is the K after which increasing the number of clusters leads to increasing the summary statistics. Suffers from inaccuracy, since i) a steep decrease might follow a small 'bump' of increase of the statistics, and ii) increase might never happen, or happen after negligible decreases. Is likely to work only for clear-cut island models.
"smoothNgoesup": a variant of "goesup", in which the summary statistics is first smoothed using a lowess approach. Is meant to be more accurate than "goesup" as it is less prone to stopping to small 'bumps' in the decrease of the statistics.
"goodfit": another criterion seeking a good fit with a minimum number of clusters. This approach does not rely on differences between successive statistics, but on absolute fit. It selects the model with the smallest K so that the overall fit is above a given threshold.
a 'gt_cluster_pca' object with an added element 'best_k'
Jombart T, Devillard S and Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 11:94. doi:10.1186/1471-2156-11-94
This function implements the Discriminant Analysis of Principal Components
(DAPC, Jombart et al. 2010). This method describes the diversity between
pre-defined groups. When groups are unknown, use gt_cluster_pca()
to
infer genetic clusters. See 'details' section for a succinct
description of the method, and the vignette in the package adegenet
("adegenet-dapc") for a
tutorial. This function returns objects of class adegenet::dapc
which are compatible
with methods from adegenet
; graphical methods for DAPC are documented in
adegenet::scatter.dapc
(see ?scatter.dapc).
gt_dapc( x, pop = NULL, n_pca = NULL, n_da = NULL, loadings_by_locus = TRUE, pca_info = FALSE )
gt_dapc( x, pop = NULL, n_pca = NULL, n_da = NULL, loadings_by_locus = TRUE, pca_info = FALSE )
x |
an object of class |
pop |
either a factor indicating the group membership of individuals;
or an integer defining the desired k if x is a |
n_pca |
number of principal components to be used in the Discriminant Analysis. If NULL, k-1 will be used. |
n_da |
an integer indicating the number of axes retained in the Discriminant Analysis step. |
loadings_by_locus |
a logical indicating whether the loadings and contribution of each locus should be stored (TRUE, default) or not (FALSE). Such output can be useful, but can also create large matrices when there are a lot of loci and many dimensions. |
pca_info |
a logical indicating whether information about the prior PCA should be stored (TRUE, default) or not (FALSE). This information is required to predict group membership of new individuals using predict, but makes the object slightly bigger. |
The Discriminant Analysis of Principal Components (DAPC) is designed to investigate the genetic structure of biological populations. This multivariate method consists in a two-steps procedure. First, genetic data are transformed (centred, possibly scaled) and submitted to a Principal Component Analysis (PCA). Second, principal components of PCA are submitted to a Linear Discriminant Analysis (LDA). A trivial matrix operation allows to express discriminant functions as linear combination of alleles, therefore allowing one to compute allele contributions. More details about the computation of DAPC are to be found in the indicated reference.
an object of class adegenet::dapc
Jombart T, Devillard S and Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 11:94. doi:10.1186/1471-2156-11-94 Thia, J. A. (2023). Guidelines for standardizing the application of discriminant analysis of principal components to genotype data. Molecular Ecology Resources, 23, 523–538. https://doi.org/10.1111/1755-0998.13706
This function prepares data for various ADMIXTOOLS 2 functions fro the
package ADMIXTOOLS 2. It
takes a gen_tibble
,
computes allele frequencies and blocked f2-statistics,
and writes the results to outdir
. It is equivalent to
admixtools::extract_f2()
.
gt_extract_f2( .x, outdir = NULL, blgsize = 0.05, maxmem = 8000, maxmiss = 0, minmaf = 0, maxmaf = 0.5, minac2 = FALSE, outpop = NULL, outpop_scale = TRUE, transitions = TRUE, transversions = TRUE, overwrite = FALSE, adjust_pseudohaploid = TRUE, fst = TRUE, afprod = TRUE, poly_only = c("f2"), apply_corr = TRUE, n_cores = 1, quiet = FALSE )
gt_extract_f2( .x, outdir = NULL, blgsize = 0.05, maxmem = 8000, maxmiss = 0, minmaf = 0, maxmaf = 0.5, minac2 = FALSE, outpop = NULL, outpop_scale = TRUE, transitions = TRUE, transversions = TRUE, overwrite = FALSE, adjust_pseudohaploid = TRUE, fst = TRUE, afprod = TRUE, poly_only = c("f2"), apply_corr = TRUE, n_cores = 1, quiet = FALSE )
.x |
|
outdir |
Directory where data will be stored. |
blgsize |
SNP block size in Morgan. Default is 0.05 (5 cM). If |
maxmem |
Maximum amount of memory to be used. If the required amount
of memory exceeds |
maxmiss |
Discard SNPs which are missing in a fraction of populations
higher than |
minmaf |
Discard SNPs with minor allele frequency less than |
maxmaf |
Discard SNPs with minor allele frequency greater than
than |
minac2 |
Discard SNPs with allele count lower than 2 in any population
(default |
outpop |
Keep only SNPs which are heterozygous in this population |
outpop_scale |
Scale f2-statistics by the inverse |
transitions |
Set this to |
transversions |
Set this to |
overwrite |
Overwrite existing files in |
adjust_pseudohaploid |
Genotypes of pseudohaploid samples are
usually coded as |
fst |
Write files with pairwise FST for every population pair.
Setting this to FALSE can make |
afprod |
Write files with allele frequency products for every
population pair. Setting this to FALSE can make |
poly_only |
Specify whether SNPs with identical allele frequencies in
every population should be discarded ( |
apply_corr |
Apply small-sample-size correction when computing
f2-statistics (default |
n_cores |
Parallelize computation across |
quiet |
Suppress printing of progress updates |
SNP metadata (invisibly)
gen_tibble
A function to return the names of the files used to store data in a gen_tibble
.
Specifically, this returns the .rds file storing the big
gt_get_file_names(x)
gt_get_file_names(x)
x |
a character vector with the names and paths of the two files
gen_tibble
has been imputedThis function checks if a dataset has been imputed. Note that having imputation does not mean that the imputed values are used.
gt_has_imputed(x)
gt_has_imputed(x)
x |
a |
boolean TRUE or FALSE depending on whether the dataset has been imputed
This function provides a very simple imputation algorithm for gen_tibble
objects by using the mode, mean or sampling from the allele frequencies.
Each locus is imputed independently (and thus linkage information is ignored).
gt_impute_simple(x, method = c("mode", "mean0", "random"), n_cores = 1)
gt_impute_simple(x, method = c("mode", "mean0", "random"), n_cores = 1)
x |
a gen_tibble with missing data |
method |
one of
|
n_cores |
the number of cores to be used |
This function is a wrapper around bigsnpr::snp_fastImputeSimple()
.
a gen_tibble with imputed genotypes
Load a gen_tibble
previously saved with gt_save()
. If the .rds and
.bk files have not
been moved, they should be found automatically. If they were moved, use
reattach_to
to point to the .rds file (the .bk file needs to be in the same
directory as the .rds file).
gt_load(file = NULL, reattach_to = NULL)
gt_load(file = NULL, reattach_to = NULL)
file |
the file name, including the full path. If it does not end with .gt, the extension will be added. |
reattach_to |
the file name, including the full path, of the .rds file if it was moved. It assumes that the .bk file is found in the same path. You should be able to leave this to NULL unless you have moved the files. |
gen_tibble
objectsThere are a number of PCA methods available for gen_tibble
objects. They
are mostly designed to work on very large datasets, so they only compute
a limited number of components. For smaller datasets, gt_partialSVD
allows
the use of partial (truncated) SVD to fit the PCA; this method is suitable when
the number of individuals is much smaller than the number of loci. For larger
dataset, gt_randomSVD
is more appropriate. Finally, there is a method
specifically designed for dealing with LD in large datasets, gt_autoSVD
.
Whilst this is arguably the best option, it is somewhat data hungry, and so
only suitable for very large datasets (hundreds of individuals with several hundred
thousands markers, or larger).
NOTE: using gt_pca_autoSVD with a small dataset will likely cause an error, see man page for details.
gen_tibble
objectsThis function performs Principal Component Analysis on a gen_tibble
,
using a fast truncated SVD with initial pruning and then iterative removal
of long-range LD regions. This function is a wrapper for bigsnpr::snp_autoSVD()
gt_pca_autoSVD( x, k = 10, fun_scaling = bigsnpr::snp_scaleBinom(), thr_r2 = 0.2, use_positions = TRUE, size = 100/thr_r2, roll_size = 50, int_min_size = 20, alpha_tukey = 0.05, min_mac = 10, max_iter = 5, n_cores = 1, verbose = TRUE )
gt_pca_autoSVD( x, k = 10, fun_scaling = bigsnpr::snp_scaleBinom(), thr_r2 = 0.2, use_positions = TRUE, size = 100/thr_r2, roll_size = 50, int_min_size = 20, alpha_tukey = 0.05, min_mac = 10, max_iter = 5, n_cores = 1, verbose = TRUE )
x |
a |
k |
Number of singular vectors/values to compute. Default is |
fun_scaling |
Usually this can be left unset, as it defaults to
|
thr_r2 |
Threshold over the squared correlation between two SNPs.
Default is |
use_positions |
a boolean on whether the position is used to define |
size |
For one SNP, window size around this SNP to compute correlations. Default is 100 / thr_r2 for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing infos.pos (NULL, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance). I recommend that you provide the positions if available. |
roll_size |
Radius of rolling windows to smooth log-p-values.
Default is |
int_min_size |
Minimum number of consecutive outlier SNPs
in order to be reported as long-range LD region. Default is |
alpha_tukey |
Default is |
min_mac |
Minimum minor allele count (MAC) for variants to be included.
Default is |
max_iter |
Maximum number of iterations of outlier detection.
Default is |
n_cores |
Number of cores used. Default doesn't use parallelism.
You may use |
verbose |
Output some information on the iterations? Default is |
Using gt_pca_autoSVD requires a reasonably large dataset, as the function iteratively removes regions of long range LD.
a gt_pca
object, which is a subclass of bigSVD
; this is
an S3 list with elements:
A named list (an S3 class "big_SVD") of
d
, the eigenvalues (singular values, i.e. as variances),
u
, the scores for each sample on each component (the left singular vectors)
v
, the loadings (the right singular vectors)
center
, the centering vector,
scale
, the scaling vector,
method
, a string defining the method (in this case 'autoSVD'),
call
, the call that generated the object.
Note: rather than accessing these elements directly, it is better to use
tidy
and augment
. See gt_pca_tidiers
.
gen_tibble
objects by partial SVDThis function performs Principal Component Analysis on a gen_tibble
,
by partial SVD through the eigen decomposition of the covariance. It works well
if the number of individuals is much smaller than the number of loci; otherwise,
gt_pca_randomSVD()
is a better option. This function is a wrapper
for bigstatsr::big_SVD()
.
gt_pca_partialSVD(x, k = 10, fun_scaling = bigsnpr::snp_scaleBinom())
gt_pca_partialSVD(x, k = 10, fun_scaling = bigsnpr::snp_scaleBinom())
x |
a |
k |
Number of singular vectors/values to compute. Default is |
fun_scaling |
Usually this can be left unset, as it defaults to
|
a gt_pca
object, which is a subclass of bigSVD
; this is
an S3 list with elements:
A named list (an S3 class "big_SVD") of
d
, the eigenvalues (singular values, i.e. as variances),
u
, the scores for each sample on each component (the left singular vectors)
v
, the loadings (the right singular vectors)
center
, the centering vector,
scale
, the scaling vector,
method
, a string defining the method (in this case 'partialSVD'),
call
, the call that generated the object.
Note: rather than accessing these elements directly, it is better to use
tidy
and augment
. See gt_pca_tidiers
.
gen_tibble
objects by randomized partial SVDThis function performs Principal Component Analysis on a gen_tibble
,
by randomised partial SVD based on the
algorithm in RSpectra (by Yixuan Qiu and Jiali Mei).
This algorithm is linear in time in all dimensions and is very
memory-efficient. Thus, it can be used on very large big.matrices.
This function is a wrapper
for bigstatsr::big_randomSVD()
gt_pca_randomSVD( x, k = 10, fun_scaling = bigsnpr::snp_scaleBinom(), tol = 1e-04, verbose = FALSE, n_cores = 1, fun_prod = bigstatsr::big_prodVec, fun_cprod = bigstatsr::big_cprodVec )
gt_pca_randomSVD( x, k = 10, fun_scaling = bigsnpr::snp_scaleBinom(), tol = 1e-04, verbose = FALSE, n_cores = 1, fun_prod = bigstatsr::big_prodVec, fun_cprod = bigstatsr::big_cprodVec )
x |
a |
k |
Number of singular vectors/values to compute. Default is |
fun_scaling |
Usually this can be left unset, as it defaults to
|
tol |
Precision parameter of svds. Default is |
verbose |
Should some progress be printed? Default is |
n_cores |
Number of cores used. |
fun_prod |
Function that takes 6 arguments (in this order):
|
fun_cprod |
Same as |
a gt_pca
object, which is a subclass of bigSVD
; this is
an S3 list with elements:
A named list (an S3 class "big_SVD") of
d
, the eigenvalues (singular values, i.e. as variances),
u
, the scores for each sample on each component (the left singular vectors)
v
, the loadings (the right singular vectors)
center
, the centering vector,
scale
, the scaling vector,
method
, a string defining the method (in this case 'randomSVD'),
call
, the call that generated the object.
Note: rather than accessing these elements directly, it is better to use
tidy
and augment
. See gt_pca_tidiers
.
gen_tibble
objectpcadapt is an algorithm that detects genetic markers under selection. It is based on the
principal component analysis (PCA) of the genotypes of the individuals.
The method is described in Luu et al. (2017),
See the R package pcadapt
, which provides extensive
documentation and examples.
gt_pcadapt(x, pca, k, n_cores = 1)
gt_pcadapt(x, pca, k, n_cores = 1)
x |
A |
pca |
a |
k |
Number of principal components to use in the analysis. |
n_cores |
Number of cores to use. |
Internally, this function uses the snp_pcadapt
function from the bigsnpr
package.
An object of subclass gt_pcadapt
, a subclass of mhtest
.
This function uses a sliding-window approach to look for runs of homozygosity (or
heterozygosity) in a diploid genome. This function uses the package selectRUNS
,
which implements an approach equivalent to the one in PLINK.
gt_roh_window( x, window_size = 15, threshold = 0.05, min_snp = 3, heterozygosity = FALSE, max_opp_window = 1, max_miss_window = 1, max_gap = 10^6, min_length_bps = 1000, min_density = 1/1000, max_opp_run = NULL, max_miss_run = NULL )
gt_roh_window( x, window_size = 15, threshold = 0.05, min_snp = 3, heterozygosity = FALSE, max_opp_window = 1, max_miss_window = 1, max_gap = 10^6, min_length_bps = 1000, min_density = 1/1000, max_opp_run = NULL, max_miss_run = NULL )
x |
|
window_size |
the size of sliding window (number of SNP loci) (default = 15) |
threshold |
the threshold of overlapping windows of the same state (homozygous/heterozygous) to call a SNP in a RUN (default = 0.05) |
min_snp |
minimum n. of SNP in a RUN (default = 3) |
heterozygosity |
should we look for runs of heterozygosity (instead of homozygosity? (default = FALSE) |
max_opp_window |
max n. of SNPs of the opposite type (e.g. heterozygous snps for runs of homozygosity) in the sliding window (default = 1) |
max_miss_window |
max. n. of missing SNP in the sliding window (default = 1) |
max_gap |
max distance between consecutive SNP to be still considered a potential run (default = 10^6 bps) |
min_length_bps |
minimum length of run in bps (defaults to 1000 bps = 1 kbps) |
min_density |
minimum n. of SNP per kbps (defaults to 0.1 = 1 SNP every 10 kbps) |
max_opp_run |
max n. of opposite genotype SNPs in the run (optional) |
max_miss_run |
max n. of missing SNPs in the run (optional) |
This function returns a data frame with all runs detected in the dataset. This data frame can then be written out to a csv file. The data frame is, in turn, the input for other functions of the detectRUNS package that create plots and produce statistics from the results (see plots and statistics functions in this manual, and/or refer to the detectRUNS vignette).
If the gen_tibble
is grouped, then the grouping variable is used to fill in the
group table. Otherwise, the group 'column' is filled with the same values as the 'id'
column
A dataframe with RUNs of Homozygosity or Heterozygosity in the analysed dataset. The returned dataframe contains the following seven columns: "group", "id", "chrom", "nSNP", "from", "to", "lengthBps" (group: population, breed, case/control etc.; id: individual identifier; chrom: chromosome on which the run is located; nSNP: number of SNPs in the run; from: starting position of the run, in bps; to: end position of the run, in bps; lengthBps: size of the run)
# run the example only if we have the package installed if (requireNamespace("detectRUNS", quietly = TRUE)) { sheep_ped <- system.file("extdata", "Kijas2016_Sheep_subset.ped", package="detectRUNS") sheep_gt <- tidypopgen::gen_tibble(sheep_ped, backingfile = tempfile(), quiet=TRUE) sheep_gt <- sheep_gt %>% group_by(population) sheep_roh <- gt_roh_window(sheep_gt) detectRUNS::plot_Runs(runs = sheep_roh) }
# run the example only if we have the package installed if (requireNamespace("detectRUNS", quietly = TRUE)) { sheep_ped <- system.file("extdata", "Kijas2016_Sheep_subset.ped", package="detectRUNS") sheep_gt <- tidypopgen::gen_tibble(sheep_ped, backingfile = tempfile(), quiet=TRUE) sheep_gt <- sheep_gt %>% group_by(population) sheep_roh <- gt_roh_window(sheep_gt) detectRUNS::plot_Runs(runs = sheep_roh) }
Save the tibble (and update the backing files). The gen_tibble
object is
saved to a file with extension .gt, togethe with update its .rds and .bk
files. Note that multiple .gt files can be linked to the same .rds and .bk
files; generally, this occurs when we create multiple subsets of the data.
The .gt file then stores the information on what subset of the full dataset
we are interested in, whilst the .rds and .bk file store the full dataset.
To reload a gen_tibble
, you can pass the name of the .gt file with
gt_load()
.
gt_save(x, file_name = NULL, quiet = FALSE)
gt_save(x, file_name = NULL, quiet = FALSE)
x |
|
file_name |
the file name, including the full path. If it does not end with .gt, the extension will be added. |
quiet |
boolean to suppress information about hte files |
the file name and path of the .gt file, together with the .rds and .bk files
gen_tibble
to use imputed dataThis function sets or unsets the use of imputed data. For some analysis, such as PCA, that does not allow for missing data, we have to use imputation, but for other analysis it might be preferable to allow for missing data.
gt_set_imputed(x, set = NULL)
gt_set_imputed(x, set = NULL)
x |
a |
set |
a boolean defining whether imputed data should be used |
gen_tibble
uses imputed dataThis function checks if a dataset uses imputed data. Note that it is possible to have a dataset that has been imputed but it is currently not using imputation.
gt_uses_imputed(x)
gt_uses_imputed(x)
x |
a |
boolean TRUE or FALSE depending on whether the dataset is using the imputed values
Estimate observed heterozygosity (H_obs) for each individual (i.e. the frequency of loci that are heterozygous in an individual).
indiv_het_obs(.x, ...) ## S3 method for class 'tbl_df' indiv_het_obs(.x, ...) ## S3 method for class 'vctrs_bigSNP' indiv_het_obs(.x, ...)
indiv_het_obs(.x, ...) ## S3 method for class 'tbl_df' indiv_het_obs(.x, ...) ## S3 method for class 'vctrs_bigSNP' indiv_het_obs(.x, ...)
.x |
a vector of class |
... |
currently unused. |
a vector of heterozygosities, one per individuals in the gen_tibble
Estimate missingness for each individual (i.e. the frequency of missing genotypes in an individual).
indiv_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'tbl_df' indiv_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'vctrs_bigSNP' indiv_missingness(.x, as_counts = FALSE, ...)
indiv_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'tbl_df' indiv_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'vctrs_bigSNP' indiv_missingness(.x, as_counts = FALSE, ...)
.x |
a vector of class |
as_counts |
boolean defining whether the count of NAs (rather than the rate) should be returned. It defaults to FALSE (i.e. rates are returned by default). |
... |
currently unused. |
a vector of heterozygosities, one per individuals in the gen_tibble
Returns the ploidy for each individual.
indiv_ploidy(.x, ...) ## S3 method for class 'tbl_df' indiv_ploidy(.x, ...) ## S3 method for class 'vctrs_bigSNP' indiv_ploidy(.x, ...)
indiv_ploidy(.x, ...) ## S3 method for class 'tbl_df' indiv_ploidy(.x, ...) ## S3 method for class 'vctrs_bigSNP' indiv_ploidy(.x, ...)
.x |
a |
... |
currently unused. |
a vector of ploidy, one per individuals in the gen_tibble
Allele frequencies can be estimates as minimum allele frequencies (MAF) with
loci_maf()
or the frequency of the alternate allele (with loci_alt_freq()
).
The latter are in line with the genotypes matrix (e.g. as extracted by
show_loci()
). Most users will be in interested in the MAF, but the
raw frequencies might be useful when computing aggregated statistics.
loci_alt_freq(.x, ...) ## S3 method for class 'tbl_df' loci_alt_freq(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_alt_freq(.x, ...) ## S3 method for class 'grouped_df' loci_alt_freq(.x, n_cores = bigstatsr::nb_cores(), ...) loci_maf(.x, ...) ## S3 method for class 'tbl_df' loci_maf(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_maf(.x, ...) ## S3 method for class 'grouped_df' loci_maf(.x, n_cores = bigstatsr::nb_cores(), ...)
loci_alt_freq(.x, ...) ## S3 method for class 'tbl_df' loci_alt_freq(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_alt_freq(.x, ...) ## S3 method for class 'grouped_df' loci_alt_freq(.x, n_cores = bigstatsr::nb_cores(), ...) loci_maf(.x, ...) ## S3 method for class 'tbl_df' loci_maf(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_maf(.x, ...) ## S3 method for class 'grouped_df' loci_maf(.x, n_cores = bigstatsr::nb_cores(), ...)
.x |
a vector of class |
... |
other arguments passed to specific methods, currently unused. |
n_cores |
number of cores to be used, it defaults to |
a vector of frequencies, one per locus
gen_tibble
Extract the loci chromosomes from a gen_tibble
(or directly from its genotype
column).
loci_chromosomes(.x, ...) ## S3 method for class 'tbl_df' loci_chromosomes(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_chromosomes(.x, ...)
loci_chromosomes(.x, ...) ## S3 method for class 'tbl_df' loci_chromosomes(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_chromosomes(.x, ...)
.x |
a |
... |
currently unused. |
a character vector of chromosomes
Return the p-value from an exact test of HWE.
loci_hwe(.x, ...) ## S3 method for class 'tbl_df' loci_hwe(.x, mid_p = TRUE, ...) ## S3 method for class 'vctrs_bigSNP' loci_hwe(.x, mid_p = TRUE, ...) ## S3 method for class 'grouped_df' loci_hwe(.x, ...)
loci_hwe(.x, ...) ## S3 method for class 'tbl_df' loci_hwe(.x, mid_p = TRUE, ...) ## S3 method for class 'vctrs_bigSNP' loci_hwe(.x, mid_p = TRUE, ...) ## S3 method for class 'grouped_df' loci_hwe(.x, ...)
.x |
a vector of class |
... |
not used. |
mid_p |
boolean on whether the mid-p value should be computed. Default is TRUE, as in PLINK. |
This function uses the original C++ algorithm from PLINK 1.90.
NOTE There are no tests for this function yet! Unit tests are needed.
a vector of probabilities from HWE exact test, one per locus
the C++ algorithm was written by Christopher Chang for PLINK 1.90, based on original code by Jan Wigginton (the code was released under GPL3).
This function uses clumping to remove SNPs at high LD. When used with its default options, clumping based on MAF is similar to standard pruning (as done by PLINK with "–indep-pairwise (size+1) 1 thr.r2", but it results in a better spread of SNPs over the chromosome.
loci_ld_clump(.x, ...) ## S3 method for class 'tbl_df' loci_ld_clump(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_ld_clump( .x, S = NULL, thr_r2 = 0.2, size = 100/thr_r2, exclude = NULL, use_positions = TRUE, n_cores = 1, return_id = FALSE, ... ) ## S3 method for class 'grouped_df' loci_ld_clump(.x, ...)
loci_ld_clump(.x, ...) ## S3 method for class 'tbl_df' loci_ld_clump(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_ld_clump( .x, S = NULL, thr_r2 = 0.2, size = 100/thr_r2, exclude = NULL, use_positions = TRUE, n_cores = 1, return_id = FALSE, ... ) ## S3 method for class 'grouped_df' loci_ld_clump(.x, ...)
.x |
a |
... |
currently not used. |
S |
A vector of loci statistics which express the importance
of each SNP (the more important is the SNP, the greater should be
the corresponding statistic). |
thr_r2 |
Threshold over the squared correlation between two SNPs.
Default is |
size |
For one SNP, window size around this SNP to compute correlations.
Default is |
exclude |
Vector of SNP indices to exclude anyway. For example,
can be used to exclude long-range LD regions (see Price2008). Another use
can be for thresholding with respect to p-values associated with |
use_positions |
boolean, if TRUE (the default), |
n_cores |
number of cores to be used |
return_id |
boolean on whether the id of SNPs to keep should be returned. It defaults to FALSE, which returns a vector of booleans (TRUE or FALSE) |
Any missing values in the genotypes of a gen_tibble
passed to loci_ld_clump
will cause an error. To deal with missingness, see gt_impute_simple()
.
a boolean vector indicating whether the SNP should be kept (if 'return_id = FALSE', the default), else a vector of SNP indices to be kept (if 'return_id = TRUE')
Estimate the rate of missingness at each locus.
loci_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'tbl_df' loci_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'vctrs_bigSNP' loci_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'grouped_df' loci_missingness(.x, as_counts = FALSE, n_cores = bigstatsr::nb_cores(), ...)
loci_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'tbl_df' loci_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'vctrs_bigSNP' loci_missingness(.x, as_counts = FALSE, ...) ## S3 method for class 'grouped_df' loci_missingness(.x, as_counts = FALSE, n_cores = bigstatsr::nb_cores(), ...)
.x |
a vector of class |
as_counts |
boolean defining whether the count of NAs (rather than the rate) should be returned. It defaults to FALSE (i.e. rates are returned by default). |
... |
other arguments passed to specific methods. |
n_cores |
number of cores to be used, it defaults to |
a vector of frequencies, one per locus
gen_tibble
Extract the loci names from a gen_tibble
(or directly from its genotype
column).
loci_names(.x, ...) ## S3 method for class 'tbl_df' loci_names(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_names(.x, ...)
loci_names(.x, ...) ## S3 method for class 'tbl_df' loci_names(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_names(.x, ...)
.x |
a vector of class |
... |
currently unused. |
a character vector of names
Use the loci table to define which loci are transitions
loci_transitions(.x, ...) ## S3 method for class 'tbl_df' loci_transitions(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_transitions(.x, ...) ## S3 method for class 'grouped_df' loci_transitions(.x, ...)
loci_transitions(.x, ...) ## S3 method for class 'tbl_df' loci_transitions(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_transitions(.x, ...) ## S3 method for class 'grouped_df' loci_transitions(.x, ...)
.x |
a vector of class |
... |
other arguments passed to specific methods. |
a logical vector defining which loci are transitions
Use the loci table to define which loci are transversions
loci_transversions(.x, ...) ## S3 method for class 'tbl_df' loci_transversions(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_transversions(.x, ...) ## S3 method for class 'grouped_df' loci_transversions(.x, ...)
loci_transversions(.x, ...) ## S3 method for class 'tbl_df' loci_transversions(.x, ...) ## S3 method for class 'vctrs_bigSNP' loci_transversions(.x, ...) ## S3 method for class 'grouped_df' loci_transversions(.x, ...)
.x |
a vector of class |
... |
other arguments passed to specific methods. |
a logical vector defining which loci are transversions
gen_tibble
objectThis function computes the Allele Sharing matrix.
Estimates Allele Sharing (matching in hierfstat
)) between pairs of individuals
(for each locus, gives 1 if the two individuals are homozygous
for the same allele, 0 if they are homozygous for a different allele, and 1/2 if at least one individual
is heterozygous. Matching is the average of these 0, 1/2 and 1s)
pairwise_allele_sharing( x, as_matrix = FALSE, block_size = bigstatsr::block_size(count_loci(x)) )
pairwise_allele_sharing( x, as_matrix = FALSE, block_size = bigstatsr::block_size(count_loci(x)) )
x |
a |
as_matrix |
boolean, determining whether the results should be a square symmetrical matrix (TRUE), or a tidied tibble (FALSE, the default) |
block_size |
maximum number of loci read at once. More loci should improve speed, but will tax memory. |
a matrix of allele sharing between all pairs of individuals
gen_tibble
objectThis function computes the IBS matrix.
pairwise_ibs( x, as_matrix = FALSE, type = c("proportion", "adjusted_counts", "raw_counts"), block_size = bigstatsr::block_size(count_loci(x)) )
pairwise_ibs( x, as_matrix = FALSE, type = c("proportion", "adjusted_counts", "raw_counts"), block_size = bigstatsr::block_size(count_loci(x)) )
x |
a |
as_matrix |
boolean, determining whether the results should be a square symmetrical matrix (TRUE), or a tidied tibble (FALSE, the default) |
type |
one of "proportion" (equivalent to "ibs" in PLINK), "adjusted_counts" ("distance" in PLINK), and "raw_counts" (the counts of identical alleles and non-missing alleles, from which the two other quantities are computed) |
block_size |
maximum number of loci read at once. More loci should improve speed, but will tax memory. |
Note that monomorphic sites are currently counted. Should we filter them beforehand? What does plink do?
a bigstatsr::FBM of proportion or adjusted counts, or a list of two bigstatsr::FBM matrices, one of counts of IBS by alleles, and one of number of valid alleles (i.e. 2n_loci - 2missing_loci)
gen_tibble
objectThis function computes the KING-robust estimator of kinship.
pairwise_king( x, as_matrix = FALSE, block_size = bigstatsr::block_size(length(loci_names(x))) )
pairwise_king( x, as_matrix = FALSE, block_size = bigstatsr::block_size(length(loci_names(x))) )
x |
a |
as_matrix |
boolean, determining whether the results should be a square symmetrical matrix (TRUE), or a tidied tibble (FALSE, the default) |
block_size |
maximum number of loci read at once. More loci should improve speed, but will tax memory. |
Note that monomorphic sites are currently considered. What does PLINK do???
This function computes pairwise Fst. The following methods are implemented:
'Hudson': Hudson's formulation, as derived in Bhatia et al (2013) for diploids.
'Nei86' : Gst according to Nei (1986), as derived in Bhatia et al (2013) for diploids.
'Nei87' : Fst according to Nei (1987) - this is equivalent to hierfstat::pairwise.neifst()
,
and includes the correction for heterozygosity when computing Ht
'WC84' : Weir and Cockerham (1984), as derived in Bhatia et al (2013) for diploids.
pairwise_pop_fst( .x, by_locus = FALSE, method = c("Hudson", "Nei87", "Nei86", "WC84"), n_cores = bigstatsr::nb_cores() )
pairwise_pop_fst( .x, by_locus = FALSE, method = c("Hudson", "Nei87", "Nei86", "WC84"), n_cores = bigstatsr::nb_cores() )
.x |
a grouped |
by_locus |
boolean, determining whether Fst should be returned by locus(TRUE), or as a single genome wide value obtained by taking the ratio of the mean numerator and denominator (FALSE, the default). |
method |
one of 'Hudson', 'Nei86', 'Nei87', and 'WC84' |
n_cores |
number of cores to be used, it defaults to |
For all formulae, the genome wide estimate is obtained by taking the ratio of the mean numerators and denominators over all relevant SNPs.
a tibble of genome-wide pairwise Fst values with each pairwise combination as a row if "by_locus=FALSE", else a list including the tibble of genome-wide values as well as a matrix with pairwise Fst by locus with loci as rows and and pairwise combinations as columns.
Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and Interpreting FST: The Impact of Rare Variants. Genome Research. 2013;23(9):1514–1521.
Nei, M. (1987) Molecular Evolutionary Genetics. Columbia University Press
This function computes population specific FIS, using either the approach of Nei 1987 (as computed by hierfstat::basic.stats()
) or of Weir and Goudet 2017
(as computed by hierfstat::fis.dosage()
).
pop_fis( .x, method = c("Nei87", "WG17"), by_locus = FALSE, include_global = FALSE, allele_sharing_mat = NULL )
pop_fis( .x, method = c("Nei87", "WG17"), by_locus = FALSE, include_global = FALSE, allele_sharing_mat = NULL )
.x |
a grouped |
method |
one of "Nei87" (based on Nei 1987, eqn 7.41) or "WG17" (for Weir and Goudet 2017) to compute FIS |
by_locus |
boolean, determining whether FIS should be returned by locus(TRUE), or as a single genome wide value (FALSE, the default). Note that this is only relevant for "Nei87", as "WG17" always returns a single value. |
include_global |
boolean determining whether, besides the population specific estiamtes, a global
estimate should be appended. Note that this will return a vector of n populations plus 1 (the global value),
or a matrix with n+1 columns if |
allele_sharing_mat |
optional and only relevant for "WG17", the matrix of Allele Sharing returned by
|
a vector of population specific fis (plus the global value if include_global=TRUE
)
Nei M. (1987) Molecular Evolutionary Genetics. Columbia University Press Weir, BS and Goudet J (2017) A Unified Characterization of Population Structure and Relatedness. Genetics (2017) 206:2085
This function computes population specific Fst, using the approach in Weir and Goudet 2017
(as computed by hierfstat::fst.dosage()
).
pop_fst(.x, include_global = FALSE, allele_sharing_mat = NULL)
pop_fst(.x, include_global = FALSE, allele_sharing_mat = NULL)
.x |
a grouped |
include_global |
boolean determining whether, besides the population specific Fst, a global Fst should be appended. Note that this will return a vector of n populations plus 1 (the global value) |
allele_sharing_mat |
optional, the matrix of Allele Sharing returned by
|
a vector of population specific Fst (plus the global value if include_global=TRUE
)
Weir, BS and Goudet J (2017) A Unified Characterization of Population Structure and Relatedness. Genetics (2017) 206:2085
This function computes basic population global statistics, following the notation in Nei 1987 (which in turn is based on Nei and Chesser 1983):
observed heterozygosity ( , column header
Ho
)
expected heterozygosity, also known as gene diversity ( ,
Hs
)
total heterozygosity ( ,
Ht
)
genetic differentiation between subpopulations (,
Dst
)
corrected total population diversity (,
Htp
)
corrected genetic differentiation between subpopulations (,
Dstp
)
(column header,
Fst
)
corrected (column header
Fstp
)
(column header,
Fis
)
Jost's (column header,
Dest
)
pop_global_stats(.x, by_locus = FALSE, n_cores = bigstatsr::nb_cores())
pop_global_stats(.x, by_locus = FALSE, n_cores = bigstatsr::nb_cores())
.x |
a |
by_locus |
boolean, determining whether the statistics should be returned by locus(TRUE), or as a single genome wide value (FALSE, the default). |
n_cores |
number of cores to be used, it defaults to |
We use the notation of Nei 1987. That notation was for loci with alleles, but in our case we only have two alleles, so
m=2
.
Within population observed heterozygosity for a locus with
alleles is defined as:
where represents the proportion of homozygote
in the sample for the
th population and
the number of populations,
following equation 7.38 in Nei(1987) on pp.164.
Within population expected heterozygosity (gene diversity) for a locus with
alleles is defined as:
where (i.e the harmonic mean of
) and
following equation 7.39 in Nei(1987) on pp.164.
Total heterozygosity (total gene diversity) for a locus with
alleles is defined as:
where
following equation 7.40 in Nei(1987) on pp.164.
The amount of gene diversity among samples is defined as:
following the equation provided in the text at the top of page 165 in Nei(1987).
The corrected amount of gene diversity among samples is defined as:
following the equation provided in the text at the top of page 165 in Nei(1987).
Total corrected heterozygosity (total gene diversity) is defined as:
following the equation provided in the text at the top of page 165 in Nei(1987).
is defined as:
following equation 7.41 in Nei(1987) on pp.164.
is defined as:
following equation 7.43 in Nei(1987) on pp.165.
is defined as:
following the explanation provided in the text at the top of page 165 in Nei(1987).
Jost's is defined as:
as defined by Jost(2008)
All these statistics are first computed by locus, and then averaged across loci (including any
monorphic locus) to obtain genome-wide values. The function uses the same algorithm as
hierfstat::basic.stats()
but is optimized for speed and memory usage.
a tibble of population statistics, with populations as rows and statistics as columns
Nei M, Chesser R (1983) Estimation of fixation indexes and gene diversities. Annals of Human Genetics, 47, 253-259. Nei M. (1987) Molecular Evolutionary Genetics. Columbia University Press, pp. 164-165. Jost L (2008) GST and its relatives do not measure differentiation. Molecular Ecology, 17, 4015-4026.
This function computes expected population heterozygosity (also referred to as gene diversity, to avoid the potentially misleading use of the term "expected" in this context), using the formula of Nei (1987).
pop_het_exp( .x, by_locus = FALSE, include_global = FALSE, n_cores = bigstatsr::nb_cores() ) pop_gene_div( .x, by_locus = FALSE, include_global = FALSE, n_cores = bigstatsr::nb_cores() )
pop_het_exp( .x, by_locus = FALSE, include_global = FALSE, n_cores = bigstatsr::nb_cores() ) pop_gene_div( .x, by_locus = FALSE, include_global = FALSE, n_cores = bigstatsr::nb_cores() )
.x |
a |
by_locus |
boolean, determining whether Hs should be returned by locus(TRUE), or as a single genome wide value (FALSE, the default). |
include_global |
boolean determining whether, besides the population specific estiamtes, a global
estimate should be appended. Note that this will return a vector of n populations plus 1 (the global value),
or a matrix with n+1 columns if |
n_cores |
number of cores to be used, it defaults to |
Within population expected heterozygosity (gene diversity) for a locus with
alleles is defined as:
where (i.e the harmonic mean of
) and
following equation 7.39 in Nei(1987) on pp.164. In our specific case, there are only two alleles, so .
at
the genome level for each population is simply the mean of the locus estimates for each population.
a vector of mean population observed heterozygosities (if by_locus=FALSE
), or a matrix of
estimates by locus (rows are loci, columns are populations, by_locus=TRUE
)
Nei M. (1987) Molecular Evolutionary Genetics. Columbia University Press
This function computes population heterozygosity, using the formula of Nei (1987).
pop_het_obs( .x, by_locus = FALSE, include_global = FALSE, n_cores = bigstatsr::nb_cores() )
pop_het_obs( .x, by_locus = FALSE, include_global = FALSE, n_cores = bigstatsr::nb_cores() )
.x |
a |
by_locus |
boolean, determining whether Ho should be returned by locus(TRUE), or as a single genome wide value (FALSE, the default). |
include_global |
boolean determining whether, besides the population specific estiamtes, a global
estimate should be appended. Note that this will return a vector of n populations plus 1 (the global value),
or a matrix with n+1 columns if |
n_cores |
number of cores to be used, it defaults to |
Within population observed heterozygosity for a locus with
alleles is defined as:
where represents the proportion of homozygote
in the sample for the
th population and
the number of populations,
following equation 7.38 in Nei(1987) on pp.164. In our specific case, there are only two alleles, so . For
population specific estimates, the sum is done over a single value of
.
at
the genome level is simply the mean of the locus estimates.
a vector of mean population observed heterozygosities (if by_locus=FALSE
), or a matrix of
estimates by locus (rows are loci, columns are populations, by_locus=TRUE
)
Nei M. (1987) Molecular Evolutionary Genetics. Columbia University Press
Predict the PCA scores for a gt_pca
, either for the original data or
projecting new data.
## S3 method for class 'gt_pca' predict( object, new_data = NULL, project_method = c("none", "simple", "OADP", "least_squares"), lsq_pcs = c(1, 2), block_size = NULL, n_cores = 1, ... )
## S3 method for class 'gt_pca' predict( object, new_data = NULL, project_method = c("none", "simple", "OADP", "least_squares"), lsq_pcs = c(1, 2), block_size = NULL, n_cores = 1, ... )
object |
the |
new_data |
a gen_tibble if scores are requested for a new dataset |
project_method |
a string taking the value of either "simple", "OADP" (Online Augmentation, Decomposition, and Procrustes (OADP) projection), or "least_squares" (as done by SMARTPCA) |
lsq_pcs |
a vector of length two with the values of the two principal components
to use for the least square fitting. Only relevant if |
block_size |
number of loci read simultaneously (larger values will speed up computation, but require more memory) |
n_cores |
number of cores |
... |
no used |
a matrix of predictions, with samples as rows and components as columns. The number
of components depends on how many were estimated in the gt_pca
object.
Zhang et al (2020). Fast and robust ancestry prediction using principal component analysis 36(11): 3439–3446.
q_matrix
or q_matrix_list
objects.This function reads .Q matrix files generated by external clustering
algorithms (such as ADMIXTURE) and transforms them into q_matrix
or
q_matrix_list
objects for plotting.
q_matrix(x)
q_matrix(x)
x |
can be:
|
either:
a single q_matrix
object
a q_matrix_list
object containing a list of Q matrices and a list of
indices for each Q matrix separated by K
#' Return QC information to assess loci (Observed heterozygosity and missingness).
qc_report_indiv(.x, ...) ## S3 method for class 'tbl_df' qc_report_indiv(.x, kings_threshold = NULL, ...) ## S3 method for class 'grouped_df' qc_report_indiv(.x, kings_threshold = NULL, ...)
qc_report_indiv(.x, ...) ## S3 method for class 'tbl_df' qc_report_indiv(.x, kings_threshold = NULL, ...) ## S3 method for class 'grouped_df' qc_report_indiv(.x, kings_threshold = NULL, ...)
.x |
either a |
... |
further arguments to pass |
kings_threshold |
an optional numeric, a threshold of relatedness for the sample |
a tibble with 2 elements: het_obs and missingness
Return QC information to assess loci (MAF, missingness and HWE test).
qc_report_loci(.x, ...) ## S3 method for class 'tbl_df' qc_report_loci(.x, ...) ## S3 method for class 'grouped_df' qc_report_loci(.x, ...)
qc_report_loci(.x, ...) ## S3 method for class 'tbl_df' qc_report_loci(.x, ...) ## S3 method for class 'grouped_df' qc_report_loci(.x, ...)
.x |
a |
... |
currently unused the HWE test. |
a tibble with 3 elements: maf, missingness and hwe_p
This function provides an overview of the fate of each SNP in two
gen_tibble
objects in the case of a merge. Only SNPs found in
both objects will be kept. One object is used as a reference
, and SNPs
in the other dataset will be flipped and/or alleles swapped
as needed. SNPs that have different alleles in the two datasets will also be
dropped.
rbind_dry_run( ref, target, use_position = FALSE, flip_strand = FALSE, quiet = FALSE )
rbind_dry_run( ref, target, use_position = FALSE, flip_strand = FALSE, quiet = FALSE )
ref |
either a |
target |
either a |
use_position |
boolean of whether a combination of chromosome and position
should be used for matching SNPs. By default, |
flip_strand |
boolean on whether strand flipping should be checked to match the two datasets. Ambiguous SNPs (i.e. A/T and C/G) will also be removed. It defaults to FALSE |
quiet |
boolean whether to omit reporting to screen |
a list with two data.frames
, named target
and ref
. Each
data.frame has nrow()
equal to the number of loci in the respective dataset,
a column id
with the locus name, and boolean columns to_keep
(the valid loci
that will be kept in the merge),
alleles_mismatched
(loci found in both datasets but with mismatched alleles,
leading to those loci being dropped), to_flip
(loci that need to be flipped
to align the two datasets, only found in target
data.frame)
and to_swap
(loci for which the order of alleles needs to be swapped
to align the two datasets, target
data.frame)
This function combined two gen_tibbles. By defaults, it subsets the loci
and swaps ref and alt alleles to make the two datasets compatible (this
behaviour can be switched off with as_is
).
The first object is used as a "reference" , and SNPs
in the other dataset will be flipped and/or alleles swapped
as needed. SNPs that have different alleles in the two datasets
(i.e. triallelic) will also be
dropped. There are also options (NOT default) to attempt strand flipping to
match alleles (often needed in human datasets from different SNP chips),
and remove ambiguous alleles (C/G and A/T) where the correct strand can not
be guessed.
## S3 method for class 'gen_tbl' rbind( ..., as_is = FALSE, flip_strand = FALSE, use_position = FALSE, quiet = FALSE, backingfile = NULL )
## S3 method for class 'gen_tbl' rbind( ..., as_is = FALSE, flip_strand = FALSE, use_position = FALSE, quiet = FALSE, backingfile = NULL )
... |
two |
as_is |
boolean determining whether the loci should be left as they are
before merging. If FALSE (the defaults), |
flip_strand |
boolean on whether strand flipping should be checked to match the two datasets. If this is set to TRUE, ambiguous SNPs (i.e. A/T and C/G) will also be removed. It defaults to FALSE |
use_position |
boolean of whether a combination of chromosome and position
should be used for matching SNPs. By default, |
quiet |
boolean whether to omit reporting to screen |
backingfile |
the path and prefix of the files used to store the
merged data (it will be a .RDS to store the |
rbind differs from merging data with plink, which swaps the order of allele1 and allele2 according to minor allele frequency when merging datasets. rbind flips and/or swaps alleles according to the reference dataset, not according to allele frequency.
a gen_tibble
with the merged data.
A wrapper around ggplot2::scale_fill_manual()
, using the distruct colours
from distruct_colours
.
scale_fill_distruct(guide = "none", ...)
scale_fill_distruct(guide = "none", ...)
guide |
guide function passed to |
... |
further parameters to be passed to |
a scale constructor to be used with ggplot
select
verb for loci
An equivalent to dplyr::select()
that works on the genotype
column
of a gen_tibble
, using the mini-grammar available for tidyselect
. The
select
-like evaluation only has access to the names of the loci (i.e. it
can select only based on names, not summary statistics of those loci; look
at select_loci_if()
for that feature.
select_loci(.data, .sel_arg)
select_loci(.data, .sel_arg)
.data |
a |
.sel_arg |
one unquoted expression, using the mini-grammar of
|
Note that the select_loci
verb does not modify the backing FBM files,
but rather it subsets the list of loci to be used stored in the gen_tibble
.
a gen_tibble
with a subset of the loci.
select_if
verb for loci
An equivalent to dplyr::select_if()
that works on the genotype
column
of a gen_tibble
. This function has access to the genotypes (and thus can
work on summary statistics to select), but not the names of the loci (look
at select_loci()
for that feature.
select_loci_if(.data, .sel_logical)
select_loci_if(.data, .sel_logical)
.data |
a |
.sel_logical |
a logical vector of length equal to the number of loci, or an expression that will tidy evaluate to such a vector. Only loci for which .sel_logical is TRUE will be selected; NA will be treated as FALSE. |
#' Note that the select_loci_if
verb does not modify the backing FBM files,
but rather it subsets the list of loci to be used stored in the gen_tibble
.
gen_tibble
Extract the genotypes (as a matrix) from a gen_tibble
.
show_genotypes(.x, indiv_indices = NULL, loci_indices = NULL, ...) ## S3 method for class 'tbl_df' show_genotypes(.x, indiv_indices = NULL, loci_indices = NULL, ...) ## S3 method for class 'vctrs_bigSNP' show_genotypes(.x, indiv_indices = NULL, loci_indices = NULL, ...)
show_genotypes(.x, indiv_indices = NULL, loci_indices = NULL, ...) ## S3 method for class 'tbl_df' show_genotypes(.x, indiv_indices = NULL, loci_indices = NULL, ...) ## S3 method for class 'vctrs_bigSNP' show_genotypes(.x, indiv_indices = NULL, loci_indices = NULL, ...)
.x |
a vector of class |
indiv_indices |
indices of individuals |
loci_indices |
indices of loci |
... |
currently unused. |
a matrix of counts of the alternative alleles (see show_loci()
) to
extract information on the alleles for those loci from a gen_tibble
.
gen_tibble
Extract and set the information on loci from a gen_tibble
.
show_loci(.x, ...) ## S3 method for class 'tbl_df' show_loci(.x, ...) ## S3 method for class 'vctrs_bigSNP' show_loci(.x, ...) show_loci(.x) <- value ## S3 replacement method for class 'tbl_df' show_loci(.x) <- value ## S3 replacement method for class 'vctrs_bigSNP' show_loci(.x) <- value
show_loci(.x, ...) ## S3 method for class 'tbl_df' show_loci(.x, ...) ## S3 method for class 'vctrs_bigSNP' show_loci(.x, ...) show_loci(.x) <- value ## S3 replacement method for class 'tbl_df' show_loci(.x) <- value ## S3 replacement method for class 'vctrs_bigSNP' show_loci(.x) <- value
.x |
a vector of class |
... |
currently unused. |
value |
a data.frame or tibble of loci information to replace the current one. |
a tibble::tibble
of information (see gen_tibble
for details
on compulsory columns that will always be present)
gen_tibble
Extract the ploidy information from a gen_tibble
. NOTE that this function
does not return the ploidy level for each individual (that is obtained with
indiv_ploidy
); instead, it returns an integer which is either the ploidy
level of all individuals (e.g. 2 indicates all individuals are diploid),
or a 0 to indicate mixed ploidy.
show_ploidy(.x, ...) ## S3 method for class 'tbl_df' show_ploidy(.x, ...) ## S3 method for class 'vctrs_bigSNP' show_ploidy(.x, ...)
show_ploidy(.x, ...) ## S3 method for class 'tbl_df' show_ploidy(.x, ...) ## S3 method for class 'vctrs_bigSNP' show_ploidy(.x, ...)
.x |
a vector of class |
... |
currently unused. |
the ploidy (0 indicates mixed ploidy)
This function computes the Allele Sharing matrix.
Estimates Allele Sharing (matching in hierfstat
)) between pairs of individuals
(for each locus, gives 1 if the two individuals are homozygous
for the same allele, 0 if they are homozygous for a different allele, and 1/2 if at least one individual
is heterozygous. Matching is the average of these 0, 1/2 and 1s)
snp_allele_sharing( X, ind.row = bigstatsr::rows_along(X), ind.col = bigstatsr::cols_along(X), block.size = bigstatsr::block_size(nrow(X)) )
snp_allele_sharing( X, ind.row = bigstatsr::rows_along(X), ind.col = bigstatsr::cols_along(X), block.size = bigstatsr::block_size(nrow(X)) )
X |
a bigstatsr::FBM.code256 matrix (as found in the |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
block.size |
maximum number of columns read at once. Note that, to optimise the speed of matrix operations, we have to store in memory 3 times the columns. |
a matrix of allele sharing between all pairs of individuals
This function computes the IBS matrix.
snp_ibs( X, ind.row = bigstatsr::rows_along(X), ind.col = bigstatsr::cols_along(X), type = c("proportion", "adjusted_counts", "raw_counts"), block.size = bigstatsr::block_size(nrow(X)) )
snp_ibs( X, ind.row = bigstatsr::rows_along(X), ind.col = bigstatsr::cols_along(X), type = c("proportion", "adjusted_counts", "raw_counts"), block.size = bigstatsr::block_size(nrow(X)) )
X |
a bigstatsr::FBM.code256 matrix (as found in the |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
type |
one of "proportion" (equivalent to "ibs" in PLINK), "adjusted_counts" ("distance" in PLINK), and "raw_counts" (the counts of identical alleles and non-missing alleles, from which the two other quantities are computed) |
block.size |
maximum number of columns read at once. Note that, to optimise the speed of matrix operations, we have to store in memory 3 times the columns. |
Note that monomorphic sites are currently counted. Should we filter them beforehand? What does plink do?
if as.counts = TRUE function returns a list of two bigstatsr::FBM matrices, one of counts of IBS by alleles (i.e. 2*n loci), and one of valid alleles (i.e. 2 * n_loci - 2 * missing_loci). If as.counts = FALSE returns a single matrix of IBS proportions.
This function computes the KING-robust estimator of kinship.
snp_king( X, ind.row = bigstatsr::rows_along(X), ind.col = bigstatsr::cols_along(X), block.size = bigstatsr::block_size(nrow(X)) * 4 )
snp_king( X, ind.row = bigstatsr::rows_along(X), ind.col = bigstatsr::cols_along(X), block.size = bigstatsr::block_size(nrow(X)) * 4 )
X |
a bigstatsr::FBM.code256 matrix (as found in the |
ind.row |
An optional vector of the row indices that are used. If not specified, all rows are used. Don't use negative indices. |
ind.col |
An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. |
block.size |
maximum number of columns read at once. |
The last step is not optimised yet, as it does the division of the num by the den all in memory (on my TODO list...).
Takes a q_matrix_list
object and returns a summary of the Q files in the
object based on the number of repeats for each K value.
## S3 method for class 'q_matrix_list' summary(object, ...)
## S3 method for class 'q_matrix_list' summary(object, ...)
object |
A |
... |
not currently used |
A summary of the number of repeats for each K value.
This function creates a summary of the merge report generated by
rbind_dry_run()
## S3 method for class 'rbind_report' summary(object, ..., ref_label = "reference", target_label = "target")
## S3 method for class 'rbind_report' summary(object, ..., ref_label = "reference", target_label = "target")
object |
a list generated by |
... |
unused (necessary for compatibility with generic function) |
ref_label |
the label for the reference dataset (defaults to "reference") |
target_label |
the label for the target dataset (defaults to "target") |
NULL (prints a summary to the console)
A theme to remove most plot decorations, matching the look of plots created with distruct.
theme_distruct()
theme_distruct()
gt_dapc
objectThis summarizes information about the components of a gt_dapc
from the
tidypopgen
package. The parameter matrix
determines which element is
returned.
## S3 method for class 'gt_dapc' tidy(x, matrix = "eigenvalues", ...)
## S3 method for class 'gt_dapc' tidy(x, matrix = "eigenvalues", ...)
x |
A |
matrix |
Character specifying which component of the DAPC should be tidied.
|
... |
Not used. Needed to match generic signature only. |
A tibble::tibble with columns depending on the component of DAPC being tidied.
If "scores"
each row in the
tidied output corresponds to the original data in PCA space. The columns
are:
row |
ID of the original observation (i.e. rowname from original data). |
LD |
Integer indicating a principal component. |
value |
The score of the observation for that particular principal component. That is, the location of the observation in PCA space. |
If matrix
is "loadings"
, each
row in the tidied output corresponds to information about the principle
components in the original space. The columns are:
row |
The variable labels (colnames) of the data set on which PCA was performed. |
LD |
An integer vector indicating the principal component. |
value |
The value of the eigenvector (axis score) on the indicated principal component. |
If "eigenvalues"
, the columns are:
LD |
An integer vector indicating the discriminant axis. |
std.dev |
Standard deviation (i.e. sqrt(eig/(n-1))) explained by
this DA (for compatibility with |
cumulative |
Cumulative variation explained by principal components up to this component (note that this is NOT phrased as a percentage of total variance, since many methods only estimate a truncated SVD. |
gt_pca
objectThis summarizes information about the components of a gt_pca
from the
tidypopgen
package. The parameter matrix
determines which element is
returned. Column names of the tidied output match those returned by
broom::tidy.prcomp, the tidier for the standard PCA objects returned
by stats::prcomp.
## S3 method for class 'gt_pca' tidy(x, matrix = "eigenvalues", ...)
## S3 method for class 'gt_pca' tidy(x, matrix = "eigenvalues", ...)
x |
A |
matrix |
Character specifying which component of the PCA should be tidied.
|
... |
Not used. Needed to match generic signature only. |
A tibble::tibble with columns depending on the component of PCA being tidied.
If "scores"
each row in the
tidied output corresponds to the original data in PCA space. The columns
are:
row |
ID of the original observation (i.e. rowname from original data). |
PC |
Integer indicating a principal component. |
value |
The score of the observation for that particular principal component. That is, the location of the observation in PCA space. |
If matrix
is "loadings"
, each
row in the tidied output corresponds to information about the principle
components in the original space. The columns are:
row |
The variable labels (colnames) of the data set on which PCA was performed. |
PC |
An integer vector indicating the principal component. |
value |
The value of the eigenvector (axis score) on the indicated principal component. |
If "eigenvalues"
, the columns are:
PC |
An integer vector indicating the principal component. |
std.dev |
Standard deviation (i.e. sqrt(eig/(n-1))) explained by
this PC (for compatibility with |
cumulative |
Cumulative variation explained by principal components up to this component (note that this is NOT phrased as a percentage of total variance, since many methods only estimate a truncated SVD. |
gt_pca_autoSVD()
augment_gt_pca
Takes a q_matrix
object, which is a matrix, and returns a tidied tibble.
## S3 method for class 'q_matrix' tidy(x, data, ...)
## S3 method for class 'q_matrix' tidy(x, data, ...)
x |
A Q matrix object (as returned by |
data |
An associated tibble (e.g. a |
... |
not currently used |
A tidied tibble