--- title: "Quality Control" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quality Control} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(ggplot2) ``` # Quality control for SNP datasets tidypopgen has two key functions to examine the quality of data across loci and across individuals: `qc_report_loci` and `qc_report_indiv`. This vignette uses a simulated data set to illustrate these methods of data cleaning. # Read data into gen_tibble format ```{r} library(tidypopgen) data <- gen_tibble(system.file("extdata/related/families.bed", package="tidypopgen"), quiet = TRUE, backingfile = tempfile(), valid_alleles = c("1","2")) ``` # Quality control for loci ```{r} loci_report <- qc_report_loci(data) summary(loci_report) ``` The output of `qc_report_loci` supplies minor allele frequency, rate of missingness, and a Hardy-Weinberg exact p-value for each SNP. These data can be visualised in autoplot : ```{r, fig.alt = "UpSet plot giving counts of snps over the threshold for: missingness, minor allele frequency, and Hardy-Weinberg equilibrium P-value"} autoplot(loci_report, type = "overview") ``` Using 'overview' provides an Upset plot, which is designed to show the intersection of different sets in the same way as a Venn diagram. SNPs can be divided into 'sets' that each pass predefined quality control threshold; a set of SNPs with missingness under a given threshold, a set of SNPs with MAF above a given threshold, and a set of SNPs with a Hardy-Weinberg exact p-value that falls above a given significance level. The upset plot then visualises our 961 SNPs within their respective sets. The number above the first bar indicates that, of our total SNPs, 609 occur in all 3 sets, meaning 609 SNPs pass all of our automatic QC measures. The combined total of the first and second bars represents the number of SNPs that pass our MAF and HWE thresholds, here, 956. The thresholds for each parameter, (level of missingness that is accepted etc) can be adjusted using the parameters provided in autoplot. For example: ```{r, fig.alt = "Upset plot as above, with adjusted thresholds"} autoplot(loci_report, type = "overview", miss_threshold = 0.03, maf_threshold = 0.02, hwe_p = 0.01) ``` To examine each QC measure in further detail, we can plot a different summary panel. ```{r, fig.alt = "Four panel plot, containing: a histogram of the proportion of missing data for snps with minor allele freqency above the threshold, a histogram of the proportion of missing data for snps with minor allele freqency below the threshold, a histogram of HWE exact test p-values, and a histogram of significant HWE exact test p-values"} autoplot(loci_report, type = "all", miss_threshold = 0.03, maf_threshold = 0.02, hwe_p = 0.01) ``` We can then begin to consider how to quality control this raw data set. Let's start by filtering SNPs according to their minor allele frequency. We can visualise the MAF distribution using: ```{r, fig.alt = "Histogram of minor allele frequency"} autoplot(loci_report, type = "maf") ``` Here we can see there are some monomorphic SNPs in the data set. Let's filter out loci with a minor allele frequency lower than 2%, by using `select_loci_if`. Here, we select all SNPs with a MAF greater than 2%. This operation is equivalent to plink --maf 0.02. ```{r} data <- data %>% select_loci_if(loci_maf(genotypes)>0.02) count_loci(data) ``` Following this, we can remove SNPs with a high rate of missingness. Lets say we want to remove SNPs that are missing in more than 5% of individuals, equivalent to using plink --geno 0.05 ```{r, fig.alt = "Histogram of the proportion of missing data"} autoplot(loci_report, type = "missing", miss_threshold = 0.05) ``` We can see here that most SNPs have low missingness, under our 5% threshold, some do, however, have missingness over our threshold. To remove these SNPs, we can again use `select_loci_if`. ```{r} data <- data %>% select_loci_if(loci_missingness(genotypes)<0.05) count_loci(data) ``` Finally, we may want to remove SNPs that show significant deviation from Hardy-Weinberg equilibrium, if our study design requires. To visualise SNPs with significant p-values in the Hardy-Weinberg exact test, we can again call autoplot: ```{r, fig.alt = "Histogram of significant HWE exact test p-values"} autoplot(loci_report,type = "significant hwe", hwe_p = 0.01) ``` Few SNPs in our data are significant, however there may be circumstances where we would want to cut out the most extreme cases, if these data were real, these cases could indicate genotyping errors. ```{r} data <- data %>% select_loci_if(loci_hwe(genotypes)>0.01) count_loci(data) ``` Once we have quality controlled the SNPs in our data, we can turn to individual samples. #Quality control for individuals ```{r} individual_report <- qc_report_indiv(data) summary(individual_report) ``` The output of `qc_report_indiv` supplies observed heterozygosity per individual, and rate of missingness per individual as standard. These data can also be visualised using autoplot: ```{r, fig.alt = "Scatter plot of missingness proportion and observed heterozygosity for each individual"} autoplot(individual_report) ``` Here we can see that most individuals have low missingness. If we wanted to filter individuals to remove those with more than 3% of their genotypes missing, we can use `filter`. ```{r} data <- data %>% filter(indiv_missingness(genotypes)<0.03) nrow(data) ``` And if we wanted to remove outliers with particularly high or low heterozygosity, we can again do so by using `filter`. As an example, here we remove observations that lie more than 3 standard deviations from the mean. ```{r} mean_val <- mean(individual_report$het_obs) sd_val <- stats::sd(individual_report$het_obs) lower <- mean_val - 3*(sd_val) upper <- mean_val + 3*(sd_val) data <- data %>% filter(indiv_het_obs(genotypes) > lower) data <- data %>% filter(indiv_het_obs(genotypes) < upper) nrow(data) ``` Next, we can look at relatedness within our sample. If the parameter `kings_threshold` is provided to `qc_report_indiv()`, then the report also calculates a KING coefficient of relatedness matrix using the sample. The `kings_threshold` is used to provide an output of the largest possible group with no related individuals in the third column `to_keep`. This boolean column recommends which individuals to remove (FALSE) and to keep (TRUE) to achieve an unrelated sample. ```{r} individual_report <- qc_report_indiv(data, kings_threshold = 0.177) summary(individual_report) ``` We can remove the recommended individuals by using: ```{r} data <- data %>% filter(id %in% individual_report$id & individual_report$to_keep == TRUE) ``` We can now view a summary of our cleaned data set again, showing that our data has reduced from 12 to 10 individuals. ```{r} summary(data) ``` # Linkage Disequilibrium For further analyses, it may be necessary to control for linkage in the data set. tidypopgen provides LD clumping. This option is similar to the --indep-pairwise flag in plink, but results in a more even distribution of loci when compared to LD pruning. To explore why clumping is preferable to pruning, see LD clumping requires a data set with no missingness. This means we need to create an imputed data set before LD pruning, which we can do quickly with `gt_impute_simple`. ```{r} imputed_data <- gt_impute_simple(data, method = "random") ``` In this example, if we want to remove SNPs with a correlation greater than 0.2 in windows of 10 SNPs at a time, we can set these parameters with `thr_r2` and `size` respectively. ```{r} to_keep_LD <- loci_ld_clump(imputed_data, thr_r2 = 0.2, size = 10) head(to_keep_LD) ``` `loci_ld_clump` provides a boolean vector the same length as our list of SNPs, telling us which to keep in the data set. We can then use this list to create a pruned version of our data: ```{r} ld_data <- imputed_data %>% select_loci_if(loci_ld_clump(genotypes, thr_r2 = 0.2, size = 10)) ``` # Save The benefit of operating on a `gen_tibble` is that each quality control step can be observed visually, and easily reversed if necessary. When we are happy with the quality of our data, we can create and save a final quality controlled version of our `gen_tibble` using `gt_save`. ```{r} gt_save(ld_data, file_name = tempfile()) ``` # Grouping data For some quality control measures, if you have a gen_tibble that includes multiple datasets you may want to group by population before running the quality control. This can be done using `group_by`. First, lets add some imaginary population data to our gen_tibble: ```{r} data <- data %>% mutate(population = rep(c("A", "B"), each = 5)) ``` We can then group by population and run quality control on each group: ```{r} grouped_loci_report <- data %>% group_by(population) %>% qc_report_loci() head(grouped_loci_report) ``` The loci report that we receive here will calculate Hardy-Weinberg equilibrium for each SNP within each population separately, providing a Bonferroni corrected p-value for each SNP. Similarly, we can run a quality control report for individuals within each population: ```{r} grouped_individual_report <- data %>% group_by(population) %>% qc_report_indiv(kings_threshold = 0.177) head(grouped_individual_report) ``` This is important when we have related individuals, as background population structure can affect the filtering of relatives.