--- title: "RVenn: An R package for set operations on multiple sets" author: "Turgut Yigit Akyol" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{RVenn: An R package for set operations on multiple sets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction This tutorial shows how to use `RVenn`, a package for dealing with multiple sets. The base R functions (`intersect`, `union` and `setdiff`) only work with two sets. `%>%` can be used from `magrittr` but, for many sets this can be tedious. `reduce` function from `purrr` package also provides a solution, which is the function that is used for set operations in this package. The functions `overlap`, `unite` and `discern` abstract away the details, so one can just construct the universe and choose the sets to operate by index or set name. Further, by using `ggvenn` Venn diagram can be drawn for 2-3 sets. As you can notice from the name of the function, `ggvenn` is based on `ggplot2`, so it is a neat way to show the relationship among a reduced number sets. For many sets, it is much better to use [UpSet](http://caleydo.org/tools/upset/) or `setmap` function provided within this package. Finally, by using `enrichment_test` function, the p-value of an overlap between two sets can be calculated. Here, the usage of all these functions will be shown. ## Creating toy data This chunk of code will create 10 sets with sizes ranging from 5 to 25. ```{r, message=FALSE} library(purrr) library(RVenn) library(ggplot2) ``` ```{r} set.seed(42) toy = map(sample(5:25, replace = TRUE, size = 10), function(x) sample(letters, size = x)) toy[1:3] # First 3 of the sets. ``` ### Construct the Venn object ```{r} toy = Venn(toy) ``` ## Set operations ### Intersection Intersection of all sets: ```{r} overlap(toy) ``` Intersection of selected sets (chosen with set names or indices, respectively): ```{r} overlap(toy, c("Set_1", "Set_2", "Set_5", "Set_8")) ``` ```{r} overlap(toy, c(1, 2, 5, 8)) ``` ### Pairwise intersections ```{r} overlap_pairs(toy, slice = 1:4) ``` ### Union Union of all sets: ```{r} unite(toy) ``` Union of selected sets (chosen with set names or indices, respectively): ```{r} unite(toy, c("Set_3", "Set_8")) ``` ```{r} unite(toy, c(3, 8)) ``` ### Pairwise unions ```{r} unite_pairs(toy, slice = 1:4) ``` ### Set difference ```{r} discern(toy, 1, 8) ``` ```{r} discern(toy, "Set_1", "Set_8") ``` ```{r} discern(toy, c(3, 4), c(7, 8)) ``` ### Pairwise differences ```{r} discern_pairs(toy, slice = 1:4) ``` ## Venn Diagram For two sets: ```{r, fig.height=5, fig.width=8, fig.retina=3} ggvenn(toy, slice = c(1, 5)) ``` For three sets: ```{r, fig.height=8, fig.width=8, fig.retina=3} ggvenn(toy, slice = c(3, 6, 8)) ``` ## Heatmap ```{r, fig.height=8, fig.width=8, fig.retina=3} setmap(toy) ``` Without clustering ```{r, fig.height=8, fig.width=8, fig.retina=3} setmap(toy, element_clustering = FALSE, set_clustering = FALSE) ``` ## Enrichment test ```{r, fig.width=8, fig.height=5, fig.retina=3} er = enrichment_test(toy, 6, 7) er$Significance qplot(er$Overlap_Counts, geom = "blank") + geom_histogram(fill = "lemonchiffon4", bins = 8, color = "black") + geom_vline(xintercept = length(overlap(toy, c(6, 7))), color = "firebrick2", size = 2, linetype = "dashed", alpha = 0.7) + ggtitle("Null Distribution") + theme(plot.title = element_text(hjust = 0.5)) + scale_x_continuous(name = "Overlap Counts") + scale_y_continuous(name = "Frequency") ``` The test above, of course, is not very meaningful as we randomly created the sets; therefore, we get a high p-value. However, when you are working with actual data, e.g. to check if a motif is enriched in the promoter regions of the genes in a gene set, you can use this test. In that case, `set1` will be the gene set of interest, `set2` will be the all the genes that the motif is found in the genome and `univ` will be all the genes of a genome.