--- title: "nuggets: Get Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{nuggets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, include=FALSE} library(nuggets) library(dplyr) ``` # Introduction Package `nuggets` searches for patterns that can be described with formulae in the form of elementary conjunctions, which are called **conditions** in this text. The conditions are constructed from predicates, which represent data columns. The user may select the interpretation of conditions by selecting the underlying logic: - **crisp (i.e. Boolean, binary) logic**, where each predicate may be either true (1) or false (0), and the truth value of the condition is computed using the laws of classical Boolean algebra; or - **fuzzy logic**, where each predicate may have assigned a *truth degree* from the interval $[0, 1]$ and the truth degree of the conjunction is computed with a selected *triangular norm (t-norm)*. Package `nuggets` allows to work with three most common t-norms: *Goedel* (minimum), *Goguen* (product), and *Lukasiewicz*. Let $a, b \in [0, 1]$ be the truth degrees of two predicates. Goedel t-norm is defined as $\min(a, b)$, Goguen t-norm as $a \cdot b$, and Lukasiewicz t-norm as $\max(0, a + b - 1)$. Before analyzed by `nuggets`, the data columns that would serve as predicates in conditions have to be either dichotomized or transformed to fuzzy sets. The package provides functions for both transformations. See the section **Data Preparation** for more details. `nuggets` provides functions to search for patterns of pre-defined types, such as `dig_associations()` for association rules, `dig_paired_baseline_contrasts()` for contrast patterns on paired numeric variables, and `dig_correlations()` for conditional correlations. See the section **Pre-defined Patterns** for more details. The user may also define a custom function to evaluate the conditions and search for patterns of a different type. `dig()` function is a general function that allows to search for patterns of any type. `dig_grid()` function is a wrapper around `dig()` that allows to search for patterns defined by conditions and a pair of columns, whose combination is evaluated by the user-defined function. See the section **Custom Patterns** for more details. # Data Preparation ## Preparations of Crisp (Boolean) Predicates For patterns based on crisp conditions, the data columns that would serve as predicates in conditions have to be transformed to logical (`TRUE`/`FALSE`) data: - numeric columns have to be transformed to factors with a selected number of levels; - factors have to be transformed to dummy logical columns. Both operations can be done with the help of the `partition()` function. The `partition()` function requires the dataset as its first argument and a *tidyselect* selection expression to select the columns to be transformed. Factors and logical columns are transformed to dummy logical columns. For numeric columns, the `partition()` function requires the `.method` argument to specify the method of partitioning. The `"crisp"` method divides the range of values of the selected columns into intervals specified by the `.breaks` argument and codes the values into dummy logical columns. The `.breaks` argument is a numeric vector that specifies the border values of the intervals. For example, consider the `CO2` dataset from the `datasets` package: ```{r} head(CO2) ``` The `Plant`, `Type`, and `Treatment` columns are factors and they will be transformed to dummy logical columns without any special arguments added to the `partition()` function: ```{r} partition(CO2, Plant:Treatment) ``` The `conc` and `uptake` columns are numeric. For instance, we can split the `conc` column into four intervals: (-Inf, 175], (175, 350], (350, 675], and (675, Inf). The breaks are thus `c(-Inf, 175, 350, 675, Inf)`. ```{r} partition(CO2, conc, .method = "crisp", .breaks = c(-Inf, 175, 350, 675, Inf)) ``` Similarly, we can split the `uptake` column into three intervals: (-Inf, 10], (10, 20], and (20, Inf) by specifying the breaks `c(-Inf, 10, 20, Inf)`. The transformation of the whole `CO2` dataset to crisp predicates can be done as follows: ```{r} crispCO2 <- CO2 |> partition(Plant:Treatment) |> partition(conc, .method = "crisp", .breaks = c(-Inf, 175, 350, 675, Inf)) |> partition(uptake, .method = "crisp", .breaks = c(-Inf, 10, 20, Inf)) head(crispCO2) ``` Each call to the `partition()` function returns a tibble data frame with the selected columns transformed to dummy logical columns while the other columns remain unchanged. Each original factor column became replaced by a set of logical columns, all of which start with the original column name and are followed by the factor level name. For example, the `Type` column, which is a factor with two levels `Quebec` and `Mississippi`, was replaced by two logical columns: `Type=Quebec` and `Type=Mississippi`. Numeric columns were replaced by logical columns with names that indicate the interval to which the original value belongs. For example, the `conc` column was replaced by four logical columns: `conc=(-Inf,175]`, `conc=(175,350]`, `conc=(350,675]`, and `conc=(675,Inf)`. Other columns were transformed similarly: ```{r} colnames(crispCO2) ``` Now all the columns are logical and can be used as predicates in crisp conditions. ## Preparations of Fuzzy Predicates For patterns based on fuzzy conditions, the data columns that would serve as predicates in conditions have to be transformed to fuzzy predicates. The fuzzy predicate is represented by a vector of truth degrees from the interval $[0, 1]$. The truth degree of a predicate is the degree to which the predicate is true with 0 meaning that the predicate is false and 1 meaning that the predicate is true. A value between 0 and 1 indicates a partial truthfulness. In order to search for fuzzy patterns, the numeric input data columns have to be transformed to fuzzy predicates, i.e., to vectors of truth degrees from the interval $[0, 1]$. (Fuzzy methods allow to be used with logical columns too.) The transformation to fuzzy predicates can be done again with the help of the `partition()` function. Again, factors will be transformed to dummy logical columns. On the other hand, numeric columns will be transformed to fuzzy predicates. For that, the `partition()` function provides two fuzzy partitioning methods: `"triangle"` and `"raisedcos"`. The `"triangle"` method creates fuzzy sets with triangular membership functions, while the `"raisedcos"` method creates fuzzy sets with raised cosine membership functions. More advanced fuzzy partitioning of numeric columns may be achieved with the help of the [lfl](https://cran.r-project.org/package=lfl) package, which provides tools for definition of fuzzy sets of many types including fuzzy sets that model linguistic terms such as "very small", "extremely big" and so on. See the [`lfl` documentation](https://github.com/beerda/lfl/blob/master/vignettes/main.pdf) for more information. In the following example, both the `conc` and `uptake` columns are transformed to fuzzy sets with triangular membership functions. For that, the `partition()` function requires the `.breaks` argument to specify the shape of fuzzy sets. For `.method = "triangle"`, each consecutive triplet of values in the `.breaks` vector specifies a single triangular fuzzy set: the first and the last value of the triplet are the borders of the triangle, and the middle value is the peak of the triangle. For instance, the `conc` column's `.breaks` may be specified as `c(-Inf, 175, 350, 675, Inf)`, which creates three triangular fuzzy sets: `conc=(-Inf,175,350)`, `conc=(175,350,675)`, and `conc=(350,675,Inf)`. Similarly, the `uptake` column's `.breaks` may be specified as `c(-Inf, 18, 28, 37, Inf)`. The transformation of the whole `CO2` dataset to fuzzy predicates can be done as follows: ```{r, message=FALSE} fuzzyCO2 <- CO2 |> partition(Plant:Treatment) |> partition(conc, .method = "triangle", .breaks = c(-Inf, 175, 350, 675, Inf)) |> partition(uptake, .method = "triangle", .breaks = c(-Inf, 18, 28, 37, Inf)) head(fuzzyCO2) colnames(fuzzyCO2) ``` # Pre-defined Patterns `nuggets` provides a set of functions for searching for some best-known pattern types. These functions allow to process Boolean data, fuzzy data, or both. The result of these functions is always a tibble with patterns stored as rows. For more advance usage, which allows to search for custom patterns or to compute user-defined measures and statistics, see the section **Custom Patterns**. ### Search for Association Rules Association rules are rules of the form $A \Rightarrow B$, where $A$ is either Boolean or fuzzy condition in the form of conjunction, and $B$ is a Boolean or fuzzy predicate. Before continuing with the search for rules, it is advisable to create the so-called *vector of disjoints*. The vector of disjoints is a character vector with the same length as the number of columns in the analyzed dataset. It specifies predicates, which are mutually exclusive and should not be combined together in a single pattern's condition: columns with equal values in the disjoint vector will not appear in a single condition. Providing the vector of disjoints to the algorithm will speed-up the search as it makes no sense, e.g., to combine `Plant=Qn1` and `Plant=Qn2` in a condition `Plant=Qn1 & Plant=Qn2` as such formula is never true for any data row. The vector of disjoints can be easily created from the column names of the dataset, e.g., by obtaining the first part of column names before the equal sign, which is neatly provided by the `var_names()` function as follows: ```{r} disj <- var_names(colnames(fuzzyCO2)) print(disj) ``` The function `dig_associations` takes the analyzed dataset as its first parameter and a pair of `tidyselect` expressions to select the column names to appear in the left-hand (antecedent) and right-hand (consequent) side of the rule. The following command searches for associations rules, such that: - any column except those starting with "Treatment" is in the antecedent; - any column starting with "Treatment" is in the consequent; - the minimum support is 0.02 (support is the proportion of rows that satisfy the antecedent AND consequent)); - the minimum confidence is 0.8 (confidence is the proportion of rows satisfying the consequent GIVEN the antecedent is true). ```{r} result <- dig_associations(fuzzyCO2, antecedent = !starts_with("Treatment"), consequent = starts_with("Treatment"), disjoint = disj, min_support = 0.02, min_confidence = 0.8) ``` The result is a tibble with found rules. We may arrange it by support in descending order: ```{r} result <- arrange(result, desc(support)) print(result) ``` ## Conditional Correlations TBD (`dig_correlations`) ## Contrast Patterns TBD (`dig_contrasts`) # Custom Patterns The `nuggets` package allows to execute a user-defined callback function on each generated frequent condition. That way a custom type of patterns may be searched. The following example replicates the search for associations rules with the custom callback function. For that, a dataset has to be dichotomized and the disjoint vector created as in the **Data Preparation** section above: ```{r} head(fuzzyCO2) print(disj) ``` As we want to search for associations rules with some minimum support and confidence, we define the variables to hold that thresholds. We also need to define a callback function that will be called for each found frequent condition. Its purpose is to generate the rules with the obtained condition as an antecedent: ```{r} min_support <- 0.02 min_confidence <- 0.8 f <- function(condition, support, foci_supports) { conf <- foci_supports / support sel <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support conf <- conf[sel] supp <- foci_supports[sel] lapply(seq_along(conf), function(i) { list(antecedent = format_condition(names(condition)), consequent = format_condition(names(conf)[[i]]), support = supp[[i]], confidence = conf[[i]]) }) } ``` The callback function `f()` defines three arguments: `condition`, `support` and `foci_supports`. The names of the arguments are not random. Based on the argument names of the callback function, the searching algorithm provides information to the function. Here `condition` is a vector of indices representing the conjunction of predicates in a condition. By the predicate we mean the column in the source dataset. The `support` argument gets the relative frequency of the condition in the dataset. `foci_supports` is a vector of supports of special predicates, which we call "foci" (plural of "focus"), within the rows satisfying the condition. For associations rules, foci are potential rule consequents. Now we can run the digging for rules: ```{r} result <- dig(fuzzyCO2, f = f, condition = !starts_with("Treatment"), focus = starts_with("Treatment"), disjoint = disj, min_length = 1, min_support = min_support) ``` As we return a list of lists in the callback function, we have to flatten the first level of lists in the result and binding it into a data frame: ```{r} result <- result |> unlist(recursive = FALSE) |> lapply(as_tibble) |> do.call(rbind, args = _) |> arrange(desc(support)) print(result) ```