---
title: "nuggets: Get Started"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{nuggets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, include=FALSE}
library(nuggets)
library(dplyr)
```


# Introduction

Package `nuggets` searches for patterns that can be described with formulae in
the form of elementary conjunctions, which are called **conditions** in this
text. The conditions are constructed from predicates, which represent data
columns. The user may select the interpretation of conditions by selecting the
underlying logic:

- **crisp (i.e. Boolean, binary) logic**, where each predicate may be either
  true (1) or false (0), and the truth value of the condition is computed
  using the laws of classical Boolean algebra; or
- **fuzzy logic**, where each predicate may have assigned a *truth degree* from
  the interval $[0, 1]$ and the truth degree of the conjunction is computed
  with a selected *triangular norm (t-norm)*. Package `nuggets` allows to work
  with three most common t-norms: *Goedel* (minimum), *Goguen* (product), and
  *Lukasiewicz*. Let $a, b \in [0, 1]$ be the truth degrees of two predicates.
  Goedel t-norm is defined as $\min(a, b)$, Goguen t-norm as $a \cdot b$, and
  Lukasiewicz t-norm as $\max(0, a + b - 1)$.

Before analyzed by `nuggets`, the data columns that would serve as predicates in
conditions have to be either dichotomized or transformed to fuzzy sets. The
package provides functions for both transformations. See the section
**Data Preparation** for more details.

`nuggets` provides functions to search for patterns of pre-defined types, such
as `dig_associations()` for association rules, `dig_paired_baseline_contrasts()`
for contrast patterns on paired numeric variables, and `dig_correlations()`
for conditional correlations. See the section **Pre-defined Patterns** for more
details.

The user may also define a custom function to evaluate the conditions and
search for patterns of a different type. `dig()` function is a general function
that allows to search for patterns of any type. `dig_grid()` function is
a wrapper around `dig()` that allows to search for patterns defined by
conditions and a pair of columns, whose combination is evaluated by the
user-defined function. See the section **Custom Patterns** for more details.


# Data Preparation

## Preparations of Crisp (Boolean) Predicates

For patterns based on crisp conditions, the data columns that would serve as
predicates in conditions have to be transformed to logical (`TRUE`/`FALSE`)
data:

- numeric columns have to be transformed to factors with a selected number of
  levels;
- factors have to be transformed to dummy logical columns.

Both operations can be done with the help of the `partition()` function. The
`partition()` function requires the dataset as its first argument and a
*tidyselect* selection expression to select the columns to be transformed.
Factors and logical columns are transformed to dummy logical columns.

For numeric columns, the `partition()` function requires the `.method` argument
to specify the method of partitioning. The `"crisp"` method divides the range of
values of the selected columns into intervals specified by the `.breaks`
argument and codes the values into dummy logical columns. The `.breaks` argument
is a numeric vector that specifies the border values of the intervals.

For example, consider the `CO2` dataset from the `datasets` package:

```{r}
head(CO2)
```

The `Plant`, `Type`, and `Treatment` columns are factors and they will be
transformed to dummy logical columns without any special arguments added to the
`partition()` function:

```{r}
partition(CO2, Plant:Treatment)
```

The `conc` and `uptake` columns are numeric. For instance, we can split the
`conc` column into four intervals: (-Inf, 175], (175, 350], (350, 675], and
(675, Inf). The breaks are thus `c(-Inf, 175, 350, 675, Inf)`.

```{r}
partition(CO2, conc, .method = "crisp", .breaks = c(-Inf, 175, 350, 675, Inf))
```

Similarly, we can split the `uptake` column into three intervals: (-Inf, 10],
(10, 20], and (20, Inf) by specifying the breaks `c(-Inf, 10, 20, Inf)`.

The transformation of the whole `CO2` dataset to crisp predicates can be done as
follows:

```{r}
crispCO2 <- CO2 |>
    partition(Plant:Treatment) |>
    partition(conc, .method = "crisp", .breaks = c(-Inf, 175, 350, 675, Inf)) |>
    partition(uptake, .method = "crisp", .breaks = c(-Inf, 10, 20, Inf))

head(crispCO2)
```

Each call to the `partition()` function returns a tibble data frame with the
selected columns transformed to dummy logical columns while the other columns
remain unchanged.

Each original factor column became replaced by a set of logical columns, all
of which start with the original column name and are followed by the factor
level name. For example, the `Type` column, which is a factor with two levels
`Quebec` and `Mississippi`, was replaced by two logical columns: 
`Type=Quebec` and `Type=Mississippi`. Numeric columns were replaced by logical
columns with names that indicate the interval to which the original value
belongs. For example, the `conc` column was replaced by four logical columns:
`conc=(-Inf,175]`, `conc=(175,350]`, `conc=(350,675]`, and `conc=(675,Inf)`.
Other columns were transformed similarly:

```{r}
colnames(crispCO2)
```

Now all the columns are logical and can be used as predicates in crisp conditions.


## Preparations of Fuzzy Predicates

For patterns based on fuzzy conditions, the data columns that would serve as
predicates in conditions have to be transformed to fuzzy predicates. The
fuzzy predicate is represented by a vector of truth degrees from the interval
$[0, 1]$. The truth degree of a predicate is the degree to which the predicate
is true with 0 meaning that the predicate is false and 1 meaning that the
predicate is true. A value between 0 and 1 indicates a partial truthfulness.

In order to search for fuzzy patterns, the numeric input data columns have to be
transformed to fuzzy predicates, i.e., to vectors of truth degrees from the
interval $[0, 1]$. (Fuzzy methods allow to be used with logical columns too.)

The transformation to fuzzy predicates can be done again with the help of the
`partition()` function. Again, factors will be transformed to dummy logical
columns. On the other hand, numeric columns will be transformed to fuzzy
predicates. For that, the `partition()` function provides two fuzzy partitioning
methods: `"triangle"` and `"raisedcos"`. The `"triangle"` method creates fuzzy
sets with triangular membership functions, while the `"raisedcos"` method creates
fuzzy sets with raised cosine membership functions.

More advanced fuzzy partitioning of numeric columns may be achieved with the
help of the [lfl](https://cran.r-project.org/package=lfl) package, which provides
tools for definition of fuzzy sets of many types including fuzzy sets that model
linguistic terms such as "very small", "extremely big" and so on.  See the
[`lfl` documentation](https://github.com/beerda/lfl/blob/master/vignettes/main.pdf)
for more information.

In the following example, both the `conc` and `uptake` columns are
transformed to fuzzy sets with triangular membership functions. For that, the
`partition()` function requires the `.breaks` argument to specify the shape
of fuzzy sets. For `.method = "triangle"`, each consecutive triplet of values
in the `.breaks` vector specifies a single triangular fuzzy set: the first and
the last value of the triplet are the borders of the triangle, and the middle
value is the peak of the triangle. 

For instance, the `conc` column's `.breaks` may be specified as
`c(-Inf, 175, 350, 675, Inf)`, which creates three triangular fuzzy sets:
`conc=(-Inf,175,350)`, `conc=(175,350,675)`, and `conc=(350,675,Inf)`.
Similarly, the `uptake` column's `.breaks` may be specified as
`c(-Inf, 18, 28, 37, Inf)`.

The transformation of the whole `CO2` dataset to fuzzy predicates can be done as
follows:

```{r, message=FALSE}
fuzzyCO2 <- CO2 |>
    partition(Plant:Treatment) |>
    partition(conc, .method = "triangle", .breaks = c(-Inf, 175, 350, 675, Inf)) |>
    partition(uptake, .method = "triangle", .breaks = c(-Inf, 18, 28, 37, Inf))

head(fuzzyCO2)
colnames(fuzzyCO2)
```


# Pre-defined Patterns

`nuggets` provides a set of functions for searching for some best-known pattern types.
These functions allow to process Boolean data, fuzzy data, or both. The result of
these functions is always a tibble with patterns stored as rows. For more advance
usage, which allows to search for custom patterns or to compute user-defined measures
and statistics, see the section **Custom Patterns**.


### Search for Association Rules

Association rules are rules of the form $A \Rightarrow B$, where $A$ is either
Boolean or fuzzy condition in the form of conjunction, and $B$ is a Boolean or
fuzzy predicate.

Before continuing with the search for rules, it is advisable to create the so-called
*vector of disjoints*. The vector of disjoints is a character vector with the same
length as the number of columns in the analyzed dataset. It specifies predicates, which
are mutually exclusive and should not be combined together in a single pattern's condition:
columns with equal values in the disjoint vector will not appear in a single condition.
Providing the vector of disjoints to the algorithm will speed-up the search as it makes
no sense, e.g., to combine `Plant=Qn1` and `Plant=Qn2` in a condition
`Plant=Qn1 & Plant=Qn2` as such formula is never true for any data row.

The vector of disjoints can be easily created from the column names of the dataset, e.g.,
by obtaining the first part of column names before the equal sign, which is neatly
provided by the `var_names()` function as follows:

```{r}
disj <- var_names(colnames(fuzzyCO2))
print(disj)
```

The function `dig_associations` takes the analyzed dataset as its first parameter and
a pair of `tidyselect` expressions to select the column names to appear
in the left-hand (antecedent) and right-hand (consequent) side of the rule. The following
command searches for associations rules, such that:

- any column except those starting with "Treatment" is in the antecedent;
- any column starting with "Treatment" is in the consequent;
- the minimum support is 0.02 (support is the proportion of rows that satisfy the
  antecedent AND consequent));
- the minimum confidence is 0.8 (confidence is the proportion of rows satisfying the
  consequent GIVEN the antecedent is true).

```{r}
result <- dig_associations(fuzzyCO2,
                           antecedent = !starts_with("Treatment"),
                           consequent = starts_with("Treatment"),
                           disjoint = disj,
                           min_support = 0.02,
                           min_confidence = 0.8)
```

The result is a tibble with found rules. We may arrange it by support in descending order:

```{r}
result <- arrange(result, desc(support))
print(result)
```


## Conditional Correlations

TBD (`dig_correlations`)

## Contrast Patterns

TBD (`dig_contrasts`)


# Custom Patterns

The `nuggets` package allows to execute a user-defined callback function on each generated
frequent condition. That way a custom type of patterns may be searched. The following example
replicates the search for associations rules with the custom callback function. For that, a dataset
has to be dichotomized and the disjoint vector created as in the **Data Preparation** section
above:

```{r}
head(fuzzyCO2)
print(disj)
```

As we want to search for associations rules with some minimum support and confidence, we define
the variables to hold that thresholds. We also need to define a callback function that will be
called for each found frequent condition. Its purpose is to generate the rules with the
obtained condition as an antecedent:

```{r}
min_support <- 0.02
min_confidence <- 0.8

f <- function(condition, support, foci_supports) {
    conf <- foci_supports / support
    sel <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support
    conf <- conf[sel]
    supp <- foci_supports[sel]
    
    lapply(seq_along(conf), function(i) { 
      list(antecedent = format_condition(names(condition)),
           consequent = format_condition(names(conf)[[i]]),
           support = supp[[i]],
           confidence = conf[[i]])
    })
}
```

The callback function `f()` defines three arguments: `condition`, `support` and `foci_supports`.
The names of the arguments are not random. Based on the argument names of the callback function,
the searching algorithm provides information to the function. Here `condition` is a vector of indices
representing the conjunction of predicates in a condition. By the predicate we mean the column in the
source dataset. The `support` argument gets the relative frequency of the condition in the dataset.
`foci_supports` is a vector of supports of special predicates, which we call "foci" (plural of "focus"),
within the rows satisfying the condition. For associations rules, foci are potential rule consequents.

Now we can run the digging for rules:

```{r}
result <- dig(fuzzyCO2,
              f = f,
              condition = !starts_with("Treatment"),
              focus = starts_with("Treatment"),
              disjoint = disj,
              min_length = 1,
              min_support = min_support)
```

As we return a list of lists in the callback function, we have to flatten the first level 
of lists in the result and binding it into a data frame:

```{r}
result <- result |>
  unlist(recursive = FALSE) |>
  lapply(as_tibble) |>
  do.call(rbind, args = _) |>
  arrange(desc(support))

print(result)
```