| Type: | Package |
| Title: | Distance-Based Learning for Mixed-Type Data |
| Version: | 0.5.0 |
| Description: | Provides tools for constructing, computing, and using distance measures for numerical, categorical, and mixed-type data. The package implements a flexible framework in which continuous and categorical components can be combined under additive, commensurable, and association-aware specifications. Supported methods include classical distances such as Gower, Euclidean, Manhattan, and Mahalanobis-type distances; categorical dissimilarities such as simple matching, occurrence-frequency, and association-based measures; and mixed-type presets designed to reduce biases due to variable type, scale, distribution, redundancy, and number of categories. The package also provides scaling options, supervised and unsupervised distance constructions, leave-one-variable-out tools for distance-based variable importance, and integration with distance-based learning workflows such as nearest-neighbour prediction, partitioning around medoids, and spectral clustering. Methods are motivated by van de Velden, Iodice D'Enza, Markos, and Cavicchia (2026) <doi:10.1080/10618600.2026.2680181> and related work on categorical and mixed-type dissimilarities. |
| Imports: | aricode, cluster, clusterGeneration, data.table, dials, distances, dplyr, entropy, fastDummies, forcats, fpc, generics, ggplot2, kdml, magrittr, Matrix, parsnip, philentropy, purrr, readr, recipes, Rfast, rlang, rsample, stats, tibble, tidyr, tidyselect, tune |
| Depends: | R (≥ 4.5.0) |
| Suggests: | arules, clustMixType, FD, klaR, mclust, palmerpenguins, parallelDist, StatMatch, workflows |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| NeedsCompilation: | no |
| Config/roxygen2/version: | 8.0.0 |
| Packaged: | 2026-06-09 13:48:59 UTC; Alfo |
| Author: | Alfonso Iodice D'Enza [aut, cre], Angelos Markos [aut], Michel van de Velden [aut], Carlo Cavicchia [aut] |
| Maintainer: | Alfonso Iodice D'Enza <iodicede@unina.it> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-09 21:40:08 UTC |
Compare LOVO diagnostics across multiple distance specifications
Description
This function compares **leave-one-variable-out (LOVO)** diagnostics
across multiple distance definitions supported by manydist.
Usage
compare_lovo_mdist(
x,
methods,
dims = 2,
keep_dist = FALSE,
.progress = FALSE,
...
)
Arguments
x |
A data frame or tibble containing the predictors. |
methods |
A **named list** describing the distance specifications
to compare. Each element must be a list of arguments passed to
For example:
methods = list(
gower = list(preset = "gower"),
u_dep = list(preset = "unbiased_dependent"),
custom = list(
preset = "custom",
method_cat = "matching",
method_num = "std",
commensurable = TRUE
)
)
|
dims |
Number of dimensions used for the MDS configuration when computing congruence-based diagnostics. |
keep_dist |
Logical; if |
.progress |
Logical; if |
... |
Additional arguments passed to These may include optional clustering diagnostics, for example
|
Details
For each distance specification, the function:
Computes the full mixed-type distance using
mdist().Recomputes the distance repeatedly leaving out one variable at a time.
Measures the impact of each variable using metrics such as mean absolute deviation (MAD), congruence-based diagnostics, and, when requested, clustering-based agreement measures.
The results are combined across methods and returned as an
MDistLOVOCompare object, which supports
print(), summary(), and ggplot2::autoplot().
Value
An object of class MDistLOVOCompare containing:
- results
A tibble with one row per method-variable combination.
- methods
The list of distance specifications used.
- dims
Number of MDS dimensions used.
- n_obs
Number of observations in the dataset.
See Also
Examples
## Not run:
library(manydist)
library(palmerpenguins)
data <- penguins |>
dplyr::select(-species) |>
tidyr::drop_na()
cmp <- compare_lovo_mdist(
x = data,
methods = list(
gower = list(preset = "gower"),
u_dep = list(preset = "unbiased_dependent")
),
cluster_k = 3
)
summary(cmp)
autoplot(cmp, metric = "mad_importance")
autoplot(cmp, metric = "pam_importance")
## End(Not run)
Congruence coefficient between two configurations
Description
Computes the congruence coefficient between two data configurations using the Frobenius inner product of their pairwise distance matrices.
Usage
congruence_coeff(L1, L2)
Arguments
L1 |
A numeric matrix or data frame (rows = observations) |
L2 |
A numeric matrix or data frame with the same number of rows as |
Details
The congruence coefficient is defined as
\frac{\langle D_1, D_2 \rangle_F}
{\sqrt{\langle D_1, D_1 \rangle_F \langle D_2, D_2 \rangle_F}}
where D_1 and D_2 are the pairwise distance matrices derived from
L1 and L2.
Value
A scalar in [-1,1] measuring similarity between the two configurations.
FIFA 21 Player Data - Dutch League
Description
The fifa_nl dataset contains information on players in the Dutch League from the FIFA 21 video game. This dataset includes various attributes of players, such as demographics, club details, skill ratings, and physical characteristics.
Usage
data("fifa_nl")
Format
A data frame with observations on various attributes describing the players.
player_positionsPrimary playing positions of the player.
nationalityThe country the player represents.
team_positionPlayer's assigned position within their club.
club_nameName of the club the player is part of.
work_rateThe player's work rate, describing defensive and attacking intensity.
weak_footSkill rating for the player's non-dominant foot, ranging from 1 to 5.
skill_movesSkill moves rating, indicating technical skill and ability to perform complex moves, on a scale of 1 to 5.
international_reputationPlayer's reputation on an international scale, from 1=local to 3=global star.
body_typeBody type of the player ( Lean, Normal, Stocky.
preferred_footDominant foot of the player, either Left or Right.
ageAge of the player in years.
height_cmHeight of the player in centimeters.
weight_kgWeight of the player in kilograms.
overallOverall skill rating of the player out of 100.
potentialPotential skill rating the player may achieve in the future.
value_eurEstimated market value of the player in Euros.
wage_eurPlayer's weekly wage in Euros.
release_clause_eurRelease clause value in Euros, which other clubs must pay to buy out the player's contract.
paceSpeed rating of the player out of 100.
shootingShooting skill rating out of 100.
passingPassing skill rating out of 100.
dribblingDribbling skill rating out of 100.
defendingDefending skill rating out of 100.
physicPhysicality rating out of 100.
Details
This dataset provides a snapshot of player attributes and performance indicators as represented in FIFA 21 for players in the Dutch League. It can be used to analyze player characteristics, compare skills across players, and explore potential relationships among variables such as age, position, and various skill ratings.
References
Stefano Leone. (2021). FIFA 21 Complete Player Dataset. Retrieved from https://www.kaggle.com/datasets/stefanoleone992/fifa-21-complete-player-dataset.
Examples
data(fifa_nl)
summary(fifa_nl)
Distance-based nearest-neighbour model
Description
'nearest_neighbor_dist()' defines a parsnip model specification for nearest-neighbour prediction using precomputed distance representations, such as those produced by [step_mdist()]. It can be used for classification or regression workflows in which predictors have first been transformed into distances to the training observations.
Usage
fit_knn_dist(x, y, k = 5, dist_fun = NULL, dist_args = list())
predict_knn_dist_class(object, new_data, type = c("class", "prob"))
predict_knn_dist_prob(object, new_data, ...)
predict_knn_dist_reg(object, new_data)
nearest_neighbor_dist(mode = "classification", neighbors = NULL)
## S3 method for class 'nearest_neighbor_dist'
tunable(x, ...)
Arguments
x |
A 'nearest_neighbor_dist' model specification. |
y |
A vector of outcomes. |
k |
Number of neighbours. |
dist_fun |
Optional distance function taking arguments 'x' and 'new_data'. If 'NULL', inputs are assumed to already be distance matrices. |
dist_args |
A named list of additional arguments passed to 'dist_fun'. |
object |
A fitted 'knn_dist' object created by 'fit_knn_dist()'. |
new_data |
A data frame or matrix of precomputed distances, or raw predictors when 'dist_fun' is supplied. |
type |
Prediction type. For classification, '"class"' returns class predictions and '"prob"' returns class probabilities. |
... |
Additional arguments passed to lower-level methods or currently not used. |
mode |
Character string specifying the model mode. Available values are '"classification"' and '"regression"'. |
neighbors |
Number of neighbours. This can be an integer or a tunable parameter, for example 'tune::tune()'. |
Details
This model is intended to be used together with [step_mdist()] in a [recipes::recipe()]. The recipe creates distance columns named 'dist_1', 'dist_2', and so on; 'nearest_neighbor_dist()' then applies k-nearest neighbours to that distance representation.
The model uses a manydist-specific parsnip engine. In the usual workflow, distances are computed by [step_mdist()] with 'output = "distance_to_training"'. The resulting distance columns are then passed to 'nearest_neighbor_dist()'.
Lower-level engine functions such as 'fit_knn_dist()' and 'predict_knn_dist_*()' are exported for parsnip registration, but users normally do not need to call them directly.
Value
A parsnip model specification of class '"nearest_neighbor_dist"'.
See Also
[step_mdist()], [mdist()]
Examples
if (requireNamespace("palmerpenguins", quietly = TRUE)) {
data("penguins", package = "palmerpenguins")
penguins_small <- palmerpenguins::penguins |>
dplyr::select(
species, bill_length_mm, bill_depth_mm, flipper_length_mm,
body_mass_g, island, sex
) |>
tidyr::drop_na()
set.seed(123)
penguin_split <- rsample::initial_split(
penguins_small,
prop = 0.75,
strata = species
)
penguin_train <- rsample::training(penguin_split)
penguin_test <- rsample::testing(penguin_split)
rec <- recipes::recipe(species ~ ., data = penguin_train) |>
step_mdist(
recipes::all_predictors(),
preset = "gower",
output = "distance_to_training"
)
spec <- nearest_neighbor_dist(
mode = "classification",
neighbors = 5
) |>
parsnip::set_engine("manydist")
wf <- workflows::workflow() |>
workflows::add_recipe(rec) |>
workflows::add_model(spec)
fit <- workflows::fit(wf, data = penguin_train)
predict(fit, new_data = penguin_test) |>
dplyr::slice_head(n = 5)
}
Generate mixed-type clustering data with signal and noise
Description
Generates synthetic mixed datasets with controllable numerical and categorical signal/noise structure and balanced cluster sizes.
Usage
gen_mixed(
k_true,
clustSizeEq = 50,
numsignal = 2,
numnoise = 2,
catsignal = 2,
catnoise = 2,
q = 5,
q_err = 9,
numsep = 0.1,
catsep = 0.5,
seed = NULL,
error_type = c("normal", "chisq"),
error_df = 2,
error_scale = 1
)
Arguments
k_true |
Number of clusters |
clustSizeEq |
Observations per cluster |
numsignal |
Number of numerical signal variables |
numnoise |
Number of numerical noise variables |
catsignal |
Number of categorical signal variables |
catnoise |
Number of categorical noise variables |
q |
Number of categories for signal categorical variables |
q_err |
Number of categories for categorical noise |
numsep |
Separation for numerical signal |
catsep |
Separation for categorical signal |
seed |
Optional seed |
error_type |
Error distribution ("normal" or "chisq") |
error_df |
Degrees of freedom for chi-square noise |
error_scale |
Scale of noise |
Value
A list with elements df, X_num, X_cat, y,
num_cols, cat_cols
generate toy mixed datasets for the supplementary material: figures 1 to 3
Description
generate toy mixed datasets for the supplementary material: figures 1 to 3
Usage
generate_dataset(
n,
porig,
pn,
pnnoise,
pcnoise,
sigma,
qoptions,
seed = NULL,
mode = "per_variable"
)
Arguments
n |
number of observations |
porig |
number of original informative continuous variables |
pn |
number of continuous variables (total, before adding noise variables) |
pnnoise |
number of extra numeric noise variables |
pcnoise |
number of extra categorical noise variables |
sigma |
sd for noise added to informative variables |
qoptions |
number of bins for categorization (vector if per_variable, scalar if shared) |
seed |
optional seed |
mode |
either "per_variable" or "shared" |
Leave-one-variable-out diagnostics for distance-based variable importance
Description
Computes leave-one-variable-out (LOVO) diagnostics for a distance specification. The function first computes the full dissimilarity matrix using [mdist()]. It then removes one predictor at a time, recomputes the dissimilarity matrix, and compares each leave-one-variable-out matrix with the full one.
Usage
lovo_mdist(
x,
response = NULL,
...,
dims = 2,
keep_dist = FALSE,
cluster_k = NULL,
cluster_methods = c("pam", "hclust", "spectral"),
hclust_method = "average",
spectral_sigma = NULL,
spectral_nstart = 50,
response_used = TRUE
)
Arguments
x |
A data frame or object coercible to a tibble. Rows are observations and columns are variables used to compute the dissimilarity. |
response |
Optional response variable. It can be supplied as an unquoted column name or as a character string. When supplied and 'response_used = TRUE', it is passed to [mdist()] for response-aware distance construction. The response column is not treated as a predictor in the leave-one-variable-out loop. |
... |
Additional arguments passed to [mdist()], such as 'preset', 'method_cat', 'method_num', 'commensurable', or 'interaction'. |
dims |
Integer. Number of dimensions used by classical multidimensional scaling when computing congruence-based diagnostics. |
keep_dist |
Logical. If 'TRUE', store the full dissimilarity matrix and all leave-one-variable-out dissimilarity matrices in the returned object. |
cluster_k |
Optional integer. Number of clusters used when computing clustering-based LOVO diagnostics. If 'NULL', clustering diagnostics are not computed. |
cluster_methods |
Character vector specifying the clustering methods used for clustering-based diagnostics. Possible values are '"pam"', '"hclust"', and '"spectral"'. |
hclust_method |
Character string specifying the linkage method passed to [stats::hclust()] when '"hclust"' is included in 'cluster_methods'. |
spectral_sigma |
Optional numeric value for the Gaussian affinity bandwidth used by spectral clustering. If 'NULL', the default used by [spectral_dist()] is applied. |
spectral_nstart |
Integer. Number of random starts used by the k-means step in spectral clustering. |
response_used |
Logical. If 'TRUE', the response variable, when supplied, is used in the distance construction. If 'FALSE', the response column is removed before computing distances. |
Details
'lovo_mdist()' is useful for assessing how strongly each predictor contributes to a distance-based representation. A predictor is considered influential when removing it produces a large change in the dissimilarity matrix, the multidimensional scaling configuration, or an optional clustering partition.
The returned object contains several LOVO diagnostics. The main distance contribution is measured by the mean absolute difference between the full dissimilarity matrix and each leave-one-variable-out matrix ('mad_importance'). The normalized version is stored as 'relative_distance'.
The function also compares classical multidimensional scaling configurations computed from the full and leave-one-variable-out dissimilarities. These diagnostics are stored as 'mds_congruence' / 'cc_importance' and 'ac_importance', the latter corresponding to an alienation coefficient.
If 'cluster_k' is supplied, the function additionally computes clustering partitions from the full and leave-one-variable-out dissimilarities and compares them using the adjusted Rand index. The corresponding importance measures are defined as '1 - ARI' and are stored as 'pam_importance', 'hclust_importance', or 'spectral_importance', depending on the selected clustering methods.
Clustering-based diagnostics require the suggested package 'mclust'.
Value
An object of class '"MDistLOVO"'. The main results are stored in the '$results' field as a tibble with one row per left-out variable. The object also has print, summary, and autoplot methods.
See Also
[mdist()], [compare_lovo_mdist()], [spectral_dist()]
Examples
if (requireNamespace("palmerpenguins", quietly = TRUE)) {
data("penguins", package = "palmerpenguins")
penguins_small <- palmerpenguins::penguins |>
dplyr::select(
species, bill_length_mm, bill_depth_mm, flipper_length_mm,
body_mass_g, island, sex
) |>
tidyr::drop_na()
# LOVO diagnostics for a Gower distance
res <- lovo_mdist(
penguins_small,
preset = "gower",
response = species,
response_used = FALSE
)
res
summary(res)
# Plot the relative distance contribution of each predictor
p <- res$autoplot(metric = "relative_distance", reorder = TRUE)
p
}
Create a LOVO method specification for 'compare_lovo_mdist()'
Description
Helper to build one method specification for [compare_lovo_mdist()]. This allows tidy-style specification of 'response', e.g. 'response = Name', while storing the method definition as a regular list.
Usage
lovo_method_spec(response = NULL, ...)
Arguments
response |
Optional response column, supplied either unquoted (e.g. 'Name') or quoted (e.g. '"Name"'). |
... |
Additional arguments passed on to [lovo_mdist()] through [compare_lovo_mdist()]. |
Value
A named list of arguments suitable for one element of the 'methods' argument in [compare_lovo_mdist()].
Examples
## Not run:
methods <- list(
tvd_sup = lovo_method_spec(
response = species,
preset = "custom",
method_cat = "tvd",
method_num = "std",
commensurable = TRUE,
response_used = TRUE
),
tvd_unsup = lovo_method_spec(
response = species,
preset = "custom",
method_cat = "tvd",
method_num = "std",
commensurable = TRUE,
response_used = FALSE
)
)
## End(Not run)
Build a recipe that computes mdist-based distance features
Description
This helper creates a tidymodels recipe using step_mdist(). It supports both mdist presets and custom param sets.
Usage
make_mdist_recipe(df, mdist_type, mdist_preset, param_set, outcome)
Arguments
df |
A data frame. |
mdist_type |
"preset" or "custom". |
mdist_preset |
Name of the preset (if mdist_type == "preset"). |
param_set |
A list of custom mdist arguments (if mdist_type == "custom"). |
outcome |
Name of the outcome variable. |
Value
A 'recipes::recipe()' object.
Mixed-type dissimilarities for distance-based learning
Description
Computes a dissimilarity object for numerical, categorical, or mixed-type data. The function combines continuous and categorical components according to either a predefined 'preset' or a user-defined custom specification.
Usage
mdist(
x,
new_data = NULL,
response = NULL,
method_cat = "tvd",
method_num = "std",
commensurable = TRUE,
ncomp = NULL,
threshold = NULL,
preset = "custom",
interaction = FALSE,
prop_nn = 0.1,
score = "ba",
decision = "prior_corrected",
gower_average = TRUE
)
Arguments
x |
A data frame or matrix containing the training observations. Columns can be numeric, factors, or a mixture of both. |
new_data |
Optional data frame or matrix containing new observations. If supplied, distances are computed from rows of 'new_data' to rows of 'x', producing a rectangular test-to-training dissimilarity matrix. |
response |
Optional response variable used for response-aware categorical dissimilarities. It can be supplied as an unquoted column name or as a character string. The response column is removed from the predictors before computing distances. |
method_cat |
Character string specifying the categorical-variable dissimilarity used when 'preset = "custom"'. Common values include '"matching"' and '"tvd"'. Use [all_dist_method_specs()] to inspect available methods. |
method_num |
Character string specifying the numerical-variable preprocessing used when 'preset = "custom"'. Available options include '"none"' for no preprocessing, '"std"' for standard-deviation scaling, '"range"' for range scaling, '"robust"' for inter-quartile-range-based scaling, and '"pc_scores"' for principal-component score scaling. |
commensurable |
Logical. If 'TRUE', dissimilarities are scaled so that the average contribution of each variable to the overall distance is equal to 1. |
ncomp |
Integer or 'NULL'. Number of principal components to retain when 'method_num = "pc_scores"'. If 'NULL', all available components are used unless 'threshold' is supplied and supported by the underlying method. |
threshold |
Numeric or 'NULL'. Optional cumulative variance threshold used when 'method_num = "pc_scores"'. |
preset |
Character string specifying a predefined distance specification. Available values include '"custom"', '"gower"', '"unbiased_dependent"', '"u_dep"', '"u_indep"', '"u_mix"', '"hl"', '"gudmm"', '"dkss"', '"mod_gower"', and '"euclidean"'. When 'preset' is not '"custom"', arguments such as 'method_cat', 'method_num', 'commensurable', and 'interaction' are handled by the preset and user-supplied values for those arguments are ignored. |
interaction |
Logical. If 'TRUE', adds an interaction-aware continuous-categorical component based on local predictive separability. |
prop_nn |
Numeric. Proportion of nearest neighbours used when 'interaction = TRUE'. |
score |
Character string specifying the score used when 'interaction = TRUE'. Available values include '"ba"' for balanced accuracy and '"logloss"'. |
decision |
Character string specifying the decision rule used when 'score = "ba"'. The default is '"prior_corrected"'. |
gower_average |
Logical; only used when 'preset = "gower"'. If 'TRUE', returns the standard Gower dissimilarity averaged over variables, matching the scale of [cluster::daisy()] with 'metric = "gower"'. If 'FALSE', returns the sum of per-variable Gower contributions, equivalent to multiplying the averaged Gower dissimilarity by the number of active variables. |
Details
'mdist()' is the main distance-construction function in 'manydist'. It can return ordinary train-train dissimilarities or rectangular test-to-training dissimilarities when 'new_data' is supplied. The resulting object stores both the dissimilarity matrix and metadata about the distance specification that was used.
With 'preset = "custom"', users manually choose the numerical preprocessing, categorical dissimilarity, commensurability, and optional interaction term.
The '"gower"' preset follows the usual Gower construction based on range scaling for continuous variables and matching dissimilarities for categorical variables. The 'gower_average' argument controls whether the result is averaged over variables or returned as a sum of variable-wise contributions.
The '"u_dep"', '"unbiased_dependent"', '"u_indep"', and '"u_mix"' presets are convenience specifications for unbiased or commensurable mixed-variable dissimilarities. The '"euclidean"' preset computes a Euclidean distance after one-hot encoding categorical variables. The '"gudmm"', '"dkss"', and '"mod_gower"' presets provide additional mixed-type distance constructions. Some presets currently support only train-train distances and will stop if 'new_data' is supplied.
Use [all_dist_method_specs()] to inspect the available distance components and method specifications.
Value
An object of class '"MDist"'. The object contains the computed dissimilarity in its '$distance' field, the selected 'preset', the training data, and a list of parameters describing the fitted distance specification. Square train-train dissimilarities are stored as '"dissimilarity"'/'"dist"' objects; rectangular test-to-training dissimilarities are stored as '"dissimilarity"'/'"matrix"' objects.
See Also
[step_mdist()], [all_dist_method_specs()]
Examples
if (requireNamespace("palmerpenguins", quietly = TRUE)) {
data("penguins", package = "palmerpenguins")
penguins_small <- palmerpenguins::penguins |>
dplyr::select(
bill_length_mm, bill_depth_mm, flipper_length_mm,
body_mass_g, species, island, sex
) |>
tidyr::drop_na()
# Gower distance on mixed-type data
d_gower <- mdist(penguins_small, preset = "gower")
d_gower
# Custom mixed-type specification
d_custom <- mdist(
penguins_small,
preset = "custom",
method_cat = "matching",
method_num = "std",
commensurable = TRUE
)
d_custom
# Train-to-new-data distances
penguin_split <- rsample::initial_split(penguins_small, prop = 0.75)
penguin_train <- rsample::training(penguin_split)
penguin_test <- rsample::testing(penguin_split)
d_new <- mdist(
penguin_train,
new_data = penguin_test,
preset = "gower"
)
d_new
}
Internal: summary method implementation for MDist R6 objects
Description
Internal: summary method implementation for MDist R6 objects
Usage
mdist_summary_impl(object, ...)
Arguments
object |
An 'MDist' object to summarize. |
... |
Additional arguments currently not used. |
Value
Invisibly returns 'object'.
PAM clustering specification based on manydist dissimilarities
Description
PAM clustering specification based on manydist dissimilarities
Usage
pam_dist(num_clusters = NULL)
Arguments
num_clusters |
Number of clusters. |
Value
A 'pam_dist_spec' object.
Spectral clustering specification based on manydist dissimilarities
Description
Spectral clustering specification based on manydist dissimilarities
Usage
spectral_dist(num_clusters = NULL, sigma = NULL, nstart = 50)
Arguments
num_clusters |
Number of clusters. |
sigma |
Optional bandwidth for the Gaussian affinity. If 'NULL', the median pairwise distance is used. |
nstart |
Number of random starts for k-means. |
Value
A 'spectral_dist_spec' object.
Examples
## Not run:
library(manydist)
library(palmerpenguins)
library(recipes)
library(generics)
data <- penguins |>
dplyr::select(-species) |>
tidyr::drop_na()
rec <- recipes::recipe(~ ., data = data) |>
step_mdist(all_predictors(), preset = "gower", output = "pairwise")
spec <- spectral_dist(num_clusters = 3)
fit_obj <- generics::fit(spec, recipe = rec, data = data)
print(fit_obj)
predict(fit_obj)
predict(fit_obj, type = "embed")
## End(Not run)
Spectral clustering from a distance matrix
Description
Spectral clustering from a distance matrix
Usage
spectral_from_dist(D, k, affinity_method = "selftune")
Arguments
D |
A distance matrix |
k |
Number of clusters |
affinity_method |
Method to build affinity |
Value
A vector of cluster labels
Add manydist dissimilarities to a recipe
Description
'step_mdist()' is a [recipes::recipe()] step that replaces selected predictors by a distance-based representation computed with [mdist()]. It is designed for distance-based learning workflows, especially nearest-neighbour prediction and clustering models that operate on dissimilarity matrices.
Usage
step_mdist(
recipe,
...,
role = "predictor",
trained = FALSE,
output = "distance_to_training",
preset = "custom",
method_cat = "tot_var_dist",
method_num = "none",
commensurable = FALSE,
ncomp = NULL,
threshold = NULL,
columns = NULL,
train_predictors = NULL,
preprocessor = NULL,
skip = FALSE,
id = recipes::rand_id("mdist")
)
Arguments
recipe |
A recipe object. |
... |
Selector(s) for the predictor columns to be used in [mdist()]. These are passed to [recipes::recipes_eval_select()] during preparation. |
role |
Role for the new distance columns. The default is '"predictor"'. |
trained |
Logical for recipes internals. Do not set manually. |
output |
Character string specifying the type of distance output. '"distance_to_training"' returns distances from the baked data to the training observations and is the usual choice for prediction workflows. '"pairwise"' returns the within-training pairwise dissimilarity matrix and is intended for training-only distance-based clustering workflows. |
preset |
Character string specifying the distance preset passed to [mdist()]. Available values include '"custom"', '"gower"', '"unbiased_dependent"', '"u_dep"', '"u_indep"', '"u_mix"', '"hl"', '"gudmm"', '"dkss"', '"mod_gower"', and '"euclidean"'. |
method_cat |
Character string specifying the categorical-variable dissimilarity passed to [mdist()] when 'preset = "custom"'. Common values include '"matching"' and '"tvd"'. Use [all_dist_method_specs()] to inspect available methods. |
method_num |
Character string specifying the numerical-variable preprocessing passed to [mdist()] when 'preset = "custom"'. Available options include '"none"' for no preprocessing, '"std"' for standard-deviation scaling, '"range"' for range scaling, '"robust"' for inter-quartile-range-based scaling, and '"pc_scores"' for principal-component score scaling. |
commensurable |
Logical. If 'TRUE', dissimilarities are scaled so that the average contribution of each variable to the overall distance is equal to 1, when supported by the selected distance specification. |
ncomp |
Integer or 'NULL'. Number of principal components to retain when 'method_num = "pc_scores"'. If 'NULL', all available components are used unless 'threshold' is supplied and supported by the underlying method. |
threshold |
Numeric or 'NULL'. Optional cumulative variance threshold used when 'method_num = "pc_scores"'. |
columns |
Names of columns selected at prep time. Used internally by recipes. |
train_predictors |
Training predictors stored at prep time. Used internally by recipes to compute distances from new observations to the training observations. |
preprocessor |
Internal fitted manydist preprocessor. |
skip |
Logical. Standard recipes argument indicating whether the step should be skipped when baking new data. |
id |
Character string. Unique step identifier. |
Details
The step can produce either distances from new observations to the training observations, or the within-training pairwise dissimilarity matrix. The former is the usual choice for supervised prediction workflows; the latter is useful for distance-based clustering workflows fitted on the training data.
During [recipes::prep()], 'step_mdist()' stores the selected training predictors and fits the internal manydist preprocessor. During [recipes::bake()], the selected predictors are removed and replaced by distance columns named 'dist_1', 'dist_2', and so on.
With 'output = "distance_to_training"', baking the training data returns the training pairwise distances, while baking new data returns distances from each new observation to each training observation. This rectangular representation is suitable for nearest-neighbour prediction models.
With 'output = "pairwise"', the step returns the within-training pairwise dissimilarity matrix. Baking genuinely new data is not supported in this mode, because the output is intended for training-only clustering workflows such as [pam_dist()] or [spectral_dist()].
Value
An updated recipe with a manydist step.
See Also
[mdist()], [nearest_neighbor_dist()], [pam_dist()], [spectral_dist()], [all_dist_method_specs()]
Examples
if (requireNamespace("palmerpenguins", quietly = TRUE)) {
data("penguins", package = "palmerpenguins")
penguins_small <- palmerpenguins::penguins |>
dplyr::select(
species, bill_length_mm, bill_depth_mm, flipper_length_mm,
body_mass_g, island, sex
) |>
tidyr::drop_na()
# Distance-to-training representation for prediction workflows
rec <- recipes::recipe(species ~ ., data = penguins_small) |>
step_mdist(
recipes::all_predictors(),
preset = "gower",
output = "distance_to_training"
)
rec_prep <- recipes::prep(rec, training = penguins_small)
baked <- recipes::bake(rec_prep, new_data = penguins_small)
baked |> dplyr::slice_head(n=5)
# Pairwise representation for clustering workflows
rec_pairwise <- recipes::recipe(~ ., data = penguins_small) |>
step_mdist(
recipes::all_predictors(),
preset = "gower",
output = "pairwise"
)
rec_pairwise_prep <- recipes::prep(rec_pairwise, training = penguins_small)
pairwise_dist <- recipes::bake(rec_pairwise_prep, new_data = penguins_small)
pairwise_dist
}