Download a copy of the vignette to follow along here: nmi_scores.Rmd
NMI scores were used in the original SNFtool
package as
a unitless way to compare the relative importance of different features
in a final cluster solution. The premise of this approach is that if a
feature was very important, clustering off of that feature alone should
result in a solution that is very similar to the one that was generated
by clustering off of all the features together.
In the original SNFtool
implementation of calculating
NMI scores, the cluster solution based on the individual feature being
assessed was restricted to necessarily being generated using squared
Euclidean distance, a K hyperparameter value of 20, an alpha
hyperparameter value of 0.5, and spectral clustering with the number of
clusters based on the best eigen-gap value of possible solutions
spanning from 2 to 5 clusters.
In contrast, the metasnf
implementation leverages all
the architectural details and hyperparameters supplied in the original
settings_matrix
and batch_snf()
call to make
the solo-feature to all-feature solutions as comparable as possible.
The chunk below outlines how the primary NMI calculating function,
batch_nmi()
, can be used.
library(metasnf)
data_list <- generate_data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
set.seed(42)
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 20,
min_k = 20,
max_k = 50
)
# Generation of 20 cluster solutions
solutions_matrix <- batch_snf(data_list, settings_matrix)
# Let's just calculate NMIs of the anxiety and depression data types for the
# first 5 cluster solutions to save time:
feature_nmis <- batch_nmi(data_list[4:5], solutions_matrix[1:5, ])
print(feature_nmis)
#> feature row_id_1 row_id_2 row_id_3 row_id_4 row_id_5
#> 1 cbcl_anxiety_r 0.08307759 0.3825622 0.5532495 0.4068634 0.2532882
#> 2 cbcl_depress_r 0.30514882 0.3348474 0.4058227 0.2307721 0.1486859
One important thing to note is that if the cluster space you
initially set up when calling batch_snf
relied on custom
distance metrics, clustering algorithms, or the
automatic_standard_normalize
parameter, you should use
those same values when calling batch_nmi()
as well.
Another important note is that by default, batch_nmi
will ignore the inc_*
columns of the settings matrix, i.e.,
no data types are dropped during solo feature cluster solution
calculations. This can lead to a bit of an odd interpretation if you
view NMI as a direct reflection of contribution to the final SNF output.
It is possible for a feature that was not a part of a particular cluster
solution to still produce its own cluster solution that has a very high
NMI score to the prior one. If you wish to suppress the calculation of
NMIs for features that were not actually included in a particular SNF
run due to having a 0 value in the inclusion column, you can set the
ignore_inclusions
parameter to FALSE
.
Finally, if you’d like the NMI information to be presented in a
transposed format, you can do that too by setting transpose
to FALSE
.