% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/clustering.R
\name{clustering}
\alias{clustering}
\title{Clustering analysis}
\usage{
clustering(
  .data,
  ...,
  by = NULL,
  scale = FALSE,
  selvar = FALSE,
  verbose = TRUE,
  distmethod = "euclidean",
  clustmethod = "average",
  nclust = NA
)
}
\arguments{
\item{.data}{The data to be analyzed. It can be a data frame, possible with
grouped data passed from \code{\link[dplyr:group_by]{dplyr::group_by()}}.}

\item{...}{The variables in \code{.data} to compute the distances. Set to
\code{NULL}, i.e., all the numeric variables in \code{.data} are used.}

\item{by}{One variable (factor) to compute the function by. It is a shortcut
to \code{\link[dplyr:group_by]{dplyr::group_by()}}. To compute the statistics by more than
one grouping variable use that function.}

\item{scale}{Should the data be scaled before computing the distances? Set to
FALSE. If TRUE, then, each observation will be divided by the standard
deviation of the variable \mjseqn{Z_{ij} = X_{ij} / sd_j}}

\item{selvar}{Logical argument, set to \code{FALSE}. If \code{TRUE}, then an
algorithm for selecting variables is implemented. See the section
\strong{Details} for additional information.}

\item{verbose}{Logical argument. If \code{TRUE} (default) then the results
for variable selection are shown in the console.}

\item{distmethod}{The distance measure to be used. This must be one of
\code{'euclidean'}, \code{'maximum'}, \code{'manhattan'},
\code{'canberra'}, \code{'binary'}, \code{'minkowski'}, \code{'pearson'},
\code{'spearman'}, or \code{'kendall'}. The last three are
correlation-based distance.}

\item{clustmethod}{The agglomeration method to be used. This should be one of
\code{'ward.D'}, \code{'ward.D2'}, \code{'single'}, \code{'complete'},
\code{'average'} (= UPGMA), \code{'mcquitty'} (= WPGMA), \code{'median'} (=
WPGMC) or \code{'centroid'} (= UPGMC).}

\item{nclust}{The number of clusters to be formed. Set to \code{NA}}
}
\value{
\itemize{
\item \strong{data} The data that was used to compute the distances.
\item \strong{cutpoint} The cutpoint of the dendrogram according to Mojena (1977).
\item \strong{distance} The matrix with the distances.
\item \strong{de} The distances in an object of class \code{dist}.
\item \strong{hc} The hierarchical clustering.
\item \strong{Sqt} The total sum of squares.
\item \strong{tab} A table with the clusters and similarity.
\item \strong{clusters} The sum of square and the mean of the clusters for each
variable.
\item \strong{cofgrap} If \code{selectvar = TRUE}, then, \code{cofpgrap} is a
ggplot2-based graphic showing the cophenetic correlation for each model
(with different number of variables). Else, will be a \code{NULL} object.
\item \strong{statistics} If \code{selectvar = TRUE}, then, \code{statistics} shows
the summary of the models fitted with different number of variables,
including cophenetic correlation, Mantel's correlation with the original
distances (all variables) and the p-value associated with the Mantel's
test. Else, will be a \code{NULL} object.
}
}
\description{
\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}}
\loadmathjax
Performs clustering analysis with selection of variables.
}
\details{
When \code{selvar = TRUE} a variable selection algorithm is executed. The
objective is to select a group of variables that most contribute to explain
the variability of the original data. The selection of the variables is based
on eigenvalue/eigenvectors solution based on the following steps.
\enumerate{
\item compute the distance matrix and the cophenetic correlation with the original
variables (all numeric variables in dataset);
\item compute the eigenvalues and eigenvectors of the correlation matrix between
the variables;
\item Delete the variable with the largest weight (highest eigenvector in
the lowest eigenvalue);
\item Compute the distance matrix and cophenetic correlation with the remaining
variables;
\item Compute the Mantel's correlation between the obtained distances matrix and
the original distance matrix;
\item Iterate steps 2 to 5 \emph{p} - 2 times, where \emph{p} is the number of original
variables.
}

At the end of the \emph{p} - 2 iterations, a summary of the models is returned.
The distance is calculated with the variables that generated the model with
the largest cophenetic correlation. I suggest a careful evaluation aiming at
choosing a parsimonious model, i.e., the one with the fewer number of
variables, that presents acceptable cophenetic correlation and high
similarity with the original distances.
}
\examples{
\donttest{
library(metan)

# All rows and all numeric variables from data
d1 <- clustering(data_ge2)

# Based on the mean for each genotype
mean_gen <-
 data_ge2 \%>\%
 mean_by(GEN) \%>\%
 column_to_rownames("GEN")

d2 <- clustering(mean_gen)


# Select variables for compute the distances
d3 <- clustering(mean_gen, selvar = TRUE)

# Compute the distances with standardized data
# Define 4 clusters
d4 <- clustering(data_ge,
                 by = ENV,
                 scale = TRUE,
                 nclust = 4)

}
}
\references{
Mojena, R. 2015. Hierarchical grouping methods and stopping
rules: an evaluation. Comput. J. 20:359-363. \doi{10.1093/comjnl/20.4.359}
}
\author{
Tiago Olivoto \email{tiagoolivoto@gmail.com}
}
