% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Functions.R
\name{wKModes}
\alias{wKModes}
\title{Weighted K-Modes Clustering with Tie-Breaking}
\usage{
wKModes(data,
        modes,
        weights = NULL,
        iter.max = .Machine$integer.max,
        freq.weighted = FALSE,
        fast = TRUE,
        random = TRUE,
        ...)
}
\arguments{
\item{data}{A matrix or data frame of categorical data. Objects have to be in rows, variables in columns.}

\item{modes}{Either the number of modes or a set of initial (distinct) cluster modes (where each mode is a row and \code{modes} has the same number of columns as \code{data}). If a number, a random set of (distinct) rows in \code{data} is chosen as the initial modes. Note, this randomness is always present, and is not governed by \code{random} below.}

\item{weights}{Optional numeric vector containing non-negative observation-specific case weights.}

\item{iter.max}{The maximum number of iterations allowed. Defaults to \code{.Machine$integer.max}. The algorithm terminates when \code{iter.max} is reached or when the partition ceases to change between iterations.}

\item{freq.weighted}{A logical indicating whether the usual simple-matching (Hamming) distance between objects is used, or a frequency weighted version of this distance. Defaults to \code{FALSE}; when \code{TRUE}, the frequency weights are computed within the algorithm and are \emph{not} user-specified. Distinct from the observation-level \code{weights} above, the frequency weights are assigned on a per-feature basis and derived from the categories represented in each column of \code{data}. For convenience, the function \code{dist_freqwH} is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.}

\item{fast}{A logical indicating whether a fast version of the algorithm should be applied. Defaults to \code{TRUE}.}

\item{random}{A logical indicating whether ties for the modal values &/or assignments are broken at random. Defaults to \code{TRUE}: the implied default had been \code{FALSE} prior to version \code{1.3.2} of this package, as per \code{klaR::kmodes} prior to version \code{1.7-1} (see Note). Note that when \code{modes} is specified as the number of modes, the algorithm is \emph{always} randomly initialised, regardless of the specification of \code{random}.

Regarding the modes, ties are broken at random when \code{TRUE} and the first candidate state is always chosen for the mode when \code{FALSE}. Regarding assignments, tie-breaking is always first biased in favour of the observation's most recent cluster: regarding ties thereafter, these are broken at random when \code{TRUE} or the first other candidate cluster is always chosen when \code{FALSE}.}

\item{...}{Catches unused arguments.}
}
\value{
An object of class \code{"wKModes"} which is a list with the following components:
\describe{
\item{\code{cluster}}{A vector of integers indicating the cluster to which each object is allocated.}
\item{\code{size}}{The number of objects in each cluster.}
\item{\code{modes}}{A matrix of cluster modes.}
\item{\code{withindiff}}{The within-cluster (weighted) simple-matching distance for each cluster.}
\item{\code{tot.withindiff}}{The total within-cluster (weighted) distance over all clusters. \code{tot.withindiff} can be used to guide the choice of the number of clusters, but beware of inherent randomness in the algorithm, which is liable to yield a jagged elbow plot (see examples).}
\item{\code{iterations}}{The number of iterations the algorithm reached.}
\item{\code{weighted}}{A logical indicating whether observation-level \code{weights} were used or not throughout the algorithm.}
\item{\code{freq.weighted}}{A logical indicating whether feature-level \code{freq.weights} were used or not in the computation of the distances. For convenience, the function \code{dist_freqwH} is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.}
\item{\code{random}}{A logical indicating whether ties were broken at random or not throughout the algorithm.}}
}
\description{
Perform k-modes clustering on categorical data with observation-specific sampling weights and tie-breaking adjustments.
}
\details{
The k-modes algorithm (Huang, 1998) is an extension of the k-means algorithm by MacQueen (1967).

The data given by \code{data} is clustered by the k-modes method (Huang, 1998) which aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimised. 

By default, the simple-matching (Hamming) distance is used to determine the dissimilarity of two objects. It is computed by counting the number of mismatches in all variables. Alternatively, this distance can be weighted by the frequencies of the categories in data, using the \code{freq.weighted} argument (see Huang, 1998, for details). For convenience, the function \code{dist_freqwH} is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.

If an initial matrix of modes is supplied, it is possible that no object will be closest to one or more modes. In this case, fewer clusters than the number of supplied modes will be returned and a warning will be printed.

If called using \code{fast = TRUE}, the reassignment of the data to clusters is done for the entire data set before recomputation of the modes is done. For computational reasons, this option should be chosen for all but the most moderate of data sizes.
}
\note{
This code is adapted from the \code{kmodes} function in the \pkg{klaR} package. Specifically, modifications were made to allow for random tie-breaking for the modes and assignments (see \code{random} above) and the incorporation of observation-specific sampling \code{weights}, with a view to using this function as a means to initialise the allocations for MEDseq models (see the \code{\link{MEDseq_control}} argument \code{init.z} and the related options \code{"kmodes"} and \code{"kmodes2"}). 

Notably, the \code{wKModes} function, when invoked inside \code{\link{MEDseq_fit}}, is used regardless of whether the weights are true sampling weights, or the weights are merely aggregation weights, or there are no weights at all. Furthermore, the \code{\link{MEDseq_control}} argument \code{random} is \emph{also} passed to \code{wKModes} when it is invoked inside \code{\link{MEDseq_fit}}.

\strong{Update}: as of version \code{1.7-1} of \pkg{klaR}, \code{klaR::kmodes} now breaks assignment ties at random only when \code{fast=TRUE}. It still breaks assignment ties when \code{fast=FALSE} and all ties for modal values in the non-random manner described above. Thus, the old behaviour of \code{klaR::kmodes} can be recovered by specifying \code{random=FALSE} here, but \code{random=TRUE} allows random tie-breaking for both types of ties in all situations.
}
\examples{
suppressMessages(require(WeightedCluster))
set.seed(99)
# Load the MVAD data & aggregate the state sequences
data(mvad)
agg      <- wcAggregateCases(mvad[,17:86], weights=mvad$weight)

# Create a state sequence object without the first two (summer) time points
states   <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels   <- c("Employment", "Further Education", "Higher Education", 
              "Joblessness", "School", "Training")
mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], 
                   states=states, labels=labels, 
                   weights=agg$aggWeights)

# Run k-modes without the weights
resX     <- wKModes(mvad.seq, 2)

# Run k-modes with the weights
resW     <- wKModes(mvad.seq, 2, weights=agg$aggWeights)

# Examine the modal sequences of both solutions
seqformat(seqdef(resX$modes), from="STS", to="SPS", compress=TRUE)
seqformat(seqdef(resW$modes), from="STS", to="SPS", compress=TRUE)

# Using tot.withindiff to choose the number of clusters
\donttest{
TWdiffs   <- sapply(1:5, function(k) wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff)
plot(TWdiffs, type="b", xlab="K")

# Use multiple random starts to account for inherent randomness
TWDiff    <- sapply(1:5, function(k) min(replicate(10, 
                    wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff)))
plot(TWDiff, type="b", xlab="K")}
}
\references{
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. \emph{Data Mining and Knowledge Discovery}, 2(3): 283-304.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. L. Cam and J. Neyman (Eds.), \emph{Proceedings of the Fifth Berkeley Symposium on  Mathematical Statistics and Probability}, Volume 1, June 21-July 18, 1965 and December 27 1965-January 7, 1966, Statistical Laboratory of the University of California, Berkelely, CA, USA, pp. 281-297. University of California Press.
}
\seealso{
\code{\link{MEDseq_control}}, \code{\link{MEDseq_fit}}, \code{\link{dist_freqwH}}, \code{\link[WeightedCluster]{wcAggregateCases}}, \code{\link[TraMineR]{seqformat}}
}
\author{
Keefe Murphy - <\email{keefe.murphy@mu.ie}>
(adapted from \code{klaR::kmodes})
}
\keyword{clustering}
