\name{createClustersDlg}
\alias{createClustersDlg}
\alias{showCorpusClustering}
\title{Cut hierarchical clustering tree into clusters}
\description{Cut a hierarchical clustering tree into clusters of documents.}
\details{This dialog allows grouping the documents present in a \pkg{tm} corpus
         according to a previously computed hierarchical clustering tree (see
         \code{\link{corpusClustDlg}}). It adds a new meta-data variable to the corpus,
         each number corresponding to a cluster; this variable is also added to the corpusMetaData
         data set. If clusters were already created before, they are simply replaced.

         Clusters will be created by starting from the top of the dendrogram, and going through
         the merge points with the highest position until the requested number of branches is reached.

         A window opens to summarize created clusters, providing information about specific documents
         and terms for each cluster. Specific terms are those whose observed frequency in the document or level
         has the lowest probability under an hypergeometric distribution, based on their global frequencies
         in the corpus and on the number of occurrences of all terms in the considered cluster.
         All terms with a probability below the value chosen using the third slider are reported, ignoring
         terms with fewer occurrences in the whole corpus than the value of the fourth slider (these terms
         can often have a low probability but are too rare to be of interest). The last slider allows limiting
         the number of terms that will be shown for each cluster.

         The positive or negative character of the association is visible from the sign of the t value,
         or by comparing the value of the \dQuote{\% Term/Level} column with that of the \dQuote{Global \%}
         column. The definition of columns is:
         \describe{
         \item{\dQuote{\% Term/Level}:}{the percent of the term's occurrences in all terms occurrences in the level.}
         \item{\dQuote{\% Level/Term}:}{the percent of the term's occurrences that appear in the level
             (rather than in other levels).}
         \item{\dQuote{Global \%}:}{the percent of the term's occurrences in all terms occurrences in the corpus.}
         \item{\dQuote{Level}:}{the number of occurrences of the term in the level (\dQuote{internal}).}
         \item{\dQuote{Global}:}{the number of occurrences of the term in the corpus.}
         \item{\dQuote{t value}:}{the quantile of a normal distribution corresponding the probability \dQuote{Prob.}.}
         \item{\dQuote{Prob.}:}{the probability of observing such an extreme (high or low) number of occurrences of
             the term in the level, under an hypergeometric distribution.}
         }

         Specific documents are selected using a different criterion than terms: documents with the smaller
         Chi-squared distance to the average vocabulary of the cluster are shown. This is a euclidean distance,
         but weighted by the inverse of the prevalence of each term in the whole corpus, and controlling for
         the documents' different lengths.

         This dialog can only be used after having created a tree, which is done via the Text
         Mining->Hierarchical clustering->Create dendrogram... dialog.
        }

\seealso{\code{\link{corpusClustDlg}}, \code{\link{cutree}}, \code{\link{hclust}}, \code{\link{dendrogram}} }
