% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/compDis.R
\name{compDis}
\alias{compDis}
\title{Calculate compound dissimilarities}
\usage{
compDis(
  compoundData,
  type = "PubChemFingerprint",
  npcTable = NULL,
  unknownCompoundsMean = FALSE
)
}
\arguments{
\item{compoundData}{Data frame with the chemical compounds of interest,
usually the compounds found in the sample dataset.
Should have a column named "compound" with common names of
the compounds, a column named "smiles" with SMILES IDs of the compounds,
and a column named "inchikey" with the InChIKey IDs for the compounds.}

\item{type}{Type of data compound dissimilarity calculations will be
based on: \code{NPClassifier}, \code{PubChemFingerprint} or \code{fMCS}.
If more than one is chosen, a matrix with mean values of the other
matrices will also calculated.}

\item{npcTable}{A data frame already generated by \code{\link{NPCTable}}
can optionally be supplied, if compound dissimilarities are to be
calculated using \code{type = "NPClassifier"}.}

\item{unknownCompoundsMean}{If unknown compounds, i.e. ones without SMILES
or InChIKey, should be given mean dissimilarity values. If not, these
will have dissimilarity 1 to all other compounds.}
}
\value{
List with compound dissimilarity matrices. A list is always
outputted, even if only one matrix is calculated. Downstream functions,
including \code{\link{calcDiv}}, \code{\link{calcBetaDiv}},
\code{\link{calcDivProf}}, \code{\link{sampDis}}, \code{\link{molNet}}
and \code{\link{chemoDivPlot}} require only the matrix as
input (e.g. as \code{fullList$specificMatrix}) rather than the whole list.
}
\description{
Function to quantify dissimilarities between phytochemical compounds.
}
\details{
This function calculates matrices with pairwise dissimilarities between
the chemical compounds in \code{compoundData}, to quantify how
different the molecules are to each other. It does so in three
different ways, based on the biosynthetic classification or
molecular structure of the molecules:
\enumerate{
\item Using the classification from the \emph{NPClassifier} tool,
\code{type = "NPClassifier"}. \emph{NPClassifier} (Kim et al. 2021) is a
deep-learning tool that automatically classifies natural products
(i.e. phytochemical compounds) into a hierarchical classification of
three levels: pathway, superclass and class. This classification largely
corresponds to the biosynthetic groups/pathways the compounds
are produced in. Classifications are downloaded from
\url{https://npclassifier.ucsd.edu/}. \emph{NPClassifier} does not always
manage to classify every compound into all three hierarchical levels. In
such cases, it might be beneficial to first run \code{\link{NPCTable}},
manually edit the resulting data frame with probable classifications if
possible (with help from the Supporting Information in Kim et al. 2021),
and then supply this classification to the \code{compDis} function
with the \code{npcTable} argument. This will ensure that compound
dissimilarities are computed optimally.
\item Using PubChem Fingerprints, \code{type = "PubChemFingerprint"}.
This is a binary substructure fingerprint with 881 binary
variables describing the chemical structure of a compound.
With this method, compounds are therefore compared
based on how structurally dissimilar the molecules are.
See \url{https://pubchem.ncbi.nlm.nih.gov/docs/data-specification}
for more information. (There are many other types of fingerprints,
and ways of calculating compound dissimilarities based on them, see
e.g. packages \code{fingerprint} and \code{rcdk}). Fingerprint data for
molecules is downloaded from PubChem. In association with this,
there might be a Warning message about closing unused connections,
which is not important.
\item fMCS, flexible Maximum Common Substructure,
\code{type = "fMCS"}. This is a pairwise graph matching concept.
The fMCS of two compounds is the largest substructure that occurs in both
compounds allowing for atom and/or bond mismatches (Wang et al 2013).
As with the fingerprints, compounds are compared based on how
structurally dissimilar the molecules are. While potentially a very
accurate similarity measure, fMCS is much more computationally demanding
than the other methods, and will take a significant amount of time for
larger data sets. Data on molecules is downloaded from PubChem.
In association with this, there might be a Warning message about closing
unused connections, which is not important.
}

Dissimilarities using NPClassifier and PubChem Fingerprints
are generated by calculating Jaccard (Tanimoto) dissimilarities from a
0/1 table with compounds as rows and group (NPClassifier) or binary
fingerprint variable (PubChem Fingerprints) as columns. fMCS generates
dissimilarity values by calculating Jaccard dissimilarities based on the
number of atoms in the maximum common substructure, allowing for one
atom and one bond mismatch. Dissimilarities are outputted as
dissimilarity matrices.

If dissimilarities are calculated with more than one method,
the function will output additional dissimilarity matrices.
This always includes a matrix with the mean dissimilarity values of the
selected methods. If \code{"NPClassifier"} is included in \code{type},
a matrix of "mix" values is also calculated. The values in this matrix
are the dissimilarities from NPClassifier when these are > 0.
For pairs of compounds where dissimilarities from NPClassifier
equals 0 (i.e. when the compounds belong to the same pathway, superclass
and class), values are equal to half of the (mean) value(s) of the
structural dissimilarity/-ies from PubChem Fingerprints and/or fMCS.
With this method, compound dissimilarities are primarily based on
NPClassifier, but instead of compounds with identical classification having
0 dissimilarity, these have a dissimilarity based on PubChem Fingerprints
and/or fMCS, scaled to always be less (< 0.5) than compounds being in the
same pathway and superclass, but different class.

If there are unknown compounds, which do not have a
corresponding SMILES or InChIKey, this can be handled in three
different ways. First, these can be completely removed from the list
of compounds and the sample data set, and hence excluded from all analyses.
Second, if \code{unknownCompoundsMean = FALSE}, unknown compounds will
be given a dissimilarity value of 1 to all other compounds. Third, if
\code{unknownCompoundsMean = TRUE}, unknown compounds will be given
a dissimilarity value to all other compounds which equals the mean
dissimilarity value between all known compounds. See \code{\link{chemodiv}}
for alternative methods that can be used when most or all compounds
are unknown.
}
\examples{
data(minimalCompData)
data(minimalNPCTable)
compDis(minimalCompData, type = "NPClassifier",
npcTable = minimalNPCTable) # Dissimilarity based on NPClassifier

\dontrun{compDis(minimalCompData)} # Dissimilarity based on Fingerprints

data(alpinaCompData)
data(alpinaNPCTable)
compDis(compoundData = alpinaCompData, type = "NPClassifier",
npcTable = alpinaNPCTable) # Dissimilarity based on NPClassifier
}
\references{
Kim HW, Wang M, Leber CA, Nothias L-F, Reher R, Kang KB,
van der Hooft JJJ, Dorrestein PC, Gerwick WH, Cottrell GW. 2021.
NPClassifier: A Deep Neural Network-Based Structural Classification
Tool for Natural Products. Journal of Natural Products 84: 2795-2807.

Wang Y, Backman TWH, Horan K, Girke T. 2013.
fmcsR: mismatch tolerant maximum common substructure searching in R.
Bioinformatics 29: 2792-2794.
}
