% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/DistCalc_func_20230321.R
\name{DistCalc}
\alias{DistCalc}
\title{DistCalc() function}
\usage{
DistCalc(
  seq_file,
  path_out,
  input_fasta = NULL,
  input_seq = "aa",
  aa_dist = NULL,
  codon_pos = NULL,
  dist_type = "G"
)
}
\arguments{
\item{seq_file}{is a sequence occurrence table as output by the 'dada2'
pipeline, which has samples in rows and nucleotide sequence variants in
columns. Optionally, a fasta file can be supplied as input in the format
rendered by read.fasta() from the package 'seqinr'.}

\item{path_out}{is a user defined path to the folder where the output files
will be saved.}

\item{input_fasta}{optional, a logical (TRUE/FALSE) that indicates whether
the input file is a fasta file (TRUE) or a 'dada2'-style sequence table
(NULL/FALSE). The default is NULL/FALSE.}

\item{input_seq}{defines the type of sequences in seq_file. It may take the
values 'nucl' or 'aa'.}

\item{aa_dist}{is optional, a logical (TRUE/FALSE) that determines whether
nucleotide sequences should be translated to amino acid sequences before
distance calculation, default is NULL/FALSE. Note that aa_dist must be set
to TRUE, if Grantham or Sandberg distances are calculated from an alignment
of nucleotide sequences.}

\item{codon_pos}{is optional, a vector of comma separated integers specifying
which codon positions to include in distance calculations. If omitted,
distance calculations are made using all codons.}

\item{dist_type}{is used to specify which kind of distances that are
calculated. It takes the values 'G' for Grantham distances, 'S'  for
Sandberg distances, or 'P' for p-distances. The argument is optional with
'G' as default setting.}
}
\value{
The function returns a matrix with distances from all pairwise sequence
  comparisons, where n is the number of sequences. If a sequence occurrence
  table is given as input file, the function additionally returns a table with
  the mean distance for each sample in the data set. If a sequence occurrence
  table is given as input file, the sequences are named in the output matrix by
  an index number that corresponds to their column number in the input file. If
  calculation of Sandberg distances is specified, the function additionally
  outputs five tables with physico-chemical z-descriptor values for each amino
  acid position in all sequences in the data set. All output tables are saved
  as .csv files in the output path.
}
\description{
\code{\link{DistCalc}} calculates Grantham distances, Sandberg distances,
or p-distances from pairwise comparisons of aligned sequences.
}
\details{
The DistCalc() function takes a fasta file or a 'dada2'-style sequence
occurrence table (with aligned sequences as column names and samples in
rows) as input and produces a matrix with pairwise distances for all
sequences in the data set. If calculation of Sandberg distances is specified,
the function additionally outputs five tables with physico-chemical
z-descriptor values (based on Sandberg et al. 1998) for each amino acid
position in all sequences in the data set. These tables may be useful for
further downstream analyses, such as estimation of MHC supertypes. If a
sequence occurrence table is provided as input, the DistCalc() function
furthermore produces a table with the mean distances from all pairwise
comparisons of the sequences in each sample in the data set. (Note:
The mean distance will be NA for samples that have 0 or 1 sequence(s).)

Grantham distances and Sandberg distances are calculated as described in
Pierini & Lenz 2018. The Grantham distances produced by DistCalc() are
simply the mean Grantham distances (Grantham 1974) between all amino acid
codons in sequence pairs. When calculating Sandberg distances, DistCalc()
first computes Euclidian distances between all amino acid pairs based on
the five physico-chemical z-descriptors defined in Sandberg et al. 1998.
Sandberg distances are then calculated as the mean Euclidian distances
between all amino acid codons in sequence pairs. P-distances calculated
by DistCalc() are simply the proportion of varying codons between pairs
of sequences.

The DistCalc() function includes an option for the user to specify which
codons to compare, which is useful e.g. if conducting the analysis only
on codon positions involved in specific functions, such as peptide binding
of an MHC molecule. It also accepts calculating amino acid distances
directly from protein-coding DNA sequences using the standard genetic
code.

The DistCalc() function accepts the following characters in the sequences:
Nucleotide sequences: A,T,G,C
Amino acid sequences: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V

It accepts gaps defined by '-'. Nucleotide triplets containing gaps are
translated to 'X', if amino acid distances are calculated directly from DNA
nucleotide sequences. Please note that '-' or 'X' are treated as unique
characters in p-distance calculations. The function will not accept 'X' or
gaps in Grantham or Sandberg distance calculations. If you wish to exclude
codons with 'X' or gaps from distance calculations, please use the
codon_pos option to specify which codons to compare.

If you publish data or results produced with MHCtools, please cite both of
the following references:
Roved, J. 2022. MHCtools: Analysis of MHC data in non-model species. Cran.
Roved, J., Hansson, B., Stervander, M., Hasselquist, D., & Westerdahl, H. 2022.
MHCtools - an R package for MHC high-throughput sequencing data: genotyping,
haplotype and supertype inference, and downstream genetic analyses in non-model
organisms. Molecular Ecology Resources. https://doi.org/10.1111/1755-0998.13645

If you calculated Grantham or Sandberg distances, please additionally cite:
Pierini, F., Lenz, T.L. 2018. Divergent allele advantage at human MHC genes:
Signatures of past and ongoing selection. Mol. Biol. Evol. 35, 2145-2158.

...and either of the following references:
Grantham R. 1974. Amino acid difference formula to help explain protein
evolution. Science 185:862-864.
Sandberg M, Eriksson L, Jonsson J, Sjostrom M, Wold S. 1998. New chemical
descriptors relevant for the design of biologically active peptides. A
multivariate characterization of 87 amino acids. JMed Chem. 41(14):2481-2491.
}
\examples{
seq_file <- sequence_table_fas
path_out <- tempdir()
DistCalc(seq_file, path_out, input_fasta=NULL, input_seq="nucl", aa_dist=NULL,
codon_pos=c(1,2,3,4,5,6,7,8), dist_type="P")
}
\seealso{
For more information about 'dada2', visit
  <https://benjjneb.github.io/dada2/>
}
