% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Simulations.R
\name{Simulate_bakRData}
\alias{Simulate_bakRData}
\title{Simulating nucleotide recoding data}
\usage{
Simulate_bakRData(
  ngene,
  num_conds = 2L,
  nreps = 3L,
  eff_sd = 0.75,
  eff_mean = 0,
  fn_mean = 0,
  fn_sd = 1,
  kslog_c = 0.8,
  kslog_sd = 0.95,
  tl = 60,
  p_new = 0.05,
  p_old = 0.001,
  read_lengths = 200L,
  p_do = 0,
  noise_deg_a = -0.3,
  noise_deg_b = -1.5,
  noise_synth = 0.1,
  sd_rep = 0.05,
  low_L2FC_ks = -1,
  high_L2FC_ks = 1,
  num_kd_DE = c(0L, as.integer(rep(round(as.integer(ngene)/2), times =
    as.integer(num_conds) - 1))),
  num_ks_DE = rep(0L, times = as.integer(num_conds)),
  scale_factor = 150,
  sim_read_counts = TRUE,
  a1 = 5,
  a0 = 0.01,
  nreads = 50L,
  alpha = 25,
  beta = 75,
  STL = FALSE,
  STL_len = 40
)
}
\arguments{
\item{ngene}{Number of genes to simulate data for}

\item{num_conds}{Number of experimental conditions (including the reference condition) to simulate}

\item{nreps}{Number of replicates to simulate}

\item{eff_sd}{Effect size; more specifically, the standard deviation of the normal distribution from which non-zero
changes in logit(fraction new) are pulled from.}

\item{eff_mean}{Effect size mean; mean of normal distribution from which non-zero changes in logit(fraction new) are pulled from.
Note, setting this to 0 does not mean that some of the significant effect sizes will be 0, as any exact integer is impossible
to draw from a continuous random number generator. Setting this to 0 just means that there is symmetric stabilization and destabilization}

\item{fn_mean}{Mean of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA from each transcript that is
metabolically labeled (new) is drawn from a normal distribution with this mean}

\item{fn_sd}{Standard deviation of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA
from each transcript that is metabolically labeled (new) is drawn from a normal distribution with this sd}

\item{kslog_c}{Synthesis rate constants will be drawn from a lognormal distribution with meanlog = \code{kslog_c} - mean(log(kd_mean)) where kd_mean
is determined from the fraction new simulated for each gene as well as the label time (\code{tl}).}

\item{kslog_sd}{Synthesis rate lognormal standard deviation; see kslog_c documentation for details}

\item{tl}{metabolic label feed time}

\item{p_new}{metabolic label (e.g., s4U) induced mutation rate. Can be a vector of length num_conds}

\item{p_old}{background mutation rate}

\item{read_lengths}{Total read length for each sequencing read (e.g., PE100 reads correspond to read_lengths = 200)}

\item{p_do}{Rate at which metabolic label containing reads are lost due to dropout; must be between 0 and 1}

\item{noise_deg_a}{Slope of trend relating log10(standardized read counts) to log(replicate variability)}

\item{noise_deg_b}{Intercept of trend relating log10(standardized read counts) to log(replicate variability)}

\item{noise_synth}{Homoskedastic variability of L2FC(ksyn)}

\item{sd_rep}{Variance of lognormal distribution from which replicate variability is drawn}

\item{low_L2FC_ks}{Most negative L2FC(ksyn) that can be simulated}

\item{high_L2FC_ks}{Most positive L2FC(ksyn) that can be simulated}

\item{num_kd_DE}{Vector where each element represents the number of genes that show a significant change in stability relative
to the reference. 1st entry must be 0 by definition (since relative to the reference the reference sample is unchanged)}

\item{num_ks_DE}{Same as num_kd_DE but for significant changes in synthesis rates.}

\item{scale_factor}{Factor relating RNA concentration (in arbitrary units) to average number of read counts}

\item{sim_read_counts}{Logical; if TRUE, read counts are simulated as coming from a heterodisperse negative binomial distribution}

\item{a1}{Heterodispersion 1/reads dependence parameter}

\item{a0}{High read depth limit of negative binomial dispersion parameter}

\item{nreads}{Number of reads simulated if sim_read_counts is FALSE}

\item{alpha}{shape1 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are
drawn for each gene.}

\item{beta}{shape2 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are
drawn for each gene.}

\item{STL}{logical; if TRUE, simulation is of STL-seq rather than a standard TL-seq experiment. The two big changes are that a short read length is required
(< 60 nt) and that every read for a particular feature will have the same number of Us. Only one read length is simulated for simplicity.}

\item{STL_len}{Average length of simulated STL-seq length. Since Pol II typically pauses about 20-60 bases
from the promoter, this should be around 40}
}
\value{
A list containing a simulated \code{bakRData} object as well as a list of simulated kinetic parameters of interest.
The contents of the latter list are:
\itemize{
\item Effect_sim; Dataframe meant to mimic formatting of Effect_df that are part of \code{bakRFit(StanFit = TRUE)}, \code{bakRFit(HybridFit = TRUE)} and \code{bakRFit(bakRData object)} output.
\item Fn_mean_sim; Dataframe meant to mimic formatting of Regularized_ests that is part of \code{bakRFit(bakRData object)} output. Contains information
about the true fraction new simulated in each condition (the mean of the normal distribution from which replicate fraction news are simulated)
\item Fn_rep_sim; Dataframe meant to mimic formatting of Fn_Estimates that is part of \\code{bakRFit(bakRData object)} output. Contains information
about the fraction new simulated for each feature in each replicate of each condition.
\item L2FC_ks_mean; The true L2FC(ksyn) for each feature in each experimental condition. The i-th column corresponds to the L2FC(ksyn) when comparing
the i-th condition to the reference condition (defined as the 1st condition) so the 1st column is always all 0s
\item RNA_conc; The average number of normalized read counts expected for each feature in each sample.
}
}
\description{
\code{Simulate_bakRData} simulates a \code{bakRData} object. It's output also includes the simulated
values of all kinetic parameters of interest. Only the number of genes (\code{ngene}) has to be set by the
user, but an extensive list of additional parameters can be adjusted.
}
\details{
\code{Simulate_bakRData} simulates a \code{bakRData} object using a realistic generative model with many
adjustable parameters. Average RNA kinetic parameters are drawn from biologically inspired
distributions. Replicate variability is simulated by drawing a feature's
fraction new in a given replicate from a logit-Normal distribution with a heteroskedastic
variance term with average magnitude given by the chosen read count vs. variance relationship.
For each replicate, a feature's ksyn is drawn from a homoskedastic lognormal distribution. Read counts
can either be set to the same value for all simulated features or can be simulated according to
a heterodisperse negative binomial distribution. The latter is the default

The number of Us in each sequencing read is drawn from a binomial distribution with number of trials
equal to the read length and probability of each nucleotide being a U drawn from a beta distribution. Each read is assigned to the
new or old population according to a Bernoulli distribution with p = fraction new. The number of
mutations in each read are then drawn from one of two binomial distributions; if the read is assigned to the
population of new RNA, the number of mutations are drawn from a binomial distribution with number of trials equal
to the number of Us and probability of mutation = \code{p_new}; if the read is assigned to the population of old RNA,
the number of mutations is instead drawn from a binomial distribution with the same number of trials but with the probability
of mutation = \code{p_old}. \code{p_new} must be greater than \code{p_old} because mutations in new RNA
arise from both background mutations that occur with probability \code{p_old} as well as metabolic label induced mutations

Simulated read counts should be treated as if they are spike-in and RPKM normalized, so the same scale factor can be applied
to each sample when comparing the sequencing reads (e.g., if you are performing differential expression analysis).

Function to simulate a \code{bakRData} object according to a realistic generative model
}
\examples{
\donttest{
# 2 replicate, 2 experimental condition, 1000 gene simulation
sim_2reps <- Simulate_bakRData(ngene = 1000, nreps = 2)

# 3 replicate, 2 experimental condition, 1000 gene simulation
# with 100 instances of differential degradation kinetics
sim_3reps <- Simulate_bakRData(ngene = 1000, num_kd_DE = c(0, 100))

# 2 replicates, 3 experimental condition, 1000 gene simulation
# with 100 instances of differential degradation kinetics in the 1st
# condition and no instances of differential degradation kinetics in the
# 2nd condition
sim_3es <- Simulate_bakRData(ngene = 1000,
                             nreps = 2,
                             num_conds = 3,
                             num_kd_DE = c(0, 100, 0))

}
}
