%\VignetteIndexEntry{CloneSeeker} %\VignetteKeywords{SNP,Copy Number,clone,subclone} %\VignetteDepends{stats} %\VignettePackage{CloneSeeker} \documentclass{article} \usepackage{Sweave} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} \newcommand{\code}[1]{\texttt{#1}} \title{CloneSeeker} \author{Mark Zucker \and Kevin R. Coombes} \date{November, 2018} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \section{Introduction} Tumors often consist of multiple distinct subpopulations or clones. Information about the number of clones present in a tumor can be inferred using either mutation allele frequency data, from sequencing studies, or from copy number varants (CNVs), derived either from sequencing or from SNP array data. The \code{CloneSeeker} package can be applied to SNP array data, sequencing data, or both, from tumor cells from a cancer patient. \code{CloneSeeker} can determine the number of clones, the distribution of cells among clones, and the copy number variations and mutations (depending on the available data sources) that occur in each clone. The presence of multiple detectable clones is called ``clonal heterogeneity'' in the literature. Clonal heterogeneity likely plays an important role in the clinical course of a cancer. It is possible, for example, that the tumor cells that will eventually become the refractory cancer after treatment are present as a minor subclone in the tumor early on. First, we load the \code{CloneSeeker} package: <>= library(CloneSeeker) @ \section{Simulated Tumor Containing Multiple Clones} In order to illustrate the algorithms, we are going to simulate data where we know the true structure. Specifically, we will simulate copy number and mutation data for a tumor with three clones. We start with an object that represents the Tumor at a somewhat abstract level. <>= set.seed(21303) # for reproducibility simTumor <- Tumor(c(5, 3, 2), rounds = 100, nu = 10, pcnv = 0.8, norm.contam = FALSE) @ The first argument to the \code{Tumor} constructor is a vector that specifies the relative proportions of cells belonging to each clone; the length of the vector determines the number of clones. These values are automatically converted to fractions that add up to one: <>= simTumor@psi @ The second argument, \code{rounds}, specifies the number of generations through which the tumor clones are evolved. The idea is that new abnormalities, either in the form of mutations or copy number variants (CNVs), are acquired at each evolutionary step from some parent cell. The parameter \code{nu} is the expected number of new mutations and the parameter \code{pcnv} is the probability of a new CNV at each step. The final parameter, \code{norm.contam}, is a logical indicator of whether the tumor sample is assumed to include a subset of cells that represent non-cancerous ``normal contamination''. The resulting simulated tumor contains descriptions of each individual clone. In the current implementation, these are stored as a list of clones. <>= class(simTumor@clones) length(simTumor@clones) @ Individual clones contain descriptions of both CNVs and mutations. <>= oneClone <- simTumor@clones[[1]] class(oneClone) length(oneClone) names(oneClone) @ The copy number data includes the chromosome, with start and end positions, the number of copies of the \code{A} and \code{B} alleles, an arbitrary ``segment'' identifier, and (as a residual from the simulated evolutionary history), a ``parent'' identifier. <>= dim(oneClone$cn) summary(oneClone$cn) @ The mutation data has a chromosomal location, arbitrary segment and mutation identifiers, the number of mutated and wild type copies for each mutation, and the affected allele. <>= dim(oneClone$seq) oneClone$seq @ \subsection{Simulating Tumor Data} Now that we have the tumor in place, we can simulate data arising from a sudy of that tumor. <>= simData <- generateTumorData(simTumor, snps.seq = 10000, snps.cgh = 600000, mu = 70, sigma.reads = 25, sigma0.lrr = 0.15, sigma0.baf = 0.03, density.sigma = 0.1) @ For a description of the many parameters to the \code{generateTumorData} function, see the man page. The first two arguments are size parameters. The first, \code{snp.seq}, determines the number of \textit{germline} variants to simulate; in the absence of separate copy number data, these are used to provide a crude estimate. The second, \code{snps.cgh}, represents the number of SNP locations on the simulated SNP chip from which copy number segments are derived. The remaining parameters control the simulated read depth and variabilty. As with individual clones, the simulated data is structured as a list with separate data frames for the CNVs and mutations. <>= class(simData) length(simData) names(simData) @ The simulated copy number data includes chromosomal locations along with estimated log R ratios (\code{LRR}), B allele frequencies (\code{BAF}), separate intensity values for the two parental alleles (\code{X} and \code{Y}), and the number of SNPs in each segment (\code{markers}). <>= cnDat <- simData$cn.data dim(cnDat) summary(cnDat) @ The simulated sequencing data, in addition to chromosomal locations, has read counts for the number of reference alleles, alternate (meaning varianmt or mutated) alleles, total counts, the variant allele frequency (\code{VAF}), and a status indicator of whether the variant is believed to be germline or somatic. <>= dim(simData$seq.data) seqDat <- simData$seq.data somatic <- seqDat[seqDat$status=='somatic',] dim(seqDat) summary(seqDat) table(seqDat$status) @ \section{Seeking Clones} To run \code{CloneSeeker}, we will need a starting set of $\psi$ vectors as inputs, where $\psi$ records the fraction of cells belonging to each clone. For each $\psi$ vector, the algorithm will compute the most probable copy number state for each clone at each segment. The maximum posterior probability is computed for each input $\psi$ vector, and these probabilities are used to resample new potential $\psi$ vectors. We usually start by considering every possible decomposition of the tumor into five clones, where the fraction assigned to each clone is a multiple of $1/20 = 0.05$. We can generate this initial matrix of $\psi$ vectors as follows: <>= psis <- generateSimplex(20, 5) dim(psis) head(psis) tail(psis) @ For SNP array data, we also need, as input, a set of possible clonal segment copy number states. If none exists the function will automatically generate one. The version used here considers all possible copy number states from 0 to 5 copies, but it imposes a strong prior belief that two different clones cannot both gain and lose the same segment. <>= cnmodels <- expand.grid(rep(list(0:5),5)) include <- sapply(1:nrow(cnmodels), function(i) { length(which(cnmodels[i,] >= 1))==5 | length(which(cnmodels[i,] <= 1)) == 5 }) cnmodels <- cnmodels[include,] @ Now we will define the other algorithm parameters: <>= pars <- list(sigma0 = 1, # SNP-wise standard deviation ktheta = 0.3, # geometric prior parameter on number of clones theta = 0.9, # geometric prior parameter on copy number changes mtheta = 0.9, # gemoetric prior parameter on point mutations alpha = 0.5, # parameter for a symmetric Dirichlet prior on psi thresh = 0.04, # smallest possible detectble clone cutoff = 100, # filter out copy number segments supported by fewer SNPs Q = 100, # number of new psi vectors resamples at each iteration iters = 4) # number of iterations @ <>= f <- system.file("auxiliary/stash.Rda", package="CloneSeeker") if (file.exists(f)) { load(f) } else { resCN <- seekClones(cndata = cnDat, vardata = NULL, cnmodels = cnmodels, psis, pars = pars) resMut <- seekClones(cndata = NULL, vardata = seqDat, cnmodels = cnmodels, psiset = psis, pars = pars) resBoth <- seekClones(cndata = cnDat, vardata = somatic, cnmodels = cnmodels, psis, pars = pars) save(resCN, resMut, resBoth, file = "../inst/auxiliary/stash.Rda") } rm(f) @ \subsection{Seeking Clones from Copy Number Data} The \code{seekClones} function can estimate the clonal architecture from copy number data, or from mutation and variant data, or jointly from both kinds of data. In this section, we will run the algorithm using \textbf{only the copy number data}. To do that, we set the \code{varData} argument to \code{NULL}. <>= resCN <- seekClones(cndata = cnDat, vardata = NULL, cnmodels = cnmodels, psiset = psis, pars = pars) @ Here are the results of the ``CNV only'' analysis of this sample: <>= resCN$psi simTumor@psi @ In this case, \code{CloneSeeker} accurately estimates not only the number of clones but also the clonal fractions. Let's look at the clonal copy number assignments as well: <>= trueCN_Assignments <- t(sapply(1:nrow(resCN$filtered.data$cndata.filt), function(i) { index <- rownames(simTumor@clones[[1]]$cn) == rownames(resCN$filtered.data$cndata.filt)[i] sapply(1:length(simTumor@clones),function(j){ simTumor@clones[[j]]$cn$A[index] + simTumor@clones[[j]]$cn$B[index] }) })) inferredCN_Assignments <- (resCN$A+resCN$B)[,1:length(simTumor@clones)] colnames(inferredCN_Assignments) <- colnames(trueCN_Assignments) <- paste("C", 1:3) data.frame(Truth = trueCN_Assignments, Infer = inferredCN_Assignments) @ Although not perfect, the algorithm managed to correctly estimate most of the segment-wise allelic copy numbers of different clones. \subsection{Sequencing Data} Now, let's illustrate the use of \code{CloneSeeker} in analyzing mutation data (by which we mean variant data such as one would find in a \code{.vcf} file) to seek clones. This time, we run the \code{CloneSeeker} algorithm with the \code{cndata} argument set to \code{NULL}. <>= resMut <- seekClones(cndata = NULL, vardata = seqDat, cnmodels = cnmodels, psiset = psis, pars = pars) @ Here the results aren't as good; at least one of the actual clones has been split into separate pieces. <>= resMut$psi simTumor@psi @ %Let's also look at the mutation assignment compared to the truth: <>= trueMutAssignments <- sapply(1:length(simTumor@clones),function(i){ sapply(1:length(resMut$filtered.data$mutdata.filt$mut.id),function(j){ index <- which(simTumor@clones[[i]]$seq$mut.id==resMut$filtered.data$mutdata.filt$mut.id[j]) if(length(index)==0){ val <- 0 }else{ val <- simTumor@clones[[i]]$seq$mutated.copies[index] } val }) }) inferredMutAssignments <- resMut$mutated[,1:length(simTumor@clones)] colnames(inferredMutAssignments) <- colnames(trueMutAssignments) <- c('C1','C2','C3') data.frame(Truth = trueMutAssignments, Infer = inferredMutAssignments) @ \subsection{Both Sequencing and SNP Array Data} Finally, we illustrate running \code{CloneSeeker} on a sample for which there is both SNP array and mutation data. <>= resBoth <- seekClones(cndata = cnDat, vardata = somatic, cnmodels = cnmodels, psiset = psis, pars = pars) @ And we can look at the inferred allocation of tumor fraction to clones: <>= resBoth$psi simTumor@psi @ Surprisingly, the results here are similar to the overaggressive results obtained using just the sequencing data rather than the simpler and correct results obtained when using just the copy number data. In conclusion, CloneSeeker can be applied effectively to cases where one has SNP array data, (processed) sequencing data, or both. \end{document}