% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- \documentclass[11pt]{article} %% Set my margins \setlength{\oddsidemargin}{0.0truein} \setlength{\evensidemargin}{0.0truein} \setlength{\textwidth}{6.5truein} \setlength{\topmargin}{0.0truein} \setlength{\textheight}{9.0truein} \setlength{\headsep}{0.0truein} \setlength{\headheight}{0.0truein} \setlength{\topskip}{0pt} %% End of margins %%\pagestyle{myheadings} %%\markboth{$Date$\hfil$Revision$}{\thepage} \usepackage[pdftex, bookmarks, bookmarksopen, pdfauthor={Dongjun Chung, Hyonho Chun and Sunduz Keles}, pdftitle={spls Vignette}] {hyperref} \title{An Introduction to the `\texttt{spls}' Package, Version 1.0} \author{Dongjun Chung$^1$, Hyonho Chun$^1$ and S\"und\"uz Kele\c{s}$^{1,2}$\\ $^1$Department of Statistics, University of Wisconsin\\ Madison, WI 53706.\\ $^2$Department of Biostatistics and Medical Informatics, University of Wisconsin\\ Madison, WI 53706.} \date{\today} \SweaveOpts{engine=R, echo=TRUE, pdf=TRUE} \begin{document} %\VignetteIndexEntry{Sparse PLS} %\VignetteKeywords{PLS, Sparse PLS} %\VignettePackage{spls} \maketitle \section{Overview} This vignette provides basic information about the `\texttt{spls}' package. SPLS stands for ``Sparse Partial Least Squares''. The SPLS regression methodology is developed in \cite{spls}. The main principle of this methodology is to impose sparsity within the context of partial least squares and thereby carry out dimension reduction and variable selection simultaneously. SPLS regression exhibits good performance even when (1) the sample size is much smaller than the total number of variables; and (2) the covariates are highly correlated. One additional advantage of SPLS regression is its ability to handle both univariate and multivariate responses. The package can be loaded with the command: <>= options(prompt = "R> ") @ <>= library("spls") @ \section{Input Data} The package requires that the response is given in the form of a either vector or matrix, and the predictors in the form of a matrix. The response can be either univariate or multivariate. The responses and the predictors are assumed to be numerical and should not contain missing values. As part of pre-processing, the predictors are centered and scaled and the responses are centered automatically as default by the package `\texttt{spls}'. We provide the Yeast Cell Cycle dataset as an example application for the `\texttt{spls}' package. The responses are the cell cycle gene expression data of 542 genes \cite{cycle} from an $\alpha$ factor based experiment. In this experiment, mRNA levels were measured at every 7 minute during 119 minutes. Hence, the response has a total of 18 measurements covering two cell cycle periods. These 18 measurements correspond to 18 columns of the response matrix in the dataset. The predictors are chromatin immunoprecipitation on chip (ChIP-chip) data of \cite{chip} and they contain the binding information for 106 transcription factors (TF). This application is concerned with identifying cell cycle related TFs, i.e., TFs whose binding events contribute to explaining the variability in gene expression, as well as inferring their activities. See \cite{spls} for more details. The yeast cell cycle dataset with the `\texttt{y}' matrix as the cell cycle gene expression (responses) and the `\texttt{x}' matrix as the ChIP-chip data (predictors) can be loaded as follows: <>= data(yeast) yeast$x[1:5,1:5] yeast$y[1:5,1:5] @ \section{Tuning Parameters} \texttt{SPLS} regression has two main tuning parameters: `\texttt{eta}' represents the sparsity tuning parameter and `\texttt{K}' is the number of hidden (latent) components. Parameters can be chosen by ($v$-fold) cross-validation using the function `\texttt{cv.spls}'. The user specifies the range for these parameters and the cross-validation procedure searches within these ranges. `\texttt{eta}' should have a value between 0 and 1. `\texttt{K}' is integer valued and can range between 1 and $ min \left\{ p, (v-1) n / v \right\} $, where $p$ is the number of predictors and $n$ is the sample size. For example, if 10-fold cross-validation is used (default), `\texttt{K}' should be smaller than $ min \left\{ p, 0.9 n \right\} $. For the yeast data, we search for `\texttt{K}' between 5 and 10 and for `\texttt{eta}' between 0.1 and 0.9 with the following command: <>= set.seed(1) cv <- cv.spls( yeast$x, yeast$y, eta = seq(0.1,0.9,0.1), K = c(5:10) ) @ `\texttt{cv.spls}' returns a heatmap-type plot of mean squared prediction error (MSPE) and the optimal values for `\texttt{eta}' and `\texttt{K}'. MSPE plot is given in Figure \ref{fig:cv} and `\texttt{cv.spls}' recommends to use `\texttt{eta}=0.7' and `\texttt{K}=8'. \begin{figure}[tbh] \begin{center} <>= <> @ \caption{\label{fig:cv} MSPE plot of SPLS when eta = 0.1--0.9 and K = 5--10.} \end{center} \end{figure} \section{\texttt{SPLS} Fit} Using the parameters obtained from `\texttt{cv.spls}', \texttt{SPLS} can be fitted by the function `\texttt{spls}'. `\texttt{spls}' also prints out the variables that join the set of selected variables at each iteration step of SPLS fit. `\texttt{print.spls}' displays the parameters used, the number of selected predictors, and the list of the selected predictors. `\texttt{coef.spls}' prints out the coefficient estimates of SPLS fits. <>= f <- spls( yeast$x, yeast$y, eta = cv$eta.opt, K = cv$K.opt ) print(f) coef.f <- coef(f) coef.f[1:5,1:5] @ A plot that illuminates the fitting procedure is the coefficient path plot given in Figure \ref{fig:plot}. This plot illustrates how the coefficient estimates change as a function of `\texttt{K}' for a given `\texttt{eta}'. The coefficient path plot for a specific value of `\texttt{eta}' can be obtained using the function `\texttt{plot.spls}'. When there are many responses, it might not be a good idea to plot them all together. The response of interest can be defined with the option `\texttt{yvar}'. If the option `\texttt{yvar}' is not specified, then `\texttt{plot.spls}' draws the coefficient paths of all responses. The command below generates the coefficient path plot given in Figure \ref{fig:plot}. <>= plot.spls( f, yvar=1 ) @ \begin{figure}[tbh] \begin{center} <>= <> @ \caption{\label{fig:plot} Coefficient path plot of the SPLS fit for the yeast data when eta=0.7.} \end{center} \end{figure} In the yeast cell cycle data, the responses were repeatedly measured at different time points. In this case, it is useful to visualize how the estimated coefficients change as a function of time. The function `\texttt{coefplot.spls}' plots the estimated coefficients of the fit obtained from the `\texttt{spls}' function across all the responses. By default, `\texttt{coefplot.spls}' displays the estimated coefficients of the selected predictors. However, in the case that too many predictors are chosen, this might lead to a crowded plotting area. We provide two options to avoid this. First, one can choose predictors to be plotted by the option `\texttt{xvar}'. Note that the index number here is defined among the selected variables. For example, `\texttt{xvar = 1}' refers to the first variable among the selected predictors. In this example, `\texttt{xvar}' can be between 1 and 28. Second, one can also control the number of plots that appear in each window using the option `\texttt{nwin}'. For example, `\texttt{nwin = c(2,2)}' indicates that the plotting area will have 4 plots with two rows and two columns. An illustration of this plotting option is given below and the resulting plots are displayed in Figure \ref{fig:coef}. <>= coefplot.spls( f, nwin=c(2,2), xvar=c(1:4) ) @ \begin{figure}[tbh] \begin{center} <>= <> @ \caption{\label{fig:coef} Plot of the estimated coefficients.} \end{center} \end{figure} \section{eQTL Application} In this section, we study the application of SPLS regression to the expression quantitative trait loci (eQTL) mapping. We provide the mice dataset \cite{mice} as an example of the eQTL application. Mice were collected from a \textit{$F_2$-ob/ob} cross and lacked a functional leptin protein hormone. The functional leptin protein hormone is known to be important for reproduction and regulation of body weight and metabolism. The predictors are the marker map consisting of 145 microsatellite markers from 19 non-sex mouse chromosomes. The responses are the gene expression measurements of 83 transcripts from liver tissues of 60 mice. This group of 83 transcripts was obtained as one of the clusters, when 45,265 transcripts were clustered using a hierarchical clustering approach. See \cite{eqtl} for more details. The mice dataset with the `\texttt{y}' matrix as the gene expression (responses) and the `\texttt{x}' matrix as the marker map (predictors) can be loaded as follows: <>= data(mice) mice$x[1:5,1:5] mice$y[1:5,1:5] @ The optimal parameters were chosen as `\texttt{eta=0.6}' and `\texttt{K=1}' from `\texttt{cv.spls}' as follows. <>= set.seed(1) cv <- cv.spls( mice$x, mice$y, eta = seq(0.1,0.9,0.1), K = c(1:5) ) @ SPLS fits are obtained as below. <>= f <- spls( mice$x, mice$y, eta = cv$eta.opt, K = cv$K.opt ) print(f) @ <>= f <- spls( mice$x, mice$y, eta = 0.6, K = 1 ) print(f) @ In this eQTL analysis, we can improve the initial SPLS fits further as described in \cite{eqtl}. First, we obtain the bootstrapped confidence intervals for the coefficients of the selected predictors using the function `\texttt{ci.spls}'. `\texttt{ci.spls}' also provides the confidence interval plots and lets users to control the plots with two options, `\texttt{plot.fix}' and `\texttt{plot.var}'. If `\texttt{plot.fix="x"}', then the function `\texttt{ci.spls}' plots the confidence intervals of a given predictor specified by the option `\texttt{plot.var}' across all the responses. Similarly, `\texttt{plot.fix="y"}' draws the plot against the predictors for the given responses. Note that if `\texttt{plot.fix="x"}', then the index number is defined among the selected variables. Figure \ref{fig:ciplot} shows the confidence interval plot of the coefficients. <>= set.seed(1) ci.f <- ci.spls( f, plot.it=TRUE, plot.fix='x', plot.var=20 ) @ \begin{figure}[tbh] \begin{center} <>= <> @ \caption{\label{fig:ciplot} Example plot of the confidence intervals of the coefficients.} \end{center} \end{figure} The function `\texttt{ci.spls}' returns the list whose element is the matrix of the confidence intervals of the coefficients. The names of elements of the list are the same as the column names of the responses. <>= cis <- ci.f$cibeta cis[[20]][1:5,] @ After we obtain the confidence intervals of the coefficients, the function `\texttt{correct.spls}' updates the coefficient estimates of the selected variables by setting the coefficients with zero-containing confidence intervals to zero. In addition, `\texttt{correct.spls}' provides the heatmap-type plots of the original SPLS coefficient estimates (Figure \ref{fig:coef1}) and the corrected coefficient estimates (Figure \ref{fig:coef2}). <>= cf <- correct.spls( ci.f ) @ <>= cf <- correct.spls( ci.f, plot.it=FALSE ) @ <>= cf[15:20,1:5] @ <>= heatmap.spls( mat=f$betahat, xlab='Predictors', ylab='Responses', main='Original Coefficient Estimates', coln=16, as='i' ) @ <>= heatmap.spls( mat=cf, xlab='Predictors', ylab='Responses', main='Corrected Coefficient Estimates', coln=16, as='i' ) @ \begin{figure}[tbh] \begin{center} <>= <> @ \caption{\label{fig:coef1} Plot of the original coefficient estimates.} \end{center} \end{figure} \begin{figure}[tbh] \begin{center} <>= <> @ \caption{\label{fig:coef2} Plot of the corrected coefficient estimates.} \end{center} \end{figure} \begin{thebibliography}{99} \bibitem{spls} Chun, H. and Kele\c{s}, S. (2007) ``Sparse partial least squares for simultaneous dimension reduction and variable selection'', (\url{http://www.stat.wisc.edu/~keles/Papers/SPLS_Nov07.pdf}). \bibitem{eqtl} Chun, H. and Kele\c{s}, S. (2008). ``Expression quantitative trait loci mapping with multivariate sparse partial least squares regression", (\url{http://www.stat.wisc.edu/~keles/Papers/chun_keles_eQTL_SPLS_submit.pdf}). \bibitem{mice} Lan, H., M. Chen, J. B. Flowers, B. S. Yandell, D. S. Stapleton, C. M. Mata, E. T-K Mui, M. T. Flowers, K. L. Schueler, K. F. Manly, R. W. Williams, C. Kendziorski, and A. D. Attie (2006). ``Combined expression trait correlations and expression quantitative trait locus mapping", \textit{PLoS Genetics}, 2, e6. \bibitem{chip} Lee, T. I., N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thomson, I. Simon, J. Zeitlinger, E. G. Jennings, H. L. Murray, D. B. Gordon, B. Ren, J. J. Wyrick, J.-B. Tagne, T. L. Volkert, E. Fraenkel, D. K. Gifford, and R. A. Young (2002). ``Transcriptional regulatory networks in \textit{Saccharmomyces cerevisiae}". \textit{Science}, \textbf{298}, pp. 799--804. \bibitem{cycle} Spellman, P. T., G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher (1998). ``Comprehensive identification of cell cycle- regulated genes of the yeast \textit{Saccharomyces cerevisiae} by microarray hydrization". \textit{Molecular Biology of the Cell}, \textbf{9}, pp. 3273--3279. \end{thebibliography} \end{document}