--- title: "Example usage of the vinereg package" author: "Daniel Kraus and Claudia Czado" date: "September 2017" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Example usage of the vinereg package} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE) ``` This file contains the source code of an exemplary application of the D-vine copula based quantile regression approach implemented in the R-package *vinereg* and presented in Kraus and Czado (2017): *D-vine copula based quantile regression*, Computational Statistics and Data Analysis, 110, 1-18. Please, feel free to address questions to . # Load required packages ```{r, message = FALSE} library(vinereg) require(ggplot2) require(dplyr) require(tidyr) require(AppliedPredictiveModeling) ``` # Data analysis ```{r} set.seed(5) ``` We consider the data set `abalone` from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/abalone) and focus on the female sub-population. In a first application we only consider continuous variables and fit models to predict the quantiles of weight (`whole`) given the predictors `length`, `diameter`, and `height`. ## Load and clean data ```{r} data(abalone, package = "AppliedPredictiveModeling") colnames(abalone) <- c( "sex", "length", "diameter", "height", "whole", "shucked", "viscera", "shell", "rings" ) abalone_f <- abalone %>% dplyr::filter(sex == "F") %>% # select female abalones dplyr::select(-sex) %>% # remove id and sex variables dplyr::filter(height < max(height)) # remove height outlier ``` ```{r, fig.width=7, fig.height=6} pairs(abalone_f, pch = ".") ``` # D-vine regression models ## Parametric D-vine quantile regression We consider the female subset and fit a parametric regression D-vine for the response weight given the covariates len, diameter and height (ignoring the discreteness of some of the variables). The D-vine based model is selected sequentially by maximizing the conditional log-likelihood of the response given the covariates. Covariates are only added if they increase the (possibly AIC- or BIC-corrected) conditional log-likelihood. We use the function `vinereg()` to fit a regression D-vine for predicting the response weight given the covariates `length`, `diameter`, and `height`. The argument `family_set` determines how the pair-copulas are estimated. We will only use one-parameter families and the *t* copula here. The `selcrit` argument specifies the penalty type for the conditional log-likelihood criterion for variable selection. ```{r} fit_vine_par <- vinereg( whole ~ length + diameter + height, data = abalone_f, family_set = c("onepar", "t"), selcrit = "aic" ) ``` The result has a field `order` that shows the selected covariates and their ranking order in the D-vine. ```{r} fit_vine_par$order ``` The field `vine` contains the fitted D-vine, where the first variable corresponds to the response. The object is of class `"vinecop_dist"` so we can use `rvineocpulib`'s functionality to summarize the model ```{r} summary(fit_vine_par$vine) ``` We can also plot the contours of the fitted pair-copulas. ```{r,, fig.width=7, fig.height=7} contour(fit_vine_par$vine) ``` ## Estimation of corresponding conditional quantiles In order to visualize the predicted influence of the covariates, we plot the estimated quantiles arising from the D-vine model at levels 0.1, 0.5 and 0.9 against each of the covariates. ```{r} # quantile levels alpha_vec <- c(0.1, 0.5, 0.9) ``` We call the `fitted()` function on `fit_vine_par` to extract the fitted values for multiple quantile levels. This is equivalent to predicting the quantile at the training data. The latter function is more useful for out-of-sample predictions. ```{r} pred_vine_par <- fitted(fit_vine_par, alpha = alpha_vec) # equivalent to: # predict(fit_vine_par, newdata = abalone.f, alpha = alpha_vec) head(pred_vine_par) ``` To examine the effect of the individual variables, we will plot the predicted quantiles against each of the variables. To visualize the relationship more clearly, we add a smoothed line for each quantile level. This gives an estimate of the expected effect of a variable (taking expectation with respect to all other variables). ```{r, fig.width=7, fig.height=4} plot_effects(fit_vine_par) ``` The fitted quantile curves suggest a non-linear effect of all three variables. ## Comparison to the benchmark model: linear quantile regression This can be compared to linear quantile regression: ```{r,, fig.width=7, fig.height=6} pred_lqr <- pred_vine_par for (a in seq_along(alpha_vec)) { my.rq <- quantreg::rq( whole ~ length + diameter + height, tau = alpha_vec[a], data = abalone_f ) pred_lqr[, a] <- quantreg::predict.rq(my.rq) } plot_marginal_effects <- function(covs, preds) { cbind(covs, preds) %>% tidyr::gather(alpha, prediction, -seq_len(NCOL(covs))) %>% dplyr::mutate(prediction = as.numeric(prediction)) %>% tidyr::gather(variable, value, -(alpha:prediction)) %>% ggplot(aes(value, prediction, color = alpha)) + geom_point(alpha = 0.15) + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), se = FALSE) + facet_wrap(~ variable, scale = "free_x") + ylab(quote(q(y* "|" * x[1] * ",...," * x[p]))) + xlab(quote(x[k])) + theme(legend.position = "bottom") } plot_marginal_effects(abalone_f[, 1:3], pred_lqr) ``` ## Nonparametric D-vine quantile regression We also want to check whether these results change, when we estimate the pair-copulas nonparametrically. ```{r,, fig.width=4.6, fig.height=4.6} fit_vine_np <- vinereg( whole ~ length + diameter + height, data = abalone_f, family_set = "nonpar", selcrit = "aic" ) fit_vine_np contour(fit_vine_np$vine) ``` Now only the length and height variables are selected as predictors. Let's have a look at the marginal effects. ```{r, fig.width=7, fig.height=4} plot_effects(fit_vine_np, var = c("diameter", "height", "length")) ``` The effects look similar to the parametric one, but slightly more wiggly. Note that even the diameter was not selected as a covariate, it's marginal effect is captured by the model. It just does not provide additional information when height and length are already accounted for. ## Discrete D-vine quantile regression To deal with discrete variables, we use the methodology of Schallhorn et al. (2017). For estimating nonparametric pair-copulas with discrete variable(s), jittering is used (Nagler, 2017). We let `vinereg()` know that a variable is discrete by declaring it `ordered`. ```{r, fig.width=4.7, fig.height=4} abalone_f$rings <- as.ordered(abalone_f$rings) fit_disc <- vinereg(rings ~ ., data = abalone_f, selcrit = "aic") fit_disc plot_effects(fit_disc) ``` # References Kraus and Czado (2017), **D-vine copula based quantile regression**, *Computational Statistics and Data Analysis, 110, 1-18* Nagler (2017), **A generic approach to nonparametric function estimation with mixed data**, *Statistics & Probability Letters, 137:326–330* Schallhorn, Kraus, Nagler and Czado (2017), **D-vine quantile regression with discrete variables**, *arXiv preprint*