--- title: "Replicating Resampling with Rsampling - Regressions and ANCOVA" author: "Paulo I Prado, Alexandre Oliveira and Andre Chalom" date: "June 2016" output: rmarkdown::html_vignette: fig_width: 5 fig_height: 5 fig_caption: true vignette: > %\VignetteIndexEntry{Regression and ANCOVA} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, echo = FALSE} knitr::opts_chunk$set( collapse=TRUE, comment = NA, prompt = TRUE ) set.seed(42) ``` ## Installation Rsampling is hosted on CRAN. To install it use ```{r installation CRAN, eval=FALSE} install.packages("Rsampling") ``` You can also install from the GitHub site, where the project is hosted. To do that use the devtools package function `install_github`: ```{r installation GitHub, eval=FALSE} library(devtools) install_github(repo = 'lageIBUSP/Rsampling') ``` After installation load the package ```{r load library} library(Rsampling) ``` ## Regression examples The data frame `rhyzophora` contains measurements of mangrove trees growing in two sites that differ in the soil stability (more or less muddy soils). ```{r inspecting object rhyzophora} head(rhyzophora) summary(rhyzophora) ``` Learn more about the data at its help page (`?rhyzophora`). ### Study Hypothesis The hypothesis is that trees at more unstable soils will allocate more biomass in supporting structures. One possible prediction is that the relation between the tree's torque [^1] and the allocation in supporting roots is different in the two kinds of soils. To express the torque the ratio between the areas of the canopy and the trunk cross-section was used. The allocation in supporting roots was expressed in number of supporting roots and the area at ground level encompassed by these roots. The data suggests a positive relation between torque and number of roots. Plus, the points of the sampled trees at the two kinds of soil seems to separate in the plot: ```{r plot rhyzophora, fig.cap = "Relation between number of supporting roots and the ratio canopy area / trunk area for mangrove trees at two sites that differ in the soil instability (medium and high instability)."} plot(n.roots ~ canopy.trunk, data=rhyzophora, type="n", xlab="canopy area / trunk area", ylab="root number") points(n.roots ~ canopy.trunk, data=rhyzophora, subset=soil.instability=="medium") points(n.roots ~ canopy.trunk, data=rhyzophora, subset=soil.instability=="high", pch=19) legend("topright", c("Medium","High"), title="Soil instability", pch=c(1,19)) ``` This pattern suggests that the relationship between torque and number of roots differ between the two sites. ### Shuffling rows within the strata #### Null hypothesis In order to illustrate how to run randomization restricted to strata we will test the most basic null hypothesis: that there is no relation at none of the soil types. #### Statistic of interest We have a **statistic of interest** for each soil type, which is the slope of the linear regressions: ```{r rhyzophora statistics of interest} rhyz.si <- function(dataframe){ m1 <- lm(n.roots ~ canopy.trunk, data=dataframe, subset=soil.instability=="medium") m2 <- lm(n.roots ~ canopy.trunk, data=dataframe, subset=soil.instability=="high") c(med = coef(m1)[[2]], high = coef(m2)[[2]]) } ## Observed values rhyz.si(rhyzophora) ``` #### Distribution of the statistics under the null hypothesis We simulate the null hypothesis shuffling the values of the torque variable between trees of the same soil type: ```{r rhyzophora resampling, results="hide"} rhyz.r <- Rsampling(type = "normal_rand", dataframe = rhyzophora, statistics = rhyz.si, stratum = rhyzophora$soil.instability, cols = 2, ntrials = 1000) ``` The argument `stratum = rhyzophora$soil.instability`, tells that the shuffling of values (at column 2) must be done within each soil type. When there's more than one statistic of interest, the function `Rsampling` returns a matrix where which line is a statistic and columns are the replications. ```{r rhyzophora resampling results} rhyz.r[,1:3] ``` Values which are equal or bigger than the observed slopes see very rare at the value distribution under the null hypothesis: ```{r rhyzophora null distributions, fig.cap="Frequency distributions of the slope of linear regressions of number of supporting roots to torque in mangrove trees, under the null hypothesis that there is not a relationship. The trees were measured at two sites, and the null hypothesis was simulated shuffling y-values (n of roots) among trees at each site. Red lines show the observed values of the slopes, and the acceptance region of the null hypothesis at 5% is in grey. Absolute values large than the observed are depicted in orange. Results from 1000 simulations. ", fig.width=7.5} par(mfrow=c(1,2)) dplot(rhyz.r[1,], svalue=rhyz.si(rhyzophora)[1], pside="Greater", main="Less unstable soil", xlab="Regression slope") dplot(rhyz.r[2,], svalue=rhyz.si(rhyzophora)[2], pside="Greater", main="More unstable soil", xlab="Regression slope") par(mfrow=c(1,1)) ``` #### Decision: should we reject the null hypothesis? The observed slopes for the two groups are out of the region of acceptance for the one-tailed null hypothesis [^2] at 5% significance level. ```{r rhyzophora test} sum(rhyz.r[1,] >= rhyz.si(rhyzophora)[1])/1000 < 0.05 sum(rhyz.r[2,] >= rhyz.si(rhyzophora)[2])/1000 < 0.05 ``` **Conclusion:** the null hypothesis is rejected (p < 0,05) at both cases. ### Comparing slopes Our main study hypothesis was that the relation between torque and support is different between the two kinds of soils. Assuming the linear relation exists, the difference can occur in two ways: different slopes or same slope but different intercepts. #### Null hypothesis We start by testing the null hypothesis that the linear the slopes of the linear regressions does not differ between soil types. #### Statistic of interest Our statistic of interest is the difference between slopes, which seems small: ```{r rhyzophora diff between slopes} rhyz.si2 <- function(dataframe){ m1 <- lm(n.roots ~ canopy.trunk, data=dataframe, subset=soil.instability=="medium") m2 <- lm(n.roots ~ canopy.trunk, data=dataframe, subset=soil.instability=="high") coef(m1)[[2]] - coef(m2)[[2]] } ## Observed values rhyz.si2(rhyzophora) ``` #### Null hypothesis simulation We simulate our new null hypothesis shuffling the trees between soil types: ```{r rhyzophora resampling slopes, results="hide"} rhyz.r2 <- Rsampling(type = "normal_rand", dataframe = rhyzophora, statistics = rhyz.si2, cols = 1, ntrials = 1000) ``` #### Decision: should we reject the null hypothesis? In this case, we cannot reject the null hypothesis at the 5% significance level: ```{r rhyzophora 2nd test} sum(rhyz.r2 > rhyz.si2(rhyzophora))/1000 < 0.05 ``` ### Comparing intercepts We have decided above to accept the null hypothesis that the slopes are equal. The biological interpretation of this fact is that at both soil types the number of support roots follows the same proportionality relation with the torque variable. This proportionality factor is the slope of the linear regressions applied to **all** trees, which we estimate by adjusting the regression to whole data set: ```{r rhyzophora common slope} lm(n.roots ~ canopy.trunk, data=rhyzophora) ``` That is, to each increment of 100 units of the torque variable in average `r round(coef(lm(n.roots ~ canopy.trunk, data=rhyzophora))[[2]]*100,1)` roots are added. This proportionality is maintained if we add any constant. For this reason the linear model is expressed by: $$E[Y] = \alpha + \beta X$$ Where $E[Y]$ is the expected value of the response variable (root number), $\beta$ is the slope or proportionality factor, and $X$ is the predictor variable (torque). The intercept $\alpha$ does not change the proportionality, rather, it only moves the regression line upwards or downwards. In other words, lines with same slope but different intercepts are parallel. In our case, different intercepts with the same slope express that trees with the same canopy/trunk ratio **always** have more roots at one of the soil types. #### Null hypothesis Now our null hypothesis is that the intercepts of the linear regressions do not differ between soil types. If this is true, the linear regression adjusted to all data would predict well the number of roots for all trees. If not, the points of one soil type will tend to fall below the line, while the points for the other soil type will fall above it. We already adjusted the regression for the whole data set, and then we can add the regression line to the plot: ```{r plot rhyzophora single regression, fig.cap = "Relation between the number of supporting roots and the ratio canopy area / trunk area for mangrove trees at two sites that differ in the soil instability (medium and high instability). Also shown the line of the linear regression fitted to the whole data set."} plot(n.roots ~ canopy.trunk, data=rhyzophora, type="n", xlab="canopy area / trunk area", ylab="number of roots") points(n.roots ~ canopy.trunk, data=rhyzophora, subset=soil.instability=="medium") points(n.roots ~ canopy.trunk, data=rhyzophora, subset=soil.instability=="high", pch=19) abline(lm(n.roots ~ canopy.trunk, data=rhyzophora)) legend("topright", c("Medium","High"), title="Soil instability", pch=c(1,19)) ``` Indeed it seems that this regression underestimates the number of roots of the trees sampled at the site with the most unstable soil, and does the opposite for the trees at sampled the site with less unstable soil. For this reason, the residuals of this regression are positive for trees sampled at unstable soil and negative for the rest. #### Statistic of interest Our statistic of interest is the difference between the means of the residual of trees at each soil type. The residuals are calculated from the regression applied to all data: ```{r statistics of interest 3 rhyzophora} rhyz.si3 <- function(dataframe){ m1 <- lm(n.roots ~ canopy.trunk, data=dataframe) res.media <- tapply(resid(m1), dataframe$soil.instability, mean) res.media[[1]] - res.media[[2]] } ## Observed values rhyz.si3(rhyzophora) ``` #### Simulating the null hypothesis We simulate the null hypothesis in the same way as before: shuffling the trees between soil types (first row of the data table) ```{r rhyzophora resampling intercept, results="hide"} rhyz.r3 <- Rsampling(type = "normal_rand", dataframe = rhyzophora, statistics = rhyz.si3, cols = 1, ntrials = 1000) ``` #### Decision: should we reject the null hypothesis? In this case we reject the null hypothesis: ```{r rhyzophora 3rd test} sum(rhyz.r3 > rhyz.si3(rhyzophora))/1000 < 0.05 ``` Therefore, there is one intercept for each soil type. We can estimate them including the soil's effect on the regression model [^3]: ```{r rhyzophora ancova} (rhyz.ancova <- lm(n.roots ~ soil.instability + canopy.trunk -1, data=rhyzophora)) ``` And then we can add the lines for the two groups to the plot: ```{r plot rhyzophora ancova, fig.cap = "Relation between number of supporting roots and the ratio canopy area / trunk area for mangrove trees at two sites that differ in the soil instability (medium and high instability). Also shown the line of the regressions fitted to each site, but with a common slope."} cfs <- coef(rhyz.ancova) plot(n.roots ~ canopy.trunk, data=rhyzophora, type="n", xlab="canopy area / trunk area", ylab="number of roots") points(n.roots ~ canopy.trunk, data=rhyzophora, subset=soil.instability=="medium", col="blue") points(n.roots ~ canopy.trunk, data=rhyzophora, subset=soil.instability=="high", col="red") abline(cfs[1],cfs[3], col="red") abline(cfs[2],cfs[3], col="blue") legend("topright", c("Medium","High"), title="Soil instability", col=c("blue", "red")) ``` [^1]: Roughly, for our purpose the torque express the force to bring the tree down. [^2]: As it doesn't make sense, in this case, to expect the number of roots to decrease with the torque variable, we did the one-tailed test. [^3]: Technical detail: we add the term `-1` to the regression formula in order to explicit to R that we want the estimates of each intercept. Otherwise, we'd get the estimation the intercept of one group and the difference to the intercept of the other group.