Regression models for continuous responses restricted to the unit interval \((0, 1)\) play a central role in empirical research across economics, biostatistics, political science, and environmental studies. Typical examples include rates, proportions, indices, and concentration measures. The simplex regression model provides a flexible and theoretically appealing framework for such data. It is based on the simplex distribution, introduced by Barndorff-Nielsen and Jørgensen (1991), which is indexed by a mean parameter \(\mu \in (0, 1)\) and a dispersion parameter \(\sigma^2 > 0\).
A random variable \(Y\) follows a simplex distribution, \(Y \sim S^-(\mu, \sigma^2)\), if its density is
\[f(y; \mu, \sigma^2) = \left\{2\pi\sigma^2[y(1-y)]^3\right\}^{-1/2} \exp\!\left(-\frac{d(y;\mu)}{2\sigma^2}\right), \quad y \in (0,1),\]
where \(d(y;\mu) = (y - \mu)^2 / [y(1-y)\mu^2(1-\mu)^2]\) is the unit deviance. The mean and variance are \(\mathrm{E}(Y) = \mu\) and \(\mathrm{Var}(Y) < \mu(1-\mu)\), with variance function \(\mathrm{V}(\mu) = \mu^3(1-\mu)^3\).
The simplex distribution can assume a wide variety of shapes — symmetric, left- or right-skewed, J-shaped, U-shaped, and even bimodal — making it particularly flexible for modeling bounded responses. Unlike the beta distribution, which is indexed by a precision parameter, the simplex distribution is indexed by a dispersion parameter \(\sigma^2\): smaller values correspond to higher concentration around the mean, whereas larger values indicate greater variability.
The simplex regression model with variable dispersion relates the mean and dispersion to covariates through link functions:
\[g(\mu_i) = \mathbf{x}_i^\top \boldsymbol{\beta} = \eta_{1i} \quad \text{and} \quad h(\sigma^2_i) = \mathbf{z}_i^\top \boldsymbol{\gamma} = \eta_{2i},\]
where \(\boldsymbol{\beta}\) and \(\boldsymbol{\gamma}\) are unknown parameter vectors. For models with a parametric mean link, an additional shape parameter \(\lambda > 0\) is estimated jointly with \(\boldsymbol{\beta}\) and \(\boldsymbol{\gamma}\):
\[g(\mu_i, \lambda) = \mathbf{x}_i^\top \boldsymbol{\beta}.\]
The SimplexRegression package supports five fixed link
functions for the mean submodel — "logit",
"probit", "loglog", "cloglog",
"cauchit" — and two parametric links,
"plogit1" and "plogit2", defined as
\[\text{plogit1:}\; g(\mu_i,\lambda) = \log\!\left[(1-\mu_i)^{-\lambda}-1\right], \qquad \text{plogit2:}\; g(\mu_i,\lambda) = \log\!\left(\frac{\mu_i^\lambda}{1-\mu_i^\lambda}\right).\]
Both reduce to the standard logit when \(\lambda = 1\). For the dispersion submodel,
the links "log", "sqrt", and
"identity" are available. All parameters are estimated by
maximum likelihood using a mixed algorithm that combines the BFGS
quasi-Newton method with Fisher scoring steps.
Compared to beta regression, simplex regression has several distinctive advantages:
This vignette illustrates the main functionalities of the
SimplexRegression package through the analysis of monthly
average relative humidity data recorded in Brasília, Brazil (\(n = 312\) observations). The analysis
covers:
penalized.ss) and penalized
information criteria (penalized.ic);lrtest);scoretest);resettest);press);plot,
halfnormal.plot);local.influence,
diag.im, diag.distances).Relative air humidity (RH) is defined as the ratio of the partial pressure of water vapor in the air to the saturation vapor pressure at the same temperature. In Brasília, low RH levels during dry months are associated with adverse health outcomes (asthma, nosebleeds, dehydration), increased forest-fire risk, and pressure on water resources. The dataset contains monthly averages from January 2000 to December 2025, obtained from the National Institute of Meteorology (INMET).
library(SimplexRegression)
data(RelativeHumidity, package = "SimplexRegression")
head(RelativeHumidity, 5)
#> Date RH Ins Ins2 Pre Pre2 Neb AP MT WS Dir
#> 1 2000-01-31 0.763 150.0 150.0 130.0 130.0 8.0 883.9 26.5 2.5 32
#> 2 2000-02-29 0.756 145.7 145.7 168.3 168.3 7.5 885.6 26.7 2.0 0
#> 3 2000-03-31 0.780 164.9 164.9 229.6 229.6 7.7 884.9 26.0 2.6 14
#> 4 2000-04-30 0.692 224.8 224.8 98.8 98.8 6.1 886.5 26.5 2.2 14
#> 5 2000-05-31 0.574 278.3 278.3 0.0 0.0 3.9 887.5 26.5 1.8 14| Variable | Description |
|---|---|
Date |
Observation date (end of month) |
RH |
Monthly average relative humidity, rescaled to \((0,1)\) |
Ins |
Total insolation (hours), with two missing values |
Ins2 |
Total insolation with imputed missing values |
Pre |
Total precipitation (mm), with two missing values |
Pre2 |
Total precipitation with imputed missing values |
Neb |
Average cloudiness (tenths) |
AP |
Average atmospheric pressure (hPa) |
MT |
Average maximum temperature (°C) |
WS |
Average wind speed (m/s) |
Dir |
Predominant wind direction (degrees) |
To capture the 12-month seasonal cycle we add harmonic regressors \(s_i = \sin(2\pi i/12)\) and \(c_i = \cos(2\pi i/12)\), and a rainy-season dummy equal to 1 for October, November, and December (the rainy season in Brasília) and 0 otherwise.
summary(rh$RH)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.301 0.523 0.661 0.635 0.756 0.851
cat(sprintf(
"Std. dev.: %.4f | Skewness: %.4f\n",
sd(rh$RH),
mean(((rh$RH - mean(rh$RH)) / sd(rh$RH))^3)
))
#> Std. dev.: 0.1331 | Skewness: -0.4660The response ranges from 0.301 to 0.851, with mean 0.635 and moderate left skewness (\(-0.47\)), consistent with a higher concentration of RH values above the mean during the wetter months. These features support the use of a unit-interval distribution such as the simplex.
All models fitted in this vignette share the same formula structure.
The two-part formula y ~ x | z specifies the mean submodel
(left of |) and the dispersion submodel (right of
|) simultaneously. The mean submodel includes insolation
(Ins2), maximum temperature (MT), wind speed
(WS), harmonic terms (hs, hc) to
capture seasonality, a rainy-season dummy, and its interaction with wind
speed. The dispersion submodel contains only precipitation
(Pre2), allowing the variability of RH to depend on
rainfall.
We first fit models with the two parametric mean links available in
the package, plogit1 and plogit2. These links
include a shape parameter \(\lambda\)
estimated jointly with the regression coefficients. When \(\lambda = 1\), both reduce to the standard
logit link; values \(\lambda \neq 1\)
introduce asymmetry in the link, adding flexibility to capture
non-standard relationships between the mean response and the linear
predictor.
When comparing models that all use a parametric link, the
penalized Scout Score (penalized.ss) and
penalized information criteria
(penalized.ic) are recommended. These functions incorporate
an additional penalty controlled by kappa (default
kappa = 0.1) that accounts for the complexity introduced by
\(\lambda\) relative to its deviation
from the standard logit. The model with the highest Scout Score or
lowest penalized criterion is selected.
penalized.ss(fit_p1, fit_p2, kappa = 0.1)
#>
#> Penalized Scout Score values (kappa = 0.100):
#> df SS
#> fit_p1 11 16.507
#> fit_p2 11 0.000
#>
#> Selected model: fit_p1penalized.ic(fit_p1, fit_p2, kappa = 0.1)
#>
#> Penalized information criteria values (kappa = 0.100):
#> df AICc BICc HQICc
#> fit_p1 11 -1352.6 -1311.4 -1336.1
#> fit_p2 11 -1349.3 -1308.1 -1332.8All penalized criteria unanimously select the plogit1
link. The Scout Score of fit_p1 is substantially higher
than that of fit_p2, and all three penalized information
criteria are lower for fit_p1.
We also fit models with all five fixed mean links supported by the package. Fixed links do not include an additional parameter and are therefore always comparable using standard (unpenalized) criteria.
fit_loglog <- simplexreg(formula, data = rh, link.mu = "loglog")
fit_logit <- simplexreg(formula, data = rh, link.mu = "logit")
fit_probit <- simplexreg(formula, data = rh, link.mu = "probit")
fit_cauchit <- simplexreg(formula, data = rh, link.mu = "cauchit")
fit_cloglog <- simplexreg(formula, data = rh, link.mu = "cloglog")When comparing models with both fixed and parametric links
simultaneously, the unpenalized version
(kappa = 0) must be used, since the penalty for \(\lambda\) is only meaningful when all
candidates use a parametric link.
penalized.ss(
fit_loglog, fit_logit, fit_probit,
fit_cauchit, fit_cloglog, fit_p1,
kappa = 0
)
#>
#> Scout Score values:
#> df SS
#> fit_loglog 10 115.923
#> fit_logit 10 55.299
#> fit_probit 10 37.521
#> fit_cauchit 10 15.857
#> fit_cloglog 10 0.000
#> fit_p1 11 25.182
#>
#> Selected model: fit_loglogThe log-log link achieves the highest Scout Score and is selected as
the best-fitting specification. We proceed with fit_loglog
for all remaining analyses.
The summary method provides a standard regression
output: estimated coefficients for the mean and dispersion submodels,
standard errors, Wald \(z\)-statistics
and \(p\)-values, the maximized
log-likelihood, information criteria (AIC, BIC, HQIC), pseudo-\(R^2\) measures, and details on the
numerical optimization procedure.
summary(fit_loglog)
#>
#> Call:
#> simplexreg(formula = formula, data = rh, link.mu = "loglog")
#>
#> Quantile residuals:
#> Min 1Q Median 3Q Max
#> -2.585 -0.635 -0.002 0.644 2.992
#>
#> Coefficients (mean model with loglog link):
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 3.5101214213 0.1335820307 26.2769 <2e-16 ***
#> Ins2 -0.0049290663 0.0002225212 -22.1510 <2e-16 ***
#> MT -0.0592798832 0.0049701318 -11.9272 <2e-16 ***
#> WS -0.0345373509 0.0133605126 -2.5850 0.0097 **
#> hs 0.2797251376 0.0121416923 23.0384 <2e-16 ***
#> hc 0.0428113435 0.0185577603 2.3069 0.0211 *
#> dummy 0.3374909598 0.0661928168 5.0986 <2e-16 ***
#> I(dummy * WS) -0.1040241692 0.0302313202 -3.4409 0.0006 ***
#>
#> Dispersion coefficients (dispersion model with log link):
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.8324455820 0.1142028509 -24.8019 <2e-16 ***
#> Pre2 0.0020288924 0.0006649587 3.0512 0.0023 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Log-likelihood: 687.47 on 10 Df
#> AIC: -1354.9
#> BIC: -1317.5
#> HQIC: -1340
#> Pseudo R-squared (Nagelkerke): 0.95318
#> Pseudo R-squared (Ferrari and Cribari-Neto): 0.95212
#> P-squared (Espinheira-Silva-Lima): 0.93404
#> Number of observations: 312
#> Number of iterations: 101 (BFGS) + 7 (Fisher scoring)All regression coefficients are statistically significant at the 5%
level. The harmonic terms (hs, hc) and the
rainy-season dummy capture the strong seasonal pattern in RH, while
insolation (Ins2), maximum temperature (MT),
and wind speed (WS) explain the within-season variation.
The Nagelkerke and Ferrari–Cribari-Neto pseudo-\(R^2\) values both exceed 0.95, indicating
an excellent fit. The model converged in 101 BFGS iterations followed by
7 Fisher scoring steps.
The standard (unpenalized) AIC, BIC, and HQIC are accessible via the usual S3 generics and compare multiple models simultaneously. Lower values indicate a better trade-off between fit and complexity.
AIC(fit_loglog, fit_logit, fit_probit, fit_cauchit, fit_cloglog, fit_p1)
#> df AIC
#> fit_loglog 10 -1354.9
#> fit_logit 10 -1345.9
#> fit_probit 10 -1339.1
#> fit_cauchit 10 -1348.6
#> fit_cloglog 10 -1300.7
#> fit_p1 11 -1353.4
BIC(fit_loglog, fit_logit, fit_probit, fit_cauchit, fit_cloglog, fit_p1)
#> df BIC
#> fit_loglog 10 -1317.5
#> fit_logit 10 -1308.4
#> fit_probit 10 -1301.7
#> fit_cauchit 10 -1311.2
#> fit_cloglog 10 -1263.2
#> fit_p1 11 -1312.2
HQIC(fit_loglog, fit_logit, fit_probit, fit_cauchit, fit_cloglog, fit_p1)
#> df HQIC
#> fit_loglog 10 -1340.0
#> fit_logit 10 -1330.9
#> fit_probit 10 -1324.2
#> fit_cauchit 10 -1333.6
#> fit_cloglog 10 -1285.7
#> fit_p1 11 -1336.9The log-log model consistently achieves the lowest values across all three criteria, confirming the Scout Score selection.
The coef method extracts estimated coefficients for the
full model or for each submodel separately, controlled by the
model argument ("full", "mean",
or "dispersion"). The vcov method returns the
corresponding variance-covariance matrix.
coef(fit_loglog) # full coefficient vector
#> (Intercept) Ins2
#> 3.5101214 -0.0049291
#> MT WS
#> -0.0592799 -0.0345374
#> hs hc
#> 0.2797251 0.0428113
#> dummy I(dummy * WS)
#> 0.3374910 -0.1040242
#> (dispersion)_(Intercept) (dispersion)_Pre2
#> -2.8324456 0.0020289
coef(fit_loglog, model = "mean") # mean submodel only
#> (Intercept) Ins2 MT WS hs
#> 3.5101214 -0.0049291 -0.0592799 -0.0345374 0.2797251
#> hc dummy I(dummy * WS)
#> 0.0428113 0.3374910 -0.1040242
coef(fit_loglog, model = "dispersion") # dispersion submodel only
#> (Intercept) Pre2
#> -2.8324456 0.0020289
round(vcov(fit_loglog, model = "mean"), 6) # vcov of mean submodel
#> (Intercept) Ins2 MT WS hs
#> (Intercept) 0.017844 9e-06 -0.000627 -0.001014 -0.000893
#> Ins2 0.000009 0e+00 -0.000001 -0.000001 0.000000
#> MT -0.000627 -1e-06 0.000025 0.000032 0.000022
#> WS -0.001014 -1e-06 0.000032 0.000179 0.000076
#> hs -0.000893 0e+00 0.000022 0.000076 0.000147
#> hc 0.001378 3e-06 -0.000063 -0.000112 -0.000044
#> dummy -0.002718 -1e-06 0.000071 0.000403 0.000274
#> I(dummy * WS) 0.000841 1e-06 -0.000023 -0.000157 -0.000066
#> hc dummy I(dummy * WS)
#> (Intercept) 0.001378 -0.002718 0.000841
#> Ins2 0.000003 -0.000001 0.000001
#> MT -0.000063 0.000071 -0.000023
#> WS -0.000112 0.000403 -0.000157
#> hs -0.000044 0.000274 -0.000066
#> hc 0.000344 -0.000353 0.000088
#> dummy -0.000353 0.004381 -0.001892
#> I(dummy * WS) 0.000088 -0.001892 0.000914The lrtest method from the lmtest package
performs likelihood ratio tests between nested simplex regression
models. A natural application is testing whether the dispersion is
constant against a model with covariate-dependent dispersion. The
restricted model (constant dispersion) is obtained via
update by replacing the dispersion submodel with an
intercept only (| 1).
fit_loglog_null <- update(fit_loglog, . ~ . | 1)
lmtest::lrtest(fit_loglog, fit_loglog_null)
#> Likelihood ratio test
#>
#> Model 1: RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | Pre2
#> Model 2: RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | 1
#> #Df LogLik Df Chisq Pr(>Chisq)
#> 1 10 687
#> 2 9 683 -1 9.73 0.0018 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1The very small \(p\)-value (\(< 0.001\)) provides strong evidence
against constant dispersion, confirming that RH variability is
significantly associated with precipitation (Pre2). The
model with variable dispersion is therefore preferred.
The scoretest function implements Rao’s score test for
the hypothesis that the standard logit link is adequate, i.e., \(H_0\!: \lambda = 1\). The test requires a
model fitted under \(H_0\) (with the
logit link) and tests against either "plogit1" or
"plogit2" as the alternative. Rejection of \(H_0\) suggests that a parametric link
provides a better fit than the standard logit.
fit_logit_h0 <- simplexreg(formula, data = rh, link.mu = "logit")
scoretest(fit_logit_h0, link.mu = "plogit1")
#>
#> Rao score test
#>
#> data: Logit vs plogit1
#> S = 9.7, df = 1, p-value = 0.0018
scoretest(fit_logit_h0, link.mu = "plogit2")
#>
#> Rao score test
#>
#> data: Logit vs plogit2
#> S = 9.08, df = 1, p-value = 0.0026The resettest function implements Ramsey’s RESET test,
which augments the original model with the squared fitted linear
predictor as an additional covariate and tests its significance. By
default (dispersion = TRUE), the squared predictor is added
to both the mean and dispersion submodels. Setting
dispersion = FALSE restricts the augmentation to the mean
submodel only. Failure to reject \(H_0\) supports the adequacy of the assumed
functional form.
resettest(fit_loglog) # both submodels augmented
#>
#> RESET test
#>
#> data: RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | Pre2
#> RESET = 0.404, df = 2, p-value = 0.82
resettest(fit_loglog, dispersion = FALSE) # mean submodel only
#>
#> RESET test
#>
#> data: RH ~ Ins2 + MT + WS + hs + hc + dummy + I(dummy * WS) | Pre2
#> RESET = 0.00136, df = 1, p-value = 0.97The large \(p\)-values in both cases (\(0.817\) and \(0.929\)) provide no evidence of functional form misspecification in either submodel.
The fitted method returns fitted mean values \(\hat{\mu}_i\). The residuals
method supports ten residual types. Quantile residuals (default) have an
approximately standard normal distribution under correct model
specification and are recommended for general diagnostics. Weighted
residuals are particularly useful for half-normal plots.
head(fitted(fit_loglog))
#> 1 2 3 4 5 6
#> 0.75946 0.78218 0.77083 0.68605 0.58019 0.53788
head(residuals(fit_loglog, type = "quantile")) # approx. N(0,1)
#> 1 2 3 4 5 6
#> 0.13417 -1.27716 0.37469 0.20025 -0.22132 -0.46531
head(residuals(fit_loglog, type = "pearson"))
#> [1] 0.16448 -1.29814 0.40502 0.22319 -0.21326 -0.46403
head(residuals(fit_loglog, type = "weighted")) # for halfnormal.plot
#> 1 2 3 4 5 6
#> 0.16533 -1.30327 0.40565 0.22250 -0.21350 -0.46338The press function computes the PRESS (Predicted
Residual Error Sum of Squares) statistic and the associated
cross-validation measures \(P^2\) and
adjusted \(P^2_c\). The \(P^2\) statistic is a cross-validation
analog of the coefficient of determination, computed from the hat matrix
diagonal without refitting the model \(n\) times. Values close to 1 indicate
strong predictive accuracy. The function accepts multiple models
simultaneously, enabling direct comparison of predictive performance
across competing specifications.
press(fit_loglog) # single model
#> P2 P2_c PRESS
#> 0.93404 0.93208 327.39633
press(fit_loglog, fit_logit, fit_probit) # comparing models
#> P2 P2_c PRESS
#> fit_loglog 0.93404 0.93208 327.40
#> fit_logit 0.95150 0.95006 327.06
#> fit_probit 0.95402 0.95265 326.83The log-log model achieves the highest \(P^2\) and \(P^2_c\) values, confirming its superior predictive performance. The PRESS statistic reflects the total cumulative squared prediction error across all leave-one-out fits.
The predict method computes predicted values for new or
original data. The type argument controls the output:
"response" returns fitted means \(\hat{\mu}_i\); "link" returns
the linear predictors \(\hat{\eta}_{1i}\) and \(\hat{\eta}_{2i}\);
"dispersion" returns fitted dispersion values \(\hat{\sigma}^2_i\).
head(predict(fit_loglog, type = "response")) # fitted means
#> 1 2 3 4 5 6
#> 0.75946 0.78218 0.77083 0.68605 0.58019 0.53788
head(predict(fit_loglog, type = "link")$mean) # mean linear predictor
#> 1 2 3 4 5 6
#> 1.29044 1.40376 1.34597 0.97601 0.60806 0.47783
head(predict(fit_loglog, type = "link")$dispersion) # dispersion predictor
#> 1 2 3 4 5 6
#> -2.5687 -2.4910 -2.3666 -2.6320 -2.8324 -2.8324
head(predict(fit_loglog, type = "dispersion")) # fitted sigma^2
#> 1 2 3 4 5 6
#> 0.076636 0.082829 0.093798 0.071935 0.058869 0.058869
# Out-of-sample prediction
new_obs <- rh[1:3, ]
predict(fit_loglog, newdata = new_obs, type = "response")
#> [1] 0.75946 0.78218 0.77083The simulate method generates response vectors from the
fitted simplex distribution using the estimated \(\hat{\mu}_i\) and \(\hat{\sigma}^2_i\). The nsim
argument controls the number of replicates and seed ensures
reproducibility. The result is a data frame with nsim
columns.
The hatvalues method returns the diagonal elements \(h_{ii}\) of the hat matrix, measuring the
leverage of each observation on the fitted values. Observations with
\(h_{ii}\) much larger than the average
\(r/n\) (where \(r\) is the number of estimated parameters)
deserve closer inspection. The cooks.distance method
computes approximate Cook’s distances, which combine leverage and
residual size to measure overall influence on the parameter estimates.
The type argument selects the residual type used
("pearson" or "weighted").
hii <- hatvalues(fit_loglog)
cook <- cooks.distance(fit_loglog, type = "pearson")
cat(sprintf("Leverages — max: %.4f mean: %.4f\n", max(hii), mean(hii)))
#> Leverages — max: 0.1272 mean: 0.0256
cat(sprintf("Cook's D — max: %.4f\n", max(cook)))
#> Cook's D — max: 0.2195No observation has a disproportionately large leverage or Cook’s distance, suggesting no single point dominates the fit.
The gleverage function computes the generalized leverage
values, which extend the classical hat values to account for both the
mean and dispersion submodels. High generalized leverage indicates
observations that may exert disproportionate influence on the fitted
values across the entire model.
The plot method produces up to eight diagnostic plots
selected via which: (1) residuals vs. observation index;
(2) residuals vs. fitted values; (3) residuals vs. linear predictor; (4)
observed vs. fitted values; (5) normal Q-Q plot; (6) Cook’s distances;
(7) generalized leverages. The default residual type is
"quantile". For plots 6 and 7, threshold
highlights influential observations and label.pos controls
label placement (1 = below, 2 = left, 3 = above, 4 = right).
Diagnostic plots (1–6) for the fitted simplex regression model with log-log link.
Cook’s distances. Observations exceeding the threshold of 0.15 are labeled.
Generalized leverage values. Observations exceeding 0.08 are labeled.
The quantile residuals are approximately standard normal, with no systematic patterns across fitted values or the linear predictor. Cook’s distances and generalized leverages are uniformly small, with no evidence of unduly influential observations.
Local influence analysis quantifies how small perturbations to the
model assumptions affect the parameter estimates. Two perturbation
schemes are supported: "case.weight" perturbs observation
weights and "response" perturbs the response values. The
parameter argument selects the parameter block of interest
— "theta" (all parameters), "beta" (mean
submodel), or "gamma" (dispersion submodel). The
type argument controls the influence measure:
"Ci" for the total local influence index \(C_i\), or "dmax" for the
direction of maximum curvature. Observations exceeding
threshold are labeled in the index plot.
local.influence(
fit_loglog,
scheme = "case.weight",
parameter = "theta",
type = "Ci",
plot = TRUE,
threshold = 0.5,
label.pos = c(3, 4, 3, 2, 2)
)Total local influence \(C_i\) under case-weight perturbation for all parameters.
local.influence(
fit_loglog,
scheme = "response",
parameter = "theta",
type = "Ci",
plot = TRUE,
threshold = 0.4,
label.pos = 2
)Total local influence \(C_i\) under response perturbation for all parameters.
A small set of observations is identified as locally influential under both perturbation schemes, but none produces an extreme influence value.
The halfnormal.plot function produces a half-normal plot
of the absolute residuals together with a simulated envelope based on
nsim Monte Carlo replications of the fitted model (default
nsim = 100). Points outside the envelope may indicate model
inadequacy. The type argument selects the residual type
(default "weighted") and seed ensures
reproducibility.
Half-normal plot of absolute weighted residuals with 95% simulated envelope (100 replications).
All diagnostic plots reveal no evidence against the adequacy of the fitted model. Quantile residuals are approximately standard normal, no observation has disproportionate influence, and nearly all points in the half-normal plot fall within the simulated envelope.
plot(rh$Date, rh$RH,
type = "l", col = "black", lwd = 1.2,
xlab = "Date", ylab = "Relative humidity",
main = "Observed vs Fitted RH — Brasília (2000–2025)")
lines(rh$Date, fitted(fit_loglog), col = "red", lwd = 1.5, lty = 2)
legend("bottomleft",
legend = c("Observed", "Fitted"),
col = c("black", "red"),
lty = c(1, 2), lwd = c(1.2, 1.5),
bty = "n", cex = 0.85)Observed (solid black) and fitted (dashed red) monthly relative humidity in Brasília, January 2000 to December 2025.
The simplex regression model with log-log link accurately captures the strong seasonal pattern in relative humidity, with fitted values tracking the observed series closely throughout the entire 26-year period.
This vignette demonstrated the main functionalities of the
SimplexRegression package through the analysis of monthly
relative humidity data from Brasília. The selected simplex regression
model uses the log-log mean link with variable dispersion driven by
precipitation. It achieved excellent goodness of fit (pseudo-\(R^2 > 0.95\)) and strong cross-validated
predictive performance (\(P^2 >
0.95\)), with no evidence of misspecification from the RESET test
or departures from model assumptions in the diagnostic plots.
The package provides a comprehensive and self-consistent workflow — from data preparation and model fitting to inference, model selection, diagnostics, and influence analysis — following the standard conventions of formula-based regression modeling in R.
sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=Portuguese_Brazil.utf8
#> [3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C
#> [5] LC_TIME=Portuguese_Brazil.utf8
#>
#> time zone: America/Sao_Paulo
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] SimplexRegression_0.1.3
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.5 knitr_1.51 rlang_1.1.6 xfun_0.54
#> [5] Formula_1.2-5 otel_0.2.0 jsonlite_2.0.0 zoo_1.8-15
#> [9] htmltools_0.5.9 pracma_2.4.6 lmtest_0.9-40 sass_0.4.10
#> [13] rmarkdown_2.30 grid_4.5.2 evaluate_1.0.5 jquerylib_0.1.4
#> [17] fastmap_1.2.0 yaml_2.3.12 lifecycle_1.0.5 compiler_4.5.2
#> [21] sandwich_3.1-1 rstudioapi_0.18.0 lattice_0.22-7 digest_0.6.38
#> [25] R6_2.6.1 parallel_4.5.2 bslib_0.10.0 Matrix_1.7-4
#> [29] tools_4.5.2 cachem_1.1.0
Need a high-speed mirror for your open-source project?
Contact our mirror admin team at info@clientvps.com.
This archive is provided as a free public service to the community.
Proudly supported by infrastructure from VPSPulse , RxServers , BuyNumber , UnitVPS , OffshoreName and secure payment technology by ArionPay.