This vignette introduces the main functions of the
PEAXAI package.
PEAXAI provides a methodology that integrates Data
Envelopment Analysis (DEA) with Machine Learning (ML) algorithms to
train a classifier that estimates the probability that each
decision-making unit (DMU) is (in)efficient.
Once the model is trained, the package offers several functions for post-hoc analysis, including:
These tools facilitate benchmarking and policy simulations in a transparent and explainable framework.
To illustrate the functionality of the PEAXAI package,
we use a dataset of 917 companies in the food industry sector operating
in Spain. The data were retrieved from the SABI (Sistema de Análisis de
Balances Ibéricos) database for the year 2023 and include only firms
with more than 50 employees.
Each row corresponds to a single firm and includes both financial and operational variables relevant for productivity and efficiency analysis.
The output variable used to measure performance is:
operating_income: Operating income in millions of
euros.The input variables include:
total_assets: Total assets (in millions of euros).employees: Number of employees.fixed_assets: Tangible fixed assets (in millions of
euros).personnel_expenses: Personnel-related costs (in
millions of euros).Additionally, the variable autonomous_community
indicates the geographical location of each firm within one of Spain’s
17 autonomous communities. This allows the analysis to reflect regional
institutional and market heterogeneity.
The dataset exhibits a wide dispersion across firms, including both medium-sized and large enterprises. This heterogeneity poses a realistic challenge for evaluating efficiency and provides a rich testing ground for model explainability.
For illustration purposes, this vignette focuses on a subset of the data: firms located in the Autonomous Community of Valencia (Comunidad Valenciana).
For illustration, we focus on firms located in the Comunidad
Valenciana (Autonomous Community of Valencia). Filter the dataset to
include only firms from the Comunidad Valenciana and remove the
autonomous_community column.
Quick structure and summary.
str(data)
#> 'data.frame': 97 obs. of 5 variables:
#> $ total_assets : num 259 185 129 104 153 ...
#> $ employees : num 70 261 354 968 1076 ...
#> $ fixed_assets : num 84.15 21.75 5.76 30.55 103.76 ...
#> $ personnel_expenses: num 16.7 19.9 13.7 36.8 31.8 ...
#> $ operating_income : num 461 358 357 294 268 ...
summary(data)
#> total_assets employees fixed_assets personnel_expenses
#> Min. : 1.537 Min. : 50.0 Min. : 0.1421 Min. : 1.037
#> 1st Qu.: 8.989 1st Qu.: 75.0 1st Qu.: 2.6797 1st Qu.: 2.167
#> Median : 24.555 Median : 98.0 Median : 6.2578 Median : 3.059
#> Mean : 41.030 Mean : 201.1 Mean : 15.2799 Mean : 6.757
#> 3rd Qu.: 52.409 3rd Qu.: 240.0 3rd Qu.: 20.0960 3rd Qu.: 8.244
#> Max. :258.825 Max. :1076.0 Max. :140.6890 Max. :36.789
#> operating_income
#> Min. : 2.382
#> 1st Qu.: 12.994
#> Median : 29.138
#> Mean : 62.307
#> 3rd Qu.: 72.688
#> Max. :460.578Determine inputs (x) and outputs (y) indices.
Determine the technology assumptions.
To address class imbalance, we apply SMOTE based on the best-practice frontier (assuming a convex production technology). First, we identify the combinations of DMUs that define the frontier. Then, we generate one of two types of synthetic DMUs from these combinations: efficient units (on the frontier) and near-frontier inefficient units (inside the frontier’s envelope).
When the efficient class is the minority, efficient-class SMOTE on the best-practice frontier helps the classifier learn the frontier more accurately. Optionally, if the observed imbalance still falls short of the target ratio, we also generate near-frontier inefficient synthetic units derived from the frontier geometry to sharpen the separation between efficient and inefficient regions.
Define machine learning methods and their tuning parameters. Here we configure a neural network (nnet) with a grid of hyperparameters
methods <- list(
"nnet" = list(
tuneGrid = expand.grid(
size = c(1, 5, 10, 20),
decay = 10^seq(-5, -1, by = 1)
),
maxit = 100,
preProcess = c("center", "scale"),
# # --- arguments nnet ---
entropy = TRUE,
skip = TRUE,
maxit = 1000,
MaxNWts = 100000,
trace = FALSE,
weights = NULL
)
)Parameters for controlling model training (k-fold cross-validation)
Classification metrics to optimize during training (priority order in case of ties)
Define hold-out sample to evaluate model performance on unseen data.
These samples are excluded from the cross-validation and hyperparameter
tuning process. By default, is NULL.
Set random seed for reproducibility (ensures consistent results across runs).
We now apply the full PEAXAI pipeline:
models <- PEAXAI_fitting(
data = data,
x = x,
y = y,
RTS = RTS,
imbalance_rate = imbalance_rate,
methods = methods,
trControl = trControl,
metric_priority = metric_priority,
calibration_method = NULL,
hold_out = hold_out,
verbose = TRUE,
seed = seed
)
#> [1] "Computing maximal friends with 1 DMUs (step 1 of 14 )"
#> [1] "Computing maximal friends with 2 DMUs (step 2 of 14 )"
#> [1] "Computing maximal friends with 3 DMUs (step 3 of 14 )"
#> [1] "Computing maximal friends with 4 DMUs (step 4 of 14 )"
#> [1] "Computing maximal friends with 5 DMUs (step 5 of 14 )"
#> [1] "Computing maximal friends with 6 DMUs (step 6 of 14 )"
#> [1] "Computing maximal friends with 7 DMUs (step 7 of 14 )"
#> [1] "Computing maximal friends with 8 DMUs (step 8 of 14 )"
#> [1] "Computing maximal friends with 9 DMUs (step 9 of 14 )"
#> [1] "Computing maximal friends with 10 DMUs (step 10 of 14 )"
#> [1] "Computing maximal friends with 11 DMUs (step 11 of 14 )"
#> [1] "Computing maximal friends with 12 DMUs (step 12 of 14 )"
#> [1] "Computing maximal friends with 13 DMUs (step 13 of 14 )"
#> [1] "Computing maximal friends with 14 DMUs (step 14 of 14 )"
#> [1] "Computing maximal friends with 1 DMUs (step 1 of 14 )"
#> [1] "Computing maximal friends with 2 DMUs (step 2 of 14 )"
#> [1] "Computing maximal friends with 3 DMUs (step 3 of 14 )"
#> [1] "Computing maximal friends with 4 DMUs (step 4 of 14 )"
#> [1] "Computing maximal friends with 5 DMUs (step 5 of 14 )"
#> [1] "Computing maximal friends with 6 DMUs (step 6 of 14 )"
#> [1] "Computing maximal friends with 7 DMUs (step 7 of 14 )"
#> [1] "Computing maximal friends with 8 DMUs (step 8 of 14 )"
#> [1] "Computing maximal friends with 9 DMUs (step 9 of 14 )"
#> [1] "Computing maximal friends with 10 DMUs (step 10 of 14 )"
#> [1] "Computing maximal friends with 11 DMUs (step 11 of 14 )"
#> [1] "Computing maximal friends with 12 DMUs (step 12 of 14 )"
#> [1] "Computing maximal friends with 13 DMUs (step 13 of 14 )"
#> [1] "Computing maximal friends with 14 DMUs (step 14 of 14 )"
#> [1] "Computing maximal friends with 1 DMUs (step 1 of 12 )"
#> [1] "Computing maximal friends with 2 DMUs (step 2 of 12 )"
#> [1] "Computing maximal friends with 3 DMUs (step 3 of 12 )"
#> [1] "Computing maximal friends with 4 DMUs (step 4 of 12 )"
#> [1] "Computing maximal friends with 5 DMUs (step 5 of 12 )"
#> [1] "Computing maximal friends with 6 DMUs (step 6 of 12 )"
#> [1] "Computing maximal friends with 7 DMUs (step 7 of 12 )"
#> [1] "Computing maximal friends with 8 DMUs (step 8 of 12 )"
#> [1] "Computing maximal friends with 9 DMUs (step 9 of 12 )"
#> [1] "Computing maximal friends with 10 DMUs (step 10 of 12 )"
#> [1] "Computing maximal friends with 11 DMUs (step 11 of 12 )"
#> [1] "Computing maximal friends with 12 DMUs (step 12 of 12 )"
#> [1] "Computing maximal friends with 1 DMUs (step 1 of 13 )"
#> [1] "Computing maximal friends with 2 DMUs (step 2 of 13 )"
#> [1] "Computing maximal friends with 3 DMUs (step 3 of 13 )"
#> [1] "Computing maximal friends with 4 DMUs (step 4 of 13 )"
#> [1] "Computing maximal friends with 5 DMUs (step 5 of 13 )"
#> [1] "Computing maximal friends with 6 DMUs (step 6 of 13 )"
#> [1] "Computing maximal friends with 7 DMUs (step 7 of 13 )"
#> [1] "Computing maximal friends with 8 DMUs (step 8 of 13 )"
#> [1] "Computing maximal friends with 9 DMUs (step 9 of 13 )"
#> [1] "Computing maximal friends with 10 DMUs (step 10 of 13 )"
#> [1] "Computing maximal friends with 11 DMUs (step 11 of 13 )"
#> [1] "Computing maximal friends with 12 DMUs (step 12 of 13 )"
#> [1] "Computing maximal friends with 13 DMUs (step 13 of 13 )"
#> [1] "Computing maximal friends with 1 DMUs (step 1 of 17 )"
#> [1] "Computing maximal friends with 2 DMUs (step 2 of 17 )"
#> [1] "Computing maximal friends with 3 DMUs (step 3 of 17 )"
#> [1] "Computing maximal friends with 4 DMUs (step 4 of 17 )"
#> [1] "Computing maximal friends with 5 DMUs (step 5 of 17 )"
#> [1] "Computing maximal friends with 6 DMUs (step 6 of 17 )"
#> [1] "Computing maximal friends with 7 DMUs (step 7 of 17 )"
#> [1] "Computing maximal friends with 8 DMUs (step 8 of 17 )"
#> [1] "Computing maximal friends with 9 DMUs (step 9 of 17 )"
#> [1] "Computing maximal friends with 10 DMUs (step 10 of 17 )"
#> [1] "Computing maximal friends with 11 DMUs (step 11 of 17 )"
#> [1] "Computing maximal friends with 12 DMUs (step 12 of 17 )"
#> [1] "Computing maximal friends with 13 DMUs (step 13 of 17 )"
#> [1] "Computing maximal friends with 14 DMUs (step 14 of 17 )"
#> [1] "Computing maximal friends with 15 DMUs (step 15 of 17 )"
#> [1] "Computing maximal friends with 16 DMUs (step 16 of 17 )"
#> [1] "Computing maximal friends with 17 DMUs (step 17 of 17 )"
#> [1] "Computing maximal friends with 1 DMUs (step 1 of 12 )"
#> [1] "Computing maximal friends with 2 DMUs (step 2 of 12 )"
#> [1] "Computing maximal friends with 3 DMUs (step 3 of 12 )"
#> [1] "Computing maximal friends with 4 DMUs (step 4 of 12 )"
#> [1] "Computing maximal friends with 5 DMUs (step 5 of 12 )"
#> [1] "Computing maximal friends with 6 DMUs (step 6 of 12 )"
#> [1] "Computing maximal friends with 7 DMUs (step 7 of 12 )"
#> [1] "Computing maximal friends with 8 DMUs (step 8 of 12 )"
#> [1] "Computing maximal friends with 9 DMUs (step 9 of 12 )"
#> [1] "Computing maximal friends with 10 DMUs (step 10 of 12 )"
#> [1] "Computing maximal friends with 11 DMUs (step 11 of 12 )"
#> [1] "Computing maximal friends with 12 DMUs (step 12 of 12 )"#> $nnet
#> Neural Network
#>
#> 118 samples
#> 5 predictor
#> 2 classes: 'efficient', 'not_efficient'
#>
#> Pre-processing: centered (5), scaled (5)
#> Resampling: None
#> $nnet
#> Imbalance_rate size decay Accuracy Kappa Recall Specificity Precision F1
#> 121 0.3 1 1e-05 0.9 0.7 0.93 0.89 0.66 0.76
#> Balanced_Accuracy G_mean ROC_AUC PR_AUC Cross_Entropy
#> 121 0.91 0.91 0.92 0.68 0.99
#> Cross_Entropy_Efficient_class Cross_Entropy_not_Efficient_class AccuracySD
#> 121 2.4 0.74 0.08
#> KappaSD RecallSD SpecificitySD PrecisionSD F1SD Balanced_AccuracySD
#> 121 0.22 0.15 0.09 0.23 0.18 0.09
#> G_meanSD ROC_AUCSD PR_AUCSD Cross_EntropySD Cross_Entropy_Efficient_classSD
#> 121 0.09 0.1 0.04 1.26 5.1
#> Cross_Entropy_not_Efficient_classSD diff_imbalance
#> 121 0.73 0.8546
This example uses SHAP (SHapley Additive exPlanations) computed with
the fastshap package to estimate global feature importance with
PEAXAI_global_importance. We approximate Shapley values via
Monte Carlo sampling (nsim) and obtain a global importance
score by aggregating absolute SHAP values across observations (and then
normalizing to relative importance). For alternative XAI methods and
options, see the package documentation.
If the method needs a reference distribution or to train a surrogate,
use background to choose that dataset. target
selects the dataset on which explanations are computed.
If target = train, relative importances are
computed on the training distribution and thus reflect what the model
actually learned during fitting.
If target = real, relative importances are
computed on a dataset intended to approximate the real-world
(deployment) distribution, yielding importances that better represent
the underlying problem as observed in practice.
relative_importance <- PEAXAI_global_importance(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
background = "train",
target = "train",
importance_method = importance_method
)#> total_assets employees fixed_assets personnel_expenses
#> importance 0.1259843 0.1943872 0.1256861 0.09496279
#> operating_income
#> importance 0.4589795
We evaluate results at three efficiency cutoffs: 0.75, 0.85, and 0.95.
We define a global, model-aware direction for improving inputs/outputs.
relative_importance guides the direction using the
model’s global importance.
scope = global applies a single direction
to all DMUs.
baseline = mean centers perturbations
around the training mean.
The argument relative_importance is the direction to
make the counterfactual analysis. The directional vector can be custom
or provided by PEAXAI_global_importance funtion.
For example, if the directional vector is custom: If it not possible
(or we do not want) to change a specific input (like
employees), we need to write:
relative_importance_custom <- t(matrix(
data = c(0.2, 0, 0.2, 0.2, 0.4),
))
relative_importance_custom <- as.data.frame(relative_importance_custom)
names(relative_importance_custom) <- names(data)[c(x,y)]#> total_assets employees fixed_assets personnel_expenses operating_income
#> 1 0.2 0 0.2 0.2 0.4
But, if we do not know which direction define, we can use the relative importace by the model fitted:
Given the thresholds and directional vector, we compute feasible improvement targets for each DMU. Key controls:
n_expand = 0.5: expansion factor for the search
neighborhood.
n_grid = 50: grid resolution for exploring
adjustments.
max_y = 2, min_x = 1: bounds that cap
output expansion and input contraction.
targets <- PEAXAI_targets(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
efficiency_thresholds = efficiency_thresholds,
directional_vector = directional_vector,
n_expand = 0.5,
n_grid = 50,
max_y = 2,
min_x = 1
)head(targets[["0.85"]][["counterfactual_dataset"]], 10)
#> total_assets employees fixed_assets personnel_expenses operating_income
#> 1 258.53448 70.0000 84.15178 16.70692 460.5779
#> 2 184.97800 261.0000 21.75500 19.92400 358.1300
#> 3 128.93344 354.0000 5.75535 13.70937 356.9468
#> 4 99.03400 928.8030 28.62337 36.14623 322.6025
#> 5 143.50947 1002.4934 100.14911 30.54698 322.0384
#> 6 158.17604 735.8251 77.22218 33.89631 273.0876
#> 7 108.17040 686.9396 49.98499 23.16355 258.0558
#> 8 247.18561 557.9576 136.36460 26.42711 253.7645
#> 9 53.41076 442.0000 24.98992 19.24766 170.1763
#> 10 78.39246 598.8868 18.23653 20.09118 241.6773We compute efficiency rankings at each threshold using two bases:
Predicted: ranks by the model’s predicted probability of
being efficient (rank_basis = predicted).
Attainable: ranks by attainable improvements/targets
implied by the model (rank_basis =
attainable).
ranking <- PEAXAI_ranking(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
rank_basis = "predicted"
)
ranking2 <- PEAXAI_ranking(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
efficiency_thresholds = efficiency_thresholds,
targets = targets,
rank_basis = "attainable"
)head(round(ranking, 4), 50)
#> Ranking DMU Probability_predicted
#> 1 1 1 1.0000
#> 2 2 2 1.0000
#> 3 3 3 1.0000
#> 4 4 17 1.0000
#> 5 5 20 1.0000
#> 6 6 46 1.0000
#> 7 7 62 0.9988
#> 8 8 9 0.9881
#> 9 9 92 0.9839
#> 10 10 18 0.9263
#> 11 11 75 0.8132
#> 12 12 36 0.8129
#> 13 13 56 0.7758
#> 14 14 26 0.7053
#> 15 15 93 0.6020
#> 16 16 85 0.3999
#> 17 17 25 0.3268
#> 18 18 97 0.2149
#> 19 19 91 0.1226
#> 20 20 71 0.0854
#> 21 21 31 0.0766
#> 22 22 44 0.0596
#> 23 23 95 0.0545
#> 24 24 42 0.0408
#> 25 25 28 0.0113
#> 26 26 89 0.0004
#> 27 27 15 0.0001
#> 28 28 58 0.0000
#> 29 29 6 0.0000
#> 30 30 83 0.0000
#> 31 31 22 0.0000
#> 32 32 70 0.0000
#> 33 33 59 0.0000
#> 34 34 43 0.0000
#> 35 35 4 0.0000
#> 36 36 5 0.0000
#> 37 37 7 0.0000
#> 38 38 8 0.0000
#> 39 39 10 0.0000
#> 40 40 11 0.0000
#> 41 41 12 0.0000
#> 42 42 13 0.0000
#> 43 43 14 0.0000
#> 44 44 16 0.0000
#> 45 45 19 0.0000
#> 46 46 21 0.0000
#> 47 47 23 0.0000
#> 48 48 24 0.0000
#> 49 49 27 0.0000
#> 50 50 29 0.0000head(round(ranking2[["0.85"]], 4), 50)
#> Ranking DMU Probability_predicted betas Probability_target
#> 1 1 1 1.0000 0.0000 0.85
#> 2 2 2 1.0000 0.0000 0.85
#> 3 3 3 1.0000 0.0000 0.85
#> 4 4 17 1.0000 0.0000 0.85
#> 5 5 20 1.0000 0.0000 0.85
#> 6 6 46 1.0000 0.0000 0.85
#> 7 7 62 0.9988 0.0000 0.85
#> 8 8 9 0.9881 0.0000 0.85
#> 9 9 92 0.9839 0.0000 0.85
#> 10 10 18 0.9263 0.0000 0.85
#> 11 11 75 0.8132 0.0056 0.85
#> 12 12 36 0.8129 0.0065 0.85
#> 13 13 56 0.7758 0.0109 0.85
#> 14 14 26 0.7053 0.0341 0.85
#> 15 15 85 0.3999 0.0460 0.85
#> 16 16 44 0.0596 0.0834 0.85
#> 17 17 71 0.0854 0.0905 0.85
#> 18 18 91 0.1226 0.0933 0.85
#> 19 19 31 0.0766 0.0954 0.85
#> 20 20 25 0.3268 0.0970 0.85
#> 21 21 95 0.0545 0.1705 0.85
#> 22 22 42 0.0408 0.1815 0.85
#> 23 23 70 0.0000 0.2384 0.85
#> 24 24 28 0.0113 0.2454 0.85
#> 25 25 15 0.0001 0.3100 0.85
#> 26 26 64 0.0000 0.3209 0.85
#> 27 27 73 0.0000 0.3535 0.85
#> 28 28 58 0.0000 0.4716 0.85
#> 29 29 6 0.0000 0.4904 0.85
#> 30 30 22 0.0000 0.5146 0.85
#> 31 31 86 0.0000 0.6240 0.85
#> 32 32 67 0.0000 0.6929 0.85
#> 33 33 48 0.0000 0.7629 0.85
#> 34 34 61 0.0000 0.8047 0.85
#> 35 35 27 0.0000 0.8600 0.85
#> 36 36 68 0.0000 0.9199 0.85
#> 37 37 29 0.0000 0.9504 0.85
#> 38 38 24 0.0000 0.9927 0.85
#> 39 39 4 0.0000 1.0025 0.85
#> 40 40 37 0.0000 1.0100 0.85
#> 41 41 12 0.0000 1.0217 0.85
#> 42 42 49 0.0000 1.1646 0.85
#> 43 43 57 0.0000 1.1835 0.85
#> 44 44 52 0.0000 1.2705 0.85
#> 45 45 47 0.0000 1.3890 0.85
#> 46 46 41 0.0000 1.3902 0.85
#> 47 47 45 0.0000 1.4376 0.85
#> 48 48 35 0.0000 1.4977 0.85
#> 49 49 14 0.0000 1.5997 0.85
#> 50 50 39 0.0000 1.8141 0.85For each threshold, we identify peer DMUs that serve as reference comparators.
weighted = FALSE treats peers uniformly; set TRUE to
weight importance given relative_importance.
peers <- PEAXAI_peer(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
targets = targets,
efficiency_thresholds = efficiency_thresholds,
weighted = FALSE,
relative_importance = relative_importance
)head(peers, 50)
#> DMU 0.75 0.85 0.95
#> 1 1 1 1 1
#> 2 2 2 2 2
#> 3 3 3 3 3
#> 4 4 9 9 9
#> 5 5 9 9 9
#> 6 6 9 9 9
#> 7 7 9 9 9
#> 8 8 9 9 9
#> 9 9 9 9 9
#> 10 10 9 9 9
#> 11 11 9 9 9
#> 12 12 18 18 9
#> 13 13 9 9 9
#> 14 14 9 9 9
#> 15 15 18 18 17
#> 16 16 9 9 9
#> 17 17 17 17 17
#> 18 18 18 18 17
#> 19 19 9 9 9
#> 20 20 20 20 20
#> 21 21 9 9 9
#> 22 22 18 18 17
#> 23 23 9 9 9
#> 24 24 18 18 17
#> 25 25 17 17 17
#> 26 26 17 17 17
#> 27 27 17 17 17
#> 28 28 20 20 20
#> 29 29 18 18 17
#> 30 30 18 18 17
#> 31 31 18 18 17
#> 32 32 18 18 17
#> 33 33 18 18 9
#> 34 34 18 18 17
#> 35 35 18 18 17
#> 36 36 36 20 20
#> 37 37 17 17 17
#> 38 38 18 18 17
#> 39 39 18 18 17
#> 40 40 18 18 46
#> 41 41 17 17 17
#> 42 42 36 46 46
#> 43 43 46 46 46
#> 44 44 46 46 46
#> 45 45 17 17 17
#> 46 46 46 46 46
#> 47 47 17 17 17
#> 48 48 46 46 46
#> 49 49 18 18 17
#> 50 50 46 46 46