When several methods, models, or scoring systems evaluate the same set of objects, the following natural question arises. Which method agrees most closely with a set of reference values? Sum of Ranking Differences (SRD) answers this question with a single, interpretable number.
The idea is straightforward. Each method produces a ranking of the objects. A reference is chosen — either an external gold standard or one derived from the data itself (for example, the row-wise mean or median). For each method, the absolute differences between its ranks and the reference ranks are summed. A small SRD value means the method tracks the reference closely, whereas a large SRD value means the method disagrees with it.
rSRD normalises SRD values to the interval \([0, 1]\), where 0 indicates perfect
agreement and 1 indicates the maximum possible disagreement given the
number of objects being ranked (i.e., scaled by the theoretical maximum
SRD). To determine whether an observed SRD value is meaningfully small
or merely within the range expected by chance under the null hypothesis
that the method ranks objects randomly, the package uses a Monte Carlo
permutation test (shuffling the assigned ranks and recalculating the
SRD) with 1,000,000 random rankings.
For a full theoretical treatment of Sum of Ranking Differences,
please refer to the original paper
Héberger and Kollár-Hunek (2011) and the recent extensions on consensus
ranking and significance testing in Sziklai, Baranyi and Héberger
(2025).
Throughout this vignette we use the movies1994 dataset,
which is bundled with rSRD. It covers 14 films released in
1994 and records five scores for each:
| Column | Description |
|---|---|
RT Reviewers |
Rotten Tomatoes critic score (%) |
RT Audience |
Rotten Tomatoes audience score (%) |
IMDB |
IMDB user rating (out of 10) |
Box Office |
Worldwide box office gross (USD) |
Nominations |
Total award nominations received |
Awards |
Total awards won |
Recognition score |
Composite score: Awards + 0.5 * Nominations. Used as the reference against which all other scoring systems are compared. |
The last column, Recognition score, is used as the
reference — it combines nominations and awards into a single measure of
industry recognition against which the other scoring systems are
compared.
path <- system.file("extdata", "movies1994.csv", package = "rSRD")
movies <- read.csv(path, header = TRUE, sep = ";", row.names = 1,
check.names = FALSE)
movies
#> RT Reviewers RT Audience IMDB Box Office
#> Clerks 90 89 7.7 3158707
#> Dumb and Dumber 69 84 7.3 247290327
#> Ed Wood 92 88 7.8 5888353
#> Forrest Gump 75 95 8.8 678226465
#> Four Weddings and a Funeral 92 74 7.1 245700832
#> Interview with the Vampire 66 86 7.5 223664608
#> Leon: The Professional 76 95 8.5 20330788
#> Natural Born Killers 52 81 7.2 50287182
#> Pulp Fiction 92 96 8.8 213928762
#> Speed 95 77 7.3 350448145
#> The Lion King 92 93 8.5 979161632
#> The Mask 81 68 7.0 351583407
#> The Shawshank Redemption 89 98 9.3 29420884
#> True Lies 77 76 7.3 378882411
#> Nominations Awards Recognition score
#> Clerks 10 7 12.0
#> Dumb and Dumber 3 5 6.5
#> Ed Wood 33 27 43.5
#> Forrest Gump 74 51 88.0
#> Four Weddings and a Funeral 27 24 37.5
#> Interview with the Vampire 34 24 41.0
#> Leon: The Professional 16 5 13.0
#> Natural Born Killers 10 5 10.0
#> Pulp Fiction 72 69 105.0
#> Speed 20 21 31.0
#> The Lion King 35 43 60.5
#> The Mask 22 6 17.0
#> The Shawshank Redemption 42 21 42.0
#> True Lies 23 8 19.5The five scoring systems use very different scales: critic percentages, a 10-point rating, a raw dollar figure, and raw counts. Before running SRD it is good practice to bring all columns onto a common scale so that no single variable dominates purely because of its unit of measurement. Note that each data preprocessing method is strictly monotonic; hence, if the reference is provided and not aggregated from the data, preprocessing has no effect on the SRD analysis.
utilsPreprocessDF() offers four methods:
"range_scale" (maps each column to \([0, 1]\)), "standardize" (zero
mean, unit variance), "scale_to_unit" (L2 normalisation),
and "scale_to_max" (divides by the column maximum). Range
scaling is the most natural choice here since all columns measure
inherently positive quantities.
movies_scaled <- utilsPreprocessDF(movies, method = "range_scale")
round(head(movies_scaled), 3)
#> RT Reviewers RT Audience IMDB Box Office Nominations Awards
#> 1 0.884 0.700 0.304 0.000 0.099 0.031
#> 2 0.395 0.533 0.130 0.250 0.000 0.000
#> 3 0.930 0.667 0.348 0.003 0.423 0.344
#> 4 0.535 0.900 0.783 0.692 1.000 0.719
#> 5 0.930 0.200 0.043 0.249 0.338 0.297
#> 6 0.326 0.600 0.217 0.226 0.437 0.297
#> Recognition score
#> 1 0.056
#> 2 0.000
#> 3 0.376
#> 4 0.827
#> 5 0.315
#> 6 0.350Note that the reference column (Recognition score) is
scaled along with the rest. After scaling, every value lies between 0
and 1.
If you do not have a natural reference column,
utilsCreateReference() can derive one from the data itself.
Available options are "max", "min",
"mean", and "median" (applied row-wise), as
well as "mixed" for a row-by-row specification.
# Example: use the row-wise median as the reference
movies_with_ref <- utilsCreateReference(movies_scaled[, -ncol(movies_scaled)],
method = "median")In our case movies_scaled already contains
Recognition score as the last column, so we use it directly
as the reference.
calculateSRDValues() computes the normalised SRD score
for every non-reference column. The last column of the data frame is
always treated as the reference.
srd <- calculateSRDValues(movies_scaled, output_to_file = FALSE)
srd
#> [1] 0.4489796 0.5204082 0.4183673 0.5714286 0.1326531 0.1122449The result is a named numeric vector with one entry per solution
column. Values closer to 0 indicate stronger agreement with the
Recognition score reference.
Awards and Nominations agree most closely
with the Recognition score reference. This is not
surprising since the reference is a weighted combination of these two
measures (Recognition = Awards + 0.5 * Nominations),
i.e. their high agreement is a mathematical consequence rather than an
independent finding. Among the remaining scoring systems,
RT Reviewers and IMDB show notably lower SRD
values than RT Audience and Box Office,
suggesting that critics and dedicated film raters track award
recognition more closely than general audiences or commercial
success.
While the raw SRD values provide an initial ranking of method performance, their interpretation requires a statistical baseline. Specifically, we must ask whether an observed SRD is significantly lower than what would arise from random ranking.
The function calculateSRDDistribution() performs this by
simulating 1,000,000 random permutations of the ranks and computing
their SRD distribution. Two threshold values are derived from this
distribution:
Because the simulation is stochastic, the thresholds can vary slightly between runs, in particular for small datasets. We recommend to always set a seed in published analyses to ensure reproducibility.
sim <- calculateSRDDistribution(movies_scaled, seed = 42)
cat("XX1 (5% threshold):", sim$xx1, "\n")
cat("Median: ", sim$median, "\n")
cat("XX19 (95% threshold):", sim$xx19, "\n")#> XX1 (5% threshold): 0.4694
#> Median: 0.6735
#> XX19 (95% threshold): 0.8571
plotPermTest() overlays the observed SRD values on the
simulated distribution as vertical coloured lines, with dashed lines
marking XX1, the median, and XX19. Solutions whose lines fall to the
left of the XX1 dashed line agree with the reference significantly
better than chance. Solutions to the right of XX19 indicate a
significantly reversed ranking.
The permutation test reveals that RT Reviewers (critics’
score) and IMDB ratings agree with the
Recognition score reference significantly better than
chance — their SRD values fall to the left of the XX1 threshold.
Awards and Nominations are also significant,
though this is largely a consequence of the
Recognition score reference being composed of these two
measures, i.e. their significance should be interpreted with
caution.
RT Audience and Box Office fall between XX1
and XX19, meaning their rankings are statistically indistinguishable
from a random ordering relative to the Recognition score
reference. Box Office is the farthest from the reference,
which suggests that award committees and general audiences evaluate
films by different criteria. Commercial blockbusters such as The
Mask and Speed rank highly by box office but received
comparatively few nominations, while critically acclaimed films such as
The Shawshank Redemption — a box office disappointment on
release — were heavily nominated. This disconnect between commercial
success and award recognition is one of the most striking findings the
SRD analysis reveals.
Note that with only 14 observations the simulation thresholds are
more variable than they would be for a larger dataset. The
seed = 42 argument ensures the thresholds reported here are
reproducible.
The permutation test provides an absolute benchmark by testing whether a given method’s SRD is significantly lower than random expectation. Cross-validation extends this framework to a relative comparison. It tests whether the SRD values of two methods are statistically distinguishable from each other, thereby enabling direct pairwise model selection.
The function calculateCrossValidation() divides the data
into folds, computes SRD values for each fold and executes a statistical
test (Wilcoxon, Alpaydin or Dietterich) between consecutive pairs of
solutions sorted by their median fold SRD.
cv <- calculateCrossValidation(movies_scaled,
method = "Wilcoxon",
number_of_folds = 7,
output_to_file = FALSE,
seed = 42)
cv$statistical_significanceThe significance labels are:
| Label | Meaning |
|---|---|
(p<0.05*) |
The two adjacent solutions are significantly different at the 5% level |
(p<0.1) |
Significant at the 10% level |
n.s. |
Not significantly different |
n.a. |
Not available (fold count outside the 5–10 range) |
plotCrossValidation() displays the fold-wise SRD values
as box-and-whisker plots, ordered by median, making it easy to see which
solutions are stable across folds and which are sensitive to the subset
of films used.
The box-whisker plot orders the scoring systems from left to right by
increasing median SRD. Awards and Nominations
occupy the two leftmost positions, confirming their strong agreement
with the Recognition score reference — though as noted
above, this is partly a mathematical consequence of how the reference is
constructed. The pairwise Wilcoxon test finds that Awards
and Nominations are not significantly different from each
other (n.s.), but Nominations and
IMDB are significantly different
(p<0.05), marking a meaningful boundary between the
award-derived measures and the independent scoring systems.
IMDB and RT Reviewers occupy the middle
positions and are not significantly different from each other
(n.s.), nor is RT Reviewers significantly
different from RT Audience (n.s.). This
suggests that while critics and dedicated film raters (IMDB) track award
recognition somewhat more closely than general audiences, the difference
between these three systems is not statistically compelling with this
dataset. Note that only consecutive methods are tested. Whether the
difference between IMDb and the RT Audience
score is statistically significant is not reported in this analysis.
The most striking finding is the significant gap between
RT Audience and Box Office
(p<0.05), confirming that commercial box office
performance is not merely different from the Recognition reference by
degree but is categorically more distant than all other scoring systems.
Box Office sits furthest from the reference, reinforcing
the conclusion from the permutation test, i.e. award committees and
general audiences evaluate films by fundamentally different
criteria.
Note that with only 14 films and 7 folds, each fold omits just 2 films. Wide boxes indicate scoring systems whose SRD values are sensitive to which 2 films are excluded — films like The Shawshank Redemption (high IMDB and RT scores but low box office) are likely influential observations that affect fold-wise results considerably.
plotHeatmapSRD() provides a complementary view of the
results by displaying the pairwise distances between all scoring systems
— including the reference — as a colour-coded heatmap. By default, the
heatmap uses a red-to-blue gradient, with red indicating SRD = 0 and
blue indicating SRD = 1. Users may supply custom palettes, and the
number of colour breaks corresponds to the palette length.
The heatmap confirms the findings of the permutation test and
cross-validation. Awards and Nominations
cluster closely together and sit nearest to the Recognition reference,
while Box Office is most distant from all other
systems.
A particularly striking feature is the distance between
RT Audience and Box Office, despite both
reflecting the judgement of general audiences rather than critics or
award committees. This apparent contradiction dissolves once we
distinguish between two fundamentally different modes of audience
behaviour. Buying a cinema ticket is an anticipatory decision
made before seeing a film, shaped by marketing, star power, and word of
mouth. Rating a film on Rotten Tomatoes is a reflective
decision made after viewing, shaped by the experience of the film
itself. The 1994 dataset illustrates this gap rather vividly. The
Mask and Speed were heavily marketed blockbusters that
sold large numbers of tickets but received more measured audience
ratings on reflection, while The Shawshank Redemption — a
commercial disappointment on release — consistently receives some of the
highest audience ratings ever recorded. Award committees, critics, and
reflective online audiences converge in their evaluations, whereas box
office receipts apparently tell a different story.
calculateSRDDistribution() and
calculateCrossValidation() both rely on random number
generation in C++. Without a seed the results vary slightly between
runs. For small datasets this variability can affect the precise values
of XX1 and XX19.
The following example makes the issue concrete:
# Two unseeded runs -- XX1 may differ slightly
sim_a <- calculateSRDDistribution(movies_scaled)
sim_b <- calculateSRDDistribution(movies_scaled)
cat("Run A -- XX1:", sim_a$xx1, " XX19:", sim_a$xx19, "\n")
cat("Run B -- XX1:", sim_b$xx1, " XX19:", sim_b$xx19, "\n")
# Two seeded runs -- results are identical
sim_1 <- calculateSRDDistribution(movies_scaled, seed = 42)
sim_2 <- calculateSRDDistribution(movies_scaled, seed = 42)
cat("Seed 42, run 1 -- XX1:", sim_1$xx1, " XX19:", sim_1$xx19, "\n")
cat("Seed 42, run 2 -- XX1:", sim_2$xx1, " XX19:", sim_2$xx19, "\n")For any analysis that will be published or shared, we strongly recommend to always specify a seed. The choice of seed value does not affect the statistical validity of the results, it only guarantees their reproducibility.
# 1. Load data (last column = reference)
path <- system.file("extdata", "movies1994.csv", package = "rSRD")
movies <- read.csv(path, header = TRUE, sep = ";", row.names = 1,
check.names = FALSE)
# 2. Preprocess
movies_scaled <- utilsPreprocessDF(movies, method = "range_scale")
# 3. Compute SRD values
srd <- calculateSRDValues(movies_scaled, output_to_file = FALSE)
# 4. Permutation test (set seed for reproducibility)
sim <- calculateSRDDistribution(movies_scaled, seed = 42)
plotPermTest(movies_scaled, sim)
# 5. Cross-validation
cv <- calculateCrossValidation(movies_scaled,
method = "Wilcoxon",
number_of_folds = 7,
output_to_file = FALSE,
seed = 42)
plotCrossValidation(cv)Héberger K., Kollár-Hunek K. (2011) “Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers”, Journal of Chemometrics, 25(4), pp. 151–158.
Sziklai B.R., Baranyi M., Héberger, K. (2025) “Does cross-validation work in telling rankings apart?”, Cent Eur J Oper Res, 33, pp. 1503–1528.