Getting Started with rSRD

Balázs R. Sziklai (0000-0002-0068-8920)
HUN-REN Centre for Economic and Regional Studies, 1097 Tóth Kálmán u. 4, Budapest, Hungary
Department of Operations Research and Actuarial Sciences, Corvinus University of Budapest, 1093 Fővám tér 8, Budapest, Hungary
Email: sziklai.balazs@krtk.elte.hu

Attila Gere (0000-0003-3075-1561)
Hungarian University of Agriculture and Life Sciences, 1118 Villányi út 29-43, Budapest, Hungary
Email: Gere.Attila@uni-mate.hu

Károly Hébeger (0000-0003-0965-939X)
Plasma Chemistry Reasearch Group, Institute of Materials and Environmental Chemistry, HUN-REN Research Centre for Natural Sciences, 1117 Magyar tudósok körútja 2, Budapest, Hungary
Email: heberger.karoly@ttk.hu

Jochen Staudacher (0000-0002-0619-4606)
Fakultät Informatik, Hochschule Kempten, Bahnhofstr. 61, 87435 Kempten, Germany
Email: jochen.staudacher@hs-kempten.de

What is Sum of Ranking Differences?

When several methods, models, or scoring systems evaluate the same set of objects, the following natural question arises. Which method agrees most closely with a set of reference values? Sum of Ranking Differences (SRD) answers this question with a single, interpretable number.

The idea is straightforward. Each method produces a ranking of the objects. A reference is chosen — either an external gold standard or one derived from the data itself (for example, the row-wise mean or median). For each method, the absolute differences between its ranks and the reference ranks are summed. A small SRD value means the method tracks the reference closely, whereas a large SRD value means the method disagrees with it.

rSRD normalises SRD values to the interval \([0, 1]\), where 0 indicates perfect agreement and 1 indicates the maximum possible disagreement given the number of objects being ranked (i.e., scaled by the theoretical maximum SRD). To determine whether an observed SRD value is meaningfully small or merely within the range expected by chance under the null hypothesis that the method ranks objects randomly, the package uses a Monte Carlo permutation test (shuffling the assigned ranks and recalculating the SRD) with 1,000,000 random rankings.

For a full theoretical treatment of Sum of Ranking Differences, please refer to the original paper
Héberger and Kollár-Hunek (2011) and the recent extensions on consensus ranking and significance testing in Sziklai, Baranyi and Héberger (2025).


The movies1994 dataset

Throughout this vignette we use the movies1994 dataset, which is bundled with rSRD. It covers 14 films released in 1994 and records five scores for each:

Column Description
RT Reviewers Rotten Tomatoes critic score (%)
RT Audience Rotten Tomatoes audience score (%)
IMDB IMDB user rating (out of 10)
Box Office Worldwide box office gross (USD)
Nominations Total award nominations received
Awards Total awards won
Recognition score Composite score: Awards + 0.5 * Nominations. Used as the reference against which all other scoring systems are compared.

The last column, Recognition score, is used as the reference — it combines nominations and awards into a single measure of industry recognition against which the other scoring systems are compared.

path   <- system.file("extdata", "movies1994.csv", package = "rSRD")
movies <- read.csv(path, header = TRUE, sep = ";", row.names = 1,
                   check.names = FALSE)
movies
#>                             RT Reviewers RT Audience IMDB Box Office
#> Clerks                                90          89  7.7    3158707
#> Dumb and Dumber                       69          84  7.3  247290327
#> Ed Wood                               92          88  7.8    5888353
#> Forrest Gump                          75          95  8.8  678226465
#> Four Weddings and a Funeral           92          74  7.1  245700832
#> Interview with the Vampire            66          86  7.5  223664608
#> Leon: The Professional                76          95  8.5   20330788
#> Natural Born Killers                  52          81  7.2   50287182
#> Pulp Fiction                          92          96  8.8  213928762
#> Speed                                 95          77  7.3  350448145
#> The Lion King                         92          93  8.5  979161632
#> The Mask                              81          68  7.0  351583407
#> The Shawshank Redemption              89          98  9.3   29420884
#> True Lies                             77          76  7.3  378882411
#>                             Nominations Awards Recognition score
#> Clerks                               10      7              12.0
#> Dumb and Dumber                       3      5               6.5
#> Ed Wood                              33     27              43.5
#> Forrest Gump                         74     51              88.0
#> Four Weddings and a Funeral          27     24              37.5
#> Interview with the Vampire           34     24              41.0
#> Leon: The Professional               16      5              13.0
#> Natural Born Killers                 10      5              10.0
#> Pulp Fiction                         72     69             105.0
#> Speed                                20     21              31.0
#> The Lion King                        35     43              60.5
#> The Mask                             22      6              17.0
#> The Shawshank Redemption             42     21              42.0
#> True Lies                            23      8              19.5

Preprocessing

The five scoring systems use very different scales: critic percentages, a 10-point rating, a raw dollar figure, and raw counts. Before running SRD it is good practice to bring all columns onto a common scale so that no single variable dominates purely because of its unit of measurement. Note that each data preprocessing method is strictly monotonic; hence, if the reference is provided and not aggregated from the data, preprocessing has no effect on the SRD analysis.

utilsPreprocessDF() offers four methods: "range_scale" (maps each column to \([0, 1]\)), "standardize" (zero mean, unit variance), "scale_to_unit" (L2 normalisation), and "scale_to_max" (divides by the column maximum). Range scaling is the most natural choice here since all columns measure inherently positive quantities.

movies_scaled <- utilsPreprocessDF(movies, method = "range_scale")
round(head(movies_scaled), 3)
#>   RT Reviewers RT Audience  IMDB Box Office Nominations Awards
#> 1        0.884       0.700 0.304      0.000       0.099  0.031
#> 2        0.395       0.533 0.130      0.250       0.000  0.000
#> 3        0.930       0.667 0.348      0.003       0.423  0.344
#> 4        0.535       0.900 0.783      0.692       1.000  0.719
#> 5        0.930       0.200 0.043      0.249       0.338  0.297
#> 6        0.326       0.600 0.217      0.226       0.437  0.297
#>   Recognition score
#> 1             0.056
#> 2             0.000
#> 3             0.376
#> 4             0.827
#> 5             0.315
#> 6             0.350

Note that the reference column (Recognition score) is scaled along with the rest. After scaling, every value lies between 0 and 1.

Constructing a reference column

If you do not have a natural reference column, utilsCreateReference() can derive one from the data itself. Available options are "max", "min", "mean", and "median" (applied row-wise), as well as "mixed" for a row-by-row specification.

# Example: use the row-wise median as the reference
movies_with_ref <- utilsCreateReference(movies_scaled[, -ncol(movies_scaled)],
                                        method = "median")

In our case movies_scaled already contains Recognition score as the last column, so we use it directly as the reference.


Computing SRD values

calculateSRDValues() computes the normalised SRD score for every non-reference column. The last column of the data frame is always treated as the reference.

srd <- calculateSRDValues(movies_scaled, output_to_file = FALSE)
srd
#> [1] 0.4489796 0.5204082 0.4183673 0.5714286 0.1326531 0.1122449

The result is a named numeric vector with one entry per solution column. Values closer to 0 indicate stronger agreement with the Recognition score reference.

Awards and Nominations agree most closely with the Recognition score reference. This is not surprising since the reference is a weighted combination of these two measures (Recognition = Awards + 0.5 * Nominations), i.e. their high agreement is a mathematical consequence rather than an independent finding. Among the remaining scoring systems, RT Reviewers and IMDB show notably lower SRD values than RT Audience and Box Office, suggesting that critics and dedicated film raters track award recognition more closely than general audiences or commercial success.


Testing significance via permutation tests

While the raw SRD values provide an initial ranking of method performance, their interpretation requires a statistical baseline. Specifically, we must ask whether an observed SRD is significantly lower than what would arise from random ranking.

The function calculateSRDDistribution() performs this by simulating 1,000,000 random permutations of the ranks and computing their SRD distribution. Two threshold values are derived from this distribution:

Because the simulation is stochastic, the thresholds can vary slightly between runs, in particular for small datasets. We recommend to always set a seed in published analyses to ensure reproducibility.

sim <- calculateSRDDistribution(movies_scaled, seed = 42)

cat("XX1  (5% threshold):", sim$xx1,    "\n")
cat("Median:             ", sim$median,  "\n")
cat("XX19 (95% threshold):", sim$xx19,  "\n")
#> XX1  (5% threshold): 0.4694
#> Median:              0.6735
#> XX19 (95% threshold): 0.8571

plotPermTest() overlays the observed SRD values on the simulated distribution as vertical coloured lines, with dashed lines marking XX1, the median, and XX19. Solutions whose lines fall to the left of the XX1 dashed line agree with the reference significantly better than chance. Solutions to the right of XX19 indicate a significantly reversed ranking.

plotPermTest(movies_scaled, sim)

Permutation test plot showing SRD distribution and solution positions

The permutation test reveals that RT Reviewers (critics’ score) and IMDB ratings agree with the Recognition score reference significantly better than chance — their SRD values fall to the left of the XX1 threshold. Awards and Nominations are also significant, though this is largely a consequence of the Recognition score reference being composed of these two measures, i.e. their significance should be interpreted with caution.

RT Audience and Box Office fall between XX1 and XX19, meaning their rankings are statistically indistinguishable from a random ordering relative to the Recognition score reference. Box Office is the farthest from the reference, which suggests that award committees and general audiences evaluate films by different criteria. Commercial blockbusters such as The Mask and Speed rank highly by box office but received comparatively few nominations, while critically acclaimed films such as The Shawshank Redemption — a box office disappointment on release — were heavily nominated. This disconnect between commercial success and award recognition is one of the most striking findings the SRD analysis reveals.

Note that with only 14 observations the simulation thresholds are more variable than they would be for a larger dataset. The seed = 42 argument ensures the thresholds reported here are reproducible.


Cross-validation

The permutation test provides an absolute benchmark by testing whether a given method’s SRD is significantly lower than random expectation. Cross-validation extends this framework to a relative comparison. It tests whether the SRD values of two methods are statistically distinguishable from each other, thereby enabling direct pairwise model selection.

The function calculateCrossValidation() divides the data into folds, computes SRD values for each fold and executes a statistical test (Wilcoxon, Alpaydin or Dietterich) between consecutive pairs of solutions sorted by their median fold SRD.

cv <- calculateCrossValidation(movies_scaled,
                               method          = "Wilcoxon",
                               number_of_folds = 7,
                               output_to_file  = FALSE,
                               seed            = 42)

cv$statistical_significance

The significance labels are:

Label Meaning
(p<0.05*) The two adjacent solutions are significantly different at the 5% level
(p<0.1) Significant at the 10% level
n.s. Not significantly different
n.a. Not available (fold count outside the 5–10 range)

plotCrossValidation() displays the fold-wise SRD values as box-and-whisker plots, ordered by median, making it easy to see which solutions are stable across folds and which are sensitive to the subset of films used.

plotCrossValidation(cv)

Box-whisker plot of cross-validation SRD values by solution

The box-whisker plot orders the scoring systems from left to right by increasing median SRD. Awards and Nominations occupy the two leftmost positions, confirming their strong agreement with the Recognition score reference — though as noted above, this is partly a mathematical consequence of how the reference is constructed. The pairwise Wilcoxon test finds that Awards and Nominations are not significantly different from each other (n.s.), but Nominations and IMDB are significantly different (p<0.05), marking a meaningful boundary between the award-derived measures and the independent scoring systems.

IMDB and RT Reviewers occupy the middle positions and are not significantly different from each other (n.s.), nor is RT Reviewers significantly different from RT Audience (n.s.). This suggests that while critics and dedicated film raters (IMDB) track award recognition somewhat more closely than general audiences, the difference between these three systems is not statistically compelling with this dataset. Note that only consecutive methods are tested. Whether the difference between IMDb and the RT Audience score is statistically significant is not reported in this analysis.

The most striking finding is the significant gap between RT Audience and Box Office (p<0.05), confirming that commercial box office performance is not merely different from the Recognition reference by degree but is categorically more distant than all other scoring systems. Box Office sits furthest from the reference, reinforcing the conclusion from the permutation test, i.e. award committees and general audiences evaluate films by fundamentally different criteria.

Note that with only 14 films and 7 folds, each fold omits just 2 films. Wide boxes indicate scoring systems whose SRD values are sensitive to which 2 films are excluded — films like The Shawshank Redemption (high IMDB and RT scores but low box office) are likely influential observations that affect fold-wise results considerably.


Heatmap visualisation

plotHeatmapSRD() provides a complementary view of the results by displaying the pairwise distances between all scoring systems — including the reference — as a colour-coded heatmap. By default, the heatmap uses a red-to-blue gradient, with red indicating SRD = 0 and blue indicating SRD = 1. Users may supply custom palettes, and the number of colour breaks corresponds to the palette length.

plotHeatmapSRD(movies_scaled)

Heatmap of pairwise SRD distances between scoring systems

The heatmap confirms the findings of the permutation test and cross-validation. Awards and Nominations cluster closely together and sit nearest to the Recognition reference, while Box Office is most distant from all other systems.

A particularly striking feature is the distance between RT Audience and Box Office, despite both reflecting the judgement of general audiences rather than critics or award committees. This apparent contradiction dissolves once we distinguish between two fundamentally different modes of audience behaviour. Buying a cinema ticket is an anticipatory decision made before seeing a film, shaped by marketing, star power, and word of mouth. Rating a film on Rotten Tomatoes is a reflective decision made after viewing, shaped by the experience of the film itself. The 1994 dataset illustrates this gap rather vividly. The Mask and Speed were heavily marketed blockbusters that sold large numbers of tickets but received more measured audience ratings on reflection, while The Shawshank Redemption — a commercial disappointment on release — consistently receives some of the highest audience ratings ever recorded. Award committees, critics, and reflective online audiences converge in their evaluations, whereas box office receipts apparently tell a different story.


A note on reproducibility

calculateSRDDistribution() and calculateCrossValidation() both rely on random number generation in C++. Without a seed the results vary slightly between runs. For small datasets this variability can affect the precise values of XX1 and XX19.

The following example makes the issue concrete:

# Two unseeded runs -- XX1 may differ slightly
sim_a <- calculateSRDDistribution(movies_scaled)
sim_b <- calculateSRDDistribution(movies_scaled)
cat("Run A -- XX1:", sim_a$xx1, "  XX19:", sim_a$xx19, "\n")
cat("Run B -- XX1:", sim_b$xx1, "  XX19:", sim_b$xx19, "\n")

# Two seeded runs -- results are identical
sim_1 <- calculateSRDDistribution(movies_scaled, seed = 42)
sim_2 <- calculateSRDDistribution(movies_scaled, seed = 42)
cat("Seed 42, run 1 -- XX1:", sim_1$xx1, "  XX19:", sim_1$xx19, "\n")
cat("Seed 42, run 2 -- XX1:", sim_2$xx1, "  XX19:", sim_2$xx19, "\n")

For any analysis that will be published or shared, we strongly recommend to always specify a seed. The choice of seed value does not affect the statistical validity of the results, it only guarantees their reproducibility.


Summary of the workflow

# 1. Load data (last column = reference)
path   <- system.file("extdata", "movies1994.csv", package = "rSRD")
movies <- read.csv(path, header = TRUE, sep = ";", row.names = 1,
                   check.names = FALSE)

# 2. Preprocess
movies_scaled <- utilsPreprocessDF(movies, method = "range_scale")

# 3. Compute SRD values
srd <- calculateSRDValues(movies_scaled, output_to_file = FALSE)

# 4. Permutation test (set seed for reproducibility)
sim <- calculateSRDDistribution(movies_scaled, seed = 42)
plotPermTest(movies_scaled, sim)

# 5. Cross-validation
cv <- calculateCrossValidation(movies_scaled,
                               method          = "Wilcoxon",
                               number_of_folds = 7,
                               output_to_file  = FALSE,
                               seed            = 42)
plotCrossValidation(cv)

References

Héberger K., Kollár-Hunek K. (2011) “Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers”, Journal of Chemometrics, 25(4), pp. 151–158.

Sziklai B.R., Baranyi M., Héberger, K. (2025) “Does cross-validation work in telling rankings apart?”, Cent Eur J Oper Res, 33, pp. 1503–1528.