---
title: "Getting Started with rSRD"
author:
  - "Balázs R. Sziklai ([0000-0002-0068-8920](https://orcid.org/0000-0002-0068-8920))<br>HUN-REN Centre for Economic and Regional Studies, 1097 Tóth Kálmán u. 4, Budapest, Hungary<br>Department of Operations Research and Actuarial Sciences, Corvinus University of Budapest, 1093 Fővám tér 8, Budapest, Hungary<br>Email: sziklai.balazs@krtk.elte.hu"
  - "Attila Gere ([0000-0003-3075-1561](https://orcid.org/0000-0003-3075-1561))<br>Hungarian University of Agriculture and Life Sciences, 1118 Villányi út 29-43, Budapest, Hungary<br>Email: Gere.Attila@uni-mate.hu"
  - "Károly Hébeger ([0000-0003-0965-939X](https://orcid.org/0000-0003-0965-939X))<br>Plasma Chemistry Reasearch Group, Institute of Materials and Environmental Chemistry, HUN-REN Research Centre for Natural Sciences, 1117 Magyar tudósok körútja 2, Budapest, Hungary<br>Email: heberger.karoly@ttk.hu"
  - "Jochen Staudacher ([0000-0002-0619-4606](https://orcid.org/0000-0002-0619-4606))<br>Fakultät Informatik, Hochschule Kempten, Bahnhofstr. 61, 87435 Kempten, Germany<br>Email: jochen.staudacher@hs-kempten.de"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with rSRD}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
library(rSRD)
```

## What is Sum of Ranking Differences?

When several methods, models, or scoring systems evaluate the same set of
objects, the following natural question arises. Which method agrees most closely with a
set of reference values? Sum of Ranking Differences (SRD) answers this question with
a single, interpretable number.

The idea is straightforward. Each method produces a ranking of the objects.
A reference is chosen --- either an external gold standard or one
derived from the data itself (for example, the row-wise mean or median). For
each method, the absolute differences between its ranks and the reference ranks
are summed. A small SRD value means the method tracks the reference closely, 
whereas a large SRD value means the method disagrees with it.

`rSRD` normalises SRD values to the interval $[0, 1]$, where 0 indicates
perfect agreement and 1 indicates the maximum possible disagreement given the number 
of objects being ranked (i.e., scaled by the theoretical maximum SRD). To
determine whether an observed SRD value is meaningfully small or merely within
the range expected by chance under the null hypothesis that the method ranks objects 
randomly, the package uses a Monte Carlo permutation test (shuffling the assigned ranks 
and recalculating the SRD) with 1,000,000 random rankings.

For a full theoretical treatment of Sum of Ranking Differences, please refer to the original paper  
Héberger and Kollár-Hunek (2011) and the recent extensions on consensus ranking and significance 
testing in Sziklai, Baranyi and Héberger (2025).

---

## The movies1994 dataset

Throughout this vignette we use the `movies1994` dataset, which is bundled with
`rSRD`. It covers 14 films released in 1994 and records five scores for each: 

| Column | Description |
|---|---|
| `RT Reviewers` | Rotten Tomatoes critic score (%) |
| `RT Audience` | Rotten Tomatoes audience score (%) |
| `IMDB` | IMDB user rating (out of 10) |
| `Box Office` | Worldwide box office gross (USD) |
| `Nominations` | Total award nominations received |
| `Awards` | Total awards won |
| `Recognition score` | Composite score: Awards + 0.5 * Nominations. Used as the reference against which all other scoring systems are compared. |

The last column, `Recognition score`, is used as the reference --- it combines
nominations and awards into a single measure of industry recognition against
which the other scoring systems are compared.

```{r load-data}
path   <- system.file("extdata", "movies1994.csv", package = "rSRD")
movies <- read.csv(path, header = TRUE, sep = ";", row.names = 1,
                   check.names = FALSE)
movies
```

---

## Preprocessing

The five scoring systems use very different scales: critic percentages, a
10-point rating, a raw dollar figure, and raw counts. Before running SRD it 
is good practice to bring all columns onto a common scale so that no single 
variable dominates purely because of its unit of measurement. Note that each 
data preprocessing method is strictly monotonic; hence, if the reference is 
provided and not aggregated from the data, preprocessing has no effect on 
the SRD analysis. 

`utilsPreprocessDF()` offers four methods: `"range_scale"` (maps each column
to $[0, 1]$), `"standardize"` (zero mean, unit variance), `"scale_to_unit"`
(L2 normalisation), and `"scale_to_max"` (divides by the column maximum).
Range scaling is the most natural choice here since all columns measure
inherently positive quantities.

```{r preprocess}
movies_scaled <- utilsPreprocessDF(movies, method = "range_scale")
round(head(movies_scaled), 3)
```

Note that the reference column (`Recognition score`) is scaled along with the
rest. After scaling, every value lies between 0 and 1.

### Constructing a reference column

If you do not have a natural reference column, `utilsCreateReference()` can
derive one from the data itself. Available options are `"max"`, `"min"`,
`"mean"`, and `"median"` (applied row-wise), as well as `"mixed"` for a
row-by-row specification.

```{r create-reference, eval = FALSE}
# Example: use the row-wise median as the reference
movies_with_ref <- utilsCreateReference(movies_scaled[, -ncol(movies_scaled)],
                                        method = "median")
```

In our case `movies_scaled` already contains `Recognition score` as the last
column, so we use it directly as the reference.

---

## Computing SRD values

`calculateSRDValues()` computes the normalised SRD score for every
non-reference column. The last column of the data frame is always treated as
the reference.

```{r srd-values}
srd <- calculateSRDValues(movies_scaled, output_to_file = FALSE)
srd
```

The result is a named numeric vector with one entry per solution column.
Values closer to 0 indicate stronger agreement with the `Recognition score`
reference.

<!--
```{r srd-bar, fig.width = 6, fig.height = 4, fig.alt = "Bar chart of normalised SRD values for each scoring system"}
barplot(sort(srd),
        horiz  = TRUE,
        las    = 1,
        xlab   = "Normalised SRD value",
        main   = "SRD values — movies1994",
        col    = "steelblue")
```
-->

`Awards` and `Nominations` agree most closely with the `Recognition score`
reference. This is not surprising since the reference is a weighted combination of
these two measures (`Recognition = Awards + 0.5 * Nominations`), i.e. their
high agreement is a mathematical consequence rather than an independent
finding. Among the remaining scoring systems, `RT Reviewers` and `IMDB`
show notably lower SRD values than `RT Audience` and `Box Office`, 
suggesting that critics and dedicated film raters track award
recognition more closely than general audiences or commercial success.

---

## Testing significance via permutation tests

While the raw SRD values provide an initial ranking of method performance, 
their interpretation requires a statistical baseline. Specifically, we must 
ask whether an observed SRD is significantly lower than what would arise from 
random ranking.

The function `calculateSRDDistribution()` performs this by simulating 1,000,000 random 
permutations of the ranks and computing their SRD distribution. Two threshold values
are derived from this distribution:

- **XX1** (5th percentile): SRD values below this threshold agree with the
  reference *significantly better than chance* at the 5% level.
- **XX19** (95th percentile): SRD values above this threshold rank objects in
  *reverse order* relative to the reference, also at the 5% level.
- SRD values between XX1 and XX19 are indistinguishable from random rankings.

Because the simulation is stochastic, the thresholds can vary slightly between
runs, in particular for small datasets. We recommend to always set a seed in 
published analyses to ensure reproducibility.

```{r srd-dist, eval = FALSE}
sim <- calculateSRDDistribution(movies_scaled, seed = 42)

cat("XX1  (5% threshold):", sim$xx1,    "\n")
cat("Median:             ", sim$median,  "\n")
cat("XX19 (95% threshold):", sim$xx19,  "\n")
```

```{r srd-dist-hidden, echo = FALSE}
# Run with a fixed seed so the vignette output is reproducible
sim <- calculateSRDDistribution(movies_scaled, seed = 42)
cat("XX1  (5% threshold):", sim$xx1,    "\n")
cat("Median:             ", sim$median,  "\n")
cat("XX19 (95% threshold):", sim$xx19,  "\n")
```

`plotPermTest()` overlays the observed SRD values on the simulated distribution
as vertical coloured lines, with dashed lines marking XX1, the median, and
XX19. Solutions whose lines fall to the left of the XX1 dashed line agree with the
reference significantly better than chance. Solutions to the right of XX19
indicate a significantly reversed ranking.

```{r perm-plot, eval = FALSE, fig.width = 7, fig.height = 5, fig.alt = "Permutation test plot showing SRD distribution and solution positions"}
plotPermTest(movies_scaled, sim)
```

```{r perm-plot-hidden, echo = FALSE, fig.width = 7, fig.height = 5, fig.alt = "Permutation test plot showing SRD distribution and solution positions"}
plotPermTest(movies_scaled, sim)
```

The permutation test reveals that `RT Reviewers` (critics' score) and
`IMDB` ratings agree with the `Recognition score` reference significantly better than
chance --- their SRD values fall to the left of the XX1 threshold. `Awards`
and `Nominations` are also significant, though this is largely a
consequence of the `Recognition score` reference being composed of these two measures,
i.e. their significance should be interpreted with caution.

`RT Audience` and `Box Office` fall between XX1 and XX19,
meaning their rankings are statistically indistinguishable from a random
ordering relative to the `Recognition score` reference. `Box Office` is
the farthest from the reference, which suggests that award committees and
general audiences evaluate films by different criteria. Commercial
blockbusters such as *The Mask* and *Speed* rank highly by box office but
received comparatively few nominations, while critically acclaimed films
such as *The Shawshank Redemption* --- a box office disappointment on
release --- were heavily nominated. This disconnect between commercial
success and award recognition is one of the most striking findings the SRD
analysis reveals.

Note that with only 14 observations the simulation thresholds are more
variable than they would be for a larger dataset. The `seed = 42` argument
ensures the thresholds reported here are reproducible.

---

## Cross-validation

The permutation test provides an absolute benchmark by testing whether a given method's 
SRD is significantly lower than random expectation. Cross-validation extends this framework 
to a relative comparison. It tests whether the SRD values of two methods are statistically 
distinguishable from each other, thereby enabling direct pairwise model selection.

The function `calculateCrossValidation()` divides the data into folds, 
computes SRD values for each fold and executes a statistical test (Wilcoxon, Alpaydin or Dietterich) 
between consecutive pairs of solutions sorted by their median fold SRD. 

```{r cv-hidden, echo = FALSE}
cv <- calculateCrossValidation(movies_scaled,
                               method          = "Wilcoxon",
                               number_of_folds = 7,
                               output_to_file  = FALSE,
                               seed            = 42)
```

```{r cv, eval = FALSE}
cv <- calculateCrossValidation(movies_scaled,
                               method          = "Wilcoxon",
                               number_of_folds = 7,
                               output_to_file  = FALSE,
                               seed            = 42)

cv$statistical_significance
```

The significance labels are:

| Label | Meaning |
|---|---|
| `(p<0.05*)` | The two adjacent solutions are significantly different at the 5% level |
| `(p<0.1)` | Significant at the 10% level |
| `n.s.` | Not significantly different |
| `n.a.` | Not available (fold count outside the 5--10 range) |

`plotCrossValidation()` displays the fold-wise SRD values as box-and-whisker
plots, ordered by median, making it easy to see which solutions are stable
across folds and which are sensitive to the subset of films used.

```{r cv-plot, eval = FALSE, fig.width = 7, fig.height = 5, fig.alt = "Box-whisker plot of cross-validation SRD values by solution"}
plotCrossValidation(cv)
```

```{r cv-plot-hidden, echo = FALSE, fig.width = 7, fig.height = 5, fig.alt = "Box-whisker plot of cross-validation SRD values by solution"}
plotCrossValidation(cv)
```

The box-whisker plot orders the scoring systems from left to right by
increasing median SRD. `Awards` and `Nominations` occupy the two leftmost
positions, confirming their strong agreement with the `Recognition score` 
reference --- though as noted above, this is partly a mathematical
consequence of how the reference is constructed. The pairwise Wilcoxon
test finds that `Awards` and `Nominations` are not significantly different
from each other (`n.s.`), but `Nominations` and `IMDB` *are*
significantly different (`p<0.05`), marking a meaningful boundary between
the award-derived measures and the independent scoring systems.

`IMDB` and `RT Reviewers` occupy the middle positions and are not
significantly different from each other (`n.s.`), nor is `RT Reviewers`
significantly different from `RT Audience` (`n.s.`). This suggests that
while critics and dedicated film raters (IMDB) track award recognition
somewhat more closely than general audiences, the difference between these
three systems is not statistically compelling with this dataset. 
Note that only consecutive methods are tested. Whether the difference between `IMDb` 
and the `RT Audience` score is statistically significant is not reported in this analysis.

The most striking finding is the significant gap between `RT Audience` and
`Box Office` (`p<0.05`), confirming that commercial box office
performance is not merely different from the Recognition reference by
degree but is categorically more distant than all other scoring systems.
`Box Office` sits furthest from the reference, reinforcing the
conclusion from the permutation test, i.e. award committees and general
audiences evaluate films by fundamentally different criteria.

Note that with only 14 films and 7 folds, each fold omits just 2 films.
Wide boxes indicate scoring systems whose SRD values are sensitive to
which 2 films are excluded --- films like *The Shawshank Redemption* (high
IMDB and RT scores but low box office) are likely influential observations
that affect fold-wise results considerably.

---

## Heatmap visualisation

`plotHeatmapSRD()` provides a complementary view of the results by
displaying the pairwise distances between all scoring systems --- including
the reference --- as a colour-coded heatmap. By default, the heatmap uses 
a red-to-blue gradient, with red indicating SRD = 0 and blue 
indicating SRD = 1. Users may supply custom palettes, and the number 
of colour breaks corresponds to the palette length.

```{r heatmap, eval = FALSE}
plotHeatmapSRD(movies_scaled)
```

```{r heatmap-hidden, echo = FALSE, fig.width = 7, fig.height = 6, fig.alt = "Heatmap of pairwise SRD distances between scoring systems"}
plotHeatmapSRD(movies_scaled)
```

The heatmap confirms the findings of the permutation test and
cross-validation. `Awards` and `Nominations` cluster closely together and
sit nearest to the Recognition reference, while `Box Office` is
most distant from all other systems.

A particularly striking feature is the distance between `RT Audience` and
`Box Office`, despite both reflecting the judgement of general
audiences rather than critics or award committees. This apparent
contradiction dissolves once we distinguish between two fundamentally
different modes of audience behaviour. Buying a cinema ticket is an
*anticipatory* decision made before seeing a film, shaped by marketing,
star power, and word of mouth. Rating a film on Rotten Tomatoes is a
*reflective* decision made after viewing, shaped by the experience of the
film itself. The 1994 dataset illustrates this gap rather vividly. *The Mask* and
*Speed* were heavily marketed blockbusters that sold large numbers of
tickets but received more measured audience ratings on reflection, while
*The Shawshank Redemption* --- a commercial disappointment on release --- 
consistently receives some of the highest audience ratings ever recorded.
Award committees, critics, and reflective online audiences converge in
their evaluations, whereas box office receipts apparently tell a different story.


---

## A note on reproducibility

`calculateSRDDistribution()` and `calculateCrossValidation()` both rely on
random number generation in C++. Without a seed the results vary slightly
between runs. For small datasets this variability can affect the precise values
of XX1 and XX19.

The following example makes the issue concrete:

```{r repro-demo, eval = FALSE}
# Two unseeded runs -- XX1 may differ slightly
sim_a <- calculateSRDDistribution(movies_scaled)
sim_b <- calculateSRDDistribution(movies_scaled)
cat("Run A -- XX1:", sim_a$xx1, "  XX19:", sim_a$xx19, "\n")
cat("Run B -- XX1:", sim_b$xx1, "  XX19:", sim_b$xx19, "\n")

# Two seeded runs -- results are identical
sim_1 <- calculateSRDDistribution(movies_scaled, seed = 42)
sim_2 <- calculateSRDDistribution(movies_scaled, seed = 42)
cat("Seed 42, run 1 -- XX1:", sim_1$xx1, "  XX19:", sim_1$xx19, "\n")
cat("Seed 42, run 2 -- XX1:", sim_2$xx1, "  XX19:", sim_2$xx19, "\n")
```

For any analysis that will be published or shared, we strongly recommend to 
always specify a seed. The choice of seed value does not affect the statistical 
validity of the results, it only guarantees their reproducibility.

---

## Summary of the workflow

```{r workflow-summary, eval = FALSE}
# 1. Load data (last column = reference)
path   <- system.file("extdata", "movies1994.csv", package = "rSRD")
movies <- read.csv(path, header = TRUE, sep = ";", row.names = 1,
                   check.names = FALSE)

# 2. Preprocess
movies_scaled <- utilsPreprocessDF(movies, method = "range_scale")

# 3. Compute SRD values
srd <- calculateSRDValues(movies_scaled, output_to_file = FALSE)

# 4. Permutation test (set seed for reproducibility)
sim <- calculateSRDDistribution(movies_scaled, seed = 42)
plotPermTest(movies_scaled, sim)

# 5. Cross-validation
cv <- calculateCrossValidation(movies_scaled,
                               method          = "Wilcoxon",
                               number_of_folds = 7,
                               output_to_file  = FALSE,
                               seed            = 42)
plotCrossValidation(cv)
```

---

## References

Héberger K., Kollár-Hunek K. (2011) "Sum of ranking differences for method
discrimination and its validation: comparison of ranks with random numbers",
*Journal of Chemometrics*, **25**(4), pp. 151--158.

Sziklai B.R., Baranyi M., Héberger, K. (2025) "Does cross-validation work 
in telling rankings apart?", *Cent Eur J Oper Res*, **33**, pp. 1503--1528.

<!--
Sziklai B. R., Baranyi M., Héberger K. (2021) "Testing Cross-Validation
Variants in Ranking Environments", arXiv preprint arXiv:2105.11939.
-->
