The RecordTest package (Inference Tools in Time Series Based on Record Statistics) contains functions to visualize the behaviour of the record occurrence, functions to calculate a wide variety of distribution-free tests for trend in location, variation and non-stationarity, for change-point detection, and tools to prepare a time series in order to study its records.
Install RecordTest using
The introductory theory and summary for the package is at
Here, the main purpose of the package is developed as well as an outline of the functions available in the package.
RecordTest has several functions that attempt to test the classical record model which assumes randomness in its variables, that is, they are continuous independent and identically distributed, by means of hypothesis tests and graphical tools.
To begin with, RecordTest has a dataset
Olympic_records_200m
containing the record times
time
and record values value
of the Olympic
200-meter, from 1900 to 2020. In this case, only the lower records are
available.
The RecordTest functions need a complete series of
observations to calculate its records. In order to apply these tools to
the series of Olympic records, the function series_record
is applied, which generates a series with the same records.
As a preview, the Olympic records series is drawn highlighting its lower records.
The graph below shows the number of accumulated lower records together with its confidence intervals under the null hypothesis, from which we see that the observed sample departs significantly since time \(t = 13\), corresponding to the 1960 Olympics.
Due to the sample size is not very large, an exact one-sided test can be implemented based on the Poisson binomial distribution. The result is highly significant. The number of observed records is 12 while the expected under the null hypothesis is close to 4.
RecordTest has a benchmark temperature dataset
TX_Zaragoza
containing the time series TX
of
daily maximum temperature at Zaragoza (Spain), from 01/01/1951 to
31/12/2020 measured in tenths of a degree Celsius. In this case, the
whole series is available.
As a preview, the temperature series TX_Zaragoza$TX
is
drawn highlighting its upper and lower records.
A large number of upper records are observed in the first observations and very few lower records. The initial behaviour in the records appear due to the seasonal increment of temperature between January and July, because this series has a strong seasonal component and serial correlation.
To take these characteristics into consideration, splitting the
observed series into \(M\) uncorrelated
subseries is especially useful in the presence of serial correlation and
seasonality. series_split
splits
TX_Zaragoza$TX
into 365 subseries, each corresponding to
each day of the year. series_uncor
selects the larger
number of columns or subseries that are not correlated with their
adjacent columns.
TxZ365 <- series_split(TX_Zaragoza$TX, Mcols = 365)
TxZ <- series_uncor(TxZ365)
dim(TxZ)
#> [1] 70 76
Since the observed series is measured and rounded to tenths of a degree Celsius, ties can occur, which may cause a lower number of records to be identified than actually occurred. In particular, we observe that around \(4\%\) of the records are weak records.
series_ties(TxZ365)
#> $number
#> Total Strong Weak Expected under IID
#> 1969 1889 80 1764
#>
#> $percentage
#> [1] 4.063
#>
#> $percentage.position
#> 1 2 3 4 5 6 7 8 9 10
#> 0.0000 2.4631 2.4590 0.0000 4.2105 6.4516 3.0769 1.6393 0.0000 3.4483
#> 11 12 13 14 15 16 17 18 19 20
#> 10.2564 3.7037 6.2500 2.4390 13.3333 0.0000 6.2500 9.0909 11.1111 0.0000
#> 21 22 23 24 25 26 27 28 29 30
#> 42.8571 22.2222 0.0000 0.0000 0.0000 16.6667 0.0000 5.5556 11.1111 8.3333
#> 31 32 33 34 35 36 37 38 39 40
#> 4.7619 7.1429 20.0000 18.1818 12.5000 14.2857 0.0000 10.0000 7.1429 10.5263
#> 41 42 43 44 45 46 47 48 49 50
#> 12.5000 6.2500 0.0000 8.3333 9.5238 0.0000 0.0000 0.0000 0.0000 25.0000
#> 51 52 53 54 55 56 57 58 59 60
#> 14.2857 0.0000 6.6667 16.6667 0.0000 0.0000 0.0000 20.0000 0.0000 0.0000
#> 61 62 63 64 65 66 67 68 69 70
#> 25.0000 3.8462 0.0000 5.5556 8.3333 0.0000 7.6923 0.0000 0.0000 0.0000
We could untie the possible records by adding a random value from a Uniform and independent distribution for each observation.
The following plot shows the upper and lower record times in the forward and backward series. Many more points are observed in the plots of the diagonal, giving evidence that there is a positive trend in the series.
The following plots show the mean number of (weighted) upper and lower records in the forward and backward series. Without weights the trend in the forward series is not significant, but backward it is highly significant. With weights, the trend in both directions is significant. This fact is produced because the evolution of temperature in Zaragoza is showing a faster increase since 1980.
ggpubr::ggarrange(N.plot(TxZ), N.plot(TxZ, weights = function(t) t-1),
ncol = 2, nrow = 1, common.legend = TRUE, legend = "bottom")
A plot that gathers the information about the number of occurrences in the four types of record is obtained as follows.
If we choose incremental weights \(\omega_t
= t-1\) to the records within the statistic, the trend becomes
more significant earlier, then this graphical tool could be useful to
identify the first time when non-stationary evidence appears. Several
versions of plots of this style can be implemented (see
help(foster.plot)
).
foster.plot(TxZ, weights = function(t) t-1) +
ggplot2::ylim(-85, 85) +
ggplot2::geom_vline(xintercept = 44, linetype = "dashed")
We can apply the associated test to detect non-stationary behaviour in the records of the series. The result is highly significant.
foster.test(TxZ, distribution = "normal", weights = function(t) t-1)
#>
#> Foster-Stuart D-statistic test with weights = t - 1
#>
#> data: TxZ
#> Z = 5.53, p-value = 1.6e-08
#> alternative hypothesis: true 'statistic' is greater than 0
#> sample estimates:
#> statistic E VAR
#> 6068 0 1203318
foster.test(TxZ, distribution = "t", weights = function(t) t-1)
#>
#> Foster-Stuart D-statistic test with weights = t - 1
#>
#> data: TxZ
#> t = 5.4, df = 75, p-value = 3.7e-07
#> alternative hypothesis: true 't' is greater than 0
It is possible to use all the series if we compute the p-value with permutations, say 10,000:
set.seed(23)
foster.test(TxZ365, distribution = "normal", weights = function(t) t-1,
permutation.test = TRUE, B = 10000)
#>
#> Foster-Stuart D-statistic test with weights = t - 1 with permuted
#> p-value (based on 10000 permutations)
#>
#> data: TxZ365
#> Z = 13.6, p-value <2e-16
#> alternative hypothesis: true 'statistic' is greater than 0
#> sample estimates:
#> statistic E VAR
#> 32801 0 5779093
Under the null hypothesis of randomness, the record probability meets \(t p_t = 1\). An exploratory tool can be proposed where \(E(t \hat p_t) = \alpha t + \beta\) and also a regression test for the hypothesis \[ H_0:\,\alpha=0,\,\beta=1 \qquad \text{and} \qquad H_1:\,\alpha\neq0\,\text{or}\,\beta\neq1. \] For Zaragoza data, plots related to this test detect a clear positive trend that is expressed with more upper records and less lower records in the forward series and the opposite in the backward series.
ggpubr::ggarrange(
p.plot(TxZ, record = c(1,1,0,0)) + ggplot2::ylim(0, 5),
p.plot(TxZ, record = c(0,0,1,1)) + ggplot2::ylim(0, 5),
ncol = 2, nrow = 1, common.legend = TRUE, legend = "bottom")
Those tests can be implemented as follows, where in addition the estimation of the parameters of the line is displayed.
p.regression.test(TxZ, record = "upper")
#>
#> Regression test on the upper records probabilities
#>
#> data: TxZ
#> F = 5.37, df1 = 2, df2 = 67, p-value = 0.0069
#> alternative hypothesis: two-sided for record probabilities
#> null values:
#> (Intercept) x
#> 1 0
#> sample estimates:
#> (Intercept) x
#> 0.848373 0.012581
p.regression.test(TxZ, record = "lower")
#>
#> Regression test on the lower records probabilities
#>
#> data: TxZ
#> F = 4.2, df1 = 2, df2 = 67, p-value = 0.019
#> alternative hypothesis: two-sided for record probabilities
#> null values:
#> (Intercept) x
#> 1 0
#> sample estimates:
#> (Intercept) x
#> 1.0283589 -0.0069164
p.regression.test(series_rev(TxZ), record = "upper")
#>
#> Regression test on the upper records probabilities
#>
#> data: series_rev(TxZ)
#> F = 8.52, df1 = 2, df2 = 67, p-value = 0.00051
#> alternative hypothesis: two-sided for record probabilities
#> null values:
#> (Intercept) x
#> 1 0
#> sample estimates:
#> (Intercept) x
#> 1.0303789 -0.0097694
p.regression.test(series_rev(TxZ), record = "lower")
#>
#> Regression test on the lower records probabilities
#>
#> data: series_rev(TxZ)
#> F = 5.33, df1 = 2, df2 = 67, p-value = 0.0071
#> alternative hypothesis: two-sided for record probabilities
#> null values:
#> (Intercept) x
#> 1 0
#> sample estimates:
#> (Intercept) x
#> 0.970841 0.010271
Other alternative based on a Monte Carlo approach is to join the information of all previous regression tests. Of the 1000 simulations considered under the null hypothesis, none has a statistic with a value greater than that of the observed series, making the test highly significant.
set.seed(23)
global.test(TxZ, FUN = p.regression.test, B = 1000)
#>
#> Test with global statistic for 'p.regression.test' with simulated
#> p-value (based on 1000 replicates)
#>
#> data: TxZ
#> Monte-Carlo = 23.4, p-value <2e-16
Other powerful tests for trend detection can be implemented as follows:
brown.method(TxZ, weights = function(t) t-1)
#>
#> Brown's method on the weighted number of records with weights = t - 1
#>
#> data: TxZ
#> X-squared = 43, df = 4.76, c = 1.68, p-value = 2.8e-08
N.test(TxZ, weights = function(t) t-1)
#>
#> Test on the weighted number of upper records with weights = t - 1
#>
#> data: TxZ
#> Z = 3.75, p-value = 8.7e-05
#> alternative hypothesis: true 'N' is greater than 4952.7
#> sample estimates:
#> N E VAR
#> 6518.0 4952.7 173877.9
or
set.seed(23)
p.chisq.test(TxZ, simulate.p.value = TRUE)
#>
#> Chi-square test on the upper records probabilities with simulated
#> p-value (based on 1000 replicates)
#>
#> data: TxZ
#> X-squared = 129, df = 69, p-value <2e-16
lr.test(TxZ, simulate.p.value = TRUE, B = 10000)
#>
#> Likelihood-ratio test for upper record indicators with simulated
#> p-value (based on 10000 replicates)
#>
#> data: TxZ
#> LR = 2055, p-value = 0.028
#> alternative hypothesis: two.sided with different probabilities
score.test(TxZ)
#>
#> Score test for upper record indicators with simulated p-value (based on
#> 1000 replicates)
#>
#> data: TxZ
#> LM = 6809, p-value <2e-16
#> alternative hypothesis: two.sided with different probabilities
Other plots:
ggpubr::ggarrange(
p.plot(TxZ, plot = 1, record = c(1,1,0,0),
smooth.method = stats::loess, span = 0.25),
p.plot(TxZ, plot = 1, record = c(1,1,0,0),
smooth.formula = y ~ I(x-1) - 1 + offset(rep(1, length(x)))),
p.plot(TxZ, plot = 2, record = c(1,1,0,0)),
p.plot(TxZ, plot = 3, record = c(1,1,0,0)),
ncol = 2, nrow = 2, common.legend = TRUE, legend = "bottom")
Signs of climate change have not been present since the beginning of the series. We can implement a test to detect change-points in the series on a daily scale as follows.
change.point(ZaragozaSeries)
#>
#> Records test for single changepoint detection
#>
#> data: ZaragozaSeries
#> Kolmogorov = 2.08, p-value = 0.00035
#> alternative hypothesis: two.sided
#> sample estimates:
#> probable changepoint time
#> 36
change.point(ZaragozaSeries, weights = function(t) sqrt(t),
record = "d", simulate.p.value = TRUE, B = 10000)
#>
#> Records test for single changepoint detection with weights = sqrt(t)
#> with simulated p-value (based on 10000 replicates)
#>
#> data: ZaragozaSeries
#> Kolmogorov = 2.35, p-value = 1e-04
#> alternative hypothesis: two.sided
#> sample estimates:
#> probable changepoint time
#> 36
The change-point is found at time 36 (1986). We can see analogous results for the annual mean temperature series, a change-point is significantly detected with time estimate 38 (1988).
test.result <- change.point(rowMeans(TxZ365, na.rm = TRUE))
test.result
#>
#> Records test for single changepoint detection
#>
#> data: rowMeans(TxZ365, na.rm = TRUE)
#> Kolmogorov = 3.74, p-value = 1.4e-12
#> alternative hypothesis: two.sided
#> sample estimates:
#> probable changepoint time
#> 38
records(rowMeans(TxZ365, na.rm = TRUE)) +
ggplot2::geom_vline(xintercept = test.result$estimate, colour = "red")
There are still more tools! Try them yourself.