library(SwimmeR)
library(dplyr)
SwimmeR
was developed to work with results from swimming
competitions. Results are often shared as web pages (.html) or PDF
documents, which are nice to read, but make data difficult to
access.
SwimmeR
solves this problem by importing & cleaning
.html and .pdf files containing swimming results, and returns a tidy
data frame.
Importing is performed by read_results
which takes as an
argument a file path as file
and a node
(for
.html only, defaults to ’"pre"
).
In addition to this vignette I do a lot of demos on how to use
SwimmeR
at my blog Swimming + Data Science.
ISL results are handled differently, see ISL section below
SwimmeR
includes Texas-Florida-Indiana.pdf, results from
a tri-meet between the three schools. It can be read in as such:
<- system.file("extdata", "Texas-Florida-Indiana.pdf", package = "SwimmeR")
TX_FL_IN_path
<- read_results(file = TX_FL_IN_path) TX_FL_IN_text
294:303]
TX_FL_IN_text[#> [1] "\n --- John Shebat 21 Texas, University of NT SCR"
#> [2] "\nEvent 7 Women 100 Yard Breaststroke"
#> [3] "\n 58.79 A"
#> [4] "\n 1:01.84 B"
#> [5] "\n Name Age School Seed Time Finals Time Points"
#> [6] "\n 1 Lilly King 21 Indiana University NT 59.46 B"
#> [7] "\n r:+0.70 27.62 59.46 (31.84)"
#> [8] "\n 2 Olivia Anderson 21 Texas, University of NT 1:01.88"
#> [9] "\n r:+0.74 29.13 1:01.88 (32.75)"
#> [10] "\n 3 Noelle Peplowski 18 Indiana University NT 1:02.02"
Here we see a subsection of the meet - the top three finishers in the
Women’s 100 Yard Breaststroke featuring Olympic gold medalist, and the
SwimmeR
package’s favorite swimmer, Lilly King.
The next step is to convert this data to a data frame using
swim_parse
. Because swim_parse
works on text
strings it is very sensitive to typos and/or nonstandard naming
conventions. “Texas-Florida-Indiana.pdf” has two examples of these
potential problems.
The first is that Indiana University
is sometimes
entered as Indiana University
, with two spaces between
Indiana
and University
. This is a problem in
versions of Swimmer
< 0.7.0 because
swim_parse
will interpret two spaces as a column separator,
and will not properly capture Indiana University
(two
spaces) as a team name. In versions of Swimmer
>= 0.7.0
the extra space won’t cause a problem at all.
The second issue is that Texas
and Florida
are styled as Texas, University of
and
Florida, University of
which I personally disapprove of. It
won’t cause any issues in SwimmeR
versions >= 0.7.0, but
will in earlier versions.
Both of these issues can be fixed with the typo
and
replacement
arguments to swim_parse
. Elements
of typo
will be replaced by the element of
replacement
with which they share an index, so all
instances of the first element of typo
will be replaced by
the first element of replacement
etc. etc. Not specifying
typo
or replacement
will not produce an error,
but might negatively impact the results. If your results look strange,
or are missing values, look for typos related to those swims.
There is a another argument to swim_parse
, called
avoid
, which will be addressed in the section on reading in
html results below.
<-
TX_FL_IN_df swim_parse(
file = TX_FL_IN_text,
typo = c("Indiana University", ", University of"), # not required in versions >= 0.7.0
replacement = c("Indiana University", "") # not required in versions >= 0.7.0
)
Here are those same Women’s 100 Breaststroke results, as a data frame in tidy format:
102:104,]
TX_FL_IN_df[#> Place Name Age Team Finals DQ Exhibition
#> 102 1 Lilly King 21 Indiana University 59.46 0 0
#> 103 2 Olivia Anderson 21 Texas 1:01.88 0 0
#> 104 3 Noelle Peplowski 18 Indiana University 1:02.02 0 0
#> Event Reaction_Time
#> 102 Women 100 Yard Breaststroke 0.70
#> 103 Women 100 Yard Breaststroke 0.74
#> 104 Women 100 Yard Breaststroke 0.71
Reading .html results is very similar to reading pdf results, but a
value must be specified to node
, containing which CSS node
the read_results
should look in for results. Here results
from the New York State 2003 Girls Championship meet will be read in,
from the “pre” node.
<- "http://www.nyhsswim.com/Results/Girls/2003/NYS/Single.htm"
NYS_link <- read_results(file = NYS_link, node = "pre") NYS_text
587:598]
NYS_text[#> [1] "\nEvent 6 Girls 100 Yard Butterfly"
#> [2] "\n==============================================================================="
#> [3] "\nNY State Rcd: S 54.35 1990 Richelle Depold, Scotia"
#> [4] "\n Name Year School Prelims Finals"
#> [5] "\n==============================================================================="
#> [6] "\nNYSPHSAA 2003 Federation Championship"
#> [7] "\nA - Final"
#> [8] "\n 1 Bridget O'Connor 12 1-Scarsdale 56.16 55.42"
#> [9] "\n 26.12 29.30"
#> [10] "\n 2 Lauren Bonfe 12 5-Alfred-Almond 56.37 56.93"
#> [11] "\n 26.18 30.75"
#> [12] "\n 3 Christa Narus 11 11-Ward Melville 58.67 57.94"
Looking at the raw results above one will see that line 2 is a header
and contains NY State Rcd:
, showing the New York State
record. Lines of this type are a common feature in swimming results, but
because they contain a recognizable swimming time, without being a
result per say, they can cause problems for swim_parse
.
Like typos these will not cause an error, but might produce nonsense
rows in the resulting data frame. swim_parse
deals with
strings that should not be included in results with the
avoid
argument. By default avoid
contains a
lot of common formulations of these header items under
avoid_default
. You can create your own list of strings as
pass it to avoid
, or add to avoid_default
via
avoid_new <- c(avoid_default, "your string here")
.
Avoid
should also include "r\\:"
if your
results have reaction times (avoid_default
already includes
"r\\:"
).
<- swim_parse(file = NYS_text, avoid = c("NY State Rcd:")) NYS_df
358:360,]
NYS_df[#> Place Name Age Team Prelims Finals DQ Exhibition
#> 358 35 Amanda Acomb 12 5-Wayland <NA> 75.75 0 0
#> 359 36 April Bresette 11 7-Ausable <NA> 71.45 0 0
#> 360 1 Bridget O'Connor 12 1-Scarsdale 56.16 55.42 0 0
#> Event
#> 358 Girls 1 mtr Diving
#> 359 Girls 1 mtr Diving
#> 360 Girls 100 Yard Butterfly
By setting splits = TRUE
inside swim_parse
one can read in split times. Splits will then be read in as either 50
splits (the default), or 25 splits, depending on the value provided to
split_length
. Let’s look at those same
Texas/Florida/Indiana Results again.
<-
TX_FL_IN_df_splits swim_parse(
read_results(TX_FL_IN_path),
# typo = c("Indiana University", ", University of"), # not required in versions >= 0.7.0
# replacement = c("Indiana University", ""), # not required in versions >= 0.7.0
splits = TRUE,
split_length = 50
)
100:102,]
TX_FL_IN_df_splits[#> Place Name Age Team Finals DQ Exhibition
#> 100 NA Alexander Margherio 18 Texas, University of 52.68 0 1
#> 101 NA John Shebat 21 Texas, University of <NA> 1 0
#> 102 1 Lilly King 21 Indiana University 59.46 0 0
#> Event Reaction_Time Split_50 Split_100 Split_150
#> 100 Men 100 Yard Backstroke <NA> 25.00 27.68 <NA>
#> 101 Men 100 Yard Backstroke <NA> <NA> <NA> <NA>
#> 102 Women 100 Yard Breaststroke 0.70 27.62 31.84 <NA>
#> Split_200 Split_250 Split_300 Split_350 Split_400 Split_450 Split_500
#> 100 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 101 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 102 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> Split_550 Split_600 Split_650 Split_700 Split_750 Split_800 Split_850
#> 100 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 101 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 102 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> Split_900 Split_950 Split_1000
#> 100 <NA> <NA> <NA>
#> 101 <NA> <NA> <NA>
#> 102 <NA> <NA> <NA>
We can now see split times for the 50 and 100 walls, plus more split columns that are filled in for the longer races.
Care is needed however, because split times are handled inconsistently in source data. For example in these results, from a meet between Indiana and Louisville splits are sometimes by 25:
and sometimes by 50 - within the same meet.
Another example, in these 2017 Junior National results from Singapore, the 1500m splits are by 25 for the first 800m, and then the last split is for the final 700m of the race.
Relays are also traditionally handled differently, with splits summing for individual athletes. In the 2018 Big Ten championship results Lilly King does not split a 25.84 on the second 25 of the breaststroke leg of Indiana’s 200 medley relay, rather 25.84 was her time for the entire 50 yard breaststroke leg.
Just be forewarned - splits, even within the same meet, will often require some after-import attention and swimming-specific knowledge to clean.
The preferred format for splits is “lap” format, where each split is the duration of a single lap (or length) of the pool. Splits are sometimes also presented in cumulative format, where each split is the total time elapsed at a particular point in the race. For example consider this data frame, containing two swimmers swimming the exact same times and splits, but with one in lap format and the other in cumulative format.
<- data.frame(
df Place = 1,
Name = c("Lenore Lap", "Casey Cumulative"),
Team = rep("KVAC", 2),
Event = rep("Womens 200 Freestyle", 2),
Finals = rep("1:58.00", 2),
Split_50 = rep("28.00", 2),
Split_100 = c("31.00", "59.00"),
Split_150 = c("30.00", "1:29.00"),
Split_200 = c("29.00", "1:58.00")
)
df#> Place Name Team Event Finals Split_50 Split_100
#> 1 1 Lenore Lap KVAC Womens 200 Freestyle 1:58.00 28.00 31.00
#> 2 1 Casey Cumulative KVAC Womens 200 Freestyle 1:58.00 28.00 59.00
#> Split_150 Split_200
#> 1 30.00 29.00
#> 2 1:29.00 1:58.00
Cumulative splits can be converted to lap splits with the
split_to_lap
function.
%>%
df filter(Name == "Casey Cumulative") %>%
splits_to_lap()
#> Place Name Team Event Finals Split_50 Split_100
#> 1 1 Casey Cumulative KVAC Womens 200 Freestyle 1:58.00 28.00 31.00
#> Split_150 Split_200
#> 1 30.00 29.00
Splits that are already in lap format can be avoided using the
threshold
parameter in splits_to_lap
. The
value is threshold
is effectively a maximum lap split
value. If no swimmer in the data frame will swim a split slower
(i.e. greater than) 35.00
then 35.00
makes a
good threshold
value. Failing to set threshold
in data frames containing both lap and cumulative split times will
result in nonsensical splits and warnings from SwimmeR
.
%>%
df splits_to_lap(threshold = 35)
#> Place Name Team Event Finals Split_50 Split_100
#> 1 1 Lenore Lap KVAC Womens 200 Freestyle 1:58.00 28.00 31.00
#> 2 1 Casey Cumulative KVAC Womens 200 Freestyle 1:58.00 28.00 31.00
#> Split_150 Split_200
#> 1 30.00 29.00
#> 2 30.00 29.00
Converting to cumulative, although not the preferred format, is
possible as well with splits_to_cumulative
.
%>%
df filter(Name == "Lenore Lap") %>%
splits_to_cumulative()
#> Place Name Team Event Finals Split_50 Split_100
#> 1 1 Lenore Lap KVAC Womens 200 Freestyle 1:58.00 28.00 59.00
#> Split_150 Split_200
#> 1 1:29.00 1:58.00
Similarly, setting threshold
allows the exclusion of
splits that are already in cumulative format. Here
threshold
is a minimum split value.
%>%
df splits_to_cumulative(threshold = 20)
#> Place Name Team Event Finals Split_50 Split_100
#> 1 1 Lenore Lap KVAC Womens 200 Freestyle 1:58.00 28.00 59.00
#> 2 1 Casey Cumulative KVAC Womens 200 Freestyle 1:58.00 28.00 59.00
#> Split_150 Split_200
#> 1 1:29.00 1:58.00
#> 2 1:29.00 1:58.00
The final argument to swim_parse
is
relay_swimmers
, which defaults to FALSE
.
Setting relay_swimmers = TRUE
will cause
swim_parse
to read in the names of relay swimmers for each
relay, and add them to the normal swim_parse
output as
columns. I don’t love this, because the result is having individual
swimmers as rows, and relay swimmers as columns (because relay swimmers
are associated with their particular relay). This is not very tidy, and
SwimmeR
strives to be tidy. Still, the functionality does
exist.
<-
TX_FL_IN_df_relay_swimmers swim_parse(
read_results(TX_FL_IN_path),
# typo = c("Indiana University", ", University of"), # not required in versions >= 0.7.0
# replacement = c("Indiana University", ""), # not required in versions >= 0.7.0
relay_swimmers = TRUE
)
1:3,]
TX_FL_IN_df_relay_swimmers[#> Place Name Age Team Finals DQ Exhibition
#> 1 1 <NA> <NA> Indiana University 3:36.59 0 0
#> 2 2 <NA> <NA> Texas, University of 3:36.84 0 0
#> 3 3 <NA> <NA> Texas, University of 3:38.75 0 0
#> Event Reaction_Time Relay_Swimmer_1 Relay_Swimmer_2
#> 1 Women 400 Yard Medley Relay <NA> Morgan Scott Lilly King
#> 2 Women 400 Yard Medley Relay <NA> Claire Adams Olivia Anderson
#> 3 Women 400 Yard Medley Relay <NA> Julia Cook Brooke Hansen
#> Relay_Swimmer_3 Relay_Swimmer_4
#> 1 Christie Jensen Shelby Koontz
#> 2 Remedy Rule Anelise Diener
#> 3 Emily Reese Joanna Evans
It is of course also possible to read in both splits and relay
swimmers, by setting both of the relevant arguments to
TRUE
.
International Swimming League results are technically .pdf files, but
they’re formatted very differently, so they have their own special
function, swim_parse_ISL
. Handling of ISL results is
otherwise the same, with the file first going to
read_results
and then to swim_parse_ISL
,
returning a data frame.
The SwimmeR
package’s favorite swimmer, Lilly King, is
involved in the ISL. Let’s see what she got up to at this particular
meet.
<-
file_url "https://github.com/gpilgrim2670/Pilgrim_Data/raw/master/ISL/Season_1_2019/ISL_16112019_CollegePark_Day_1.pdf"
if (SwimmeR:::is_link_broken(file_url) == TRUE) {
warning("External data unavailable")
else {
} <- read_results(file_url)
file_read <- swim_parse_ISL(file = file_read)
df_ISL which(df_ISL$Name == "KING Lilly"), ]
df_ISL[
}#> # A tibble: 2 × 7
#> # Rowwise:
#> Place Lane Name Team Finals Event DQ
#> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 8 KING Lilly CAC 29.00 Women's 50m Breaststroke Final 0
#> 2 1 7 KING Lilly CAC 2:17.78 Women's 200m Breaststroke Final 0
Two first place finishes for Ms. King - very nice! Otherwise all the
normal information is here, place, time, team, event etc. Beginning in
the 2020 season ISL starts reporting points in their results, which
swim_parse_ISL
will also read. swim_parse_ISL
also handles splits and relay swimmers via the arguments
splits
and relay_swimmers
. All ISL meets thus
far have splits at the 50 walls, so there is no
split_length
argument. Otherwise splits and relay swimmers
are handled exactly the same way as swim_parse
, detailed
above.
Once results are captured in R as tidy data frames the real fun can
begin - but there’s another problem. Times in swimming are recorded as
minutes:seconds.hundredth. This is fine when a time is less than a
minute, because 59.99
can be of class numeric
in R, but times greater than or equal to a minute 1:00.00
are stuck as class character
. SwimmeR
provides
two functions, sec_format
and mmss_format
to
convert between times as seconds (for doing math), and times as
minutes:seconds.hundredths, for swimming-specific display.
data(King200Breast)
King200Breast#> # A tibble: 50 × 4
#> Event Year Time Date
#> <chr> <chr> <chr> <date>
#> 1 200 Breast Junior 2:02.60 2018-03-17
#> 2 200 Breast Senior 2:02.90 2019-03-23
#> 3 200 Breast Sophomore 2:03.18 2017-03-18
#> 4 200 Breast Freshman 2:03.59 2016-03-19
#> 5 200 Breast Senior 2:03.60 2018-11-17
#> 6 200 Breast Sophomore 2:04.03 2017-02-18
#> 7 200 Breast Junior 2:04.68 2018-02-17
#> 8 200 Breast Senior 2:05.14 2019-02-23
#> 9 200 Breast Junior 2:05.49 2018-03-17
#> 10 200 Breast Freshman 2:05.58 2016-02-20
#> # … with 40 more rows
Included in SwimmeR
is King200Breast
,
containing all Lilly King’s 200 Breaststroke times for her NCAA career.
Times recorded as character values, in standard
minutes:seconds.hundredth format. We can use sec_format
to
format them as seconds, and mmss_format
to go back to
minutes:seconds.hundredth. Both functions work well with the
tidyverse
packages.
<- King200Breast %>%
King200Breast ::mutate(Time_sec = sec_format(Time),
dplyrTime_swim_2 = mmss_format(Time_sec))
King200Breast#> # A tibble: 50 × 6
#> Event Year Time Date Time_sec Time_swim_2
#> <chr> <chr> <chr> <date> <dbl> <chr>
#> 1 200 Breast Junior 2:02.60 2018-03-17 123. 2:02.60
#> 2 200 Breast Senior 2:02.90 2019-03-23 123. 2:02.90
#> 3 200 Breast Sophomore 2:03.18 2017-03-18 123. 2:03.18
#> 4 200 Breast Freshman 2:03.59 2016-03-19 124. 2:03.59
#> 5 200 Breast Senior 2:03.60 2018-11-17 124. 2:03.60
#> 6 200 Breast Sophomore 2:04.03 2017-02-18 124. 2:04.03
#> 7 200 Breast Junior 2:04.68 2018-02-17 125. 2:04.68
#> 8 200 Breast Senior 2:05.14 2019-02-23 125. 2:05.14
#> 9 200 Breast Junior 2:05.49 2018-03-17 125. 2:05.49
#> 10 200 Breast Freshman 2:05.58 2016-02-20 126. 2:05.58
#> # … with 40 more rows
This is useful for comparing times, or plotting
plot(King200Breast$Date, King200Breast$Time_sec, axes = FALSE, ann = FALSE)
axis(1, at = c(16800, 17200, 17600, 18000), labels = c(2016, 2017, 2018, 2019))
axis(2, at = c(125, 130, 135, 140), labels = mmss_format(c(125, 130, 135, 140)), las = 1)
par(mar = c(5,7,4,2) + 0.3)
The same thing can be done in ggplot
.
%>%
King200Breast ggplot(aes(x = Date, y = Time_sec)) +
geom_point() +
scale_y_continuous(labels = scales::trans_format("identity", mmss_format)) +
theme_classic() +
labs(y= "Time",
title = "Lilly King NCAA 200 Breaststroke")
get_mode
to clean swimming dataSwim teams often have abbreviations, for example Lilly King swam for Indiana University, and sometimes “Indiana University” was listed as her team name. Other times though the team might be listed as “IU” or “IUWSD”. James (Sulley) Sullivan swam (probably) for Monsters University, or MU Regularizing these names is a useful part of cleaning data.
<- c(rep("Lilly King", 5), rep("James Sullivan", 3))
Name <- c(rep("IU", 2), "Indiana", "IUWSD", "Indiana University", rep("Monsters University", 2), "MU")
Team <- data.frame(Name, Team, stringsAsFactors = FALSE)
df
df#> Name Team
#> 1 Lilly King IU
#> 2 Lilly King IU
#> 3 Lilly King Indiana
#> 4 Lilly King IUWSD
#> 5 Lilly King Indiana University
#> 6 James Sullivan Monsters University
#> 7 James Sullivan Monsters University
#> 8 James Sullivan MU
Lilly has 4 different teams, but all of them are actually the same
team. Similarly Sulley has two teams, but actually only one. Using
get_mode
to return the most frequently occurring team for
each swimmer is easier than manually specifying every swimmer’s
team.
<- df %>%
df ::group_by(Name) %>%
dplyr::mutate(Team = get_mode(Team))
dplyr
df#> # A tibble: 8 × 2
#> # Groups: Name [2]
#> Name Team
#> <chr> <chr>
#> 1 Lilly King IU
#> 2 Lilly King IU
#> 3 Lilly King IU
#> 4 Lilly King IU
#> 5 Lilly King IU
#> 6 James Sullivan Monsters University
#> 7 James Sullivan Monsters University
#> 8 James Sullivan Monsters University
To aid in making single elimination brackets for tournaments and
shoot-outs SwimmeR
has draw_bracket
. Any
number of teams between 5 and 64 can be used, with byes automatically
assigned to higher seeds.
<- c("red", "orange", "yellow", "green", "blue", "indigo", "violet")
teams draw_bracket(teams = teams)
Now add the results of round two:
<- c("red", "yellow", "blue", "indigo")
round_two draw_bracket(teams = teams,
round_two = round_two)
And round three:
<- c("red", "blue")
round_three draw_bracket(teams = teams,
round_two = round_two,
round_three = round_three)
And crown the champion:
<- "red"
champion draw_bracket(teams = teams,
round_two = round_two,
round_three = round_three,
champion = champion)